Last year, as part of the IPAM Upgrade on 2016-03-24, we enabled a new feature called "Immediate Fixed Addresses" which made it possible for most DHCP-enabled Host and Fixed Address adds/changes/deletes to take effect immediately (i.e. without waiting 5 minutes for a restart). Unfortunately this proved to cause some long-term instability for the DHCP service, and so per the vendor's recommendation we disabled the feature again on 2016-08-15 until it could be fixed in a future software release.
Since then, the vendor has created a diagnostic software patch to reveal more information about the cause of the problem when it occurs, but after many months neither we nor the vendor have been able to successfully replicate the problem in a lab environment.
On Thu Jun 22, we will be deploying the diagnostic software patch and temporarily re-enabling the Immediate Fixed Addresses feature in production. It is expected that sometime after this, one or both of the main campus DHCP servers will experience a crash event, after which we will disable the feature again to restore full stability. The value of doing this is to collect diagnostic information about the crash event and provide it to the vendor, so that they can use it to implement a fix for this feature in a future software release, which will eventually allow us to enable the feature long-term.
The expected worst-case impact is a single outage of campus DHCP lasting at most 30 minutes, which would occur at some unpredictable time in the future, possibly several weeks after the feature is re-enabled. During such an outage, new clients joining or rejoining the network may be unable to obtain an IP address. The service default lease time is 24h, and clients start trying to renew halfway through their lease (12h), so clients already on the network will generally NOT be affected. This future outage also will NOT affect the major Wi-Fi client nets, as they are served by a completely separate pair of DHCP servers.
If and when an outage does occur, it will be detected by monitoring, and we have a procedure ready to restore service and disable the feature again.
It is alternatively possible that only one of the two servers will experience an outage, or that the problem will not reoccur at all; either of these cases would result in no client impact.
Note that we have scheduled this work over the summer in order to minimize impact; unfortunately, since it may take weeks to reproduce the problem, it's not possible to narrow down the schedule any more specifically than that.