icetray - Fotolia
Build a patch management process to stay cool in an IT crisis
Time is of the essence during a major IT crisis, but rushing to fix a problem can do more harm than good. Ensure version control, documentation and testing for effective patch deployment.
Something has gone wrong. The faces of the help desk staff are awash with red light reflected from their dashboards. The sounds of users wailing and gnashing their teeth, as they're unable to perform their jobs, fill the office.
"Just fix it," think the users. But, for IT teams, that is easier said than done.
So, where should the resolution process start?
In pursuit of a quick fix, it's tempting to turn to rock star developers and assign the issue to them. Sometimes, this patch management process works well -- at least on the surface. The developers quickly find the problem and fix it, and users can continue their work.
However, this method is flawed, as it lacks an audit trail, doesn't necessarily get to the root cause of the issue and might leave teams scrambling if a similar problem arises again.
Begin with the basics
To establish a successful patch management process -- even in times of IT crises -- start with proper root cause analysis. Developers should not mess with things just because they might be broken. Use diagnostic tools to identify the exact problem and where it lies.
If it is a software issue, use the right DevOps tools, such as CloudBees Jenkins, Atlassian, Chef and Puppet, to address the problem. Confirm that version control is in place, and ensure that all developers document what they do and what changes they made and why.
Test, test, test
Do not push the developers' fix directly into the operations environment. Developers must test the patch first, and then it should be tested again, as any other piece of software would. Be sure to test against the most realistic production environment possible. This is easier when an organization uses VMs or containers with a cloud platform. Developers or IT ops teams can spin up a virtual environment and test the patch there, without any direct effects on the working -- or, in the case of a crisis, the nonworking -- production environment.
If anything looks like a potential issue or vulnerability in testing, ship the code back to the developers along with the test analytics. It is far better to iterate a couple of times and get it right than to rush the patch management process and risk further incidents down the road.
Even when a patch looks good and performs well in tests, don't just roll it out, throw the switch and relax. Ensure that there is a known point to roll back to, in case of another issue. Full version control, along with rollback points throughout the system, enables developers to revert to a previous version and to provide some business continuity when a problem occurs.
Roll out the patch
After teams write and test a patch to address a fully identified root cause, it's time to roll out the fix. If it doesn't work, roll back to a known position for continuity, as mentioned above. If the problem or a similar one occurs again in the future, full documentation of the fix exists for reference.
A similar patch management process can apply to hardware issues as well: Identify the root cause; create a suitable trouble ticket; restrict access to identified equipment to a defined administrator/engineer; ensure change documentation; test the fix before any clients or workloads use it; and roll out the patch, with a plan B in place as required.
While this process might appear to slow down the required fix, it ultimately prevents developers from having to scramble to address the same problem more than once. Of course, the exception is where a developer correctly identifies the exact cause of an issue and fixes it completely on the first try -- but this is rare.
Rather than end up with a mess of changed code across a raft of different areas that might escalate into new, unrelated problems, a proper patch management process in times of an IT crisis secures your job.