21-May-2010

System Admin (un)Truths?

I've just come across an old article by Evan Erwin entitled Top 10 System Administrator Truths. While I agree with a lot of what Evan has to say, I can tell that he's never worked in a mission critical environment by his truth #10. This is called "The Holy Grail of Tech Support" and put bluntly, he's suggesting that you should ask your user to reboot to fix a multitude of issues. Well, try suggesting that you reboot <insert name of your local stock exchange>'s trading platform in the middle of a trading day and see how far you get.

Giving the box the three finger salute may fix your immediate problem, but it's sure not going to help you actually find out what was wrong in the first place, as a reboot will remove all possible avenues of investigation when the computer's memory is initialized. Because you didn't fix something to stop the issue happening again, it'll probably happen again. Not a good outcome when you have to explain to the CIO why you are rebooting a production machine for the nineteenth time.

It's far better to either track the problem down while the machine is still functioning, or cause the machine to do a crash dump to capture as much context as you can so you can investigate the problem after the computer is rebooted.

Posted at May 21, 2010 8:16 AM
Tag Set:
Comments

A reboot is rarely the answer. VMS has fine tools for working out what is going on.

There is a a good presentation about making the best use of a crash dump. Some preparation ensures that when crashing the system to get a dump is needed then the result is of a maximum use to the support centre.

(I've resisted the urge to add various jokes of the traditional British humor here about dumps).

Posted by: Ian Miller at May 21, 2010 8:48 PM

Comments are closed