01-Nov-2003

Problem overview

The last couple of posts have been a little on the negative side. There are more than a few reasons for this. The most obvious is that when I evangelize for OpenVMS to my management, and then get let down by Engineering in more ways that one, I tend to get depressed. And the OpenVMS Team at work ends up looking silly to boot.

The beginning of this problem goes back a few months. Nearly a year actually. We had a developer writing a a bit of kernel threaded code (with Pthreads). It turned out that what he was trying to do was unsupported, but even after getting the code into a supported state, he still managed to crash the development machine a number of times. And all of this was user mode code.

Engineering got rather excited about this, and offered us debug execlets to try and uncover where the problem actually was, because there was undoubtably a problem with the operating system. And of course I was keen to put the debug code on because we needed the problem fixed. Let alone the thousands of installations out there that may be exposed to the same problem. So on it went.

Vesions 1, 2, 3, 4, and 5. Each one crashed the machine in a different way. The first couple actually detected the problem, corrected it, and continued, only to crash moments later with fallout from the "correction". Our management was not happy with this. Not happy at all.

It got to the stage where management ordered me to remove the debug execlets from the development machine, with the edict that no beta or debug code ever be run on our systems again. And of course I did.

Now, some months later, we have another problem. It's just as catastrophic (i.e., the machine needs a reboot), and again Engineering are asking us to deploy some software on our machines. This time a production machine.

Unsurprisingly, management baulked. We had gone as far as filing a change control to deploy the software, but it didn't make the change control meeting. It got shot down before the change control board even got a look-in.

I'm not surprised that our management are taking this stance. But Engineering are sure acting like it's the biggest slap in the face they have ever had. Our problem has been downgraded in priority, and I have heard language like "until the customer wants to work with us, etc, etc...."

Well, we'd love to work with them. Unfortunately we have a business to run. I've suggested ways to reproduce the problem, and tried them on a uniprocessor. Engineering's response: the problem will only occur on a SMP box.

OK. They're the vendor. They should have plenty of SMP boxen laying around to replicate the problem. Go for it, boys.

My position has to be that nothing I do in user mode should crash or hang the box...

Posted at November 1, 2003 7:48 PM
Tag Set:

Comments are closed