17-May-2003

Continuing the Bugcheck saga

Reviewing some website statistics this morning, I realised that I haven't followed up on an entry: Bugcheck traced from last November. That and a previous entry talked about a pesky bugcheck we have been seeing in development for quite some time.

The crash appeared to happen when a developer was running a program that used multiple kernel threads to forward calls to an application in parallel.

It turned out that calling a product he was using from kernel threads was unsupported, but that was not the underlying cause. As something was obviously corrupting a PTE, which is a kernel protected data structure, and nothing the programmer was calling used elevated privileges, there had to be something wrong somewhere in the kernel.

After supplying me with five rounds of debug execlets, OpenVMS Engineering asked us if we could supply them with a full backup of the development cluster. Engineering then built a cluster to restore onto, restored our data, and booted.

Shortly thereafter, they were able to reproduce the problem at will. Unfortunately, the problem is still unidentified. The good news is that in the course of the investigation, quite a few sections of code in the kernel that should have held a spinlock and didn't were identified and corrected.

The problem is so elusive I understand that changing just a few lines in some of the debug execlets that Engineering have been using changes the timing just enough so that the problem doesn't occur. Roll the execlet out and back it comes. How frustrating!

Hopefully, they'll nail it soon.

Posted at May 17, 2003 9:47 AM
Tag Set:

Comments are closed