12-Jul-2012

XQPERR crash on 8.4 reproducer?

Recently I published a XQP crash footprint on IA64 8.4. The cause of the crash is quite strange, and appears to be something not correctly handling FCBs describing a directory index being placed on the LRU buffer list.

The FCB contains a status longword with the two bits of interest being FCB$V_ISDIR and FCB$V_DIR. These two bits indicate that the FCB describes a directory and is on the LRU list, respectively.

The issue appears to be that under certain circumstances, FCBs get placed on the LRU list without the FCB$V_DIR bit being set. This should not be possible (?)

When the kernel needs a new FCB, it looks at the LRU list to find the first one on the list. In the case of the crash, all the FCBs on the LRU list did not meet the selection criteria because none of them had the FCB$V_DIR bit set. The system failed an assert, and down we went.

The problem started to get really interesting when I found that the situation was only occurring on one node of the cluster. With some tips from Engineering, I wrote a command procedure to use SDA to report on these anomalous FCBs. One of the interesting things about them was that all of them were on a single disk.

After a chance remark from the other VMS guy here, I started to think about what created and deleted directories in a certain pattern on that disk, and came to the realisation that we were using the RENAME command to perform file archiving on that disk by renaming the containing directory files. This behaviour matched the appearance of the bad FCBs very well.

With this in mind, it was relatively easy to write a command procedure that reproduced the problem (sort of). In the real issue, the errant FCBs seem to stick around for quite some time. Earlier today, I had 221 of them, and they weren't going anywhere. However, as I write this, the count of bad FCBs is back to zero, but the total number on the LRU list is equal to ACP_DINDXCACHE. Could there be a correlation there?

In the reproducer code, we go from zero bad FCBs to two bad ones. Seconds later, they're gone.

It also appears that the problem is related to ODS-5 disks as I've tested against ODS-2 and the problem does not occur. I've also tested it against Alphas running 7.3-2 and 8.3, and neither of these systems demonstrates the problem. I wish I still had a 8.3-1H1 IA64 system here ;)

Let's hope Engineering can now reproduce this. I'll post more as it comes to hand.

Posted at July 12, 2012 5:44 PM
Tag Set:
Comments

Congratulations on your persistance. If only all HP customers would help this way or at least report a problem rather than just whine about it on the internet then more progress could be made :-)

Sounds like a fun one. I can guess at some of the people looking at this and may ask...

Posted by: Ian at July 12, 2012 7:33 PM

Comments are closed