18-Dec-2005

SAN problems

I've been a little quiet on the blog here of late as we have been struggling with a number of problems, the first of which appears to be a firmware interaction issue on our SAN.

We are running the following firmware:

DeviceDescriptionFirmware
HSG80StorageWorks array controllerACS 8.8-4
Silkworm 3200Brocade 2Gb 16 port FC switch3.2.0a
Silkworm 2800Brocase 1Gb 16 port FC switch2.6.2d
Emulex L90002Gb FC host bus adapterCS3.92A2
Emulex L80001Gb FC host bus adapterDS3.92A2

These revisions are the latest and greatest at this time.

Additionally, we are running OpenVMS 7.3-2, with FIBRE_SCSI-V0400 on (more on this in a minute).

The problem we are seeing is that not all fibrechannel paths are available to all hosts in the cluster. By this I mean we expect to see four paths in existance and available (as the HSG80s have 4 ports) but we only see three in existance (but some machines do show four).

We strongly suspect a firmware issue due to the fact that we have the problem on both (independant) fibrechannel fabrics and across multiple physical machines. Interestingly, not all machines in the cluster have the problem, so I suppose it is conceivable that the issue is a host based code issue. However, again, read on.

Our second problem is that we applied the FIBRE_SCSI-V0700 patch at the recommendation of HP, only to find that it caused CPUSPINWAIT crashes on partitions in our GS160 AlphaServers. Note that while the patch was on, we were still experiencing the path problem, which seems to rule out a host based issue. Needless to say, we removed this patch, and actually thanked the Engineering Team for coming up with $ PRODUCT UNDO.

Information on this crash that we have been able to glean out of Engineering indicates that FIBRE_SCSI-V0700 is aggravating a known issue with IO_ROUTINES.EXE. As more I/O is done, non-paged pool is allocated from the lookaside lists in the RAD where the I/O takes place, eventually exhausting them. When this occurs, the operating system attempts to allocate non-paged pool from the other RAD, and when the request is not fulfulled quickly enough, it crashes the machine.

The reason this machine crashes is that it's the one doing the vast majority of our backups, and so is performing the most I/O in parallel.

Before we removed the patch, we tried things like playing with SYSGEN parameters NPAG_GENTLE and NPAG_AGGRESSIVE in an attempt to reclaim pool, and also attempted allocating more non-paged pool in the RAD that (apparently) was always exhausting it (we found out that it can occur in either RAD).

If this isn't enough, we have a recently modified user mode image that is crashing a machine in the same cluster on a reasonably regular basis. All the image does is use RMS to extract some information from some indexed files and write a report. However, every second or third night, the machine that is running it will crash with a system service exception. Note again we have had crashes on multiple machines so this looks like a latent bug in the operating system.

We are certainly keeping Engineering busy...

Posted at December 18, 2005 6:12 PM
Tag Set:
Comments

Interesting indeed, you may want to look at using Emlux LP10000 HBA's and updating SRM to 7.0B. I'm curious have you seen any voltage weirdness coming from MBM were it changes from 1.8 to 1.7 and so on? I'm tring to remeber if GS160's use MBM?

This same io_routines problem was seen on our ES40 cluster. We have since upgraded to an ES80 and the routine problem disappeared on this hardware. We did see some problems with voltage and VRM modules being bad on our ES80 and after appling SRM updates and firmware to the HBA's the problem was resolved as well as replacing a faulty VRM module.

Posted by: Bill at December 20, 2005 11:37 AM

Thanks for the info on the IO_ROUTINES issue, Bill. Good to hear that we're not alone.

On the firmware side, we will certainly be rolling out SRM 7.0 ASAP. Be it good or bad, we attempt to minimise (both hardware and software) change to the production cluster during the few months towards the end of the year, as this is a critical time for us. The path problem is not crippling, just worrying. And I will cheer if it's resolved by an SRM update. However, as far as I'm aware, HP still don't have any idea as to cause, even after we captured some traces with a fibrechannel analyser today.

As I get information, I'll post it here...

Posted by: Jim Duff at December 20, 2005 5:45 PM

The MBM thing: yes, GS160s have MBM, but I won't play with it while production instances are running on that hardware. Hmmm, I could call SYS$GETSYI to display the power vector...

I'll look at this tomorrow, although this is certainly not the cause of any of our issues listed above, as the problems are occuring on multiple physical GS160s.

Posted by: Jim Duff at December 20, 2005 6:02 PM

Turns out that GS160s don't support the power vector item code for SYS$GETSYI. Seems hardware and software engineering need to talk to each other a little more.

Posted by: Jim Duff at December 21, 2005 1:33 PM

Comments are closed