28-Jun-2003

Galaxy Pros and Cons

Supporting our business intelligence data warehouse, we have a pair of GS160 AlphaServers. They are fully decked out with 16 CPUs and 64 GB of memory each. Additionally, they have fully redundant networking, disk controllers, and inter-node communication channels.

As they run OpenVMS, they can take advantage of Galaxy, the soft partitioning method available on the operating system, and each machine is broken up into two soft partitions.

This offers some nice advantages, not least among them the ability to dynamically reallocate a CPU from one partition to the other.

However, there are some gotchas.

Let's talk about the nice things first. Being able to dynamically reallocate a CPU from one partition to another with zero visibility to the end users is fantastic. It gets busy on your batch processing partitions at night? Simply move some CPUs over and move them back for the daytime load. Unexpected online load on the adhoc inquiry partitions during the day? Move some CPUs from the database partitions.

Soft partitioning with Galaxy also gives you the advantage of shared memory. So what? you ask. OpenVMS can do some extremely cool things with shared memory. If your partitions are networked, the OS will construct a "network" device in shared memory. This means that all network traffic between the partitions within the machine, instead of going out via whatever physical network interface you have, will use the shared memory "network".

The same thing happens with cluster communications if the partitions are clustered together. Shared memory is a lot faster than Memory Channel II or gigabit ethernet.

Other Galaxy-aware software can also take advantage of shared memory. In particular, Oracle Rdb can do row caching on multiple Galaxy partitions that are clustered together. Normally, row caching is only permitted if the database is only open on one node.

So what are the downsides? The GS160 does have a couple of single points of failure. And both of them have the capability of taking down all the Galaxy partitions in the one machine.

The first SPOF is an obvious one: the hierarchical switch that joins the "fireboxes". If you have a GS160, you will notice that the CPUs and memory are housed in what are called QBBs, or quad building blocks. These hold up to four CPUs and four memory modules, plus a few other things. Two QBBs back to back form a firebox. Joining the CPUs in the two fireboxes is the hierarchical switch. If this switch fails, down go all Galaxy partitions.

The second SPOF is memory known as "director DIMMs". This memory is used by the architecture as part of the memory management system. If one of these DIMMs that is controlling part of the shared memory fails (or for that matter, a DIMM involved in supplying the pool of shared memory fails) wave goodbye the all the partitions.

Of course, both these points of failure can be eliminated by hard partitioning, but that's another story.

So, do the advantages of soft partitioning with Galaxy outweigh the two single points of failure? I think so. Then again, the choice is easy for me, as I have two GS160s in our cluster.

Posted at June 28, 2003 8:06 PM
Tag Set:

Comments are closed