31-Mar-2009

Contract summary

Today is the last day of my current contract. I've had quite a bit of fun, learning and teaching opportunities, frustration and accomplishment, in the nearly nine months I've spent working for this company. For a description of the high and low points, read on.

I was initially hired as a contractor due to the local guru being on extended sick leave. The company was in the process of attempting to port an in-house COBOL application from AlphaServers to Itanium based blade servers. And the project was stuck.

The application consists of nearly 2000 individual programs that were initially developed by a third party software developer and then sold to the company. The language was ANSI standard COBOL, and a lot my frustrations stemmed from the legacy development environment and the "this is the way it's always been done" attitude when presented with possibly better ways of doing things.

Because the application was sold to the company many years ago, there was only a handful of programmers working on the codebase initially. And so no source control or automated build/dependancy system was considered needed. As the company grew, this lack has become a bit of a problem. The development team acknowledges the problem exists, but (due to hysterical historical reasons) expects the technical team to maintain the procedures that they do have. This is a bit of a problem when there are no programmers on the technical team.

So I've walked into this situation, on the premise that I was to do the infrastructure changes and the associated project planning to move the application from one platform to another. Note I say "infrastructure changes". To me that means hardware and operating systems configuration, network configuration, and possibly application environment setup.

It immediately became apparent that while they had new Itanium blades in place, nothing was happening. Within a couple of weeks, I had discovered that the parallel compiles and links that they thought were happening were completely broken on the Itanium development box. Which lead to a dive into the DCL code that supports the development and QA environments. Wow, this was messy, but I fixed what was wrong with the build environment and learned a lot about how the application hung together. Which was just as well because...

I was tasked with building the application for deployment on the Itaniums, which was just a "slight" departure from infrastructure management. Having just completed a rummage through the build environment, I was extremely nervous about using any of the pre-existing procedures to build the new version. So I ended up writing a complete "build it from scratch" procedure that didn't depend on anything.

Obviously, one of the first questions out of my mouth when I was tasked with this project was "Please supply me with a list of programs to be deployed on the new platform". You'd think that this would be easy. Nope. Because the development environment employs no source control system, there is no formal way to determine just what is in production and what isn't. This might have worked nicely when the development team consisted of a couple of programmers, but it starts falling apart when you are maintaining a codebase of approximately 3000 COBOL programs totalling 2.3 million lines of code.

We iterated through the build and test phases a number of times, and when we went live, we still had missed at least one program that was in use in production. Get a source control system, recommended Jim, a number of times...

Because the application was purchased, some code was not available. This included the library that performed all the standard screen handling, and a module that is involved in all I/O actions performed by the application. So these were binary translated onto the Itaniums.

When we started testing the newly built application using the overnight runs as a benchmark, it quickly became apparent that the performance on the Itaniums sucked, to put it bluntly. We had programs that were taking ten times longer to process the same amount of data. And of course, these programs are run multiple (read, hundreds) of times a night.

Being Itaniums, I initially suspected the the number of alignment faults we were seeing was responsible for the slowdown, but when I started taking PC samples, to my surprise I found that the programs were spending an inordinate amount of time in the translated image environment shareable image while performing I/O. I (theoretically) knew that binary translated images were a Bad Thing, but we didn't seem to have much choice here - no source code. But the problem was so bad that we had to consider the performance slowdown a show stopper, and we started looking at alternatives.

I suggested that the company approach the ex-directors of the now out-of-business software vendor and see if they could acquire the source code for the I/O module. Fortunately, this is what eventually occurred. In parallel with the search for the source code, I reverse engineered the module, and had a fairly comprehensive implementation when the original source was obtained. (Comparing what the original source did with my re-engineered version was interesting. Apart from some implementation issues surrounding data structures, the code was very close to what I'd written based on looking at the API.)

I updated the original source code of the I/O module to eliminate some errors due to the newer, less forgiving compiler (this was source code written in 1992, after all), fixed some alignment faults that were being generated as the I/O module was being called, and presto, the performance was comparable with the Alpha version.

I suggested that the company take the opportunity to acquire the source code for the screens library at the same time, but the ex-directors wanted money for this transaction, and it was decided to stick with the binary translation.

And finally, the system was built. And a compiler difference nearly became show stopper number two.

As part of the port, the operating system was being upgraded. The AlphaServers run OpenVMS 7.3-2, and the newer Itanium blades require at minimum OpenVMS 8.3-1H1. Rather that taking the conventional route of updating the AlphaServers, installing the newest compilers, and recompiling and testing on the AlphaServers before commencing a port, the company had just attempted to port. I stepped in in the middle of the effort, and of course recommended the standard approach, but my recommendation was not acted upon because we'd have to update the production AlphaServers, delaying the port.

So it wasn't until some parallel testing was performed between the two platforms that the problem became apparent. It turns out the the older Alpha COBOL compiler was very permissive when it came to dealing with uninitialized decimal data variables. A number of data records were corrected, and again the project was back on track.

One of the primary reasons that the port was stalled when I came on board was that a third party data query tool that was in extensive use was not being ported by its owner to Itanium. The development team thought that there was no point in testing until this issue was resolved. And the solution was really an obvious one: retain one Alpha in the production cluster to run these queries until such time as the tool was retired or another solution was found. This turned out to be an easy implementation. When an end user selected the menu option that started this tool, have the DCL redirect them to the Alpha so they could continue their query, and at completion log them out again so they ended up back in the menu system on the Itaniums.

The application finally went live for approximately 2000 end users Australia wide, with only a very few minor issues. The performance analysis that had been done certainly paid off, with the CPU requirements on the new platform being significantly less than on the AlphaServers.

My biggest problem during this contract was the resistance to change and the ownership issues with the existing build environment command procedures. Unfortunately, the scope of the contract didn't allow me to address either of these issues, and a lot of frustration resulted. But in the end, the application ended up on the new Itaniums, and an example of the comments about the transition was

"A special mention should go to Jim Duff. Without Jim's tireless efforts, attention to detail and passion for the project and the technology this would not have been possible" - Applications Development Manager.

Oh, infrastructure management. I actually did get to do some. I configured and built the new production cluster from four Itanium based BL860c blade servers. I put a new disaster recovery plan in place, and built a new Itanium cluster at the disaster recovery site to support it. I implemented NTP time synchronization. And performance monitoring software. And disk defragmentation software. And I wrote a nice little generic disk monitoring package with alerts and alarms. All this and I learned about CIFS (i.e., SAMBA) and configured it too!

All in all, a challenging and rewarding contract, which I'm sad to see the end of. I hope the next job will be just as interesting.

Posted at March 31, 2009 9:14 AM
Tag Set:
Comments

Jim, you are a blinding flash and a deafening report... (quote from an old Sci fi book).

I would hire you in a heart beat - you are "old school" on making things work - but nothing on the radar at the moment, will keep ears and eyes open...

q

Posted by: Peter Q at March 31, 2009 7:13 PM

Peter,

Glad to see someone still reads science fiction by Doc Smith.

Posted by: Jim Duff at April 1, 2009 7:40 AM

Comments are closed