20-Mar-2014

Poor man's PCA

As part of DECset, HP have a really useful bit of software called the Performance and Coverage Analyzer (PCA). The program is capable of recording PC information, and then displaying the corresponding lines of source code that the frequently recorded PCs belong to. Unfortunately, I don't have a license.

So, how do I get a program to record its program counter on a regular basis? Read on.

Firstly, why do I want to do this? Well, we have a program that runs in production some time after 3 AM six days a week. Normally, it takes less than a couple of minutes to run. But on some occasions, it can take up to an hour. In both cases, the program terminates normally.

We run multiple streams of our overnight run, one for each business branch. After all branches are processed, a consolidation run wraps all the information up for a national overview, and then opens the branches for a normal day's trading. If one branch is delayed, it will affect all branches.

Unfortunately, all attempts to reproduce the slow processing have been unsuccessful on the test machines. So I thought if I can capture PC information when the program runs slowly, perhaps we can pinpoint where in the code the program is spending excessive time.

(Of course, we've already looked at lock traffic and other possible causes like lack of CPU. Nothing we've looked at so far seems to indicate a potential cause).

One option to record PCs is of course to use the PCS SDA extension. But because the problematic program I'm attempting to debug only exhibits the issue occasionally in production sometime after 3 AM, I'd have to get up too early for potentially no result. System managers are lazy want their sleep.

Originally, I was thinking of doing the same thing that PCA does, but oh man, do I really want to write an alternate debugger?

Then I remembered that there is an obscure run time library routine that allows you to declare a routine that is to be run as a one time initialization routine. What if I wrote a routine that set a timer and executed a routine to collect the PC and wrote it out to a file? That sounded like the ticket!

The RTL routine is LIB$INITIALIZE, and it has some really nice features, including the ability of having its associated routine invoked without even modifying the code to be analyzed. All we'd need to do is re-link it.

With all this in mind, I wrote a module that contains a routine to set a timer, and the routine that would be called as an AST when that timer expired.

The AST routine is the one that does the deed and collects the PC. It uses some other really obscure services to walk the call frames, and symbolize the results. All PCs in system space (and in shared images) are ignored. That means the call frames are traversed until a PC is found that lies within the image (which is fine, because the image source code is linked monolithically). A file backed section is used to record the PC values and counts.

At the moment, this code is Itanium only, but I'll probably get around to making it work on Alpha in my copious free time.

The code is available as a zip file on my Downloads page.

Posted at March 20, 2014 2:59 PM
Tag Set:

Comments are closed