14-Jan-2016

Are we getting closer to the bug?

In my previous post which was terrifyingly over six months ago, I again touched on the issue that I described way back in June 2014 about directory renames sometimes producing catastrophically incorrect results. I thought I'd bring you up-to-date with what's happening...

Shortly after I published the update in late June 2015, Engineering got in touch with me to note that RMS has a process wide directory path cache, which gets invalidated on every directory remove operations (delete or rename). Cache invalidation is based on a directory sequence number contained in the UCB of the disk involved. They suggested running some SDA commands to see if, when the problem occurred, the field was not being updated.

The engineer obviously coded a set of SDA commands off the top of their head, as what was supplied didn't function. I corrected the DCL to make a functioning routine, but sent the DCL back for confirmation that this was what the engineer was actually asking for (as the disk involved is a shadow set, was the UCB in question the shadow set's UCB? The physical disk's? Some combination?)

Receiving no response, I promptly forgot about adding the SDA stuff to our production procedure that experiences the problem. Until the problem happened again..

I rewrote the SDA stuff as a kernel mode hack to reduce the time and potential for output change and called it from the DCL. Here's the code, demonstrating how to read a mutex protected memory location in kernel (warning, don't run this code if you don't understand what's going on. Mistakes or misguided attempts to debug kernel mode code will crash your system).


#define __NEW_STARLET 1
#include <stdio.h>
#include <stdlib.h>
#include <ssdef.h>
#include <descrip.h>
#include <string.h>
#include <stsdef.h>
#include <ucbdef.h>
#include <ccbdef.h>
#include <psldef.h>
#include <exe_routines.h>
#include <ioc_routines.h>
#include <mutexdef.h>
#include <sch_routines.h>
#include <vms_macros.h>
#include <lib$routines.h>
#include <starlet.h>

#define errchk_sig(arg) if (!$VMS_STATUS_SUCCESS(arg)) (void)lib$signal(arg);


/******************************************************************************/
static int get_dirseq (int channel, unsigned short int *dirseq) {

extern PCB *CTL$GL_PCB;

static int r0_status;
static MUTEX *mutex;
static PCB *pcb_p;
static CCB *ccb_p;
static DT_UCB *ucb_p;

    if ((channel == 0) || (dirseq == NULL)) {
        return SS$_BADPARAM;
    }

    /*
    ** Ensure we can write the variable from the previous mode.
    */
    if (!(exe_std$probew (dirseq, sizeof (unsigned short int), PSL$C_KERNEL) & 1)) {
        return SS$_ACCVIO;
    }

    /*
    ** Get the address of our process control block.
    */
    pcb_p = CTL$GL_PCB;

    /*
    ** Lock I/O database mutex for read.
    */
    mutex = sch_std$iolockr (pcb_p);

    /*
    ** Get the channel control block from the channel provided.
    */
    r0_status = ioc$chan_to_ccb (channel, &ccb_p);
    if (r0_status & 1) {
        /*
        ** Valid channel.  Get the unit control block.
        */
        ucb_p = (DT_UCB *)ccb_p->ccb$l_ucb;
        *dirseq = ucb_p->ucb$w_dirseq;
    }
    sch_std$iounlock (pcb_p);
    (void)setipl (0);
    return SS$_NORMAL;
}


/******************************************************************************/
int main (int argc, char *argv[]) {

static int r0_status;
static unsigned short int chan;
static unsigned short int dirseq;

static struct {
    int arg_count;
    int channel;
    unsigned short int *dirseq;
} arg_list;

static struct dsc$descriptor_s device_d = { 0,
                                            DSC$K_DTYPE_T,
                                            DSC$K_CLASS_S,
                                            NULL };

    if (argc < 2) {
        fprintf (stderr, "Usage: dirseq disk_name\n");
        return SS$_NORMAL;
    }

    device_d.dsc$w_length = strlen (argv[1]);
    device_d.dsc$a_pointer = argv[1];
    r0_status = sys$assign (&device_d,
                            &chan,
                            0,
                            0,
                            0);
    errchk_sig (r0_status);

    arg_list.arg_count = 2;
    arg_list.channel = chan;
    arg_list.dirseq = &dirseq;

    r0_status = lib$lock_image (0);
    errchk_sig (r0_status);

    r0_status = sys$cmkrnl (get_dirseq, (unsigned int *)&arg_list);
    errchk_sig (r0_status);

    r0_status = lib$unlock_image (0);
    errchk_sig (r0_status);

    printf ("Dirseq = %04x\n", dirseq);

    r0_status = sys$dassgn (chan);
    errchk_sig (r0_status);
}

Unsurprisingly when the problem next occurred, the directory sequence number remained unchanged over a directory rename command:


(02:45:04)$     mc sm_exe:dirseq disk_name
Dirseq = c5ff
(02:45:04)$     rename/log disk_name:[branches]xxxx.dir -
                           disk_name:[branches.xxxx_archived]20160111.dir
%RENAME-I-RENAMED, DISK_NAME:[BRANCHES]XXXX.DIR;1 renamed to DISK_NAME:[BRANCHES.XXXX_ARCHIVED]20160111.DIR;1
(02:45:04)$     mc sm_exe:dirseq disk_name
Dirseq = c5ff

The ball is now in firmly in Engineering's court...

Posted at January 14, 2016 11:24 AM
Comments

Sporadic directory weirdness has been a long standing feature. Sounds like they are getting closer.

Have you been told about settings of XQPCTLD7
and diagnostic versions of XPQ ?

I see there is a new XPQ patch recently released. Worth a look

Posted by: Ian at January 14, 2016 9:05 PM

Comments are closed