QC Procedure: SL-53+VLTSW-2010+DFS-6 Upgrade of DFO Blades

home dfos administration documentation procedures concepts external: QC

The upgrade of the Qaulity control [QC] operational computing environment [dfo21 -- dfo33 and qc01 -- qc20] from Scientific Linux 4.3, VLTSW-2007, DFS-5 [SL-43+VLTSW-2007+DFS-5] based systems to Scientific Linux 5.3, VLTSW-2010, DFS-6 [SL-53+VLTSW-2010+DFS-6] based systems, using the same hardware, namely so-called Blace Center nodes, has been both a long and long-awaited process. It was finally begun (from the QC+SOS view) in Oct 2009 and represents a collaboration between:

  • the VLTSW group -- who provide and test the Operating System [OS] and Very Large Telescope Software [VLTSW] components
  • the DFS group -- who provide and test the Data Flow Systems [DFS] components
  • the SOS group -- who provide the hardware and make the OS and software installations
  • the QC group -- who test the integrated environment for QC Operations and ultimately use the systems for QC operations
Operational machines at Paranal were upgraded to SL-53+VLTSW-2010+DFS-6 at the start of P85, April 2010. QC dfoXX operational machines are now set to begin upgrade at the beginning of August 2010. The delay between the two is due to delays caused by all four groups.

Of course with such a significant upgrade in OS and software, a number of things have changed and thus a certain number of adjustments need to be made to QC files. DFOS software changes have already been implemented transparently (i.e. the modified DFOS software works both on under SL-43+VLTSW-2007+DFS-5 and SL-53+VLTSW-2010+DFS-6 based systems.
The following changes to configuration files are provided in the hope of being transparent, i.e. can be made before the upgrade is applied and should work for both pre and post upgrade systems.

  1. crontab:
    crontab -l > ${HOME}/crontab.SL-43+VLTSW-2007+DFS-5
    
  2. .qcrc:
    • DFS_RELEASE: If you use:
      export DFS_RELEASE=dfs
      
      then there is nothing to change.
      If you set DFS_RELEASE to a specific version, e.g.
      export DFS_RELEASE=dfs-5_6_7
      
      then add the following code immediately below this:
      [ -d ~flowmgr/dfs-6_0_3 ] && export DFS_RELEASE=dfs-6_0_3
      
      This will be ok until we have to upgrade the DFS version on the new systems, at which point more definitive editing should be made, e.g. comment out the DFS-5 line and simply set the DFS-6 line.
    • Condor You should have something like:
      source /opsw/packages/vultur/config/bashrc.vultur.private
      export PATH=$PATH:/opsw/condor/bin:/opsw/condor/sbin
      
      Replace with:
      if [ -r /opsw/packages/vultur/config/bashrc.vultur.private ]; then
        ## This must be a pre DFS-6 system...
        source /opsw/packages/vultur/config/bashrc.vultur.private
        export PATH=$PATH:/opsw/condor/bin:/opsw/condor/sbin
      elif [ -r ~flowmgr/${DFS_RELEASE}/config/bashrc.vultur.private ]; then
        ## This must be a DFS-6 or later system...
        . ~flowmgr/${DFS_RELEASE}/config/bashrc.vultur.private
        export PATH=$PATH:/opt/condor/bin:/opt/condor/sbin
      fi
      
  3. XFCE
    • .vnc/xstartup You should have something like:
      startxfce &
      
      Replace with
      if which startxfce4 >/dev/null 2>&1 ; then
        startxfce4 &
      elif which startxfce >/dev/null 2>&1 ; then
        startxfce &
      fi
      
    • Migrate old menu to new one:
      export DFOS_FUNCTIONS=/home/uves/DFOS/bin/dfos.functions
      /home/uves/DFOS/bin/dfos.migrate.xfceUserMenu2xfce4
      

If you haven't already done step 1 above it is too late. You must now do steps 2 and 3 above if you have not yet already done so. Once you have done so you can then proceed with:

  1. ssh to your upgraded dfoNN
  2. TMP_DIR:
    For many of us the DFOS enviroment variable TMP_DIR is set to something like /tmp/<inst> and since /tmp was not migrated across, the TMP_DIR will not in general exist. So do:
    [ ! -d "${TMP_DIR}" ] && mkdir -pv ${TMP_DIR}
    
  3. /hsrmnt:
    Replace any and all references to /hrsmnt/home with simply /home in .qcrc, .dfosrc, .pecs/* and anywhere else you may have made a reference to it... If you are feeling brave and trust JP implicitly you could do:
    sed -i 's|/hsrmnt/home|/home|g' 
    
    for each <file> you know or think it might be in, or for the bold:
    sed -i 's|/hsrmnt/home|/home|g' .qcrc .dfosrc .pecs/*
    
  4. LD_LIBRARY_PATH Check if LD_LIBRARY_PATH includes /vlt/FEB2007/NOCCS/lib. If so find in .qcrc, .dfosrc, .pecs/* where this is set and remove it.
    grep LD_LIBRARY_PATH.\*/vlt/FEB2007/NOCCS/lib .qcrc .dfosrc .pecs/*
    
  5. logout and then ssh to your upgraded dfoNN again
  6. Firefox: Start Firefox, do NOT click any of the DFOS action buttons. Do the following to obtain the "expected" behaviours...
    1. To prevent firefox coming to the front when remote loading a file:
      - In the URL tar type "about:config"
      - Click Ok
      - type "Diverted" in the filter field
      - Double click on the "browser.tabs.loadDivertedInBackground" config key to set it to "true"
      
    2. To prevent firefox opening remote loaded pages in a new tab (or window)
      - In the URL tar type "about:config"
      - Click Ok
      - type "open_external" in the filter field
      - Double click on the "browser.link.open_external" config key and set it to 1
      
    3. To prevent firefox opening clicked links in a new tab (or window)
      - In the URL tar type "about:config"
      - Click Ok
      - type "open_newwindow" in the filter field
      - Double click on the "browser.link.open_newwindow" config key and set it to 1
      
  7. DFOS action buttons:
    Before clicking on any DFOS action button, recreate the web page it is on with the appropriate tool BEFORE doing so...
  8. dfoMonitor:
    As a first check of basic health, run dfoMonitor once. Now clicking on the DFOS action buttons should be OK.
  9. SciSoft: conflicts with the VLTSW provided MIDAS and java. The best advise I think is don't use SciSoft, i.e. comment out (or remove completely from .qcrc, .dfosrc and .pecs/* any line that includes:
    . /scisoft/bin/Setup.bash
    
    or
    source /scisoft/bin/Setup.bash
    
    If you really do need SciSoft, then use (something like) the following, In .qcrc:
    saveMIDASHOME="${MIDASHOME}"
    saveMIDVERS="${MIDVERS}"
    ## SciSoft...
    if [ -f /scisoft/bin/Setup.bash ]; then
       . /scisoft/bin/Setup.bash > /dev/null 2>&1
    fi
    export MIDASHOME="${saveMIDASHOME}"
    export MIDVERS="${saveMIDVERS}"
    
    And then at the very END of .qcrc
    for P in $(echo $PATH | tr ":" "\n" | grep scisoft) ; do
      export PATH=${PATH/:${P}}:${P}
    done
    
  10. Python: pyqc was originally written to work with the Python provided by SciSoft. Since version 1.2.1 (or maybe earlier) it work with the QC customised Python version in /qcdp. If you have already migrated to use /qcdp there is nothing to do. If not, now is the time to do so.
  11. crontab:
    crontab ${HOME}/crontab.SL-43+VLTSW-2007+DFS-5
    
  12. Resume normal operations
While practically every piece of software we use has been upgraded, most are transparent to us, some however are not, here is what I have found so far:
  • Pipelines: Initially the same versions of the Pipelines delivered to Paranal for April 2010 have been installed for the SL53+VLTSW-2010+DFS-6 based systems.
  • The window manager sawfish is NOT available.
The Upgrade Procedure

The upgrade procedure was first developed on Xen Virtual Machines. Once the beta-procedure was established it was applied to dfo21 where it was fine-tuned in a collaboration between Alexis Huxley of SOS and John pritchard of QC. Once the final procedure was established it is to be applied to the remaining DFO nodes, namely dfo22 -- dfo33.
The nodes will be upgraded between Aug 3rd and Aug 12th at a rate of about 2 per day. SL-53+VLTSW-2010+DFS-6 based versions of each dfoNN will be prepared offline (i.e. on available, identical hardware). Once these are ready and in coordination with the QC scientist concerned, the SL-43+VLTSW-2007+DFS-5 based version will be shutdown, the 1Tb /diska disk removed from the SL-43+VLTSW-2007+DFS-5 based version and installed into the SL-43+VLTSW-2007+DFS-5 based version and the the SL-43+VLTSW-2007+DFS-5 based version will be started up. Unfortunately the downtime is liable to be of the order of 2-3hrs as the 1Tb /diska disks are almost certainly all overdue for file system checks, and depending on the disk usage, this will take anywhere from a few mins to perhaps as much as 3hrs.
In more detail, the procedure is:

  • The day before the scheduled upgrade of dfoNN (to be announced by SOS by email)
      No impact on operations
    1. Prepare a new machine offline
    2. rsync /diskb onto /diska (the 1Tb disk)
  • The Day
      No operations
    1. 2mins: Shutdown dfoNN
    2. 30mins: Physically remove /diska from dfoNN and install it into the new dfoNN
    3. 5mins-3hrs: Power on the new dfoNN
      - Probably /diska will require an file system check, this can take a 1-3hrs, depending on the quantity of data on that disk.
    4. Release to QC scientist
    5. dfoNN now ready for operations
    6. QC scientist, having already followed the procedures given in the Before Upgrade section above, reactivates cronjobs (see After Upgrade) section above, and operations resume.