Common DFOS tools:
Documentation

dfos = Data Flow Operations System, the common tool set for DFO
*make printable new: see also:
 

v2.3:
-
supports OPSHUB environment (no functional changes)

documentation about CONDOR

AB concepts, AB structure

 

v2.3.4:
- download progress bar (OPSHUB)
- MAX_CORES configurable (OPSHUB: 1; others: 8)

[ used databases ] databases none
[ used dfos tools ] dfos tools none
[ output used by ] output execAB and execQC files; entries in job file
[ upload/download ] upload/download none

createJob

OPS
HUB
enabled for OPSHUB workflow

Description

This tool reads the list of created ABs and transforms it into an executable list of process jobs. It creates an execution file for recipe calls ($DFO_JOB_DIR/execAB_<MODE>_<DATE>). It also creates an execution file for QC report jobs ($DFO_JOB_DIR/execQC_<MODE>_<DATE>).

The tool presently supports one type of DRS (data reduction system) for pipelines:

and two types of batch queue systems (BQS):

For historical reasons, all three are valid options for the configuration key DRS_TYPE. Effectively, CPL means serial processing, CON means batch queue processing (many independent jobs in parallel), and INT means serial processing (one AB at a time) but internal splitting into N jobs (N spectrographs in the case of MUSE). In operations, INT is the standard mode of the MUSE pipeline, while CON is the standard mode for all other cases.

For DRS_TYPE=CON the scheduler tool condor is called with the dfs script vultur_exec_cascade. For INT, the linux tool likwid-bin is used.

For the QC report calls, three cases of BQS are configurable in config.processQC (!):

Subcascades. The tool supports creation of subcascades from a pre-existing cascade. This might be useful if a certain AB has failed, or was not certified, and all depending ABs need to be reprocessed again.

Batch queue system. A batch queue system is able to recognize and manage a set of ABs and their dependencies within a cascade, and to feed the cascade into a processing platform. Presently QC is using a farm of 9 multi-core blades called 'muc<nn>', and the open-source software HTCondor (commonly called CONDOR) is used for scheduling jobs for parallel execution.

CONDOR addresses all available CPUs (up to 8 for muc01-muc07, up to 30 for muc08 and muc09), hence increasing the efficiency of pipeline processing.

The QC jobs can also be executed in parallel, here even without the need to respect dependencies. Since QC reports are coded by the QC scientist, they do not strictly follow a common architecture. Depending on the way they are written, they might or might not execute in parallel.

The standard combination for DFOS (autoDaily) processing is BQS=CON. Sequential processing without CONDOR is also possible on the command line but should be used operationally only in exceptional cases (like for MUSE). Check here for more information about CONDOR.

  sequential scheduling scheduling through a batch queue system
configuration in config.createAB (DRS_TYPE): INT (for muse@muc09); CPL (only for testing) CON
executing: one AB at a time (in 24 parallel threads for muse@muc09) multiple ABs at once
proper sequence controlled by:

execAB file which is filled by createJob sequentially following the cascade

the BQS evaluating the WAITFOR events (dependencies)
levels to recipe execution processAB >> [likwid-bin >>] esorex >> recipe BQS >> processAB >> esorex >> recipe
invoking AB:
processAB -a <AB>
pipeline calls:
esorex

The tool writes the name of the execution file (execAB) into the off-line job file (JOBS_NIGHT for off-line processing, JOBS_AUTO for automatic processing with autoDaily). JOBS_NIGHT can be launched anytime off-line.

Find its place in the daily workflow here.

Execution mode for QC report jobs

The tool writes an execution file execQC containing processQC calls. These calls can be managed in three ways:
  sequential scheduling scheduling through a batch queue system implicit scheduling (new with v1.6)
configuration in config.processQC (QCBQS_TYPE): SER PAR IMPLICIT
executing: one AB at a time multiple ABs at once within processAB, by plugin
sequence controlled by:

execQC file which is filled by createJob (but sequence does not matter)

same (no dependencies for QC jobs) execAB file
levels to QC report creation processQC>> MIDAS/python/... (whatever is the platform for the QC procedure) BQS >> processQC>> MIDAS/python/... BQS >> processAB >> >> post_plugin >> MIDAS/python/...
invoking AB: processQC -a <AB> processQC -a <AB> none

Note that using QCBQS_TYPE = PAR requires that your QC procedures can handle parallel execution correctly!

Preparation for updating HC plots. In incremental processing mode, the tool writes a call of the executable file execHC_TREND into the jobs file. This file is filled by autoDaily from two sources: qc1Parser (for the most recent opslog entries), and scoreQC (for the most recent QC1 parameters). After execution of execAB and execQC, it finally collects all trendPlotter calls triggered by any of these data sources. Thereby it ensures that all HC reports are updated in due time to feed back product quality information to Paranal SciOps.

Calls of getStatusAB. To have an updated view of the AB monitor, the tool getStatusAB is called during execution. Execution time and load of the tool is non-zero (but it has become much faster with v3.0), so it is not possible to run it permanently. But as a reasonable compromise, a tool refresh is called every 60 sec, and the web page is auto-refreshed every 300 sec.

Output

The tool produces the following list of calls in $DFO_JOB_DIR/JOBS_AUTO (if called by autoDaily) or $DFO_JOB_DIR/JOBS_NIGHT (if called on the command line) (<mode> and <date> are the command-line parameters):

what when why
dfoMonitor -m -q always initial refresh of the dfoMonitor (-m: enable display for autoDaily execution messages; -q: quiet mode)
echo "starting raw data download ...";
$DFO_JOB_DIR/rawDown_<mode>_<date>
always launch download file (for raw files): organized download of first 50 raw files required for the cascade (OPSHUB: all raw files); either in max. 8 parallel threads, or one by one (OPSHUB)
$DFO_JOB_DIR/mcalDown_<mode>_<date> optional launch download file (for mcalib files): organized download of all mcalibs required for the cascade (optional, since in most cases they are already/still in $DFO_CAL_DIR)
getStatusAB -d <date> -r & always launch AB monitor in recursive mode (every 60 sec; to reflect the processing status of the cascade)
vultur_exec_cascade --dagId=<mode>_<date> \
--jobs=$DFO_JOB_DIR/execAB_<mode>_<date> --wait [DRS_TYPE=CON]
OR
$DFO_JOB_DIR/execAB_<mode>_<date>
always condor call of processing jobs as defined in execAB_<mode>_<date>; the --wait parameter makes sure that the cascade is terminated before the job queue continues
echo "vultur_exec_cascade for <date> done."
CON only feedback on the command line
cascadeMonitor -d <date> [autoDaily]
OR
xterm -e watch ... [command line]
cascadeMonitor -D <date> &
always update the cascadeMonitor (in case you want to watch the execution details)

update overview mode of cascadeMonitor
cd $DFO_JOB_DIR; export CALL_ENABLED=YES always set environment to export the AB monitor and all linked information to the qcweb server
$DFO_JOB_DIR/execQC_<mode>_<date> always call the QC job file
echo "cal_QC <date> `date +%Y-%m-%d"T"%H:%M:%S`" >> $DFO_MON_DIR/DFO_STATUS always add cal_QC (or sci_QC) flag in $DFO_MON_DIR/DFO_STATUS file
getStatusAB -d <date> -F & always refresh the AB monitor; -F: export content to the qcweb server (--> provide the complete processing information, incl. scores, to the calChecker and the HC monitor)

Bold: files created by createJob (and deleted by finishNight):

There is always only one call execAB_<MODE>_<DATE> per mode and date in the job file (same for execQC).

The rawDown_CALIB_<DATE> and mcalDown_CALIB_<DATE> files are created only for mode=CALIB.

rawDown. The rawDown_CALIB_<DATE> files have the following structure:

ngasClient -f GIRAF.2013-02-12T23:15:02.239.fits &
ngasClient -f GIRAF.2013-02-12T23:19:06.208.fits &
ngasClient -f GIRAF.2013-02-12T23:23:10.238.fits &
ngasClient -f GIRAF.2013-02-12T23:45:49.447.fits &
ngasClient -f GIRAF.2013-02-12T23:47:13.895.fits &
ngasClient -f GIRAF.2013-02-12T23:48:37.191.fits &
ngasClient -f GIRAF.2013-02-12T23:27:52.890.fits &
ngasClient -f GIRAF.2013-02-12T23:32:23.462.fits

ngasClient -f GIRAF.2013-02-12T23:50:42.742.fits &
ngasClient -f GIRAF.2013-02-12T23:55:06.383.fits &
ngasClient -f GIRAF.2013-02-13T10:06:04.870.fits &

etc.

They are intended as protection against the muc blades firing too many download jobs at once. Otherwise it could happen that the condor dagman releases many (currently 8) download requests at once, potentially overloading ngasClient and also the local decompression, and HOTFLY, processes. The current download file allows up to 8 parallel downloads (this number has been empirically tested) but not more. For the OPSHUB, we download one file after the other (MAX_CORES=1) because we are bandwidth-limited, and having more than one transfer process at a time does not offer advantages.

The mcalDown file has a related but different purpose. It provides a controlled download of master calibration files which had already been transformed into headers (using cleanupProducts). With condor on the muc blades, it might happen that many master_flat ABs are fired at the same time and, under unfortunate circumstances, all attempt to download at the same time the same master_bias. This is avoided by checking all ABs for their MCALIB section, and then download the required mcalibs for all ABs first, before the condor processing is then started. The mcalDown files are only created if there is something to download, under current standard conditions (incremental processing) only rarely.

For the OPSHUB, there is a special feature for the download jobs, the progress monitor. It displays in a simple graphical way the total amount fo files to be downloaded, and the current status of downloads (one dot per downloaded file). This helps following the download status in a situation which is bandwidth-limited.

Monitor calls. In automatic calls (within autoDaily), the tool enters into JOBS_AUTO:

dfoMonitor -m -q initial call of dfoMonitor
getStatusAB -d <date> -r recursive call of AB monitor
... ...
cascadeMonitor -d <date> final call of cascadeMonitor (details) when the cascade is finished
cascadeMonitor -D<date> same, for the overview mode
CALL_ENABLED=YES enabling the environment for exporting ABs and logs
getStatusAB -d <date> -F final call of AB monitor with export of all logs and QC products
dfoMonitor -m -q final call of dfoMonitor

In command-line mode, the tool enters into JOBS_NIGHT:

getStatusAB -d <date> -r recursive call of AB monitor
... ...
watch -n $WATCH_CADENCE cascadeMonitor -d <date> xterm watch call of cascadeMonitor
getStatusAB -d <date> final call of AB monitor (no export of logs and QC products!)

How to use

Type createJob -h for on-line help, createJob -v for the version number,

createJob -m CALIB -d 2024-07-12

to create a pipeline processing job for all CALIB ABs from 2024-07-12,

createJob -m CALIB -d 2024-07-12 -a

to create a subcascade.

There is the option -F used if the tool is called by autoDaily (not on the command-line!) and only for mode CALIB:

createJob -m CALIB -d 2024-07-12 -F

This triggers the getStatusAB tool to be run with option -F (immediate transfer to qcweb).

Check the recipe call entries in $DFO_JOB_DIR/execAB_CALIB_2024-07-12, and the job submission entry in $DFO_JOB_DIR/JOBS_NIGHT.

Subcascade. A subcascade is created interactively: the flag -a causes the tool to ask the user about the AB name, and then creates a new job file from the existing execAB file. The new job file is called execAB_<mode>_<date>_sub and contains only those ABs and the corresponding waitfors which depend upon the specified parent AB. That job is ready to be launched in the same way as the main job file.

The typical process calls in execAB_CALIB_2024-07-12 look like follows (for DRS_TYPE = CPL):

processAB -a <ab_name1>
processAB -a <ab_name2>
etc.

The typical QC report calls in execQC_CALIB_2024-07-12 look like follows (for QCBQS_TYPE = SER):

processQC -a <ab_name1>
processQC -a <ab_name2>
etc.

The typical entry in the job file, for the above call and DRS-TYPE = CPL, looks like:

# pipeline jobs for CALIB, 2024-07-12
getStatusAB -d 2024-07-12 -r &
$DFO_JOB_DIR/execAB_CALIB_2024-07-12

$DFO_JOB_DIR/execQC_CALIB_2024-07-12
getStatusAB -d 2024-07-12 -F &

The DRS type is read from config.createAB, the QCBQS_TYPE from config.processQC. The getStatusAB calls provide an up-to-date view of the processing situation every 60 sec as long as the batch is executing. A final call of getStatusAB occurs with flag -F, meaning the results (the AB monitor and all log and plots files) are written to the QC web server qcweb.

Re-creation of ABs. For support of the re-create mode of createAB, the tool createJob has an option -r <job_id> where <job_id> is the name of the job file to be created. This option is used only within createAB, not in normal operations. The added functionality is rather small, the tool works on a job_id (with suffix recreate) that is not determined by the tool itself but by createAB.

Installation

Use dfosExplorer, or type dfosInstall -t createJob .

Configuration file

config.createJob defines:

Section 1: general
CAL_CLEANUP YES|NO YES: remove all raw FITS files after QC processing (CALIB only); optional, default: NO; useful only if disk space is an issue
PLUGIN MY_PLUGIN optional name of PLUGIN (used before creating entry in JOB_FILE_NAME)
MAX_JOBS 1 OPSHUB: maximum number of jobs allowed to run in parallel, for DRS_TYPE=CON
FULL_RAWDOWNLOAD YES OPSHUB: YES|NO (NO); YES: wait with processing until all files are downloaded (for DRS_TYPE=CON and INT)
MAX_CORES 1 or 8 OPSHUB: 1; DFOS and PHOENIX: 8
 
Section 1b: MEF processing
Obsolete
 
Section 2: List of called helper applications
Obsolete

Status information

The tool writes 'cal_Queued' / 'sci_Queued' into $DFO_MON_DIR/DFO_STATUS.

Operational aspects

Workflow description

1. Find the AB list as $DFO_MON_DIR/AB_list_<DATE>

2. read DRS_TYPE from config.createAB and QCBQS_TYPE from config.processQC

3. loop on ABs:

3.1.1 mode=CALIB: check the ABs for files in the RAWFILE section which need to be downloaded before processing; if found, write the download calls into rawDown_CALIB_<DATE>, write call of rawDown into JOBS_NIGHT or JOBS_AUTO
3.1.2 mode=CALIB: check the ABs for files in the MCALIB section which need to be downloaded before processing; if found, write the download calls into mcalDown_CALIB_<DATE>, write call of mcalDown into JOBS_NIGHT or JOBS_AUTO

3.2 DRS_TYPE=CON: determine job IDs and waitfors (dependencies); DRS_TYPE=CPL/INT: determine dependencies and write processAB calls in proper sequence

3.3 create command-line recipe calls:

3.3.1 CPL: processAB -a $AB
3.3.2 CON: <job_id> || processAB || -a $AB -u $USER || <dependencies>
3.3.3 INT: processAB -a $AB -l

3.4 write result into execAB_<mode>_<DATE>

3.4.1 call PLUGIN if configured
3.4.2 write call of getStatusAB into JOBS_NIGHT or JOBS_AUTO (in recursive mode, -r)

3.5 write call of execAB into JOBS_NIGHT or JOBS_AUTO

3.6 write calls of cascadeMonitor into JOBS_NIGHT (as watch command) or JOBS_AUTO (as final call)

3.7 create command-line QC report calls:

3.7.1 SER: processQC -a $AB
3.7.2 PAR: <job_id> || processQC || -a $AB -u $USER || NONE
3.7.3 IMPLICIT: no call

3.8 write result into execQC_<mode>_<DATE>
3.9 write call of execQC file into
JOBS_NIGHT or JOBS_AUTO
3.10 write call of getStatusAB into JOBS_NIGHT or JOBS_AUTO (with option -F if called by autoDaily, incremental mode )
3.11 (only if called within autoDaily, incremental mode: write call of execHC_TREND into JOBS_AUTO)

3.12 if CAL_CLEANUP=YES and mode=CALIB: write command into JOBS_NIGHT or JOBS_AUTO to clean up all raw files after QC processing

4. set DFO status to 'cal_Queued' or 'sci_Queued'

5. If flag -a is set (subcascade):

5.1 query for the AB name
5.2 extract the subcascade
5.3 exit

Operational hints