Common DFOS tools:
Documentation

dfos = Data Flow Operations System, the common tool set for DFO
*make printable   see also:
    preparation of CONDOR jobs: createJob
    monitoring: mucMonitor, cascadeMonitor
    AB concepts, AB structure

HTCondor® Project University Wisconsin
"Like a vulture circling the desert, Condor scavenges for processing power that would otherwise be lost." Myron Livny, "Mr. Condor"

Condor: Batch queue system for DFO

Why need a batch queue system?

The dfos tool createJob creates a job file ($DFO_JOB_DIR/execAB_<mode>_<date>) which has all processAB calls of ABs. Since QC is using the concept of cascades and virtual calibrations, it is important to control the execution sequence of ABs. Otherwise it might happen that an AB is executed before the required product files are available.

A batch queue system (BQS) knows about these dependencies. In its simplest form (DRS_TYPE=CPL), the ABs are executed one-by-one in the sequence they show up in the execAB file. This is fully sufficient (and actually all you can do) on a single-processor system.

On a multi-core platform, you can gain a lot more efficiency but you need some more scheduling control. A BQS controls the proper sequence and manages the execution. By evaluating the dependencies between ABs, it finds the ones which are ready to execute (i.e. not blocked).

The BQS is a layer between the dfos call to execute a batch of ABs (createJob/execAB), and the individual pipeline recipe calls (processAB/esorex/recipe):

createJob    
  BQS (within execAB)  
    process call of ABs (processAB -> esorex -> pipeline)

QC uses multi-core nodes and hence needs a BQS. Before 2013, we used two kinds of architecture: stand-alone, dual-core, single-user dfo blades, and a powerful cluster made of 20 dual-core blades in multi-user mode ("QC blades"). Since 2013, we use a total of seven multi-core blades ("muc blades"), in either multi-user or in single-user configuration. Each of the muc blades has 12 physical (24 virtual) cores.

The first BQS was developed as a prototype by SDD in 2004 and called REI (recipe execution interface). In 2005 it was realized that a mature BQS system was available on the market, with 20+ years of development and testing built-in: Condor. This system has been developed, and is maintained, at the University of Wisconsin. It was decided to use that system as the standard BQS for QC.

CONDOR in general

"Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion." more ...

Personal/private CONDOR can come in two flavours: personal, and shared. A shared CONDOR is what is installed in complex environments like a campus compute grid. On both the old and the new blades, we are using a ' personal' CONDOR installation (resources are shared among users on the same server but not beyond). A personal CONDOR installation does:

CONDOR can handle any kind of compute jobs, provided

QC uses CONDOR for processing ABs (mandatory) and QC reports (optional).

DAGs and job dependencies

A DAG is a Directed Acyclic Graph, an entity consisting of nodes and dependencies like "don’t run job B until job A has completed successfully". A cascade is a typical DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops.

Condor has a component to recognize and manage DAGs: the DAGMan (Directed Acyclic Graph Manager). DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.

CONDOR links:

All about CONDOR
CONDOR Users Manual 
In-house CONDOR presentation (Jens Knudstrup, December 2005)
Tutorials

Condor for QC

The current dfos system (running on the muc blades) exploits the following Condor capabilities:

Currently all processing tasks are created, scheduled and executed by autoDaily in automatic mode. The scheduling and execution is managed by Condor.

Syntax

The general syntax of CONDOR calls is:

<job ID>||<executable>||<arguments to executable>||<upload files>||<dependencies>

This is currently used in the execAB files which for DRS_TYPE=CON read:

<AB> || <$DFO_BIN_DIR>/processAB || -a <AB> -u <user> || <NONE, or comma-separated list of depending ABs>

Macros

Condor comes with some built-in command-line tools useful to interact with the system.

command description example
condor_status
show the status of the condor pool (all machines/nodes in pool)  
condor_q check condor job queue. Shows all jobs submitted and in condor queue with
job ID, OWNER, SUBMITTED (date and time when job was submitted). Total
summary of the number of jobs and their status (idle/running/held) is also
shown.

condor_q -l xshooter
condor_submit Submit new Jobs  
condor_rm remove jobs from the queue.
Useful:
condor_rm -forcex [constraints]

where [constraints] are:

cluster.proc remove a given job ID
cluster remove a given cluster of jobs
user remove all jobs owned by user
-constraint expr remove all jobs matching the boolean expression
-all remove all jobs
 
condor_history query the job history queue to show all jobs (ID/OWNER/SUBMITTED) submitted
to condor (long!)
condor_history -l 13523 (see details of job with ID 13523)
condor_submit_dag submit a job (or a cluster of jobs) that have dependencies
 
condor_config_val

get or set values of the condor configuration parameters.

Useful ones:

condor_config_val show options
condor_config_val -config print the locations of condor config files
condor_config_val -owner show owner of the condor_config_val process
condor_config_val -dump dump the current condor configuration

 

 

You probably will use only condor_rm, to abort executing of a running job.

There is also a set of DFS-provided tools, to interact with the system. These start with 'vultur' (memo: condor is a vultur ...) and can be invoked on the command line. They are also used by dfos tools.

command description example
vultur_exec_cascade launch a condor job

parameters:

--dagId ID used to identify job call
--tmpdir temporary directory to store Job# subdirectory log files

 



the following call is written by createJob in JOBS_NIGHT:

vultur_exec_cascade --dagId=CALIB_2006-04-20 --jobs=execAB_CALIB_2006-04-20 --wait

vultur_stop gracefully stop an executing cascade (but, stops ALL of user's jobs indiscriminately!)  
vultur_watch_q watch the condor queue. Shows in one second intervals the number of jobs/idle/running/held  
vultur_cascade_stat monitors the cascade entries

vultur_cascade_stat --dagId=CALIB_2006-04-20

How to use

CONDOR comes as part of the standard DFS installation, in /etc/condor.

To run your pipeline under the batch queue system CONDOR, the following preparations are necessary: