Common DFOS tools:
|dfos = Data Flow Operations System, the common tool set for DFO|
|make printable||see also:|
|preparation of CONDOR jobs: createJob|
|monitoring: mucMonitor, cascadeMonitor|
|AB concepts, AB structure|
The dfos tool createJob creates a job file ($DFO_JOB_DIR/execAB_<mode>_<date>) which has all processAB calls of ABs. Since QC is using the concept of cascades and virtual calibrations, it is important to control the execution sequence of ABs. Otherwise it might happen that an AB is executed before the required product files are available.
A batch queue system (BQS) knows about these dependencies. In its simplest form (DRS_TYPE=CPL), the ABs are executed one-by-one in the sequence they show up in the execAB file. This is fully sufficient (and actually all you can do) on a single-processor system.
On a multi-core platform, you can gain a lot more efficiency but you need some more scheduling control. A BQS controls the proper sequence and manages the execution. By evaluating the dependencies between ABs, it finds the ones which are ready to execute (i.e. not blocked).
The BQS is a layer between the dfos call to execute a batch of ABs (createJob/execAB), and the individual pipeline recipe calls (processAB/esorex/recipe):
|BQS (within execAB)|
|process call of ABs (processAB -> esorex -> pipeline)|
QC uses multi-core nodes and hence needs a BQS. Before 2013, we used two kinds of architecture: stand-alone, dual-core, single-user dfo blades, and a powerful cluster made of 20 dual-core blades in multi-user mode ("QC blades"). Since 2013, we use a total of seven multi-core blades ("muc blades"), in either multi-user or in single-user configuration. Each of the muc blades has 12 physical (24 virtual) cores.
The first BQS was developed as a prototype by SDD in 2004 and called REI (recipe execution interface). In 2005 it was realized that a mature BQS system was available on the market, with 20+ years of development and testing built-in: Condor. This system has been developed, and is maintained, at the University of Wisconsin. It was decided to use that system as the standard BQS for QC.
CONDOR in general
"Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion." more ...
Personal/private CONDOR can come in two flavours: personal, and shared. A shared CONDOR is what is installed in complex environments like a campus compute grid. On both the old and the new blades, we are using a ' personal' CONDOR installation (resources are shared among users on the same server but not beyond). A personal CONDOR installation does:
CONDOR can handle any kind of compute jobs, provided
QC uses CONDOR for processing ABs (mandatory) and QC reports (optional).
A DAG is a Directed Acyclic Graph, an entity consisting of nodes and dependencies like "don’t run job B until job A has completed successfully". A cascade is a typical DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops.
Condor has a component to recognize and manage DAGs: the DAGMan (Directed Acyclic Graph Manager). DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.
|All about CONDOR|
|CONDOR Users Manual|
|In-house CONDOR presentation (Jens Knudstrup, December 2005)|
The current dfos system (running on the muc blades) exploits the following Condor capabilities:
Currently all processing tasks are created, scheduled and executed by autoDaily in automatic mode. The scheduling and execution is managed by Condor.
The general syntax of CONDOR calls is:
<job ID>||<executable>||<arguments to executable>||<upload files>||<dependencies>
This is currently used in the execAB files which for DRS_TYPE=CON read:
<AB> || <$DFO_BIN_DIR>/processAB || -a <AB> -u <user> || <NONE, or comma-separated list of depending ABs>
Condor comes with some built-in command-line tools useful to interact with the system.
||show the status of the condor pool (all machines/nodes in pool)|
|condor_q||check condor job queue. Shows all jobs submitted and in condor queue with
job ID, OWNER, SUBMITTED (date and time when job was submitted). Total
summary of the number of jobs and their status (idle/running/held) is also
|condor_q -l xshooter|
|condor_submit||Submit new Jobs|
|condor_rm||remove jobs from the queue.
condor_rm -forcex [constraints]
where [constraints] are:
|condor_history||query the job history queue to show all jobs (ID/OWNER/SUBMITTED) submitted
to condor (long!)
|condor_history -l 13523 (see details of job with ID 13523)|
|condor_submit_dag||submit a job (or a cluster of jobs) that have dependencies
get or set values of the condor configuration parameters.
You probably will use only condor_rm, to abort executing of a running job.
There is also a set of DFS-provided tools, to interact with the system. These start with 'vultur' (memo: condor is a vultur ...) and can be invoked on the command line. They are also used by dfos tools.
|vultur_exec_cascade||launch a condor job
the following call is written by createJob in JOBS_NIGHT:
vultur_exec_cascade --dagId=CALIB_2006-04-20 --jobs=execAB_CALIB_2006-04-20 --wait
|vultur_stop||gracefully stop an executing cascade (but, stops ALL of user's jobs indiscriminately!)|
|vultur_watch_q||watch the condor queue. Shows in one second intervals the number of jobs/idle/running/held|
|vultur_cascade_stat||monitors the cascade entries||
CONDOR comes as part of the standard DFS installation, in /etc/condor.
To run your pipeline under the batch queue system CONDOR, the following preparations are necessary: