Common DFOS tools:
Documentation

dfos = Data Flow Operations System, the common tool set for DFO

make printable		see also:
		preparation of CONDOR jobs: createJob
		monitoring: mucMonitor, cascadeMonitor
		AB concepts, AB structure

HTCondor^® Project University Wisconsin
"Like a vulture circling the desert, Condor scavenges for processing power that would otherwise be lost." Myron Livny, "Mr. Condor"

Condor: Batch queue system for DFO

Why need a batch queue system?

The dfos tool createJob creates a job file ($DFO_JOB_DIR/execAB_<mode>_<date>) which has all processAB calls of ABs. Since QC is using the concept of cascades and virtual calibrations, it is important to control the execution sequence of ABs. Otherwise it might happen that an AB is executed before the required product files are available.

A batch queue system (BQS) knows about these dependencies. In its simplest form (DRS_TYPE=CPL), the ABs are executed one-by-one in the sequence they show up in the execAB file. This is fully sufficient (and actually all you can do) on a single-processor system.

On a multi-core platform, you can gain a lot more efficiency but you need some more scheduling control. A BQS controls the proper sequence and manages the execution. By evaluating the dependencies between ABs, it finds the ones which are ready to execute (i.e. not blocked).

The BQS is a layer between the dfos call to execute a batch of ABs (createJob/execAB), and the individual pipeline recipe calls (processAB/esorex/recipe):

createJob
	BQS (within execAB)
		process call of ABs (processAB -> esorex -> pipeline)

QC uses multi-core nodes and hence needs a BQS. Before 2013, we used two kinds of architecture: stand-alone, dual-core, single-user dfo blades, and a powerful cluster made of 20 dual-core blades in multi-user mode ("QC blades"). Since 2013, we use a total of seven multi-core blades ("muc blades"), in either multi-user or in single-user configuration. Each of the muc blades has 12 physical (24 virtual) cores.

The first BQS was developed as a prototype by SDD in 2004 and called REI (recipe execution interface). In 2005 it was realized that a mature BQS system was available on the market, with 20+ years of development and testing built-in: Condor. This system has been developed, and is maintained, at the University of Wisconsin. It was decided to use that system as the standard BQS for QC.

CONDOR in general

"Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion." more ...

Personal/private CONDOR can come in two flavours: personal, and shared. A shared CONDOR is what is installed in complex environments like a campus compute grid. On both the old and the new blades, we are using a ' personal' CONDOR installation (resources are shared among users on the same server but not beyond). A personal CONDOR installation does:

keep an eye on your jobs and keep you posted on their progress
implement your policy on the execution order of the jobs (determined by the OCA rules)
keep a log of your job activities (under $BQSROOT/condor).

CONDOR can handle any kind of compute jobs, provided

the job can run in the background (no interactive input, windows, GUI, etc. required)
the job knows of its environment (sourcing of .qcrc and .dfosrc needs to be done within the compute job, like for cronjobs).

QC uses CONDOR for processing ABs (mandatory) and QC reports (optional).

DAGs and job dependencies

A DAG is a Directed Acyclic Graph, an entity consisting of nodes and dependencies like "don’t run job B until job A has completed successfully". A cascade is a typical DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops.

Condor has a component to recognize and manage DAGs: the DAGMan (Directed Acyclic Graph Manager). DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.

CONDOR links:

All about CONDOR

CONDOR Users Manual

In-house CONDOR presentation (Jens Knudstrup, December 2005)

Tutorials

Condor for QC

The current dfos system (running on the muc blades) exploits the following Condor capabilities:

it offers shared compute resources (up to 8 cores for processAB or processQC calls)
it manages different cascades with their internal dependencies and external priorities

Currently all processing tasks are created, scheduled and executed by autoDaily in automatic mode. The scheduling and execution is managed by Condor.

Syntax

The general syntax of CONDOR calls is:

This is currently used in the execAB files which for DRS_TYPE=CON read:

<AB> || <$DFO_BIN_DIR>/processAB || -a <AB> -u <user> || <NONE, or comma-separated list of depending ABs>

Macros

Condor comes with some built-in command-line tools useful to interact with the system.

command

description

example

condor_status

show the status of the condor pool (all machines/nodes in pool)

condor_q

check condor job queue. Shows all jobs submitted and in condor queue with
job ID, OWNER, SUBMITTED (date and time when job was submitted). Total
summary of the number of jobs and their status (idle/running/held) is also
shown.

condor_q -l xshooter

condor_submit

Submit new Jobs

condor_rm

remove jobs from the queue.
Useful:
condor_rm -forcex [constraints]

where [constraints] are:

cluster.proc	remove a given job ID
cluster	remove a given cluster of jobs
user	remove all jobs owned by user
-constraint expr	remove all jobs matching the boolean expression
-all	remove all jobs

condor_history

query the job history queue to show all jobs (ID/OWNER/SUBMITTED) submitted
to condor (long!)

condor_history -l 13523 (see details of job with ID 13523)

condor_submit_dag

submit a job (or a cluster of jobs) that have dependencies

condor_config_val

get or set values of the condor configuration parameters.

Useful ones:

condor_config_val	show options
condor_config_val -config	print the locations of condor config files
condor_config_val -owner	show owner of the condor_config_val process
condor_config_val -dump	dump the current condor configuration

You probably will use only condor_rm, to abort executing of a running job.

There is also a set of DFS-provided tools, to interact with the system. These start with 'vultur' (memo: condor is a vultur ...) and can be invoked on the command line. They are also used by dfos tools.

command

description

example

vultur_exec_cascade

launch a condor job

parameters:

--dagId	ID used to identify job call
--tmpdir	temporary directory to store Job# subdirectory log files

the following call is written by createJob in JOBS_NIGHT:

vultur_exec_cascade --dagId=CALIB_2006-04-20 --jobs=execAB_CALIB_2006-04-20 --wait

vultur_stop

gracefully stop an executing cascade (but, stops ALL of user's jobs indiscriminately!)

vultur_watch_q

watch the condor queue. Shows in one second intervals the number of jobs/idle/running/held

vultur_cascade_stat

monitors the cascade entries

vultur_cascade_stat --dagId=CALIB_2006-04-20

How to use

CONDOR comes as part of the standard DFS installation, in /etc/condor.

To run your pipeline under the batch queue system CONDOR, the following preparations are necessary:

edit .qcrc to include (at the end, just before sourcing .dfosrc):
source /opsw/packages/config/bashrc.vultur.private
export PATH=$PATH:/opsw/condor/bin:/opsw/condor/sbin
edit .dfosrc, to include $BQSROOT (see reference file)

Last update: April 26, 2021 by rhanusch

Common DFOS tools: Documentation

HTCondor® Project University Wisconsin "Like a vulture circling the desert, Condor scavenges for processing power that would otherwise be lost." Myron Livny, "Mr. Condor"