Common DFOS tools:
Documentation

dfos = Data Flow Operations System, the common tool set for DFO

Batch queue system: CONDOR

Batch queue systems

The dfos tool createJob creates a job file ($DFO_JOB_DIR/execAB_<mode>_<date) which has all calls of ABs to process. Since QC is using the concept of cascades and virtual calibrations, it is important to control the execution sequence of ABs. Otherwise it might happen that an AB is executed before the products from another AB are available.

A batch queue system (BQS) knows about these dependencies. In its simplest form (DRS_TYPE=CPL or RBS), the ABs are already properly sorted in the execAB file. Then all ABs can be called sequentially by processAB and executed.

In its more advanced form, a BQS takes over the sequence control and manages the execution on a multi-processor platform. By evaluating the current dependencies, it finds ABs which are ready to execute (i.e. not blocked). The BQS is a layer between the DFOS call to execute a batch of ABs (createJob/execAB), and the individual pipeline recipe calls (processAB/esorex/recipe):

createJob
	BQS (within execAB)
		call of ABs (processAB -> esorex -> pipeline)

A BQS is required for QC as soon as multi-node processing is intended. Furthermore the functionality of extracting all internal dependencies from ABs is very attractive. The current DFO machines have two processors, and using CONDOR makes it possible to use them in parallel. A performance gain by a factor of up to 1.8 has been measured.

The first BQS was developed as a prototype by DFS in 2004 and called REI (recipe execution interface). In 2005 it was realized that a mature BQS system was available on the market, with 20+ years of development and testing built-in: CONDOR. This system has been developed, and is maintained, at the University of Wisconsin. It was decided to use that system as the standard BQS for DFO.

CONDOR

"Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion." more ...

The current dfos system uses only a subset of CONDOR abilities:

on the standard dfo machines, it provides access to the two processors and can run two jobs at once (actually two processAB sessions but it could be any job)
on the compute cluster, it provides access, in the same sense, to any configured number of processors
it is capable to self-schedule a complete batch of ABs (submitted in the execAB file), by recognizing dependencies

While on a standard dfo machine it can help to reduce pipeline processing times by roughly a factor 2, it can offer real parallel processing on the compute cluster for multi-detector files when the number of processors matches the number of detectors.

Among other features which are currently not exploited by dfos are grid support (distributed computing: all tasks executable on all dfo machines).

The following simple scheme illustrates the potential of using CONDOR (both on a dual processor and a multi-processor platform), assuming a model cascade with three levels and up to 6 independent ABs>:

FORS1 model cascade: 1 BIAS, 3 depending FLATs, 6 depending STD

traditional dfo processing (DRS_TYPE="CPL")

slightly more advanced dfo processing (DRS_TYPE="CON")

processing on cluster (N=6)

		1				BIAS
2		3		4		FLAT
5	6	7	8	9	10	STD

1
2
3
4
5
6
7
8
9
10

1
2	3
4	5
6	7
8	9
10	10

1
2	3	4
5	6	7	8	9	10

number of cycles (same execution time per AB assumed):

6
(2,3,4 can only start when 1 is done; 5-10 only when 2-4 is done)

3
(cannot be reduced further!)

The very same cascade on a multi-processor platform with more than 6 nodes cannot be made any faster because of the dependencies. The survey instruments with up to 32 detectors will require a higher processing multiplexity. The idea is to have the pipeline recipe access a particular extension (both in raw and product files) and hence have the same AB running 32 times, indexed by detector ID:

OCAM model cascade: 1 BIAS, 1 depending FLAT

processing on N=6 cluster

processing on N=32 cluster

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	BIAS
34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	66	FLAT

1	2	3	4	5	6
7	8	9	10	11	12
13	14	15	16	17	18
19	20	21	22	23	24
25	26	27	28	29	30
31	32
33
34	35	36	37	38	39
40	41	42	43	44	45
46	47	48	49	50	51
52	53	54	55	56	57
58	59	60	61	62	63
64	65
66

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32
33
34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65
66

number of cycles (same execution time per AB assumed):

14
Jobs 33 and 66 are join job (to create MEF product file). 33 depends on 1-32. 34-65 depend on 33. 66 depends on 34-65.

4
(cannot be reduced)

In general, CONDOR will be used by QC to execute all pipeline jobs. Hence a hybrid scenario (with the two cascades from above submitted to the same cluster) could look like following:

OCAM model cascade: 1 BIAS, 1 depending FLAT

processing on N=6 cluster

processing on N=32 cluster

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	BIAS
34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	66	FLAT

		1
2		3		4
5	6	7	8	9	10

1	2	3	4	5	6
7	8	9	10	11	12
13	14	15	16	17	18
19	20	21	22	23	24
25	26	27	28	29	30
31	32	1
33	2	3	4
34	35	36	37	38	39
40	41	42	43	44	45
46	47	48	49	50	51
52	53	54	55	56	57
58	59	60	61	62	63
64	65	5	6	7	8
66	9	10

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32
33	1
34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65
66	2	3	4
5	6	7	8	9	10

5
(an extra cycle is required because of the FORS dependencies)
Buying just one more CPU would reduce the cycle number to 4:

processing on N=33 cluster

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	1
33																														2	3	4
34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	5
66																												6	7	8	9	10

number of cycles (same execution time per AB assumed):

14
(The FORS jobs can be scheduled to idle processors)

4-5
(obviously the idle processor cycles could be used for further compute jobs)

CONDOR links:

All about CONDOR

CONDOR Users Manual (current DFO installation on March 2006: v6.6)

In-house CONDOR presentation (Jens Knudstrup, December 2005)

Other information about CONDOR

How to use

CONDOR comes as part of your DFS installation. NOTE: It can currently be used only for CPL pipelines. To run your pipeline under the batch queue system CONDOR, the following preparations are necessary:

edit .qcrc to include (at the end, just before sourcing .dfosrc):
source /opsw/packages/vultur/config/bashrc.vultur.private
export PATH=$PATH:/opsw/condor/bin:/opsw/condor/sbin
edit .dfosrc, to include BQSROOT (see reference file)
use createAB v.1.8.7 or higher; in config.assoc/config.createAB: have 'CON' in the list of supported DRS systems (and REI removed)use createJob v1.2 or higher; edit config.createJob to have valid entries for CON (like the template file)use processAB v.1.4.8 or higher; enter 'CON' in config.processAB under SUPPORTEDuse getStatusAB v.1.6 or higher (no changes to config file)
use finishNight v.1.1.6 or higher (no config file)

This page will be updated with more information about CONDOR.

Macros

(memo: condor is a vulture ...)The following macros are provided with the condor installation. They can be executed on the command line. vultur_exec_cascade - to launch a job filevultur_stop - to stop gracefully an executing cascadevultur_watch_q - watch the queue

vultur_cascade_stat - monitors the cascade entries

Common DFOS tools: Documentation

Batch queue system: CONDOR

Batch queue systems

How to use

Macros

Common DFOS tools:
Documentation