Common DFOS tools:
Documentation

dfos = Data Flow Operations System, the common tool set for DFO
*make printable

Batch queue system: CONDOR

Batch queue systems

The dfos tool createJob creates a job file ($DFO_JOB_DIR/execAB_<mode>_<date) which has all calls of ABs to process. Since QC is using the concept of cascades and virtual calibrations, it is important to control the execution sequence of ABs. Otherwise it might happen that an AB is executed before the products from another AB are available.

A batch queue system (BQS) knows about these dependencies. In its simplest form (DRS_TYPE=CPL or RBS), the ABs are already properly sorted in the execAB file. Then all ABs can be called sequentially by processAB and executed.

In its more advanced form, a BQS takes over the sequence control and manages the execution on a multi-processor platform. By evaluating the current dependencies, it finds ABs which are ready to execute (i.e. not blocked). The BQS is a layer between the DFOS call to execute a batch of ABs (createJob/execAB), and the individual pipeline recipe calls (processAB/esorex/recipe):

createJob    
  BQS (within execAB)  
    call of ABs (processAB -> esorex -> pipeline)

A BQS is required for QC as soon as multi-node processing is intended. Furthermore the functionality of extracting all internal dependencies from ABs is very attractive. The current DFO machines have two processors, and using CONDOR makes it possible to use them in parallel. A performance gain by a factor of up to 1.8 has been measured.

The first BQS was developed as a prototype by DFS in 2004 and called REI (recipe execution interface). In 2005 it was realized that a mature BQS system was available on the market, with 20+ years of development and testing built-in: CONDOR. This system has been developed, and is maintained, at the University of Wisconsin. It was decided to use that system as the standard BQS for DFO.

CONDOR

"Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion." more ...

The current dfos system uses only a subset of CONDOR abilities:

While on a standard dfo machine it can help to reduce pipeline processing times by roughly a factor 2, it can offer real parallel processing on the compute cluster for multi-detector files when the number of processors matches the number of detectors.

Among other features which are currently not exploited by dfos are grid support (distributed computing: all tasks executable on all dfo machines).

The following simple scheme illustrates the potential of using CONDOR (both on a dual processor and a multi-processor platform), assuming a model cascade with three levels and up to 6 independent ABs>:

FORS1 model cascade: 1 BIAS, 3 depending FLATs, 6 depending STD traditional dfo processing (DRS_TYPE="CPL") slightly more advanced dfo processing (DRS_TYPE="CON") processing on cluster (N=6)
    1       BIAS
2   3   4   FLAT
5 6 7 8 9 10 STD
1
2
3
4
5
6
7
8
9
10
1  
2 3
4 5
6 7
8 9
10 10

1          
2 3 4      
5 6 7 8 9 10
number of cycles (same execution time per AB assumed): 10 6
(2,3,4 can only start when 1 is done; 5-10 only when 2-4 is done)
3
(cannot be reduced further!)

The very same cascade on a multi-processor platform with more than 6 nodes cannot be made any faster because of the dependencies. The survey instruments with up to 32 detectors will require a higher processing multiplexity. The idea is to have the pipeline recipe access a particular extension (both in raw and product files) and hence have the same AB running 32 times, indexed by detector ID:

OCAM model cascade: 1 BIAS, 1 depending FLAT processing on N=6 cluster processing on N=32 cluster
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 BIAS
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 FLAT
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
31 32        
33          
34 35 36 37 38 39
40 41 42 43 44 45
46 47 48 49 50 51
52 53 54 55 56 57
58 59 60 61 62 63
64 65        
66          
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
33                                                              
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
66                                                              
number of cycles (same execution time per AB assumed): 14
Jobs 33 and 66 are join job (to create MEF product file). 33 depends on 1-32. 34-65 depend on 33. 66 depends on 34-65.
4
(cannot be reduced)

In general, CONDOR will be used by QC to execute all pipeline jobs. Hence a hybrid scenario (with the two cascades from above submitted to the same cluster) could look like following:

OCAM model cascade: 1 BIAS, 1 depending FLAT processing on N=6 cluster processing on N=32 cluster
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 BIAS
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 FLAT

    1      
2   3   4  
5 6 7 8 9 10
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
31 32 1      
33 2 3 4    
34 35 36 37 38 39
40 41 42 43 44 45
46 47 48 49 50 51
52 53 54 55 56 57
58 59 60 61 62 63
64 65 5 6 7 8
66 9 10      
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
33 1                                                            
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
66 2 3 4                                                        
5 6 7 8 9 10                                                    

 

5
(an extra cycle is required because of the FORS dependencies)
Buying just one more CPU would reduce the cycle number to 4:
processing on N=33 cluster
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1
33                                                           2 3 4
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 5
66                                                       6 7 8 9 10
number of cycles (same execution time per AB assumed): 14
(The FORS jobs can be scheduled to idle processors)
4-5
(obviously the idle processor cycles could be used for further compute jobs)

CONDOR links:

All about CONDOR
CONDOR Users Manual (current DFO installation on March 2006: v6.6)
In-house CONDOR presentation (Jens Knudstrup, December 2005)
Other information about CONDOR

How to use

CONDOR comes as part of your DFS installation. NOTE: It can currently be used only for CPL pipelines. To run your pipeline under the batch queue system CONDOR, the following preparations are necessary:

This page will be updated with more information about CONDOR.

Macros

(memo: condor is a vulture ...)The following macros are provided with the condor installation. They can be executed on the command line. vultur_exec_cascade - to launch a job filevultur_stop - to stop gracefully an executing cascadevultur_watch_q - watch the queue

vultur_cascade_stat - monitors the cascade entries