|dfos = Data Flow Operations System, the common tool set for DFO|
|make printable||new:||see also:|
- v1.2.3: muc09/10 added; monthly mode for phoenix (mcalib mode)
|topics: description | navigation and monitor pages: all details | condor concepts | main table | output | operations | configuration | technical details|
|tool monitors condor execution|
This tool visualizes the status of the current condor processing cascade on a MUC blade. It helps understanding the dependencies in a cascade, analyze recipe performance and interplay with the condor processing nodes on a muc blade. It further links to the other cascade monitors on the host and to the mucMonitor.
The tool has two main modes:
[It has also a monthly mode which is a special feature needed for phoenix 2.0, mcalib production.]DATE mode:
|Condor cascade for KMOS
Last update: 2013-02-07T15:06:04 (UT) by kmos@muc01 (0d 00h:00m:03s ago)
Browser refresh: every 30 sec | System load past minute: 0.19 Force browser refresh with Ctrl+R
The top panel has update information. In mode -d, the tool can be run in watch mode. The cadence of the tool for the watch mode is configurable and should be roughly adapted to the typical execution time of ABs (usually 30-60 sec is a good value). The browser refresh is adapted to the same value.ALL_DATE mode:
|Daily condor processing for muc02
Last update: 2013-02-07T15:06:04 (UT) by uves@muc02
In that mode the tool is called once (either on the command-line or by the JOBS_AUTO file). It collects information from all accounts on the host.
Navigation and monitor pages
The horizontal navigation has links to the overview page ('ALL') and to the detail pages (linked by instrument name). Y ou can switch to the cascadeMonitors of the other operational users of the same muc blade.
ALL (output of the ALL_DATE mode)
|Processing date: 2013-03-03|
You can navigate forward/backwards in time.
The plots shows all cascades executed on the specified DFO date and their execution times. All modes are included: CALIB, QCCALIB, and SCIENCE. The width of each green symbol corresponds roughly to the total execution time. The timescale is accurate to roughly one hour. Condor logs come in the time zone set by the account and are corrected to UTC. Here is one of three plots (muc02 has three operational accounts):
pages (last cascade per instrument)
This mode is called during, and immediately after, job execution and provides details of the condor processing. You find the output behind the instrument links (e.g. giraffe (GIRAFFE)). At any given time, there is only one output page, usually the one corresponding to the last executed cascade.
The name of the analyzed job is displayed, as well as the cascade ID and its path. The status is either active or finished.
|Total number of jobs: 28; finished: 28; exec_time: 18.7 m; scheduler:
Average consumption: 36.2% (2.9 of 8 cores)
These are parameters of the cascade. The 'scheduler' parameter displays the time used by the condor scheduler to analyze the cascade dependencies before it is actually launched. The average consumption is based on the time average of the number of used cores (part c) of the display).
Condor knows four different AB states:
|done||processing job finished (either with or without succes, doesn't make a difference for condor)|
|blocked||cannot be executed due to dependencies|
|waiting||could be executed but currently not enough cores|
There are two fundamentally important limitations to a cascade: dependency limited (some ABs need to be finished first, before others can execute, because of virtual calibrations required), and core limited (there are more executable ABs than cores). A dependency-limited cascade cannot benefit from more available cores, while a core-limited cascade would execute faster if more cores were available.
Condor has an internal watch process which loops every 10-15 sec and finds all status changes in the processing queue. These might be due to ABs having finished, started processing, or have their status changed from blocked to waiting. All visualizations on the detailed cascade monitor are state changes, i.e. they apply to one or more ABs having changed their state since the last state change. In general they do not refer to an individual AB although under special conditions this might be the case.
The output is divided in three parts: the finished jobs ("Done", green); the blocked jobs (the ones blocked by dependencies, red); the active and waiting jobs ("Used", grey ).
The display has numbers of ABs over elapsed time. The scaling factor of the time axis is configurable (to adapt to the specifics of the supported instrument). The cascade always starts with all ABs blocked (red) and ends with all ABs finished (green).
Move your mouse over an icon and see some related technical information. dot.21 refers to the text file CALIB_2013-01-13.dot.21 in /data23/giraffe/condor/CALIB_2013-02-09-1360601594.23658609.
standard installation with dfosInstall
Type cascadeMonitor -h for a quick help, cascadeMonitor -v for the version number. Type
cascadeMonitor -D 2013-03-13
for the ALL_DATE mode: "low-resolution" overview of all cascades on the host for specified date. Type
cascadeMonitor -d 2013-01-13
to create the detailed ("high-resolution") cascade monitor for CALIB_2013-01-13 and your $DFO_INSTRUMENT;
cascadeMonitor -d 2013-01-03 -m SCIENCE
to create the cascade monitor for SCIENCE_2013-01-03 and your $DFO_INSTRUMENT;
cascadeMonitor -c CALIB_2013-02-09-1360601594.23658609
to visualize the specified cascade.
During execution of JOBS_NIGHT, or if you type it on the command-line, the tool runs in a loop, like
watch -n 30 cascadeMonitor -d 2013-01-13
The tool stops looping when it discovers that the cascade is finished (no .lock file found).
The supported condor cascade types are:
While calling on the command line is possible anytime, it is called in watch mode by createJob. It is called twice (at the end of the cascade) within autoDaily, once with option -d for the cascade details, and once with option -D for the overview.
On the AB monitor, you also have a button to start the cascadeMonitor manually if you are interested in the cascade monitor as processing progresses.
The output is linked to the local AB monitor (link casc ).
The configuration file config.cascadeMonitor has the following keys:
Section 1: General configuration
|MAX_CORES||8||number of cores on muc blades available for condor processing|
|TIME_SCALE||3||scaling factor for time axis; larger --> image gets compressed; default: 3|
|WATCH_CADENCE||30||cadence in sec of 'watch -n $WATCH_CADENCE ...' call in JOBS_NIGHT|
For the mode -D, the tool needs to collect information from all operational accounts on the muc blade. This is achieved in the following way:
For the mode -d, no such scheme is necessary since it runs stand-alone on the account.
The list of operational accounts is read from the mucMonitor configuration file.