Common DFOS tools:
Documentation

dfos = Data Flow Operations System, the common tool set for DFO
*make printable new: see also:
 

- v1.2:
improved display for CONDOR_AB=NO

The tool visualizes the condor processing queue on the MUC blades. Current muc02 view ...
See also cascadeMonitor
  - v1.3:
information about PHOENIX processing an ingestion jobs; new optional config keys ING_WARNING and PHOENIX_EXEC

[ used databases ] databases none
[ used dfos tools ] dfos tools condor_status, condor_q
[ output used by ] output mucMonitor_mucnn.html, exported to http://qcweb.hq.eso.org/ALL/
[ upload/download ] upload/download upload: mucMonitor_mucnn.html (see output); down: config.mucMonitor_mucnn (with wget)
topics: description | interface: PHOENIX jobs | condor jobs | condor output | details | queue | cascades | operations | configuration | special

mucMonitor

[ top ] Description

tool monitors condor execution

This tool visualizes the condor processing situation on a MUC blade. It is particularly useful on the muc01...muc05/muc08 systems with their multi-user setup.

The mucMonitor gives an overview of the current condor activity (nodes executing condor jobs) and the pending queue. On the PHOENIX-enabled mucs, it provides an overview of the ongoing PHOENIX processing and ingestion jobs. It further links to the cascadeMonitor of the operational users (called 'registered' users).
Multi-core machine monitor: muc02
Last update: 2013-02-07T15:06:04 (UT) by xshooter@muc02 (0d 00h:00m:03s ago)
Browser refresh: every 10 sec | System load past minute: 0.19

The top panel has update information. The cadence of the tool is quite high since the condor processing might be quite dynamic (timescale of changes less than a minute). The browser refresh is hard-coded to 10sec. The 'ago' bracket (calculated by the browser) turns red as soon as the tool is older than one minute, this to indicate that the tool is not up-to-date (either due to the fact that it is currently not called by any of the operational users, or there is a problem with condor.

The horizontal navigation has the following links:
muc02   cascade: ALL   giraffe   uves   xshooter   queue   other
link to mucMonitor  
overview per muc
  links for the operational users   list of current queue jobs   link to other mucMonitors

The links 'muc02' and 'queue' relate to the mucMonitor, the others to the cascadeMonitor (these links could be de-activated via configuration, DAILY_CASCADES=NO, if accounts do not use the condor queuing system). These display an overview ('ALL') and details for the operational users (these are configured; there could be more accounts on a muc server but they do not get listed here unless configured):

Operational users: giraffe uves xshooter

[ top ] PHOENIX jobs. For phoenix-enabled systems (on muc08...muc11), the mucMonitor displays the executing PHOENIX and ingestion jobs. If none is executing, a PHOENIX-enabled mucMonitor displays:

Enabled for PHOENIX checks
Enabled for INGESTION checks

The PHOENIX systems (in particular on muc08) are typically demanding more resources than the dfos systems. Therefore knowledge about other competing jobs is desirable for the scheduling of your own PHOENIX job. Likewise, there is information about the IDP ingestion jobs on those systems. If one is running, the system displays a blinking warning. See below for the related operational rules.

[ top ] vultur_exec_cascade jobs. Next the mucMonitor lists the currently executing condor queues:

Scheduled vultur_exec_cascade jobs:
UID STIME JOB
uves 14:40 /home/uves/jobs/execAB_CALIB_2013-01-13
xshooter 14:38 /home/xshooter/jobs/execAB_CALIB_2013-01-13
giraffe 14:42 /home/giraffe/jobs/execAB_CALIB_2013-01-13

UID=user ID; STIME=start time; JOB=job file under $DFO_JOB_DIR being currently executed.

[ top ] Main table. Then it displays the 'condor_q' output:
Condor nodes on muc02
        slot4
        slot8
        slot12
 

busy | idle | reserved/not available

'busy' indicates that condor currently executes a job on that core; 'idle' means no current condor job; 'reserved/not available' means the core is not configured for condor jobs, or is currently not available. Note that this overview indicates only the condor situation. Some cores are reserved for interactive jobs, so they might be busy e.g. with certifyProducts or trendPlotter and would not be indicated on this monitor. Likewise, an idle condor core might actually be running a non-condor job.

The monitor includes the Ganglia load report (the last-hour overview of the load on that muc server).

If a node is condor-active, the job is displayed (e.g. processAB), along with the AB name and the user ID, in the Details panel:

[ top ] Details: This is the output of 'condor_status':

core status load CMD AB RAW_TYPE user
slot1 Busy 1.660 processAB GIRAF.2013-01-14T11:19:27.794.ab   giraffe
slot2 Busy 0.730 processAB XSHOO.2013-01-14T00:02:06.516.ab FLEX_SLIT_NIR xshooter
slot3 Busy 0.500 processAB XSHOO.2013-01-14T14:17:56.188_tpl.ab ARC_SLIT_NIR xshooter
slot4 Busy 0.460 processAB XSHOO.2013-01-14T00:25:59.853_tpl.ab STD_FLUX_SLIT_NOD_UVB xshooter
slot5 Idle 0.670        
slot6 Busy 1.000 processAB GIRAF.2013-01-14T11:00:00.698.ab   giraffe
slot7 Idle 1.670        
slot8 Busy 1.000 processAB UVES.2013-01-14T13:53:05.615_tpl.ab   uves

The load is the last-minute average.

[ top ] Queue. The 'queue' tab lists all jobs in the condor queue that are currently waiting (i.e. not dependency-locked but not yet executing due to the muc blade being core-limited).

[ top ] Cascade monitor. The tabs between the first and the last tab are filled by links to the account-specific cascade monitors. They provide a visualization of the current status of the latest condor processing cascade (e.g. CALIB_2013-01-13 for xshooter). The cascade monitor is created by the tool cascadeMonitor, see more details there.


Output

How to install

How to use

Type mucMonitor -h for a quick help, mucMonitor -v for the version number. Type

mucMonitor

on the command line to create or refresh http://qcweb/~qc/ALL/mucMonitor_<hostname>.html.


[ top ] Operations

While calling on the command line is possible anytime, the tool should be running operationally in an infinite loop on one account per muc blade only. Remember it is a visualization of 'condor_q', and condor is an installation on a machine, not on an account. For instance, there is one condor installation on muc02, and no matter if giraffe, uves or xshooter call 'condor_q', they will all see the same result. This is also true for 'mucMonitor'. Calling it individually from each account would not break anything but cost unnecessary performance. On the other hand it doesn't matter which account is executing mucMonitor. For that reason, the output indicates the current execution node of mucMonitor.

Self-organized mode. Therefore the tool should run operationally in a self-organized way: If the browser has discovered the tool is not refreshed anymore (how?), you should start the tool on your own account and thereby "take over". Killing or starting a session on another account is not possible (unless in emergency you log in via qc_shift). Of course, if your own session is hanging, you can/should fix it right at the source.

The proper way of calling the infinite loop is

watch -n 10 mucMonitor

Do this in a dedicated window. It's a good idea to do it in a separate xterm so that you easily recognize it.If needed, you (but nobody else) can always terminate the session with Ctrl+C.

On the AB monitor, you also have a button to start the mucMonitor in the infinite loop. The AB monitor has a browser-independent way of finding whether a mucMonitor session exists on the muc blade: it greps in the 'ps' output for the string mucMonitor. If nothing is found, it displays the standard red cross as alert.

The output is linked to the dfoMonitor. The output is linked to the public monitor page as "MUC monitor".

Outdated.

last update: 2013-02-07T17:01:55 (UT) by xshooter@muc02 (0d 18h:13m:25s ago)

The browser calculates the difference between the current time and the timestamp in the mucMonitor (same way as e.g. in the trendPlotter or calChecker pages). If the page is older than 1 minute (!), it turns red and tries to catch your attention. Thereby you can normally expect the output to be near-real time.

Timeouts. Sometimes the command 'condor_q' is hanging. The tool has a timeout mechanism: if after 6 seconds there is no condor_q response, it exits and tries again after the 'watch' time. The configuration file download has also a timeout mechanism.

[ top ] Configuration file

Because of the self-organized way of operations, the tool needs a central configuration being identical for all participating accounts. This is why there is a master configuration file http://www.eso.org/~qc/dfos/tools/config.mucMonitor_<hostname> that is downloaded automatically each time the tool executes. It overwrites the local version under $DFO_CONFIG_DIR/config.mucMonitor_<hostname>.local. The download is done via wget and has a timeout protection (10 sec). On timeout, the local version is read instead. It is pointless to edit the local version (it gets overwritten upon the next execution).

The central configuration file should not be edited unless after coordination.

It contains the topology of the executing muc blade, and labels marking the role of the nodes ('condor_execution' or 'free'):

Section 1: Host users

HOST muc02 hostname
DAILY_CASCADES YES YES|NO, optional (default: YES). NO means no links for autoDaily cascades (for PHOENIX nodes only).
CONDOR_AB YES YES|NO, optional (default: YES). NO means no vultur_exec calls to be monitored (currently for MUSE nodes only).
ING_WARNING NO YES|NO, optional (default: NO). YES means a warning if 'ingestProducts' is currently called (to avoid multiple ingestion streams for PHOENIX_enabled machines). No meaning for dfos accounts.
PHOENIX_EXEC NO YES|NO, optional (default: NO). YES means displaying if phoenix is running (for multi-user PHOENIX accounts only).
USER account name DFO_INSTRUMENT
USER
giraffe GIRAFFE
USER uves UVES
USER xshooter XSHOOTER
 
Section 2: Host node table (condor_status output)
MUC core ("slot") name ROW1 etc.: arrangement in the main table role  
MUC slot1@muc02.hq.eso ROW1 condor_execution  
MUC slot2@muc02.hq.eso ROW1 condor_execution  
MUC slot9@muc02.hq.eso ROW3 free  
etc.  

[ top ] Special cases

There are the following special cases:

a) the PHOENIX processing on muc08/muc10/muc11 (having no autoDaily process): there is no monitoring of the condor cascade (cascadeMonitor). The corresponding standard links are disabled there via the optional config key DAILY_CASCADES (by default YES, here set to NO). These machines also have the PHOENIX_EXEC set to YES, to indicate if a phoenix job is being executed. While there is no general operational rule, it seems reasonable to wait with a task if another PHOENIX job is already running.

b) the MUSE processing on muc10/muse11, using DRS_TYPE=CPL (because of internal parallelisation); then there is no need to monitor the condor pattern. This is controlled by the optional key CONDOR_AB (by default YES, here set to NO.

c) the PHOENIX-enabled accounts have ING_WARNING=YES, giving a warning if an ingestion process is currently running, this to avoid having several of them running in parallel. If you see such a warning (blinking), please wait with your ingestion job until the blinking is over. In particular the MUSE IDP ingestion process is quite demanding for the current NGAS system.

This even applies to other mucs: before you start an IDP ingestion process, please check the mucMonitors of muc08, muc10, and muc11 (or check with your colleagues directly).