QC QC shiftleader: autoDaily

Quick overview . Details . calChecker    HC monitor  .   autoDaily .
  DFO monitors | Ganglia | execTimes | CC_POSTIT | HC_POSTIT

Use the qc_shift account on muc02 to login to any of the accounts.
 
what to check how issues and solutions
3. autoDaily: incremental daily processing running?

any HC monitor, go to MORE

'autoDaily' box for all instruments should be green (no gap of more than six hours)

check the exported dfoMonitors here

check the WISQ execution times of calSelector and autoDaily: they should be close to their median values (the WISQ monitors are updated only once per day but most recent values for exec_times are always visible behind 'Data downloads')

The following issues with incremental processing can be seen here:

  • autoDaily not executed (because cronjob is disabled)
  • autoDaily stuck because too many ABs in $DFO_AB_DIR (more than 2500 --> red score on MORE page)

The exported dfoMonitor shows the last created and last processed AB. These may also be indicative for issues.

Other indicators for issues are:

  • machine has high load, or no load
  • execution times are abnormal

Once alerted about an issue, it is easy to spot on the respective dfo Monitor (e.g. by: load indicators, autoDaily active but no progress, no ABs processed etc.)

possible issues and their solution:
load very high

visible on ganglia monitor; check for runaway processes, e.g. by calling top or uptime; call IT/lcondere@eso.org for help

ngas access not possible visible on dfo Monitor; send a ticket/mail to AOG
cronjob disabled

crontab -l | grep autoDaily
if disabled: try to find out why and implement;
last execution listed on MORE)

too many ABs in $DFO_AB_DIR visible on MORE and on dfoMonitor; autoDaily gets stuck; move older ABs into separate directory
dfs/pipeline issue

check version on dfo Monitor; rollback to previous version if you know what you do

autoDaily:
trendPlotter jobs
 

autoDaily is running nominally and processes ABs, but no new scores/HC plots are updated

possible issues and their solution:
load very high

visible on ganglia monitor; check for runaway processes, e.g. by calling top or uptime; call IT/lcondere@eso.org for help

database down qc1_score and other QC1_db tables down: call DBCM for help
$DFO_JOB_DIR /JOBS_TREND existing?

 

scp to qc@stargate1 not possible this should generate plenty of error mails also for other instruments; call IT for help
autoDaily:
enough disk space for processing?
check XDM disk space monitor
possible issues and their solution:
machine cannot process because fast cache or data disk is full

try to understand why (massive reprocessing? many days affected?)

  identify origin and fix it (massive reprocessing: stop it; too many days? remove data for older days)