RTC Supervisor

The RTC Supervisor is NOT a deployment tool, it’s role is to provide a single entry point to the RTC for state guiding, monitoring, error recovery and population of the run time repo.

It responds to a number of the standard commands defined by Stdif and exports global state in the OLDB.

Introduction

The current release of the RTC Supervisor performs the following:

  • Guiding the state of all SRTC components by forwarding state change requests to them

  • Evaluating the overall state of the RTC by monitoring the state of all supervised RTC components

  • Monitoring the liveliness of all RTC components and detecting if any have crashed

  • Providing a simple interface for recovery when one or more components are in error

And in subsequent releases will do the following:

  • Monitoring any error events generated by RTC components and generating an overall error indication

  • Loading the contents of the runtime repository from the persistent repository

  • Implement a mode switching interface by reloading parts of the runtime repository from the persistent repository

  • Providing a means of updating the persistent repository with items which have been changed in the runtime repository

  • To act as a base class to which instrument specific functionality and interfaces an be added

The RTC Supervisor implementation is divided into a library and a server. The server implementing the usable component. In addition there are a number of test programs used as integration tests and an example implementation of some deployment code which allows a set of components defined in a YAML file to be launched.

The library implements the functionality required for communicating with a set of supervised rtcObjects and sending commands to them in a list rtcCommandRequest and rtcCommandRequestSeries. The complete configuration of rtcObjects is managed by the rtcObjectConfig class which reads its config from the runtime repo.

The RTC Supervisor component is currently based on the rtctkExampleComponent structure, i.e. it provides a business logic which implements the various activities. THIS MAY WELL CHANGE and become a simple server implementing the Stdif commands directly.

Launching the Server

Being based on the rtctkExampleComponent (at least for the moment) the command line of the server is:

rtctkRtcSupervisor -h
Options:
  -h [ --help ]         print help messages
  -i [ --cid ] arg      component identity
  -s [ --sde ] arg      service discovery endpoint

e.g.:

rtctkRtcSupervisor -i rtc_sup -s file:///$PREFIX/run/exampleEndToEnd/service_disc.yaml

State Guiding

Currently state guiding is performed from the activities defined by the ExampleComponent, in each activity the AllObjectRequestList() is used to send a series of commands in series or in parallel to the list of supervised components.

The RTC Supervisor implements activities for and understands the following commands

  • Init

  • Reset

  • Recover (currently empty)

  • Enable

  • Disable

NOTE: That Run and Idle are not in the above list. The baseline currently is that the sequencer code will be responsible for sending the Run commands to the rtc components in the correct order. This may be re-evaluated.

There are configuration flags available for each of these activities indicating if they should be performed in parallel or series on the list of supervised components.

Commanding

To send one of the supported commands you can use the rtctkSendCommand script which makes use of the rtctkClient program implementing the Stdif client interface as follows:

rtctkSendCommand rtc_sup Init
rtctkSendCommand rtc_sup Reset
rtctkSendCommand rtc_sup Recover
rtctkSendCommand rtc_sup Enable
rtctkSendCommand rtc_sup Disable

Where rtc_sup is the name which the rtcSupervisor has been passed with the -i flag. The rtctkSendCommand script will look in an environment variable $REPO_DIR for the service_disc.yaml with which it will look up the URIs required.

State Evaluation

When the object configuration is built from the runtime repo a list of publish subscribe URIs is created, one per supervised component. The business logic creates a StateSubscriber with the list of URIs. The StateSubscribers callback to be called whenever an event is received and the rtcObjectConfig::OnStateEventReceived() method is called which sets the state attribute of the identified rtcObject in the object list and then evaluates the system believed state/substate and publishes it in the OLDB.

A typical content of the OLDB when the system is operational and supervising two components, object1 and object2 would be:

rtc_sup:
  global_display_state:
    type: RtcString
    value: On.Operational.Idle
  global_state:
    type: RtcString
    value: operational
  global_substate:
    type: RtcString
    value: idle
  global_error:
    type: RtcBool
    value: false
  global_error_who:
    type: RtcString
    value: ""
  state:
    type: RtcString
    value: "On.Operational.Idle On.Operational.Update.Idle "
object1:
  state:
    type: RtcString
    value: "On.Operational.Idle On.Operational.Update.Idle "
object2:
  state:
    type: RtcString
    value: "On.Operational.Idle On.Operational.Update.Idle "

Asynchronous Detection of Component Failure

Asynchronous monitoring is performed by the rtcMonitor class. The rtcServer has a member which is an rtcMonitor. A thread is created from the rtcMonitors creator which periodically when active calls the rtcServers MonitorCycle() method.

The rtcServer marks the monitor as being active whenever the state is at least NotOperational/Ready.

The rtcServers MonitorCycle() method uses the AllObjectRequestList() to send a GetVersion command with a short timeout to each component. If the command fails the rtcObject sending the command will mark the component as having generated an exception and commands will not be sent to it subsequently.

If a component does fail then the InError method is called to set the error flag and record the name of the component in error.

Error Notification

In general when the rtcSupervisor notices something has gone wrong it calls the rtcSupervisor::InError method which updates the OLDB with the error and an indication of the cause.

Mutex Usage

A std::mutex is available in the RtcSupervisor class which can be used to globally lock the component.

This is used to avoid e.g. the monitor thread trying to “ping” the components when an activity is active.

As new extension points and the ability to add interfaces to the RtcSupervisor are added it will be necessary for programmers to make use of this facility.

Configuration

The supervisor has the following static parameters which define whether the associated activities are started in the supervised components in parallel or series.

Listing 1 rtc_sup.yaml
cfg_static:
    init_alone:
        type: RtcBool
        value: true
    enable_alone:
        type: RtcBool
        value: true
    disable_alone:
        type: RtcBool
        value: false
    update_alone:
        type: RtcBool
        value: false

The supervisor needs to get a list of components which are supervised. As an INTERIM MEASURE, these are read from a DEPL table. It is likely that whatever component implements RTC deployment set starting/stopping will populate something similar. Any functionality regarding the usage of this DEPL table will be revisited.

The RTC Supervisor only reads the object_list attribute, the others are used by the deployment component. The presence of the rtc_sup in the object list is optional it is used by the DEPL component for deployment, the supervisor skips it if found. If you call your rtcSupervisor something else you will need to modify the rtcSupervisor to skip this new name. Look at the code in the rtcObjectConfig.cpp with the comment

Listing 2 depl.yaml
object_list:
    type: RtcString
    value: "rtc_sup object1 object2"
rtcSupervisor:
    host:
        type: RtcString
        value: "localhost"
    exe:
        type: RtcString
        value: "rtctkRtcSupervisor"
object1:
    host:
        type: RtcString
        value: "localhost"
    exe:
        type: RtcString
        value: "rtctkExampleComponent"
object2:
    host:
        type: RtcString
        value: "localhost"
    exe:
        type: RtcString
        value: "rtctkExampleComponent"

Deployment and Other Scripts

To aid testing the rtcSupervisor a simple deployment python script is provided which accepts a file like DEPL.yaml and will deploy each of the components identified in the object_list using the rtctkStartObject.sh wrapper, passing the executable name and the component name.

The rtctkStartObject.sh script acts as a simple wrapper allowing all RTC components to be identified by looking for the command line rtctkStartObject.sh. The script launches the Object and waits for its completion.

The rtctkRtcSuper_start_components.sh script copies some of the resources into a “run” directory, does some cleanup and uses the deploy mechanism described above.

The rtctkRtcSuper_stop_components.sh script just kills all the rtctkStartComponent instances using killall

The rtctkRtcSuper_show_oldb.sh script provides a simple way of keeping an eye on the fake oldb contents.

The rtctkSendCommand.sh script is a simple wrapper for the rtctkClient passing the SDE file argument, the script assumes it can find this file in a directory identified by $REPO_DIR

Todo

  • User extensions. Provide a mechanism for the users to add their own functionality for e.g. “InError” “SetMode”, “Run(thing)”.

  • RTR population. The Runtime Repo is not currently populated at init time.

  • SetMode, Mode setting by populating parts of the Runtime Repo currently not supported.