The Node Framework#

Note

This page was converted to RST from the original Confluence documentation article, which was written by Jacob Glueck on August 21st, 2018.

Motivation#

During the 2015 rover competition, failing to read from the pH sensor on the rover caused the entire system to crash and our rover to speed towards a cliff. Since then, there have been multiple other occasions when a failure of one part of the rover prevented the entire rover from operating.

In order to solve this problem, the rover program needs to be a fault-tolerant system, which can recover or otherwise manage failures of individual parts without failing as a whole. During the 2015-2016 school year, Changxu Lu (cl795) and Ryan Pindulic (rnp35) attempted to solve this problem using the Component Framework. However, while testing during the Spring 2016 semester, the controls software team found a number of problems with the component framework:

It was hard to run things independently because they all required a shared Component Graph Manager.
There was no distinction between a node being enabled externally and being used by another node. Without this distinction, it is unclear if a given node should disable if a node using it does. For example, if the arm enabled a device, and then the arm disabled, should the device disable too? What if something else is using it? These questions led me to develop a better enabling model for the nodes.
No support for node configuration.

Ultimately, the whole framework became annoying to deal with during testing because every time we had to make a change to the code, the component framework would get in the way. As such, we mostly removed it during testing, only keeping the essential subsystem nodes, which allowed us to easily enable things like drive, the arm, and the drill.

While we removed the node framework during testing, the fault-tolerance problem remained: if code for one task, like the arm crashed, the whole rover would crash. While we fixed specific things (like the PH) sensor so such crashes were less likely, we wanted a strong guarantee that regardless of how we write code for each part of the rover, a small failure could never crash the whole rover. As such, Jacob Glueck wrote the Node Framework during the Fall 2016 semester.

Design#

The Node Framework is an elegant way to isolate different functionality in nodes, while allowing nodes to depend on each other, and managing dependency failures. The following defines the main components of the framework:

Node: A simple program with a single purpose.
Name element: A string, without any / characters, representing part of a path.
Path: A globally unique identifier for a given node, composed of many name elements joined by / characters. For example, /rover/drive/wheels/left-front.
Name: The final component of a path. In the example above, the name would be left-front.
Absolute path: A path which begins with a /.
Relative path: A path which does not begin with a /.
Namespace: An absolute path or relative path which is prepended to the some other relative path to produce an expanded path. For example, the node wheels/left-front in the namespace /rover/drive/ would have a full path of /rover/drive/wheels/left-front.
Dependency: A mapping between a string identifier and a path. An identifier which is not mapped to a path is an unsatisfied dependency, and an identifier which is mapped to a path is a satisfied dependency. In order for a node to run, all its dependencies must be satisfied. A node generally relies on functions provided by its dependencies. For example, a node may have a laser dependency. That dependency could be satisfied by a node: laser
Required dependencies: The dependencies which must be satisfied in order for a node to run. For example, a given node might require that lasers, rockets, and med-bay all be satisfied.

As the nodes use Robot Operating System (ROS) functions to communicate, the following terms are also important:

Topic: A channel for one-way communication between nodes. Does not support acknowledgments. A node can publish a message to a topic, but does not receive any response as to whether any nodes received it. Topics broadcast their messages: many nodes can subscribe to a topic at once, but only one node can publish.
Service: A mechanism for sending a message and receiving a response. A node can advertise a service, and other nodes can call it. Services do not support timeouts, so a node calling a service might block indefinitely.
Action: A mechanism for executing goals. A node can advertise an action, and other nodes can call it. Unlike services, actions can send feedback during the action, and they can also be canceled and timeout. Furthermore, actions are about 5 times faster than services. (See issue 145 for detailed timing analysis.)

Communication#

Because actions are faster than services, and support timeouts, all nodes use actions for bidirectional communication and topics for one-way communication.

Node State Model#

Each node has 2 states: on and off. The transition from off to on is activation and the transition from on to off is deactivation. However, nodes are not activated and deactivated directly. Instead, nodes expose 4 actions to control their state: enable, disable, acquire, and release.

The enable and disable commands are called by the user in order to activate or deactivate nodes. Calling enable on a node activates the node if the node is not already activated, and marks the node as enabled. Calling disable deactivates the node, and marks the node as disabled.

The acquire and release actions allow nodes to manage their dependencies. When a node activates, it acquires all of its dependencies. Upon being acquired, each dependent node will add the path of the node which acquired it to its user list. If the node is off, it will activate (acquiring any of its dependencies as necessary). When a node deactivates, it releases all of its dependencies. Upon being released, each dependent node will remove the path of the node that released it from its user list. If the user list is now empty and the node is disabled, the node will deactivate since no other nodes are using it and it was not enabled by a user.

Fault Handling#

When a node fails or is notified that one of its dependencies has failed, it executes the following sequence of actions:

It deactivates.
It notifies all the nodes which acquired it of its failure.
If the node is disabled and thus was only running because other nodes had acquired it, it stops here, and does not execute any more steps.
If the node is enabled, it will attempt to recover, depending on its configuration. Each node has two settings which dictate its recovery behavior:
- restart_delay: the amount of time to wait before trying to activate again.
- max_restart_attempts: the maximum number of times to try and restart before disabling. If 0, then the node will not attempt to recover.
The node will then, up to max_restart_attempts times, wait for the restart_delay, and then attempt to activate again.

Note that this behavior provides support for error handling only at the enabled nodes and the node which first raised the error.

An example to enhance your understanding

To understand this design, consider 3 nodes, A, B, and C, where A depends on B and B depends on C. A is initially enabled. If C notices a problem, it can try to fix it itself. If it cannot fix the problem, it will deactivate, notifying B. B knows that C has failed, and also that C could not fix the problem itself. Given that, there is nothing B could do to fix the problem with C, so it has no choice but to disable. It could try enabling C again, but that is all any node could do, and if as the error propagates upward, all the nodes try to enable the failed nodes below them, all that will happen is a lot of enable attempts, while nothing the high-level control node is trying to do is working. All these extra enable attempts will waste time. The enabled nodes, however, do get a chance to try to enable their failing dependencies. While the enabled nodes are the same as all the other nodes in that there is nothing special they can do to fix the error of another node, these nodes were enabled externally. If one of them disabled due to a failure, the operator who enabled it originally might just try enabling it again to see if the act of cycling the whole system off and on will clear the error. Allowing the enabled nodes to try to restart automatically facilitates this.

Dependency Management#

Each node maintains a mapping from dependency identifiers to the paths of the nodes which satisfy the dependencies. For example, a starship node might maintain the following mapping:

Name	Mapping
lasers	/ship/laser_cannons
warpdrive	/ship/drive/warp
inertial	/ship/inertial
gravity	/ship/inertial

This allows one node to satisfy multiple dependencies.

Configuration#

Each node has a configuration file which defines the recovery constants and dependencies. The configuration file also includes an internal section, in which the node can store any data it wants to. For example, consider a configuration file for a starship node:

node:
    restart_delay: 10
    max_restart_attempts: 1
dependencies:
    lasers: /ship/laser_cannons
    warpdrive: /ship/drive/warp
    inertial_damping_system: /ship/inertial
    gravity_control_system: /ship/inertial
internal:
    artificial_g: 9.8

The file specifies the recovery parameters in the node section, the dependencies in the dependencies section, and the extra configuration in the internal section.

The configuration files for a node is stored in a ROS package designated as the configuration package. The configuration package is supplied to every node at runtime through a ROS param. Within the configuration package, the configuration file is stored at the path of the node. (See issue 173 for a discussion of how the configuration file location was picked.) For example, consider a system with the following nodes:

drive/
`-- motors
    |-- left-bank-controller
    |-- left-front-motor
    |-- right-bank-controller
    `-- right-front-motor

The configuration file for drive/motors/left-front-motor node would then be config/drive/motors/left-front-motor.yaml relative to the configuration package.

The configuration structure also has another parameter, the configuration root, which is the namespace at the root of the configuration directory. For example, if the configuration root in the last example was drive, then the configuration file for the node drive/motors/left-front-motor would be config/motors/left-front-motor.yaml.

The final component of the configuration system is the ability for nodes to read information from other files. To read other files, instead of a .yaml configuration file, a node will have a .d configuration directory. Within that directory, there will be a config.yaml file containing the configuration as discussed above. However, the directory can contain any other files the node needs, of any format. The nodes can access these files while running. For example, if the left-front-motor node above wished to be able to access a wheel.stl file for some reason while running, its configuration file would then be stored at config/drive/motors/left-front-motor.d/config.yaml and the wheel.stl file would be stored at config/drive/motors/left-front-motor.d/wheel.stl.

Dynamic Reconfiguration#

One problem in past years has been that we could not change the rover configuration while it was running. The node framework, however, allows for this. Every time a node activates, it reads its configuration file. Thus, in order to change a node’s configuration, the user can change the file and then restart only the node in question. If restarting that node causes any other failures, the framework will handle those automatically by restarting those nodes too using the failure handling procedure discussed above.

Each node restart is also clean, meaning that it does not retain any state from any previous runs. This means that if a node had an internal issue, a restart will clear it.