STONITH (Shoot The Other Node In The Head)
What is STONITH (Shoot The Other Node In The Head)?
STONITH (Shoot The Other Node In The Head) is a Linux service for maintaining the integrity of nodes in a high-availability (HA) cluster. STONITH automatically powers down a node that is not working correctly.
STONITH is used as part of a cluster's fencing strategy. Fencing provides a mechanism for monitoring the state of a cluster's resources and nodes and then taking action if something does not appear right with one of them. For this reason, fencing is often defined as a method for bringing a cluster to a known state. It ensures that no resources or nodes are operating in an unknown state and perhaps taking actions that could be detrimental to systems or data.
There are two types of fencing in a Linux cluster: resource-level and node-level.
- Resource-level fencing ensures that a node cannot access the same resource on more than one node. For example, the fencing mechanism might prevent a resource from running on all nodes except the currently active one.
- Node-level fencing ensures that a node does not run any resources. This is accomplished by either shutting down the server or changing its status in some other way, such as restarting it and setting it as the secondary node.
The STONITH service is used for node-level fencing. If a node fails to respond or is behaving unusually, STONITH ensures that the node cannot do any damage, especially in the event of a split-brain scenario (when two nodes both think they're the active server).
For example, if a primary database node automatically fails over to a standby node because of a disruption in service, the first node might still try to write data to the shared storage system at the same time as the new primary node, which could corrupt the data or impact its integrity. To prevent this from happening, STONITH will shut down the first node or restart it and set it as the standby server.
How is STONITH implemented?
STONITH is often implemented through a cluster's hardware, where it can respond to an event without involving the operating system (OS). Although hardware-based STONITH works well, it might require special components to be installed in each server, which can make the nodes more expensive and result in hardware vendor lock-in.
Various types of hardware components can serve as STONITH devices, if power management is used for all nodes in the cluster. A STONITH device might be installed on each node, such as a Dell Remote Access Card (DRAC) or an HPE Integrated Lights-Out (iLO) component. STONITH can also be implemented through devices that manage power on multiple nodes, such as a blade power control device, an uninterruptible power supply (UPS) or a power distribution unit (PDU). In addition, STONITH can be used with devices that support the Intelligent Platform Management Interface (IPMI).
STONITH can also be implemented through a disk-based solution. One of the most common is the split-brain detection (SBD) service (also called storage-based death). SBD STONITH can be easier to implement because no specific hardware is required. However, many implementations require a small amount of shared storage space. In SBD STONITH, the nodes in the Linux cluster use a heartbeat mechanism or messaging service to keep each other updated. If something goes wrong with a node in the cluster, the node will terminate itself.
Cloud providers also offer STONITH capabilities that are often specific to their platforms and the systems they're running. For example, Azure provides two options for setting up fencing on SUSE Linux Enterprise Server: the Azure fence agent or SBD STONITH. The Azure fence agent restarts a failed node via Azure APIs. SBD STONITH is similar to on-premises deployments. However, it requires at least one additional virtual machine (VM) to serve as an Internet Small Computer System Interface (iSCSI) target server and to provide the system with an SBD device.