fault management
What is fault management?
Fault management is the component of network management that detects, isolates and fixes problems. When properly implemented, network fault management can keep connectivity, applications and services running at an optimal level, provide fault tolerance and minimize downtime. Fault management systems are platforms or tools designed specifically for this purpose.
Faults result from malfunctions or events that interfere with, degrade or obstruct service delivery. Examples of faults include hardware failure, connectivity loss or port status change. Once the fault management platform detects a fault, it notifies the administrator -- and any additional authorized or designated parties -- via an alarm or alert.
Network administrators can view these notifications in the fault management system's GUI, and many platforms can forward these alerts via email, text or a mobile app. Network administrators can also configure fault management systems to automatically fix or prevent certain events using programs and scripts.
Fault management is one component of FCAPS (fault management, configuration, accounting, performance and security), which is a network management framework established by the International Organization for Standardization (ISO).
Important functions of fault management
Network fault management comprises a variety of functions to keep the network operational. Fault management systems perform the following actions:
- Defines thresholds for potential failure conditions.
- Constantly monitors system status and usage levels.
- Continuously scans for threats, such as viruses and Trojans.
- Provides general diagnostics.
- Remotely controls system elements, including workstations and servers, from a single location.
- Notifies administrators and users of impending and actual malfunctions.
- Traces the locations of potential and actual malfunctions.
- Automatically corrects potential problem-causing conditions.
- Automatically fixes malfunctions.
- Comprehensively logs system status and actions taken.
Types of fault management
There are two types of network fault management: active and passive.
Active fault management
Active fault management uses various tools, such as ping or Transmission Control Protocol/User Datagram Protocol port checks, to continually query devices and determine their status. This is akin to a person asking, "How are you?" to everyone in a room at repeated intervals. This enables the fault management system to identify and rectify potential issues in real time, sometimes before they even become problems. The tradeoff, however, is more network chatter.
Passive fault management
Passive fault management systems monitor network environments for events that indicate a fault or failure has occurred. This information might come from error logs or Simple Network Management Protocol traps, among other sources. This is akin to a person who quietly listens until someone calls out for help. While passive fault management is more conservative in its resource use, its drawback is that it might not discover faults until too late.
Fault management process
The fault management process used in commercial platforms might vary slightly among different vendors, but fault management systems typically follow the same lifecycle:
- Fault detection. The system discovers that service delivery has been interrupted or its performance has degraded.
- Fault diagnosis and isolation. The system identifies the source of the fault, such as a component failure or power outage, and its location in the network topology.
- Event correlation and aggregation. Because a single fault can cause multiple alarms, fault management systems often group related events for administrators and provide a root cause analysis.
- Restoration of service. The network management system automatically executes any preconfigured scripts or programs to get services up and running as soon as possible.
- Problem resolution. The system corrects, repairs or replaces the source of the fault. In some cases, manual intervention might be necessary based on the cause.
Editor's note: This article was reformatted to improve the reader experience.