The role of network observability in distributed systems
Network observability sounds like a new term for an existing practice, but is that the case? This chapter excerpt lays out what observability is and everything you need to know.
When William Shakespeare wrote about how a rose by any other name would smell as sweet, he may as well have been writing about IT, where new terminology pops up regularly and sometimes appears synonymous with existing terms.
This is the case for network observability, a term that network professionals could confuse with network monitoring -- justifiably so. Yet, network observability originated due to network systems becoming more distributed. Network pros may struggle to fully view or understand distributed systems, and the goal of network observability is to make these systems more transparent and easier to understand, according to author Dinesh Dutt. If network pros can understand their distributed systems, they can then control, build and manage them better as well.
Network observability plays a key role in Dutt's book, Cloud Native Data Center Networking, from O'Reilly Media. In addition to observability, the book touches on other key aspects of cloud-native networking, including why organizations may need it and the roles of virtualization and automation in networking.
Network observability's rose by another name -- network monitoring -- is also featured in the book and commonly asked about as people start to learn about observability, Dutt said. The key difference between the two terms is monitoring can alert network pros if a failure occurs, but observability can help pros understand why the failure occurred. Observability is about providing answers, Dutt said, whereas monitoring is fault-based and performance-based.
For pros starting out with network observability, Dutt said the first action to take is to forget about vendors and go back to networking basics. The goal is for network pros to be able to abstract away vendor-specific details when they look at a system. Dutt also noted network observability skills can act as an easy transition into automation, and it enables network pros to use and improve the network they already have before designing the next one.
Below is an excerpt from Chapter 11, "Network Observability," which defines the term and explains why network pros should care about it.
Two distributed systems experts, one a theoretician and the other a practitioner, separated by a generation, make the same observation. Distributed systems are hard to understand, hard to control, and always frustrating when things go wrong. And sandwiched in the middle between the endpoints is the network operator. "Is it the network?" is not too far down the list of universal questions such as, "What is the meaning of life, the universe, and everything?" Sadly, network operators do not even have the humor of a Douglas Adams story to fall back on.
The modern data center with its scale and the ever increasing distributed nature of its applications only makes it more difficult to answer the questions that network operators have been dealing with since the dawn of distributed applications. Observability represents the operator's latest attempt to respond adequately to the questions. Along with automation, observability has become one of the central pillars of the cloud native data center.
The primary goal of this chapter is to leave you with an understanding of the importance of observability and the unique challenges of network observability. You should be able to answer questions such as the following two questions:
- What is observability and why should I care?
- What are the challenges of network observability?
We begin the story with a definition of observability and how it is different from monitoring.
What Is Observability?
Observability can be defined as the property of a system that provides answers to questions ranging from the mundane (What's the state of BGP?) to the existential (Is my network happy?). More precisely, observability is a way for an operator to understand what's going on in a system by examining the outputs provided by the system. As a concrete example of this, without having examined BGP process' data structures, packet exchanges, state machine, and so on, we can use the show bgp summary command to infer the state of BGP as a whole on a node.
Twitter, which is credited with the use of the term in its current meaning, had this to say in its announcement: "The Observability Engineering team at Twitter provides full-stack libraries and multiple services to our internal engineering teams to monitor service health, alert on issues, support root cause investigation by providing distributed systems call traces, and support diagnosis by creating a searchable index of aggregated application/system logs." (The emphasis is mine.)
Network operators today have a difficult time when it comes to their ability to answer questions. Alan Kay, a pioneering computer scientist once said, "Simple things should be simple, and complex things should be possible." Network operators have not seen network operations satisfy this maxim. Their mean time to innocence, to prove whether the network is at fault or what is the cause, has always been arduous. To describe why I say this with an example, think of what is needed to answer the question, "Are all my BGP sessions in Established state?" You can tell that by scanning a listing of the state of all BGP sessions. A better question might be "Which of my BGP sessions did not reach the Established state?" But you cannot answer this question unless the system provides you with a list of failed sessions. Providing a list of all peering sessions and using that to determine which peerings failed to establish successfully is not as good.
To understand why, consider a network with 128, or even 32, BGP sessions spread across multiple address families and VRFs. You can't list all these sessions on a single screen, and it takes time to eyeball it for a problem. Automating this also involves writing a more involved program, and let's face it, how many network operators automate this part of their life? If a system provides the option to list only the failed sessions, you can instantly focus on those. Even better is a command that lists failed sessions along with the reason for the failure. This saves you from examining logs or using some other mechanism to identify the cause of the failure. And now extend the problem to answering the question across tens to thousands of nodes, and you can understand the goal of a well-observed system.
How easily you can answer your question is a measure of how observable a system is. The more easily you can gather the information from the commands, the better you can grasp the crucial information, and the less information you need to keep in your short-term memory to build a map of the network as it is currently functioning.
Network operators have had some measure of monitoring, but not observability. One of the clearest descriptions of the difference between the two comes from Baron Schwartz, the founder and CTO of the popular VividCortex software. He once wrote: "Monitoring tells you whether the system works. Observability lets you ask why it's not working." For example, monitoring can tell you a BGP session is down, but observability can help you answer why. In other words, monitoring assumes we know what to monitor, implying also that we know the acceptable and abnormal values for the things we monitor. But the data center and the modern application architecture are large and complex enough to leave many things unknown. To reuse a popular cliché, "Monitoring is sufficient for capturing the known knowns, whereas observability is required for tracking down what went wrong when encountering the unknown unknowns."