distributed tracing
What is distributed tracing?
Distributed tracing, also called distributed request tracing, is a method for IT and DevOps teams to monitor applications, especially those composed of microservices. Distributed tracing helps pinpoint where failures occur and what causes suboptimal performance.
Distributed tracing requires the software developers to add instrumentation to the code of an application. That instrumentation provides information so that the administrator can analyze performance and the developer can debug the operations of complex software.
Request tracing basics
Request tracing is a fundamental practice in software engineering. Developers use small portions of instrumentation code set up to track and deliver relevant metrics about an array of important behaviors within the application's code. In most cases, the goal is to trace the behaviors of requests, such as user requests to a website, taking place within the software. This common usage is what gave this technique the name request tracing.
Request tracing shares similarities with application performance management (APM). A reporting tool organizes, processes and generates visualizations of the stream of metrics from requests, creating a picture of how the software behaves. The application team uses this reporting to quickly spot potential anomalies that could cause performance problems or otherwise poor user experiences.
Request tracing faces serious challenges when used in distributed software architectures with multiple functional modules or services. Services scale independently, resulting in many iterations of the same function, potentially running on different host systems or even different environments.
While it's a simple matter to trace a request through function X within a monolithic application, a microservices-based architecture could run 20 iterations of function X on four servers across two data centers. Those services must all work together and interoperate with other services to form the complete application. Request tracing must continue to collect metrics as requests flow through many different services and multiple instances of services.
Distributed request tracing can follow requests through each module or service. For example, distributed tracing lets an application architect visualize the performance of each iteration of function X. Developers and operations professionals performing support can quickly identify that one instance of function X is introducing significantly more latency than other instances, and remediate the problem.
Traces and spans
Distributed tracing relies on traces and spans. A trace is the complete processing of a request. The trace represents the whole journey of a request as it moves through all of the services or components of a distributed system. All trace events generated by a request share a trace ID that tools use to organize, filter and search for specific traces.
Each trace is comprised of a number of spans. A span, sometimes called segment, is the activity or operation that takes place within individual services or components of the distributed system. Each span is another step in the total processing of the overall request. Spans are typically named and timed operations. Spans each carry a unique span ID, and can also carry metadata or other annotations.
Software developers and operations staff can trace a user request through each span, correlate each span to a service instance, and even determine the physical location of the host system on which each span executes. The spans form a complete picture of the request trace. By assessing each span within a trace, it is possible to determine the source of a problem.
Trace data is not shared and processed in real time. It is generated and collected to local storage resources using agents or daemons that communicate with the software's instrumentation. That data is then moved to a central location and analyzed on-demand. It's a similar process to modern event logging and other metrics-gathering activities.
Since each distributed request trace is intended to reflect the complete journey of a request through the application, a trace is almost always an assessment of end-to-end performance: from the time a request arrives at the front end through middleware, back end access and results delivered to the requester. The idea of a localized trace is typically not applied to distributed request tracing activities.
Distributed tracing benefits and limitations
Benefits of distributed tracing include accuracy at pinpointing issues and a compatibility with modern software architectures. However, there are limitations, such as a difficult setup and risk of vendor lock-in.
Distributed request tracing is most commonly compared with APM technologies. APM tools monitor and manage the performance and availability of software workloads, with information on performance problems and alerts relating to a minimum established level of service for the workload. APM tools provide metrics that reflect the end-user experience, such as average response time under peak load. They also frequently track the resources used by the workload to identify resource capacity and potential bottlenecks. These tools have no direct connection to the software code.
Distributed request tracing is also frequently compared with event logging, because both logging and tracing rely on some level of instrumentation to monitor and report on key software activities. Event logging differs from request tracing in several key areas. Logging captures high-level event information, such as a resource exceeding a prescribed threshold. It must not contain excessive or redundant information, and often follows a standard format prescribed for log aggregation and analytics. By comparison, request tracing can capture enormous volumes of low-level information in disparate formats.
Distributed tracing is well-suited to debugging and monitoring modern distributed software architectures, such as microservices, where it is necessary for time-sensitive requests to traverse multiple services, systems and even facilities.
Tracing provides accurate details that support decisive isolation of potential problems. However, distributed tracing requires some level of instrumentation added to the production codebase, and the tools used to search and visualize trace data can be complex to set up and use productively.
Distributed request tracing tools
Numerous tools collect, filter, and present distributed request tracing information. Popular examples include:
- Datadog incorporates distributed tracing in its APM upgrade package;
- Zipkin is an open source tool largely based on Google's Dapper research;
- the open source Envoy proxy tool was developed through work done at Lyft;
- AWS offers the X-Ray service to its customers; and
- the LightStep monitoring service also originated from Google's Dapper research.
The distributed tracing community has developed a vendor-neutral open standard OpenTracing. The aim is to enable developers to easily add instrumentation to the codebase of an application with APIs that do not lock them into any one particular product.