How to monitor application performance in the cloud
As you move real-time applications to the cloud, you might relinquish some network control. However, several mechanisms are available to monitor application performance.
In general, with real-time applications, a person waits for and uses results from the application. Real-time apps include obvious things, like voice, video and screen sharing, but they also include highly interactive business processes and monitoring systems. To some extent, real-time apps also include streaming video, although buffering can help smooth out the delivery of video streams.
Real-time apps' reaction to common network problems is dependent on the application and the data transport mechanism that is used. Interactive voice and video are relatively tolerant of up to 1% of randomly distributed packet loss. The applications use User Datagram Protocol packet streams, and the endpoint codecs use interpolation to estimate the values of data in individual lost packets.
On the other hand, Transmission Control Protocol (TCP) applications -- like most interactive business applications and streaming voice and video -- are very susceptible to packet loss. More than 0.0001% packet loss has a significant effect on the throughput of an application that uses TCP as its data transport mechanism.
Bursts of packet loss will cause a streaming application to pause while the lost data is retransmitted by the sending system. The result is the application pauses while it buffers data.
Packet loss is typically due to link errors or congestion. Link errors indicate something needs to be investigated and corrected. Congestion is due to link speed differences or aggregation points. Link speed differences occur when data goes from a higher-speed link to a lower-speed link, such as transitioning from a 10 Gbps data center fabric link to a 1 Gbps office or WAN link.
Monitor application performance and pinpoint packet problems
Another source of congestion is at aggregation points where many lower-speed links connect to a router or switch that has one or two higher-speed uplinks. A burst of traffic from multiple end systems could all arrive at the higher-speed link at nearly the same time, potentially overrunning the interface buffers and causing a burst of packet loss.
High jitter has its greatest effect on interactive voice and video. It is caused when real-time packets are queued behind multiple large packets. The real-time traffic must wait its turn, resulting in large variations in latency.
When jitter gets too high, voice and video packets simply arrive at the receiver too late to be passed to the codec for playback at the proper time. The voice endpoints incorporate some buffering to help reduce the effect of jitter, but it is limited in its ability to handle high jitter. As a result, high jitter looks like packet loss.
High jitter can cause packets to look like they've been dropped. High congestion is packet loss due to congestion with other applications and data flows. You can track these problems by looking for interfaces with high drops. Use Top-N 95th percentile of drops to identify interfaces with significant problems. If you find interfaces with many, many drops, it's an indication the link is oversubscribed and needs less traffic or more bandwidth.
A high level of errors is easy to track by looking at interface statistics and could indicate a physical layer problem.
An analysis of real-time traffic can indicate your monitoring system should look for several sources of impairment, including link errors, high jitter and congestion-induced packet loss, due to link speed differences or aggregation points.
You don't have access to the physical interfaces in the cloud, so it's not possible to monitor interface errors or interface drops. Instead, you must look for application impairment using other mechanisms.
Passive monitoring of application performance
Cloud infrastructure providers may provide packet capture (pcap) mechanisms. An alternative is to determine if your virtual appliances, such as firewalls and switches, provide pcap technology. Look for the ability to export pcap files, allowing you to examine packet traces using a variety of tools.
Some applications provide good internal diagnostics that can help identify the network problems that are affecting their operation. For example, voice and video endpoints can use the Real-Time Transport Control Protocol (RTCP) to report packet loss, jitter and round-trip times during the call. This information can be used to discern whether a problem is with a specific endpoint, a group of endpoints, a region or systemwide. A bit of sleuthing may be required to identify the cloud network infrastructure that is causing a problem.
Alternatively, you can monitor application performance if parts of the application traffic flows are in a spot where you can place a physical or virtual appliance. These systems become more powerful as the breadth of the monitored infrastructure increases. Ideally, all tiers of a multi-tier application will be monitored, allowing the tool to identify both network and non-network problems that affect application performance. Some of these tools can import pcap files for their analysis.
Testing your network infrastructure
Active path testing tools have certain advantages that make them attractive. First, there are several types of active path testing:
- Synthetic transactions. Create real transactions. For example, place a call between specific endpoints to ensure the call controller is functioning correctly, as well as validating the data path between endpoints.
- Simulate application traffic. Test probes exchange packets that match the application, but carry a diagnostic payload, like packet counter and timestamp, to measure path characteristics. This requires probes to be distributed throughout the infrastructure, and it doesn't add load to the application systems.
- Standard network diagnostics. No special capabilities are needed. Traceroute may provide path information that is not visible via other tools. Test probes often provide this capability, in addition to the other two types of tests.
Active path testing allows the network infrastructure to be tested when the critical applications are not running or when they are not being used. It is useful for early problem identification and for collecting information about intermittent problems. The combination of standard network diagnostics with synthetic transactions or simulated traffic can provide visibility into the infrastructure that is not available with other tools.
Sorting through the data
Network management tools can provide an overwhelming amount of data. Data averaging over hours or a day can hide problems due to long durations of low values. Instead, use sorting functions like the 95th percentile to identify items that have bursts of problems. It is especially useful for filtering interfaces and links that have bursts of high packet loss. A concise report of the top 10 instances in each problem category allows you to focus on the most problematic instances.
You don't need all of the above tools to get started. Use what is available and get started. Think about the complete application infrastructure, what tools you have available and what you can get from those tools to monitor application performance. A little resourcefulness will go a long way.