Troubleshooting application performance: Issues and answers
Workflow congestion, sluggish switches, pointless parameter settings -- application flaws don't come from a single source, but they're easier to solve when you know where to look first.
No application, not even the most feature-rich and user-intuitive, can survive major performance problems. Maintain worker, customer and supplier quality of experience within the expected range, or face productivity drops, abandoned transactions and other bad news.
Poor application quality of experience comes from myriad sources, so troubleshooting application performance issues is the first step to fixing them.
Performance problems are disruptive to live operations because they were not caught in testing. They're often subtle, the result of a diverse set of conditions, and hard to remedy effectively. Many of these problems represent programming errors that show up only under extreme or unusual conditions, and this means administrators must monitor system and network logs and record users' complaints to even realize a problem exists.
The following application performance issues are typical in modern IT operations, but surreptitious.
Subpar application performance
Users typically identify poor response time as an application performance problem. Slow response is most often caused by resource congestion, either within the application's components or in the network connections. The explanation is simple, but it's tricky to find out what's getting congested and why. Before you try to fix anything, examine system and network logs for reported resource problems during the periods when response time is poor.
Pros find simple network or hosting problems easily enough, but hidden workflow congestion proves difficult to uncover. IT organizations commonly deploy scalable application components, particularly in the front-end, cloud-hosted component of an application, to handle variable workloads. All this front-end elasticity can feed a single thread into an inflexible, legacy application or a single database update process. Compare the transaction volumes or database access rates to front-end user access rates, and if the former is hitting a plateau, it probably indicates workflow congestion.
When troubleshooting application performance issues, check whether configuration limits were set too low. Many applications, and even middleware tools, have variable parameters to limit buffers, threads and other activities. Developers tend to leave settings on default levels, which means the application can run out of a resource simply because the infrastructure didn't supply enough of it. This kind of error almost always leaves a message in a system log, so engage in thorough log analytics.
Sometimes poor performance is a problem unintentionally baked into application design; most commonly via over-componentization. Development teams think in terms of component reuse and composability of applications more than they do about operational impacts in production. When an application is divided into components, separately hosted, the workflow between components creates network delays, which can accumulate. Any delays in the processing done by the components also add up, and the result can be poor performance. Design issues are often hard to detect without an application performance management tool, because the problem isn't one massive flaw, but a series of small, accumulating delays. Turn to an APM tool for help, or do some more log analytics: Dig through individual network and hosting logs from the systems involved, and sum up the processing and transit times.
Application performance issue remedies
Never stop at troubleshooting application performance issues in production -- that step only solves the here and now. Examine the testing flow and application lifecycle management tools, particularly as they relate to volume and performance testing. Consider pushing the envelope more in the testing phase to simulate production-like conditions.
Application scaling is no simple feat. Most application teams apply scaling either to just one or only a few front-end components, which moves congestion back a step. Other teams apply scaling to too many components, and increase delay. Application components must scale as the workload varies directly with the number of users. Stop scaling at the point where some user tasks are completed, and the workflow then stops. If workflows connect all the way back to the database, take steps to improve database performance, through changes in parameterization, server upgrades and solid-state drives instead of disks.
Seek out data center network congestion next. An axiom of data center switching is that horizontal traffic between components is less efficient to connect through traditional switches than vertical flows to and from the user. Work with vendors to ensure that the network has an optimum switch topology in place, and use the fastest possible Ethernet connections between switches. If the application workflows move among several data centers, investigate a project to increase the capacity of data center interconnect trunks.
Review the component hosting plan as part of efforts to mitigate and prevent application performance issues. Co-host tightly linked application components in the same data center, as a routine matter. If you're worried about diversity and availability, the application's failover plan can include redeploying components to another location, but its normal configuration should optimize the intercomponent traffic in a complex workflow. Also check that the amount of resource sharing on servers is within an acceptable range, by counting the number of containers or VMs per server. Overloaded servers will usually run out of memory or show a sharp dip in performance when load increases.
Middleware configuration can be a big source of performance problems, in particular for workloads hosted on a cloud infrastructure and for container setups. Check configuration parameters to verify nothing is pushing the limits. With platforms such as Kubernetes for container orchestration and OpenStack for cloud hosting, the administrator can specify a networking tool; research whether the available tools can scale to the number of components you expect, or whether another choice might offer better performance. OpenStack's Neutron networking is known to have limitations with some plug-in configurations, for example.
As production deployment gets more complicated, troubleshooting application performance issues necessarily covers a greater pool of things that might have gone wrong. The overall strategy has to include limiting deployment complexity where the benefits don't justify the levels of design componentization and virtualization.