Getty Images/iStockphoto
Optimize serverless apps with an observability strategy
Serverless apps come with unexpected challenges and sometimes they are nearly impossible to troubleshoot and optimize. Move away from traditional monitoring practices to succeed.
Observability is rooted in complete visibility into application workflows, including serverless apps. A strong observability strategy can root out problems, but it must be targeted properly and implemented carefully.
Serverless deployments, unlike both data center hosting and traditional cloud services like IaaS, don't commit resources to applications in anticipation of need. Instead, small components of the application -- often called functions or microservices -- are deployed on demand and released based on usage parameters. Resources are not explicit, so they don't have to be managed, and enterprises don't pay for unused resources. But resources aren't persistent, so traditional monitoring strategies are likely to fail because they rely on resource state to detect issues.
The best strategy depends on how enterprise cloud teams weigh the three following factors for serverless deployments:
- Complexity of the workflows that link serverless components. The greater the complexity, the more the observability strategy turns toward stack tracing and probes.
- Breadth of the resource base on which components are deployed. One cloud, hybrid cloud or multi-cloud; the broader the base, the more difficult it is to rely on cloud provider tools.
- Primary mission of observability. Is the primary mission resource and cost management, user quality of experience (QoE) or troubleshooting and debugging? As the mission shifts away from resources and costs, observability necessarily focuses more on workflows and logic.
Serverless observability strategies rely on a variety of methods to ensure efficient and healthy applications. Below, we explore factors such as logging, tracing, latency and cost considerations.
Logging
A cloud provider can give users the history of deployed serverless components in the form of logs and stack traces. This information tracks component activations and enables teams to analyze the workflow, assuming they know the transaction or activity that initiated the application.
Single cloud serverless observability
Simple serverless workflows -- those that have few logic branches and limited use of serverless components in multiple workflows -- that use a single cloud provider can likely rely on the following basic cloud provider tools:
- Amazon CloudWatch.
- Azure Monitor.
- Google Cloud's operations suite.
These tools work because each application user's activity maps to a static set of serverless components. All that's required to monitor resource usage and QoE and even debug applications are the sequence and timing of component activations. These rely on logs and stack traces to present data on activations, and they're dependent on the features of each cloud service.
Multi-cloud serverless observability
The situation gets more complicated for complex workflows that include logic branches through components, use some components in multiple applications or have simple workflows that span multiple clouds.
Here, proprietary tools are valuable both because they can provide in-component telemetry to trace logic flows and because they often support multiple clouds. For multi-cloud observability of simple flows, products like Lumigo or Honeycomb can incorporate information from monitoring and events, log analysis, tracing and stack tracing.
Tracing
The application itself might provide trace or probe capability, which generates events at key points in processing. These events can then track work through the application.
Stack tracing is the minimum requirement to support complex workflows. It enables operations teams to correlate serverless component activations as a specific sequence. When those sequences are linked to the transactions involved, teams can verify resource usage, track latency, allocate and optimize costs, and debug.
Datadog is a commonly used tool here that can track workflows for single, multi- and hybrid clouds. Other tools such as Elastic Observability and AWS X-Ray improve observability by adding stack trace, particularly when combined with cloud provider monitoring tools.
Logic tracing
Complex workflows sometimes require logic tracing, so teams need to incorporate probes and application performance management tools in the observability strategy. This is a necessary step because it's difficult to analyze the path a transaction has taken versus what it should have taken without knowing the logic used to decide the workflow path. It's also difficult to track utilization and latency for each transaction. So, teams might add probes to generate events.
The OpenTelemetry and OpenTracing groups define standards for in-code probes. Companies like Dynatrace support the collection and analysis of data from these in-code instrumentation probes. Meanwhile, open source projects, such as Jaeger and Appdash, support instrumentation. Some cloud services might also offer these tools. Oracle Cloud Infrastructure Application Performance Monitoring supports OpenTracing probes.
In-code probes or instrumentation are likely critical in debugging serverless code. It's difficult to use interactive debugging with serverless components because the introduction of real-time debugging changes the timing of the components. Probes introduced at key points in the logic can help narrow down problems with the logic. They can also refine factors like knowledge of the paths a workflow takes through an application or the reasons a path was selected.
Latency and cost considerations
It might seem operationally valuable to avoid managing servers, VMs or containers. But on-demand execution of microservices creates additional costs and latency when each service is loaded. As the number of triggering events increases, serverless costs can rise alarmingly. It can reach the point where it would be cheaper to either deploy a serverless framework in the cloud via Kubernetes or Knative, or simply use containers instead.
It's difficult to know where the value boundary with serverless lies. But it's smart to run a pilot test for an alternative approach if resource usage and costs are unexpectedly high, or if QoE is affected by accumulated delays in the workflow. Many observability tools serve in container or Knative environments too, so enterprises can sustain observability across all reasonable hosting options.
Tom Nolle is founder and principal analyst at Andover Intel, a consulting and analysis firm that looks at evolving technologies and applications first from the perspective of the buyer and the buyer's needs. By background, Nolle is a programmer, software architect, and manager of software and network products, and he has provided consulting services and technology analysis for decades.