drx - Fotolia
4 integrated monitoring strategies that will improve IT operations
Digital transformation expert Isaac Sacolick offers four ways to build an integrated monitoring architecture that will actually improve IT operations -- not slow them down.
Does your organization need an integrated monitoring strategy? Consider first how your IT organization tracks the reliability and performance of the network, systems, applications and other infrastructure elements in your data center and public cloud environments.
Chances are with many different -- and unintegrated -- monitoring tools.
The arsenal likely includes several monitoring tools to alert on uptime, performance and incidents. Each tool provides a view into the underlying issues and delivers events, metrics, logs and diagnostics within a defined operational scope. Every public cloud provider also has its own native tools. There might also be tools to monitor application health, diagnose database bottlenecks, display the status of data integrations or provide insights into the throughput of APIs.
In addition, the collection is almost certainly growing. A survey from a few years ago on application performance monitoring tools found that 65% of companies owned more than 10 different commercial monitoring products -- 50% or fewer of them in active use. With the migration to the cloud and accelerated application development spurred by DevOps practices, that number is increasing. Indeed, other vendors report that the typical customer uses between 30 and 50 unique monitoring tools, with multiple instances of each deployed across a sprawling infrastructure.
The case for an integrated monitoring strategy
For many IT organizations, these siloed, unused and unintegrated monitoring tools purchased over many years not only use up limited resources, but also cause performance issues:
- Recovering from complex incidents is hard and takes too long because of the number of people involved, and the tools used, to diagnose root cause. IT organizations either can't measure the mean time to recovery (MTTR), or it isn't very good. It just takes too long and requires too much expertise to diagnose issues.
- Key performance indicators need to be met on the health of business applications systems, and networks, but it's hard to arrive at a holistic value that accurately represents end-user experience and the health of underlying services and systems.
- The performance of applications that are growing in usage and data volumes must be tracked to help forecast when infrastructure needs to be scaled out or when applications need upgrades to address scalability limitations.
- Routing alerts to the right people has become more complex now that there are multiple IT workflow tools and lots of systems putting out alerts. Nobody wants the alerts triggering hundreds or thousands of tickets from multiple monitoring tools for a single incident.
Four integrated monitoring strategies
Organizations that are hampered by too many monitoring tools and need to develop an integrated monitoring strategy basically have four main options:
1. One approach is to reduce the number of monitoring tools. One organization that I know is standardized on a single public cloud uses the cloud's native monitoring tools to cover the infrastructure, and it heavily uses Splunk to report issues from database and application log files.
- This approach works well for organizations with standardized architectures, strong application development standards and less-demanding service-level requirements. In other words, not for very many organizations. It is less viable for larger organizations with more heterogeneous environments, legacy platforms and complex application architectures.
2. A second approach is to develop monitoring approaches directly tied and integrated into the application architecture. A monitoring dashboard for IoT data using AWS serverless and managed services is an example of this approach.
- This approach works well for newly developed architectures where monitoring and service-level requirements can be factored in from the ground up. It's unlikely to be viable for legacy architectures or a viable approach for enterprises with multiple computing architectures.
3. A third approach involves larger organizations with more complex environments trying to develop an integrated monitoring system of their own. This can be done by aggregating logs and data from all the monitoring tools into one central data warehouse. Once the data is centralized, a common set of reporting dashboards, predictive analytics to forecast capacity and more intelligent alerts based on inputs from multiple monitoring tools can be developed. If your organization already has expertise in cloud databases such as AWS Relational Database Service, a data integration tool like Talend, modeling expertise with Databricks, and data visualization tools like Tableau, then this could be an attractive option.
- As elegant of an approach as this is, it's also time-intensive, and expensive, to develop and support.
4. One final option is to consider an autonomous operations platform. Platforms such as BigPanda deliver out-of-the-box integrations with monitoring tools, data warehousing and alert aggregation. Coupled with AI and machine learning, these platforms create a virtual, unified monitoring architectures that enables intelligent incident management.
Creating business value with integrated monitoring
The key to driving business value is not in selecting additional monitoring tools, but by utilizing the speed, insights and collaboration an integrated monitoring architecture or autonomous operations platform enables.
One place DevOps teams should target is using machine learning to drive improvements in incident response. Once the data is centralized, machine learning algorithms can be used to correlate alerts, simplify diagnostics and improve the MTTR to critical incidents.
Integrated monitoring strategies will drive efficiencies. Once data is aggregated and alerts intelligently grouped to incidents, routing them to the appropriate people helps free up others that do not need to be involved in diagnosing or resolving the incident. If a single incident trips up multiple monitors but the system recognizes that a database sent out the first alert, the data operations team can be the first, and possibly the only, group alerted to this incident. This works well when an integrated monitoring architecture is also integrated with workflow tools such as Jira, ServiceNow, Slack or others used in most enterprises today. This approach can also drive significant customer experience improvements when communications around incidents are shared with customer service teams.
About the author
Isaac Sacolick, president of StarCIO, is the author of Driving Digital: The Leader's Guide to Business Transformation through Technology, which covers many practices such as agile, DevOps and data science that are critical to successful digital transformation programs. Sacolick is a recognized top social CIO, digital transformation influencer, industry speaker and blogger at Social, Agile and Transformation.