Why the Future of Monitoring Is Agentless – InfoQ.com

InfoQ Live October 19: How to apply Microservices and DevSecOps to improve application maintainability, security & deployment speed. Register Now
Facilitating the spread of knowledge and innovation in professional software development


In this article, we’ll explore the benefits of using blockchain for business solutions, describing the differences between public and private versions of this technology in practice. We’ll also talk about a new type of chain — a hybrid of private and public chains which takes the benefits of both to create a truly versatile platform with no compromises.
Happy developers make happy customers and stakeholders. Authority is ineffective with competent and knowledgeable teams. Socio-technical systems design provides a new worldview of what constitutes quality of working life and humanism at work. To create a magic environment where the soul of our teams can thrive, we need to create the conditions for strong relationships to develop and flourish.
In the podcast, Rosaria Silipo talks about the emerging trends in deep learning, with focus on low code visual programming to help data scientists apply deep learning techniques without having to code the solution from scratch.
In this podcast Shane Hastie, Lead Editor for Culture & Methods, spoke to Damon Lanphear about recruiting and growing remote teams, hanging the interviewing process, governance as an accelerator of innovation and applying AI to primary healthcare.
Traditionally, monitoring software has relied heavily on agent-based approaches for extracting telemetry data from systems. Observability requires better telemetry than agents currently provide. OpenTelemetry is driving advances in this area by creating a standard format and APIs to create, transmit, and store telemetry data. This unlocks new opportunities in observability.
Learn how to apply Microservices and DevSecOps to improve application security & deployment speed. Virtual Event on Oct 19th, 9AM EDT/ 3PM CEST
Turn advice from 64+ world-class professionals into immediate action items. Attend online on Nov 1-12.
Learn from practitioners driving innovation and change in software. Attend in-person on April 4-6, 2022.
InfoQ Homepage Articles Why the Future of Monitoring Is Agentless
Oct 15, 2021 11 min read
by
Austin Parker
reviewed by
Daniel Bryant
I like to joke that any sufficiently advanced software can be reduced to “fancy ETL (extract-transform-load).” Admittedly, a blithe oversimplification that gets me booed out of fancy cocktail parties, but the best oversimplifications have at least a single foot in the truth. 
This adage holds especially true in the realm of monitoring and observability; the first step on your journey to measure performance is often a shell script or installer and an awful lot of configuration files that extract performance data that it then transforms into a proprietary format, which is then loaded into a database.
Commercial and open source monitoring systems alike generally require the installation and maintenance of these agent processes on virtual machines, or injected into your process, or running as container sidecars – perhaps even all three. 
Solo.io is a modern service connectivity company that delivers API infrastructure software to connect, manage, and secure complex application traffic from the edge to service mesh. Learn more.
These agents not only require care and feeding, such as security and configuration updates, along with a variety of business concerns such as cost management and resource tuning. Agents become a load-bearing dependency for our tech stacks, becoming so deeply wedged into our systems that we can’t imagine life without them.
For the practices of monitoring and observability to truly move forward, however, we must seek to pry these agents out of our systems. I believe that the future of monitoring is going to be agentless in order to keep up with the increasing complexity of our systems. 
To understand why this is the case, though, we should first step back and understand why agents are so popular and widespread today. After that, I’ll discuss why agent-based monitoring approaches negatively impact our systems and organizations. Finally, I’ll touch on why the future of monitoring is agentless, and what we can look forward to over the next several years.
The proliferation of agent-based monitoring is perhaps a natural consequence of several trends in application development and system design. 
Let’s explore this in a little more detail, and discuss why each contributes to the problem.
The pace of innovation is one of the biggest contributing factors to agent sprawl. The past decade has seen the rise of virtualization become eclipsed by the rise of infrastructure-as-a-service and the public cloud become eclipsed by the rise of containers become eclipsed by the rise of container orchestration and Kubernetes become eclipsed by the rise of serverless and edge computing and… well, you see my point. 
Each of these technologies offers new opportunities and challenges in terms of monitoring, which has resulted in something of a monitoring arms race between incumbent vendors striving to keep up with the newest platforms and new insurgent vendors who focus on providing solutions to whatever the newest and hottest tech is. Both parties often arrive at the same logical destination, however – to incorporate data from these changing platforms and runtimes, they need to deploy something to collect telemetry. 
The reality of building an observability platform is that the data format you use isn’t necessarily going to be something you can ship to your clients or build into their software, or the platforms they’re using – you need a translation layer. Agents, traditionally, have provided that layer and integration point.
These myriad platforms have also led to myriad applications written in myriad languages, communicating with each other over increasingly-standardized protocols. This has stymied efforts to standardize the language of performance monitoring; aka, the way that applications and platforms emit telemetry data, the query language and structure for that data, and how the data is visualised and parsed. 
While we’ve seen ‘vertical’ integration of these technologies in specific platforms and runtimes (such as .NET, or Spring) there’s been less success in creating broad ‘horizontal’ integration across a variety; Standardized telemetry from Go, Javascript, C#, various network appliances, container runtimes, etc. is much harder to come by, especially when you try to integrate multiple versions of any of these, a common occurrence in most companies. 
Agents obviously serve as a point solution here, but imperfectly. An agent can not only consume telemetry from multiple systems, it can process it into a desired format – and, quite often, agents are capable of hooking into systems that don’t emit certain types of telemetry and producing it for you; For example an agent is capable of generating trace data from web applications by wrapping connections in order to perform APM or RUM.
With this in mind, it’s easy to see how commercial products have embraced agents. They’re easy to reason about, their value can be explained clearly, and they dangle the promise of immediate results for little effort on your part. You can certainly see how they make sense in a world where server boxes and virtual machines were the primary resource of computing. What is odd, however, is how commoditized these agents have become. It’s true that there’s distinctions between the agents in terms of configuration, performance, and other features — but there isn’t a lot of daylight between them in terms of the actual telemetry data they generate. 
This isn’t terribly surprising if you consider that the agents are scraping telemetry data that’s made available through whatever commodity database, web server, or cache you’re using. We could ask ourselves why these commodity services don’t all adopt a lingua franca – a single format for expressing metrics, logs, and traces in. Perhaps a more interesting question is why the agents that are monitoring them don’t share a common output format.
Your personal answer to these questions has a lot to do with how cynical you are. A more virtuous-minded individual may surmise that these agents are a necessary and valuable component of a monitoring system. The most cynical might suggest that vendors have explicitly relied on these agents being difficult to migrate away from, and have resisted standardization efforts in order to encourage lock-in to their products. 
Personally, I land somewhat in the middle. Obviously, vendors prefer a system that’s easy to set up and hard to leave, but agents have made it easy for engineers and operators to quickly derive value from monitoring products. However, the impedance mismatch between what’s good for quickly onboarding and what’s good for long-term utility can be staggering, and contribute to flaws in observability practice.
Agent-based monitoring systems tend to discourage good observability development practices in applications. They encourage a reliance on “black box” metrics from other components in the stack, or on automatic instrumentation of existing libraries with little extension into business logic. This isn’t meant to downplay the usefulness of these metrics or instrumentations, but to point out that they aren’t enough. 
Think about it this way — what would you say if the only logs you had access to for your application were the ones that your web framework or database provided? You could muddle through, but it’d be awfully hard to track down errors and faults caused by bugs in the code you wrote. We don’t bat an eyelash, though, when it’s suggested that our metrics and traces be generated entirely through agents!
Let me be clear, I’m not advocating that we throw our agents into the trash (at least, not immediately), or that the telemetry we’re getting from them is somehow valueless. What I am saying, however, is that our reliance on these agents is a stumbling block that makes it harder to do the right thing overall. The fact that they’re so ubiquitous means that we rarely get the chance to think about the telemetry our entire application generates in a holistic fashion. 
More often than not, we’re being reactive rather than proactive. This can manifest in many different ways — suddenly dealing with cardinality explosions caused by scaling insufficiently pruned database metrics, an inability to monitor all of our environments due to agent cost or overhead, or poor adoption due to dashboards that tell us about everything except what’s actually happening. New telemetry points are added in response to failures, but are often disconnected from the whole, leading to even more complexity. Our dashboards proliferate for each new feature or bug, but failures tend to manifest in new and exciting ways over time, leaving us with a dense sprawl of underutilized and unloved metrics, logs, and status pages.
Agents have placed us in an untenable position, not only as engineers who are tasked with maintaining complex systems, but as their colleagues and organizations. Observability is a crucial aspect of scaling systems and adopting DevOps practices such as CI/CD, chaos engineering, feature flags, etc. Agents don’t make a lot of sense in this world! They artificially limit your view of a system to what the agent can provide.
The insights delivered by agents can be useful if you know what you’re looking at, but the custom dashboards and metrics delivered for a piece of software such as, say, Kafka, can be inscrutable to people that aren’t already experts. Spurious correlation due to data misinterpretation can lead to longer downtime, on-call fatigue, and more. While considering the engineering impact of agents is important, the business impact is crucial as well.

Datadog’s default Kafka dashboard. If I don’t know a lot about Kafka, how can I interpret this?
If it isn’t obvious by now, I don’t think that agents are the future. In fact, I think that agents will become less and less important over time, and that developers will begin to incorporate observability planning into their code designs. There’s a few reasons that I believe this to be the case. 
It’s striking to see how OpenTelemetry has changed the conversation about monitoring and observability. Perhaps it’s unsurprising, given its popularity and the level of commitment that the open source and vendor community have given it already. Kubernetes has integrated OpenTelemetry for API server tracing, and multiple companies have started to accept native OpenTelemetry format data.

OpenTelemetry is the second-most active CNCF project, after Kubernetes.
OpenTelemetry enables developers to spend less time relying on agents for ‘table stakes’ telemetry data, and more time designing metrics, logs, and traces that give actionable insights into the business logic of your application. Standard attributes for compute, database, serverless, containers, and other resources can lead to a new generation of performance and cost optimization technology. This latter step is crucial, as it’s not enough to simply log everything and age it out; We need to measure what matters in the moment, while preserving historical trend data for future use.
The world that gave us agent-based, per-host monitoring isn’t the world we live in today. Systems aren’t a collection of virtual machines or servers in racks any more; They’re a multi-faceted mesh of clients and servers scattered across public and private networks. The agents that exist in this world will be smarter, more lightweight, and more efficient. Instead of doing all the work themselves, they’ll exist more as stream processors, intelligently filtering and sampling data to reduce overhead, save on network bandwidth, and control storage and processing costs.
One thing I try to keep in mind when I look to the future is that there’s millions of professional developers in the world who don’t even use the state of the art today. The amount of digital ink spilled over endless arguments about observability, what it is and isn’t, and the best way to use tracing to monitor a million microservices pales in comparison to the number of people who are stuck with immature logging frameworks and a Nagios instance. I don’t say this to demean or diminish those developers or their work – in fact, I think that they exemplify the importance of getting this next generation right. 
If observability becomes a built-in feature, rather than an add-on, then it becomes possible to embrace and extend that observability and incorporate it into how we build, test, deploy, and run software. We’ll be able to better understand these complex systems, reduce downtime, and create more reliable applications. At the end of the day, that gives us more time doing things we want to do, and less time tracking down needles in haystacks. Who could ask for anything more?
Austin Parker is the Principal Developer Advocate at Lightstep, and has been creating problems with computers for most of his life. He’s a maintainer of the OpenTelemetry project, the host of several podcasts, organizer of Deserted Island DevOps, infrequent Twitch streamer, conference speaker, and more. When he’s not working, you can find him posting on Twitter, cooking, and parenting. His most recent book is Distributed Tracing in Practice, published by O’Reilly Media.

 

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
You need to Register an InfoQ account or or login to post comments. But there’s so much more behind being registered.
Get the most out of the InfoQ experience.
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
Focus on the topics that matter in software development right now.
Deep-dive with 64+ world-class software leaders. Discover how they are applying emerging trends. Learn their use cases and best practices.
Stay ahead of the adoption curve and shape your roadmap with QCon Plus online software development conference.
InfoQ.com and all content copyright © 2006-2021 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we’ve ever worked with.
Privacy Notice, Terms And Conditions, Cookie Policy

source