Blog Home

Why Telemetry & Observability Are Broken for Today's Applications and the Path Ahead

By Fausto Bernardini, Aravind Kalaiah
observabilitytelemetrydistributed-systemsmicroservices
Why Telemetry & Observability Are Broken for Today's Applications and the Path Ahead

Observability matters more than ever — yet we're still in the dark.

In the era of cloud-native architectures and microservices, observability has become the lifeline for engineering teams tasked with maintaining complex distributed systems.

Unlike traditional monitoring, which tells you when something is wrong, observability provides the why—offering deep insights into system behavior that enable rapid incident resolution and continuous optimization.

Flying Blind at Scale Is Not an Option

As applications decompose into hundreds of microservices, each with its own failure modes and performance characteristics, the ability to understand system state through telemetry data has transformed from a luxury to an absolute necessity.

Without comprehensive observability, teams are essentially flying blind, unable to meet the demanding SLAs that modern businesses require or optimize systems for cost and performance in an increasingly competitive landscape.

Observability is crucial, yes. But implementing it well is not easy...

Writing and Maintaining Telemetry Sucks

We all know it. We all know it's true. Writing telemetry is a grind — and it never stays right.

It Takes Skills Most Teams Don't Have

Establishing a well-thought-out observability posture is a complex task that requires specific expertise—expertise that many organizations lack or cannot afford to dedicate to this critical but often undervalued function.

The reality is that observability implementation rarely follows best practices or strategic planning.

Instead, it evolves organically through a series of compromises and reactive decisions that compound into technical debt.

Telemetry Added in Spurts

Typically, telemetry is added in spurts, following a predictable but problematic pattern. Teams add instrumentation to their services and applications at launch, hoping to catch expected behaviors and common failure modes.

However, this initial instrumentation is based on assumptions about how the application will behave in production—assumptions that rarely survive contact with reality.

End users don't follow predefined paths, traffic patterns can change abruptly based on external factors, and the complex interactions between services create emergent behaviors that no one anticipated during the design phase.

Always Adding After the Fire

The gaps in this initial instrumentation become painfully apparent during incidents. Additional telemetry is invariably added as a response to each incident, appearing as action items in post-mortem documents.

While this reactive approach helps prevent the exact same issue from occurring again, it always leaves unknown unknowns unaddressed.

The result is a game of whack-a-mole where incidents keep happening, often with long remediation times as engineers struggle to understand what's happening in systems with inconsistent observability coverage.

Everyone Speaks a Different Language

The problem is further exacerbated by organizational silos. Each development squad establishes their own standards for what to measure, at what level of detail, and with which associated context.

One team might use "user_id" while another uses "userId" or "customer_identifier" for the same concept. Labeling of attributes is left to individual developers, leading to a chaos of naming conventions that makes correlation across services nearly impossible when searching for root causes downstream.

This semantic inconsistency transforms what should be straightforward queries into complex investigations requiring tribal knowledge of each team's conventions.

Cascading Fragility

As observability platforms evolve or organizations decide to migrate to different solutions, the situation deteriorates further.

It becomes time-prohibitive to rewrite telemetry consistently across all services, leaving the observability landscape as a mishmash of tools, standards, and semantic conventions.

Legacy services continue to emit telemetry in old formats while new services adopt modern standards, creating a Tower of Babel effect in the observability stack.

Perhaps most critically, the reality in most organizations is that telemetry is perpetually relegated to low-priority status.

Sprint after sprint, the tickets for improving observability get pushed below the line in favor of feature development or urgent bug fixes.

The irony is palpable—the very instrumentation that would help prevent and quickly resolve those urgent bugs never gets implemented because teams are too busy fighting fires.

The Problem with Legacy Thinking and Tooling

Much of the instrumentation added to applications today still reflects thinking from previous generation platforms, focusing heavily on low-level, infrastructure signals.

CPU Utilization Isn't the Root Cause

Metrics like CPU utilization, memory consumption, and disk space remaining certainly have their place in the observability toolkit, but they come woefully short of providing the level of detail needed to quickly find bugs in highly distributed and dynamic systems.

These infrastructure metrics tell you that something is wrong but provide little insight into what is causing the problem or how to fix it.

You Need to See the Whole Story

A much better approach is to instrument code for detailed end-to-end traces that capture the complete journey of a request through the system.

Many teams start with auto-instrumentation available for common languages, frameworks, and platforms—tools that can automatically capture basic trace data without code changes.

This approach provides immediate value and can illuminate the basic flow of requests through the system.

However, to fully reap the benefits of distributed tracing, organizations need to go beyond auto-instrumentation.

Beyond Auto: You Need Custom Instrumentation

Ideally, developers would add custom instrumentation throughout the code itself, capturing details about business logic, user context, and application state as high-dimensionality, high-cardinality attributes.

This rich contextual information is what transforms traces from simple performance timelines into powerful debugging tools.

Custom spans can capture database query parameters, business rule evaluations, external API responses, and user-specific context that makes it possible to understand not just that a request failed, but why it failed for that specific user under those specific conditions.

But Rich Context Takes Time You Don't Have

But the engineering time needed to properly instrument all your code can be staggering, often requiring weeks or months of dedicated effort that many organizations cannot afford to allocate.

The High Cost of the Wrong Signals

So what is the observability industry's answer to the difficulty of collecting just the right, rich instrumentation?

"Collect it all!"

Turn on the firehose of data. (Vendors with usage-based pricing love this.)

The "Just in Case" Mentality

Faced with the complexity of proper instrumentation, many organizations settle for what seems like the easiest path: capturing vast amounts of high-frequency, low-value metrics.

The thinking goes that if you collect enough data, surely the answer will be in there somewhere. This approach leads to metric explosion, where every possible measurement is collected "just in case" it might be useful someday.

Sometimes teams compound this problem by adding high-cardinality attributes to these metrics, exponentially increasing the volume and cost of storing and processing this data deluge.

Buried in Endless Logs

Logging often becomes the default fallback, with applications generating verbose logs for every operation.

While logs can be valuable, the reality is demonstrated nightly by the many bleary-eyed on-callers spending hours grepping through infinite amounts of unstructured text, desperately searching for clues about system behavior.

Without structure, context, or correlation, logs become a haystack in which the crucial needles of insight are nearly impossible to find.

Vendors Love the Waste

Observability platform vendors are often complicit in encouraging this wasteful behavior.

Their consumption-based pricing models create a perverse incentive where vendor revenue grows linearly with data volume, regardless of whether that data provides value to customers.

The more data you send, the more you pay, creating a situation where vendors have little motivation to help customers optimize their telemetry for quality over quantity.

Budgets Spiral Out of Control

The financial impact of this approach is staggering. Industry reports indicate that organizations commonly spend between 5-30% of their total infrastructure budget on observability tools alone.

As one startup founder put it: "The cost... is just insane. I can't stomach spending nearly 50% of our AWS bill on one tool that just watches the things that are actually making us money."

For many companies, observability has become one of the largest line items in their technology budget, sometimes exceeding the cost of the infrastructure being monitored.

This cost spiral is unsustainable, forcing organizations to make difficult trade-offs between observability coverage and budget constraints.

The statistic that 71% of CIOs say cloud complexity exceeds human ability to manage isn't just about technical complexity—it's about the economic impossibility of achieving comprehensive observability using traditional approaches.

At 2AM, It All Comes Crashing Down

The pain downstream...

More Noise = Less Reliability

But maybe more important than the direct vendor costs, are the costs of signals missed amid the deluge of noise.

Whether it's the warning signs hiding in volumes of data that could be spotted early… or the hard-to-find root causes and remedies hiding among the the flood of data during the 2 am fire fighting calls, noise buries the signals that matter.

These engineer pay, burnout, ability to hire and customer trust costs can be harder to measure.

Pager Duty Nightmares

The lack of proper instrumentation becomes painfully visible when the dreaded 2 AM alert pages an on-caller.

Without comprehensive telemetry, incident root cause analysis and remediation require heroics that should never be necessary in a well-instrumented system.

Engineers find themselves reverse-engineering system behavior from incomplete data, forming hypotheses based on intuition rather than evidence, and often resorting to "restart and pray" tactics when the true cause remains elusive.

Failures Hide in the Cracks

More often than not, incidents are caused by unpredictable combinations of factors that only proper observability could have revealed: subtle network slowdowns that cascade into timeout failures, race conditions that manifest only under specific load patterns, or database queries that became terribly slow after the last seemingly innocent code update.

The complex interactions between services in a microservices architecture create failure modes that are impossible to anticipate without comprehensive observability.

Customers Notice Before You Do

Perhaps most damaging to team morale and company reputation is when customers become the monitoring system, alerting you to problems you weren't even aware existed.

Nothing undermines confidence in a service more than users reporting issues before internal monitoring catches them. Yet this scenario plays out daily across the industry, a direct result of observability strategies that focus on infrastructure metrics while ignoring user experience signals.

Connecting Telemetry to Business Value

The Textbook Is Clear

Best practices for observability have been well-documented in recent years, from Google's influential SRE book to countless blog posts and conference talks.

The consensus is clear: effective telemetry starts with understanding the end user experience and business priorities. Service Level Indicators (SLIs) and Service Level Objectives (SLOs) should be driven by associated business KPIs and risks, creating a direct line from technical metrics to business outcomes.

This approach ensures that observability efforts focus on what matters most to the organization and its customers.

Reality Rarely Matches the Playbook

Yet despite the wealth of available knowledge, few organizations successfully deploy these best practices across their technology stack.

More often, the state of telemetry reflects decades of half-baked projects started with the best intentions but never completed.

Pockets of excellence where one team has implemented exemplary observability practices exist alongside poorly managed stacks where even basic monitoring is absent.

The overall approach is haphazard, driven more by individual initiative than organizational strategy.

Why Teams Fall Short

The challenges preventing adoption of best practices are well known but difficult to overcome.

Competing priorities mean observability improvements constantly lose out to feature development. Reduced budgets force teams to choose between investing in better tooling or maintaining existing services.

The lack of specialized observability skills means even well-intentioned efforts may be misdirected.

And perhaps most frustratingly, each new generation of technology seems to make tooling more complicated to use, raising the bar for effective implementation even as the need for observability grows more acute.

The Path Ahead

There is a better way forward. The current state of observability—with its high costs, complexity, and inconsistent implementation—is not an inevitable consequence of modern architectures.

It's a solvable problem that requires new approaches and tools designed for the realities of distributed systems.

Better Observability at the Source is Sylogic's Mission

At Sylogic, we are working on AI-powered agents that can automate your way out of this mess, intelligently instrumenting applications, correlating signals across services, and focusing on the metrics that truly matter for your business.

The future of observability lies not in collecting more data but in collecting the right data, not in manual instrumentation but in intelligent automation, and not in reactive fixes but in proactive system understanding.

Stay tuned for more—help is on the way!

F
Fausto Bernardini
Fausto is the co-founder & CTO at Sylogic. He has built data platforms at hyperscalers and led cloud orgs, such as Google and IBM Cloud.
A
Aravind Kalaiah
Aravind is the co-founder and CEO of Sylogic. He has built large scale AI platforms at organizations such as OpenAI, Meta and Cerebras.