Your Telemetry Doesn't Need More Data. It Needs Better Context.

Observability is supposed to help you fix problems faster. But for too many teams, it feels like it's become a source of them.

Teams burn hours digging through logs, patching gaps, and wrestling with brittle, inconsistent telemetry—only to miss the next incident anyway. The data is noisy, expensive, and can get in the way, making the truth harder to discover.

And despite best practices being well known, most orgs still struggle to connect observability to business impact.

It doesn't have to be this way. Here are a few things I learned leading and using observability at a couple of hyperscalers.

In this post, we'll unpack why telemetry is broken for modern apps—and what a smarter, nimbler, more reliable, proactive, business-aware approach looks like.

“Hold on. I'm not a Hyperscaler!”

Yes, hyperscalers have teams of very smart people dedicated to not only implementing best practices - but even defining best practices.

Your team, however, is probably like most. Stretched thin—keeping systems running, putting out fires, and focused on shipping features users care about.

Which is why these lessons learned apply especially for you…

Fast Teams Win and Observability Adds Speed

Microservices, cloud architectures, LLM-calls, SaaS ecosystems. As applications decompose into hundreds of services, understanding system behavior through telemetry data stops being a luxury and becomes a necessity.

Without clear observability, teams cannot uncover root causes efficiently, meet demanding SLAs or optimize systems effectively. They move slowly.

Root cause determination is a big challenge. Lots of churn across departments. Debugging takes a lot of time.
— Former SRE Lead at a Major Financial Institution

But, when done right, it's a key ingredient in how fast teams build faster than their competitors.

1. Make Maintaining Observability Suck Less for Your Teams

Writing and maintaining observability sucks.

Establishing effective observability requires specific expertise that many organizations lack. Telemetry is typically added in reactive spurts—initial instrumentation at launch based on assumptions, followed by patches after incidents. This approach leaves unknown unknowns unaddressed while incidents keep occurring with long remediation times.

The problem compounds when each development squad creates their own standards.

Different teams capture different dimensions with their metrics. Also, one team uses "user_id" while another prefers "userId" for the same concept. This chaos of naming conventions makes cross-service correlation time consuming during incident investigations. As platforms evolve or migrations occur, rewriting telemetry becomes time-prohibitive, leaving a mishmash of tools and standards.

Telemetry is always low priority. Sprint after sprint it gets pushed below the line.
— Fractional CTO & Advisor to Startups

Most critically, observability remains perpetually low-priority. Sprint after sprint, instrumentation improvements get pushed below the line in favor of features or urgent fixes—ironically, the very fixes that better observability would prevent.

Modern observability requires a system. Not ad-hoc spurts, but a continuous system that reviews gaps and suggests fixes using org-wide best practice-based policies.

A system that makes it too easy to keep telemetry up to date and in sync across the org as apps change.

What are the components of this system? Read on…

2. Start With Business + User Context

We have a problem with legacy thinking and tooling.

Much instrumentation still reflects previous-generation thinking, focusing on infrastructure metrics like CPU and memory usage. While useful, these signals fall short in distributed systems—they indicate something is wrong but not why or how to fix it.

End-to-end distributed tracing offers a better approach.

Auto-instrumentation provides immediate value, but to truly benefit, teams need custom instrumentation capturing business logic and user context as high-cardinality attributes.

This transforms traces from simple timelines into powerful debugging tools.

However, the engineering time required for comprehensive instrumentation often proves prohibitive. Read on for solutions to this bottleneck…

3. Connect Telemetry to Business Value

Best practices are well-documented: effective telemetry starts with understanding user experience and business priorities. And letting that guide your telemetry.

SLIs and SLOs should align with business KPIs, creating a direct line from technical metrics to business outcomes.

Yet few organizations successfully deploy these practices. Why? The state of telemetry typically reflects decades of incomplete projects—pockets of excellence alongside poorly managed stacks. Competing priorities, reduced budgets, lack of skills, and increasingly complex tooling prevent consistent implementation of known best practices.

Rethinking telemetry from first principles has historically been unattainable. An impossible to justify rewrite that would consume precious engineering cycles.

But what if that could be automated?

What if you could go from business goals to recommended Service Level Objectives to the necessary telemetry, without burdening your eng team?

4. Stop Throwing Money at the Wrong Signals

The high cost of the wrong signals.

Datadog was 50% of our AWS bill. Just to monitor the code that's actually doing all the work!
— Founder & Engineering Leader at Two Venture-Backed Startups

Faced with complexity, organizations often choose the "easy" path: collecting vast amounts of high-frequency, low-value metrics.

This metric explosion, sometimes compounded by high-cardinality attributes, creates a costly data deluge. Verbose logging becomes the fallback, resulting in on-callers spending hours grepping through unstructured text searching for clues.

Observability vendors encourage this waste through consumption-based pricing that grows with data volume regardless of value.

Organizations commonly spend 5-30% of infrastructure budgets on observability tools—sometimes exceeding the cost of monitored infrastructure. The statistic that 71% of CIOs find cloud complexity exceeds human ability to manage reflects both technical and economic impossibility of comprehensive observability using traditional approaches.

The solution? When Step 3's business context is applied to your end-to-end telemetry, you suddenly see what data is needed, and what is not. And you can turn down the firehose of data you're collecting.

You can simplify.

5. Better Data Upstream = Easier Troubleshooting Downstream

The pain downstream. Poor instrumentation becomes painfully visible during the dreaded 2 AM incident calls.

Without comprehensive telemetry, root cause analysis requires heroics. Engineers reverse-engineer system behavior from incomplete data, forming hypotheses based on intuition rather than evidence.

When something goes wrong we spend days digging through logs; there's no business lens on any of it.
— Architect at a Fortune 500 company

Incidents often stem from unpredictable factor combinations: network slowdowns cascading into failures, race conditions under specific loads, or queries slowing after innocent-looking updates.

Most damaging is when customers report issues before internal monitoring catches them—a daily occurrence resulting from observability strategies that ignore user experience signals.

Context-rich traces have become the standard debugging tool for highly distributed systems.

You can get half the way there with instrumentation if you are using modern frameworks. For even better coverage and details, custom instrumentation of code spans is the way to go.

Except, it has been prohibitively expensive so far. But again, help is on the way...

The Path Ahead

Expensive, noisy data and brittle instrumentation that breaks under pressure isn't helping us build faster.

But this isn't inevitable. We don't need more dashboards.

We need observability that's continuous, business-aware, and proactive by design—built to evolve with your systems and your team.

At Sylogic, we're turning these lessons into action—developing AI-powered agents that generate more reliable telemetry, close instrumentation gaps, and help teams stay ahead of issues before they ever show up in a log file.

Because faster teams win. And reliable observability is a key part of how they do it.