Distributed Tracing Is Overrated

Jan 14, 2026

Distributed tracing—especially as popularized by OpenTelemetry—has become a default checkbox in modern observability stacks. Turn on auto-instrumentation, collect traces everywhere, sample aggressively, and assume value will emerge.

But for most real-world systems, it doesn’t.

In fact, tracing today is widely overused, misunderstood, and delivers a terrible noise-to-value ratio. The industry has normalized a tool that only benefits a very small class of workloads, while imposing significant cost, complexity, and confusion on everyone else.

This post explains why.

The Core Promise of Tracing (and Who It Actually Helps)

The original promise of distributed tracing is simple:

Connect a single request as it flows through multiple distributed components.

That’s genuinely useful—but only in very specific cases:

Low-volume, request-driven systems
Highly fan-out architectures
Latency-sensitive user requests
Clear causal chains across services

In practice, the number of workloads that meaningfully benefit from stitching together dozens of spans under a trace ID is tiny.

Most production systems do not look like textbook microservices diagrams. They process batches, streams, queues, cron jobs, background workers, ETL pipelines, and async workflows. Once you leave the synchronous request/response world, tracing collapses.

Tracing and Batch Workloads Don’t Mix

Anything that works on batches of data immediately breaks the tracing mental model.

What does a trace even mean when:

One message represents 10,000 records?
One job runs for 30 minutes?
One consumer processes messages from multiple producers?
Retries, partial failures, and replays are normal?

You end up with:

Artificial trace boundaries
Arbitrary parent/child relationships
Spans that represent work units, not causality

At that point, trace IDs become decorative metadata—not a useful debugging primitive.

Sampling Makes Tracing Actively Worse

Tracing at scale is expensive, so everyone samples. And sampling quietly destroys most of tracing’s supposed value.

Once you sample:

You cannot reliably derive metrics from span attributes
You lose statistical validity
Rare-but-important events disappear
Latency distributions become lies

Yet tracing systems still look rich—full of spans, attributes, and diagrams—creating false confidence.

Even when users try to treat traces as a source of metrics (“latency by customer”, “errors by queue name”), sampling makes those numbers meaningless. Tracing is neither complete enough for metrics nor focused enough for debugging.

It lives in an awkward, broken middle.

Auto-Instrumentation: Noise at Industrial Scale

Auto-instrumentation was supposed to lower the barrier to observability. Instead, it massively increased noise.

Out of the box, auto-instrumentation:

Produces spans for every HTTP call
Wraps every Redis command
Instruments every database query
Emits enormous numbers of low-level spans

The result:

Tens or hundreds of thousands of spans per second
Most of them irrelevant
Almost no semantic context
No opinionated aggregation

There is no zoom-out mode.

You’re dropped into a microscopic view of the system with no map, no summary, and no guidance on what matters. Engineers drown in traces while still being unable to answer basic questions like:

Which workload is unhealthy?
What changed?
Where is time actually being spent?

Nobody Agrees What Tracing Is For

Ask ten engineers what tracing is for and you’ll get incompatible answers:

Debugging a specific exception
Finding latency bottlenecks
Understanding system topology
Performance profiling
Root cause analysis

Tracing is bad at most of these—and excellent at none of them.

This confusion leads to broken user experiences:

Debugging workflows that require manual hunting
Performance investigations built on sampled data
Profiling attempts using request-level spans
Visualizations that imply causality where none exists

A tool without a clear primary use case becomes a dumping ground.

HTTP Requests Are the Wrong Primitive

Tracing assumes that HTTP requests are the fundamental unit of work.

In many systems, they aren’t.

Real cost and failure often live in:

Database queries
Cache behavior
Queue depth
Batch size
Lock contention
Cold starts
Downstream throttling

Auto-instrumented traces explode these into thousands of tiny spans, but provide no way to reason at the right level of abstraction. You see every Redis call—but not the workload pattern that caused them.

More data, less understanding.

Conclusion

Distributed tracing is treated as a foundational pillar of observability, yet for most systems it produces enormous cost and complexity with minimal practical return. It’s time to seriously question whether this “centerpiece” earns its place—or whether we’ve simply accepted it without scrutiny.

Moshe's Substack

Discussion about this post

Ready for more?