Engineering · 2026-08-05
OpenTelemetry traces when most of the latency is in someone else's GPU
A request that spends fifty milliseconds in our gateway and several seconds inside a provider's inference is hard to instrument with the usual OpenTelemetry patterns. Here is the small set of conventions that, after some iteration, produce a trace worth reading.
A typical request through an LLM gateway has a latency profile whose dominant component is somebody else's compute. Tens of milliseconds in our code; tens of milliseconds in network; several seconds inside a provider's inference; tens of milliseconds back. Total wall-clock often in the four-to-six-second range, of which the gateway is responsible for a couple of percent.
The standard OpenTelemetry pattern — emit a span per logical step, propagate trace context across service boundaries, build a tree — collapses on this shape. Most LLM providers do not propagate trace context, so we cannot extend our trace into theirs. The naïve thing to do is emit one span called external_call covering the multi-second provider phase and treat it as opaque. The team that has to debug a slow request then has nothing to look at for the part that mattered.
We landed on three conventions that, together, make the trace useful again.
Use the gen_ai.* semantic conventions. OpenTelemetry's Generative AI semantic conventions stabilised through 2024 and 2025. They give every span over an LLM call a small set of standard attributes: gen_ai.system (which provider), gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.id, plus several more. Emit these on the gateway-side span that wraps the external call, sourcing the token counts from the response body or the streaming summary. Because the convention is shared, the trace store can group, alert on, and aggregate across LLM-shaped spans without each team teaching it a private vocabulary. This is the highest-yield move on the list. Before we adopted the conventions, every dashboard had its own ad-hoc filter expression; after, they share one.
Distinguish queueing from generation by measuring TTFT and TBT. A four-second provider span looks the same whether the four seconds were queued for 3.8s and generated in 0.2s or queued for 0.1s and streamed at thirty tokens per second for the rest. The two failure modes call for different mitigations — queueing suggests failover or backoff; slow generation suggests model choice or capacity. Both are observable, even when the provider doesn't propagate context, by measuring time-to-first-token (TTFT) and time-between-tokens (TBT) on the streaming side. Emit TTFT as a span attribute and TBT as a histogram metric per (provider, model) pair. When latency alarms trip, the on-call has the decomposition without having to dig.
Use timestamped events on a single streaming span, not one span per token. A first attempt at fine-grained streaming visibility might be a span per output token. The trace store will not enjoy this — sustained traffic produces tens of thousands of spans per second, which is not what an OpenTelemetry collector is designed to absorb. The OTel-correct shape is one span per streaming operation with timestamped events at token boundaries. Events are cheap, spans are not, and the use case here is annotate moments inside a span, which is what events are for.
A code skeleton, in Go, of the gateway-side instrumentation:
ctx, span := tracer.Start(ctx, "gen_ai.completion",
trace.WithSpanKind(trace.SpanKindClient),
trace.WithAttributes(
attribute.String("gen_ai.system", provider.Name),
attribute.String("gen_ai.request.model", req.Model),
attribute.Int("gen_ai.request.max_tokens", req.MaxTokens),
),
)
defer span.End()
start := time.Now()
stream, err := provider.StreamComplete(ctx, req)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return nil, err
}
var firstTokenAt time.Time
var tokenCount int
for ev := range stream.Events() {
now := time.Now()
if firstTokenAt.IsZero() {
firstTokenAt = now
span.SetAttributes(
attribute.Int64("gen_ai.ttft_ms",
now.Sub(start).Milliseconds()),
)
}
span.AddEvent("token", trace.WithTimestamp(now))
tokenCount++
// forward to client...
}
span.SetAttributes(
attribute.Int("gen_ai.usage.output_tokens", tokenCount),
attribute.String("gen_ai.response.id", stream.ResponseID()),
)
What this misses, deliberately: the input-token count. Emit it after the call returns, from the response body's usage block, not from a tokenizer run locally. Running the tokenizer locally is tempting (it gives the count earlier in the span) but it is wrong often enough — different models tokenize differently, and the provider may surprise you with a tokenizer change — that the emitted count would diverge from the billed count. Accuracy after the fact beats a fast estimate.
A smaller note. Some providers do propagate trace context, and the temptation is to set the OpenTelemetry traceparent header on the outbound call so the trace extends naturally. Our default is to strip the traceparent from outbound provider calls unless there is an explicit per-provider opt-in. The reasoning: a propagated header is a correlation a third party may persist, and the right default for outbound LLM calls is the same as the right default for outbound HTTP in general — don't decorate the request with our internal identifiers unless we mean to.
The metrics half of the observability story is shorter, because the conventions handle most of it. Export histograms per (provider, model) for TTFT, TBT, total latency, input tokens, output tokens, and per-call cost. Export counters per (provider, model, status) for completions, failures, retries, and policy-eliminations. Both feed alerting that fires on regressions in any of those dimensions, not on absolute thresholds. Absolute thresholds age badly when models change underneath you; relative thresholds — this provider's TTFT is 2σ slower than its trailing seven-day baseline — age much better.

