Welcome to the 3rd installment of “OTel in a minute” (or so) – short bites of OpenTelemetry you can use. Today, we’re going to take a look at something called “tail sampling”, which is a type of sampling of our telemetry, typically on our distributed traces.
So why sample telemetry at all? In short, because it’s not cost effective to send all of our traces to be ingested into our observability vendors, and furthermore, it creates a lot of potential noise that our engineering and dev teams shouldn’t have to deal with and wade through.
There are different types of sampling that can be done – many current vendors for example support head or probabilistic sampling, where a decision about whether to keep a trace is made up front and likely at random. This can lead to poor outcomes though, and we hear many eng and dev teams complaining that when they have an issue in production such as high latency or errors, they turn to their observability vendor only to find that those traces have been thrown away!
“Tail sampling” on the other hand waits until the entire trace has completed (or after some reasonable time…) to make a decision of whether to sample the trace – this way we can look at the complete properties of the trace and the spans that make it up before making a decision of whether to sample it and send it on to our observability backend or vendor. Now for example, we can ensure we sample all of the traces that have “high” latency or that have errors in them. We can focus our eng and dev teams on the key signals of interest and reduce cost.
With a sufficient sample of “good” behavior, and by sampling for these “outliers” – we can bring better outcomes to our observability and tracing efforts, as shown visually in this OpenTelemetry blog post.
The OTel collector has support for the “tail sampling processor” where this can be configured. Let’s take a look at our OTel collector configuration -we’ve got our 2 key policies configured here, one to sample traces with high latency, and one to sample traces with errors. We’ve also configured our favorite “tap processor” from episode 1 in our traces pipeline to look at the telemetry before and after going through the tail sampling processor.
We run this configuration in our collector and have it receive telemetry from Otelgen – and using some “jq” magic we can compute the span latency before and after the processor – the left and right of the screen respectively
We are looking at individual spans here and their latencies, and the spans with latency < 200ms do not get sampled or come out of the pipeline, since if an individual span in our trace is 200ms, the entire trace must be at least 200ms….
We can do a similar verification if we just focus our collector configuration on errors [show side by side errors] – in this case the status is not being set in our spans, and since none of these contain an error, none of them get sampled and come out of the traces pipeline.
Well that’s today’s OTel in a minute (or so) – where we’ve seen that the OTel collector can be a powerful tool for both reducing costs, and increasing insights from our traces through tail sampling – see the blog on our website for step by step instructions. Until next time.