Skip to main content

One post tagged with "opentelemetry"

View All Tags

ยท 5 min read
Ziinc
๐Ÿ‘‹ I'm a dev at Supabase

I work on logging and analytics, and manage the underlying service that Supabase Logs and Logflare. The service do over a billion of requests each day with traffic constantly growing, and these devlog posts talk a bit about my day-to-day open source dev work.

It serves as some insight on what one can expect when working on high performance and high availability software, with real code snippets and PRs to boot. Enjoy!๐Ÿ˜Š

This week, I'm implementing OpenTelemetry, which generates traces of our HTTP requests to Logflare, the underlying analytics server of Supabase. For Elixir, we have the following dependencies that we need to add:

# mix.exs
[
...
{:opentelemetry, "~> 1.3"},
{:opentelemetry_api, "~> 1.2"},
{:opentelemetry_exporter, "~> 1.6"},
{:opentelemetry_phoenix, "~> 1.1"},
{:opentelemetry_cowboy, "~> 0.2"}
]

A quick explanation of each package:

  • :opentelemetry - the underlying core Erlang modules that implement the OpenTelemetry Spec
  • :opentelemetry_api - the Erlang/Elixir API for easy usage of starting custom traces
  • :opentelemetry_exporter - the functionality that hooks into the recorded traces and exports them to somewhere
  • :opentelemetry_phoenix - automatic tracing for the Phoenix framework
  • :opentelemetry_cowboy - automatic tracing for the cowboy webserver

Excluding ingestion and querying routesโ€‹

Logflare handles a ton of ingestion and querying routes every second, and if we were to track every single one of them, we would generate huge amount of traces. This would not be desirable or even useful, because storage costs for these would be quite high and a lot of it would be noise.

What we need is to exlcude off these specific API routes, but record the rest. We don't want to record all, of course, as usually a sample of a large amount of traffic would suffice in giving a good analysis of overall performance.

Of course, when using sampling, we would not have a wholly representative dataset of traces that would represent real-world performance. However, for practical purposes, we would be using the OpenTelemetry traces for optimizing a majority of request happy paths.

In order to do so, I had to implement a custom sampler for OpenTelemetry. The main pull request is here, and I'll break down some parts of the code for easy digestion.

Configuration Adjustmentsโ€‹

We need to make the configuration flexible enough to allow for self-hosting users to increase/decrease the default sampling probability. This also allows us to configure the sampling probability differently for different clusters, such as having higher sampling for our canary cluster.

# runtime.exs
if System.get_env("LOGFLARE_OTEL_ENDPOINT") do
config :logflare, opentelemetry_enabled?: true
config :opentelemetry,
traces_exporter: :otlp
traces_exporter: :otlp,
sampler:
{:parent_based,
%{
root:
{LogflareWeb.OpenTelemetrySampler,
%{
probability:
System.get_env("LOGFLARE_OPEN_TELEMETRY_SAMPLE_RATIO", "0.001")
|> String.to_float()
}}
}}
end

Lines in GitHub

We define a custom sampler that LogflareWeb.OpenTelemetrySampler works on the parent (as specified by :parent_based), and input the :probability option as a map key to the sampler.

We also conditionally start the OpenTelemetry setup code for the Cowboy and Phoenix plugins based on whether the OpenTelemetry exporting endpoint is provided:

# lib/logflare/application.ex
if Application.get_env(:logflare, :opentelemetry_enabled?) do
:opentelemetry_cowboy.setup()
OpentelemetryPhoenix.setup(adapter: :cowboy2)
end

Lines on GitHub

Remember that we set the :opentelemetry_enabled? flag in the runtime.exs above?

Custom Samplerโ€‹

The custom OpenTelemetry sampler works by wrapping the base sampler :otel_sampler_trace_id_ratio_based with our own module. The logic is in two main portions of the module: the setup/1 callback, and the should_sample/7 callback.

In the setup/1 callback, we ensure that we delegate to the :otel_sampler_trace_id_ratio_based.setup/1 with the probability float input. This would generate a map with two keys, the probability as is, and the something called :id_upper_bound.

# lib/logflare/open_telemetry_sampler.ex
@impl :otel_sampler
def setup(opts) do
:otel_sampler_trace_id_ratio_based.setup(opts.probability)
end

How the trace ID sampling works is that each trace has a generated ID, which is a super large integer like 75141356756228984281078696925651880580. A bitwise AND is performed using a hardcoded max trace ID value, and the result of the bitwise AND is then used to compare against the upper bound ID. If it is smaller than the upper bound ID, then it will record the sample, otherwise it will drop it. This is implementation specific and is out of scope for this blog post, but you can read more about the OpenTelemetry spec on the TraceIdRatioBased sampler specification.

Here is the code. For brevity sake, I have omitted the arguments for should_sample/7 the function call and definition:

# lib/logflare/open_telemetry_sampler.ex
@impl :otel_sampler
def should_sample(... ) do
tracestate = Tracer.current_span_ctx(ctx) |> OpenTelemetry.Span.tracestate()

exclude_route? =
case Map.get(attributes, "http.target") do
"/logs" <> _ -> true
"/api/logs" <> _ -> true
"/api/events" <> _ -> true
"/endpoints/query" <> _ -> true
"/api/endpoints/query" <> _ -> true
_ -> false
end

if exclude_route? do
{:drop, [], tracestate}
else
:otel_sampler_trace_id_ratio_based.should_sample(...)
end

Lines on GitHub

Here, because this will be a highly executed code path, we use a case for the path check instead of multiple if-do or a cond-do, because binary pattern matching in Elixir is very performant. Furthermore, binary pattern matching is perfect for our situation because we only need to check for the first part of the HTTP route that is called, instead of all.

The code is relatively simple, it delegates to :otel_sampler_trace_id_ratio_based.should_sample/7 if it is not in one of the bad path. If it is one of the hot paths, we will drop the trace. As this sampler works on the parent, it will drop all child traces as well.

Arguably, we could optimize this even further by re-writing the conditional delegation into mutliple function heads and pattern matching on the attribute argument and doing the binary check within the function guard. As always, premature optimization is the enemy of all software engineers, so I'll defer this refactor until the next time when I need to improve this module.

Wrap upโ€‹

And that is how you implement a custom OpenTelemetry sampler!