Wiring Sentry/GlitchTip across 12 Go services without breaking the hot path

When you have one Go service, error tracking is a half-hour job: drop in sentry-go, init it from SENTRY_DSN, call sentry.CaptureException on the few places that matter, ship. When you have twelve Go services, that same half-hour decision becomes a tax that compounds — every service grows its own slightly-different init code, its own slightly-different middleware, its own opinion about what "release tag" means. By the time a production panic happens, you discover that three services aren't initialising the SDK at all because someone forgot the env var in the deployment manifest.

We just finished that wiring at Elido — twelve Go services plus an audit-chain backfill CLI plus three Next.js apps plus two Node services, all feeding a self-hosted GlitchTip at sentry.elido.app. The interesting parts weren't the SDK calls. They were the shape of the shared package that makes the SDK calls disappear into one line per service, and the constraints that fall out of needing the middleware on edge-redirect's hot path without burning the p95 15ms budget.

This post is a complete account of how the wiring works, what we got right, and the two compromises we made deliberately.

TL;DR#

One shared package, pkg/sentryinit, replaces twelve func main copies. Adding a new service is a single defer sentryinit.Init(logger, "service-name")() plus one middleware line.
ChiMiddleware() auto-captures panics and non-panicking 5xx responses on warm-path services. FastHTTPMiddleware() does the same for edge-redirect and is zero-alloc on the happy path — verified by a benchmark that ships in the package.
We chose GlitchTip (Sentry-compatible, self-hosted) over Sentry SaaS for EU residency. The SDK is unchanged.
The hot path explicitly does NOT call sentry.CaptureException from handler code. All capture happens at the middleware boundary, where the cost only materialises when there's something to report.

Why a shared package, not twelve copies#

The minimum viable Sentry wiring in Go is six lines:

sentry.Init(sentry.ClientOptions{
    Dsn:              os.Getenv("SENTRY_DSN"),
    Environment:      os.Getenv("ENV"),
    Release:          os.Getenv("ELIDO_VERSION"),
    ServerName:       "api-core",
    AttachStacktrace: true,
})
defer sentry.Flush(2 * time.Second)

Six lines, twelve services. Seventy-two lines that diverge over time. The problem isn't the count — it's the drift. One service forgets Release. Another sets Environment from a slightly differently named env var. A third has a one-second flush and loses events on a fast SIGTERM. The behaviour of error tracking across the fleet stops being a property of the platform and starts being a property of whichever engineer wrote that service's main.go.

pkg/sentryinit is the un-clever fix. It lives in the Go workspace, every service requires it via a local replace directive, and the call site is one line:

defer sentryinit.Init(logger, "api-core")()

The package itself is small. The whole runtime surface is one Init function, two HTTP middlewares (chi and net/http), one fasthttp middleware, and a debug endpoint for proving the wiring end-to-end in production. The relevant bits of the implementation:

func Init(logger *zap.Logger, serverName string) func() {
    dsn := os.Getenv("SENTRY_DSN")
    if dsn == "" {
        return func() {}
    }
    env := os.Getenv("ENV")
    if env == "" {
        env = "production"
    }
    release := os.Getenv("ELIDO_VERSION")
    if err := sentry.Init(sentry.ClientOptions{
        Dsn:              dsn,
        Environment:      env,
        Release:          release,
        ServerName:       serverName,
        AttachStacktrace: true,
        EnableTracing:    false,
        SampleRate:       1.0,
        IgnoreErrors: []string{
            "context canceled",
            "http: Server closed",
        },
    }); err != nil {
        if logger != nil {
            logger.Warn("sentry init failed", zap.Error(err), zap.String("service", serverName))
        }
        return func() {}
    }
    sentry.ConfigureScope(func(scope *sentry.Scope) {
        scope.SetTag("service", serverName)
    })
    return func() { sentry.Flush(flushTimeout) }
}

Three things in that snippet that earn their lines.

First, the empty-DSN early return. Local development doesn't have a DSN. CI tests don't either. Without the early return, every dev box would try to initialise an SDK pointing at nowhere and emit an "invalid DSN" warning every time go run started. The early return means the call site never has to branch — defer sentryinit.Init(logger, "api-core")() is correct in every environment.

Second, the service tag pinned on the global scope. GlitchTip already segments events by project (one project per service), but the tag lets cross-project searches and dashboards filter by service slug without having to parse the DSN's project ID. When the same panic class appears in three services within an hour, the tag makes that pattern findable in one query.

Third, IgnoreErrors. context canceled is what every gRPC client returns when a downstream request is cancelled by an upstream timeout — a normal control-flow event in a chained microservice graph, not a bug. http: Server closed is what the stdlib HTTP server returns during graceful shutdown. Both produce noise that drowns the signal. The deny-list filters them before they reach the queue.

Wiring a new service is appending it to go.work, dropping a one-line require + replace in the service's go.mod, and adding the defer line in main.go. That's the contract. Everything else — flush timeout, sample rate, ignored error patterns — is centralised.

The chi middleware#

On warm-path services — api-core, analytics-api, billing, domain-manager, search, url-scanner, qr-generator, metadata-fetcher, webhook-dispatcher — the auto-capture surface is HTTP. A handler can panic, or it can return a 5xx without panicking, and we want both visible.

The naïve approach is to use sentry-go/http's built-in Handle middleware. We didn't, for two reasons. First, that middleware always starts a transaction even when EnableTracing is false — wasted allocation on every request. Second, it captures panics but not non-panicking 5xx responses, which means a handler that returns 503 because Postgres dropped the connection stays invisible.

The replacement is small:

func ChiMiddleware() func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            hub := sentry.GetHubFromContext(r.Context())
            if hub == nil {
                hub = sentry.CurrentHub().Clone()
                r = r.WithContext(sentry.SetHubOnContext(r.Context(), hub))
            }
            hub.Scope().SetRequest(r)

            ww := chimw.NewWrapResponseWriter(w, r.ProtoMajor)
            defer func() {
                if rvr := recover(); rvr != nil {
                    if rvr == http.ErrAbortHandler {
                        panic(rvr)
                    }
                    hub.RecoverWithContext(r.Context(), rvr)
                    if ww.Status() == 0 {
                        ww.WriteHeader(http.StatusInternalServerError)
                    }
                    return
                }
                if status := ww.Status(); status >= 500 && status < 600 {
                    hub.WithScope(func(scope *sentry.Scope) {
                        scope.SetLevel(sentry.LevelError)
                        scope.SetTag("status_code", strconv.Itoa(status))
                        hub.CaptureMessage(fmt.Sprintf("HTTP %d %s %s", status, r.Method, r.URL.Path))
                    })
                }
            }()

            next.ServeHTTP(ww, r)
        })
    }
}

The hub is cloned per request and stored on the context. That lets handlers attach domain-specific breadcrumbs (sentry.GetHubFromContext(r.Context()).AddBreadcrumb(...)) without leaking into other in-flight requests. The chi-internal WrapResponseWriter preserves the http.Flusher / http.Hijacker / http.Pusher interfaces — some chi middleware downstream peeks at those, and a hand-rolled wrapper loses them. For services that don't use chi (click-ingester and analytics-export mount plain http.ServeMux), the package ships a stdlib-only twin called HTTPMiddleware().

A subtle bit of behaviour: http.ErrAbortHandler is re-panicked rather than captured. That's the stdlib convention for "the client disconnected, suppress the goroutine cleanly". Capturing it as an exception would flood the queue with non-bugs.

The wiring is identical across the warm-path services:

r := chi.NewRouter()
r.Use(middleware.RequestID)
r.Use(middleware.RealIP)
r.Use(sentryinit.ChiMiddleware())
r.Use(oteltrace.ChiMiddleware("api-core"))
// ... rest of the middleware stack

sentryinit.ChiMiddleware goes before oteltrace.ChiMiddleware so panics in the tracing layer still get captured.

The hard part: fasthttp on the redirect hot path#

edge-redirect is a different animal. Its budget is p50 5ms / p95 15ms on a cache hit, measured across three production POPs. Anything that allocates per request shows up in the GC profile and eventually in the p99 tail. The chi middleware above is fine for warm-path services that allocate freely; on the edge it would be a problem.

sentry-go/fasthttp.Handle was a non-starter for the same reason sentry-go/http.Handle was: it builds an http.Request snapshot on every request, including the happy path, even when there's nothing to report. For a service serving thousands of requests per second per POP, that's thousands of unnecessary http.Request structs per second per POP.

The fasthttp middleware in pkg/sentryinit flips the cost model: nothing allocates until there's actually something to capture.

func FastHTTPMiddleware() func(fasthttp.RequestHandler) fasthttp.RequestHandler {
    return func(next fasthttp.RequestHandler) fasthttp.RequestHandler {
        return func(ctx *fasthttp.RequestCtx) {
            defer func() {
                if rvr := recover(); rvr != nil {
                    if rvr == http.ErrAbortHandler {
                        panic(rvr)
                    }
                    hub := sentry.CurrentHub().Clone()
                    req := fasthttpRequestSnapshot(ctx)
                    hub.Scope().SetRequest(req)
                    hub.RecoverWithContext(
                        context.WithValue(context.Background(), sentry.RequestContextKey, req),
                        rvr,
                    )
                    ctx.Response.Reset()
                    ctx.SetStatusCode(fasthttp.StatusInternalServerError)
                    return
                }
                if status := ctx.Response.StatusCode(); status >= 500 && status < 600 {
                    hub := sentry.CurrentHub().Clone()
                    req := fasthttpRequestSnapshot(ctx)
                    hub.WithScope(func(scope *sentry.Scope) {
                        scope.SetRequest(req)
                        scope.SetLevel(sentry.LevelError)
                        scope.SetTag("status_code", strconv.Itoa(status))
                        hub.CaptureMessage("HTTP " + strconv.Itoa(status) + " " + string(ctx.Method()) + " " + string(ctx.Path()))
                    })
                }
            }()
            next(ctx)
        }
    }
}

The shape is the same as the chi version, but the hub clone and request-snapshot construction are pushed inside the recover / 5xx branches. On a 302 cache-hit response — the overwhelmingly common case — the defer body fires, recover() returns nil, the status check returns false, and nothing else runs. The closure itself is what Go inlines into the stack frame at this call shape, so even the deferred-function cost amortises to nothing detectable.

There's a benchmark in the package (fasthttp_test.go) that pins this down:

func BenchmarkFastHTTPMiddleware_HappyPath(b *testing.B) {
    noop := func(ctx *fasthttp.RequestCtx) {
        ctx.SetStatusCode(fasthttp.StatusFound)
    }
    wrapped := FastHTTPMiddleware()(noop)

    ctx := &fasthttp.RequestCtx{}
    ctx.Init(&ctx.Request, &net.TCPAddr{IP: net.ParseIP("127.0.0.1"), Port: 1234}, nil)
    ctx.Request.SetRequestURI("/abc123")
    ctx.Request.Header.SetMethod("GET")
    ctx.Request.Header.SetHost("f.elido.me")

    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        wrapped(ctx)
    }
}

Paired with BenchmarkFastHTTPHandler_Bare (same handler, no middleware), the delta on a 2024 M3 dev box is in the noise — the wrapped version reports zero additional allocations per op. The Sentry middleware on the edge-redirect hot path costs nothing on the happy path. It costs something only when there's a panic or a 5xx, which is precisely when you don't mind paying.

The wiring in edge-redirect's main.go is one line:

rootHandler := sentryinit.FastHTTPMiddleware()(h.Route)

What this explicitly does NOT do: it doesn't sprinkle sentry.CaptureException calls through the redirect handler itself. The handler stays the way the latency budget needs it to — no Sentry awareness, no per-request allocation for error-tracking purposes. The middleware boundary is the only place capture happens, and the middleware boundary is structurally free on the happy path.

This is a deliberate compromise. If edge-redirect has a logic bug that produces a wrong destination URL without crashing or returning 5xx — say, a misconfigured rule that routes EU traffic to the wrong fallback — Sentry won't see it. The bot dashboards and the synthetic monitoring will. The trade is that we keep the redirect cheap; observability for non-error correctness lives outside the SDK.

Why GlitchTip, not Sentry SaaS#

A GDPR-first product writing customer data to a US-hosted error-tracking service is a contradiction that auditors notice. Stack traces from api-core include URL paths, occasionally tenant IDs, sometimes IP addresses (we redact them via Sentry's BeforeSend hook, but the redaction can be bypassed by mistake). The cleanest path is keeping the data plane inside our own EU region.

GlitchTip is the choice. It speaks the Sentry wire protocol, so the SDK is byte-identical — no fork, no shim, no second auth library. The dashboard is Sentry-shaped and lives at sentry.elido.app behind our wg-easy VPN. The ingestion endpoint at o<projectId>.sentry.elido.app/api/<id>/store/ is reachable from every service over the public internet, with rate limits at the nginx layer. The recent fix(ops/nginx): open Sentry ingestion endpoints; keep dashboard VPN-only commit captures that exact split.

The migration cost from Sentry SaaS to GlitchTip is roughly one DNS change, one DSN swap per project, and one Postgres + Redis deployment behind the dashboard host. We never ran on SaaS — we wired GlitchTip from day one — but the path is open in either direction. The SDK doesn't know which backend it's talking to.

There are two GlitchTip-specific caveats we hit and fixed during rollout. First, GlitchTip's signup flow requires registration to be open for the initial admin invite to work; we toggled it open during bootstrap, sent the invites, and toggled it back closed. Second, GlitchTip's outbound email signs up via Resend, and the from-domain has to be verified before email verification on signup will succeed — we skip email verification until the Resend domain is green and re-enable it after. Both are documented in the runbook for anyone repeating this.

The debug-panic endpoint#

End-to-end testing the wiring in production without a fresh deploy is the kind of thing that quietly never gets done — until a real panic happens and you discover the wiring was broken three weeks ago. We added a standing diagnostic surface for exactly this.

func DebugPanicHandler() http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        expected := os.Getenv(debugTokenEnv)
        if expected == "" || r.URL.Query().Get("token") != expected {
            http.NotFound(w, r)
            return
        }
        panic("elido sentry-debug panic: " + r.RemoteAddr + " " + r.URL.RawQuery)
    }
}

Mounted at GET /debug/sentry-panic, gated by ELIDO_SENTRY_DEBUG_TOKEN. With the env var unset, the route 404s — safe to ship to production. When the var is set and the request carries ?token=<value>, the handler panics on purpose. The middleware catches it, the SDK transports it to GlitchTip, the event lands in the right project. The whole round trip can be verified in under a minute without redeploying.

There's a fasthttp twin for the edge:

func DebugPanicFastHTTPHandler() fasthttp.RequestHandler {
    return func(ctx *fasthttp.RequestCtx) {
        expected := os.Getenv(debugTokenEnv)
        if expected == "" || string(ctx.QueryArgs().Peek("token")) != expected {
            ctx.SetStatusCode(fasthttp.StatusNotFound)
            return
        }
        panic("elido sentry-debug panic: " + ctx.RemoteAddr().String() + " " + string(ctx.QueryArgs().QueryString()))
    }
}

Same token gate, same hidden-when-unconfigured behaviour. The first thing that happens after a deployment is the on-call hits the debug endpoint on the affected service. If the event lands in GlitchTip within ten seconds, the wiring is healthy. If it doesn't, the deployment is rolled back before the next outage discovers the broken wiring the hard way.

What we didn't wire#

Three things that look like obvious additions but stay deliberately out of scope.

Tracing. EnableTracing: false in Init. We use OpenTelemetry for distributed tracing (the pkg/oteltrace package wires it across the same services). Letting Sentry do tracing in parallel would double the per-request transaction allocations and double the cost of context propagation through the call graph. Sentry's strength is errors; OTel's strength is spans. We use each for what it's good at.

Manual CaptureException on the redirect path. Covered above. The hot path doesn't import sentryinit for the purpose of calling it from handlers. The middleware is the only capture boundary.

Performance monitoring (transactions). Same reason as tracing. redirect_duration_seconds is a Prometheus histogram with region and cache_tier labels. That's the source of truth for latency. Pushing the same data through Sentry's performance monitoring would be a duplicate pipeline with worse aggregation.

What it looks like from outside#

Twelve services, one shared package, one line per main.go, one line of middleware per router. When a panic happens — and they do — it shows up in GlitchTip under the right project with the right service tag, the right Environment, the right Release, and a stack trace deep enough to find the line. When a non-panicking 5xx escapes — and those happen too, usually after a database hiccup — it shows up the same way.

The compromises are explicit, written down in the package's package-level doc comment, and tested with a benchmark. The wiring is documented in the same place as the runbooks, not in tribal knowledge. Adding the thirteenth service will take fifteen minutes — five of which are writing the test, five of which are wiring the DSN into the deployment manifest, and five of which are running make build and proving it with the debug endpoint.

That's the shape that holds up. Six lines per service was always going to drift. One line, plus one shared package, plus one benchmark, doesn't.

The wiring is open in the monorepo at pkg/sentryinit/ for anyone running a Go fleet on Sentry or GlitchTip who wants a shape to copy. The associated runbook covers the rotation procedure for DSNs, the GlitchTip bootstrap caveats, and the rollback path. For teams self-hosting the whole Elido stack, the k3s playbook covers where the SDK fits into the broader Kubernetes deployment. For a deep dive on what "zero-alloc on the happy path" actually means under load, the redirect p95 post is the companion piece.

Marius Voß is DevRel and edge infra at Elido. He shipped the sentryinit package alongside the rollout described above and has spent the last week watching the GlitchTip dashboard fill up with events that were previously invisible.