The redirect path of a URL shortener has exactly one job: resolve a slug to a destination and return a 301 in single-digit milliseconds. Everything else is bookkeeping. Click analytics, attribution, geo enrichment, fraud scoring, webhook fan-out — none of it can sit on the request path. The latency budget does not allow it.
This is the engineering trick that lets the analytics pipeline coexist with the redirect p95 < 15ms cornerstone: the edge fires a click event into Redpanda and forgets it. A separate worker — click-ingester — picks it up later, enriches it, and writes it to ClickHouse in batches. The redirect process never blocks. The analytics pipeline never touches the hot path. The tradeoff is durability, and it is a smaller tradeoff than it first looks.
What "fire and forget" actually means here#
The edge-redirect handler, after picking the destination URL out of the two-tier cache, does three things before the Location header goes out:
- Builds an in-memory
click.Eventstruct from the request (slug, workspace ID, user agent, referer, IP, geo from the local GeoLite2-City mmdb, device/browser parse, suspicion flags). - Calls
producer.Emit(ctx, event)on the franz-go Kafka producer. - Writes
HTTP/1.1 301and theLocationheader to the response buffer.
The producer call returns immediately. It does not wait for an ack from any Redpanda broker. The franz-go library buffers the record in-process and dispatches it on a background goroutine; the production callback is invoked later, on a worker pool that does not own the request goroutine. If the produce fails, the callback logs the error and the event is dropped. The redirect has already been served.
func (p *Producer) Emit(ctx context.Context, e Event) {
if p == nil {
return
}
b, err := json.Marshal(e)
if err != nil {
p.log.Warn("click marshal", zap.Error(err))
return
}
rec := &kgo.Record{Topic: p.topic, Value: b}
p.client.Produce(ctx, rec, func(_ *kgo.Record, err error) {
if err != nil && p.log != nil {
p.log.Warn("click produce", zap.Error(err))
}
})
}
That is the entire interface. No retry queue inside the edge process, no synchronous ack wait, no disk spool. The contract with the rest of the system is simple: best-effort emit, log failures, never block.
A nil-receiver guard lets local dev run without a Kafka broker. Without it, every contributor would need a Redpanda container running just to test the redirect path against fasthttp handlers.
Why we did not pick a synchronous write#
The obvious alternative is to write each click directly to ClickHouse from the edge. We considered it. We rejected it for three reasons that compound.
Latency. ClickHouse INSERT round-trip from the Frankfurt POP to a same-region ClickHouse cluster sits at 3-6ms p50 on a quiet network, 12-20ms p95 under load. That is the entire redirect budget. Adding it to the response path would push p95 past the 15ms SLO before anything else went wrong. The cache strategy post explains how tight the budget is in practice.
Backpressure. ClickHouse is happy ingesting batches of 1000-10000 rows per INSERT. It is unhappy ingesting single rows in tight loops — the MergeTree engine writes a part file per insert and a background process merges parts. A direct-write pattern from a multi-region edge fleet would create millions of tiny parts and the merge queue would never catch up. The ClickHouse documentation is explicit: insert in batches of at least 1000 rows, no more than once per second.
Failure isolation. A ClickHouse cluster restart, a network blip, or a slow query that locks a replica would propagate directly into redirect failures. The edge process would either start timing out (worsening p95) or start dropping clicks (worsening data quality). Sticking a message bus between the two lets each side fail independently — the edge keeps redirecting even when ClickHouse is degraded, and ClickHouse keeps ingesting even when one POP is offline.
Redpanda absorbs all three pressures. It is Kafka-protocol-compatible so franz-go talks to it transparently. It has a single-binary footprint with no JVM. It buffers on disk so a multi-hour ClickHouse outage does not lose events as long as the topic retention window holds.
The click-ingester worker#
click-ingester is a Go service that runs as a consumer group on the click events topic. One replica per region, three regions, no sharding by slug or workspace — the consumer group rebalances if a replica restarts and partitions are assigned by Redpanda. The consumer's job is small:
- Poll fetches from the topic.
- Decode each record's JSON into a typed
Event. - Push the event into a writer's in-memory buffer.
- Sometimes: fire webhooks, forward to Klaviyo / Mixpanel / GA4 MP, publish to the in-app live click stream.
The writer batches by count or by time, whichever fires first. Defaults: 1000 events per batch, 5-second flush interval. A batch is built into a INSERT INTO click_events PrepareBatch call against ClickHouse and committed as one server-side append. On success, the writer marks the underlying Kafka record offsets as committed; on failure, nothing is committed and the consumer re-fetches from the last successful offset on its next poll.
The offset-after-flush contract is the durability guarantee. The consumer never tells Redpanda "I've processed this record" until the record has landed in ClickHouse as part of a successful batch. A crash between consume and flush means the consumer group rebalances, the new owner re-polls from the last committed offset, and the events are reprocessed. Reprocessing is safe because the click_events table is ReplacingMergeTree keyed on a synthetic event ID — duplicate inserts collapse on merge.
Bad messages are not retried. A JSON decode failure is marked committed immediately so the consumer does not get stuck on a poison record. This is a small but real source of data loss; the rate sits at single events per day across the full fleet, and the affected events show up in the consumer's decode_error_total Prometheus counter.
The durability tradeoff in numbers#
Fire-and-forget gives up some events. The question is how many, and whether that matters for the use case.
We measured the production loss rate over a 90-day window. The number is approximately 0.04% of emitted events — about four lost clicks per ten thousand. The breakdown:
- Edge process restart with in-flight buffer. franz-go buffers up to a few hundred milliseconds of records before flushing to a broker. A SIGTERM during a deploy can drop whatever is in the buffer. The deploy script issues a clean shutdown that drains the buffer with a 2-second timeout, which catches most cases but not all.
- Redpanda broker unavailability beyond the producer retry window. franz-go retries produce failures, but the retry budget is bounded. If a region's Redpanda cluster is unhealthy for more than about 30 seconds the buffer overflows and new records are dropped at the producer's edge.
- Network partition between the edge POP and the regional Redpanda cluster. The same effect as above. The producer logs warnings and drops events until connectivity returns.
For the URL shortener workload, 0.04% loss is acceptable. Clicks are statistical signal, not financial transactions. Cohort analytics, conversion attribution, and geo distribution all aggregate well across a sample with that miss rate. Use cases that would not tolerate it — regulated industries with audit requirements, billing-tied click counts — are not what the redirect tier serves directly.
For workspaces that need higher durability, we offer a separate audit-log mode that writes every click synchronously to Postgres in addition to the fire-and-forget path. The synchronous write adds 3-5ms p95 to the redirect, opt-in, off by default. The ClickHouse export guide documents the audit-log shape for compliance teams that need to reconcile counts.
Replay strategy when ClickHouse is down#
The producer is fire-and-forget but the consumer side has a real replay story.
When ClickHouse is unavailable, the writer's flush calls fail. The consumer continues polling — franz-go's poll loop is independent of the writer's flush loop — but offsets are not committed because the flush did not succeed. Redpanda retention is set to 72 hours, which is the maximum tolerable outage before events start aging out.
During a real outage (we have had three of meaningful duration in 18 months), the recovery sequence is:
- ClickHouse comes back online.
- The next flush attempt succeeds and commits offsets.
- The consumer catches up by draining the backlog at the configured batch rate. With a 1000-event batch and a 5-second flush, the consumer can drain about 200 events per second per replica; three replicas means roughly 36k events per minute.
- The Grafana dashboard for the
click_eventstable shows the catch-up curve — the row insert rate stays elevated until the backlog clears.
The 72-hour retention is sized to absorb a multi-day ClickHouse rebuild without data loss. We have never used more than 4 hours of it in production. Disk on the Redpanda brokers is the cost, and it is small relative to losing analytics data.
A replay-from-archive is also possible. Redpanda has tiered storage shipping closed segments to S3-compatible object storage. We have it configured but have not needed it — hot replay covers every incident we have seen.
What the consumer also does#
Click ingestion is not just ClickHouse writes. The consumer is the central fan-out point for every downstream system that cares about clicks.
- Webhook dispatcher. Customer-configured webhooks fire from the consumer, not from the edge. The consumer enqueues a webhook job per click that matches a configured filter. Retries, signing, and delivery happen in
webhook-dispatcher. - Server-side event forwarding. Klaviyo, Mixpanel, GA4 Measurement Protocol, Meta CAPI. The consumer holds a per-workspace config cache and fires the appropriate POST for each click that the workspace has wired up. Forwarders are best-effort with a small in-memory retry; persistent failures land in a dead-letter table.
- Live click stream. The in-app "watch a campaign drop live" view subscribes to a Redis pub/sub channel. The consumer publishes a minimal-shape event for each click that matches an active live session. This is the only synchronous-feeling part of the pipeline, and it is best-effort — drop events when the channel is congested.
- Pixel firing. Conversion pixels (retargeting and offline conversion) fire from the consumer based on per-link configuration. Pixel firing is its own fault domain; failures are logged but do not back-pressure the ClickHouse writer.
All of these run after the offset commit but before the next poll. A slow pixel endpoint can slow effective consumer throughput. A per-forwarder timeout (1 second hard cap) and a per-batch concurrency limit (16 in flight) keep the slow path from dominating.
Why this shape and not Kinesis or a queue#
A few alternative event-bus shapes evaluated and not chosen.
SQS or RabbitMQ as a queue. Neither has the throughput-per-broker Redpanda offers at click-event volume. SQS bills per request, which makes high-volume streams expensive; RabbitMQ pushes back on dense topics.
AWS Kinesis. Reasonable if we were AWS-resident. We are not — Hetzner FRA, Hetzner ASH, OVH SGP. Self-hosted Kafka or Redpanda is the right shape for an EU-first deployment.
Plain Kafka. Works. We picked Redpanda for the operational profile — single binary, no Zookeeper, no JVM tuning. The wire protocol is identical and franz-go cannot tell the difference. A self-hosted Elido deployment can swap in Apache Kafka without code changes.
Managed services like Confluent Cloud. Not EU-resident in the way we want. The redirect tier needs same-region message-bus latency.
The decision is documented in more detail in the edge-redirect architecture page, which is the source-of-truth for the redirect-tier configuration choices.
What we would do differently next time#
The fire-and-forget pattern is correct. The implementation has rough edges worth flagging for anyone copying the design.
Shutdown drain. The franz-go 2-second drain timeout has lost events during deploys when the buffer is busy. The fix is a SIGTERM hook that flushes synchronously before the process exits, with a longer timeout and a hard kill if the broker is unreachable.
Dead-letter path for decode failures. Marking poison records committed and moving on is fine for throughput but loses observability. A future iteration writes the raw bytes plus the decode error into a click_events_decode_failures table so the team can audit what shows up.
Per-workspace forwarder concurrency. Today every workspace's forwarders share the consumer's global pool. A noisy workspace with a slow Mixpanel endpoint can starve others. A per-workspace cap is the obvious fix; we have not built it.
None of these have caused a production incident. They are the kind of thing you log in the ADR backlog and chip away at.
Related reading#
- Hitting p95 < 15ms for redirects from FRA, ASH, and SGP — the cornerstone latency-budget piece this post sits next to.
- Cache strategy for URL redirects: L1 LRU and L2 Redis — the other half of the hot-path story.
- Why we use ClickHouse for click analytics (not Postgres) — the downstream-of-this-pipeline decision.
- Smart links explained — what the destination URL field actually resolves to before the click event is emitted.
- Short links as Terraform — operational walkthrough of the redirect tier configuration.
- Wiring Sentry across 12 Go services — the panic and 5xx capture path that runs alongside the consumer.
- Architecture:
/docs/architecture/edge-redirect. - Operational guide:
/docs/guides/clickhouse-export— the audit-log mode for workspaces that need per-click durability. - External: Redpanda tiered storage, ClickHouse bulk inserts, fasthttp.