Elido
8 min de leituraEngenharia

Short link monitoring with Sentry and Datadog

Forward 4xx/5xx redirect events and edge latency p99 to Sentry as issues and Datadog as metrics. Sample dashboards, alert thresholds.

Marius Voß
DevRel · edge infra
Diagram showing short link monitoring signals routed to Sentry and Datadog dashboards

If a short link returns a 5xx for 30 seconds during an Instagram campaign push, you lose roughly 4-7% of the cohort. Most engineering teams find out the next morning when somebody pastes a Slack screenshot. This guide is the playbook we use at Elido to catch redirect failures in under 60 seconds using two tools you probably already pay for: Sentry for issues and Datadog for metrics. It is the same wiring we run for our own edge POPs, which serve roughly 240M redirects/month at a p99 of 13ms.

The short version: Sentry is good at one thing for redirects, and that thing is "one issue per broken destination, with a list of slugs that hit it." Datadog is good at the orthogonal thing, time-series. You want both, and Elido emits to both natively. Sentry is currently Beta (paste a DSN and you are done); Datadog is Live with a dedicated metric collector. Below: which signals matter, how the Sentry wiring works under the hood, and what a Datadog dashboard for redirect health should actually contain.

Which signals matter for redirect monitoring#

Before you wire anything, decide what you actually care about. Redirect monitoring is a narrower problem than full APM, and the signal set is small. Four signals cover roughly 95% of real-world incidents:

4xx redirect events. A 404 on a short link is almost always one of three things: a slug got deleted, a slug expired, or somebody is fuzzing your domain. A 410 is intentional and noisy, so we suppress it from alerts. A 451 (geo-block) is interesting only in aggregate. Per-event 4xx volume is too noisy for paging; treat it as a metric, not an issue.

5xx redirect events. These are page-the-on-call. A 5xx means either the edge couldn't reach Redis (L2 cache), couldn't reach api-core (origin gRPC), or the destination URL had a DNS failure during a HEAD-check. Each of these has a different runbook. The Sentry transformer in api-core tags the root cause so the issue title is something like 5xx: redis-timeout (12 slugs affected, last seen 14s ago) rather than a generic Internal Server Error.

Edge latency p99. A cache HIT redirect should serve in under 15ms at p99 from any of our three POPs. We alert if p99 stays above 50ms for 5 minutes. The reason is that one slow query won't lift p99 for 5 minutes, but a Redis replica falling out of sync will. See redirect p95 under 15ms for the latency budget breakdown.

Click-rate anomaly and scan failure. Click-rate anomalies are the late warning system. If a campaign normally does 4k clicks/hour and suddenly does 200, something upstream broke (your ad got disapproved, your QR sticker peeled off, somebody pulled the wrong link). Scan failures come from the url-scanner service, which screens destinations for malware. A spike in scan-failure usually means somebody's account got compromised and is creating phishing links.

Routing signals to the right tool#

Not every signal belongs in every tool. Sending 4xx volume to Sentry as issues will bury the actual broken-destination issue under noise. Sending p99 latency to Sentry as alerts is awkward because Sentry's alerting is built around issue frequency, not time-series. The mental model: Sentry = exceptions, Datadog = metrics, Slack = humans, Linear = follow-up tickets.

Matrix showing 4xx, 5xx, latency, and scan failure signals routed across Sentry, Datadog, Slack, and Linear

Elido emits where the X is. We don't push 4xx events to Sentry because they aren't exceptions. We don't push every click event to Datadog because the volume isn't worth the cost (Datadog custom metrics are billed per unique tag combination, and the cardinality of slug x region x tier would burn $4k/month for a mid-size workspace). The split above is what we converged on after 9 months of running the system internally.

Sentry wiring: DSN paste and the envelope transformer#

Sentry integration in Elido is in Beta but functionally complete. The setup is three clicks. You go to /integrations, find Sentry, paste a DSN, and pick which event types to forward. The DSN is the only secret. We store it in Postgres with envelope encryption (KMS-wrapped per ADR-0036) so even our DB admins can't read it raw.

What happens under the hood is that api-core has a webhook transformer that listens to the internal event bus (Redpanda topic redirect.errors) and packages matching events into Sentry envelopes. The envelope format is documented in Sentry's envelope spec - it is just an HTTP POST with a JSON header line, a JSON item header, and a JSON item payload, separated by newlines. There is no Sentry SDK in the request path. That keeps the edge code (services/edge-redirect) small and avoids a hot-path dependency.

The transformer does three useful things:

Fingerprinting. Sentry groups events by fingerprint. A naive fingerprint would group every 5xx into one giant issue, which is useless. Our transformer fingerprints by error_class:destination_host so a Redis timeout on links pointing to acme.com is a separate issue from a Redis timeout on links pointing to globex.com. This makes "one broken destination = one issue" actually true.

Slug aggregation. Each Sentry event carries a tags block listing the first 50 slugs affected, the workspace ID, and the redirect domain. When 800 slugs share a destination and that destination starts returning DNS NXDOMAIN, you see one issue with slugs_affected: 800 and a sample of 50, not 800 separate alerts.

Rate limiting per workspace. A workspace running a bad campaign can generate 10k 5xx in 60 seconds. Sentry will accept all of them and bill you for it. The transformer rate-limits at 50 envelopes per minute per workspace and rolls the rest into a single "suppressed" event with a count. We learned this the hard way when a customer pointed 4M short links at a domain that started returning 503.

If you want to handle ingest yourself rather than through Elido's transformer, the observability docs cover the alternate path: subscribe to our webhook event bus and convert events to Sentry envelopes in your own infra. Most teams don't bother. The transformer is faster than you can build it.

A note on what shows up as an "issue": Sentry's UI treats each grouped event as an issue card with a sparkline, a sample event, and a list of tags. For redirect errors, the most useful tag is cache_result (HIT, MISS, BYPASS). If you see a wave of 5xx with cache_result: BYPASS, somebody on your team probably deployed a change that forced cache bypass for testing and forgot to flip it back. Real story, twice in the last year.

Datadog wiring: metric collector and dashboards#

Datadog is Live. The wiring is also three clicks, but the architecture is different. Instead of a per-event transformer, we run a metric collector on the api-core side that aggregates redirect telemetry into Datadog's metric format and submits batches every 10 seconds via Datadog's custom metric API. The collector pre-aggregates so we never submit raw events. That keeps custom-metric cardinality low and keeps your Datadog bill under control.

The metrics we emit out of the box:

  • elido.redirect.count - counter, tagged by domain, tier, region, cache_result, status_class (2xx/3xx/4xx/5xx)
  • elido.redirect.latency.ms - distribution, tagged by domain, tier, region, cache_result
  • elido.click.count - counter, tagged by domain, tier (deduped at the click-ingester boundary)
  • elido.scanner.failure.count - counter, tagged by reason (malware, phishing, expired_cert, dns_nxdomain)

Tags are the lever. You can pull up "p99 latency for link.acme.com in FRA during the last 4 hours" with a one-line query. You don't need to pre-build dashboards for every domain. See /integrations/datadog for the metric reference and tag taxonomy.

Datadog dashboard mockup with four panels showing p99 latency by region, error rate per domain, click volume by tier, and broken-redirect count

The four panels above are what we put on our own NOC TV. They cover the daily on-call view. p99 edge latency by region catches POP-level regressions (a Hetzner FRA blip looks different from an OVH SGP blip, and you want to see them side by side). Error rate per domain top-10 surfaces the noisy customers - if acme.com is at 8% 5xx and everyone else is at 0.02%, you don't have an Elido problem, you have an acme problem. Click volume by tier (f / s / b for free, starter, business via tier-isolation) tells you whether a traffic spike is from a paying tenant or a free-tier campaign that should be rate-limited. Broken-redirect count last 24h is the closing-the-loop metric, since a redirect that 4xx'd should either have been fixed or expired-and-removed within 24 hours; see link rotting prevention for the auto-repair path.

Recommended alert thresholds (these are our defaults; you can override per workspace):

  • elido.redirect.latency.ms p99 > 50ms sustained 5 min → page on-call
  • elido.redirect.count{status_class:5xx} rate > 0.5% sustained 2 min → page on-call
  • elido.redirect.count{status_class:4xx} rate > 5% sustained 10 min → Slack only
  • elido.scanner.failure.count rate > 10/min for a workspace → security review, no page

The 0.5% 5xx threshold is conservative. Our baseline is ~0.01% (mostly DNS hiccups on customer destinations), so 0.5% is a 50x deviation, which is real.

When to use what#

For a small team running a developer-focused product on /solutions/developers, Sentry alone is probably enough. You'll get paged on real 5xx, you'll see issues, and you'll fix them. You won't have the dashboard culture to make Datadog worth $1.50/host/month for the metric submission overhead.

For a larger company on /solutions/enterprise with an SRE on-call rotation, you want both. Sentry for the issue stream, Datadog for the dashboards, Slack alerts wired to PagerDuty for the page. The observability guide walks through the PagerDuty service mapping if you go that route.

For everyone in between, our recommendation is: Sentry on day one (free tier of Sentry is fine for under 5k events/month), Datadog when you start having more than one redirect domain or more than one region of traffic. The bill for the Datadog metric collector on a typical Elido Business workspace is around $35/month, which is the price of one engineer not having to grep nginx logs on a Sunday.

What this gets you that generic uptime monitors don't#

A Pingdom or UptimeRobot check on f.elido.me will tell you whether the edge is up. It will not tell you that the destination of slug summer24 started returning DNS NXDOMAIN 12 minutes ago, or that p99 in SGP is 4x p99 in FRA because a Redpanda partition leader bounced. Redirect monitoring is a destination-aware problem. The redirect itself can be healthy while the link is dead.

The Sentry + Datadog combo above gives you destination-aware visibility without writing custom probes. Sentry tells you what's broken at the destination level. Datadog tells you what's degrading at the edge level. Slack tells humans, Linear holds the follow-up. The wiring is paste-a-DSN for Sentry and a single OAuth flow for Datadog. Start with Sentry today; add Datadog when your redirect domain count goes above one.

For pricing and what tier includes which integration, see /pricing. For the API surface around event subscriptions if you want to roll your own, /features/analytics and the Sentry-across-12-Go-services breakdown cover the event taxonomy.

Experimente Elido

Cole uma URL, obtenha um link curto

Sem cadastro. O link vive 30 dias. Cadastre-se para mantê-lo para sempre.

Grátis, sem necessidade de registo · 2 por dia

Experimente o Elido

Encurtador de URL hospedado na UE: domínios personalizados, análises profundas e API aberta. Plano gratuito - sem cartão de crédito.

Tags
short link monitoring
sentry url monitoring
datadog short link metrics
redirect monitoring saas
edge latency monitoring

Continuar lendo