Cache strategy for URL redirects: L1 LRU and L2 Redis

The redirect tier of a URL shortener is one of the few production systems where cache strategy is the architecture. There is no other meaningful work happening on the hot path - every request resolves a key (the short slug), reads a destination URL, and emits a 301 or 302. Everything else is observability and bookkeeping. The cache is what determines whether the median request takes 800 microseconds or 12 milliseconds.

This post documents the cache strategy behind Elido's edge-redirect service. Two tiers, an eviction policy chosen to optimise for tail latency rather than hit rate, a warming strategy that is more boring than it sounds, and the failure modes we have seen in 18 months of production. The redirect p95 < 15ms cornerstone covers the full latency budget; this is the cache-specific deep dive.

Why two tiers

The simplest cache architecture for a redirect service is a single tier: a Redis cluster between the redirect process and the origin database. Every request that does not hit the database hits Redis; every request that does not hit Redis hits the database. The Redis hop adds about 1ms when Redis is in the same region.

Two-tier caches add an in-process layer in front of Redis. The first tier - call it L1 - lives inside the redirect process address space. A hit at L1 returns the destination URL in a few hundred nanoseconds, no network round-trip required. A miss at L1 falls through to Redis (L2), which serves at sub-millisecond latency. A miss at L2 falls through to the origin gRPC call against the canonical Postgres database.

The choice between one tier and two is essentially a question of how flat your tail latency needs to be. Redis is fast but it is not free. A 1ms p50 to Redis becomes a 4-6ms p99 under load, and the p99.9 can exceed 20ms when the network has any contention. For an SLO that targets p95 < 15ms, every Redis hit is consuming a meaningful fraction of the budget. For p99.9 < 50ms, the Redis tail is the dominant contributor.

An in-process LRU absorbs the highest-frequency keys - the ones that drive 80%+ of the traffic. At Elido's traffic distribution, the top 1000 short links by request volume account for over 70% of redirect requests. Those keys are easy to serve in-process; the long tail can fall through to Redis without degrading the p95.

Flow diagram of the two-tier redirect cache: 98% of requests hit the in-process L1 LRU, 1.8% fall through to the L2 Redis cluster, and 0.2% reach the origin gRPC database

L1: a per-process LRU

The L1 cache uses Ristretto, the same admission-policy LRU used by Caddy and by Dgraph. We picked it for three reasons:

Concurrent reads scale linearly with CPU cores. A simpler sync.Map cache pegs at about 4M ops/sec on a typical edge POP machine; Ristretto sustains 30M+ in our benchmarks.
TinyLFU admission policy prevents one-shot scan workloads from evicting hot keys. A bot crawl that touches 10,000 unique slugs once each does not displace the genuinely popular links from the cache.
Bounded memory rather than bounded key count. We can set "use up to 256MB" rather than "store up to 100,000 entries", which is the configuration that matters for capacity planning.

The configuration we ship is:

cache, err := ristretto.NewCache(&ristretto.Config{
    NumCounters: 10_000_000, // 10M counters → tracks ~1M items
    MaxCost:     256 << 20,   // 256MB
    BufferItems: 64,
    Metrics:     true,
})

NumCounters is the TinyLFU frequency-tracking table size; the rule of thumb in the Ristretto docs is 10× the expected item count. With a 256MB budget and average link record at 200 bytes, the cache holds about 1.3M entries when full.

The TTL on L1 entries is 60 seconds. This is deliberately short. A redirect can have its destination changed in the dashboard at any time, and the L1 cache is the slowest layer to invalidate (Redis can be invalidated by publish; L1 lives in each process and needs a coordinated invalidation path).

A 60-second TTL means worst-case staleness is 60 seconds after a destination update. For most use cases this is acceptable; for the use cases where it is not (immediate destination changes during a live campaign), the dashboard's invalidation button issues a fanout that purges all L1 caches across the fleet. The fanout uses Redis pub/sub on a channel each edge process subscribes to at startup.

L2: Redis cluster with read replicas

L2 is a Redis cluster, deployed in each region (Frankfurt, Ashburn, Singapore). Reads go to local replicas; writes go to the regional primary and replicate within Redis's standard async model.

The data format is small. A redirect record at L2 looks like:

KEY:   redirect:f.elido.me:abc123
VALUE: {"d":"https://shop.example.com/spring","f":0,"v":12}

Three fields: destination URL, flags (bot-filtering enabled, password required, etc., packed into a uint16), and version. The version is the row version from Postgres; it lets us detect stale cache entries on read.

The TTL at L2 is 24 hours. This is much longer than L1 because L2 has a working invalidation path: when a link is created or updated in the origin database, the API publishes a Redis pub/sub message to the regional invalidation channel, and the redirect processes evict their L1 entries; the L2 entry is overwritten directly by the API layer.

The pub/sub invalidation has a subtle property: it is lossy. If a redirect process is restarting when the invalidation message is published, it does not see the message and its L1 cache may serve the stale value for up to 60 seconds. We accept this because the TTL is the backstop - the staleness is bounded.

The Redis cluster size at each POP is small. Frankfurt runs three primary nodes plus three replicas; the total dataset fits in about 4GB. At our cache hit rate (98% L1, 1.8% L2, 0.2% origin under normal load), the throughput requirement on Redis is moderate - usually 5-15k ops/sec at peak per POP, well within a single primary node's capacity if we had to consolidate.

The eviction policy choice

Ristretto's TinyLFU admission policy is the choice that matters most for tail latency.

A naïve LRU evicts the least-recently-used key whenever it needs to make room. That is fine when the access pattern is reasonably uniform - the keys that were most recently used are the ones most likely to be used again. It falls apart under two specific patterns:

Scan workloads. A bot crawl that hits 50,000 unique slugs in rapid succession will, under naïve LRU, evict every hot key and replace them with crawl keys that will never be accessed again. The cache hit rate drops, the origin sees a load spike, and the p95 jumps because most requests are now hitting the slow path.
Bursty hot keys. A link that is normally cold but suddenly receives 100k requests in 30 seconds (a viral social post, a TV campaign drop) needs to be cached fast. Under naïve LRU, it will displace one of the existing hot keys.

TinyLFU handles both. The admission policy tracks key frequencies and only admits a new key to the cache if it is more frequent than the eviction candidate. A one-shot bot crawl does not displace the hot keys because the crawl keys have a frequency count of 1. A bursty hot key does enter the cache, but only after its frequency exceeds the eviction candidate's - which happens within a few hundred requests.

Side-by-side comparison showing naive LRU evicting hot keys to 71% hit rate under a 50,000-slug bot scan, while Ristretto TinyLFU rejects the one-shot scan keys and holds the hit rate near 98%

The cost is that the first 100-500 requests for a newly-popular link are slow (fall through to L2 or origin) until the admission policy decides to cache it. For most use cases this is the right tradeoff; for campaigns where we know in advance that a link will spike, we have a pre-warm endpoint described below.

Cache warming

The L2 cache cold-starts when a new Redis cluster comes online. We do not warm it from a snapshot; the first 5 minutes after a cluster restart see elevated origin traffic until the cache fills naturally.

The L1 cache cold-starts when a redirect process restarts (deploys, OOM kills, scale-up). The first 30 seconds after a process restart see most requests fall through to L2; the next 60 seconds see L1 fill to its hot-key working set. Total cold-start contribution to origin load is small (most edge processes restart far less often than the cache TTL).

The exception: when a campaign manager pre-publishes a link they know will spike - a TV-ad URL, a press-release URL, a launch announcement - the dashboard offers a "pre-warm" toggle. Toggling it issues a no-op redirect against the edge-redirect service at every POP, which populates L1 in advance. This is unglamorous and rarely necessary; the autoscaler handles unanticipated traffic spikes adequately. The pre-warm is the answer to anticipated spikes where the first 60 seconds of cold-cache latency would be visible.

What happens at L1 capacity

A 256MB L1 cache fills in less than a minute on a typical edge POP. Once full, every new key requires the TinyLFU admission policy to decide whether it should evict an existing key.

The interesting observation: at our distribution, the L1 hit rate plateaus around 98% once warm. The 2% miss rate is the long tail - the ~30% of links that account for less than 30% of traffic and therefore do not pass the TinyLFU frequency threshold. These miss at L1 and hit at L2, where the hit rate is approximately 99%. The remaining 0.2% of total requests fall through to the origin.

We measured this distribution at three workload shapes - heavy bot traffic, viral spike, steady state - and the L1 hit rate fluctuates between 95% and 99%. The L2 hit rate is more stable at 98-99.5%. Total origin load from the redirect tier is therefore bounded at about 0.5% of incoming request volume, which is the number that matters for origin capacity planning.

Cache invalidation in detail

The invalidation flow is the part most often misunderstood by anyone reading the architecture from outside. The detail:

When the API receives a PATCH /v1/links/{id} that changes the destination URL, three things happen in order:

Postgres commits the change with the new row version (UPDATE links SET destination = ?, version = version + 1 WHERE id = ?).
Redis is written directly with the new value at every regional Redis cluster. The write fans out from the API to each region's Redis through a write-through layer.
Pub/sub invalidation is published on each regional invalidate:redirect channel. Edge redirect processes subscribe to this channel at startup and evict the L1 entry for the key.

Ordered invalidation pipeline for a destination update: Postgres commits the new row version, Redis is written through across regions, pub/sub publishes the L1 eviction, with the 60-second TTL as backstop

The ordering matters. Postgres-first ensures the canonical store has the new value. Redis-write-through-before-publish ensures any process that misses the publish but reads from Redis sees the new value. The publish is the optimisation that keeps L1 in sync; the TTL is the backstop if a publish is missed.

The known race: a redirect process that is reading from Redis (because of an L1 miss) and a concurrent invalidation publish. The read may return the new value (the publish happened slightly before the read) or the old value (the publish happened slightly after). If the old value is returned and cached in L1, the next 60 seconds may serve the old value to that process. This is acceptable; the alternative - a synchronous lock around the read-publish race - adds latency to every request to avoid an edge case that affects under 0.01% of invalidations.

For uses cases where the staleness window is unacceptable (a destination URL is being taken down for legal reasons, a destination is suddenly malicious), the dashboard's "purge cache" action issues an aggressive invalidation: it pauses all L1 reads for 100ms across the fleet, evicts the key from every L1, then resumes. This is rarely used and pinned to a per-second rate limit.

Failure modes we have actually seen

Three failures from the 18-month production history that are worth documenting because they shaped the current configuration.

Redis primary failover with stale replicas. In month 4 of production, a primary node in the Frankfurt cluster failed. The replica was promoted within 30 seconds (Sentinel-driven failover). The replicas had been about 200ms behind the primary at the moment of failure, which meant the first few hundred invalidations published just before the failover did not reach the promoted replica. Result: a brief window where about 0.3% of redirects served stale destinations. Resolution: we now run replicas with min-replicas-to-write 1 and min-replicas-max-lag 10, which trades a small write-availability hit for a tighter replication lag guarantee.

L1 cache thrashing during a synthetic monitoring scan. In month 9, a third-party uptime monitoring service was misconfigured to probe every short link in a customer's workspace once per minute. The customer had 18,000 short links. The probe pattern was a complete scan every 60 seconds. Effect: the L1 cache hit rate dropped from 98% to 71% on three edge POPs because the scan pattern admitted every probed key to the cache. Resolution: we added User-Agent-based filtering before the cache admission layer - known monitoring User-Agents bypass the cache and serve from L2 directly. This was a TinyLFU edge case: the scan keys looked frequent enough to displace genuinely hot keys.

Pub/sub disconnect during a long-running deploy. In month 13, a deploy that took longer than expected (about 4 minutes) caused several edge processes to remain connected to the old pub/sub channel after the Redis primary had failed over. Invalidations published to the new primary did not reach those processes; their L1 caches served stale values for the duration of the deploy. Resolution: pub/sub connection heartbeats with auto-reconnect on missed heartbeats, and a deploy-time L1 flush as a precaution.

What we considered and rejected

A few alternatives evaluated and not chosen:

A single in-process cache, no Redis. Tested. The miss-to-origin rate at any single process is too high without an L2; the origin database would need 3-5× more capacity. The marginal cost of Redis is small relative to the origin-capacity savings.

A CDN like Cloudflare or Fastly for redirect caching. Tested in staging. The CDN's 1-2ms regional latency on a cache hit is roughly the same as Redis, but the invalidation story is materially worse (CDN purges have minute-scale latency and per-URL purge costs). The CDN added complexity without improving the latency or the hit rate.

A larger L1. The 256MB budget is sized to the per-process memory envelope; doubling it does not double the hit rate because the hot working set already fits. The diminishing returns kick in at about 128MB on our distribution; 256MB has headroom for traffic growth.

Observability

The metrics we track per edge process:

cache_l1_hit_total, cache_l1_miss_total - derived hit rate per process.
cache_l2_hit_total, cache_l2_miss_total - derived hit rate per region.
cache_origin_request_total - origin request volume; the SLO target is < 1% of total requests.
cache_invalidation_total{source="pubsub|ttl|purge"} - counts of invalidation by mechanism.
cache_l1_memory_bytes - actual memory used by the L1 cache; alerted at 90% of the configured budget.

All metrics are scraped by Prometheus and visualised in the observability guide's dashboard set. The Grafana dashboards at the regional level show the regional cache hit rate over time; the per-process dashboards (used during incidents) show the per-process L1 hit rate and memory usage.

When to use this strategy and when not

A two-tier cache makes sense when:

The workload is read-heavy with a long-tailed key distribution.
The hot working set fits in per-process memory (a few hundred megabytes).
Cache misses are expensive enough that the second tier saves database load.
The staleness budget is tight enough that L1's TTL alone is not acceptable.

It does not make sense when:

The hot working set does not fit in process memory. In that case the L1 misses fall through to L2 frequently enough that L1 contributes little.
Writes are frequent relative to reads. The invalidation cost dominates.
The data is per-request unique (no benefit from caching at all).

For the URL shortener workload, all four "yes" conditions hold and the configuration above has held up across 18 months of production growth. For other workloads, the tier count and the eviction policy need re-evaluation.

Hitting p95 < 15ms for redirects from FRA, ASH, and SGP - the cornerstone for the engineering cluster; this post is the cache-specific deep dive.
Why we use ClickHouse for click analytics (not Postgres) - adjacent engineering decision in the same architecture.
Fire-and-forget click ingestion with Redpanda - the click-event pipeline that runs alongside the redirect cache.
Short links as Terraform - the operational walkthrough for the redirect-tier configuration.
Edge architecture: /docs/architecture/edge-redirect.
Operational guide: /docs/guides/observability - the metrics dashboard set referenced above.
Product surface: /solutions/developers and /solutions/analytics.
External: Ristretto's design paper and the TinyLFU paper for the admission-policy theory.

Cache strategy for URL redirects: L1 LRU and L2 Redis

Why two tiers

L1: a per-process LRU

L2: Redis cluster with read replicas

The eviction policy choice

Cache warming

What happens at L1 capacity

Cache invalidation in detail

Failure modes we have actually seen

What we considered and rejected

Observability

When to use this strategy and when not

Paste a URL, get a working short link

Continue reading

Why two tiers

L1: a per-process LRU

L2: Redis cluster with read replicas

The eviction policy choice

Cache warming

What happens at L1 capacity

Cache invalidation in detail

Failure modes we have actually seen

What we considered and rejected

Observability

When to use this strategy and when not

Related reading

Paste a URL, get a working short link

Continue reading