Skip to content

Untitled

URL Shortener — Instant Recall Notes


The five step framework — apply to every HLD problem

1. Clarify requirements      never draw anything first
2. Capacity estimation       numbers drive component decisions
3. High level components     skeleton first, complexity only where needed
4. Deep dive                 interviewer picks, you go deep
5. Non-functional            scalability, availability, consistency

Capacity estimation — the mechanics

Always start here. Every component decision references these numbers.

The one number to memorise: 86,400 — seconds in a day.

Writes:
100M URLs/day ÷ 86,400 = ~1,200 writes/sec
→ single primary DB handles this fine

Reads:
100M links * 10 clicks avg = 1B reads/day
1B ÷ 86,400 = ~11,500 reads/sec
→ DB cannot handle this alone → need cache

Read/write ratio: 10:1 → read dominant → cache is mandatory

Storage:
100M records/day * 365 * 5 years = 182B records
182B * 500 bytes = ~91TB
→ single node viable
→ raise TTL/archiving strategy for inactive links
→ 91TB tells you sharding is a future conversation not immediate

What each number drives:

Number Decision it drives
1,200 writes/sec Single primary DB is fine
11,500 reads/sec Cache is mandatory
10:1 read/write ratio Read path needs most attention
91TB over 5 years Single node viable, raise archiving
182B records needed 7 character Base62 sufficient

Base62 vs Base64 vs Hash — the decision

Why not Base64:

Base64 includes + and /. Both break URLs.

+ = space in URL encoding
/ = path separator

You'd need URL encoding — %2B, %2F — making codes ugly and longer. Base64 is designed for binary-to-text encoding, not URLs. Wrong tool.

Fix exists — Base64URL (RFC 4648) swaps + for - and / for _. Safe but you've now got 64 characters instead of 62 for no meaningful benefit over Base62.

Why not Hash:

Three problems:

  1. Collisions — two different URLs can produce the same first 7 characters of an MD5/SHA256 hash. Must detect, handle, retry. Complexity for no benefit.

  2. Deterministic — same URL always produces same code. Two campaign links to the same destination URL get the same short code. Analytics breaks — can't distinguish clicks between campaigns.

  3. Always fixed length — no efficiency at low scale, always 7 characters whether you have 100 URLs or 100 billion.

Why Base62:

a-z = 26
A-Z = 26
0-9 = 10
    = 62 URL-safe characters, no encoding needed
62^7 = 3.5 trillion combinations
3.5T ÷ 100M per day = 35,000 days coverage (~95 years)

Auto-increment to Base62 — the mechanics

The confusion point: DB always stores integers. Base62 is application-level encoding on top. Two separate concerns.

Base62 is base 10 with 62 digits instead of 10. Same repeated division algorithm you use to convert decimal to binary.

Alphabet: 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Position:  0         1         2         3         4         5         6
           0123456789012345678901234567890123456789012345678901234567890 1

Conversion example — ID 1000 to Base62:

1000 ÷ 62 = 16 remainder 8  → Alphabet[8]  = '8'
  16 ÷ 62 =  0 remainder 16 → Alphabet[16] = 'g'
Read bottom to top: "g8"

Verify:

g = position 16
8 = position 8
16 * 62¹ + 8 * 62⁰ = 992 + 8 = 1000 ✓
private static readonly string Alphabet = 
    "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";

public static string Encode(long id)
{
    if (id == 0) return "0";
    var result = new StringBuilder();
    while (id > 0)
    {
        result.Insert(0, Alphabet[(int)(id % 62)]);
        id /= 62;
    }
    return result.ToString();
}

The write path — full flow

The confusion point: when does the code get saved back to the DB.

POST /shorten { url: "https://very-long-url.com" }

Step 1  Insert URL, get ID back in same round trip
 INSERT INTO urls (url, created_at) VALUES (@url, now()) RETURNING id
 id = 1000

Step 2  Encode in application
 Base62.Encode(1000) = "g8"

Step 3  Store code back
 UPDATE urls SET code = 'g8' WHERE id = 1000

Step 4  Return "g8" to client

Two DB operations. RETURNING id avoids a separate SELECT round trip.

Two valid designs:

Design Write Read
Store integer only encode ID, return code decode code → lookup by ID
Store code + integer encode ID, store code lookup by code directly

Store code + integer is simpler on the read path. No decoding needed.


The read path — you do NOT decode

The confusion point: you never decode the short code on GET requests.

Decoding gives you the integer ID. You could then lookup by ID. But you already stored the code — just look up by code directly. Simpler, one less step.

GET /g8
→ SELECT url FROM urls WHERE code = 'g8'   ← direct lookup, no decode
→ redirect

Decoding only necessary if you never stored the code column — lookup by decoded integer ID instead.


The confusion point: you do not run a separate service to decide what's hot and manually populate the cache.

LRU handles it automatically through access patterns.

Every cache miss  populate Redis
Every cache hit   Redis internally refreshes LRU timestamp for that key
Cache full        Redis evicts key with oldest LRU timestamp automatically

Hot links stay cached — they keep getting hit, LRU timestamp keeps refreshing. Cold links get evicted — nobody clicks them, LRU timestamp gets stale.

You write zero eviction logic. Redis handles it entirely.

Cache warming — optional optimisation on top. Background job pre-loads top N clicked links into Redis before they get a cache miss. Eliminates cold start after TTL expiry. Not a core requirement.

Analytics service — separate concern entirely. Records click counts, geography, referrer for user-facing reporting. Has nothing to do with cache management.

LRU eviction   → automatic, Redis internal, zero code
Cache populate → on cache miss, your application code
Cache warming  → optional background job, uses analytics data
Analytics      → separate service, user-facing click reporting

301 vs 302

301 Permanent 302 Temporary
Browser caches Yes — permanently No
Every click hits server No Yes
Analytics possible No — browser bypasses server Yes
If destination changes Browser ignores update Always gets latest

Analytics required → always 302. Every click must pass through your server.


Analytics off the critical path

Never block the redirect on analytics recording.

GET /g8 cache hit  long URL retrieved publish ClickEvent to queue    async, fire and forget return 302 immediately         user redirected, done

Analytics Service (separate) consumes ClickEvent increments click count stores geography, referrer, timestamp

Redis in .NET — canonical pattern

IDistributedCache — canonical for simple get/set/remove with TTL. Abstraction over Redis. Swappable in tests with in-memory implementation.

StackExchange.Redis — when you need Redis-specific features: counters, sorted sets, pub/sub, Lua scripts.

Critical: IConnectionMultiplexer is always singleton. Manages connection pool internally. Creating it per request destroys performance.

// WRONG
using var redis = await ConnectionMultiplexer.ConnectAsync("localhost");

// CORRECT
builder.Services.AddSingleton<IConnectionMultiplexer>(
    ConnectionMultiplexer.Connect("localhost"));

Cache-aside decorator pattern

The pattern: repository owns cache logic. Service above it has zero cache knowledge.

Service → IUrlRepository → result
               ↑
    CachedUrlRepository      ← decorator, adds cache
               ↑
       UrlRepository         ← DB only, no cache knowledge
public class CachedUrlRepository : IUrlRepository
{
    private readonly IUrlRepository _inner;
    private readonly IDistributedCache _cache;

    public async Task<string?> GetLongUrlAsync(string code)
    {
        var cached = await _cache.GetStringAsync(code);
        if (cached != null) return cached;

        var longUrl = await _inner.GetLongUrlAsync(code);
        if (longUrl == null) return null;

        await _cache.SetStringAsync(code, longUrl, new DistributedCacheEntryOptions
        {
            AbsoluteExpirationRelativeToNow = TimeSpan.FromHours(24)
        });

        return longUrl;
    }

    public async Task<string> CreateAsync(string longUrl)
    {
        // writes pass straight through — no cache on write
        return await _inner.CreateAsync(longUrl);
    }
}

Registration:

builder.Services.AddScoped<UrlRepository>();
builder.Services.AddScoped<IUrlRepository>(sp =>
    new CachedUrlRepository(
        sp.GetRequiredService<UrlRepository>(),
        sp.GetRequiredService<IDistributedCache>()));

Why no cache on write: link just created, nobody has the code yet, cache hit rate is zero. Populate on first read — lazy population. Cache-aside pattern.


Full read path end to end

GET /g8
  UrlShortenerService.RedirectAsync("g8")
  CachedUrlRepository.GetLongUrlAsync("g8")
  Redis hit?
   yes  return longUrl
          publish ClickEvent (async)
          302 redirect

   no   UrlRepository.GetLongUrlAsync("g8")
          DB query: SELECT url FROM urls WHERE code = 'g8'
          found     SET Redis 'g8' = longUrl, TTL 24hrs
                     publish ClickEvent (async)
                     302 redirect
          not found  404

URL Shortener — Gaps and Weak Points Recap


Concepts you arrived at correctly without prompting

These are solid. Don't second guess them in interviews.

  • Storage estimation drives archiving strategy not just infrastructure
  • LRU eviction is automatic — you don't run a service to decide what's hot
  • Analytics and cache are separate concerns entirely
  • Callback endpoint is just a GET — no UI needed
  • BFF only decrypts cookie when it needs to call a downstream resource
  • SignInAsync is the correct way to establish session — not manual cookie append
  • Database-per-tenant is not sharding — different intent, same mechanism
  • Sharding is a last resort — single node first, replicas second, shard only when genuinely needed

Gaps identified — things that needed correction or didn't land first time


Gap 1 — Storage estimation

You understood the calculation but didn't connect it to a decision.

Storage estimation drives:

  • Single node viable or not
  • Whether archiving/TTL strategy needs to be raised
  • Whether sharding is a near-term concern

At 91TB — single node viable, raise TTL for inactive links. That's the answer. If it came out at 500TB+ — sharding conversation starts immediately.

Remember: always state what the storage number means, not just what it is.


Gap 2 — Base62 vs Base64

You didn't know why Base64 fails for URLs.

Base64 includes + and /
+ = space in URL encoding
/ = path separator
Both break URL parsing

Base64URL fixes this with - and _ substitution but gives you 64 characters for no meaningful benefit over Base62's 62. Base62 is the standard because it's URL-safe by construction.


Gap 3 — When you decode the short code

Initial instinct was to decode on every GET request.

You never decode unless you never stored the code column. If you stored the code — look up by code directly. Decoding just to get the ID to look up by ID is an unnecessary extra step.

Stored code column → SELECT WHERE code = 'g8'        ← direct
No code column     → decode 'g8' → 1000 → SELECT WHERE id = 1000

Gap 4 — When the code gets written back to the DB

You didn't see where the code gets stored after the ID is generated.

INSERT  RETURNING id  encode id  UPDATE SET code = @code WHERE id = @id

Two DB operations on write. RETURNING id avoids a separate SELECT round trip. Alternatively — Snowflake ID generator removes the chicken-and-egg problem entirely by generating the ID before touching the DB.


Gap 5 — Cache hot link decision

Instinct was that a separate async service analyses metrics and decides what to populate or evict.

Reality — LRU handles this automatically. No service needed. Access patterns decide what stays cached.

Cache miss → populate
Cache hit  → LRU timestamp refreshes internally
Cache full → LRU evicts least recently accessed key automatically

Cache warming is an optional optimisation on top. Not a core requirement.


Gap 6 — Rate limiting across distributed instances

Initial instinct was that gateway-level rate limiting avoids the distributed state problem.

Gateway itself is scaled to multiple instances. Same problem exists. In-memory counter on each gateway instance is isolated.

Fix is always Redis-backed shared state regardless of where the limiter lives — service or gateway.

Any component scaled to multiple instances
→ in-memory rate limiting is useless
→ Redis shared counter required

Gap 7 — Sharding vs database-per-tenant

You conflated the two initially.

Sharding — one logical dataset split for scale. All shards together form one complete picture.

Database-per-tenant — isolated datasets per tenant. Not for scale. For isolation. Tenants never share data, never need to be queried together.

Mechanism is identical. Intent is different. Use the right term in interviews.


Gap 8 — Cross-shard queries in a genuinely sharded system

You correctly identified that different endpoints with different parameters might need different shards.

The answer is not a complex resolver. The answer is:

  • Choose shard key so most common queries are single-shard
  • Group IDs by shard, multiple DbContexts, aggregate in memory for known cross-shard reads
  • Fan-out at write time so reads are always single-shard
  • Read model for complex cross-shard reporting

Shard key selection is the most important decision. Wrong key means cross-shard queries everywhere.


Vocabulary gaps — you have the concept, not the term

What you called it Interview term
Status locking and cleanup Choreography-based Saga
Correct over available CP under CAP
Winner gets rowCount 1 Pessimistic locking
Read loosely, lock on write Optimistic reads, pessimistic writes
Correlation ID spanning services Distributed tracing correlation
Pre-computed pricing read model CQRS read model
Loading different DB per tenant Database-per-tenant multi-tenancy

What's next

  1. Custom aliases — how user-defined short codes work alongside auto-generated ones
  2. Non-functionals for URL shortener — scalability, availability, consistency
  3. You drive the next system design problem end to end using the five step framework