Skip to content

URL Shortener — HLD Session Recap


Starting Point

First HLD problem attempted. Five step framework introduced and demonstrated by tutor end to end. Student observed the full thinking process before attempting independently. Goal was to understand each building block deeply before drilling the framework. Student covered concepts one at a time — capacity estimation, Base62, write path, read path, cache, sharding, rate limiting, custom aliases, non-functionals.


The Five Step Framework

Apply to every HLD problem. No exceptions.

1. Clarify requirements      never draw anything first, 2-3 minutes
2. Capacity estimation       numbers drive component decisions, 2-3 minutes
3. High level components     skeleton first, add complexity only where needed
4. Deep dive                 interviewer picks one or two areas, 20-30 minutes
5. Non-functional            scalability, availability, consistency

Clarified Requirements

  • Public URL shortener (bit.ly scale)
  • Basic click count analytics required
  • Custom aliases optional feature
  • Read heavy — links created once, clicked many times

Capacity Estimation

The one number to memorise: 86,400 — seconds in a day.

Mental shortcut for big number division: convert to powers of 10, subtract exponents.

10 billion ÷ 100 thousand
= 10^10 ÷ 10^5
= 10^(10-5)
= 10^5
= 100,000

Powers to memorise:

thousand  = 10^3
million   = 10^6
billion   = 10^9
trillion  = 10^12

Derivations:

Writes:
100M URLs shortened/day ÷ 86,400 = ~1,200 writes/second
→ single primary DB handles this fine

Reads:
100M links * 10 clicks avg = 1B reads/day
1B ÷ 86,400 = ~11,500 reads/second
→ DB cannot handle this alone → cache mandatory

Read/write ratio: 10:1 → read dominant

Storage:
short code + long URL + created_at + user_id = ~500 bytes per record
100M records/day * 365 * 5 years = 182B records
182B * 500 bytes = ~91TB over 5 years
→ single node viable
→ raise TTL/archiving strategy
→ sharding is future concern not day one

Short code space:
62^7 = 3.5 trillion combinations
3.5T ÷ 100M per day = 35,000 days coverage (~95 years)
→ 7 character Base62 is sufficient

What each number drives:

Number Decision
1,200 writes/sec Single primary DB sufficient
11,500 reads/sec Cache mandatory
10:1 read/write ratio Read path needs most attention
91TB over 5 years Single node viable, raise archiving
182B records needed 7 character Base62 sufficient

What storage estimation drives specifically: Not just infrastructure — storage number raises the archiving conversation. At 91TB, single node viable. But adding ~18TB/year means raising TTL policy — archive links inactive for 2 years. Keeps active dataset at ~20-30TB. Always state what the storage number means, not just what it is.


Base62 vs Base64 vs Hash — The Decision

Why not Base64:

Base64 includes + and /. Both break URLs.

+ = space in URL encoding
/ = path separator

URL encoding required — %2B, %2F — making codes ugly and longer. Base64 designed for binary-to-text encoding. Wrong tool for URLs.

Base64URL (RFC 4648) fixes this with - and _ substitution. Safe but 64 characters for no meaningful benefit over Base62.

Why not Hash:

Three problems:

  1. Collisions — two different URLs can produce same first 7 characters of MD5/SHA256. Must detect, handle, retry. Complexity for no benefit.

  2. Deterministic — same URL always produces same code. Two campaign links to same destination URL get same short code. Analytics breaks — can't distinguish clicks between campaigns.

  3. Always fixed length — always 7 characters whether 100 URLs or 100 billion. No efficiency gain early on.

Why Base62:

a-z = 26 characters
A-Z = 26 characters
0-9 = 10 characters
    = 62 total, all URL-safe, no encoding needed
62^7 = 3.5 trillion combinations

Auto-Increment to Base62 — The Mechanics

Key insight: DB always stores integers. Base62 is application-level encoding on top. Two separate concerns.

Base62 is base 10 with 62 digits instead of 10. Same repeated division algorithm as converting decimal to binary.

Alphabet:

Position 0-9:   '0'-'9'
Position 10-35: 'a'-'z'
Position 36-61: 'A'-'Z'

Full: "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

Conversion example — ID 1000:

1000 ÷ 62 = 16 remainder 8  → Alphabet[8]  = '8'
  16 ÷ 62 =  0 remainder 16 → Alphabet[16] = 'g'
Read remainders bottom to top: "g8"

Verify:

g = position 16
8 = position 8
16 * 62^1 + 8 * 62^0 = 992 + 8 = 1000 ✓

Code:

private static readonly string Alphabet = 
    "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";

public static string Encode(long id)
{
    if (id == 0) return "0";
    var result = new StringBuilder();
    while (id > 0)
    {
        result.Insert(0, Alphabet[(int)(id % 62)]);
        id /= 62;
    }
    return result.ToString();
}

Why guaranteed unique: DB auto-increment gives unique integers. Encoding a unique integer always gives a unique string. No collision possible.

The enumeration problem: codes are sequential. Someone with code "g8" can guess "g9" exists. Acceptable for public URL shortener. Not acceptable for private document sharing — switch to random Base62.


The Write Path — Full Flow

The confusion point — when does the code get saved back to DB:

POST /shorten { url: "https://very-long-url.com" }

Step 1  INSERT URL, get ID back in same round trip
INSERT INTO urls (url, created_at) VALUES (@url, now()) RETURNING id
 id = 1000

Step 2  Encode in application
Base62.Encode(1000) = "g8"

Step 3  Store code back
UPDATE urls SET code = 'g8' WHERE id = 1000

Step 4  Return "g8" to client

Two DB operations. RETURNING id avoids separate SELECT round trip.

Two valid designs:

Design Write Read
Store code + integer encode ID, store code lookup by code directly
Store integer only encode ID, return code decode code → lookup by ID

Store code + integer is simpler on read path. No decoding needed.


The Read Path — Never Decode

The confusion point — student initially thought decoding was needed on GET requests.

You never decode unless you never stored the code column.

Stored code column:
GET /g8
→ SELECT url FROM urls WHERE code = 'g8'   ← direct lookup
→ redirect

No code column stored:
GET /g8
→ decode "g8" → 1000
→ SELECT url FROM urls WHERE id = 1000
→ redirect

Decoding just to get the ID when the code is already stored is an unnecessary extra step.


The confusion point — student thought a separate service decides what's hot and manually populates cache.

LRU handles everything automatically through access patterns. You write zero eviction logic.

Cache miss → populate Redis
Cache hit  → Redis internally refreshes LRU timestamp
Cache full → Redis evicts key with oldest LRU timestamp automatically

Hot links stay cached — accessed constantly, LRU timestamp keeps refreshing. Cold links evicted — nobody clicks them, LRU timestamp goes stale.

Power law distribution: top 1% of links get ~80% of clicks. Cache the hot links, absorb 80% of read traffic without touching DB.

Cache warming — optional optimisation. Background job pre-loads top N clicked links before TTL expiry. Eliminates cold start problem. Not a core requirement.

Analytics service — entirely separate concern. Records click counts, geography, referrer for user-facing reporting. Has nothing to do with cache management.

LRU eviction   → automatic, Redis internal, zero code
Cache populate → on cache miss, application code
Cache warming  → optional background job
Analytics      → separate service, user-facing reporting

301 vs 302

301 Permanent 302 Temporary
Browser caches Yes — permanently No
Every click hits server No Yes
Analytics possible No Yes
If destination changes Browser ignores update Always gets latest

Analytics required → always 302. Every click must pass through server.


Analytics Off Critical Path

Never block the redirect on analytics recording.

GET /g8 cache hit  long URL retrieved publish ClickEvent to queue    async, fire and forget, never blocks return 302 immediately

Analytics Service (separate) consumes ClickEvent increments click count stores geography, referrer, timestamp

Redis in .NET — Canonical Pattern

IDistributedCache — canonical for simple get/set/remove with TTL. Abstraction over Redis. Swappable with in-memory for tests.

StackExchange.Redis — when Redis-specific features needed: counters, sorted sets, pub/sub, Lua scripts.

Critical — IConnectionMultiplexer is always singleton:

// WRONG — new connection per request, destroys performance
using var redis = await ConnectionMultiplexer.ConnectAsync("localhost");

// CORRECT — one connection pool, reused everywhere
builder.Services.AddSingleton<IConnectionMultiplexer>(
    ConnectionMultiplexer.Connect("localhost"));

Cache-Aside Decorator Pattern

Repository owns cache logic. Service has zero cache knowledge. Decorator pattern.

Service → IUrlRepository → result
               ↑
    CachedUrlRepository      ← decorator, adds cache
               ↑
       UrlRepository         ← DB only, no cache knowledge
public class CachedUrlRepository : IUrlRepository
{
    private readonly IUrlRepository _inner;
    private readonly IDistributedCache _cache;

    public async Task<string?> GetLongUrlAsync(string code)
    {
        var cached = await _cache.GetStringAsync(code);
        if (cached != null) return cached;

        var longUrl = await _inner.GetLongUrlAsync(code);
        if (longUrl == null) return null;

        await _cache.SetStringAsync(code, longUrl, new DistributedCacheEntryOptions
        {
            AbsoluteExpirationRelativeToNow = TimeSpan.FromHours(24)
        });

        return longUrl;
    }

    public async Task<string> CreateAsync(string longUrl)
    {
        // writes pass straight through — no cache on write
        return await _inner.CreateAsync(longUrl);
    }
}

Why no cache on write: link just created, nobody has code yet, cache hit rate is zero. Populate on first read — lazy population. Cache-aside pattern.

Registration:

builder.Services.AddScoped<UrlRepository>();
builder.Services.AddScoped<IUrlRepository>(sp =>
    new CachedUrlRepository(
        sp.GetRequiredService<UrlRepository>(),
        sp.GetRequiredService<IDistributedCache>()));

Full Read Path End to End

GET /g8

UrlShortenerService.RedirectAsync("g8")
  CachedUrlRepository.GetLongUrlAsync("g8")
  Redis hit?
   yes  publish ClickEvent (async, fire and forget)
          return 302 redirect

   no   UrlRepository.GetLongUrlAsync("g8")
          SELECT url FROM urls WHERE code = 'g8'
          found     SET Redis 'g8' = longUrl, TTL 24hrs
                     publish ClickEvent (async)
                     return 302 redirect
          not found  404

Sharding

What it solves: single node storage and write throughput limits. Read replicas solve read throughput but don't reduce storage or improve write throughput — replicas are full copies, all writes still go to one primary.

When to shard URL shortener: 91TB over 5 years is viable on single node. Shard when writes grow 10x (~12,000/sec) or storage exceeds single node comfort. Not a day-one requirement.

Hash-based sharding:

hash(shortCode) % N = shard index

Even distribution. No hotspots. Standard for content systems without natural range key.

Consistent hashing: minimises redistribution when nodes added or removed. Simple modulo means adding one shard requires moving almost all data. Consistent hashing redistributes only 1/N of data.

Read replicas per shard: each shard is an independent replication unit. Shard 1 primary replicates to Shard 1 replicas only. No knowledge of other shards.

ShardDescriptor in .NET:

public class ShardDescriptor
{
    public string WriteConnectionString { get; init; }
    public string ReadConnectionString { get; init; }
}

public class ShardResolver
{
    public ShardDescriptor Resolve(string shardKey)
    {
        var index = Math.Abs(shardKey.GetHashCode()) % _shards.Length;
        return _shards[index];
    }
}

Interview framing:

"At current scale — single primary with read replicas. If writes grow 10x or storage exceeds single node comfort I'd introduce hash-based sharding on the short code using consistent hashing to minimise redistribution when adding nodes."


Rate Limiting

What it solves: prevents abuse — bots submitting millions of URLs, spam link generation, API hammering.

Where it lives: API Gateway. Services never see rejected traffic.

What to rate limit on: - IP address — simplest, but corporate NAT means many users share one IP - API key — more precise, bad actor gets key revoked without affecting others - User ID — most precise, requires auth

Two main algorithms:

Token Bucket: Each client has a bucket of N tokens. Refills at fixed rate. Each request consumes one token. Empty bucket = rejected. Allows bursting up to capacity. Natural for real users who are sometimes active in bursts.

Sliding Window: Track exactly how many requests in last N seconds. Exceeds limit = rejected. More precise than token bucket. No burst allowance. Strict enforcement.

Must use Redis for distributed rate limiting:

Any component scaled to multiple instances
→ in-memory rate limiting is useless
→ each instance has isolated counter
→ client bypasses limit by hitting different instance
→ Redis shared counter required

This applies even when rate limiting at the gateway — gateway itself scaled to multiple instances behind a load balancer. In-memory fails. Redis required.

ASP.NET Core built-in rate limiting (. NET 7+):

builder.Services.AddRateLimiter(options =>
{
    options.AddSlidingWindowLimiter("shorten-limit", o =>
    {
        o.PermitLimit = 100;
        o.Window = TimeSpan.FromMinutes(1);
        o.SegmentsPerWindow = 6;
        o.QueueLimit = 0;  // reject immediately, no queuing
    });
    options.RejectionStatusCode = 429;
});

Built-in is in-memory. For distributed — use RedisRateLimiting library or handle at NGINX/YARP gateway level.

Why reads are not rate limited: clicks on legitimate links. Real users click. Bots click for valid reasons (link preview, social crawlers). Rate limiting reads breaks legitimate use cases. Rate limit writes only.


Custom Aliases

User picks their own short code instead of auto-generated.

Happy path:

POST /shorten { url: "...", alias: "my-brand" }
 validate format
 check reserved list
 INSERT with code = 'my-brand'
 unique constraint violation  409
 success  return "my-brand"

Gotcha 1 — Race condition:

Two users request same alias simultaneously. Both check, both see nothing, both try to insert. One wins, one gets DB error.

Fix — unique constraint on code column. Try the insert, catch constraint violation. One round trip instead of two. Race condition impossible.

try
{
    await _db.ExecuteAsync(
        "INSERT INTO urls (code, url) VALUES (@code, @url)",
        new { code = alias, url = longUrl });
}
catch (PostgresException ex) when (ex.SqlState == "23505")
{
    return Conflict("Alias already taken");
}

Gotcha 2 — Namespace collision:

User requests alias "g8". That code already exists as auto-generated. Or user reserves "g8" today, auto-increment reaches ID 1000 tomorrow which encodes to "g8".

Initial fix considered — prefix auto-generated codes with "a-". Problem — user can submit "a-g8" as alias, colliding with prefixed auto-generated code "a-g8".

Correct fix — single table, single unique constraint:

CREATE TABLE urls (
    id        BIGINT PRIMARY KEY,
    code      VARCHAR(30) NOT NULL UNIQUE,  -- covers everything
    url       TEXT NOT NULL,
    is_alias  BOOLEAN DEFAULT FALSE
);

One table. One unique constraint. Auto-generated and aliases live side by side. Constraint prevents any collision regardless of origin or source.

Separate tables were considered and rejected — they create false safety. User can submit any string as alias including auto-generated patterns. Physical separation without global uniqueness check is insufficient.

Gotcha 3 — Alias validation:

private static readonly Regex ValidAlias = 
    new Regex("^[a-zA-Z0-9-_]{3,30}$", RegexOptions.Compiled);

Rules: alphanumeric plus hyphen and underscore (URL safe), minimum 3 characters, maximum 30 characters.

Gotcha 4 — Reserved aliases:

private static readonly HashSet<string> Reserved = new()
{
    "health", "api", "admin", "callback", "login", "logout"
};

User registers alias "health" — now GET /health hits redirect instead of health check endpoint.

Write path with aliases:

POST /shorten { url, alias? }

Alias provided:
 validate format (regex)
 check reserved list
 INSERT with alias as code
 unique constraint violation  409 Conflict
 success  return alias

No alias:
 generate Snowflake ID
 encode to Base62
 INSERT with generated code
 return generated code

Read path — no change:

GET /my-brand
→ SELECT url FROM urls WHERE code = 'my-brand'
→ found → 302 redirect

Identical to auto-generated lookup. Same column, same index, same query.


Non-Functional Requirements

Scalability:

Read path — stateless services scale horizontally, Redis Cluster distributes cache keys, read replicas handle cache misses.

Write path — 1,200/sec comfortable on single primary. If grows 10x: 1. Vertical scale — bigger machine 2. PgBouncer — connection pooling, reduces overhead 3. Write batching — buffer writes, flush in batches 4. Sharding — last resort, hash by short code, consistent hashing

Availability:

Component fails Impact Recovery
Redis Reads fall through to DB, slower but functional Cache warms naturally on restart
Read replica Reads fall back to primary, higher load Add new replica
Primary DB Writes fail, reads from cache + replica Promote replica, small data loss window
Service instance LB health check removes it, traffic to healthy instances Auto-scaling replaces

Consistency:

  • Write path: strong — unique constraint on primary, two users cannot get same code
  • Read path: eventual — cache TTL 24hrs, stale data possible after URL update
  • Explicit invalidation on URL change: await _cache.RemoveAsync(code) tightens window to near-zero
  • Analytics: eventual — click events async via queue, counts lag by seconds. Acceptable.
  • Read replica lag: 50-100ms, acceptable. Newly created link read immediately from primary not replica.

Handling viral links:

"Viral link cached after first click. Every subsequent click hits Redis — sub-millisecond. DB never sees it. Redis handles hundreds of thousands of reads/second on single node. Cache-aside pattern absorbs viral traffic naturally."

Data loss if primary goes down:

"Replication lag at failure time — typically milliseconds to seconds. Acceptable for URL shortener. Synchronous replication would eliminate loss at cost of higher write latency — not justified for this use case."


Interview Framing — Key Statements

On sharding:

"At current scale a single primary with read replicas is sufficient. If write throughput or storage hits single node limits I'd introduce hash-based sharding on the short code using consistent hashing to minimise redistribution when scaling out."

On rate limiting:

"Rate limit at the API Gateway on the write path — IP-based for anonymous clients, API key based for registered users with higher limits. Token bucket for burst tolerance, sliding window for strict enforcement. State in Redis so limits are consistent across all service instances regardless of how many gateway instances are running."

On Base62:

"Base64 has URL-unsafe characters. Hashing introduces collision handling complexity for no benefit. Base62 with auto-increment gives guaranteed uniqueness, URL-safe characters, simple implementation. If link privacy is a concern I'd switch to random Base62 — collision probability negligible until billions of records."

On custom aliases:

"Custom aliases use the same code column as auto-generated codes. Uniqueness enforced at DB level via unique constraint — try the insert, catch constraint violation, eliminates race condition. Single table with one unique constraint is the only true guarantee — separate tables create false safety because users can submit any string as alias."


Gaps Identified During Session

Gap 1 — Storage estimation not connected to decisions Calculated 91TB but didn't use it. Storage drives archiving conversation, not just infrastructure. Always state what the number means.

Gap 2 — Base64 URL safety Didn't know + and / were the problem characters. Now known: Base64URL substitutes - and _ but Base62 is simpler and standard.

Gap 3 — Never decode on read Initial instinct was to decode short code on every GET. Resolved — if code stored, look up by code directly. Decode only if code column never stored.

Gap 4 — When code gets written back Didn't see the two-step write flow. INSERT with RETURNING id, then UPDATE SET code. Or Snowflake ID generator removes chicken-and-egg entirely.

Gap 5 — LRU is automatic Thought separate service manages cache population and eviction. Resolved — LRU handles it entirely through access patterns. Zero eviction code needed.

Gap 6 — Rate limiting across instances Thought gateway-level rate limiting avoided distributed state problem. Resolved — gateway itself is scaled. Same problem. Redis required regardless of where limiter lives.

Gap 7 — Separate tables don't prevent alias collision Proposed separate tables for aliases and generated codes. Resolved — user can submit any string as alias including auto-generated patterns. Only single table with unique constraint is a true guarantee.


Unanswered / Needs More Depth

  • Snowflake ID generator mechanics — time component, machine ID, sequence
  • Link expiration — user-defined TTL on short codes
  • Link analytics dashboard design — reporting DB, event pipeline
  • URL validation on write — check destination is reachable, not on malicious domain blocklist
  • Geographic analytics — tracking click location, CDN implications
  • Custom domain support — user brings their own domain for short links

What's Next

  • Student drives Twitter feed problem end to end
  • Consistent hashing in depth — virtual nodes, redistribution mechanics
  • Design a rate limiter as a standalone HLD problem
  • Design a distributed cache
  • Resume finalisation — URL shortener design knowledge translatable to interview answers on caching and sharding