Notification System — HLD Session Recap¶

Starting Point¶

Twitter feed design completed. Notification system introduced as the second problem — same patterns, different domain. Goal was to force application of patterns rather than recall of a specific solution. Student drove the design more independently than the Twitter feed session.

Problem Scope¶

Notification system for Reddit-style platform. When a post is made in a subreddit — all subscribers are notified.

Clarified requirements: - Average subscribers per subreddit: 100,000 - Average posts per day across all subreddits: 500,000 - Each user has 2 devices on average - Scope: notification delivery only — post content fetching not in scope

Capacity Estimation¶

Student derived independently:

Posts per second:
500,000 posts/day ÷ 86,400 = ~5 posts/second

Devices:
100,000 avg subscribers * 2 devices = 200,000 devices per subreddit post

Notification deliveries per second:
5 posts/sec * 200,000 devices = 1,000,000 deliveries/second

What the number drives: 1 million deliveries/second is not a read/write problem. It is a throughput and fan-out problem. Same pattern as Twitter feed. One event fans out to many recipients.

The Naive Approach and Why It Breaks¶

One Fan-out Service receives post event. Loops through 200,000 subscribers. Calls SendNotification() for each.

Problems: - Sequential processing of 200,000 operations per post - At 5 posts/second = 1,000,000 sequential operations/second in one service - Service cannot keep up

Student's instinct — multiple parallel consumers: correct direction. But spinning 20 permanent instances for notification workload is wasteful. Auto-scaling is too slow for spikes. The real fix is message batching.

Message Batching Pattern¶

Instead of one event per post triggering all work in one consumer — Dispatcher Service fetches subscriber list, splits into batches, publishes many small batch messages.

PostCreated event arrives
→ Dispatcher Service
→ fetch 200,000 subscriber IDs for subreddit
→ split into batches of 1,000
→ publish 200 BatchNotification messages to RabbitMQ

Each BatchNotification message:
{
  postId: 123,
  subscriberIds: [userId1, userId2, ... userId1000]
}

200 small messages instead of one massive operation. N Notification Workers pull from queue. Each processes 1,000 notifications. Auto-scaling based on queue depth — not permanent instances.

Why not put everything in one message: 200,000 user IDs in one message is megabytes. RabbitMQ messages should be small. Batching keeps messages small and distributes work.

Batch size is configurable: 1,000 per batch is a starting point. Tune based on worker processing time and acceptable latency.

Device Token Delivery¶

Worker has 1,000 user IDs. Needs device tokens to call FCM/APNs.

Two options:

Option 1 — Include tokens in batch message: Dispatcher fetches tokens alongside subscriber IDs. Bundles into message.

{
  postId: 123,
  subscribers: [
    { userId: 1, tokens: ["token_abc", "token_xyz"] },
    ...1000 entries
  ]
}

Message size: ~300KB per batch. Workers are simple — no additional lookups.

Option 2 — Token lookup in worker: Keep messages small. Worker fetches tokens from Redis using MGET.

BatchNotification message: { postId, userIds: [...1000 IDs] }  ← ~8KB

Worker:
MGET device_tokens:1 device_tokens:2 ... device_tokens:1000
→ one Redis call, batch fetch all tokens
→ call FCM/APNs

Preferred approach: Option 2. Messages stay small at ~8KB. Dispatcher focused on fan-out only. Workers have one Redis dependency. Tradeoff is acceptable.

The Large Subreddit Problem¶

r/funny — 67 million subscribers. 140 million devices.

67,000,000 ÷ 1,000 = 67,000 batch messages from one post

67,000 messages flood the queue. All other subreddits' notifications are delayed. Workers overwhelmed. Same backlog problem as Twitter celebrity fan-out.

Student correctly identified: the problem is the number of messages created, not the per-instance processing speed. Increasing batch size just shifts the bottleneck — doesn't eliminate it.

Tiered Handling — Same Pattern as Twitter¶

Small subreddit (< 100K subscribers): push model, full fan-out, real-time push notification to device.

Medium subreddit (< 1M subscribers): push model, batched fan-out.

Large subreddit (> 1M subscribers): pull model. No fan-out. Dispatcher publishes one single message:

{
  postId: 123,
  subredditId: "funny",
  type: "large_subreddit"
}

One message. No user IDs. No device tokens. Consumer writes one record to notification store. Users discover it when they open the app.

Why this works for large subreddits: r/funny is entertainment content. Low urgency. Users browse when they open the app. Real-time push notification from a 67M member community would be noise anyway.

Why it doesn't work for small subreddits: niche communities post rarely. Users want real-time notification because posts are high signal and infrequent.

Pull Mechanism for Large Subreddits¶

App open pull:

User opens app
→ GET /notifications?userId=123&since=lastSeen
→ server checks user's large subreddit subscriptions
→ queries notification store for new posts since lastSeen
→ returns count or list
→ app shows badge or notification banner

Not periodic polling. Event-driven — check happens when user opens app. No battery drain. No background polling.

Websocket for active users:

Users currently active in app maintain a websocket connection. Server pushes large subreddit event down the websocket in real time.

Server → websocket → "r/funny has a new post"

Not user-specific fan-out. Subreddit-level broadcast.

Why websockets are tractable: Not every user has a persistent connection. Only active users — roughly 2-5% at any moment.

200M users * 2% active = 4M concurrent connections
4M ÷ 50,000 per server = 80 websocket servers

Manageable. Distributed across many servers.

Websocket Connection Registry¶

Websocket connections are stateful. User 123 is connected to Websocket Server A. Notification arrives for r/funny. System needs to know which server to push to.

Redis connection registry:

Redis: user:123 → websocket_server_A

When notification arrives:

→ lookup user:123 in Redis → websocket_server_A
→ send message to websocket_server_A via Redis pub/sub
→ server_A pushes to user 123's connection

For large subreddit broadcast — no user lookup needed:

Each websocket server maintains local map:

subreddit_connections:funny → [conn1, conn2, conn3...]

Notification published to Redis pub/sub channel. All websocket servers receive it. Each pushes to local connections subscribed to that subreddit. No user-specific routing.

Three Delivery Mechanisms for Three User States¶

User state                  → Delivery mechanism
Active in app               → Websocket push, real-time
App closed, small sub       → FCM/APNs push notification
App closed, large sub       → Pull on next app open

One mechanism cannot serve all states efficiently. Design explicitly for each.

Activity Filter Optimisation¶

Before fan-out — filter subscribers to only those active in last 30 days.

r/funny: 67M subscribers
Active in last 30 days: ~10M
Fan-out: 10M instead of 67M

10x reduction without changing architecture. Some users miss notifications. Acceptable — they're inactive anyway.

Complete Architecture¶

Post created
→ Dispatcher Service
→ fetch subscriber list
→ check subreddit size

Small/Medium subreddit:
→ activity filter (active 30 days)
→ batch into 1,000 per message
→ publish N BatchNotification messages to RabbitMQ
→ Notification Workers consume
→ MGET device tokens from Redis
→ call FCM/APNs

Large subreddit:
→ publish ONE subreddit-level message
→ Notification Store Consumer writes one record
→ Active users receive via websocket broadcast
→ Inactive users discover on next app open via REST poll

Storage¶

Device token store (Redis):

key: device_tokens:{userId}
value: ["token_abc", "token_xyz"]  ← user's device tokens

Large subreddit notification store (DB):

subreddit_notifications
    id            BIGINT
    subreddit_id  VARCHAR
    post_id       BIGINT
    created_at    TIMESTAMPTZ

Queried on app open:

SELECT * FROM subreddit_notifications
WHERE subreddit_id IN (user's large subreddits)
AND created_at > lastSeen

User subscription store:

small_subs:{userId}  → set of small subreddit IDs
large_subs:{userId}  → set of large subreddit IDs

Separated at subscribe time. No classification needed at notification time.

Patterns Used¶

Pattern	Where used
Fan-out on write	Small subreddit — push to all subscriber devices
Pull on read	Large subreddit — user fetches on app open
Tiered handling for outliers	Small/medium push, large pull
Message batching	200K subscribers → 200 batch messages of 1K each
Async fan-out via queue	Post event → RabbitMQ → Dispatcher → Workers
Activity filter	Only notify users active in last 30 days
Websocket with connection registry	Active users get real-time delivery
Three delivery mechanisms	Match mechanism to user state

The Meta-Pattern — Same Arc as Twitter Feed¶

Naive approach    → loop through all subscribers synchronously
First problem     → too slow, one service can't handle it
Fix               → async via queue
Second problem    → large subreddit floods queue (67K messages)
Fix               → tiered handling, pull model for large subs
Third problem     → delivery to device
Fix               → match mechanism to user state

Identical arc to Twitter feed. Celebrities = large subreddits. Push model = fan-out. Pull model = client fetches on open.

Key Confusions and Resolutions¶

Confusion 1 — Spinning 20 instances vs batching Student initially thought scaling instances was the fix for fan-out volume. Resolved — batching reduces message count from 200,000 to 200. Workers process batches. Queue absorbs spikes. Auto-scale based on queue depth not permanent load.

Confusion 2 — What the large subreddit consumer does Student asked what the consumer does with the subreddit-level message. Resolved — almost nothing. Writes one record to notification store. No user lookup. No device tokens. One DB write. Users pull from this store on app open.

Confusion 3 — Websocket at scale Student correctly identified that one websocket per user at 200M users is not feasible. Resolved — only active users maintain connections (~2-5%). 4M concurrent connections distributed across 80 websocket servers. Manageable. Large subreddit broadcast uses Redis pub/sub to all websocket servers — no user-specific routing needed.

Confusion 4 — Connecting patterns to each other Student felt patterns weren't forming a cohesive mental model. Resolved — the three questions that drive every HLD problem: how does data get written, how does data get read, how do you handle scale when naive approaches break. Same arc repeats across every system.

Unanswered / Needs More Depth¶

Notification preferences — user can mute specific subreddits
Read receipts — marking notifications as seen
Notification grouping — "5 new posts in r/funny" vs 5 separate notifications
Rate limiting per user — avoid spamming users with too many notifications

What's Next¶

Ecommerce ordering flow (covered next)
Deep dive on websocket scaling
Notification delivery guarantees — at least once vs at most once vs exactly once