Skip to content

Notification System — HLD Session Recap


Starting Point

Twitter feed design completed. Notification system introduced as the second problem — same patterns, different domain. Goal was to force application of patterns rather than recall of a specific solution. Student drove the design more independently than the Twitter feed session.


Problem Scope

Notification system for Reddit-style platform. When a post is made in a subreddit — all subscribers are notified.

Clarified requirements: - Average subscribers per subreddit: 100,000 - Average posts per day across all subreddits: 500,000 - Each user has 2 devices on average - Scope: notification delivery only — post content fetching not in scope


Capacity Estimation

Student derived independently:

Posts per second:
500,000 posts/day ÷ 86,400 = ~5 posts/second

Devices:
100,000 avg subscribers * 2 devices = 200,000 devices per subreddit post

Notification deliveries per second:
5 posts/sec * 200,000 devices = 1,000,000 deliveries/second

What the number drives: 1 million deliveries/second is not a read/write problem. It is a throughput and fan-out problem. Same pattern as Twitter feed. One event fans out to many recipients.


The Naive Approach and Why It Breaks

One Fan-out Service receives post event. Loops through 200,000 subscribers. Calls SendNotification() for each.

Problems: - Sequential processing of 200,000 operations per post - At 5 posts/second = 1,000,000 sequential operations/second in one service - Service cannot keep up

Student's instinct — multiple parallel consumers: correct direction. But spinning 20 permanent instances for notification workload is wasteful. Auto-scaling is too slow for spikes. The real fix is message batching.


Message Batching Pattern

Instead of one event per post triggering all work in one consumer — Dispatcher Service fetches subscriber list, splits into batches, publishes many small batch messages.

PostCreated event arrives Dispatcher Service fetch 200,000 subscriber IDs for subreddit split into batches of 1,000 publish 200 BatchNotification messages to RabbitMQ

Each BatchNotification message:
{
  postId: 123,
  subscriberIds: [userId1, userId2, ... userId1000]
}

200 small messages instead of one massive operation. N Notification Workers pull from queue. Each processes 1,000 notifications. Auto-scaling based on queue depth — not permanent instances.

Why not put everything in one message: 200,000 user IDs in one message is megabytes. RabbitMQ messages should be small. Batching keeps messages small and distributes work.

Batch size is configurable: 1,000 per batch is a starting point. Tune based on worker processing time and acceptable latency.


Device Token Delivery

Worker has 1,000 user IDs. Needs device tokens to call FCM/APNs.

Two options:

Option 1 — Include tokens in batch message: Dispatcher fetches tokens alongside subscriber IDs. Bundles into message.

{
  postId: 123,
  subscribers: [
    { userId: 1, tokens: ["token_abc", "token_xyz"] },
    ...1000 entries
  ]
}

Message size: ~300KB per batch. Workers are simple — no additional lookups.

Option 2 — Token lookup in worker: Keep messages small. Worker fetches tokens from Redis using MGET.

BatchNotification message: { postId, userIds: [...1000 IDs] }  ← ~8KB

Worker:
MGET device_tokens:1 device_tokens:2 ... device_tokens:1000
→ one Redis call, batch fetch all tokens
→ call FCM/APNs

Preferred approach: Option 2. Messages stay small at ~8KB. Dispatcher focused on fan-out only. Workers have one Redis dependency. Tradeoff is acceptable.


The Large Subreddit Problem

r/funny — 67 million subscribers. 140 million devices.

67,000,000 ÷ 1,000 = 67,000 batch messages from one post

67,000 messages flood the queue. All other subreddits' notifications are delayed. Workers overwhelmed. Same backlog problem as Twitter celebrity fan-out.

Student correctly identified: the problem is the number of messages created, not the per-instance processing speed. Increasing batch size just shifts the bottleneck — doesn't eliminate it.


Tiered Handling — Same Pattern as Twitter

Small subreddit (< 100K subscribers): push model, full fan-out, real-time push notification to device.

Medium subreddit (< 1M subscribers): push model, batched fan-out.

Large subreddit (> 1M subscribers): pull model. No fan-out. Dispatcher publishes one single message:

{
  postId: 123,
  subredditId: "funny",
  type: "large_subreddit"
}

One message. No user IDs. No device tokens. Consumer writes one record to notification store. Users discover it when they open the app.

Why this works for large subreddits: r/funny is entertainment content. Low urgency. Users browse when they open the app. Real-time push notification from a 67M member community would be noise anyway.

Why it doesn't work for small subreddits: niche communities post rarely. Users want real-time notification because posts are high signal and infrequent.


Pull Mechanism for Large Subreddits

App open pull:

User opens app GET /notifications?userId=123&since=lastSeen server checks user's large subreddit subscriptions
→ queries notification store for new posts since lastSeen
→ returns count or list
→ app shows badge or notification banner

Not periodic polling. Event-driven — check happens when user opens app. No battery drain. No background polling.

Websocket for active users:

Users currently active in app maintain a websocket connection. Server pushes large subreddit event down the websocket in real time.

Server → websocket → "r/funny has a new post"

Not user-specific fan-out. Subreddit-level broadcast.

Why websockets are tractable: Not every user has a persistent connection. Only active users — roughly 2-5% at any moment.

200M users * 2% active = 4M concurrent connections
4M ÷ 50,000 per server = 80 websocket servers

Manageable. Distributed across many servers.


Websocket Connection Registry

Websocket connections are stateful. User 123 is connected to Websocket Server A. Notification arrives for r/funny. System needs to know which server to push to.

Redis connection registry:

Redis: user:123 → websocket_server_A

When notification arrives:

→ lookup user:123 in Redis → websocket_server_A
→ send message to websocket_server_A via Redis pub/sub
→ server_A pushes to user 123's connection

For large subreddit broadcast — no user lookup needed:

Each websocket server maintains local map:

subreddit_connections:funny → [conn1, conn2, conn3...]

Notification published to Redis pub/sub channel. All websocket servers receive it. Each pushes to local connections subscribed to that subreddit. No user-specific routing.


Three Delivery Mechanisms for Three User States

User state                   Delivery mechanism
Active in app                Websocket push, real-time
App closed, small sub        FCM/APNs push notification
App closed, large sub        Pull on next app open

One mechanism cannot serve all states efficiently. Design explicitly for each.


Activity Filter Optimisation

Before fan-out — filter subscribers to only those active in last 30 days.

r/funny: 67M subscribers
Active in last 30 days: ~10M
Fan-out: 10M instead of 67M

10x reduction without changing architecture. Some users miss notifications. Acceptable — they're inactive anyway.


Complete Architecture

Post created Dispatcher Service fetch subscriber list check subreddit size

Small/Medium subreddit:
→ activity filter (active 30 days) batch into 1,000 per message publish N BatchNotification messages to RabbitMQ Notification Workers consume MGET device tokens from Redis call FCM/APNs

Large subreddit:
→ publish ONE subreddit-level message Notification Store Consumer writes one record Active users receive via websocket broadcast Inactive users discover on next app open via REST poll

Storage

Device token store (Redis):

key: device_tokens:{userId}
value: ["token_abc", "token_xyz"]  ← user's device tokens

Large subreddit notification store (DB):

subreddit_notifications
    id            BIGINT
    subreddit_id  VARCHAR
    post_id       BIGINT
    created_at    TIMESTAMPTZ

Queried on app open:

SELECT * FROM subreddit_notifications
WHERE subreddit_id IN (user's large subreddits)
AND created_at > lastSeen

User subscription store:

small_subs:{userId}  → set of small subreddit IDs
large_subs:{userId}  → set of large subreddit IDs

Separated at subscribe time. No classification needed at notification time.


Patterns Used

Pattern Where used
Fan-out on write Small subreddit — push to all subscriber devices
Pull on read Large subreddit — user fetches on app open
Tiered handling for outliers Small/medium push, large pull
Message batching 200K subscribers → 200 batch messages of 1K each
Async fan-out via queue Post event → RabbitMQ → Dispatcher → Workers
Activity filter Only notify users active in last 30 days
Websocket with connection registry Active users get real-time delivery
Three delivery mechanisms Match mechanism to user state

The Meta-Pattern — Same Arc as Twitter Feed

Naive approach     loop through all subscribers synchronously
First problem      too slow, one service can't handle it
Fix               → async via queue
Second problem    → large subreddit floods queue (67K messages)
Fix               → tiered handling, pull model for large subs
Third problem     → delivery to device
Fix               → match mechanism to user state

Identical arc to Twitter feed. Celebrities = large subreddits. Push model = fan-out. Pull model = client fetches on open.


Key Confusions and Resolutions

Confusion 1 — Spinning 20 instances vs batching Student initially thought scaling instances was the fix for fan-out volume. Resolved — batching reduces message count from 200,000 to 200. Workers process batches. Queue absorbs spikes. Auto-scale based on queue depth not permanent load.

Confusion 2 — What the large subreddit consumer does Student asked what the consumer does with the subreddit-level message. Resolved — almost nothing. Writes one record to notification store. No user lookup. No device tokens. One DB write. Users pull from this store on app open.

Confusion 3 — Websocket at scale Student correctly identified that one websocket per user at 200M users is not feasible. Resolved — only active users maintain connections (~2-5%). 4M concurrent connections distributed across 80 websocket servers. Manageable. Large subreddit broadcast uses Redis pub/sub to all websocket servers — no user-specific routing needed.

Confusion 4 — Connecting patterns to each other Student felt patterns weren't forming a cohesive mental model. Resolved — the three questions that drive every HLD problem: how does data get written, how does data get read, how do you handle scale when naive approaches break. Same arc repeats across every system.


Unanswered / Needs More Depth

  • Notification preferences — user can mute specific subreddits
  • Read receipts — marking notifications as seen
  • Notification grouping — "5 new posts in r/funny" vs 5 separate notifications
  • Rate limiting per user — avoid spamming users with too many notifications

What's Next

  • Ecommerce ordering flow (covered next)
  • Deep dive on websocket scaling
  • Notification delivery guarantees — at least once vs at most once vs exactly once