Notification System — HLD Session Recap¶
Starting Point¶
Twitter feed design completed. Notification system introduced as the second problem — same patterns, different domain. Goal was to force application of patterns rather than recall of a specific solution. Student drove the design more independently than the Twitter feed session.
Problem Scope¶
Notification system for Reddit-style platform. When a post is made in a subreddit — all subscribers are notified.
Clarified requirements: - Average subscribers per subreddit: 100,000 - Average posts per day across all subreddits: 500,000 - Each user has 2 devices on average - Scope: notification delivery only — post content fetching not in scope
Capacity Estimation¶
Student derived independently:
Posts per second:
500,000 posts/day ÷ 86,400 = ~5 posts/second
Devices:
100,000 avg subscribers * 2 devices = 200,000 devices per subreddit post
Notification deliveries per second:
5 posts/sec * 200,000 devices = 1,000,000 deliveries/second
What the number drives: 1 million deliveries/second is not a read/write problem. It is a throughput and fan-out problem. Same pattern as Twitter feed. One event fans out to many recipients.
The Naive Approach and Why It Breaks¶
One Fan-out Service receives post event. Loops through 200,000 subscribers. Calls SendNotification() for each.
Problems: - Sequential processing of 200,000 operations per post - At 5 posts/second = 1,000,000 sequential operations/second in one service - Service cannot keep up
Student's instinct — multiple parallel consumers: correct direction. But spinning 20 permanent instances for notification workload is wasteful. Auto-scaling is too slow for spikes. The real fix is message batching.
Message Batching Pattern¶
Instead of one event per post triggering all work in one consumer — Dispatcher Service fetches subscriber list, splits into batches, publishes many small batch messages.
PostCreated event arrives
→ Dispatcher Service
→ fetch 200,000 subscriber IDs for subreddit
→ split into batches of 1,000
→ publish 200 BatchNotification messages to RabbitMQ
Each BatchNotification message:
{
postId: 123,
subscriberIds: [userId1, userId2, ... userId1000]
}
200 small messages instead of one massive operation. N Notification Workers pull from queue. Each processes 1,000 notifications. Auto-scaling based on queue depth — not permanent instances.
Why not put everything in one message: 200,000 user IDs in one message is megabytes. RabbitMQ messages should be small. Batching keeps messages small and distributes work.
Batch size is configurable: 1,000 per batch is a starting point. Tune based on worker processing time and acceptable latency.
Device Token Delivery¶
Worker has 1,000 user IDs. Needs device tokens to call FCM/APNs.
Two options:
Option 1 — Include tokens in batch message: Dispatcher fetches tokens alongside subscriber IDs. Bundles into message.
{
postId: 123,
subscribers: [
{ userId: 1, tokens: ["token_abc", "token_xyz"] },
...1000 entries
]
}
Message size: ~300KB per batch. Workers are simple — no additional lookups.
Option 2 — Token lookup in worker: Keep messages small. Worker fetches tokens from Redis using MGET.
BatchNotification message: { postId, userIds: [...1000 IDs] } ← ~8KB
Worker:
MGET device_tokens:1 device_tokens:2 ... device_tokens:1000
→ one Redis call, batch fetch all tokens
→ call FCM/APNs
Preferred approach: Option 2. Messages stay small at ~8KB. Dispatcher focused on fan-out only. Workers have one Redis dependency. Tradeoff is acceptable.
The Large Subreddit Problem¶
r/funny — 67 million subscribers. 140 million devices.
67,000,000 ÷ 1,000 = 67,000 batch messages from one post
67,000 messages flood the queue. All other subreddits' notifications are delayed. Workers overwhelmed. Same backlog problem as Twitter celebrity fan-out.
Student correctly identified: the problem is the number of messages created, not the per-instance processing speed. Increasing batch size just shifts the bottleneck — doesn't eliminate it.
Tiered Handling — Same Pattern as Twitter¶
Small subreddit (< 100K subscribers): push model, full fan-out, real-time push notification to device.
Medium subreddit (< 1M subscribers): push model, batched fan-out.
Large subreddit (> 1M subscribers): pull model. No fan-out. Dispatcher publishes one single message:
{
postId: 123,
subredditId: "funny",
type: "large_subreddit"
}
One message. No user IDs. No device tokens. Consumer writes one record to notification store. Users discover it when they open the app.
Why this works for large subreddits: r/funny is entertainment content. Low urgency. Users browse when they open the app. Real-time push notification from a 67M member community would be noise anyway.
Why it doesn't work for small subreddits: niche communities post rarely. Users want real-time notification because posts are high signal and infrequent.
Pull Mechanism for Large Subreddits¶
App open pull:
User opens app
→ GET /notifications?userId=123&since=lastSeen
→ server checks user's large subreddit subscriptions
→ queries notification store for new posts since lastSeen
→ returns count or list
→ app shows badge or notification banner
Not periodic polling. Event-driven — check happens when user opens app. No battery drain. No background polling.
Websocket for active users:
Users currently active in app maintain a websocket connection. Server pushes large subreddit event down the websocket in real time.
Server → websocket → "r/funny has a new post"
Not user-specific fan-out. Subreddit-level broadcast.
Why websockets are tractable: Not every user has a persistent connection. Only active users — roughly 2-5% at any moment.
200M users * 2% active = 4M concurrent connections
4M ÷ 50,000 per server = 80 websocket servers
Manageable. Distributed across many servers.
Websocket Connection Registry¶
Websocket connections are stateful. User 123 is connected to Websocket Server A. Notification arrives for r/funny. System needs to know which server to push to.
Redis connection registry:
Redis: user:123 → websocket_server_A
When notification arrives:
→ lookup user:123 in Redis → websocket_server_A
→ send message to websocket_server_A via Redis pub/sub
→ server_A pushes to user 123's connection
For large subreddit broadcast — no user lookup needed:
Each websocket server maintains local map:
subreddit_connections:funny → [conn1, conn2, conn3...]
Notification published to Redis pub/sub channel. All websocket servers receive it. Each pushes to local connections subscribed to that subreddit. No user-specific routing.
Three Delivery Mechanisms for Three User States¶
User state → Delivery mechanism
Active in app → Websocket push, real-time
App closed, small sub → FCM/APNs push notification
App closed, large sub → Pull on next app open
One mechanism cannot serve all states efficiently. Design explicitly for each.
Activity Filter Optimisation¶
Before fan-out — filter subscribers to only those active in last 30 days.
r/funny: 67M subscribers
Active in last 30 days: ~10M
Fan-out: 10M instead of 67M
10x reduction without changing architecture. Some users miss notifications. Acceptable — they're inactive anyway.
Complete Architecture¶
Post created
→ Dispatcher Service
→ fetch subscriber list
→ check subreddit size
Small/Medium subreddit:
→ activity filter (active 30 days)
→ batch into 1,000 per message
→ publish N BatchNotification messages to RabbitMQ
→ Notification Workers consume
→ MGET device tokens from Redis
→ call FCM/APNs
Large subreddit:
→ publish ONE subreddit-level message
→ Notification Store Consumer writes one record
→ Active users receive via websocket broadcast
→ Inactive users discover on next app open via REST poll
Storage¶
Device token store (Redis):
key: device_tokens:{userId}
value: ["token_abc", "token_xyz"] ← user's device tokens
Large subreddit notification store (DB):
subreddit_notifications
id BIGINT
subreddit_id VARCHAR
post_id BIGINT
created_at TIMESTAMPTZ
Queried on app open:
SELECT * FROM subreddit_notifications
WHERE subreddit_id IN (user's large subreddits)
AND created_at > lastSeen
User subscription store:
small_subs:{userId} → set of small subreddit IDs
large_subs:{userId} → set of large subreddit IDs
Separated at subscribe time. No classification needed at notification time.
Patterns Used¶
| Pattern | Where used |
|---|---|
| Fan-out on write | Small subreddit — push to all subscriber devices |
| Pull on read | Large subreddit — user fetches on app open |
| Tiered handling for outliers | Small/medium push, large pull |
| Message batching | 200K subscribers → 200 batch messages of 1K each |
| Async fan-out via queue | Post event → RabbitMQ → Dispatcher → Workers |
| Activity filter | Only notify users active in last 30 days |
| Websocket with connection registry | Active users get real-time delivery |
| Three delivery mechanisms | Match mechanism to user state |
The Meta-Pattern — Same Arc as Twitter Feed¶
Naive approach → loop through all subscribers synchronously
First problem → too slow, one service can't handle it
Fix → async via queue
Second problem → large subreddit floods queue (67K messages)
Fix → tiered handling, pull model for large subs
Third problem → delivery to device
Fix → match mechanism to user state
Identical arc to Twitter feed. Celebrities = large subreddits. Push model = fan-out. Pull model = client fetches on open.
Key Confusions and Resolutions¶
Confusion 1 — Spinning 20 instances vs batching Student initially thought scaling instances was the fix for fan-out volume. Resolved — batching reduces message count from 200,000 to 200. Workers process batches. Queue absorbs spikes. Auto-scale based on queue depth not permanent load.
Confusion 2 — What the large subreddit consumer does Student asked what the consumer does with the subreddit-level message. Resolved — almost nothing. Writes one record to notification store. No user lookup. No device tokens. One DB write. Users pull from this store on app open.
Confusion 3 — Websocket at scale Student correctly identified that one websocket per user at 200M users is not feasible. Resolved — only active users maintain connections (~2-5%). 4M concurrent connections distributed across 80 websocket servers. Manageable. Large subreddit broadcast uses Redis pub/sub to all websocket servers — no user-specific routing needed.
Confusion 4 — Connecting patterns to each other Student felt patterns weren't forming a cohesive mental model. Resolved — the three questions that drive every HLD problem: how does data get written, how does data get read, how do you handle scale when naive approaches break. Same arc repeats across every system.
Unanswered / Needs More Depth¶
- Notification preferences — user can mute specific subreddits
- Read receipts — marking notifications as seen
- Notification grouping — "5 new posts in r/funny" vs 5 separate notifications
- Rate limiting per user — avoid spamming users with too many notifications
What's Next¶
- Ecommerce ordering flow (covered next)
- Deep dive on websocket scaling
- Notification delivery guarantees — at least once vs at most once vs exactly once