Ecommerce Ordering Flow — HLD Session Recap¶
Starting Point¶
Twitter feed and notification system completed. Ecommerce ordering flow introduced as third problem — different problem class. Not a fan-out problem. A transactional system under extreme load. New concepts: write contention, Redis atomic stock gate, multi-product atomicity with Lua scripts, Indian payment gateway flow, DDD persistence without EF Core, optimistic vs pessimistic concurrency.
Problem Scope¶
Single product purchase flow: - Read product - Click buy - Add payment details - Make payment
Cart with multiple products addressed as an extension. Student correctly scoped to single product first.
Capacity Estimation¶
Normal day:
1 million orders/day
80/20 rule — 80% in 6 hour peak window (6PM-12AM)
800,000 orders in 6 hours = 800,000 ÷ 21,600 = ~350 orders/second
Flash sale:
10x normal = 10 million orders/day
50% in first hour = 5 million in 60 minutes
5,000,000 ÷ 3,600 = ~1,400 orders/second
Key insight: spike is concentrated. 1,400/second at exactly 12:00 when sale starts. Auto-scaling too slow — takes minutes to spin up. Fix is pre-scaling — manually scale to 10x capacity before sale starts. Scale back down after first hour normalises.
Product page reads:
1 billion reads/day normal
10x on sale day = 10 billion reads
Concentrated in first few minutes
~100,000 reads/second
Normal Day Order Flow — Saga with Orchestrator¶
Student derived the full flow independently. Chose orchestration-based Saga (not choreography) because the transactional flow has compensating actions that need central coordination.
POST /orders/initiate
→ validate request
→ DECRBY stock:productId qty in Redis ← reserve stock
→ INSERT order, status = awaiting_payment
→ UPDATE stock in DB
→ create payment session with Razorpay/PayU
→ respond with payment_url
User redirected to gateway (Razorpay)
→ user enters UPI PIN or card + OTP on gateway's page
→ gateway processes payment
→ two things happen simultaneously:
Webhook: Razorpay → POST /webhooks/payment
Redirect: user browser → GET /payment/callback
Backend on webhook receipt:
→ verify HMAC signature
→ UPDATE order status = confirmed or failed
→ raise OrderConfirmed or OrderFailed event
Frontend on redirect:
→ ignore status parameter in URL
→ poll GET /orders/{orderId}/status every 2 seconds
→ until confirmed or failed
→ show result to user
Why polling not redirect status: webhook and redirect arrive simultaneously. Redirect arrives before webhook is processed. Redirect status parameter is unreliable. Frontend polling converges on correct state once webhook updates order. Polling is the authoritative read.
Why HMAC verification on webhook: any party can POST to your webhook endpoint. HMAC signature with shared secret proves the webhook actually came from Razorpay.
Saga State Machine¶
awaiting_payment
↓ webhook success ↓ webhook failure ↓ timeout 10min
confirmed failed expired
↓ ↓ ↓
fulfillment release stock release stock
email/push notify user notify user
analytics
Compensating actions:
Payment failed → release stock reservation → mark order failed
Stock release service down → RabbitMQ retries command until service recovers. Background reconciliation job finds orders in payment_failed state with stock still reserved — releases after threshold.
This is the same pattern as PEPPOL cleanup service. Same instinct, different domain.
Stock Reservation — The Contention Problem¶
Normal day (350 orders/sec) — optimistic concurrency sufficient:
UPDATE stock
SET quantity = quantity - @qty
WHERE product_id = @productId
AND quantity >= @qty
rowsAffected = 0 means insufficient stock or concurrent update. Return out of stock or retry. Conflicts rare at 350/sec.
Flash sale (1,400 orders/sec on one hot product) — optimistic concurrency breaks:
1,400 concurrent writers on the same row. PostgreSQL serialises them. Massive retry storm. Connection pool exhausts. Timeouts cascade. System collapses not because of correctness failure but because of sheer contention volume.
Redis Stock Gate¶
Move stock count to Redis. Stock check and reservation become one atomic Redis operation.
key: stock:product_123
value: 1000 ← available units
Single product reservation:
var remaining = await redis.DecrBy("stock:product_123", qty);
if (remaining >= 0)
{
// stock reserved — proceed to DB write
}
else
{
// overshot — undo immediately
await redis.IncrBy("stock:product_123", qty);
return OutOfStock();
}
Why DECRBY is safe without explicit locking:
Redis is single-threaded internally. DECRBY is atomic — no read-then-write gap. All requests queue and execute one at a time. 1,400/second on one key — trivial for Redis. No row contention. No serialisation overhead. Sub-millisecond.
This is not pessimistic locking. It is atomic operation with post-fact check. If result is negative — you went below zero. Revert immediately.
Redis and PostgreSQL Consistency¶
Redis is the gate. PostgreSQL is the source of truth. They must stay in sync.
Happy path:
DECRBY stock:123 qty → result >= 0 (success)
→ UPDATE stock SET quantity = quantity - @qty
WHERE product_id = @productId AND quantity >= @qty
→ success → both in sync
Postgres fails after Redis success:
DECRBY success → Postgres UPDATE fails
→ INCRBY stock:123 qty ← revert Redis immediately
→ return error to orchestrator
→ orchestrator compensates
Redis revert also fails:
Postgres fails → INCRBY fails
→ Redis says X, Postgres says X+qty → diverged
→ reconciliation job runs every 5 minutes
→ SELECT quantity FROM stock WHERE product_id = 123 → DB value
→ SET stock:123 = DB value ← Redis reset from DB
→ back in sync within 5 minutes
DB is always source of truth. Redis always reset from DB on divergence. 5 minute inconsistency window acceptable — nobody loses money, just potentially sees slightly stale stock count on product page.
Multi-Product Atomicity — Lua Script¶
Single product — one DECRBY, atomic, clean.
Multiple products in one order:
Product A — quantity 2
Product B — quantity 1
Product C — quantity 3
Three separate DECRBYs. Not atomic together. Race condition:
DECRBY stock:A 2 → success, result = 8
DECRBY stock:B 1 → success, result = 0
DECRBY stock:C 3 → failure, result = -1 ← out of stock
Now must revert A and B. But competing orders may have decremented them further between your DECRBY and your INCRBY.
Student also identified second race condition: checked A available, checked B available, by the time checking B — A was taken by another order.
Fix — Redis Lua script:
Lua script executes atomically. No other operation can interleave. Two-pass approach: check all first, decrement all only if all pass.
-- first pass: check all
for i=1,#KEYS do
local stock = redis.call('GET', KEYS[i])
if tonumber(stock) < tonumber(ARGV[i]) then
return 0 -- fail, nothing touched
end
end
-- second pass: decrement all
for i=1,#KEYS do
redis.call('DECRBY', KEYS[i], ARGV[i])
end
return 1 -- success
If any product is out of stock — return 0 immediately. Nothing has been touched. No rollback needed. No compensating action needed at Redis level.
var script = @"
for i=1,#KEYS do
local stock = redis.call('GET', KEYS[i])
if tonumber(stock) < tonumber(ARGV[i]) then
return 0
end
end
for i=1,#KEYS do
redis.call('DECRBY', KEYS[i], ARGV[i])
end
return 1
";
var keys = new RedisKey[] { "stock:A", "stock:B", "stock:C" };
var vals = new RedisValue[] { 2, 1, 3 };
var result = await redis.ScriptEvaluateAsync(script, keys, vals);
if ((int)result == 0)
return OutOfStock();
// all reserved, proceed to DB writes
Payment Abandonment — Reservation TTL¶
User goes to gateway. Never completes payment. Stock held forever.
Fix — TTL on reservation. Background job finds orders in awaiting_payment status older than 10 minutes. Releases stock. Marks order expired.
Order status = awaiting_payment
AND created_at < now - 10 minutes
→ INCRBY stock:productId qty (release Redis)
→ UPDATE stock SET quantity = quantity + qty (release DB)
→ UPDATE order SET status = expired
User has 10 minute window to complete payment. After that — stock available for other buyers.
Catalogue Cache — Product Page at Scale¶
100,000 product page reads/second on sale day. Cannot serve from DB.
Cache structure:
key: product:{productId}
value: { name, price, stock_level, category... }
TTL: none — invalidated by events
key: catalogue:{category}
value: [productId1, productId2, ...] ← just IDs, pre-sorted
TTL: none — invalidated by events
Read path:
GET /catalogue/electronics
→ LRANGE catalogue:electronics 0 49 ← IDs
→ MGET product:101 product:102 ... ← details
→ return
Two Redis calls. No DB involved.
Event-driven cache invalidation:
Not TTL-based. Immediate on the event that matters.
StockExhausted event
→ DEL product:123 ← remove product detail
→ LREM catalogue:electronics 0 123 ← remove from category list
StockUpdated event
→ SET product:123 { updated stock level }
PriceUpdated event
→ SET product:123 { updated price }
Surgical update. No full cache rebuild. No rebuild window. No DB spike.
Cache miss — rebuild from DB:
Product not in cache → fetch from DB → repopulate → return. Slow for first request, fast after.
Indian Payment Flow — Full Sequence¶
User never enters card details on merchant site. Redirected to gateway.
Step 1 — Order initiation (merchant backend)
POST /orders/initiate
→ Redis stock gate
→ INSERT order, status = awaiting_payment
→ create Razorpay order via Razorpay API
→ respond with razorpay_order_id + payment_url
Step 2 — Frontend redirects to Razorpay
User enters UPI PIN or card + OTP on Razorpay page
Merchant backend uninvolved during this step
Step 3 — Payment processed by Razorpay
Two simultaneous events:
Webhook → POST /webhooks/razorpay (server to server)
Redirect → user browser back to merchant
Step 4 — Backend processes webhook
→ verify HMAC signature
→ UPDATE order status = confirmed or failed
→ raise domain events
Step 5 — Frontend polls
→ GET /orders/{orderId}/status every 2 seconds
→ until confirmed or failed
→ show result
Why two simultaneous events: Razorpay sends webhook AND redirects browser. Race condition — browser arrives before webhook processed. Solution: frontend ignores redirect status, polls until backend confirms via webhook.
DDD Without EF Core¶
Student asked whether DDD requires EF Core. Answer: no. EF Core is a tool. DDD is a design philosophy. Repository pattern abstracts persistence completely.
DDD only needs:
// load aggregate
var order = await _repository.GetByIdAsync(orderId);
// save aggregate
await _repository.SaveAsync(order);
How those two methods are implemented — EF Core, Dapper, raw ADO.NET — domain never knows.
DDD with Dapper:
public class StockRepository : IStockRepository
{
public async Task<Stock> GetByProductIdAsync(Guid productId)
{
var row = await _db.QuerySingleOrDefaultAsync(
"SELECT * FROM stock WHERE product_id = @productId",
new { productId });
// reconstruct aggregate from raw data
return Stock.Reconstitute(row.product_id, row.quantity, row.version);
}
public async Task SaveAsync(Stock stock)
{
var rowsAffected = await _db.ExecuteAsync(@"
UPDATE stock
SET quantity = @quantity, version = version + 1
WHERE product_id = @productId AND version = @version",
new { stock.Quantity, stock.ProductId, stock.Version });
if (rowsAffected == 0)
throw new ConcurrencyException();
foreach (var evt in stock.DomainEvents)
await _dispatcher.DispatchAsync(evt);
stock.ClearDomainEvents();
}
}
Domain model has no idea Dapper exists. Repository handles mapping. Clean separation.
What you lose without EF Core: migrations (use DbUp or Flyway instead), change tracking (manual), navigation properties (explicit loading — actually better for DDD), LINQ queries (write SQL directly).
What you gain: full SQL control, no query translation surprises, significant performance on hot paths, simpler mental model.
Optimistic vs Pessimistic Concurrency¶
Student's existing pattern in PEPPOL:
UPDATE orders
SET status = 'invoicing'
WHERE id = @id
AND status = 'created'
This is optimistic concurrency. Status column is the concurrency token. No lock acquired before update. Check happens at save time. rowsAffected = 0 means conflict — back off.
Optimistic concurrency — the definition: Assume no conflict will happen. Proceed without locking. Check at save time. Retry on conflict.
Pessimistic locking — the definition: Assume conflict will happen. Acquire lock before reading. Nobody else can touch the row while you hold it.
-- pessimistic
SELECT * FROM stock WHERE product_id = @id FOR UPDATE;
-- row locked until COMMIT
UPDATE stock SET quantity = quantity - 1 WHERE product_id = @id;
COMMIT;
Student's insight: UPDATE WHERE is checking after the fact, not before. Correct — that IS optimistic. The strategy is optimistic (proceed without lock) even though the detection is post-fact.
EF Core Row Versioning¶
Two mechanisms:
ConcurrencyToken — self-managed:
[ConcurrencyCheck]
public string Status { get; set; }
EF Core automatically includes Status in WHERE clause on UPDATE. You decide the value. Good when a natural domain field exists (status, phase, state).
Timestamp/RowVersion — auto-generated:
[Timestamp]
public uint RowVersion { get; set; }
Database generates and manages the value. Changes on every UPDATE automatically. You never set it. Good when no natural concurrency field exists.
PostgreSQL xmin — built-in system column, no schema change needed:
modelBuilder.Entity<Stock>()
.Property<uint>("xmin")
.HasColumnType("xid")
.IsRowVersion();
EF Core after SaveChangesAsync: does NOT do a full SELECT. Reads back only database-generated columns using RETURNING clause in same operation. No extra round trip unless you have ValueGeneratedOnAddOrUpdate columns.
Avoid database-generated update columns at high scale: set timestamps in application code instead. Keeps EF Core in minimal operation path. No extra SELECT.
Student's correct insight: row versioning with retry loop is bad at high scale not because of extra reads but because of retry storm. 1,400 orders/second on one product — every instance constantly retrying. DB load compounds. Redis gate eliminates the retry loop entirely.
CQRS — Queries vs Write Path¶
Misconception clarified: queries are not locked to one use case or external consumers only. Queries are any operation that reads without changing state. They can be used anywhere — including inside command handlers.
public class ReserveStockCommandHandler
{
public async Task Handle(ReserveStockCommand command)
{
// reading inside a command handler — fine
var stock = await _stockQueries.GetStockLevelAsync(command.ProductId);
if (stock < command.Quantity)
throw new InsufficientStockException();
await _stockRepository.DecrementAsync(command.ProductId, command.Quantity);
}
}
The clean separation:
Query read path → external consumers, UI, API responses
bypasses domain, returns DTOs
thin layer, no business logic
Repository → internal domain operations, command handlers
loads aggregates, returns domain objects
enforces invariants
Same data. Different paths. Different purposes. Never crossed.
In hexagonal architecture: query read path bypasses domain entirely. Command handlers use repositories which load full aggregates. The hot path (Redis gate + raw Dapper) bends the architecture deliberately for performance — domain events still raised, domain is still aware of what happened, just didn't participate in the how.
Where EF Core Is and Isn't Appropriate¶
EF Core appropriate: - Complex domain aggregates with relationships (PEPPOL invoice lifecycle) - Standard CRUD operations not in hot path - Schema migrations - Rich domain model with invariants
Dapper or raw SQL appropriate: - High throughput write paths (stock reservation, payment recording) - Complex sharding scenarios - Reporting and read models (CQRS read path) - Tight loops — notification workers, fan-out services
Mixed approach in same codebase:
Domain write path → EF Core (rich model, change tracking useful)
Hot write path → Dapper (stock updates, high frequency inserts)
Read path → Dapper or raw ADO.NET (CQRS projections)
Cache operations → StackExchange.Redis (no ORM involved)
Bulk operations → COPY command, bulk insert libraries
Flash Sale — Complete Picture¶
11:45 → pre-scale to 10x normal capacity
12:00 → sale starts
Product page:
→ catalogue cache hit (Redis)
→ product detail cache hit (Redis)
→ no DB involved for hot products
Order flow:
→ Redis Lua script gates all products atomically
→ DB write only for successful reservations
→ Razorpay payment session created
→ user redirected to gateway
→ webhook confirms payment
→ frontend polls for result
→ post-confirmation async via RabbitMQ
Stock exhausted:
→ StockExhausted event
→ DEL product:123 from cache
→ LREM from catalogue:electronics
→ product disappears from catalogue immediately
Key Confusions and Resolutions¶
Confusion 1 — Auto-scaling for flash sale spike Student initially thought auto-scaling was the fix. Resolved — auto-scaling too slow for initial spike. Pre-scaling is the correct answer. Scale up before sale starts using knowledge of the event.
Confusion 2 — How Redis DECRBY is safe without explicit WHERE check Student asked how DECRBY prevents overselling without a WHERE quantity >= qty equivalent. Resolved — Redis is single-threaded. DECRBY atomic. No read-then-write gap. Check is post-fact on result value. If negative — overshot, revert immediately. Equivalent safety to DB WHERE clause without the contention.
Confusion 3 — Redis and Postgres staying in sync Student correctly identified this as a concern. Resolved — Redis is gate, Postgres is source of truth. Revert Redis on Postgres failure. Reconciliation job resets Redis from DB every 5 minutes. DB always wins on divergence.
Confusion 4 — Indian payment flow vs western synchronous charge Student correctly pushed back on synchronous card charging assumption. India uses UPI, netbanking, OTP — redirect to gateway. Two separate phases: order initiation and payment confirmation. Webhook is authoritative signal. Frontend polls. Same Saga pattern, different payment step implementation.
Confusion 5 — DDD requires EF Core Student wondered if DDD was possible without EF Core. Resolved — DDD is a design philosophy, not an ORM choice. Repository pattern abstracts persistence. Any data access technology works. EF Core is convenient, Dapper is faster, neither is required by DDD.
Confusion 6 — UPDATE WHERE is pessimistic locking Student thought UPDATE WHERE status = 'created' was pessimistic. Resolved — it's optimistic. No lock acquired upfront. Check happens at save time. That's the definition of optimistic. Student had been doing optimistic concurrency correctly all along without the name.
Confusion 7 — EF Core does full SELECT after UPDATE Student asked if EF Core reads full row after update. Resolved — no. Reads back only database-generated columns via RETURNING in same operation. No extra round trip unless ValueGeneratedOnAddOrUpdate columns configured. Avoid those at high scale.
Patterns Used¶
| Pattern | Where used |
|---|---|
| Saga with orchestrator | Order flow state machine, compensating actions |
| Redis as atomic gate | Stock reservation, prevents overselling |
| Lua script for atomicity | Multi-product reservation, check-all-then-decrement-all |
| Event-driven cache invalidation | StockExhausted removes product from catalogue immediately |
| Reservation TTL | Payment abandonment — 10 minute window |
| Webhook + polling | Authoritative payment confirmation, race condition handled |
| Pre-scaling | Known traffic spike — scale before not after |
| Reconciliation job | Redis/Postgres divergence safety net |
| Optimistic concurrency | DDD aggregate version field, status as concurrency token |
Unanswered / Needs More Depth¶
- Order idempotency — what if frontend retries POST /orders/initiate twice
- Partial fulfilment — some products in stock, some not. Ship partial or hold entire order
- Inventory reservation vs hard decrement — reserve during checkout, decrement on confirmation
- Payment retry — user's card declined, wants to try different card, order already created
- Refund flow — payment succeeded, user cancels, stock must be returned
What's Next¶
- Drive a full problem end to end from Step 1 without guidance
- Design WhatsApp / chat system — new problem class (websockets, message ordering, delivery guarantees)
- Design a rate limiter — applies patterns from both URL shortener and ecommerce
- Resume finalisation — all three systems (PEPPOL, ERP, AverAzure) with correct terminology