Skip to content

Ecommerce Ordering Flow — HLD Session Recap


Starting Point

Twitter feed and notification system completed. Ecommerce ordering flow introduced as third problem — different problem class. Not a fan-out problem. A transactional system under extreme load. New concepts: write contention, Redis atomic stock gate, multi-product atomicity with Lua scripts, Indian payment gateway flow, DDD persistence without EF Core, optimistic vs pessimistic concurrency.


Problem Scope

Single product purchase flow: - Read product - Click buy - Add payment details - Make payment

Cart with multiple products addressed as an extension. Student correctly scoped to single product first.


Capacity Estimation

Normal day:

1 million orders/day
80/20 rule  80% in 6 hour peak window (6PM-12AM)
800,000 orders in 6 hours = 800,000 ÷ 21,600 = ~350 orders/second

Flash sale:

10x normal = 10 million orders/day
50% in first hour = 5 million in 60 minutes
5,000,000 ÷ 3,600 = ~1,400 orders/second

Key insight: spike is concentrated. 1,400/second at exactly 12:00 when sale starts. Auto-scaling too slow — takes minutes to spin up. Fix is pre-scaling — manually scale to 10x capacity before sale starts. Scale back down after first hour normalises.

Product page reads:

1 billion reads/day normal
10x on sale day = 10 billion reads
Concentrated in first few minutes
~100,000 reads/second

Normal Day Order Flow — Saga with Orchestrator

Student derived the full flow independently. Chose orchestration-based Saga (not choreography) because the transactional flow has compensating actions that need central coordination.

POST /orders/initiate
→ validate request
→ DECRBY stock:productId qty in Redis    ← reserve stock
→ INSERT order, status = awaiting_payment
→ UPDATE stock in DB
→ create payment session with Razorpay/PayU
→ respond with payment_url

User redirected to gateway (Razorpay)
→ user enters UPI PIN or card + OTP on gateway's page
→ gateway processes payment
→ two things happen simultaneously:
    Webhook: Razorpay → POST /webhooks/payment
    Redirect: user browser → GET /payment/callback

Backend on webhook receipt:
→ verify HMAC signature
→ UPDATE order status = confirmed or failed
→ raise OrderConfirmed or OrderFailed event

Frontend on redirect:
→ ignore status parameter in URL
→ poll GET /orders/{orderId}/status every 2 seconds
→ until confirmed or failed
→ show result to user

Why polling not redirect status: webhook and redirect arrive simultaneously. Redirect arrives before webhook is processed. Redirect status parameter is unreliable. Frontend polling converges on correct state once webhook updates order. Polling is the authoritative read.

Why HMAC verification on webhook: any party can POST to your webhook endpoint. HMAC signature with shared secret proves the webhook actually came from Razorpay.


Saga State Machine

awaiting_payment
    ↓ webhook success    ↓ webhook failure    ↓ timeout 10min
  confirmed             failed               expired
    ↓                     ↓                     ↓
fulfillment           release stock          release stock
email/push            notify user            notify user
analytics

Compensating actions:

Payment failed → release stock reservation → mark order failed

Stock release service down → RabbitMQ retries command until service recovers. Background reconciliation job finds orders in payment_failed state with stock still reserved — releases after threshold.

This is the same pattern as PEPPOL cleanup service. Same instinct, different domain.


Stock Reservation — The Contention Problem

Normal day (350 orders/sec) — optimistic concurrency sufficient:

UPDATE stock 
SET quantity = quantity - @qty
WHERE product_id = @productId 
AND quantity >= @qty

rowsAffected = 0 means insufficient stock or concurrent update. Return out of stock or retry. Conflicts rare at 350/sec.

Flash sale (1,400 orders/sec on one hot product) — optimistic concurrency breaks:

1,400 concurrent writers on the same row. PostgreSQL serialises them. Massive retry storm. Connection pool exhausts. Timeouts cascade. System collapses not because of correctness failure but because of sheer contention volume.


Redis Stock Gate

Move stock count to Redis. Stock check and reservation become one atomic Redis operation.

key: stock:product_123
value: 1000  ← available units

Single product reservation:

var remaining = await redis.DecrBy("stock:product_123", qty);

if (remaining >= 0)
{
    // stock reserved — proceed to DB write
}
else
{
    // overshot — undo immediately
    await redis.IncrBy("stock:product_123", qty);
    return OutOfStock();
}

Why DECRBY is safe without explicit locking:

Redis is single-threaded internally. DECRBY is atomic — no read-then-write gap. All requests queue and execute one at a time. 1,400/second on one key — trivial for Redis. No row contention. No serialisation overhead. Sub-millisecond.

This is not pessimistic locking. It is atomic operation with post-fact check. If result is negative — you went below zero. Revert immediately.


Redis and PostgreSQL Consistency

Redis is the gate. PostgreSQL is the source of truth. They must stay in sync.

Happy path:

DECRBY stock:123 qty  result >= 0 (success)
 UPDATE stock SET quantity = quantity - @qty
  WHERE product_id = @productId AND quantity >= @qty
 success  both in sync

Postgres fails after Redis success:

DECRBY success  Postgres UPDATE fails INCRBY stock:123 qty   revert Redis immediately return error to orchestrator orchestrator compensates

Redis revert also fails:

Postgres fails → INCRBY fails
→ Redis says X, Postgres says X+qty → diverged
→ reconciliation job runs every 5 minutes
→ SELECT quantity FROM stock WHERE product_id = 123  → DB value
→ SET stock:123 = DB value  ← Redis reset from DB
→ back in sync within 5 minutes

DB is always source of truth. Redis always reset from DB on divergence. 5 minute inconsistency window acceptable — nobody loses money, just potentially sees slightly stale stock count on product page.


Multi-Product Atomicity — Lua Script

Single product — one DECRBY, atomic, clean.

Multiple products in one order:

Product A — quantity 2
Product B — quantity 1
Product C — quantity 3

Three separate DECRBYs. Not atomic together. Race condition:

DECRBY stock:A 2  → success, result = 8
DECRBY stock:B 1  → success, result = 0
DECRBY stock:C 3  → failure, result = -1  ← out of stock

Now must revert A and B. But competing orders may have decremented them further between your DECRBY and your INCRBY.

Student also identified second race condition: checked A available, checked B available, by the time checking B — A was taken by another order.

Fix — Redis Lua script:

Lua script executes atomically. No other operation can interleave. Two-pass approach: check all first, decrement all only if all pass.

-- first pass: check all
for i=1,#KEYS do
    local stock = redis.call('GET', KEYS[i])
    if tonumber(stock) < tonumber(ARGV[i]) then
        return 0  -- fail, nothing touched
    end
end

-- second pass: decrement all
for i=1,#KEYS do
    redis.call('DECRBY', KEYS[i], ARGV[i])
end

return 1  -- success

If any product is out of stock — return 0 immediately. Nothing has been touched. No rollback needed. No compensating action needed at Redis level.

var script = @"
    for i=1,#KEYS do
        local stock = redis.call('GET', KEYS[i])
        if tonumber(stock) < tonumber(ARGV[i]) then
            return 0
        end
    end
    for i=1,#KEYS do
        redis.call('DECRBY', KEYS[i], ARGV[i])
    end
    return 1
";

var keys = new RedisKey[] { "stock:A", "stock:B", "stock:C" };
var vals = new RedisValue[] { 2, 1, 3 };

var result = await redis.ScriptEvaluateAsync(script, keys, vals);

if ((int)result == 0)
    return OutOfStock();
// all reserved, proceed to DB writes

Payment Abandonment — Reservation TTL

User goes to gateway. Never completes payment. Stock held forever.

Fix — TTL on reservation. Background job finds orders in awaiting_payment status older than 10 minutes. Releases stock. Marks order expired.

Order status = awaiting_payment
AND created_at < now - 10 minutes
→ INCRBY stock:productId qty  (release Redis)
→ UPDATE stock SET quantity = quantity + qty  (release DB)
→ UPDATE order SET status = expired

User has 10 minute window to complete payment. After that — stock available for other buyers.


Catalogue Cache — Product Page at Scale

100,000 product page reads/second on sale day. Cannot serve from DB.

Cache structure:

key: product:{productId}
value: { name, price, stock_level, category... }
TTL: none — invalidated by events

key: catalogue:{category}
value: [productId1, productId2, ...]  ← just IDs, pre-sorted
TTL: none — invalidated by events

Read path:

GET /catalogue/electronics LRANGE catalogue:electronics 0 49     IDs MGET product:101 product:102 ...      details return

Two Redis calls. No DB involved.

Event-driven cache invalidation:

Not TTL-based. Immediate on the event that matters.

StockExhausted event
→ DEL product:123                    ← remove product detail
→ LREM catalogue:electronics 0 123  ← remove from category list

StockUpdated event
→ SET product:123 { updated stock level }

PriceUpdated event
→ SET product:123 { updated price }

Surgical update. No full cache rebuild. No rebuild window. No DB spike.

Cache miss — rebuild from DB:

Product not in cache → fetch from DB → repopulate → return. Slow for first request, fast after.


Indian Payment Flow — Full Sequence

User never enters card details on merchant site. Redirected to gateway.

Step 1 — Order initiation (merchant backend)
POST /orders/initiate
→ Redis stock gate
→ INSERT order, status = awaiting_payment
→ create Razorpay order via Razorpay API
→ respond with razorpay_order_id + payment_url

Step 2 — Frontend redirects to Razorpay
User enters UPI PIN or card + OTP on Razorpay page
Merchant backend uninvolved during this step

Step 3 — Payment processed by Razorpay
Two simultaneous events:
  Webhook → POST /webhooks/razorpay (server to server)
  Redirect → user browser back to merchant

Step 4 — Backend processes webhook
→ verify HMAC signature
→ UPDATE order status = confirmed or failed
→ raise domain events

Step 5 — Frontend polls
→ GET /orders/{orderId}/status every 2 seconds
→ until confirmed or failed
→ show result

Why two simultaneous events: Razorpay sends webhook AND redirects browser. Race condition — browser arrives before webhook processed. Solution: frontend ignores redirect status, polls until backend confirms via webhook.


DDD Without EF Core

Student asked whether DDD requires EF Core. Answer: no. EF Core is a tool. DDD is a design philosophy. Repository pattern abstracts persistence completely.

DDD only needs:

// load aggregate
var order = await _repository.GetByIdAsync(orderId);

// save aggregate
await _repository.SaveAsync(order);

How those two methods are implemented — EF Core, Dapper, raw ADO.NET — domain never knows.

DDD with Dapper:

public class StockRepository : IStockRepository
{
    public async Task<Stock> GetByProductIdAsync(Guid productId)
    {
        var row = await _db.QuerySingleOrDefaultAsync(
            "SELECT * FROM stock WHERE product_id = @productId",
            new { productId });

        // reconstruct aggregate from raw data
        return Stock.Reconstitute(row.product_id, row.quantity, row.version);
    }

    public async Task SaveAsync(Stock stock)
    {
        var rowsAffected = await _db.ExecuteAsync(@"
            UPDATE stock 
            SET quantity = @quantity, version = version + 1
            WHERE product_id = @productId AND version = @version",
            new { stock.Quantity, stock.ProductId, stock.Version });

        if (rowsAffected == 0)
            throw new ConcurrencyException();

        foreach (var evt in stock.DomainEvents)
            await _dispatcher.DispatchAsync(evt);

        stock.ClearDomainEvents();
    }
}

Domain model has no idea Dapper exists. Repository handles mapping. Clean separation.

What you lose without EF Core: migrations (use DbUp or Flyway instead), change tracking (manual), navigation properties (explicit loading — actually better for DDD), LINQ queries (write SQL directly).

What you gain: full SQL control, no query translation surprises, significant performance on hot paths, simpler mental model.


Optimistic vs Pessimistic Concurrency

Student's existing pattern in PEPPOL:

UPDATE orders 
SET status = 'invoicing' 
WHERE id = @id 
AND status = 'created'

This is optimistic concurrency. Status column is the concurrency token. No lock acquired before update. Check happens at save time. rowsAffected = 0 means conflict — back off.

Optimistic concurrency — the definition: Assume no conflict will happen. Proceed without locking. Check at save time. Retry on conflict.

Pessimistic locking — the definition: Assume conflict will happen. Acquire lock before reading. Nobody else can touch the row while you hold it.

-- pessimistic
SELECT * FROM stock WHERE product_id = @id FOR UPDATE;
-- row locked until COMMIT
UPDATE stock SET quantity = quantity - 1 WHERE product_id = @id;
COMMIT;

Student's insight: UPDATE WHERE is checking after the fact, not before. Correct — that IS optimistic. The strategy is optimistic (proceed without lock) even though the detection is post-fact.


EF Core Row Versioning

Two mechanisms:

ConcurrencyToken — self-managed:

[ConcurrencyCheck]
public string Status { get; set; }

EF Core automatically includes Status in WHERE clause on UPDATE. You decide the value. Good when a natural domain field exists (status, phase, state).

Timestamp/RowVersion — auto-generated:

[Timestamp]
public uint RowVersion { get; set; }

Database generates and manages the value. Changes on every UPDATE automatically. You never set it. Good when no natural concurrency field exists.

PostgreSQL xmin — built-in system column, no schema change needed:

modelBuilder.Entity<Stock>()
    .Property<uint>("xmin")
    .HasColumnType("xid")
    .IsRowVersion();

EF Core after SaveChangesAsync: does NOT do a full SELECT. Reads back only database-generated columns using RETURNING clause in same operation. No extra round trip unless you have ValueGeneratedOnAddOrUpdate columns.

Avoid database-generated update columns at high scale: set timestamps in application code instead. Keeps EF Core in minimal operation path. No extra SELECT.

Student's correct insight: row versioning with retry loop is bad at high scale not because of extra reads but because of retry storm. 1,400 orders/second on one product — every instance constantly retrying. DB load compounds. Redis gate eliminates the retry loop entirely.


CQRS — Queries vs Write Path

Misconception clarified: queries are not locked to one use case or external consumers only. Queries are any operation that reads without changing state. They can be used anywhere — including inside command handlers.

public class ReserveStockCommandHandler
{
    public async Task Handle(ReserveStockCommand command)
    {
        // reading inside a command handler — fine
        var stock = await _stockQueries.GetStockLevelAsync(command.ProductId);

        if (stock < command.Quantity)
            throw new InsufficientStockException();

        await _stockRepository.DecrementAsync(command.ProductId, command.Quantity);
    }
}

The clean separation:

Query read path   external consumers, UI, API responses
                   bypasses domain, returns DTOs
                   thin layer, no business logic

Repository        internal domain operations, command handlers
                   loads aggregates, returns domain objects
                   enforces invariants

Same data. Different paths. Different purposes. Never crossed.

In hexagonal architecture: query read path bypasses domain entirely. Command handlers use repositories which load full aggregates. The hot path (Redis gate + raw Dapper) bends the architecture deliberately for performance — domain events still raised, domain is still aware of what happened, just didn't participate in the how.


Where EF Core Is and Isn't Appropriate

EF Core appropriate: - Complex domain aggregates with relationships (PEPPOL invoice lifecycle) - Standard CRUD operations not in hot path - Schema migrations - Rich domain model with invariants

Dapper or raw SQL appropriate: - High throughput write paths (stock reservation, payment recording) - Complex sharding scenarios - Reporting and read models (CQRS read path) - Tight loops — notification workers, fan-out services

Mixed approach in same codebase:

Domain write path     → EF Core (rich model, change tracking useful)
Hot write path        → Dapper (stock updates, high frequency inserts)
Read path             → Dapper or raw ADO.NET (CQRS projections)
Cache operations      → StackExchange.Redis (no ORM involved)
Bulk operations       → COPY command, bulk insert libraries

Flash Sale — Complete Picture

11:45  pre-scale to 10x normal capacity
12:00  sale starts

Product page:
 catalogue cache hit (Redis)
 product detail cache hit (Redis)
 no DB involved for hot products

Order flow:
 Redis Lua script gates all products atomically
 DB write only for successful reservations
 Razorpay payment session created
 user redirected to gateway
 webhook confirms payment
 frontend polls for result
 post-confirmation async via RabbitMQ

Stock exhausted:
 StockExhausted event
 DEL product:123 from cache
 LREM from catalogue:electronics
 product disappears from catalogue immediately

Key Confusions and Resolutions

Confusion 1 — Auto-scaling for flash sale spike Student initially thought auto-scaling was the fix. Resolved — auto-scaling too slow for initial spike. Pre-scaling is the correct answer. Scale up before sale starts using knowledge of the event.

Confusion 2 — How Redis DECRBY is safe without explicit WHERE check Student asked how DECRBY prevents overselling without a WHERE quantity >= qty equivalent. Resolved — Redis is single-threaded. DECRBY atomic. No read-then-write gap. Check is post-fact on result value. If negative — overshot, revert immediately. Equivalent safety to DB WHERE clause without the contention.

Confusion 3 — Redis and Postgres staying in sync Student correctly identified this as a concern. Resolved — Redis is gate, Postgres is source of truth. Revert Redis on Postgres failure. Reconciliation job resets Redis from DB every 5 minutes. DB always wins on divergence.

Confusion 4 — Indian payment flow vs western synchronous charge Student correctly pushed back on synchronous card charging assumption. India uses UPI, netbanking, OTP — redirect to gateway. Two separate phases: order initiation and payment confirmation. Webhook is authoritative signal. Frontend polls. Same Saga pattern, different payment step implementation.

Confusion 5 — DDD requires EF Core Student wondered if DDD was possible without EF Core. Resolved — DDD is a design philosophy, not an ORM choice. Repository pattern abstracts persistence. Any data access technology works. EF Core is convenient, Dapper is faster, neither is required by DDD.

Confusion 6 — UPDATE WHERE is pessimistic locking Student thought UPDATE WHERE status = 'created' was pessimistic. Resolved — it's optimistic. No lock acquired upfront. Check happens at save time. That's the definition of optimistic. Student had been doing optimistic concurrency correctly all along without the name.

Confusion 7 — EF Core does full SELECT after UPDATE Student asked if EF Core reads full row after update. Resolved — no. Reads back only database-generated columns via RETURNING in same operation. No extra round trip unless ValueGeneratedOnAddOrUpdate columns configured. Avoid those at high scale.


Patterns Used

Pattern Where used
Saga with orchestrator Order flow state machine, compensating actions
Redis as atomic gate Stock reservation, prevents overselling
Lua script for atomicity Multi-product reservation, check-all-then-decrement-all
Event-driven cache invalidation StockExhausted removes product from catalogue immediately
Reservation TTL Payment abandonment — 10 minute window
Webhook + polling Authoritative payment confirmation, race condition handled
Pre-scaling Known traffic spike — scale before not after
Reconciliation job Redis/Postgres divergence safety net
Optimistic concurrency DDD aggregate version field, status as concurrency token

Unanswered / Needs More Depth

  • Order idempotency — what if frontend retries POST /orders/initiate twice
  • Partial fulfilment — some products in stock, some not. Ship partial or hold entire order
  • Inventory reservation vs hard decrement — reserve during checkout, decrement on confirmation
  • Payment retry — user's card declined, wants to try different card, order already created
  • Refund flow — payment succeeded, user cancels, stock must be returned

What's Next

  • Drive a full problem end to end from Step 1 without guidance
  • Design WhatsApp / chat system — new problem class (websockets, message ordering, delivery guarantees)
  • Design a rate limiter — applies patterns from both URL shortener and ecommerce
  • Resume finalisation — all three systems (PEPPOL, ERP, AverAzure) with correct terminology