Index

Indexing — Session Recap¶

Starting point¶

Indexing came up as part of the DB Design pillar drill. Student had covered normalisation and was moving through the topic list. No prior indexing coverage in this session series. Starting level: could name relevant columns to index but couldn't explain why or in what order.

What we did¶

Started with a concrete query on an orders table — WHERE customer_id = 123 AND status = 'pending' ORDER BY created_at DESC
Student named the right columns but not the reasoning behind order
Explained composite index column ordering — cardinality rule
Explained covering indexes
Explained B-tree structure from first principles — sorted structure with row pointers
Explained three scan types — index seek, index scan, sequential scan
Explained when Postgres ignores your index — low cardinality, cost model threshold
Explained write overhead of indexes — every write maintains every index
Covered bulk insert pattern — drop, insert, rebuild
Covered over-indexing — pg_stat_user_indexes for dead index detection
Partial indexes introduced but skipped at student's request

Concepts covered¶

What an index is¶

A separate data structure — a sorted copy of one or more columns with pointers back to the actual rows. Without an index every query does a full table scan — reads every row. With an index the DB binary searches to the relevant rows in O(log n) instead of O(n).

Default index type in PostgreSQL and MySQL is a B-tree — a balanced tree. Each node in the tree narrows the search until you reach the matching leaf nodes which carry the row pointers (TIDs — tuple identifiers pointing to heap pages).

Index on customer_id:

customer_id   row_pointer
1             → row 45
1             → row 892
2             → row 12
3             → row 567

Three scan types¶

Scan type	What happens	When
Index seek	Binary search to matching entries, follow pointers	Highly selective query, small result set
Index scan	Walks entire index	Less selective, index still cheaper than heap
Sequential scan / table scan	Reads every row in the table	Low cardinality, large result set, no useful index

Composite index column ordering¶

Rule: equality columns first, highest cardinality first, sort column next, range column last.

For WHERE customer_id = 123 AND status = 'pending' ORDER BY created_at DESC:

CREATE INDEX idx_orders_customer_status_date 
ON orders(customer_id, status, created_at DESC);

customer_id first — equality, high cardinality (many distinct values), eliminates most rows immediately. status second — equality, low cardinality (3 values) but still narrows the set. created_at last — not a filter, but in the index so DB skips the filesort entirely.

Why cardinality first matters: status first would still leave a large chunk of the table after the equality match — 33% of rows for each status value. customer_id first eliminates far more rows at the first step.

Why sort column in the index: If ORDER BY created_at DESC and the index is already sorted by created_at DESC, the DB skips the sort operation entirely. No in-memory filesort.

The bookshelf analogy (from pagination session): Books sorted by spine label (status, amount, id). Walk to the right shelf, right section, right position, pull 20 books. Zero wasted work. The order of labels on the spine determines how efficiently you can find what you need.

Covering index¶

If all columns the query needs are inside the index itself, the DB never touches the actual table. The index alone satisfies the query. This is the fastest possible read path — no heap fetches at all.

-- This query is satisfied entirely by the index above
-- if SELECT only needs customer_id, status, created_at
SELECT customer_id, status, created_at 
FROM orders 
WHERE customer_id = 123 AND status = 'pending' 
ORDER BY created_at DESC;

As soon as SELECT * or any column not in the index is needed, the DB has to follow pointers to the heap for every matching row.

When Postgres ignores your index — low cardinality¶

CREATE INDEX idx_status ON orders (status);
-- status has 3 values: pending, completed, cancelled
-- 'completed' = 70% of rows

Postgres looks at this and calculates: to satisfy WHERE status = 'completed' I would fetch TIDs for 70% of the heap. Random heap fetches for 70% of the table is slower than just reading the whole table sequentially. Sequential scan wins.

The index exists, costs writes to maintain, and never gets used for high-frequency status values.

The threshold: Postgres has a cost model. If the planner estimates an index scan would touch more than roughly 5–15% of the table it switches to sequential scan. The exact threshold depends on data distribution and the random_page_cost setting.

Write overhead¶

Every INSERT, UPDATE, or DELETE has to update every index on the table. A table with 15 indexes: every write hits 15 index pages. Adds disk I/O, increases vacuum work, causes index bloat over time.

Few indexes  → fast writes, slower reads
Many indexes → slow writes, faster reads

Bulk insert pattern¶

Maintaining indexes row by row during a large insert is expensive. Rebuilding from scratch on sorted data is faster:

DROP INDEX idx_one;
DROP INDEX idx_two;
INSERT INTO orders SELECT * FROM staging;  -- 100k rows, no index maintenance
CREATE INDEX idx_one ON orders (...);      -- rebuild once from sorted data
CREATE INDEX idx_two ON orders (...);

Over-indexing detection¶

SELECT indexrelname, idx_scan 
FROM pg_stat_user_indexes 
WHERE relname = 'orders'
ORDER BY idx_scan;

Any index with zero or near-zero idx_scan is dead weight — costs writes to maintain, never used for reads. Drop it.

Partial indexes — introduced, not drilled¶

A partial index indexes only a subset of rows matching a condition:

CREATE INDEX idx_orders_pending ON orders (customer_id, created_at)
WHERE status = 'pending';

Smaller index, only covers rows you actually query frequently. Not covered in depth — student skipped.

What we messed up¶

Named correct index columns but couldn't explain the order — listed customer_id, status, created_at correctly but when pushed on why customer_id first, said "it's the most searched column." That's a hand-wavy answer. The precise reason is cardinality — customer_id eliminates the most rows at the first step because it has the most distinct values. Low cardinality columns like status belong after high cardinality equality columns.

"Selectivity goes down as pages go deeper" (from pagination session) — said this confidently about indexes but couldn't back it up when challenged. Selectivity is a real concept — it refers to how many distinct values a column has relative to total rows, and it's what determines whether the query planner uses the index at all. It was the right word used in the wrong context. Needs consolidating.

Key values and config to remember¶

Thing	Value / Pattern
Default index type	B-tree — balanced tree, O(log n) lookup
Composite index order	Equality → highest cardinality → sort column → range
Planner ignores index threshold	Roughly 5–15% of table rows estimated to be touched
Covering index benefit	No heap fetches — query satisfied entirely from index
Bulk insert pattern	Drop indexes → insert → rebuild
Dead index detection	`pg_stat_user_indexes` — check `idx_scan` column
Write cost	Every write maintains every index on the table

Unanswered questions / things to investigate¶

Partial indexes — introduced and skipped. When do you use them, what are the practical use cases, how does the planner decide to use one. Relevant for the PEPPOL platform — indexing only active invoices for example.

Selectivity and the query planner — came up in indexing and again in pagination. What exactly is selectivity, what is cardinality, how does the planner's cost model work, when does it make wrong decisions and how do you force it with hints or statistics updates (ANALYZE).

Index bloat and VACUUM — mentioned in passing. Postgres MVCC means dead tuples accumulate. Indexes hold references to dead tuples too. How does VACUUM clean indexes, what is REINDEX, when do you need it.

EXPLAIN ANALYZE — not covered at all. The actual tool for understanding what the planner is doing. Essential for diagnosing slow queries in production. Should be drilled with real queries.

Hash indexes — B-tree was covered. Hash indexes exist for pure equality lookups and are faster for that case. Not covered.

GIN / GiST indexes — relevant for full text search and array columns. Not covered. Could be relevant for ERP search features.

What's next¶

Selectivity drill — close the gap that appeared twice across indexing and pagination sessions
EXPLAIN ANALYZE — most practical indexing skill for production work and interviews
Partial indexes — quick follow-up, was skipped mid-session
API + Application Design pillar — all DB Design topics now covered: normalisation, indexes, pagination. Move to REST design, idempotency, consistency patterns