Index
Indexing — Session Recap¶
Starting point¶
Indexing came up as part of the DB Design pillar drill. Student had covered normalisation and was moving through the topic list. No prior indexing coverage in this session series. Starting level: could name relevant columns to index but couldn't explain why or in what order.
What we did¶
- Started with a concrete query on an
orderstable —WHERE customer_id = 123 AND status = 'pending' ORDER BY created_at DESC - Student named the right columns but not the reasoning behind order
- Explained composite index column ordering — cardinality rule
- Explained covering indexes
- Explained B-tree structure from first principles — sorted structure with row pointers
- Explained three scan types — index seek, index scan, sequential scan
- Explained when Postgres ignores your index — low cardinality, cost model threshold
- Explained write overhead of indexes — every write maintains every index
- Covered bulk insert pattern — drop, insert, rebuild
- Covered over-indexing —
pg_stat_user_indexesfor dead index detection - Partial indexes introduced but skipped at student's request
Concepts covered¶
What an index is¶
A separate data structure — a sorted copy of one or more columns with pointers back to the actual rows. Without an index every query does a full table scan — reads every row. With an index the DB binary searches to the relevant rows in O(log n) instead of O(n).
Default index type in PostgreSQL and MySQL is a B-tree — a balanced tree. Each node in the tree narrows the search until you reach the matching leaf nodes which carry the row pointers (TIDs — tuple identifiers pointing to heap pages).
Index on customer_id:
customer_id row_pointer
1 → row 45
1 → row 892
2 → row 12
3 → row 567
Three scan types¶
| Scan type | What happens | When |
|---|---|---|
| Index seek | Binary search to matching entries, follow pointers | Highly selective query, small result set |
| Index scan | Walks entire index | Less selective, index still cheaper than heap |
| Sequential scan / table scan | Reads every row in the table | Low cardinality, large result set, no useful index |
Composite index column ordering¶
Rule: equality columns first, highest cardinality first, sort column next, range column last.
For WHERE customer_id = 123 AND status = 'pending' ORDER BY created_at DESC:
CREATE INDEX idx_orders_customer_status_date
ON orders(customer_id, status, created_at DESC);
customer_id first — equality, high cardinality (many distinct values), eliminates most rows immediately. status second — equality, low cardinality (3 values) but still narrows the set. created_at last — not a filter, but in the index so DB skips the filesort entirely.
Why cardinality first matters: status first would still leave a large chunk of the table after the equality match — 33% of rows for each status value. customer_id first eliminates far more rows at the first step.
Why sort column in the index: If ORDER BY created_at DESC and the index is already sorted by created_at DESC, the DB skips the sort operation entirely. No in-memory filesort.
The bookshelf analogy (from pagination session): Books sorted by spine label (status, amount, id). Walk to the right shelf, right section, right position, pull 20 books. Zero wasted work. The order of labels on the spine determines how efficiently you can find what you need.
Covering index¶
If all columns the query needs are inside the index itself, the DB never touches the actual table. The index alone satisfies the query. This is the fastest possible read path — no heap fetches at all.
-- This query is satisfied entirely by the index above
-- if SELECT only needs customer_id, status, created_at
SELECT customer_id, status, created_at
FROM orders
WHERE customer_id = 123 AND status = 'pending'
ORDER BY created_at DESC;
As soon as SELECT * or any column not in the index is needed, the DB has to follow pointers to the heap for every matching row.
When Postgres ignores your index — low cardinality¶
CREATE INDEX idx_status ON orders (status);
-- status has 3 values: pending, completed, cancelled
-- 'completed' = 70% of rows
Postgres looks at this and calculates: to satisfy WHERE status = 'completed' I would fetch TIDs for 70% of the heap. Random heap fetches for 70% of the table is slower than just reading the whole table sequentially. Sequential scan wins.
The index exists, costs writes to maintain, and never gets used for high-frequency status values.
The threshold: Postgres has a cost model. If the planner estimates an index scan would touch more than roughly 5–15% of the table it switches to sequential scan. The exact threshold depends on data distribution and the random_page_cost setting.
Write overhead¶
Every INSERT, UPDATE, or DELETE has to update every index on the table. A table with 15 indexes: every write hits 15 index pages. Adds disk I/O, increases vacuum work, causes index bloat over time.
Few indexes → fast writes, slower reads
Many indexes → slow writes, faster reads
Bulk insert pattern¶
Maintaining indexes row by row during a large insert is expensive. Rebuilding from scratch on sorted data is faster:
DROP INDEX idx_one;
DROP INDEX idx_two;
INSERT INTO orders SELECT * FROM staging; -- 100k rows, no index maintenance
CREATE INDEX idx_one ON orders (...); -- rebuild once from sorted data
CREATE INDEX idx_two ON orders (...);
Over-indexing detection¶
SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY idx_scan;
Any index with zero or near-zero idx_scan is dead weight — costs writes to maintain, never used for reads. Drop it.
Partial indexes — introduced, not drilled¶
A partial index indexes only a subset of rows matching a condition:
CREATE INDEX idx_orders_pending ON orders (customer_id, created_at)
WHERE status = 'pending';
Smaller index, only covers rows you actually query frequently. Not covered in depth — student skipped.
What we messed up¶
Named correct index columns but couldn't explain the order — listed customer_id, status, created_at correctly but when pushed on why customer_id first, said "it's the most searched column." That's a hand-wavy answer. The precise reason is cardinality — customer_id eliminates the most rows at the first step because it has the most distinct values. Low cardinality columns like status belong after high cardinality equality columns.
"Selectivity goes down as pages go deeper" (from pagination session) — said this confidently about indexes but couldn't back it up when challenged. Selectivity is a real concept — it refers to how many distinct values a column has relative to total rows, and it's what determines whether the query planner uses the index at all. It was the right word used in the wrong context. Needs consolidating.
Key values and config to remember¶
| Thing | Value / Pattern |
|---|---|
| Default index type | B-tree — balanced tree, O(log n) lookup |
| Composite index order | Equality → highest cardinality → sort column → range |
| Planner ignores index threshold | Roughly 5–15% of table rows estimated to be touched |
| Covering index benefit | No heap fetches — query satisfied entirely from index |
| Bulk insert pattern | Drop indexes → insert → rebuild |
| Dead index detection | pg_stat_user_indexes — check idx_scan column |
| Write cost | Every write maintains every index on the table |
Unanswered questions / things to investigate¶
Partial indexes — introduced and skipped. When do you use them, what are the practical use cases, how does the planner decide to use one. Relevant for the PEPPOL platform — indexing only active invoices for example.
Selectivity and the query planner — came up in indexing and again in pagination. What exactly is selectivity, what is cardinality, how does the planner's cost model work, when does it make wrong decisions and how do you force it with hints or statistics updates (ANALYZE).
Index bloat and VACUUM — mentioned in passing. Postgres MVCC means dead tuples accumulate. Indexes hold references to dead tuples too. How does VACUUM clean indexes, what is REINDEX, when do you need it.
EXPLAIN ANALYZE — not covered at all. The actual tool for understanding what the planner is doing. Essential for diagnosing slow queries in production. Should be drilled with real queries.
Hash indexes — B-tree was covered. Hash indexes exist for pure equality lookups and are faster for that case. Not covered.
GIN / GiST indexes — relevant for full text search and array columns. Not covered. Could be relevant for ERP search features.
What's next¶
- Selectivity drill — close the gap that appeared twice across indexing and pagination sessions
- EXPLAIN ANALYZE — most practical indexing skill for production work and interviews
- Partial indexes — quick follow-up, was skipped mid-session
- API + Application Design pillar — all DB Design topics now covered: normalisation, indexes, pagination. Move to REST design, idempotency, consistency patterns