Database Query Optimization for Large Datasets

When Catalyst PSA hit 10 million time entries and queries started taking 30+ seconds, we had to get serious about database optimization. After 27 years of database work and optimizing queries across PostgreSQL, SQL Server, MySQL, and MongoDB, here are the techniques that made the biggest impact.

Start with EXPLAIN: Understand Before Optimizing

Never optimize blind. Always start with query execution plans.

PostgreSQL EXPLAIN ANALYZE

EXPLAIN ANALYZE
SELECT p.id, p.name, COUNT(te.id) as entry_count
FROM projects p
LEFT JOIN time_entries te ON te.project_id = p.id
WHERE p.status = 'active'
GROUP BY p.id, p.name;

Output shows:

GroupAggregate  (cost=125.45..130.12 rows=150 width=520) (actual time=2543.234..2544.012 rows=148 loops=1)
  ->  Hash Join  (cost=12.50..100.00 rows=500 width=512) (actual time=0.234..2500.123 rows=500000 loops=1)
        Hash Cond: (te.project_id = p.id)
        ->  Seq Scan on time_entries te  (cost=0.00..50000.00 rows=10000000 width=16) (actual time=0.012..1500.456 rows=10000000 loops=1)
        ->  Hash  (cost=10.00..10.00 rows=150 width=512) (actual time=0.123..0.123 rows=148 loops=1)
              ->  Seq Scan on projects p  (cost=0.00..10.00 rows=150 width=512) (actual time=0.001..0.050 rows=148 loops=1)
                    Filter: (status = 'active'::text)
Planning Time: 0.234 ms
Execution Time: 2544.123 ms

Key insights:

Seq Scan: Full table scan (bad for large tables)
actual time: Real execution time
rows: Estimated vs actual rows
loops: How many times operation ran

The sequential scan on 10M time_entries is killing performance.

Indexing: The Most Important Optimization

Indexes are to databases what table of contents is to books.

Basic Single-Column Index

-- Problem: Slow filtering by status
SELECT * FROM projects WHERE status = 'active';
 
-- Solution: Index on status
CREATE INDEX idx_projects_status ON projects(status);
 
-- Result: 2500ms -> 15ms

Composite Indexes

Order matters! Most selective column first:

-- Query filtering by tenant and status
SELECT * FROM projects
WHERE tenant_id = '123' AND status = 'active';
 
-- Good: tenant_id first (more selective)
CREATE INDEX idx_projects_tenant_status ON projects(tenant_id, status);
 
-- Bad: status first (less selective)
CREATE INDEX idx_projects_status_tenant ON projects(status, tenant_id);

With good index: 0.5ms. With bad index: 50ms.

Covering Indexes

Include all columns needed by query:

-- Query needs id, name, status
SELECT id, name, status FROM projects
WHERE tenant_id = '123' AND status = 'active';
 
-- Covering index - no table lookup needed
CREATE INDEX idx_projects_covering ON projects(tenant_id, status) INCLUDE (id, name);
 
-- Result: 50ms -> 2ms

PostgreSQL fetches everything from index without touching table.

Partial Indexes

Index only relevant rows:

-- Only query active projects
SELECT * FROM projects WHERE status = 'active';
 
-- Partial index - smaller, faster
CREATE INDEX idx_projects_active ON projects(tenant_id)
WHERE status = 'active';
 
-- Index is 80% smaller, queries are faster

Index on Computed Columns

-- Query on lowercase email
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
 
-- Index on expression
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
 
-- Now query uses index

Query Optimization Techniques

Avoid SELECT *

Select only needed columns:

-- Bad: Fetches all columns (wasteful)
SELECT * FROM projects;
 
-- Good: Fetch only what you need
SELECT id, name, status FROM projects;
 
-- Result: 500ms -> 100ms (5x faster)

With 50 columns and millions of rows, this matters.

Use LIMIT

Always limit results:

-- Bad: Returns all 10 million records
SELECT * FROM time_entries ORDER BY created_at DESC;
 
-- Good: Return first page
SELECT * FROM time_entries ORDER BY created_at DESC LIMIT 20 OFFSET 0;
 
-- For pagination, use cursor-based:
SELECT * FROM time_entries
WHERE created_at < '2024-01-01T00:00:00Z'
ORDER BY created_at DESC
LIMIT 20;

Cursor-based pagination is faster than OFFSET for large datasets.

Avoid N+1 Queries

Classic ORM problem:

// Bad: N+1 queries (1 + 100 queries)
const projects = await db.select().from(projects);
 
for (const project of projects) {
  const client = await db.select().from(clients).where(eq(clients.id, project.clientId));
  project.client = client;
}
 
// Good: Single query with join
const projectsWithClients = await db
  .select({
    project: projects,
    client: clients,
  })
  .from(projects)
  .leftJoin(clients, eq(projects.clientId, clients.id));

Result: 100 queries -> 1 query. 5000ms -> 50ms.

Batch Operations

Insert/update in batches:

-- Bad: 1000 separate inserts
INSERT INTO time_entries (project_id, hours) VALUES ('123', 8);
INSERT INTO time_entries (project_id, hours) VALUES ('124', 6);
-- ... 998 more
 
-- Good: Single batch insert
INSERT INTO time_entries (project_id, hours)
VALUES
  ('123', 8),
  ('124', 6),
  ('125', 7),
  -- ... all 1000 rows
;
 
-- Result: 30 seconds -> 0.5 seconds

With ORMs:

// Drizzle batch insert
await db.insert(timeEntries).values(
  entries.map(e => ({
    projectId: e.projectId,
    hours: e.hours,
  }))
);

Avoid Subqueries When Possible

-- Bad: Correlated subquery (runs for each row)
SELECT p.id, p.name,
  (SELECT COUNT(*) FROM time_entries WHERE project_id = p.id) as entry_count
FROM projects p;
 
-- Good: Join and aggregate
SELECT p.id, p.name, COUNT(te.id) as entry_count
FROM projects p
LEFT JOIN time_entries te ON te.project_id = p.id
GROUP BY p.id, p.name;
 
-- Result: 8000ms -> 200ms

JOIN Optimization

Choose the Right JOIN Type

-- INNER JOIN: Only matching rows (fastest)
SELECT * FROM projects p
INNER JOIN clients c ON p.client_id = c.id;
 
-- LEFT JOIN: All left rows + matching right (slower)
SELECT * FROM projects p
LEFT JOIN clients c ON p.client_id = c.id;
 
-- Use INNER JOIN when possible

Join Order Matters

Smaller table first:

-- Bad: 10M rows joined with 1K rows
SELECT * FROM time_entries te
LEFT JOIN projects p ON te.project_id = p.id;
 
-- Good: 1K rows joined with 10M rows
SELECT * FROM projects p
LEFT JOIN time_entries te ON p.id = te.project_id;

Modern query planners optimize this, but explicit ordering helps.

Index Foreign Keys

Always index columns used in joins:

CREATE INDEX idx_time_entries_project_id ON time_entries(project_id);
CREATE INDEX idx_time_entries_user_id ON time_entries(user_id);

Without these indexes, joins do full table scans.

Aggregation Optimization

Use Indexed Columns in GROUP BY

-- Query
SELECT client_id, COUNT(*) FROM projects GROUP BY client_id;
 
-- Index helps grouping
CREATE INDEX idx_projects_client_id ON projects(client_id);
 
-- Result: 5000ms -> 50ms

Materialized Views for Complex Aggregations

Pre-compute expensive aggregations:

-- Create materialized view
CREATE MATERIALIZED VIEW project_stats AS
SELECT
  p.id,
  p.name,
  COUNT(DISTINCT te.id) as total_entries,
  SUM(te.hours) as total_hours,
  COUNT(DISTINCT te.user_id) as team_size
FROM projects p
LEFT JOIN time_entries te ON te.project_id = p.id
GROUP BY p.id, p.name;
 
CREATE INDEX idx_project_stats_id ON project_stats(id);
 
-- Query materialized view (instant)
SELECT * FROM project_stats WHERE id = '123';
 
-- Refresh periodically
REFRESH MATERIALIZED VIEW CONCURRENTLY project_stats;

Complex aggregation: 30s -> 0.01s.

Date Range Queries

Use Date Indexes

-- Query by date range
SELECT * FROM time_entries
WHERE entry_date BETWEEN '2024-01-01' AND '2024-12-31';
 
-- Index on date column
CREATE INDEX idx_time_entries_date ON time_entries(entry_date);
 
-- For timestamp queries, use BRIN index for large tables
CREATE INDEX idx_time_entries_created_brin ON time_entries USING BRIN (created_at);

BRIN indexes are tiny and fast for chronological data.

Partition by Date

For massive tables, partition by date:

CREATE TABLE time_entries (
  id UUID,
  project_id UUID,
  entry_date DATE,
  hours DECIMAL
) PARTITION BY RANGE (entry_date);
 
CREATE TABLE time_entries_2024 PARTITION OF time_entries
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
 
CREATE TABLE time_entries_2023 PARTITION OF time_entries
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

Queries only scan relevant partitions. 10M row scan -> 1M row scan.

Full-Text Search

Use PostgreSQL's Built-in FTS

-- Add tsvector column
ALTER TABLE projects ADD COLUMN search_vector tsvector;
 
-- Generate search vector
UPDATE projects SET search_vector = to_tsvector('english', name || ' ' || description);
 
-- Index it
CREATE INDEX idx_projects_search ON projects USING GIN(search_vector);
 
-- Query
SELECT * FROM projects
WHERE search_vector @@ to_tsquery('english', 'website & redesign');
 
-- Automatic updates
CREATE TRIGGER projects_search_update
BEFORE INSERT OR UPDATE ON projects
FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(search_vector, 'pg_catalog.english', name, description);

Full-text search: 5000ms -> 20ms.

Connection Pooling

Don't create new connections for each query:

// Bad: New connection per query
export async function query(sql: string) {
  const client = new Client({ connectionString: process.env.DATABASE_URL });
  await client.connect();
  const result = await client.query(sql);
  await client.end();
  return result;
}
 
// Good: Connection pool
import { Pool } from 'pg';
 
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 20, // max connections
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});
 
export async function query(sql: string) {
  return await pool.query(sql);
}

With Drizzle:

import { drizzle } from 'drizzle-orm/node-postgres';
import { Pool } from 'pg';
 
const pool = new Pool({ max: 20 });
const db = drizzle(pool);

Analyze and Vacuum

PostgreSQL needs maintenance:

-- Update statistics for query planner
ANALYZE projects;
ANALYZE time_entries;
 
-- Or all tables
ANALYZE;
 
-- Clean up dead rows
VACUUM time_entries;
 
-- Full vacuum (locks table)
VACUUM FULL time_entries;
 
-- Enable auto-vacuum
ALTER TABLE time_entries SET (autovacuum_enabled = true);

We run ANALYZE after bulk imports. Without it, query planner makes bad decisions.

Monitoring and Profiling

Find Slow Queries

-- Enable slow query log (postgresql.conf)
log_min_duration_statement = 1000  # Log queries > 1s
 
-- Or use pg_stat_statements extension
CREATE EXTENSION pg_stat_statements;
 
-- Find slowest queries
SELECT
  query,
  calls,
  total_exec_time,
  mean_exec_time,
  max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

This showed us that a single report query was taking 80% of database time.

Monitor Index Usage

-- Find unused indexes
SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

We found 15 unused indexes consuming 2GB. Dropped them.

Results

After applying these optimizations to Catalyst PSA:

Dashboard load: 30s -> 1.5s (20x faster)
Report generation: 45s -> 3s (15x faster)
API average response time: 500ms -> 50ms (10x faster)
Database size: Reduced 20% by removing unused indexes
Query throughput: 500 qps -> 2000 qps

Lessons Learned

Index everything used in WHERE/JOIN: Most important optimization
EXPLAIN is your best friend: Always check execution plans
**Avoid SELECT ***: Select only needed columns
Batch operations: Never insert/update one row at a time
Monitor in production: Find slow queries with logging
Partition large tables: Especially time-series data
Connection pooling is essential: Never create connections per query

Conclusion

Database optimization is about understanding how databases work: indexes, query plans, joins, and aggregations. With the right indexes and query patterns, even databases with hundreds of millions of rows can be fast.

After 27 years of database work, I've learned that most performance problems come from missing indexes or N+1 queries. Fix those, and you're 90% of the way there.