Joins and Relational Thinking

  • ID: DBSQL-F-L04
  • Type: Lesson
  • Audience: Public
  • Theme: Keys, joins, and avoiding duplication

Joins reconstruct meaning across tables.

Most SQL mistakes in analytics come from joins that change row counts in unexpected ways.

Relational thinking means understanding:


Setup


INNER JOIN

INNER JOIN keeps rows that match in both tables.

Orders with customer names:

query = '''
SELECT o.order_id,
       c.full_name,
       o.order_date,
       o.status
FROM orders o
JOIN customers c
  ON o.customer_id = c.customer_id
ORDER BY o.order_date;
'''

df = pd.read_sql(query, con)
print(df)
  order_id      full_name  order_date     status
0    O3001   Amina Hassan  2026-01-15       paid
1    O3002   Amina Hassan  2026-02-02       paid
2    O3003  Joseph Kimaro  2026-02-05       paid
3    O3004    Neema Msuya  2026-02-11  cancelled
4    O3005  Grace Nyerere  2026-02-14       paid

LEFT JOIN

LEFT JOIN keeps all rows from the left table, even when no match exists on the right.

Customers and their orders:

query = '''
SELECT c.full_name,
       o.order_id,
       o.order_date,
       o.status
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
ORDER BY c.full_name, o.order_date;
'''

df = pd.read_sql(query, con)
print(df)
       full_name order_id  order_date     status
0   Amina Hassan    O3001  2026-01-15       paid
1   Amina Hassan    O3002  2026-02-02       paid
2   David Okello      NaN         NaN        NaN
3  Grace Nyerere    O3005  2026-02-14       paid
4  Joseph Kimaro    O3003  2026-02-05       paid
5    Neema Msuya    O3004  2026-02-11  cancelled

Cardinality and Duplication

One-to-many relationships are common:

  • one customer → many orders
  • one order → many order_items

Duplication is expected when you join across one-to-many relationships.

The error happens when duplication is interpreted as more customers, more orders, or more revenue without verifying row counts.


Verify Row Counts

orders_n = pd.read_sql("SELECT COUNT(*) AS n_orders FROM orders;", con)
items_n  = pd.read_sql("SELECT COUNT(*) AS n_items FROM order_items;", con)

print(orders_n)
print(items_n)
   n_orders
0         5
   n_items
0        8

Row count changes after a join are not automatically wrong.

They must match the structure you expect.