Outer Joins
Outer joins are one of the most widely misunderstood table operators. As useful as they are, it is very easy to incorrectly apply predicates in outer join queries and introduce logical bugs that are very difficult to troubleshoot (or spend hours trying to figure out why a particular query does not produce the desired results). This article demonstrates common misunderstanding about outer joins and how to properly use them.
First, let’s start with explanation of the logical processing of a query (listing here only the steps related to outer join operators, for full explanation of logical query processing read Anatomy of a Query).
1). FROM/CROSS JOIN: The FROM clause is processed to identify source tables; a cross join (Cartesian product) is formed between the two tables involved in the table operator.
2). ON: The ON clause predicates are applied and only rows that satisfy the predicate(s) (for which the predicates evaluate to TRUE) are included in the temporary result set.
3). ADD OUTER ROWS: The left side table in LEFT OUTER JOIN and right side table in RIGHT OUTER JOIN (or both tables in FULL OUTER JOIN) are the preserved tables. That means all rows (and selected attribute values) from this table(s) are present in the result set after the outer join operator is applied. At this phase the non-matching rows from the preserved table(s) are added back (non-matched based on the predicates in the ON clause). The attributes for the non-matched rows from the non-preserved table are added as NULLs.
4). WHERE: The predicates in the WHERE clause are applied and only rows for which the predicates evaluate to TRUE are included in the temporary result set.
The process repeats for any other table operators in the FROM clause taking the temporary result set from the prior table operator as left input.
The key concept here is to understand well phases 2, 3, and 4, and how filtering affects the result set. To illustrate this with example, let’s create two tables with loans and customers. Each customer can have personal or business loans (or both), or no loans at all.
CREATE TABLE Loans (
loan_nbr INT NOT NULL,
customer_nbr INT NOT NULL,
loan_date DATETIME NOT NULL,
loan_amount DECIMAL(15, 2) NOT NULL,
loan_type CHAR(1) NOT NULL,
CONSTRAINT ck_loan_type
CHECK (loan_type IN ('P', 'B')), -- P=Personal; B=Business
CONSTRAINT pk_loans
PRIMARY KEY (loan_nbr));
INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
VALUES(1, 1, '20080101', 1500.00, 'P');
INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
VALUES(2, 2, '20080215', 1000.00, 'P');
INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
VALUES(3, 1, '20080311', 5000.00, 'B');
INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
VALUES(4, 3, '20080312', 2000.00, 'P');
INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
VALUES(5, 4, '20080325', 1200.00, 'P');
INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
VALUES(6, 3, '20080327', 4000.00, 'B');
INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
VALUES(7, 5, '20080410', 3500.00, 'B');
INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
VALUES(8, 2, '20080412', 2000.00, 'P');
CREATE TABLE Customers (
customer_nbr INT NOT NULL,
customer_name VARCHAR(35),
PRIMARY KEY (customer_nbr));
INSERT INTO Customers (customer_nbr, customer_name)
VALUES(1, 'Peter Brown');
INSERT INTO Customers (customer_nbr, customer_name)
VALUES(2, 'Jeff Davis');
INSERT INTO Customers (customer_nbr, customer_name)
VALUES(3, 'Kevin Fox');
INSERT INTO Customers (customer_nbr, customer_name)
VALUES(4, 'Donna Louis');
INSERT INTO Customers (customer_nbr, customer_name)
VALUES(5, 'Greg Barrow');
INSERT INTO Customers (customer_nbr, customer_name)
VALUES(6, 'Richard Douglas');
-- Add foreign key for Loans
ALTER TABLE Loans
ADD CONSTRAINT fk_loans_customers
FOREIGN KEY (customer_nbr)
REFERENCES Customers(customer_nbr);
CASE #1: Predicates on the non-preserved table attributes
The first request is to retrieve list of all customers and only the total personal loan amount (if any) they have. The following query seems to describe the request properly:
SELECT C.customer_name,
SUM(COALESCE(L.loan_amount, 0)) AS total
FROM Customers AS C
LEFT OUTER JOIN Loans AS L
ON C.customer_nbr = L.customer_nbr
WHERE L.loan_type = 'P'
GROUP BY C.customer_name
ORDER BY customer_name;
However, the results do not seem correct:
customer_name total
---------------- --------
Donna Louis 1200.00
Jeff Davis 3000.00
Kevin Fox 2000.00
Peter Brown 1500.00
Customers Greg and Richard are missing from the output. To understand the problems let’s run this step by step:
1). Cross join:
SELECT C.customer_name, L.loan_amount, L.loan_type
FROM Customers AS C
CROSS JOIN Loans AS L;
This returns Cartesian product (every possible combination of a row from the Customers table and a row from the Loans table).
2). ON predicates:
SELECT C.customer_name, L.loan_amount, L.loan_type
FROM Customers AS C
INNER JOIN Loans AS L
ON C.customer_nbr = L.customer_nbr;
This results in the following output:
customer_name loan_amount loan_type
--------------- ------------ ---------
Peter Brown 1500.00 P
Jeff Davis 1000.00 P
Peter Brown 5000.00 B
Kevin Fox 2000.00 P
Donna Louis 1200.00 P
Kevin Fox 4000.00 B
Greg Barrow 3500.00 B
Jeff Davis 2000.00 P
At this stage only rows that match based on the predicate for customer match are included.
3). Add outer rows:
SELECT C.customer_name, L.loan_amount, L.loan_type
FROM Customers AS C
LEFT OUTER JOIN Loans AS L
ON C.customer_nbr = L.customer_nbr;
The query returns this result set:
customer_name loan_amount loan_type
--------------- ------------- ---------
Peter Brown 1500.00 P
Peter Brown 5000.00 B
Jeff Davis 1000.00 P
Jeff Davis 2000.00 P
Kevin Fox 2000.00 P
Kevin Fox 4000.00 B
Donna Louis 1200.00 P
Greg Barrow 3500.00 B
Richard Douglas NULL NULL
Here the outer rows are added, resulting in adding back a row for customer Richard, who has no loans at all and was excluded in the prior phase when the ON clause predicates were applied.
4). WHERE predicates:
SELECT C.customer_name, L.loan_amount, L.loan_type
FROM Customers AS C
LEFT OUTER JOIN Loans AS L
ON C.customer_nbr = L.customer_nbr
WHERE L.loan_type = 'P';
This result set is the base for the results in the first query attempt:
customer_name loan_amount loan_type
--------------- ------------ ---------
Peter Brown 1500.00 P
Jeff Davis 1000.00 P
Kevin Fox 2000.00 P
Donna Louis 1200.00 P
Jeff Davis 2000.00 P
Now it is very clear that the predicate in the WHERE clause filters the NULL for loan type for customer Richard, and customer Greg is excluded because he has only business loans. Incorrectly placing the predicate in the WHERE clause turns the outer join to inner join.
To correct this query it only requires to move the predicate on loan type from the WHERE clause to the ON clause:
SELECT C.customer_name,
SUM(COALESCE(L.loan_amount, 0)) AS total
FROM Customers AS C
LEFT OUTER JOIN Loans AS L
ON C.customer_nbr = L.customer_nbr
AND L.loan_type = 'P'
GROUP BY C.customer_name
ORDER BY customer_name;
This results in correct result set listing all customers and the total personal loan amount.
customer_name total
---------------- --------
Donna Louis 1200.00
Greg Barrow 0.00
Jeff Davis 3000.00
Kevin Fox 2000.00
Peter Brown 1500.00
Richard Douglas 0.00
CASE #2: Predicates on the preserved table attributes
Let’s look at another example demonstrating how filtering affect the preserved table attributes. The request is to retrieve the total loan amount for customer Richard Douglas, even if Richard does not have any loans at all. Since it is required to return always Richard’s account info, the following query seems to satisfy the request:
SELECT C.customer_name,
SUM(COALESCE(L.loan_amount, 0)) AS total
FROM Customers AS C
LEFT OUTER JOIN Loans AS L
ON C.customer_nbr = L.customer_nbr
AND C.customer_name = 'Richard Douglas'
GROUP BY C.customer_name
ORDER BY customer_name;
However, the result set returns information for all customers:
customer_name total
--------------- -----
Donna Louis 0.00
Greg Barrow 0.00
Jeff Davis 0.00
Kevin Fox 0.00
Peter Brown 0.00
Richard Douglas 0.00
As explained earlier, the reason for that is that outer rows are added back after the outer join predicates are applied, resulting in adding back all other customers. To get only Richard’s loan information requires moving the predicate filtering on customer name to the WHERE clause:
SELECT C.customer_name,
SUM(COALESCE(L.loan_amount, 0)) AS total
FROM Customers AS C
LEFT OUTER JOIN Loans AS L
ON C.customer_nbr = L.customer_nbr
WHERE C.customer_name = 'Richard Douglas'
GROUP BY C.customer_name
ORDER BY customer_name;
This results in correctly returning only Richard’s loan information:
customer_name total
---------------- -----
Richard Douglas 0.00