Refactoring is misunderstood so many times, mostly by developers living by the motto “If it’s not broken don’t fix it”. But in most cases this is invalid argument and leads to keeping inefficient and difficult to maintain solutions around. With tools and languages evolving, as well as knowledge about specific technology, there are so many opportunities to improve existing code in a way that will make a huge difference in many aspects. As Sir Winston Churchill said “To improve is to change; to be perfect is to change often”.
One example of refactoring that can lead to better code is the problem of finding ranges of existing values in SQL. A range (or also known as “island”) refers to finding and grouping consecutive values (dates, numbers, etc.) to a row representing the start point and the end point of the range. There are many variations of this problem but in essence the logic always drills down to finding a grouping factor to collapse the values to a range.
Let’s look at one example to illustrate the refactoring process. Given Sales table containing sale transactions our goal is to find:
1). Consecutive ranges of transaction numbers (transaction numbers that do not have gaps)
2). Consecutive ranges of sale dates (sale dates with no missing dates in the range)
These results can be used for variety of reporting and data analysis purposes. Here is how the Sales table looks:
CREATE TABLE Sales (
transaction_nbr INT NOT NULL PRIMARY KEY,
sale_date DATE NOT NULL,
amount DECIMAL(18, 2) NOT NULL);
INSERT INTO Sales VALUES
(1, '20091201', 100.00),
(3, '20091202', 15.00),
(4, '20091203', 102.50),
(5, '20091204', 110.00),
(6, '20091207', 98.25),
(9, '20091208', 20.00),
(11, '20091209', 160.00),
(12, '20091211', 250.00);
SELECT transaction_nbr, sale_date, amount
FROM Sales;
/*
transaction_nbr sale_date amount
--------------- ---------- ----------
1 2009-12-01 100.00
3 2009-12-02 15.00
4 2009-12-03 102.50
5 2009-12-04 110.00
6 2009-12-07 98.25
9 2009-12-08 20.00
11 2009-12-09 160.00
12 2009-12-11 250.00
*/
The requirement is to return the following two result sets:
/*
Ranges by transaction_nbr:
range_start range_end
----------- -----------
1 1
3 6
9 9
11 12
Ranges by sale_date:
range_start range_end
----------- ----------
2009-12-01 2009-12-04
2009-12-07 2009-12-09
2009-12-11 2009-12-11
*/
On versions prior to SQL Server 2005, a very common solution is to find the grouping factor for the ranges by using a subquery to find the minimum value in the range and then group all consecutive values in the range based on that. For transaction numbers the query to find grouping factor looks like this:
SELECT transaction_nbr,
(SELECT MIN(B.transaction_nbr)
FROM Sales AS B
WHERE B.transaction_nbr >= A.transaction_nbr
AND NOT EXISTS
(SELECT *
FROM Sales AS C
WHERE C.transaction_nbr = B.transaction_nbr + 1)) AS grp
FROM Sales AS A;
Here is the result set:
/*
transaction_nbr grp
--------------- -----------
1 1
3 6
4 6
5 6
6 6
9 9
11 12
12 12
*/
It is easy to see the groups created by the subquery. Now the task to finalize ranges is very trivial, simply grouping by the grouping factor and retrieving the MIN and MAX values in the range:
SELECT MIN(transaction_nbr) AS range_start,
MAX(transaction_nbr) AS range_end
FROM (
SELECT transaction_nbr,
(SELECT MIN(B.transaction_nbr)
FROM Sales AS B
WHERE B.transaction_nbr >= A.transaction_nbr
AND NOT EXISTS
(SELECT *
FROM Sales AS C
WHERE C.transaction_nbr = B.transaction_nbr + 1)) AS grp
FROM Sales AS A) AS T
GROUP BY grp;
This query satisfies the first task to find ranges by transaction number:
/*
range_start range_end
----------- -----------
1 1
3 6
9 9
11 12
*/
The logic for finding ranges by sale date is very similar. The only difference is that the subquery to find the grouping factor uses the date/time functions in SQL Server to check if the dates are consecutive. Here is the query solving the second task and the corresponding result set:
SELECT MIN(sale_date) AS range_start,
MAX(sale_date) AS range_end
FROM (
SELECT sale_date,
(SELECT MIN(B.sale_date)
FROM Sales AS B
WHERE B.sale_date >= A.sale_date
AND NOT EXISTS
(SELECT *
FROM Sales AS C
WHERE C.sale_date = DATEADD(DAY, 1, B.sale_date))) AS grp
FROM Sales AS A) AS T
GROUP BY grp;
/*
range_start range_end
----------- ----------
2009-12-01 2009-12-04
2009-12-07 2009-12-09
2009-12-11 2009-12-11
*/
While in this example the query does not look so difficult, with more complex scenarios it can become very difficult to understand, and performance will not be great.
SQL Server 2005 introduced the ranking functions (ROW_NUMBER, RANK, DENSE_RANK, and NTILE). This provides a new tool to solve out problem in more simplified and efficient manner. We can use a very simple math to find the grouping factor. Take a look at the following query and the results:
SELECT transaction_nbr,
ROW_NUMBER() OVER(ORDER BY transaction_nbr) AS rk,
transaction_nbr - ROW_NUMBER() OVER(ORDER BY transaction_nbr) AS grp
FROM Sales;
/*
transaction_nbr rk grp
--------------- -------------------- --------------------
1 1 0
3 2 1
4 3 1
5 4 1
6 5 1
9 6 3
11 7 4
12 8 4
*/
The query simply generates rank by transaction number, and defines expression subtracting the rank from the transaction number. Observing the transaction number and the rank columns it is easy to see that transaction numbers increase with 1 when there are no gaps, while ranks always increase with 1. Subtracting sequentially increasing numbers from set of numbers without gaps results in constant number (as both sequences increase with 1). When the set of numbers has gaps the subtraction results in different number. This is the base to define the grouping factor for our ranges.
Here is the final query to solve the first task to find ranges by transaction number:
SELECT MIN(transaction_nbr) AS range_start,
MAX(transaction_nbr) AS range_end
FROM (
SELECT transaction_nbr,
transaction_nbr - ROW_NUMBER() OVER(ORDER BY transaction_nbr) AS grp
FROM Sales AS A) AS T
GROUP BY grp;
/*
range_start range_end
----------- -----------
1 1
3 6
9 9
11 12
*/
Utilizing the ranking functions allows using new algorithm which results in simplified and better performing solution.
The solution for finding date ranges is very similar. Here are two versions with minor differences. The first version uses the difference between a fixed date (January 1, 2000) and the sale date value. This difference will generate sequential numeric values when there are no gaps between dates and will skip numbers when gaps exist.
-- Using number as grouping factor
SELECT MIN(sale_date) AS range_start,
MAX(sale_date) AS range_end
FROM (
SELECT sale_date,
DATEDIFF(DAY, '20000101', sale_date) -
ROW_NUMBER() OVER(ORDER BY sale_date) AS grp
FROM Sales AS A) AS T
GROUP BY grp;
/*
range_start range_end
----------- ----------
2009-12-01 2009-12-04
2009-12-07 2009-12-09
2009-12-11 2009-12-11
*/
The second version subtracts days (represented by rank based on sale date) from the sale date. This in effect generates constant date when sale dates are in sequence.
-- Using date as grouping factor
SELECT MIN(sale_date) AS range_start,
MAX(sale_date) AS range_end
FROM (
SELECT sale_date,
DATEADD(DAY, -1 * ROW_NUMBER() OVER(ORDER BY sale_date), sale_date) AS grp
FROM Sales AS A) AS T
GROUP BY grp;
The same technique can be used in many different scenarios. In more complex cases (like when partitioning by a column is required) there may be a need for multiple ranking functions. Here is one example:
Grouping with ROW_NUMBER
http://www.tangrainc.com/blog/2008/03/grouping-with-row_number/