Defensive Database Programming

Defensive programming is a much avoided topic by developers. Rushing to deliver features is almost always prioritized higher than spending time to defensively program code for robustness. Especially in the database world where this concept is much misunderstood. This is why it was real pleasure to read Alex Kuznetsov’s book "Defensive Database Programming with SQL Server".

Alex deals with this "inconvenient" topic using very practical approach. Instead of filling pages with theory and reasons why defensive programming is good, he dives right into simple examples from the daily work of every database professional. We have all seen (and ignored) many of these issues, but demonstrating how this affects our code and how simple it is to avoid these problems makes this book shine. It is not a complete catalog of defensive techniques, rather a good collection of examples to illustrate the need for defensive coding and applicable methods. It builds the mindset to think proactively and create robust solutions. 

The book includes coverage of the following topics: basic defensive technique, code vulnerabilities, changes to database objects, upgrades, reusing code, data integrity and constraints, error handling, concurrency.
 
In many ways this book reminds me of the classic work by Donald Norman on designing everyday things (The Design of Everyday Things), which in similar manner demonstrates how defensive design can prevent human errors.

In conclusion, "Defensive Database Programming with SQL Server" is a wonderful addition to the library of every database professional. It should be required reading for all SQL practitioners.

Refactoring Entity-Attribute-Value Design

Entity-Attribute-Value (often referenced as EAV) table design is one of the “wonderful” things that newcomers to the SQL land discover very quickly! Traditional procedural programmers have very hard time understanding why tables have fixed number of columns, and adding a new column requires table and possibly code change. Their great problem solving skills from the procedural world help them find a “natural” solution – the EAV design – where a pair of key (describing the attribute) and value allow to store data with flexibility.
    Example is storing all employee information in 3 columns: key column to identify employee (entity), attribute key column (attribute) to identify the attribute stored (name, birth date, SSN, date of hire, etc.), and value column (value) to store the actual value.

More on EAV…
Tony Andrews has an excellent article explaining the problems with EAV design: OTLT and EAV: the two big design mistakes all beginners make. It is worth to note that in some cases where it is required to store many attributes that change very often (new attributes added, or existing attributes deleted) and there is no need for referential or domain integrity and query logic based on these attributes then it may be a valid scenario to utilize EAV design.

Normally the value column has to be of some large length character data type, like VARCHAR(200), and data has to be casted to text when saved and then casted back to the original data type when retrieved. 
    Once all coding is done then comes the realization of how bad this approach is… And sometimes you just walk into a project where EAV has been already deployed and now it is your job to fix it. The typical signs of EAV implementation is the existence of very complex (and slow) queries to satisfy relatively trivial user requests for data and the lack of data integrity.

How do you fix EAV design?
Here is one example of EAV design and the refactoring process to correct it. In our case the information for bank loans is stored in a table representing EAV design. Listing 1 below shows the scripts to create the table and insert sample data.

Listing 1
-- EAV table
CREATE TABLE EAV_Loans (
 
loan_nbr INT NOT NULL,
 
customer_nbr INT NOT NULL,
 
code VARCHAR(30) NOT NULL,
 
value VARCHAR(200),
 
CONSTRAINT pk_eav_loans
 
PRIMARY KEY (loan_nbr, customer_nbr, code));

INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(1, 1, 'date', '20080110');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(1, 1, 'amount', '1500.00');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(1, 1, 'type', 'personal');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(2, 2, 'date', '20080215');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(2, 2, 'amount', '3500.00');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(2, 2, 'type', 'personal');

SELECT loan_nbr, customer_nbr, code, value
FROM EAV_Loans;

/*

loan_nbr    customer_nbr code      value
----------- ------------ --------- ----------
1           1            amount    1500.00
1           1            date      20080110
1           1            type      personal
2           2            amount    3500.00
2           2            date      20080215
2           2            type      personal

*/

Here the pair of loan number and customer number identifies the entity. The code column stores the attribute (amount, date, and type of loan) and the value column represents that actual value.
    A quick demonstration of the problems caused by this design. Let’s write a query to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008. Here is a first attempt to satisfy the request:

Listing 2
SELECT A.loan_nbr,
      
A.customer_nbr,
      
CAST(A.value AS DATETIME) AS loan_date,
      
CAST(B.value AS DECIMAL(15, 2)) AS loan_amount
FROM EAV_Loans AS A
INNER JOIN EAV_Loans AS B
  
ON A.loan_nbr = B.loan_nbr
 
AND A.customer_nbr = B.customer_nbr
INNER JOIN EAV_Loans AS C
  
ON A.loan_nbr = C.loan_nbr
 
AND A.customer_nbr = C.customer_nbr
WHERE A.code = 'date'
 
AND CAST(A.value AS DATETIME) >= '20080101'
 
AND CAST(A.value AS DATETIME) <  '20080201'
 
AND B.code = 'amount'
 
AND CAST(B.value AS DECIMAL(15, 2)) > 1000.00
 
AND C.code = 'type'
 
AND C.value = 'personal';

One of the first issues to notice is the multiple joins to retrieve each attribute. This is because each attribute is stored in separate row and we need to put back together the customer loan info. Besides that seems the code should be OK… except one little problem! The nature of the SQL is such that nothing guarantees that the predicates in the WHERE clause will be executed in the order listed. They will be expanded in the query plan and the cast to DATETIME or DECIMAL can be applied to the loan type value which will result in conversion error. I described this in more detail in my post Predicates in SQL.
    To fix this you may be tempted to try a different approach: using table expressions to retrieve each attribute. Listing 3 shows a version of such query.

Listing 3
SELECT A.loan_nbr,
      
A.customer_nbr,
      
loan_date,
      
loan_amount
FROM (SELECT loan_nbr, customer_nbr,
            
CAST(value AS DATETIME) AS loan_date
     
FROM EAV_Loans
     
WHERE code = 'date') AS A
INNER JOIN (SELECT loan_nbr, customer_nbr,
                  
CAST(value AS DECIMAL(15, 2)) AS loan_amount
           
FROM EAV_Loans
           
WHERE code = 'amount') AS B
  
ON A.loan_nbr = B.loan_nbr
 
AND A.customer_nbr = B.customer_nbr
INNER JOIN (SELECT loan_nbr, customer_nbr,
                  
value AS loan_type
           
FROM EAV_Loans
           
WHERE code = 'type') AS C
  
ON A.loan_nbr = C.loan_nbr
 
AND A.customer_nbr = C.customer_nbr
WHERE loan_date >= '20080101'
 
AND loan_date '20080201'
 
AND loan_amount > 1000.00
 
AND loan_type = 'personal';

However, this query has the same problem because the table expressions are expanded in the query plan and nothing guarantees the order of executing the predicates. One approach to retrieve the requested data successfully is to use a pivoting technique. The query below illustrates that.

Listing 4
SELECT loan_nbr,
      
customer_nbr,
      
loan_date,
      
loan_amount
FROM (
SELECT loan_nbr,
      
customer_nbr,
      
MAX(CASE WHEN code = 'date' THEN CAST(value AS DATETIME) END),
      
MAX(CASE WHEN code = 'amount' THEN CAST(value AS DECIMAL(15, 2)) END),
      
MAX(CASE WHEN code = 'type' THEN value END)
FROM EAV_Loans
GROUP BY loan_nbr, customer_nbr
     
) AS L(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
WHERE loan_date >= '20080101'
 
AND loan_date '20080201'
 
AND loan_amount > 1000.00
 
AND loan_type = 'personal';

Here the CASE expressions guarantee that only the correct values will be casted to the specific data type. But then how efficient is this query for such a trivial request?

The refactoring process
Usually the first step of correcting an EAV design is to create a normalized table for storing the data. In our case the table structure can look like Listing 5 (note that here for clarify the loan type is spelled out as entire word; in real business scenario it may be sufficient to indicate only P or B; alternatively create table to store loan types and reference via a foreign key constraint).

Listing 5
CREATE TABLE Loans (
 
loan_nbr INT NOT NULL,
 
customer_nbr INT NOT NULL,
 
loan_date DATETIME NOT NULL,
 
loan_amount DECIMAL(15, 2) NOT NULL,
 
loan_type VARCHAR(10) NOT NULL,
 
CONSTRAINT ck_loan_type
 
CHECK (loan_type IN ('personal', 'business')),
 
CONSTRAINT pk_loans
 
PRIMARY KEY (loan_nbr));

Now each attribute is in separate column with appropriate data type. This guarantees the integrity of the data as well as now we can define constraints to restrict data values (like the CHECK constraint for loan type).
    The next step is to convert and transfer data from the old format to the new table. We already know that a pivoting technique works and can be used here. Here is the query to transfer from EAV to normalized format.

Listing 6
INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
SELECT loan_nbr,
      
customer_nbr,
      
MAX(CASE WHEN code = 'date' THEN CAST(value AS DATETIME) END),
      
MAX(CASE WHEN code = 'amount' THEN CAST(value AS DECIMAL(15, 2)) END),
      
MAX(CASE WHEN code = 'type' THEN value END)
FROM EAV_Loans
GROUP BY loan_nbr, customer_nbr;

The last step in the refactoring process is to replace the old EAV table with view for backward compatibility (if there is code referencing that original table that cannot be changed at this time).  There are different methods to accomplish this. For simplicity here we can use unpivoting using UNION and query for each attribute (note that you have to drop the original EAV table first because the view has the same name).

Listing 7
CREATE VIEW EAV_Loans
(loan_nbr, customer_nbr, code, value)
AS
SELECT
loan_nbr, customer_nbr,
      
CAST('date' AS VARCHAR(30)),
      
CONVERT(VARCHAR(200), loan_date, 112)
FROM Loans
UNION
SELECT
loan_nbr, customer_nbr,
      
CAST('amount' AS VARCHAR(30)),
      
CAST(loan_amount AS VARCHAR(200))
FROM Loans
UNION
SELECT
loan_nbr, customer_nbr,
      
CAST('type' AS VARCHAR(30)),
      
CAST(loan_type AS VARCHAR(200))
FROM Loans;

If there is any old code that references the original EAV table for data modifications you will have to add instead of triggers to the view to handle it.

Now, let’s answer the same data request to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008, this time using the normalized table.

Listing 8
SELECT loan_nbr, customer_nbr, loan_date, loan_amount
FROM Loans
WHERE loan_date >= '20080101'
 
AND loan_date '20080201'
 
AND loan_amount > 1000.00
 
AND loan_type = 'personal';

Simple and elegant, just as it should be!

Refactoring Ranges

Refactoring is misunderstood so many times, mostly by developers living by the motto “If it’s not broken don’t fix it”. But in most cases this is invalid argument and leads to keeping inefficient and difficult to maintain solutions around. With tools and languages evolving, as well as knowledge about specific technology, there are so many opportunities to improve existing code in a way that will make a huge difference in many aspects. As Sir Winston Churchill said “To improve is to change; to be perfect is to change often”.

One example of refactoring that can lead to better code is the problem of finding ranges of existing values in SQL. A range (or also known as “island”) refers to finding and grouping consecutive values (dates, numbers, etc.) to a row representing the start point and the end point of the range. There are many variations of this problem but in essence the logic always drills down to finding a grouping factor to collapse the values to a range.

Let’s look at one example to illustrate the refactoring process. Given Sales table containing sale transactions our goal is to find:

1). Consecutive ranges of transaction numbers (transaction numbers that do not have gaps)

2). Consecutive ranges of sale dates (sale dates with no missing dates in the range)

These results can be used for variety of reporting and data analysis purposes. Here is how the Sales table looks:

CREATE TABLE Sales (
 
transaction_nbr INT NOT NULL PRIMARY KEY,
 
sale_date DATE NOT NULL,
 
amount DECIMAL(182) NOT NULL);

INSERT INTO Sales VALUES 
(1'20091201'100.00),
(
3'20091202'15.00),
(
4'20091203'102.50),
(
5'20091204'110.00),
(
6'20091207'98.25),
(
9'20091208'20.00),
(
11'20091209'160.00),
(
12'20091211'250.00);

SELECT transaction_nbrsale_dateamount
FROM Sales;

/*

transaction_nbr sale_date  amount
--------------- ---------- ----------
1               2009-12-01 100.00
3               2009-12-02 15.00
4               2009-12-03 102.50
5               2009-12-04 110.00
6               2009-12-07 98.25
9               2009-12-08 20.00
11              2009-12-09 160.00
12              2009-12-11 250.00

*/

The requirement is to return the following two result sets:

/*

Ranges by transaction_nbr:

range_start range_end
----------- -----------
1           1
3           6
9           9
11          12

Ranges by sale_date:

range_start range_end
----------- ----------
2009-12-01  2009-12-04
2009-12-07  2009-12-09
2009-12-11  2009-12-11

*/

On versions prior to SQL Server 2005, a very common solution is to find the grouping factor for the ranges by using a subquery to find the minimum value in the range and then group all consecutive values in the range based on that. For transaction numbers the query to find grouping factor looks like this:

SELECT transaction_nbr,
      (
SELECT MIN(B.transaction_nbr)
       
FROM Sales AS B
       
WHERE B.transaction_nbr >= A.transaction_nbr
         
AND NOT EXISTS
                (
SELECT *
                 
FROM Sales AS C
                 
WHERE C.transaction_nbr B.transaction_nbr 1)) AS grp
FROM Sales AS A;

Here is the result set:

/*

transaction_nbr grp
--------------- -----------
1               1
3               6
4               6
5               6
6               6
9               9
11              12
12              12

*/

It is easy to see the groups created by the subquery. Now the task to finalize ranges is very trivial, simply grouping by the grouping factor and retrieving the MIN and MAX values in the range:

SELECT MIN(transaction_nbrAS range_start
       
MAX(transaction_nbrAS range_end
FROM (
SELECT transaction_nbr,
      (
SELECT MIN(B.transaction_nbr)
       
FROM Sales AS B
       
WHERE B.transaction_nbr >= A.transaction_nbr
         
AND NOT EXISTS
                (
SELECT *
                 
FROM Sales AS C
                 
WHERE C.transaction_nbr B.transaction_nbr 1)) AS grp
FROM Sales AS AAS T
GROUP BY grp;

This query satisfies the first task to find ranges by transaction number:

/*

range_start range_end
----------- -----------
1           1
3           6
9           9
11          12

*/

The logic for finding ranges by sale date is very similar. The only difference is that the subquery to find the grouping factor uses the date/time functions in SQL Server to check if the dates are consecutive. Here is the query solving the second task and the corresponding result set:

SELECT MIN(sale_dateAS range_start
       
MAX(sale_dateAS range_end
FROM (
SELECT sale_date,
      (
SELECT MIN(B.sale_date)
       
FROM Sales AS B
       
WHERE B.sale_date >= A.sale_date
         
AND NOT EXISTS
                (
SELECT *
                 
FROM Sales AS C
                 
WHERE C.sale_date DATEADD(DAY1B.sale_date))) AS grp
FROM Sales AS AAS T
GROUP BY grp;


/*

range_start range_end
----------- ----------
2009-12-01  2009-12-04
2009-12-07  2009-12-09
2009-12-11  2009-12-11

*/

While in this example the query does not look so difficult, with more complex scenarios it can become very difficult to understand, and performance will not be great.

SQL Server 2005 introduced the ranking functions (ROW_NUMBER, RANK, DENSE_RANK, and NTILE). This provides a new tool to solve out problem in more simplified and efficient manner. We can use a very simple math to find the grouping factor. Take a look at the following query and the results:

SELECT transaction_nbr,
       
ROW_NUMBER() OVER(ORDER BY transaction_nbrAS rk,
       
transaction_nbr ROW_NUMBER() OVER(ORDER BY transaction_nbrAS grp
FROM Sales;

/*

transaction_nbr rk                   grp
--------------- -------------------- --------------------
1               1                    0
3               2                    1
4               3                    1
5               4                    1
6               5                    1
9               6                    3
11              7                    4
12              8                    4

*/

The query simply generates rank by transaction number, and defines expression subtracting the rank from the transaction number. Observing the transaction number and the rank columns it is easy to see that transaction numbers increase with 1 when there are no gaps, while ranks always increase with 1. Subtracting sequentially increasing numbers from set of numbers without gaps results in constant number (as both sequences increase with 1). When the set of numbers has gaps the subtraction results in different number. This is the base to define the grouping factor for our ranges.

Here is the final query to solve the first task to find ranges by transaction number:

SELECT MIN(transaction_nbrAS range_start
       
MAX(transaction_nbrAS range_end
FROM (
SELECT transaction_nbr,
       
transaction_nbr ROW_NUMBER() OVER(ORDER BY transaction_nbrAS grp
FROM Sales AS AAS T
GROUP BY grp;


/*

range_start range_end
----------- -----------
1           1
3           6
9           9
11          12

*/

Utilizing the ranking functions allows using new algorithm which results in simplified and better performing solution.

The solution for finding date ranges is very similar. Here are two versions with minor differences. The first version uses the difference between a fixed date (January 1, 2000) and the sale date value. This difference will generate sequential numeric values when there are no gaps between dates and will skip numbers when gaps exist.

-- Using number as grouping factor
SELECT MIN(sale_dateAS range_start
       
MAX(sale_dateAS range_end
FROM (
SELECT sale_date,
       
DATEDIFF(DAY'20000101'sale_date) - 
       
ROW_NUMBER() OVER(ORDER BY sale_dateAS grp
FROM Sales AS AAS T
GROUP BY grp;

/*

range_start range_end
----------- ----------
2009-12-01  2009-12-04
2009-12-07  2009-12-09
2009-12-11  2009-12-11

*/

The second version subtracts days (represented by rank based on sale date) from the sale date. This in effect generates constant date when sale dates are in sequence.

-- Using date as grouping factor
SELECT MIN(sale_dateAS range_start
       
MAX(sale_dateAS range_end
FROM (
SELECT sale_date,
       
DATEADD(DAY, -ROW_NUMBER() OVER(ORDER BY sale_date), sale_dateAS grp
FROM Sales AS AAS T
GROUP BY grp;

The same technique can be used in many different scenarios. In more complex cases (like when partitioning by a column is required) there may be a need for multiple ranking functions. Here is one example:

Grouping with ROW_NUMBER
http://www.tangrainc.com/blog/2008/03/grouping-with-row_number/