Cleaning Data with Recursive CTE
SQL Server 2005 added a great new feature: Common Table Expressions (CTE). And even better than that – recursive CTEs. That provides a new powerful tool to solve many SQL problems. One of the areas where recursive CTEs shine is the hierarchical data management.
Here is another side of the recursive CTEs – utilizing them for some common tasks like cleaning data. The problem: a table has a column with values that have invalid characters. The task is to replace all those invalid characters with a space. Unfortunately the REPLACE function does not support pattern matching and each character in the column has to be verified individually and replaced if it falls in the invalid range. The solution below utilizes a recursive CTE to walk though the ACSII table of characters and to replace the invalid characters in the column values.
-- Create test table.
CREATE TABLE Foobar (
key_col INT PRIMARY KEY,
text_col NVARCHAR(100));
-- Populate sample data.
INSERT INTO Foobar VALUES (1, N'ABC!@#%DEFgh');
INSERT INTO Foobar VALUES (2, N'~!102WXY&*()_Z');
-- Perform the cleanup with recursive CTE.
WITH Clean (key_col, text_col, ch)
AS
(SELECT key_col,
REPLACE(text_col, CHAR(255), ' '),
255
FROM Foobar
UNION ALL
SELECT key_col,
CASE WHEN
CHAR(ch - 1) NOT LIKE '[A-Z]'
THEN REPLACE(text_col, CHAR(ch - 1), ' ')
ELSE text_col END,
ch - 1
FROM Clean
WHERE ch > 1)
SELECT key_col, text_col
FROM Clean
WHERE ch = 1
OPTION (MAXRECURSION 255);
On a side note – the recursive CTEs are not the best performers. Also, by default a CTE allows only 100 levels of recursion. The MAXRECURSION hint can be used to set higher level (a value between 0 and 32767; setting to 0 will remove the limit). Be aware that settings MAXRECURSION to 0 may create an infinite loop.
Here is a different method using utility table with numbers and FOR XML PATH, which is more effective:
WITH Clean (key_col, text_col)
AS
(SELECT key_col, REPLACE(CAST(
(SELECT CASE
WHEN SUBSTRING(text_col, n, 1) LIKE '[A-Z]'
THEN SUBSTRING(text_col, n, 1)
ELSE '.'
END
FROM (SELECT number
FROM master..spt_values
WHERE type = 'P'
AND number BETWEEN 1 AND 100) AS Nums(n)
WHERE n <= LEN(text_col)
FOR XML PATH('')) AS NVARCHAR(100)), '.', ' ')
FROM Foobar)
SELECT key_col, text_col
FROM Clean;