Obscure Real Data for Testing
Very often we have a good set of production data but it cannot be used for training or testing because of data confidentiality. And the solution sometimes is just to tweak pieces of the data so it cannot be linked to the original data. Here is one method for obscure data in SQL Server 2005. It is based on a couple techniques:
– Using table with numbers (here generated on the fly by cross joining common table expressions)
– Using the table with numbers to slice each value into individual characters
– Generating random values in the normal character range that will be used to replace the real characters
– Using FOR XML PATH with empty element to concatenate back the obscured value.
CREATE TABLE Patients (
pname VARCHAR(35),
pname_masked VARCHAR(35));
INSERT INTO Patients VALUES('John Doe', NULL);
INSERT INTO Patients VALUES('Jeff Smith', NULL);
WITH
N1 (n) AS (SELECT 1 UNION ALL SELECT 1),
N2 (n) AS (SELECT 1 FROM N1 AS X, N1 AS Y),
N3 (n) AS (SELECT 1 FROM N2 AS X, N2 AS Y),
N4 (n) AS (SELECT ROW_NUMBER()
OVER(ORDER BY X.n) FROM N3 AS X, N3 AS Y)
UPDATE Patients
SET pname_masked =
CAST(
CAST((SELECT CASE
WHEN SUBSTRING(pname, n, 1) = ' '
THEN ' '
ELSE CHAR(x + ASCII(SUBSTRING(pname, n, 1)))
END
FROM N4 AS Nums
CROSS APPLY
(SELECT TOP(1) T.n AS x
FROM N4 AS T
WHERE T.n + ASCII(UPPER(SUBSTRING(pname, Nums.n, 1))) BETWEEN 65 AND 90
ORDER BY CHECKSUM(NEWID())) AS N
WHERE Nums.n <= LEN(pname)
FOR XML PATH('')) AS XML) AS VARCHAR(35));
SELECT pname, pname_masked
FROM Patients;
-- Results
pname pname_masked
------------ -------------
John Doe Xwlv Lyi
Jeff Smith Qmol Yuyvs
Great post! Very simple and helpful technique. Thanks!