February 25, 2017

Sampling all data in tables in MS SQL Server

Sampling data

Data warehouse comes in different shape and sizes, our legacy implementation of Data warehouse is Microsoft SQL Server 2008 with many tables inside. Recently we have to export a sample of all tables in our database for auditing purpose, specifically verifying there is no PII (Personal Identifiable Information) in our database.

A simple approach is doing a SELECT TOP 100 * from each table. This can be a little tedious if you have hundreds of tables. One way is to use a for loop to go through each of the tables.

This query will return the names of all the tables in the current database.
SELECT name FROM sysobjects WHERE xtype='U' order by 1;
Or if your database is organised in a schema, you will need the schema name together with the table name.
SELECT
    t.name AS table_name,
    SCHEMA_NAME(schema_id) AS schema_name
FROM sys.tables AS t
ORDER BY schema_name, table_name;

With that principle, we can loop through each table and apply SELECT TOP 100 * on each of the tables.
DECLARE @table_name nvarchar(200)
DECLARE @schema_name nvarchar(200)
DECLARE @sql nvarchar(max)

DECLARE db_cursor CURSOR FOR
SELECT
    t.name AS table_name,
    SCHEMA_NAME(schema_id) AS schema_name
FROM sys.tables AS t
ORDER BY schema_name, table_name;

OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @table_name, @schema_name

WHILE @@FETCH_STATUS = 0
BEGIN
    PRINT @schema_name + '.' + @table_name

    SELECT
        @sql = N'SELECT TOP 100 * FROM [' + @schema_name + '].[' + @table_name + ']'
    EXEC sp_executesql @sql;

    FETCH NEXT FROM db_cursor INTO @table_name, @schema_name
END

CLOSE db_cursor
DEALLOCATE db_cursor

Well, turn out that MSSQL has a built-in stored procedure to loop through each of the tables.
exec sp_MSforeachtable @command1='SELECT TOP 100 * FROM ?'

Another improvement: the statement 'SELECT TOP 100 *' does just that: top 100 rows from the table. It might not give a good idea about the data in the table. The top 100 rows might contain only bad data and do not give us a grasp of the data in the table. We can use 'TABLESAMPLE' to get more random rows, a true sampling of data.
   SELECT * FROM [TABLENAME] TABLESAMPLE(100 ROWS);
OR SELECT * FROM [TABLENAME] TABLESAMPLE(1 PERCENT);
So sampling of data in tables can be achieved with a one-liner:
exec sp_MSforeachtable @command1='SELECT * FROM ? TABLESAMPLE(100 ROWS)'