Sampling data |
Data warehouse comes in different shape and sizes, our legacy implementation of Data warehouse is Microsoft SQL Server 2008 with many tables inside. Recently we have to export a sample of all tables in our database for auditing purpose, specifically verifying there is no PII (Personal Identifiable Information) in our database.
A simple approach is doing a SELECT TOP 100 * from each table. This can be a little tedious if you have hundreds of tables. One way is to use a for loop to go through each of the tables.
This query will return the names of all the tables in the current database.
SELECT name FROM sysobjects WHERE xtype='U' order by 1;
Or if your database is organised in a schema, you will need the schema name together with the table name.
With that principle, we can loop through each table and apply SELECT TOP 100 * on each of the tables.
Well, turn out that MSSQL has a built-in stored procedure to loop through each of the tables.
Another improvement: the statement 'SELECT TOP 100 *' does just that: top 100 rows from the table. It might not give a good idea about the data in the table. The top 100 rows might contain only bad data and do not give us a grasp of the data in the table. We can use 'TABLESAMPLE' to get more random rows, a true sampling of data.
So sampling of data in tables can be achieved with a one-liner:
SELECT t.name AS table_name, SCHEMA_NAME(schema_id) AS schema_name FROM sys.tables AS t ORDER BY schema_name, table_name;
With that principle, we can loop through each table and apply SELECT TOP 100 * on each of the tables.
DECLARE @table_name nvarchar(200) DECLARE @schema_name nvarchar(200) DECLARE @sql nvarchar(max) DECLARE db_cursor CURSOR FOR SELECT t.name AS table_name, SCHEMA_NAME(schema_id) AS schema_name FROM sys.tables AS t ORDER BY schema_name, table_name; OPEN db_cursor FETCH NEXT FROM db_cursor INTO @table_name, @schema_name WHILE @@FETCH_STATUS = 0 BEGIN PRINT @schema_name + '.' + @table_name SELECT @sql = N'SELECT TOP 100 * FROM [' + @schema_name + '].[' + @table_name + ']' EXEC sp_executesql @sql; FETCH NEXT FROM db_cursor INTO @table_name, @schema_name END CLOSE db_cursor DEALLOCATE db_cursor
Well, turn out that MSSQL has a built-in stored procedure to loop through each of the tables.
exec sp_MSforeachtable @command1='SELECT TOP 100 * FROM ?'
Another improvement: the statement 'SELECT TOP 100 *' does just that: top 100 rows from the table. It might not give a good idea about the data in the table. The top 100 rows might contain only bad data and do not give us a grasp of the data in the table. We can use 'TABLESAMPLE' to get more random rows, a true sampling of data.
SELECT * FROM [TABLENAME] TABLESAMPLE(100 ROWS); OR SELECT * FROM [TABLENAME] TABLESAMPLE(1 PERCENT);
exec sp_MSforeachtable @command1='SELECT * FROM ? TABLESAMPLE(100 ROWS)'