September 11, 2017

Using AWS Athena to query CSV files in S3



Have you thought of trying out AWS Athena to query your CSV files in S3? This post outlines some steps you would need to do to get Athena parsing your files correctly. Let's walk through it step by step.

Pet data

Let's start with a simple data about our pets.

$ cat pet.csv
date_of_birth,pet_type,pet_name,weight,age
"July 10, 2016",cat,"Bella, The Great",10.5,1
"January 1, 2015",cat,Tiger,4.2,2
"January 10, 2007",cat,Chole,6.5,3
"March 10, 2007",dog,"Max, The Big",20.1,2
"June 2, 2007",dog,Molly,22.8,3
"August 1, 2011",dog,Lulu,40.4,2
"September 11, 2007",dog,Coco,35.6,5

Our pet data has some dates, doubles, ints, and strings with commas in it. It is a good representation of a typical CSV file data you might have. I uploaded this pet.csv into an S3 location: S3://test-athena-linh/pet/, then headed to AWS Athena to create the table. (NOTE: run any required Drop table statement before running create statement).

CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data (
  `date_of_birth` date,
  `pet_type` string,
  `pet_name` string,
  `weight` double,
  `age` int 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://test-athena-linh/pet/'
TBLPROPERTIES ('has_encrypted_data'='false');

Our DDL is quite ambitious with double, int, and date as the data types. Let's see if Athena can parse our data correctly. Viewing the data is interesting because with the above table definition Athena doesn't parse the comma in quote correctly using LazySimpleSerDe. Because of having a comma in the middle of a field, columns are shifted.

Columns are shifted with quoted comma

Quirk #1: with all default setting, Athena doesn't deal with quoted commas in CSV very well.

LazySimpleSerDe doesn't handle quoted comma very well. I've discovered OpenCSVSerde can work with quoted commas by specifying quoteChar = '"'. Let's give it a try:

CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data (
  `date_of_birth` date,
  `pet_type` string,
  `pet_name` string,
  `weight` double,
  `age` int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'quoteChar' = '"',
  'field.delim' = ','
) LOCATION 's3://test-athena-linh/pet/'
TBLPROPERTIES ('has_encrypted_data'='false');


Create table ran fine, but I hit an error when viewing the data. The error message wasn't very helpful, it just mentioned the column rather than the actual value causing the error.

Athena error message is not very detailed

Quirk #2: When there is an error, there is little extra information about why it occurs.

I am guessing error relates to parsing data type. Let's change all types to 'string' and see what the data look like.

CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data (
  `date_of_birth` string,
  `pet_type` string,
  `pet_name` string,
  `weight` string,
  `age` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'quoteChar' = '"',
  'field.delim' = ','
) LOCATION 's3://test-athena-linh/pet/'
TBLPROPERTIES ('has_encrypted_data'='false');

Now with all string data types, our CSV can be read correctly with quoted commas.

No error when parsing String data type

From the output, we can see header row is included and break type parsing. Searching on the Internet suggested OpenCSVSerde has a config in TBLPROPERTIES 'skip.header.line.count'='1' which could be useful. However, trying it out in Athena didn't lead to the expected outcome. AWS Support confirmed:  "it's a known issue in Athena that property "skip.header.line.count"="1" does not work because of the following Presto issue: https://github.com/prestodb/presto/issues/1848. Hive does honour the skip.header.line property and skips header while querying the table. However, Presto displays the header record when querying the same table. The Athena Product team is aware of this issue and is planning to fix it."

Quirk #3: header row is included in the result set when using OpenCSVSerde.

We can certainly exclude header using query condition, but we can't do arithmetic operations (SUM, AVG) on strings.

Arithmetic on Strings doesn't work

To move forward with our data and accomodating all Athena quirks so far, we will need to run CREATE table as strings and do type conversion on the fly.

SELECT SUM(weight)
FROM 
  (
    SELECT 
      date_of_birth,
      pet_type,
      pet_name,
      cast(weight AS DOUBLE) as weight,
      cast(age AS INTEGER) as age
    FROM athena_test."pet_data" 
    WHERE date_of_birth <> 'date_of_birth'
  )

Typecasting in an inner query allows an outer query to do arithmetic

Yayyy, we got the expected result.

Let's attempt to parse the date in the format "MMMM dd, yyyy" (e.g. "July 10, 2017") as we might want to do arithmetic on dates rather than leaving it as a string. Presto has some date_parse function taking in a string and a pattern, date patterns are defined here: https://prestodb.io/docs/current/functions/datetime.html. The above date pattern is represented as '%M %d, %Y'.

select date_parse('July 10, 2017', '%M %d, %Y');
-- result: 2017-07-10 00:00:00.000

Our complete casting of pet_data to its right data types looks like this:

SELECT
  date_parse(date_of_birth, '%M %d, %Y') as date_of_birth,
  pet_type,
  pet_name,
  cast(weight AS DOUBLE) as weight,
  cast(age AS INTEGER) as age
FROM athena_test."pet_data"
WHERE date_of_birth <> 'date_of_birth'

Unfortunately, we will need to wrap this as a subquery for any query we want to do on our pet_data table. It would be better to be able to create a view on this. Potentially we could create a view to encapsulate the inner query, but as far as I know, Athena doesn't support View. I have tried a version of Presto Create Table AS but hit an error. I think Athena SQL is not the full set of Presto SQL: https://prestodb.io/docs/current/sql/create-table-as.html

-- This query has error
CREATE TABLE athena_test.pet_data_typed(date_of_birth, pet_type, pet_name, weight, age)
AS
SELECT 
  date_of_birth,
  pet_type,
  pet_name,
  cast(weight AS DOUBLE) as weight,
  cast(age AS INTEGER) as age
FROM athena_test."pet_data" 
WHERE date_of_birth <> 'date_of_birth'

Athena doesn't support CREATE TABLE AS

Quirk #4: Athena doesn't support View

From my trial with Athena so far, I am quite disappointed in how Athena handles CSV files. There is a lot of fiddling around with typecasting. Not sure what I did wrong there, please point out how I could improve on the above if you have a better way, and thanks in advance. I've heard Athena might work better with JSON or Avro files, I will try this out next. I will also write about my experience with Google BigQuery in future blog posts. Stay tuned!

February 25, 2017

Sampling all data in tables in MS SQL Server

Sampling data

Data warehouse comes in different shape and sizes, our legacy implementation of Data warehouse is Microsoft SQL Server 2008 with many tables inside. Recently we have to export a sample of all tables in our database for auditing purpose, specifically verifying there is no PII (Personal Identifiable Information) in our database.

A simple approach is doing a SELECT TOP 100 * from each table. This can be a little tedious if you have hundreds of tables. One way is to use a for loop to go through each of the tables.

This query will return the names of all the tables in the current database.
SELECT name FROM sysobjects WHERE xtype='U' order by 1;
Or if your database is organised in a schema, you will need the schema name together with the table name.
SELECT
    t.name AS table_name,
    SCHEMA_NAME(schema_id) AS schema_name
FROM sys.tables AS t
ORDER BY schema_name, table_name;

With that principle, we can loop through each table and apply SELECT TOP 100 * on each of the tables.
DECLARE @table_name nvarchar(200)
DECLARE @schema_name nvarchar(200)
DECLARE @sql nvarchar(max)

DECLARE db_cursor CURSOR FOR
SELECT
    t.name AS table_name,
    SCHEMA_NAME(schema_id) AS schema_name
FROM sys.tables AS t
ORDER BY schema_name, table_name;

OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @table_name, @schema_name

WHILE @@FETCH_STATUS = 0
BEGIN
    PRINT @schema_name + '.' + @table_name

    SELECT
        @sql = N'SELECT TOP 100 * FROM [' + @schema_name + '].[' + @table_name + ']'
    EXEC sp_executesql @sql;

    FETCH NEXT FROM db_cursor INTO @table_name, @schema_name
END

CLOSE db_cursor
DEALLOCATE db_cursor

Well, turn out that MSSQL has a built-in stored procedure to loop through each of the tables.
exec sp_MSforeachtable @command1='SELECT TOP 100 * FROM ?'

Another improvement: the statement 'SELECT TOP 100 *' does just that: top 100 rows from the table. It might not give a good idea about the data in the table. The top 100 rows might contain only bad data and do not give us a grasp of the data in the table. We can use 'TABLESAMPLE' to get more random rows, a true sampling of data.
   SELECT * FROM [TABLENAME] TABLESAMPLE(100 ROWS);
OR SELECT * FROM [TABLENAME] TABLESAMPLE(1 PERCENT);
So sampling of data in tables can be achieved with a one-liner:
exec sp_MSforeachtable @command1='SELECT * FROM ? TABLESAMPLE(100 ROWS)'