September 11, 2017

Using AWS Athena to query CSV files in S3



Have you thought of trying out AWS Athena to query your CSV files in S3? This post outlines some steps you would need to do to get Athena parsing your files correctly. Let's walk through it step by step.

Pet data

Let's start with a simple data about our pets.

$ cat pet.csv
date_of_birth,pet_type,pet_name,weight,age
"July 10, 2016",cat,"Bella, The Great",10.5,1
"January 1, 2015",cat,Tiger,4.2,2
"January 10, 2007",cat,Chole,6.5,3
"March 10, 2007",dog,"Max, The Big",20.1,2
"June 2, 2007",dog,Molly,22.8,3
"August 1, 2011",dog,Lulu,40.4,2
"September 11, 2007",dog,Coco,35.6,5

Our pet data has some dates, doubles, ints, and strings with commas in it. It is a good representation of a typical CSV file data you might have. I uploaded this pet.csv into an S3 location: S3://test-athena-linh/pet/, then headed to AWS Athena to create the table. (NOTE: run any required Drop table statement before running create statement).

CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data (
  `date_of_birth` date,
  `pet_type` string,
  `pet_name` string,
  `weight` double,
  `age` int 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://test-athena-linh/pet/'
TBLPROPERTIES ('has_encrypted_data'='false');

Our DDL is quite ambitious with double, int, and date as the data types. Let's see if Athena can parse our data correctly. Viewing the data is interesting because with the above table definition Athena doesn't parse the comma in quote correctly using LazySimpleSerDe. Because of having a comma in the middle of a field, columns are shifted.

Columns are shifted with quoted comma

Quirk #1: with all default setting, Athena doesn't deal with quoted commas in CSV very well.

LazySimpleSerDe doesn't handle quoted comma very well. I've discovered OpenCSVSerde can work with quoted commas by specifying quoteChar = '"'. Let's give it a try:

CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data (
  `date_of_birth` date,
  `pet_type` string,
  `pet_name` string,
  `weight` double,
  `age` int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'quoteChar' = '"',
  'field.delim' = ','
) LOCATION 's3://test-athena-linh/pet/'
TBLPROPERTIES ('has_encrypted_data'='false');


Create table ran fine, but I hit an error when viewing the data. The error message wasn't very helpful, it just mentioned the column rather than the actual value causing the error.

Athena error message is not very detailed

Quirk #2: When there is an error, there is little extra information about why it occurs.

I am guessing error relates to parsing data type. Let's change all types to 'string' and see what the data look like.

CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data (
  `date_of_birth` string,
  `pet_type` string,
  `pet_name` string,
  `weight` string,
  `age` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'quoteChar' = '"',
  'field.delim' = ','
) LOCATION 's3://test-athena-linh/pet/'
TBLPROPERTIES ('has_encrypted_data'='false');

Now with all string data types, our CSV can be read correctly with quoted commas.

No error when parsing String data type

From the output, we can see header row is included and break type parsing. Searching on the Internet suggested OpenCSVSerde has a config in TBLPROPERTIES 'skip.header.line.count'='1' which could be useful. However, trying it out in Athena didn't lead to the expected outcome. AWS Support confirmed:  "it's a known issue in Athena that property "skip.header.line.count"="1" does not work because of the following Presto issue: https://github.com/prestodb/presto/issues/1848. Hive does honour the skip.header.line property and skips header while querying the table. However, Presto displays the header record when querying the same table. The Athena Product team is aware of this issue and is planning to fix it."

Quirk #3: header row is included in the result set when using OpenCSVSerde.

We can certainly exclude header using query condition, but we can't do arithmetic operations (SUM, AVG) on strings.

Arithmetic on Strings doesn't work

To move forward with our data and accomodating all Athena quirks so far, we will need to run CREATE table as strings and do type conversion on the fly.

SELECT SUM(weight)
FROM 
  (
    SELECT 
      date_of_birth,
      pet_type,
      pet_name,
      cast(weight AS DOUBLE) as weight,
      cast(age AS INTEGER) as age
    FROM athena_test."pet_data" 
    WHERE date_of_birth <> 'date_of_birth'
  )

Typecasting in an inner query allows an outer query to do arithmetic

Yayyy, we got the expected result.

Let's attempt to parse the date in the format "MMMM dd, yyyy" (e.g. "July 10, 2017") as we might want to do arithmetic on dates rather than leaving it as a string. Presto has some date_parse function taking in a string and a pattern, date patterns are defined here: https://prestodb.io/docs/current/functions/datetime.html. The above date pattern is represented as '%M %d, %Y'.

select date_parse('July 10, 2017', '%M %d, %Y');
-- result: 2017-07-10 00:00:00.000

Our complete casting of pet_data to its right data types looks like this:

SELECT
  date_parse(date_of_birth, '%M %d, %Y') as date_of_birth,
  pet_type,
  pet_name,
  cast(weight AS DOUBLE) as weight,
  cast(age AS INTEGER) as age
FROM athena_test."pet_data"
WHERE date_of_birth <> 'date_of_birth'

Unfortunately, we will need to wrap this as a subquery for any query we want to do on our pet_data table. It would be better to be able to create a view on this. Potentially we could create a view to encapsulate the inner query, but as far as I know, Athena doesn't support View. I have tried a version of Presto Create Table AS but hit an error. I think Athena SQL is not the full set of Presto SQL: https://prestodb.io/docs/current/sql/create-table-as.html

-- This query has error
CREATE TABLE athena_test.pet_data_typed(date_of_birth, pet_type, pet_name, weight, age)
AS
SELECT 
  date_of_birth,
  pet_type,
  pet_name,
  cast(weight AS DOUBLE) as weight,
  cast(age AS INTEGER) as age
FROM athena_test."pet_data" 
WHERE date_of_birth <> 'date_of_birth'

Athena doesn't support CREATE TABLE AS

Quirk #4: Athena doesn't support View

From my trial with Athena so far, I am quite disappointed in how Athena handles CSV files. There is a lot of fiddling around with typecasting. Not sure what I did wrong there, please point out how I could improve on the above if you have a better way, and thanks in advance. I've heard Athena might work better with JSON or Avro files, I will try this out next. I will also write about my experience with Google BigQuery in future blog posts. Stay tuned!

February 25, 2017

Sampling all data in tables in MS SQL Server

Sampling data

Data warehouse comes in different shape and sizes, our legacy implementation of Data warehouse is Microsoft SQL Server 2008 with many tables inside. Recently we have to export a sample of all tables in our database for auditing purpose, specifically verifying there is no PII (Personal Identifiable Information) in our database.

A simple approach is doing a SELECT TOP 100 * from each table. This can be a little tedious if you have hundreds of tables. One way is to use a for loop to go through each of the tables.

This query will return the names of all the tables in the current database.
SELECT name FROM sysobjects WHERE xtype='U' order by 1;
Or if your database is organised in a schema, you will need the schema name together with the table name.
SELECT
    t.name AS table_name,
    SCHEMA_NAME(schema_id) AS schema_name
FROM sys.tables AS t
ORDER BY schema_name, table_name;

With that principle, we can loop through each table and apply SELECT TOP 100 * on each of the tables.
DECLARE @table_name nvarchar(200)
DECLARE @schema_name nvarchar(200)
DECLARE @sql nvarchar(max)

DECLARE db_cursor CURSOR FOR
SELECT
    t.name AS table_name,
    SCHEMA_NAME(schema_id) AS schema_name
FROM sys.tables AS t
ORDER BY schema_name, table_name;

OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @table_name, @schema_name

WHILE @@FETCH_STATUS = 0
BEGIN
    PRINT @schema_name + '.' + @table_name

    SELECT
        @sql = N'SELECT TOP 100 * FROM [' + @schema_name + '].[' + @table_name + ']'
    EXEC sp_executesql @sql;

    FETCH NEXT FROM db_cursor INTO @table_name, @schema_name
END

CLOSE db_cursor
DEALLOCATE db_cursor

Well, turn out that MSSQL has a built-in stored procedure to loop through each of the tables.
exec sp_MSforeachtable @command1='SELECT TOP 100 * FROM ?'

Another improvement: the statement 'SELECT TOP 100 *' does just that: top 100 rows from the table. It might not give a good idea about the data in the table. The top 100 rows might contain only bad data and do not give us a grasp of the data in the table. We can use 'TABLESAMPLE' to get more random rows, a true sampling of data.
   SELECT * FROM [TABLENAME] TABLESAMPLE(100 ROWS);
OR SELECT * FROM [TABLENAME] TABLESAMPLE(1 PERCENT);
So sampling of data in tables can be achieved with a one-liner:
exec sp_MSforeachtable @command1='SELECT * FROM ? TABLESAMPLE(100 ROWS)'

January 28, 2016

How to reduce AWS Windows server creation time



So I've been working with AWS cloud to host some Windows servers. Embracing the concept "infrastructure as code", I coded the server set up as AWS CloudFormation template. As part of a server creation, some custom scripts might need to be run to initialise the server. Those scripts can be specified in an AWS::CloudFormation::Init block. A snippet of that is below:
    "WindowsInstance": {
      "Metadata": {
        "AWS::CloudFormation::Init": {
          "config": {
            "commands": {
              "00-configEnvAndCopyScripts": {
                "command": "powershell.exe -ExecutionPolicy Unrestricted c:\\cfn\\scripts\\config_env_and_copy_scripts.ps1"
              },
              "01-settingTime": {
                "command": "powershell.exe -ExecutionPolicy Unrestricted c:\\cfn\\scripts\\set_time.ps1"
              },
              ...

All well and good so far. But I can tell you how much I hate Windows servers... In AWS cloud, it could take up to 30 minutes to spin up a Windows server. Luckily AWS does output the log of the initialisation process in the file cfn-init.log so that we can trace through what is going on. The output sample is something like this:
2016-01-14 05:08:50,957 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.ap-southeast-2.amazonaws.com
2016-01-14 05:08:50,957 [DEBUG] Describing resource WindowsInstance in stack arn:aws:cloudformation:ap-southeast-2:4623423112354:stack/tableau-a/abcdef-123456
2016-01-14 05:08:51,535 [INFO] -----------------------Starting build-----------------------
2016-01-14 05:08:51,737 [DEBUG] Creating Scheduled Task for cfn-init resume
2016-01-14 05:08:51,816 [DEBUG] Scheduled Task created
2016-01-14 05:08:51,832 [INFO] Running configSets: default
2016-01-14 05:08:51,832 [INFO] Running configSet default
2016-01-14 05:08:51,832 [INFO] Running config config
2016-01-14 05:08:51,832 [DEBUG] No packages specified
2016-01-14 05:08:51,832 [DEBUG] No groups specified
2016-01-14 05:08:51,832 [DEBUG] No users specified
2016-01-14 05:08:51,832 [DEBUG] No sources specified
2016-01-14 05:08:51,878 [DEBUG] Running command 00-configEnvAndCopyScripts
2016-01-14 05:08:51,878 [DEBUG] No test for command 00-configEnvAndCopyScripts
2016-01-14 05:10:47,171 [INFO] Command 00-configEnvAndCopyScripts succeeded
2016-01-14 05:10:47,171 [DEBUG] Command 00-configEnvAndCopyScripts output: The operation completed successfully.
2016-01-14 05:10:47,173 [INFO] Waiting 60 seconds for reboot
2016-01-14 05:10:48,187 [DEBUG] Running command 01-settingTime
2016-01-14 05:10:48,187 [DEBUG] No test for command 01-settingTime
2016-01-14 05:10:48,437 [INFO] Command 01-settingTime succeeded
2016-01-14 05:10:48,437 [DEBUG] Command 01-settingTime output: The operation completed successfully.
2016-01-14 05:10:48,437 [INFO] Waiting 60 seconds for reboot
Looking at the output, I was curious about the line Waiting 60 seconds for reboot, what was it doing? Why does it wait for 60 seconds? I searched around and finally got an answer: by default, every command in AWS will wait for 60 seconds in case the command causes a restart. But if you are sure that your command does NOT cause a restart, you don't need to and shouldn't have to wait. In that case, you can put "waitAfterCompletion": "0"  after the command. It will save one minute per command. So if you have 60 commands, that is 60 minutes saved.

The new CloudFormation template looks like this:
      
"WindowsInstance": {
    "Metadata": {
        "AWS::CloudFormation::Init": {
          "config": {
            "commands": {
              "00-configEnvAndCopyScripts": {
                "command": "powershell.exe -ExecutionPolicy Unrestricted c:\\cfn\\scripts\\config_env_and_copy_scripts.ps1",
                "waitAfterCompletion": "0"
              },
              "01-settingTime": {
                "command": "powershell.exe -ExecutionPolicy Unrestricted c:\\cfn\\scripts\\set_time.ps1",
                "waitAfterCompletion": "0"
              },
              ...
Unfortunately, you need to repeat that waitAfterCompletion block after every command. As far as I know, there is no way to set the default wait to 0 second. If you find a better way to achieve this, please leave a comment below. 

So you might say, "why don't you set up everything, then build an AMI from it, rather than doing initialisation every time?". Good question! I like to build everything from scratch as I know what goes into the server, and those changes are version controlled. Having an AMI is like a black box, you want to keep it to the minimum. If you lose that AMI, you might not know how to rebuild that AMI.

Thanks and good luck with your AWS adventure.

August 9, 2015

Groovy script structure done right



Recently, I was tasked with writing a relative complex data migration script. The script involves connecting to a MySQL database, querying existing data and then inserting to a destination schema. Doing this in Bash would be quite hard to test and error-prone. Some modern functional language would provide a better solution, e.g. Ruby, Scala, or Groovy. We opt to use Groovy as some of the team members have Java background so there is less friction when doing maintenance. This blog post is to show you how to set up the basic structure of Groovy scripting with Spock for Unit Testing and Gradle for building.

Groovy CLI

Firstly, we set up a basic script structure with Groovy CLI. Script: data-fix.groovy
#!/usr/bin/env groovy

def cli = new CliBuilder(usage:'data-fix')
cli.with {
    u longOpt: 'user', args: 1, argName: 'user', required: true, 'DB user'
    p longOpt: 'password', args: 1, argName: 'password', required: true, 'DB password'
    s longOpt: 'sourceSchema', args: 1, argName: 'sourceDbSchema', required: true, 'staging DB schema'
    d longOpt: 'destinationSchema', args: 1, argName: 'destDbSchema', required: true, 'production DB schema'
    h longOpt: 'host', args: 1, argName: 'dbHost', 'DB host, default to be localhost'
}

def opts = cli.parse(args)
if (!opts) {
    System.exit(1)
}

new Processor(opts).run()

Basic Processor class:
class Processor {
    def opts

    Processor(opts) {
        this.opts = opts
    }

    void run() {
        println "Running..."
    }
}

The above code can be viewed in this Github commit. Next up, we will set up Unit Testing.

Unit Testing with Spock and Gradle

Spock provides a nice testing framework. I am a fan of its easy mocking syntax and BDD (Behavioural Driven Development) syntax "given, when, then". One way to setup Spock in Groovy is by using Gradle build and dependencies management.

By default, Gradle assumes certain directory structures: src/main/groovy, and src/test/groovy. (You can change the above structure as described here). We will move our code into the above directory structure and will create an empty test file ProcessorSpec.groovy under src/test/groovy directory.
.
├── README.md
└── src
    ├── main
    │   └── groovy
    │       ├── data-fix.groovy
    │       └── Processor.groovy
    └── test
        └── groovy
            └── ProcessorSpec.groovy
Setting up build.gradle in the top directory:
apply plugin: "groovy"

version = "1.0"
description = "Spock Framework - Data fix Project"

// Spock works with Java 1.5 and above
//sourceCompatibility = 1.5

repositories {
  // Spock releases are available from Maven Central
  mavenCentral()
  // Spock snapshots are available from the Sonatype OSS snapshot repository
  maven { url "http://oss.sonatype.org/content/repositories/snapshots/" }
}

dependencies {
  // mandatory dependencies for using Spock
  compile "org.codehaus.groovy:groovy-all:2.4.1"
  testCompile "org.spockframework:spock-core:1.0-groovy-2.4"
  testCompile "cglib:cglib:2.2"
  testCompile "org.objenesis:objenesis:1.2"
}
Let's modify the file ProcessorSpec.groovy to have a failed test, so that we can confirm that the test is actually run and everything is set up correctly.
import spock.lang.*

class ProcessSpec extends Specification {
    def "#first test"() {
        when:
        def a = true

        then:
        a == false
    }
}
Executing Gradle build to see the test failed:
$ gradle --info clean test
...
Gradle Test Executor 2 finished executing tests.

ProcessSpec > #first test FAILED
    Condition not satisfied:

    a == false
    | |
    | false
    true
        at ProcessSpec.#first test(ProcessorSpec.groovy:9)

1 test completed, 1 failed
The above changes can be viewed in this Github commit.
Gradle wrapper is great to ensure the build is run the same way across different machines. On a machine that does not have Gradle installed, it will first download Gradle and execute the build task. We can setup Gradle wrapper with this easy command:
$ gradle wrapper
# The above command will generate wrapper script and we can execute our build with this command:
$ ./gradlew --info clean test

Adding libraries

We got the basic skeleton done. The next step is to add logic into our script. The script will connect to the MySQL database, so we will add mysql-connector to the script. In addition, to debug the script, I'm a fan of adding logging statements to the flow. We will use @Grab to add dependencies into the script data-fix.groovy.
file: data-fix.groovy
#!/usr/bin/env groovy

@GrabConfig(systemClassLoader=true)
@Grab('mysql:mysql-connector-java:5.1.27')
@Grab('log4j:log4j:1.2.17')
...

file: Processor.groovy
import groovy.sql.Sql
import org.apache.log4j.*
import groovy.util.logging.*

@Log4j
class Processor {
    def opts

    Processor(opts) {
        log.level = Level.DEBUG
        this.opts = opts
    }

    void run() {
        log.info "Running..."
    }
}
Running the script gives the expected log statement. However, running build now failed with this exception: Execution failed for task ':compileGroovy'.> org/apache/ivy/core/report/ResolveReport
[src/main/groovy] $ ./data-fix.groovy -h localhost -u root -p somepassword -s staging -d prod
INFO - Running...
[  top level dir] $ ./gradlew --info clean test
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':compileGroovy'.
> org/apache/ivy/core/report/ResolveReport
So what went wrong? @Grab is using Grape to manage dependencies, while Gradle has its own dependencies management. At this point, we have 2 options: use Gradle to manage all dependencies and execute the script via Gradle, or mix and match between Gradle and Grape (Grape is for runtime, Gradle is only for testing). Both options have their own merits. For me, I prefer the simplicity of Grape at runtime, so I will continue with the later. 
We will need to configure build.gradle to ignore Grape:
test {                                        
  systemProperty 'groovy.grape.enable', 'false'  
}

compileGroovy {
  groovyOptions.forkOptions.jvmArgs = [ '-Dgroovy.grape.enable=false' ]
}
compileTestGroovy {
  groovyOptions.forkOptions.jvmArgs = [ '-Dgroovy.grape.enable=false' ]
}
The above change can be viewed in this Github commit.

Using this method will violate DRY (Don't Repeat Yourself), as dependencies are defined in 2 places: @Grab and in Gradle dependencies. You can have a look at mrhaki blog post if you want to invoke Groovy script from Gradle task. I found passing script command-line options as Gradle run properties is a bit awkward. 

Adding more logic and tests

Simple logic - default localhost if the host is not provided

Now that we have a structure going, we can add more logic into our script. The first easy one is set the host to the parameter provided, otherwise default to 'localhost'.
file: ProcessorSpec.groovy
    def "#new set host to parameter, or default to localhost"() {
        expect:
        new Processor([]).host == 'localhost'
        new Processor([h: 'myserver']).host == 'myserver'
    }

file: Processor.groovy
    Processor(opts) {
        log.level = Level.DEBUG
        this.opts = opts
        this.host = opts.h ?: 'localhost'
    }

    void run() {
        log.info "Host               : $host"
        log.info "User               : ${opts.u}"
        log.info "Password           : ${opts.p}"
        log.info "Source schema      : ${opts.s}"
        log.info "Destination schema : ${opts.d}"
    }
Running test:
[  top level dir] $ ./gradlew --info clean test
BUILD SUCCESSFUL
[src/main/groovy] $ ./data-fix.groovy -h myserver -u root -p somepassword -s staging -d prod
INFO - Host               : myserver
INFO - User               : root
INFO - Password           : somepassword
INFO - Source schema      : staging
INFO - Destination schema : prod
[src/main/groovy]$ ./data-fix.groovy -u root -p somepassword -s staging -d prod
INFO - Host               : localhost
INFO - User               : root
INFO - Password           : somepassword
INFO - Source schema      : staging
INFO - Destination schema : prod
The above changes can be viewed in this Github commit.

Summary

As you can see, the Groovy language is very easy to work with and powerful as a scripting language. Together with unit testing, you have confidence in your script doing the right thing and production-ready. I truly believe you should Unit Test everything, including scripts; and the above is the setup to achieve just that.

References

January 21, 2015

Query with MongoDB


As part of my work, we have been working with MongoDB. Coming from the SQL world where queries start with WHERE clause or cryptic JOIN, the MongoDB JavaScript file is easy to work with. I will share with you some example queries in this post.
As a background, we implemented a feature where user can endorse terms, similar to the LinkedIn endorsement. We have the concept of term, and endorsement. A term is just a representation of a piece of text. An endorsement is an association between a term, user, and listing that a user's currently endorsing.  Here's what they look like:
testx:PRIMARY> db.terms.findOne();
{
    "_id" : ObjectId("53aa7ac62ecea3e746beaef2"),
    "text" : "Locally Owned",
    "origin" : "TERM_DB",
    "_created" : NumberLong("1403681478027"),
    "_modified" : NumberLong("1403681478027")
}

testx:PRIMARY> db.endorsements.findOne();
{
    "_id" : ObjectId("53aa7ba8d0adeb95abec766f"),
    "listingId" : "14938311",
    "userId" : "1",
    "headingCode" : "33650",
    "term" : DBRef("terms", ObjectId("53aa7acc2ecea3e746beaf20")),
    "_created" : NumberLong("1403681704148"),
    "_modified" : NumberLong("1403681704148")
}

1. Updating using Regular expression

When we went live initially, there were a bunch of terms starting with "Authentic", such as "Authentic Italian", "Authentic Chinese". After a while, we didn't see much uptake for those terms. We decided to drop the word "Authentic". Good news is Mongo supports regex, so updating them is quite easy: searching and replace all terms starting with "Authentic" with an empty string.
function updateTermWithAuthentic() {
    //drop the word Authentic in front of "Authentic Some Cuisine", Authentic is 10 character including the space, hence the substring of 10.
    db.terms.find({
        text: { $regex: '^Authentic' }
    }).forEach(function(doc) {
        db.terms.update(
            { _id: doc._id }, 
            { $set: { text: doc.text.substring(10) } }
        )
    })
}
We could update a term text with a new text, with a bit of print logging:
function updateTermText(oldTerm, newTerm) {
    var cursor = db.terms.find({text: oldTerm});
    if (cursor.hasNext()) {
        db.terms.update(
            { text: oldTerm },
            { $set: { text: newTerm } }
        )
        print("Done - Processing '" + oldTerm + "' - replaced with new term: '" + newTerm + "'.");
    } else {
        print(">>> ERROR - Processing '" + oldTerm + "'... Don't have term '" + oldTerm + "' in the database, maybe typo?.");
    }
}

updateTermText("Good, High Quality", "Good or High Quality");

2. Count distinct

How many users submitted endorsements:
testx:PRIMARY> db.endorsements.distinct('userId').length;

3. Run query on data after a certain date/timestamp

How many endorsements after a certain date, for example after 14-Aug-2014:
testx:PRIMARY> db.endorsements.find({
    _created: {
        $gte: new Date(2014,7,15).getTime()
    }
}).count();  //month is 0-11 
//Sample output: 441
How many unique users submitting endorsements after a certain date, for example, 14-Aug-2014:
testx:PRIMARY> db.runCommand({
    distinct: 'endorsements', 
    key: 'userId', 
    query: {
        _created: {
            $gte: new Date(2014,7,15).getTime()
        }
    }
}).values.length;
//Sample output: 113

4. Aggregation

Find the popular endorsement terms in each listing:
testx:PRIMARY> db.endorsements.aggregate(
    [
        {
            $group: {
                _id: { listingId: "$listingId", term: "$term" },
                count: { $sum: 1 }
            },
            {
                $sort: { count: -1 }
            }
        }
    ]
).forEach(function(myDoc) {
    print(myDoc._id.listingId + "," + myDoc._id.term + "," + myDoc.count);
}

/*
Sample output:
14813910,DBRef("terms", ObjectId("53aa7ae5d0adeb95abec7602")),10
13347379,DBRef("terms", ObjectId("53aa7ace2ecea3e746beaf34")),10
13501322,DBRef("terms", ObjectId("53aa7ae9d0adeb95abec761e")),7
13347379,DBRef("terms", ObjectId("53aa7ae1d0adeb95abec75e2")),6
*/
The DBRef and ID are a bit ugly. It would be nice to output the Term text there. With DBRef, we can modify the forEach function() above to perform another query to get the term text:
testx:PRIMARY> db.endorsements.aggregate(
//similar to the above
//...
).forEach(function(myDoc) {
    var termRef = myDoc._id.term
    var termText = db[termRef.$ref].findOne( { _id: (termRef.$id) } )
    print(myDoc._id.listingId + "," + termText.text + "," + myDoc.count);
});

/*
Sample output:
15495422,Cosy Venue,10
14524676,A La Carte,10
14813910,Healthy,7
13407783,Elderly Friendly,6
*/

5. Aggregation, count and sorting

Which user contributed the most number of endorsements? We need to aggregate based on userId.
testx:PRIMARY> db.endorsements.aggregate(
    [
        {
            $group: {
                _id: { userId: "$userId" },
                count: { $sum: 1 }
            }
        },
        {
            $sort: {count: -1}
        }
    ]
).forEach(function(myDoc) {
    print(myDoc._id.userId + "," + myDoc.count);
});

/*
Sample output: 
361259bc32c1bc79b8dffa4130f22a83ea7d1c,80
55b5ae81-000c-4a0e-87ee-b6a7cada5a01,38
4e31ccd3-371a-49a6-bfc7-e1f9c3e99672,24
b82e9b60-5e1b-42c2-8131-7ddc67cfd525,23
*/

6. Update the DBRef

Scenario: An existing term becomes obsolete, so we want to update all existing endorsements pointing to that obsolete terms to a different term. For example, the term "Great Service" is obsolete and should be deleted, all existing endorsements should be updated to "Great or Professional Service".
First, find the term "Great Service", and the term "Great or Professional Service", note down the ObjectId for both.
// find the ID of "Great Service"
testx:PRIMARY> db.terms.find({text: "Great Service"}).pretty();
{
     "_id" : ObjectId("54a2123744c96256390bee87"),
     "className" : "models.endorsement.Term",
     "text" : "Great Service",
     "origin" : "TERM_DB",
     "_created" : NumberLong("1419907639466"),
     "_modified" : NumberLong("1419907639466")
}
// find the ID of “Great or Professional Service"
testx:PRIMARY> db.terms.find({text: "Great or Professional Service"}).pretty();
{
    "_id" : ObjectId("53d99b0b44c96256390bec17"),
    "className" : "models.endorsement.Term",
    "text" : "Great or Professional Service",
    "origin" : "TERM_DB",
    "_created" : NumberLong("1406769931603"),
    "_modified" : NumberLong("1406769931603")
}
Perform update for all the endorsements having term ID of "Great Service" to the ID of "Great or Professional Service":
testx:PRIMARY> db.endorsements.update(
    { 
        "term.$id": ObjectId("54a2123744c96256390bee87")
    },
    { 
        $set: {
            term: { $ref: "terms", $id: "53d99b0b44c96256390bec17" }
        }
    }
)
Finally, delete the term "Great Service":
// remove the term “Great Service"
testx:PRIMARY> db.terms.remove({text: "Great Service"})

7. Unique grouping from derived values

Scenario: we want to find out unique users contributing for each day. We can do a distinct query based on userId and _created fields. However, _created is a timestamp, not a date. From the _created field, we need to derive a yyyy-MM-dd and perform a distinct query. Luckily, MongoDB group() function and its keyf does support that.
testx:PRIMARY> db.endorsements.group({
    keyf: function(doc) {
        var createdDate = new Date(doc._created);
        return { id: doc.userId, year: createdDate.getFullYear(), month: createdDate.getMonth() + 1, date: createdDate.getDate() };
    },
    reduce: function ( curr, result ) { },
    initial: {}
}).forEach(function(doc) {
    print(doc.id + ", " + doc.year + "/" + doc.month + "/" + doc.date);
});

/*
Sample output:
55b5ae81-000c-4a0e-87ee-b6a7cada5a01, 2014/8/14
f2ded4d1-ff70-4b5e-88e4-e5e51d8a3780, 2014/8/14
a8b8a640-0c2d-443b-8cdd-a541d610cfe2, 2014/8/14
d589f10e-a97c-473d-aecf-ab847f5b42db, 2014/8/14
5d830945-9665-4298-aff1-6ab68b27c8af, 2014/8/14
*/

References


July 11, 2014

Automated JMeter Browser Test and Report in New Relic

As part of my work, we have New Relic set up to record and monitor back-end server response time and front-end browser response time. Recently I noticed some of our JMeter Browser test did not record any traffic nor response time in New Relic. This post is to document my findings and solution. Firstly, I need to work out whether it is our New Relic setting error, or my application JavaScript error, or something else. So I start all my browsers and hit the website URL. I then wait a few minutes before looking into New Relic. This manual test has reported correctly in New Relic, where I can see all browsers data: Chrome, Firefox, IE. That means both my New Relic setting and my application New Relic JavaScript setting are correct.

Manual browser test reported correctly in New Relic

So... what's wrong? I look into how my JMeter Browser test is set up. I use JMeter Chrome Driver plugin and JMeter file looks like this:

JMeter browser test set up

The full script is here:

var javaUtilConcurrent = JavaImporter(java.util.concurrent);
var websampler = JavaImporter(org.openqa.selenium, org.openqa.selenium.support.ui, org.openqa.selenium.interactions.touch);

//open the browser before timing starts 
WDS.browser.get("about:blank")

with(websampler) {
    WDS.browser.manage().timeouts().implicitlyWait(10, javaUtilConcurrent.TimeUnit.SECONDS);
    var wait = new WebDriverWait(WDS.browser, 15);

    WDS.sampleResult.sampleStart(); 
    WDS.browser.get("http://"+WDS.args[0]);

    wait.until(ExpectedConditions.visibilityOfElementLocated(By.cssSelector(".home-page"))); 

    //end timing
    WDS.sampleResult.sampleEnd();
}

// verify on homepage
if (!WDS.browser.getTitle().contains("Our website home page")) {
    WDS.sampleResult.setSuccessful(false);
    WDS.sampleResult.setResponseMessage("Homepage failed: Title does not match the expected value.")
}

I start the JMeter script and watch closely to how it runs. I noticed the browser window might have closed too soon, maybe even before New Relic report script kicked in. To confirm that the New Relic script is loaded correctly in the test browser, I put in a wait for an invalid element, so that when the test runs, and test browser will pop up and I would have enough time to play with / inspect element on the test browser. And YES, I can see the New Relic script at the bottom of the page in Chrome inspect element.

<script type="text/javascript" src="http://js-agent.newrelic.com/nr-100.js"></script>
<script type="text/javascript" src="http://beacon-3.newrelic.com/1/somethingsomething"></script>

That confirms the page can load the script correctly, and also confirms my theory that JMeter closes the Chrome window too early before the New Relic script has enough time to report back to New Relic. I then put some wait for New Relic script element right after waiting for the page to load, and before the "end timing" block.

wait.until(ExpectedConditions.presenceOfElementLocated(By.cssSelector("script[src*='beacon']")));

Voila! It works, and report in New Relic is as expected.

JMeter browser test report correctly

June 18, 2014

Installing Karma behind proxy


As part of working with AngularJS, I tried to install Karma test runner by using npm on one of our new Build Slaves. I got this error while executing the installation.
$ npm install karma

npm http 200 http://registry.npmjs.org/abbrev
> phantomjs@1.9.7-6 install /usr/lib/node_modules/karma-phantomjs-launcher/node_modules/phantomjs
> node install.js
Downloading http://cdn.bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2
Saving to /usr/lib/node_modules/karma-phantomjs-launcher/node_modules/phantomjs/phantomjs/phantomjs-1.9.7-linux-x86_64.tar.bz2
Using proxy http://company-proxy.com:8080/
Receiving...
Error requesting archive.
Status: 403
Request options: {
  "protocol": "http:",
  "slashes": true,
  "auth": null,
  "host": "company-proxy.com:8080",
  "port": "8080",
  "hostname": "company-proxy.com",
  "hash": null,
  "search": null,
  "query": null,
  "pathname": "/",
  "path": "http://cdn.bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2",
  "href": "http://company-proxy.com:8080/",
  "headers": {
    "Host": "cdn.bitbucket.org",
    "User-Agent": "curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5"
  }
}
Response headers: {
  "cache-control": "no-cache",
  "pragma": "no-cache",
  "content-type": "text/html; charset=utf-8",
  "proxy-connection": "Keep-Alive",
  "connection": "Keep-Alive",
  "content-length": "606"
}
Make sure your network and proxy settings are correct.
npm ERR! phantomjs@1.9.7-6 install: `node install.js`
npm ERR! Exit status 1

Solution: Turns out that I need to set the proxy for npm:
$ npm config set proxy "http://company-proxy.com:8080"

Then I hit another problem, where the output has this error return code 407: Unauthorized; which means we need to add credential into the proxy config:
$ npm config set proxy "http://username:password@company-proxy.com:8080"
That's it. Installation is successful!

Side notes:

Sometimes, you might need to set the registry and set https-proxy:
npm config set registry http://registry.npmjs.org/
npm config set https-proxy https://username:password@proxy.company.com:8080

References:

June 8, 2014

Reasons I Start Blogging



So... I started blogging after many years. Well, better late than never. You might wonder what made me changed my mind.

For a start, I always wanted to document what I learned, problems I faced at work and their solutions. Normally I use Evernote, it is a great tool to keep notes; but it would be much better than I could share those notes with a wider developer community which I owe so much too. There are many times I found a solution to my current in someone's blog; so I hope I could return that favour, and someone would find my blog useful.

In this keynote Jackstones: the Journey to Mastery by Dan North, he mentioned a great way to learn is to teach. Blogging is a form of teaching, so by blogging I could reinforce what I learned.

I have been introduced to SEO (Search Engine Optimisation) in my work, and have learned the power of a good domain and content. That is an investment. An investment that grows with time.

Finally, this blog post really summarised it all, 15 reasons I think you should blog,  and made me push the "Start Blogging" button.

I sincerely hope you will find something valuable from my blog.  :-)