August 9, 2015

Groovy script structure done right



Recently, I was tasked with writing a relative complex data migration script. The script involves connecting to a MySQL database, querying existing data and then inserting to a destination schema. Doing this in Bash would be quite hard to test and error prone. Some modern functional language would provide a better solution, e.g. Ruby, Scala, or Groovy. We opt to use Groovy as some of team members have Java background so there is less friction when doing maintenance. This blog post is to show you how to set up basic structure of Groovy scripting with Spock for Unit Testing and Gradle for building.

Groovy CLI

Firstly, we set up a basic script structure with Groovy CLI. Script: data-fix.groovy
#!/usr/bin/env groovy

def cli = new CliBuilder(usage:'data-fix')
cli.with {
    u longOpt: 'user', args: 1, argName: 'user', required: true, 'DB user'
    p longOpt: 'password', args: 1, argName: 'password', required: true, 'DB password'
    s longOpt: 'sourceSchema', args: 1, argName: 'sourceDbSchema', required: true, 'staging DB schema'
    d longOpt: 'destinationSchema', args: 1, argName: 'destDbSchema', required: true, 'production DB schema'
    h longOpt: 'host', args: 1, argName: 'dbHost', 'DB host, default to be localhost'
}

def opts = cli.parse(args)
if (!opts) {
    System.exit(1)
}

new Processor(opts).run()

Basic Processor class:
class Processor {
    def opts

    Processor(opts) {
        this.opts = opts
    }

    void run() {
        println "Running..."
    }
}

The above code can be viewed in this Github commit. Next up, we will set up Unit Testing.

Unit Testing with Spock and Gradle

Spock provides a nice testing framework. I am a fan of its easy mocking syntax and BDD (Behavioural Driven Development) syntax "given, when, then". One way to setup Spock in Groovy is using Gradle build and dependencies management.

By default, Gradle assumes certain directory structures: src/main/groovy, and src/test/groovy. (You can change the above structure as described here). We will move our code into the above directory structure, and will create an empty test file ProcessorSpec.groovy under src/test/groovy directory.
.
├── README.md
└── src
    ├── main
    │   └── groovy
    │       ├── data-fix.groovy
    │       └── Processor.groovy
    └── test
        └── groovy
            └── ProcessorSpec.groovy
Setting up build.gradle in the top directory:
apply plugin: "groovy"

version = "1.0"
description = "Spock Framework - Data fix Project"

// Spock works with Java 1.5 and above
//sourceCompatibility = 1.5

repositories {
  // Spock releases are available from Maven Central
  mavenCentral()
  // Spock snapshots are available from the Sonatype OSS snapshot repository
  maven { url "http://oss.sonatype.org/content/repositories/snapshots/" }
}

dependencies {
  // mandatory dependencies for using Spock
  compile "org.codehaus.groovy:groovy-all:2.4.1"
  testCompile "org.spockframework:spock-core:1.0-groovy-2.4"
  testCompile "cglib:cglib:2.2"
  testCompile "org.objenesis:objenesis:1.2"
}
Let's modify the file ProcessorSpec.groovy to have a failed test, so that we can confirm that test is actually run and everything is setup correctly.
import spock.lang.*

class ProcessSpec extends Specification {
    def "#first test"() {
        when:
        def a = true

        then:
        a == false
    }
}
Executing Gradle build to see the test failed:
$ gradle --info clean test
...
Gradle Test Executor 2 finished executing tests.

ProcessSpec > #first test FAILED
    Condition not satisfied:

    a == false
    | |
    | false
    true
        at ProcessSpec.#first test(ProcessorSpec.groovy:9)

1 test completed, 1 failed
The above changes can be viewed in this Github commit.
Gradle wrapper is great to ensure build is run the same way across different machine. On a machine that does not have Gradle installed, it will first download Gradle and execute the build task. We can setup Gradle wrapper with this easy command:
$ gradle wrapper
# The above command will generate wrapper script and we can execute our build with this command:
$ ./gradlew --info clean test

Adding libraries

We got the basic skeleton done. The next step is to add logic into our script. The script will connect to MySQL database, so we will add mysql-connector to the script. In addition, to debug script, I'm a fan of adding logging statements to the flow. We will use @Grab to add dependencies into the script data-fix.groovy.
file: data-fix.groovy
#!/usr/bin/env groovy

@GrabConfig(systemClassLoader=true)
@Grab('mysql:mysql-connector-java:5.1.27')
@Grab('log4j:log4j:1.2.17')
...

file: Processor.groovy
import groovy.sql.Sql
import org.apache.log4j.*
import groovy.util.logging.*

@Log4j
class Processor {
    def opts

    Processor(opts) {
        log.level = Level.DEBUG
        this.opts = opts
    }

    void run() {
        log.info "Running..."
    }
}
Running the script gives the expected log statement. However, running build now failed with this exception: Execution failed for task ':compileGroovy'.> org/apache/ivy/core/report/ResolveReport
[src/main/groovy] $ ./data-fix.groovy -h localhost -u root -p somepassword -s staging -d prod
INFO - Running...
[  top level dir] $ ./gradlew --info clean test
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':compileGroovy'.
> org/apache/ivy/core/report/ResolveReport
So what went wrong? @Grab is using Grape to manage dependencies, while Gradle has its own dependencies management. At this point, we have 2 options: use Gradle to manage all dependencies and execute script via Gradle, or mix and match between Gradle and Grape (Grape is for runtime, Gradle is only for testing). Both options have its own merits. For me, I prefer the simplicity of Grape at runtime, so I will continue with the later. 
We will need to configure build.gradle to ignore Grape:
test {                                        
  systemProperty 'groovy.grape.enable', 'false'  
}

compileGroovy {
  groovyOptions.forkOptions.jvmArgs = [ '-Dgroovy.grape.enable=false' ]
}
compileTestGroovy {
  groovyOptions.forkOptions.jvmArgs = [ '-Dgroovy.grape.enable=false' ]
}
The above change can be viewed in this Github commit.

Using this method will violate DRY (Don't Repeat Yourself), as dependencies are defined in 2 places: @Grab and in Gradle dependencies. You can have a look at mrhaki blog post if you want to invoke Groovy script from Gradle task. I found passing script command line options as Gradle run properties is a bit awkward. 

Adding more logic and tests

Simple logic - default localhost if host is not provided

Now that we have a structure going, we can add more logic into our script. The first easy one is set host to the parameter provided, otherwise default to 'localhost'.
file: ProcessorSpec.groovy
    def "#new set host to parameter, or default to localhost"() {
        expect:
        new Processor([]).host == 'localhost'
        new Processor([h: 'myserver']).host == 'myserver'
    }

file: Processor.groovy
    Processor(opts) {
        log.level = Level.DEBUG
        this.opts = opts
        this.host = opts.h ?: 'localhost'
    }

    void run() {
        log.info "Host               : $host"
        log.info "User               : ${opts.u}"
        log.info "Password           : ${opts.p}"
        log.info "Source schema      : ${opts.s}"
        log.info "Destination schema : ${opts.d}"
    }
Running test:
[  top level dir] $ ./gradlew --info clean test
BUILD SUCCESSFUL
[src/main/groovy] $ ./data-fix.groovy -h myserver -u root -p somepassword -s staging -d prod
INFO - Host               : myserver
INFO - User               : root
INFO - Password           : somepassword
INFO - Source schema      : staging
INFO - Destination schema : prod
[src/main/groovy]$ ./data-fix.groovy -u root -p somepassword -s staging -d prod
INFO - Host               : localhost
INFO - User               : root
INFO - Password           : somepassword
INFO - Source schema      : staging
INFO - Destination schema : prod
The above changes can be viewed in this Github commit.

Summary

As you can see, Groovy language is very easy to work with and powerful as a scripting language. Together with unit testing, you have confidence in your script doing the right thing and production ready. I truly believe you should Unit Test everything, including scripts; and the above is the setup to achieve just that.

References

January 21, 2015

Query with MongoDB


As part of my work, we have the pleasure of working with MongoDB. Coming from the SQL world where queries start with WHERE clause or cryptic JOIN, the MongoDB JavaScript file is a pleasure to work with. I will share with you some example queries in this post.
As a background, we implemented a feature where user can endorse terms, similar to the LinkedIn endorsement. We have concept of term, and endorsement. A term is just a representation of a piece of text. An endorsement is an association between a term, user, and listing that a user's currently endorsing.  Here's what they look like:
testx:PRIMARY> db.terms.findOne();
{
    "_id" : ObjectId("53aa7ac62ecea3e746beaef2"),
    "text" : "Locally Owned",
    "origin" : "TERM_DB",
    "_created" : NumberLong("1403681478027"),
    "_modified" : NumberLong("1403681478027")
}

testx:PRIMARY> db.endorsements.findOne();
{
    "_id" : ObjectId("53aa7ba8d0adeb95abec766f"),
    "listingId" : "14938311",
    "userId" : "1",
    "headingCode" : "33650",
    "term" : DBRef("terms", ObjectId("53aa7acc2ecea3e746beaf20")),
    "_created" : NumberLong("1403681704148"),
    "_modified" : NumberLong("1403681704148")
}

1. Updating using Regular expression

When we went live initially, there were a bunch of terms starting with "Authentic", such as "Authentic Italian", "Authentic Chinese". After a while, we didn't see much uptake for those terms. We decided to drop the word "Authentic". Good news is Mongo supports regex, so updating them is quite easy: searching and replace all terms starting with "Authentic" with an empty string.
function updateTermWithAuthentic() {
    //drop the word Authentic in front of "Authentic Some Cuisine", Authentic is 10 character including the space, hence the substring of 10.
    db.terms.find({
        text: { $regex: '^Authentic' }
    }).forEach(function(doc) {
        db.terms.update(
            { _id: doc._id }, 
            { $set: { text: doc.text.substring(10) } }
        )
    })
}
We could update a term text with a new text, with a bit of print logging:
function updateTermText(oldTerm, newTerm) {
    var cursor = db.terms.find({text: oldTerm});
    if (cursor.hasNext()) {
        db.terms.update(
            { text: oldTerm },
            { $set: { text: newTerm } }
        )
        print("Done - Processing '" + oldTerm + "' - replaced with new term: '" + newTerm + "'.");
    } else {
        print(">>> ERROR - Processing '" + oldTerm + "'... Don't have term '" + oldTerm + "' in the database, maybe typo?.");
    }
}

updateTermText("Good, High Quality", "Good or High Quality");

2. Count distinct

How many users submitted endorsements:
testx:PRIMARY> db.endorsements.distinct('userId').length;

3. Run query on data after a certain date / timestamp

How many endorsements after a certain date, for example after 14-Aug-2014:
testx:PRIMARY> db.endorsements.find({
    _created: {
        $gte: new Date(2014,7,15).getTime()
    }
}).count();  //month is 0-11 
//Sample output: 441
How many unique users submitting endorsements after a certain date, for example 14-Aug-2014:
testx:PRIMARY> db.runCommand({
    distinct: 'endorsements', 
    key: 'userId', 
    query: {
        _created: {
            $gte: new Date(2014,7,15).getTime()
        }
    }
}).values.length;
//Sample output: 113

4. Aggreation

Find the popular endorsement terms in each listing:
testx:PRIMARY> db.endorsements.aggregate(
    [
        {
            $group: {
                _id: { listingId: "$listingId", term: "$term" },
                count: { $sum: 1 }
            },
            {
                $sort: { count: -1 }
            }
        }
    ]
).forEach(function(myDoc) {
    print(myDoc._id.listingId + "," + myDoc._id.term + "," + myDoc.count);
}

/*
Sample output:
14813910,DBRef("terms", ObjectId("53aa7ae5d0adeb95abec7602")),10
13347379,DBRef("terms", ObjectId("53aa7ace2ecea3e746beaf34")),10
13501322,DBRef("terms", ObjectId("53aa7ae9d0adeb95abec761e")),7
13347379,DBRef("terms", ObjectId("53aa7ae1d0adeb95abec75e2")),6
*/
The DBRef and ID are a bit ugly. It would be nice to output the Term text there. With DBRef, we can modify the forEach function() above to perform another query to get the term text:
testx:PRIMARY> db.endorsements.aggregate(
//similar to the above
//...
).forEach(function(myDoc) {
    var termRef = myDoc._id.term
    var termText = db[termRef.$ref].findOne( { _id: (termRef.$id) } )
    print(myDoc._id.listingId + "," + termText.text + "," + myDoc.count);
});

/*
Sample output:
15495422,Cosy Venue,10
14524676,A La Carte,10
14813910,Healthy,7
13407783,Elderly Friendly,6
*/

5. Aggregation, count and sorting

Which user contributed the most number of endorsements? We need to aggregate based on userId.
testx:PRIMARY> db.endorsements.aggregate(
    [
        {
            $group: {
                _id: { userId: "$userId" },
                count: { $sum: 1 }
            }
        },
        {
            $sort: {count: -1}
        }
    ]
).forEach(function(myDoc) {
    print(myDoc._id.userId + "," + myDoc.count);
});

/*
Sample output: 
361259bc32c1bc79b8dffa4130f22a83ea7d1c,80
55b5ae81-000c-4a0e-87ee-b6a7cada5a01,38
4e31ccd3-371a-49a6-bfc7-e1f9c3e99672,24
b82e9b60-5e1b-42c2-8131-7ddc67cfd525,23
*/

6. Update the DBRef

Scenario: An existing term becomes obsolete, so we want to update all existing endorsements pointing to that obsolete terms to a different term. For example, term "Great Service" is obsolete and should be deleted, all existing endorsements should be updated to "Great or Professional Service".
First, find the term "Great Service", and the term "Great or Professional Service", note down the ObjectId for both.
// find the ID of "Great Service"
testx:PRIMARY> db.terms.find({text: "Great Service"}).pretty();
{
     "_id" : ObjectId("54a2123744c96256390bee87"),
     "className" : "models.endorsement.Term",
     "text" : "Great Service",
     "origin" : "TERM_DB",
     "_created" : NumberLong("1419907639466"),
     "_modified" : NumberLong("1419907639466")
}
// find the ID of “Great or Professional Service"
testx:PRIMARY> db.terms.find({text: "Great or Professional Service"}).pretty();
{
    "_id" : ObjectId("53d99b0b44c96256390bec17"),
    "className" : "models.endorsement.Term",
    "text" : "Great or Professional Service",
    "origin" : "TERM_DB",
    "_created" : NumberLong("1406769931603"),
    "_modified" : NumberLong("1406769931603")
}
Perform update for all the endorsements having term ID of "Great Service" to the ID of "Great or Professional Service":
testx:PRIMARY> db.endorsements.update(
    { 
        "term.$id": ObjectId("54a2123744c96256390bee87")
    },
    { 
        $set: {
            term: { $ref: "terms", $id: "53d99b0b44c96256390bec17" }
        }
    }
)
Finally, delete the term "Great Service":
// remove the term “Great Service"
testx:PRIMARY> db.terms.remove({text: "Great Service"})

7. Unique grouping from derived values

Scenario: we want to find out unique users contributing for each day. We can do distinct query based on userId and _created fields. However, _created is a timestamp, not a date. From the _created field, we need to derived a yyyy-MM-dd and perform distinct query. Luckily, MongoDB group() function and its keyf does support that.
testx:PRIMARY> db.endorsements.group({
    keyf: function(doc) {
        var createdDate = new Date(doc._created);
        return { id: doc.userId, year: createdDate.getFullYear(), month: createdDate.getMonth() + 1, date: createdDate.getDate() };
    },
    reduce: function ( curr, result ) { },
    initial: {}
}).forEach(function(doc) {
    print(doc.id + ", " + doc.year + "/" + doc.month + "/" + doc.date);
});

/*
Sample output:
55b5ae81-000c-4a0e-87ee-b6a7cada5a01, 2014/8/14
f2ded4d1-ff70-4b5e-88e4-e5e51d8a3780, 2014/8/14
a8b8a640-0c2d-443b-8cdd-a541d610cfe2, 2014/8/14
d589f10e-a97c-473d-aecf-ab847f5b42db, 2014/8/14
5d830945-9665-4298-aff1-6ab68b27c8af, 2014/8/14
*/

References