January 21, 2015

Query with MongoDB


As part of my work, we have been working with MongoDB. Coming from the SQL world where queries start with WHERE clause or cryptic JOIN, the MongoDB JavaScript file is easy to work with. I will share with you some example queries in this post.
As a background, we implemented a feature where user can endorse terms, similar to the LinkedIn endorsement. We have the concept of term, and endorsement. A term is just a representation of a piece of text. An endorsement is an association between a term, user, and listing that a user's currently endorsing.  Here's what they look like:
testx:PRIMARY> db.terms.findOne();
{
    "_id" : ObjectId("53aa7ac62ecea3e746beaef2"),
    "text" : "Locally Owned",
    "origin" : "TERM_DB",
    "_created" : NumberLong("1403681478027"),
    "_modified" : NumberLong("1403681478027")
}

testx:PRIMARY> db.endorsements.findOne();
{
    "_id" : ObjectId("53aa7ba8d0adeb95abec766f"),
    "listingId" : "14938311",
    "userId" : "1",
    "headingCode" : "33650",
    "term" : DBRef("terms", ObjectId("53aa7acc2ecea3e746beaf20")),
    "_created" : NumberLong("1403681704148"),
    "_modified" : NumberLong("1403681704148")
}

1. Updating using Regular expression

When we went live initially, there were a bunch of terms starting with "Authentic", such as "Authentic Italian", "Authentic Chinese". After a while, we didn't see much uptake for those terms. We decided to drop the word "Authentic". Good news is Mongo supports regex, so updating them is quite easy: searching and replace all terms starting with "Authentic" with an empty string.
function updateTermWithAuthentic() {
    //drop the word Authentic in front of "Authentic Some Cuisine", Authentic is 10 character including the space, hence the substring of 10.
    db.terms.find({
        text: { $regex: '^Authentic' }
    }).forEach(function(doc) {
        db.terms.update(
            { _id: doc._id }, 
            { $set: { text: doc.text.substring(10) } }
        )
    })
}
We could update a term text with a new text, with a bit of print logging:
function updateTermText(oldTerm, newTerm) {
    var cursor = db.terms.find({text: oldTerm});
    if (cursor.hasNext()) {
        db.terms.update(
            { text: oldTerm },
            { $set: { text: newTerm } }
        )
        print("Done - Processing '" + oldTerm + "' - replaced with new term: '" + newTerm + "'.");
    } else {
        print(">>> ERROR - Processing '" + oldTerm + "'... Don't have term '" + oldTerm + "' in the database, maybe typo?.");
    }
}

updateTermText("Good, High Quality", "Good or High Quality");

2. Count distinct

How many users submitted endorsements:
testx:PRIMARY> db.endorsements.distinct('userId').length;

3. Run query on data after a certain date/timestamp

How many endorsements after a certain date, for example after 14-Aug-2014:
testx:PRIMARY> db.endorsements.find({
    _created: {
        $gte: new Date(2014,7,15).getTime()
    }
}).count();  //month is 0-11 
//Sample output: 441
How many unique users submitting endorsements after a certain date, for example, 14-Aug-2014:
testx:PRIMARY> db.runCommand({
    distinct: 'endorsements', 
    key: 'userId', 
    query: {
        _created: {
            $gte: new Date(2014,7,15).getTime()
        }
    }
}).values.length;
//Sample output: 113

4. Aggregation

Find the popular endorsement terms in each listing:
testx:PRIMARY> db.endorsements.aggregate(
    [
        {
            $group: {
                _id: { listingId: "$listingId", term: "$term" },
                count: { $sum: 1 }
            },
            {
                $sort: { count: -1 }
            }
        }
    ]
).forEach(function(myDoc) {
    print(myDoc._id.listingId + "," + myDoc._id.term + "," + myDoc.count);
}

/*
Sample output:
14813910,DBRef("terms", ObjectId("53aa7ae5d0adeb95abec7602")),10
13347379,DBRef("terms", ObjectId("53aa7ace2ecea3e746beaf34")),10
13501322,DBRef("terms", ObjectId("53aa7ae9d0adeb95abec761e")),7
13347379,DBRef("terms", ObjectId("53aa7ae1d0adeb95abec75e2")),6
*/
The DBRef and ID are a bit ugly. It would be nice to output the Term text there. With DBRef, we can modify the forEach function() above to perform another query to get the term text:
testx:PRIMARY> db.endorsements.aggregate(
//similar to the above
//...
).forEach(function(myDoc) {
    var termRef = myDoc._id.term
    var termText = db[termRef.$ref].findOne( { _id: (termRef.$id) } )
    print(myDoc._id.listingId + "," + termText.text + "," + myDoc.count);
});

/*
Sample output:
15495422,Cosy Venue,10
14524676,A La Carte,10
14813910,Healthy,7
13407783,Elderly Friendly,6
*/

5. Aggregation, count and sorting

Which user contributed the most number of endorsements? We need to aggregate based on userId.
testx:PRIMARY> db.endorsements.aggregate(
    [
        {
            $group: {
                _id: { userId: "$userId" },
                count: { $sum: 1 }
            }
        },
        {
            $sort: {count: -1}
        }
    ]
).forEach(function(myDoc) {
    print(myDoc._id.userId + "," + myDoc.count);
});

/*
Sample output: 
361259bc32c1bc79b8dffa4130f22a83ea7d1c,80
55b5ae81-000c-4a0e-87ee-b6a7cada5a01,38
4e31ccd3-371a-49a6-bfc7-e1f9c3e99672,24
b82e9b60-5e1b-42c2-8131-7ddc67cfd525,23
*/

6. Update the DBRef

Scenario: An existing term becomes obsolete, so we want to update all existing endorsements pointing to that obsolete terms to a different term. For example, the term "Great Service" is obsolete and should be deleted, all existing endorsements should be updated to "Great or Professional Service".
First, find the term "Great Service", and the term "Great or Professional Service", note down the ObjectId for both.
// find the ID of "Great Service"
testx:PRIMARY> db.terms.find({text: "Great Service"}).pretty();
{
     "_id" : ObjectId("54a2123744c96256390bee87"),
     "className" : "models.endorsement.Term",
     "text" : "Great Service",
     "origin" : "TERM_DB",
     "_created" : NumberLong("1419907639466"),
     "_modified" : NumberLong("1419907639466")
}
// find the ID of “Great or Professional Service"
testx:PRIMARY> db.terms.find({text: "Great or Professional Service"}).pretty();
{
    "_id" : ObjectId("53d99b0b44c96256390bec17"),
    "className" : "models.endorsement.Term",
    "text" : "Great or Professional Service",
    "origin" : "TERM_DB",
    "_created" : NumberLong("1406769931603"),
    "_modified" : NumberLong("1406769931603")
}
Perform update for all the endorsements having term ID of "Great Service" to the ID of "Great or Professional Service":
testx:PRIMARY> db.endorsements.update(
    { 
        "term.$id": ObjectId("54a2123744c96256390bee87")
    },
    { 
        $set: {
            term: { $ref: "terms", $id: "53d99b0b44c96256390bec17" }
        }
    }
)
Finally, delete the term "Great Service":
// remove the term “Great Service"
testx:PRIMARY> db.terms.remove({text: "Great Service"})

7. Unique grouping from derived values

Scenario: we want to find out unique users contributing for each day. We can do a distinct query based on userId and _created fields. However, _created is a timestamp, not a date. From the _created field, we need to derive a yyyy-MM-dd and perform a distinct query. Luckily, MongoDB group() function and its keyf does support that.
testx:PRIMARY> db.endorsements.group({
    keyf: function(doc) {
        var createdDate = new Date(doc._created);
        return { id: doc.userId, year: createdDate.getFullYear(), month: createdDate.getMonth() + 1, date: createdDate.getDate() };
    },
    reduce: function ( curr, result ) { },
    initial: {}
}).forEach(function(doc) {
    print(doc.id + ", " + doc.year + "/" + doc.month + "/" + doc.date);
});

/*
Sample output:
55b5ae81-000c-4a0e-87ee-b6a7cada5a01, 2014/8/14
f2ded4d1-ff70-4b5e-88e4-e5e51d8a3780, 2014/8/14
a8b8a640-0c2d-443b-8cdd-a541d610cfe2, 2014/8/14
d589f10e-a97c-473d-aecf-ab847f5b42db, 2014/8/14
5d830945-9665-4298-aff1-6ab68b27c8af, 2014/8/14
*/

References