tag:blogger.com,1999:blog-58837047407962068052024-02-19T23:08:11.146+11:00Dev Daily HashA hash of my experience.linhhttp://www.blogger.com/profile/00903052857385350518noreply@blogger.comBlogger8125tag:blogger.com,1999:blog-5883704740796206805.post-88261570411143929562017-09-11T13:39:00.003+10:002021-01-15T15:09:08.355+11:00Using AWS Athena to query CSV files in S3<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0sN_q1HgJ_T71mpS5H4aHSE-lKhli0a5h6-xx14V9GlvqOSudH0X5rbtbzGIR0IKuA_2N8NGkUDP5Rx_O69m-KYoR9xZJtGPkv3dr-Mw1x4MqZEGh_M6hiIx2X5z87W_benMhtnjUAPo_/s1600/athena.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="322" data-original-width="866" height="118" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0sN_q1HgJ_T71mpS5H4aHSE-lKhli0a5h6-xx14V9GlvqOSudH0X5rbtbzGIR0IKuA_2N8NGkUDP5Rx_O69m-KYoR9xZJtGPkv3dr-Mw1x4MqZEGh_M6hiIx2X5z87W_benMhtnjUAPo_/s320/athena.png" width="320" /></a></div>
<br />
<br />
Have you thought of trying out AWS Athena to query your CSV files in S3? This post outlines some steps you would need to do to get Athena parsing your files correctly. Let's walk through it step by step.<br />
<br />
<h2>
Pet data</h2>
Let's start with a simple data about our pets.<br />
<br />
<pre name="code">$ cat pet.csv
date_of_birth,pet_type,pet_name,weight,age
"July 10, 2016",cat,"Bella, The Great",10.5,1
"January 1, 2015",cat,Tiger,4.2,2
"January 10, 2007",cat,Chole,6.5,3
"March 10, 2007",dog,"Max, The Big",20.1,2
"June 2, 2007",dog,Molly,22.8,3
"August 1, 2011",dog,Lulu,40.4,2
"September 11, 2007",dog,Coco,35.6,5
</pre>
<br />
Our pet data has some dates, doubles, ints, and strings with commas in it. It is a good representation of a typical CSV file data you might have. I uploaded this pet.csv into an S3 location: <code>S3://test-athena-linh/pet/</code>, then headed to AWS Athena to create the table. (NOTE: run any required Drop table statement before running create statement).<br />
<br />
<pre class="sql" name="code">CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data (
`date_of_birth` date,
`pet_type` string,
`pet_name` string,
`weight` double,
`age` int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://test-athena-linh/pet/'
TBLPROPERTIES ('has_encrypted_data'='false');
</pre>
<br />
Our DDL is quite ambitious with double, int, and date as the data types. Let's see if Athena can parse our data correctly. Viewing the data is interesting because with the above table definition Athena doesn't parse the comma in quote correctly using LazySimpleSerDe. Because of having a comma in the middle of a field, columns are shifted.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXg-i9MxfleOo5I9jCjFWe0D-MPmgESw7WH-tF6bkckeshxmaNcj1mTXiCpbAB4AkVmnIEGmVMadXF-W2uR90n4S8t1yjDReRU5mFMU_0DsADqAJrLBGdq00TnfXasDBL7TUtdvD1T7klg/s1600/athena+-+comma+is+broken.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1120" data-original-width="1600" height="224" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXg-i9MxfleOo5I9jCjFWe0D-MPmgESw7WH-tF6bkckeshxmaNcj1mTXiCpbAB4AkVmnIEGmVMadXF-W2uR90n4S8t1yjDReRU5mFMU_0DsADqAJrLBGdq00TnfXasDBL7TUtdvD1T7klg/s320/athena+-+comma+is+broken.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Columns are shifted with quoted comma</td></tr>
</tbody></table>
<br />
<b>Quirk #1:</b> with all default setting, Athena doesn't deal with quoted commas in CSV very well.<br />
<br />
LazySimpleSerDe doesn't handle quoted comma very well. I've discovered OpenCSVSerde can work with quoted commas by specifying quoteChar = '"'. Let's give it a try:<br />
<br />
<pre class="sql" name="code">CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data (
`date_of_birth` date,
`pet_type` string,
`pet_name` string,
`weight` double,
`age` int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'quoteChar' = '"',
'field.delim' = ','
) LOCATION 's3://test-athena-linh/pet/'
TBLPROPERTIES ('has_encrypted_data'='false');
</pre>
<br />
<br />
Create table ran fine, but I hit an error when viewing the data. The error message wasn't very helpful, it just mentioned the column rather than the actual value causing the error.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZ4PcyRmc0xQq6njkGYfuyg-wfwqmQmBuM92qacqCBhZ63P8RucS0AY2fTZUTst6uyGpxDfqxF0IWj8c9o_i5AHAoRkdAdFBlF1IhgxOQbhzbHN6mzTixmtmveGR0cS1hjrW0AMS_F50zg/s1600/athena+quoteChar+error.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="885" data-original-width="1600" height="177" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZ4PcyRmc0xQq6njkGYfuyg-wfwqmQmBuM92qacqCBhZ63P8RucS0AY2fTZUTst6uyGpxDfqxF0IWj8c9o_i5AHAoRkdAdFBlF1IhgxOQbhzbHN6mzTixmtmveGR0cS1hjrW0AMS_F50zg/s320/athena+quoteChar+error.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Athena error message is not very detailed</td></tr>
</tbody></table>
<br />
<b>Quirk #2:</b> When there is an error, there is little extra information about why it occurs.<br />
<br />
I am guessing error relates to parsing data type. Let's change all types to 'string' and see what the data look like.<br />
<br />
<pre class="sql" name="code">CREATE EXTERNAL TABLE IF NOT EXISTS athena_test.pet_data (
`date_of_birth` string,
`pet_type` string,
`pet_name` string,
`weight` string,
`age` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'quoteChar' = '"',
'field.delim' = ','
) LOCATION 's3://test-athena-linh/pet/'
TBLPROPERTIES ('has_encrypted_data'='false');
</pre>
<br />
Now with all string data types, our CSV can be read correctly with quoted commas.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRa5Dj75suNYofh7dMFP9QUhIGx2LzWCgBt9GC04yI-G0R_wRqtBov_m9CE5l-X_8uQLflPMcbERkOoaGW8i1TFWJDGsBcB0O1PgBHnZAJ9GqkPr2CKNqvZIN33VtclH6vbC_jWaMhhcar/s1600/athena+quoteChar+data+is+read.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1125" data-original-width="1600" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRa5Dj75suNYofh7dMFP9QUhIGx2LzWCgBt9GC04yI-G0R_wRqtBov_m9CE5l-X_8uQLflPMcbERkOoaGW8i1TFWJDGsBcB0O1PgBHnZAJ9GqkPr2CKNqvZIN33VtclH6vbC_jWaMhhcar/s320/athena+quoteChar+data+is+read.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">No error when parsing String data type</td></tr>
</tbody></table>
<br />
From the output, we can see header row is included and break type parsing. Searching on the Internet suggested OpenCSVSerde has a config in TBLPROPERTIES 'skip.header.line.count'='1' which could be useful. However, trying it out in Athena didn't lead to the expected outcome. AWS Support confirmed: "it's a known issue in Athena that property "skip.header.line.count"="1" does not work because of the following Presto issue: <a href="https://github.com/prestodb/presto/issues/1848" rel="nofollow" target="_blank">https://github.com/prestodb/presto/issues/1848</a>. Hive does honour the skip.header.line property and skips header while querying the table. However, Presto displays the header record when querying the same table. The Athena Product team is aware of this issue and is planning to fix it."<br />
<br />
<b>Quirk #3:</b> header row is included in the result set when using OpenCSVSerde.<br />
<br />
We can certainly exclude header using query condition, but we can't do arithmetic operations (SUM, AVG) on strings.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-gZGMSl_i6Y9dVWU47Z8ZUSPG1ELVsXKmApPYiBaIBX8xYUK3je6tHJ0ZwbQwuICfc-kjw1FEepVcnpq_pd-OULqxiLqCWEeYpXn0MCnoQCzfYiSXypjn6ix97yDSuKULJnpeonriE658/s1600/athena+sum+on+strings.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="862" data-original-width="1600" height="172" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-gZGMSl_i6Y9dVWU47Z8ZUSPG1ELVsXKmApPYiBaIBX8xYUK3je6tHJ0ZwbQwuICfc-kjw1FEepVcnpq_pd-OULqxiLqCWEeYpXn0MCnoQCzfYiSXypjn6ix97yDSuKULJnpeonriE658/s320/athena+sum+on+strings.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Arithmetic on Strings doesn't work</td></tr>
</tbody></table>
<br />
To move forward with our data and accomodating all Athena quirks so far, we will need to run CREATE table as strings and do type conversion on the fly.<br />
<br />
<pre class="sql" name="code">SELECT SUM(weight)
FROM
(
SELECT
date_of_birth,
pet_type,
pet_name,
cast(weight AS DOUBLE) as weight,
cast(age AS INTEGER) as age
FROM athena_test."pet_data"
WHERE date_of_birth <> 'date_of_birth'
)
</pre>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4yHEhZp02fxUMZT1qtnIwZon4WiTEDzSOwfi3UKzfwHSJUidTf3PtCsad02d-Yd490A-IQP51XGzUx22chPH8Koyv0arbug-qdJhLyRre0OLMH_IiQSKm_TIx4e7bbEb4jeumrtvAihhg/s1600/athena+cast+type.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="848" data-original-width="1600" height="169" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4yHEhZp02fxUMZT1qtnIwZon4WiTEDzSOwfi3UKzfwHSJUidTf3PtCsad02d-Yd490A-IQP51XGzUx22chPH8Koyv0arbug-qdJhLyRre0OLMH_IiQSKm_TIx4e7bbEb4jeumrtvAihhg/s320/athena+cast+type.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Typecasting in an inner query allows an outer query to do arithmetic</td></tr>
</tbody></table>
<br />
Yayyy, we got the expected result.<br />
<br />
Let's attempt to parse the date in the format "MMMM dd, yyyy" (e.g. "July 10, 2017") as we might want to do arithmetic on dates rather than leaving it as a string. Presto has some <code>date_parse</code> function taking in a string and a pattern, date patterns are defined here: <a href="https://prestodb.io/docs/current/functions/datetime.html" rel="nofollow" target="_blank">https://prestodb.io/docs/current/functions/datetime.html</a>. The above date pattern is represented as <code>'%M %d, %Y'</code>.<br />
<br />
<pre class="sql" name="code">select date_parse('July 10, 2017', '%M %d, %Y');
-- result: 2017-07-10 00:00:00.000
</pre>
<br />
Our complete casting of pet_data to its right data types looks like this:<br />
<br />
<pre class="sql" name="code">SELECT
date_parse(date_of_birth, '%M %d, %Y') as date_of_birth,
pet_type,
pet_name,
cast(weight AS DOUBLE) as weight,
cast(age AS INTEGER) as age
FROM athena_test."pet_data"
WHERE date_of_birth <> 'date_of_birth'
</pre>
<br />
Unfortunately, we will need to wrap this as a subquery for any query we want to do on our pet_data table. It would be better to be able to create a view on this. Potentially we could create a view to encapsulate the inner query, but as far as I know, Athena doesn't support View. I have tried a version of Presto Create Table AS but hit an error. I think Athena SQL is not the full set of Presto SQL: <a href="https://prestodb.io/docs/current/sql/create-table-as.html" rel="nofollow" target="_blank">https://prestodb.io/docs/current/sql/create-table-as.html</a><br />
<br />
<pre class="sql" name="code">-- This query has error
CREATE TABLE athena_test.pet_data_typed(date_of_birth, pet_type, pet_name, weight, age)
AS
SELECT
date_of_birth,
pet_type,
pet_name,
cast(weight AS DOUBLE) as weight,
cast(age AS INTEGER) as age
FROM athena_test."pet_data"
WHERE date_of_birth <> 'date_of_birth'
</pre>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-_30XLjDoW8PY4pyNZIbbCFx8t7GVhF5TAH0kYjI6qWQCweYjIjGrkvLC3feBHKLxy034sf7GVOkSwNtmzvlcnA0JzTMCLQnYRy5PFEFQeMKxZcPFEOvJTr2szcQK6lJvCer_2t1jIKDj/s1600/athena+create+table+as+error.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="697" data-original-width="1600" height="139" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-_30XLjDoW8PY4pyNZIbbCFx8t7GVhF5TAH0kYjI6qWQCweYjIjGrkvLC3feBHKLxy034sf7GVOkSwNtmzvlcnA0JzTMCLQnYRy5PFEFQeMKxZcPFEOvJTr2szcQK6lJvCer_2t1jIKDj/s320/athena+create+table+as+error.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Athena doesn't support CREATE TABLE AS</td></tr>
</tbody></table>
<br />
<b>Quirk #4:</b> Athena doesn't support View<br />
<br />
From my trial with Athena so far, I am quite disappointed in how Athena handles CSV files. There is a lot of fiddling around with typecasting. Not sure what I did wrong there, please point out how I could improve on the above if you have a better way, and thanks in advance. I've heard Athena might work better with JSON or Avro files, I will try this out next. I will also write about my experience with Google BigQuery in future blog posts. Stay tuned!<br />
<br />linhhttp://www.blogger.com/profile/00903052857385350518noreply@blogger.com3tag:blogger.com,1999:blog-5883704740796206805.post-37100761146500975732017-02-25T22:07:00.003+11:002021-01-15T15:10:26.776+11:00Sampling all data in tables in MS SQL Server<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmuWnx5IO5PnqtUOne61MA7XHcbfGAvh019ciXopHKt4JSC052YUpe_DM1yCaEjg4j6T9cfZP4nN-5GgX_D7PdVvt6lY1NGBRatjxU3Pk2kIP20mt4y-R1A9KWJHMop4Wtgv-QkE9RutkO/s1600/How+To+Use+Operator+Precedence+In+MSSQL+Database+Server+Tutorial+Example.png" style="margin-left: auto; margin-right: auto;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmuWnx5IO5PnqtUOne61MA7XHcbfGAvh019ciXopHKt4JSC052YUpe_DM1yCaEjg4j6T9cfZP4nN-5GgX_D7PdVvt6lY1NGBRatjxU3Pk2kIP20mt4y-R1A9KWJHMop4Wtgv-QkE9RutkO/s320/How+To+Use+Operator+Precedence+In+MSSQL+Database+Server+Tutorial+Example.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Sampling data</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
Data warehouse comes in different shape and sizes, our legacy implementation of Data warehouse is Microsoft SQL Server 2008 with many tables inside. Recently we have to export a sample of all tables in our database for auditing purpose, specifically verifying there is no PII (Personal Identifiable Information) in our database.</div>
<div>
<br /></div>
<div>
A simple approach is doing a SELECT TOP 100 * from each table. This can be a little tedious if you have hundreds of tables. One way is to use a for loop to go through each of the tables.</div>
<div>
<br /></div>
<div>
This query will return the names of all the tables in the current database.</div>
<pre class="brush:sql">SELECT name FROM sysobjects WHERE xtype='U' order by 1;
</pre>
<div>
Or if your database is organised in a schema, you will need the schema name together with the table name.<br />
<pre class="brush:sql">SELECT
t.name AS table_name,
SCHEMA_NAME(schema_id) AS schema_name
FROM sys.tables AS t
ORDER BY schema_name, table_name;
</pre>
<br />
With that principle, we can loop through each table and apply SELECT TOP 100 * on each of the tables.<br />
<pre class="brush:sql">DECLARE @table_name nvarchar(200)
DECLARE @schema_name nvarchar(200)
DECLARE @sql nvarchar(max)
DECLARE db_cursor CURSOR FOR
SELECT
t.name AS table_name,
SCHEMA_NAME(schema_id) AS schema_name
FROM sys.tables AS t
ORDER BY schema_name, table_name;
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @table_name, @schema_name
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT @schema_name + '.' + @table_name
SELECT
@sql = N'SELECT TOP 100 * FROM [' + @schema_name + '].[' + @table_name + ']'
EXEC sp_executesql @sql;
FETCH NEXT FROM db_cursor INTO @table_name, @schema_name
END
CLOSE db_cursor
DEALLOCATE db_cursor
</pre>
<br />
Well, turn out that MSSQL has a built-in stored procedure to loop through each of the tables.<br />
<pre class="brush:sql">exec sp_MSforeachtable @command1='SELECT TOP 100 * FROM ?'
</pre>
<br />
Another improvement: the statement 'SELECT TOP 100 *' does just that: top 100 rows from the table. It might not give a good idea about the data in the table. The top 100 rows might contain only bad data and do not give us a grasp of the data in the table. We can use 'TABLESAMPLE' to get more random rows, a true sampling of data.<br />
<pre class="brush:sql"> SELECT * FROM [TABLENAME] TABLESAMPLE(100 ROWS);
OR SELECT * FROM [TABLENAME] TABLESAMPLE(1 PERCENT);
</pre>
</div>
So sampling of data in tables can be achieved with a one-liner:
<br />
<pre class="brush:sql">exec sp_MSforeachtable @command1='SELECT * FROM ? TABLESAMPLE(100 ROWS)'
</pre>
linhhttp://www.blogger.com/profile/00903052857385350518noreply@blogger.com1tag:blogger.com,1999:blog-5883704740796206805.post-28767950433393569352016-01-28T23:36:00.002+11:002021-01-15T15:11:06.150+11:00How to reduce AWS Windows server creation time<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsrG46Wf2oiD8JXX0NlWyCZ7pkbaDTIW-OVrNWmEU-Gy_AtuVDAMmend7UoS0HJDuyws7gdD0wigsP9t9QE9FV91hYzNrWjQrXlf428QQt4NoAE2h6vRVvEnUMkps4jsrFLB0UyFVOepW4/s1600/Amazon-Web-Services-AWS.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="234" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsrG46Wf2oiD8JXX0NlWyCZ7pkbaDTIW-OVrNWmEU-Gy_AtuVDAMmend7UoS0HJDuyws7gdD0wigsP9t9QE9FV91hYzNrWjQrXlf428QQt4NoAE2h6vRVvEnUMkps4jsrFLB0UyFVOepW4/s320/Amazon-Web-Services-AWS.jpg" width="320" /></a></div>
<span style="font-family: inherit;"><br /></span><span style="font-family: inherit;"><br /></span><span style="font-family: inherit;">So I've been working with AWS cloud to host some Windows servers. Embracing the concept "infrastructure as code", I coded the server set up as <a href="https://aws.amazon.com/cloudformation/aws-cloudformation-templates/" rel="nofollow" target="_blank">AWS CloudFormation template</a>. As part of a server creation, some custom scripts might need to be run to initialise the server. Those scripts can be specified in an <span style="background-color: white; font-size: 15px;"><a href="http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-init.html" rel="nofollow" target="_blank">AWS::CloudFormation::Init</a> block. A snippet of that is below:</span></span><br />
<pre class="brush:plain"> "WindowsInstance": {
"Metadata": {
"AWS::CloudFormation::Init": {
"config": {
"commands": {
"00-configEnvAndCopyScripts": {
"command": "powershell.exe -ExecutionPolicy Unrestricted c:\\cfn\\scripts\\config_env_and_copy_scripts.ps1"
},
"01-settingTime": {
"command": "powershell.exe -ExecutionPolicy Unrestricted c:\\cfn\\scripts\\set_time.ps1"
},
...
</pre>
<span style="font-family: inherit;"><span style="background-color: white; font-size: 15px;"><br /></span></span><span style="font-family: inherit;"><span style="background-color: white; font-size: 15px;">All well and good so far. But I can tell you how much I hate Windows servers... In AWS cloud, it could take up to 30 minutes to spin up a Windows server. Luckily AWS does output the log of the initialisation process in the file <code>cfn-init.log</code> so that we can trace through what is going on. The output sample is something like this:</span></span><br />
<pre class="brush:plain">2016-01-14 05:08:50,957 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.ap-southeast-2.amazonaws.com
2016-01-14 05:08:50,957 [DEBUG] Describing resource WindowsInstance in stack arn:aws:cloudformation:ap-southeast-2:4623423112354:stack/tableau-a/abcdef-123456
2016-01-14 05:08:51,535 [INFO] -----------------------Starting build-----------------------
2016-01-14 05:08:51,737 [DEBUG] Creating Scheduled Task for cfn-init resume
2016-01-14 05:08:51,816 [DEBUG] Scheduled Task created
2016-01-14 05:08:51,832 [INFO] Running configSets: default
2016-01-14 05:08:51,832 [INFO] Running configSet default
2016-01-14 05:08:51,832 [INFO] Running config config
2016-01-14 05:08:51,832 [DEBUG] No packages specified
2016-01-14 05:08:51,832 [DEBUG] No groups specified
2016-01-14 05:08:51,832 [DEBUG] No users specified
2016-01-14 05:08:51,832 [DEBUG] No sources specified
2016-01-14 05:08:51,878 [DEBUG] Running command 00-configEnvAndCopyScripts
2016-01-14 05:08:51,878 [DEBUG] No test for command 00-configEnvAndCopyScripts
2016-01-14 05:10:47,171 [INFO] Command 00-configEnvAndCopyScripts succeeded
2016-01-14 05:10:47,171 [DEBUG] Command 00-configEnvAndCopyScripts output: The operation completed successfully.
2016-01-14 05:10:47,173 [INFO] Waiting 60 seconds for reboot
2016-01-14 05:10:48,187 [DEBUG] Running command 01-settingTime
2016-01-14 05:10:48,187 [DEBUG] No test for command 01-settingTime
2016-01-14 05:10:48,437 [INFO] Command 01-settingTime succeeded
2016-01-14 05:10:48,437 [DEBUG] Command 01-settingTime output: The operation completed successfully.
2016-01-14 05:10:48,437 [INFO] Waiting 60 seconds for reboot
</pre>
<span style="font-family: inherit;">Looking at the output, I was curious about the line <code>Waiting 60 seconds for reboot</code>, what was it doing? Why does it wait for 60 seconds? I searched around and finally got an answer: by default, </span>every command in AWS will wait for 60 seconds in case the command causes a restart. But if you are sure that your command does NOT cause a restart, you don't need to and shouldn't have to wait. In that case, you can put <code>"waitAfterCompletion": "0"</code> after the command. It will save one minute per command. So if you have 60 commands, that is 60 minutes saved.<br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">The new CloudFormation template looks like this:</span><br />
<pre class="brush:plain">
"WindowsInstance": {
"Metadata": {
"AWS::CloudFormation::Init": {
"config": {
"commands": {
"00-configEnvAndCopyScripts": {
"command": "powershell.exe -ExecutionPolicy Unrestricted c:\\cfn\\scripts\\config_env_and_copy_scripts.ps1",
"waitAfterCompletion": "0"
},
"01-settingTime": {
"command": "powershell.exe -ExecutionPolicy Unrestricted c:\\cfn\\scripts\\set_time.ps1",
"waitAfterCompletion": "0"
},
...
</pre>
<span style="font-family: inherit;">Unfortunately, you need to repeat that <code>waitAfterCompletion</code> block after every command. As far as I know, there is no way to set the default wait to 0 second. If you find a better way to achieve this, please leave a comment below. </span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">So you might say, "why don't you set up everything, then build an AMI from it, rather than doing initialisation every time?". Good question! I like to build everything from scratch as I know what goes into the server, and those changes are version controlled. Having an AMI is like a black box, you want to keep it to the minimum. If you lose that AMI, you might not know how to rebuild that AMI.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Thanks and good luck with your AWS adventure.</span>linhhttp://www.blogger.com/profile/00903052857385350518noreply@blogger.com2tag:blogger.com,1999:blog-5883704740796206805.post-39174979767431089592015-08-09T23:39:00.003+10:002021-01-15T15:14:12.769+11:00Groovy script structure done right<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivndmPjFMCKrv33Rf8h8sfsH_Q5x5OnBfOnwQpKq9HnZl7JSTdqm2yKmJnJDmYWG3c7HswP3chYdjXJuCe5E-6HqMxgmctCRI6sOzjXm3LOanO6wPI_WRJXNL4vA7MqTXHBRpoEB6_pSfi/s1600/Groovy-logo.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="158" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivndmPjFMCKrv33Rf8h8sfsH_Q5x5OnBfOnwQpKq9HnZl7JSTdqm2yKmJnJDmYWG3c7HswP3chYdjXJuCe5E-6HqMxgmctCRI6sOzjXm3LOanO6wPI_WRJXNL4vA7MqTXHBRpoEB6_pSfi/s320/Groovy-logo.png" width="320" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Recently, I was tasked with writing a relative complex data migration script. The script involves connecting to a MySQL database, querying existing data and then inserting to a destination schema. Doing this in Bash would be quite hard to test and error-prone. Some modern functional language would provide a better solution, e.g. Ruby, Scala, or Groovy. We opt to use Groovy as some of the team members have Java background so there is less friction when doing maintenance. This blog post is to show you how to set up the basic structure of Groovy scripting with Spock for Unit Testing and Gradle for building.</div>
<div>
<br /></div>
<h3>
Groovy CLI</h3>
<div>
Firstly, we set up a basic script structure with Groovy CLI. Script: <code>data-fix.groovy</code>
</div>
<div>
<pre class="brush:groovy">#!/usr/bin/env groovy
def cli = new CliBuilder(usage:'data-fix')
cli.with {
u longOpt: 'user', args: 1, argName: 'user', required: true, 'DB user'
p longOpt: 'password', args: 1, argName: 'password', required: true, 'DB password'
s longOpt: 'sourceSchema', args: 1, argName: 'sourceDbSchema', required: true, 'staging DB schema'
d longOpt: 'destinationSchema', args: 1, argName: 'destDbSchema', required: true, 'production DB schema'
h longOpt: 'host', args: 1, argName: 'dbHost', 'DB host, default to be localhost'
}
def opts = cli.parse(args)
if (!opts) {
System.exit(1)
}
new Processor(opts).run()
</pre>
<br /></div>
<div>
Basic <code>Processor</code> class:</div>
<div>
<pre class="brush:groovy">class Processor {
def opts
Processor(opts) {
this.opts = opts
}
void run() {
println "Running..."
}
}
</pre>
</div>
<div>
<br />
The above code can be viewed in this <a href="https://github.com/linhpham/groovy-script-structure/commit/f938aa786fb44907d66fb8dec2b51c20236eb0b4" rel="nofollow" target="_blank">Github commit</a>. Next up, we will set up Unit Testing.<br />
<br />
<h3>
Unit Testing with Spock and Gradle</h3>
<a href="https://code.google.com/p/spock/" rel="nofollow" target="_blank">Spock</a> provides a nice testing framework. I am a fan of its easy mocking syntax and BDD (Behavioural Driven Development) syntax "given, when, then". One way to setup Spock in Groovy is by using <a href="https://gradle.org/" rel="nofollow" target="_blank">Gradle</a> build and dependencies management.<br />
<br />
By default, Gradle assumes certain directory structures: <code>src/main/groovy</code>, and <code>src/test/groovy</code>. (You can change the above structure as described <a href="https://docs.gradle.org/current/userguide/groovy_plugin.html" rel="nofollow" target="_blank">here</a>). We will move our code into the above directory structure and will create an empty test file <code>ProcessorSpec.groovy</code> under <code>src/test/groovy</code> directory.</div>
<div>
<pre>.
├── README.md
└── src
├── main
│ └── groovy
│ ├── data-fix.groovy
│ └── Processor.groovy
└── test
└── groovy
└── ProcessorSpec.groovy
</pre>
</div>
<div>
Setting up <code>build.gradle</code> in the top directory:</div>
<div>
<pre class="brush:groovy">apply plugin: "groovy"
version = "1.0"
description = "Spock Framework - Data fix Project"
// Spock works with Java 1.5 and above
//sourceCompatibility = 1.5
repositories {
// Spock releases are available from Maven Central
mavenCentral()
// Spock snapshots are available from the Sonatype OSS snapshot repository
maven { url "http://oss.sonatype.org/content/repositories/snapshots/" }
}
dependencies {
// mandatory dependencies for using Spock
compile "org.codehaus.groovy:groovy-all:2.4.1"
testCompile "org.spockframework:spock-core:1.0-groovy-2.4"
testCompile "cglib:cglib:2.2"
testCompile "org.objenesis:objenesis:1.2"
}
</pre>
</div>
<div>
Let's modify the file <code>ProcessorSpec.groovy</code> to have a failed test, so that we can confirm that the test is actually run and everything is set up correctly.</div>
<div>
<pre class="brush:groovy">import spock.lang.*
class ProcessSpec extends Specification {
def "#first test"() {
when:
def a = true
then:
a == false
}
}
</pre>
</div>
<div>
Executing Gradle build to see the test failed:
</div>
<div>
<pre class="brush:plain">$ gradle --info clean test
...
Gradle Test Executor 2 finished executing tests.
ProcessSpec > #first test FAILED
Condition not satisfied:
a == false
| |
| false
true
at ProcessSpec.#first test(ProcessorSpec.groovy:9)
1 test completed, 1 failed
</pre>
</div>
<div>
The above changes can be viewed in this <a href="https://github.com/linhpham/groovy-script-structure/commit/8f76a055c5b802b681252b22033035adcb7d282f" rel="nofollow" target="_blank">Github commit</a>.</div>
<div>
Gradle wrapper is great to ensure the build is run the same way across different machines. On a machine that does not have Gradle installed, it will first download Gradle and execute the build task. We can setup Gradle wrapper with this easy command:
</div>
<div>
<pre class="brush:plain">$ gradle wrapper
# The above command will generate wrapper script and we can execute our build with this command:
$ ./gradlew --info clean test
</pre>
</div>
<h3>
Adding libraries</h3>
<div>
We got the basic skeleton done. The next step is to add logic into our script. The script will connect to the MySQL database, so we will add <code>mysql-connector</code> to the script. In addition, to debug the script, I'm a fan of adding logging statements to the flow. We will use @Grab to add dependencies into the script <code>data-fix.groovy</code>.</div>
<div>
<pre class="brush:groovy">file: data-fix.groovy
#!/usr/bin/env groovy
@GrabConfig(systemClassLoader=true)
@Grab('mysql:mysql-connector-java:5.1.27')
@Grab('log4j:log4j:1.2.17')
...
file: Processor.groovy
import groovy.sql.Sql
import org.apache.log4j.*
import groovy.util.logging.*
@Log4j
class Processor {
def opts
Processor(opts) {
log.level = Level.DEBUG
this.opts = opts
}
void run() {
log.info "Running..."
}
}
</pre>
</div>
<div>
Running the script gives the expected log statement. However, running build now failed with this exception: <code>Execution failed for task ':compileGroovy'.> org/apache/ivy/core/report/ResolveReport</code>
</div>
<div>
<pre class="brush:plain">[src/main/groovy] $ ./data-fix.groovy -h localhost -u root -p somepassword -s staging -d prod
INFO - Running...
[ top level dir] $ ./gradlew --info clean test
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':compileGroovy'.
> org/apache/ivy/core/report/ResolveReport
</pre>
</div>
<div>
So what went wrong? @Grab is using Grape to manage dependencies, while Gradle has its own dependencies management. At this point, we have 2 options: use Gradle to manage all dependencies and execute the script via Gradle, or mix and match between Gradle and Grape (Grape is for runtime, Gradle is only for testing). Both options have their own merits. For me, I prefer the simplicity of Grape at runtime, so I will continue with the later. </div>
<div>
We will need to configure <code>build.gradle</code> to ignore Grape:
</div>
<div>
<pre class="brush:groovy">test {
systemProperty 'groovy.grape.enable', 'false'
}
compileGroovy {
groovyOptions.forkOptions.jvmArgs = [ '-Dgroovy.grape.enable=false' ]
}
compileTestGroovy {
groovyOptions.forkOptions.jvmArgs = [ '-Dgroovy.grape.enable=false' ]
}
</pre>
</div>
<div>
The above change can be viewed in this <a href="https://github.com/linhpham/groovy-script-structure/commit/ffaf1f67ab0a88785fcee444f01600875e755a76" rel="nofollow" target="_blank">Github commit</a>.
<br />
<br />
Using this method will violate DRY (Don't Repeat Yourself), as dependencies are defined in 2 places: @Grab and in Gradle dependencies. You can have a look at <a href="http://mrhaki.blogspot.com.au/2014/09/gradle-goodness-running-groovy-scripts.html" target="_blank">mrhaki blog post</a> if you want to invoke Groovy script from Gradle task. I found passing script command-line options as Gradle run properties is a bit awkward. </div>
<h3>
Adding more logic and tests</h3>
<h4>
Simple logic - default localhost if the host is not provided</h4>
<div>
Now that we have a structure going, we can add more logic into our script. The first easy one is set the host to the parameter provided, otherwise default to 'localhost'.
</div>
<div>
<pre class="brush:groovy">file: ProcessorSpec.groovy
def "#new set host to parameter, or default to localhost"() {
expect:
new Processor([]).host == 'localhost'
new Processor([h: 'myserver']).host == 'myserver'
}
file: Processor.groovy
Processor(opts) {
log.level = Level.DEBUG
this.opts = opts
this.host = opts.h ?: 'localhost'
}
void run() {
log.info "Host : $host"
log.info "User : ${opts.u}"
log.info "Password : ${opts.p}"
log.info "Source schema : ${opts.s}"
log.info "Destination schema : ${opts.d}"
}
</pre>
<pre class="brush:plain">Running test:
[ top level dir] $ ./gradlew --info clean test
BUILD SUCCESSFUL
[src/main/groovy] $ ./data-fix.groovy -h myserver -u root -p somepassword -s staging -d prod
INFO - Host : myserver
INFO - User : root
INFO - Password : somepassword
INFO - Source schema : staging
INFO - Destination schema : prod
[src/main/groovy]$ ./data-fix.groovy -u root -p somepassword -s staging -d prod
INFO - Host : localhost
INFO - User : root
INFO - Password : somepassword
INFO - Source schema : staging
INFO - Destination schema : prod
</pre>
</div>
<div>
The above changes can be viewed in this <a href="https://github.com/linhpham/groovy-script-structure/commit/c00184acf213d46d953add12fdceb44432e7dcf9" rel="nofollow" target="_blank">Github commit</a>.
</div>
<h3>
Summary</h3>
<div>
As you can see, the Groovy language is very easy to work with and powerful as a scripting language. Together with unit testing, you have confidence in your script doing the right thing and production-ready. I truly believe you should Unit Test everything, including scripts; and the above is the setup to achieve just that.
</div>
<h3>
References</h3>
<div>
<ul>
<li><a href="http://stackoverflow.com/questions/18173908/error-compiling-a-groovy-project-using-grab-annotation" rel="nofollow" target="_blank">http://stackoverflow.com/questions/18173908/error-compiling-a-groovy-project-using-grab-annotation</a></li>
<li><a href="http://stackoverflow.com/questions/5886401/general-error-during-conversion-no-suitable-classloader-found-for-grab" rel="nofollow" target="_blank">http://stackoverflow.com/questions/5886401/general-error-during-conversion-no-suitable-classloader-found-for-grab</a></li>
<li><a href="http://stackoverflow.com/questions/16471096/any-alternative-to-grabconfig" rel="nofollow" target="_blank">http://stackoverflow.com/questions/16471096/any-alternative-to-grabconfig</a></li>
<li><a href="http://stackoverflow.com/questions/17360719/running-groovy-scripts-from-gradle" rel="nofollow" target="_blank">http://stackoverflow.com/questions/17360719/running-groovy-scripts-from-gradle</a></li>
</ul>
</div>
linhhttp://www.blogger.com/profile/00903052857385350518noreply@blogger.com1tag:blogger.com,1999:blog-5883704740796206805.post-73577380764753456802015-01-21T10:52:00.001+11:002021-01-15T15:16:26.199+11:00Query with MongoDB<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh98wpKgmuz5uWZX7iWEnPwTyOPRIxPNjfTGo5ffP9wsSfSY1ELSlLYx2IBrSi0c8xDyDwJeLYwGwnD4xom-7WP-59x8Q3BpH1CXzu3nDN1uSMCKsDPoCVrhvqq6zQNLqYlvYdslvWe0YCS/s1600/mongodb.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="221" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh98wpKgmuz5uWZX7iWEnPwTyOPRIxPNjfTGo5ffP9wsSfSY1ELSlLYx2IBrSi0c8xDyDwJeLYwGwnD4xom-7WP-59x8Q3BpH1CXzu3nDN1uSMCKsDPoCVrhvqq6zQNLqYlvYdslvWe0YCS/s320/mongodb.png" width="320" /></a></div>
<br />
As part of my work, we have been working with <a href="http://www.mongodb.org/" target="_blank" title="MongoDB">MongoDB</a>. Coming from the SQL world where queries start with WHERE clause or cryptic JOIN, the MongoDB JavaScript file is easy to work with. I will share with you some example queries in this post.<br />
As a background, we implemented a feature where user can endorse terms, similar to the LinkedIn endorsement. We have the concept of <code>term</code>, and <code>endorsement</code>. A <code>term</code> is just a representation of a piece of text. An <code>endorsement</code> is an association between a term, user, and listing that a user's currently endorsing. Here's what they look like:<br />
<pre class="brush:plain">testx:PRIMARY> db.terms.findOne();
{
"_id" : ObjectId("53aa7ac62ecea3e746beaef2"),
"text" : "Locally Owned",
"origin" : "TERM_DB",
"_created" : NumberLong("1403681478027"),
"_modified" : NumberLong("1403681478027")
}
testx:PRIMARY> db.endorsements.findOne();
{
"_id" : ObjectId("53aa7ba8d0adeb95abec766f"),
"listingId" : "14938311",
"userId" : "1",
"headingCode" : "33650",
"term" : DBRef("terms", ObjectId("53aa7acc2ecea3e746beaf20")),
"_created" : NumberLong("1403681704148"),
"_modified" : NumberLong("1403681704148")
}
</pre>
<h3>
1. Updating using Regular expression</h3>
When we went live initially, there were a bunch of terms starting with "Authentic", such as "Authentic Italian", "Authentic Chinese". After a while, we didn't see much uptake for those terms. We decided to drop the word "Authentic". Good news is Mongo supports regex, so updating them is quite easy: searching and replace all terms starting with "Authentic" with an empty string.<br />
<pre class="brush:plain">function updateTermWithAuthentic() {
//drop the word Authentic in front of "Authentic Some Cuisine", Authentic is 10 character including the space, hence the substring of 10.
db.terms.find({
text: { $regex: '^Authentic' }
}).forEach(function(doc) {
db.terms.update(
{ _id: doc._id },
{ $set: { text: doc.text.substring(10) } }
)
})
}
</pre>
We could update a term text with a new text, with a bit of print logging:<br />
<pre class="brush:plain">function updateTermText(oldTerm, newTerm) {
var cursor = db.terms.find({text: oldTerm});
if (cursor.hasNext()) {
db.terms.update(
{ text: oldTerm },
{ $set: { text: newTerm } }
)
print("Done - Processing '" + oldTerm + "' - replaced with new term: '" + newTerm + "'.");
} else {
print(">>> ERROR - Processing '" + oldTerm + "'... Don't have term '" + oldTerm + "' in the database, maybe typo?.");
}
}
updateTermText("Good, High Quality", "Good or High Quality");
</pre>
<h3>
2. Count distinct</h3>
How many users submitted endorsements:<br />
<pre class="brush:plain">testx:PRIMARY> db.endorsements.distinct('userId').length;
</pre>
<h3>
3. Run query on data after a certain date/timestamp</h3>
How many endorsements after a certain date, for example after 14-Aug-2014:<br />
<pre class="brush:plain">testx:PRIMARY> db.endorsements.find({
_created: {
$gte: new Date(2014,7,15).getTime()
}
}).count(); //month is 0-11
//Sample output: 441
</pre>
How many unique users submitting endorsements after a certain date, for example, 14-Aug-2014:<br />
<pre class="brush:plain">testx:PRIMARY> db.runCommand({
distinct: 'endorsements',
key: 'userId',
query: {
_created: {
$gte: new Date(2014,7,15).getTime()
}
}
}).values.length;
//Sample output: 113
</pre>
<h3>
4. Aggregation</h3>
Find the popular endorsement terms in each listing:<br />
<pre class="brush:plain">testx:PRIMARY> db.endorsements.aggregate(
[
{
$group: {
_id: { listingId: "$listingId", term: "$term" },
count: { $sum: 1 }
},
{
$sort: { count: -1 }
}
}
]
).forEach(function(myDoc) {
print(myDoc._id.listingId + "," + myDoc._id.term + "," + myDoc.count);
}
/*
Sample output:
14813910,DBRef("terms", ObjectId("53aa7ae5d0adeb95abec7602")),10
13347379,DBRef("terms", ObjectId("53aa7ace2ecea3e746beaf34")),10
13501322,DBRef("terms", ObjectId("53aa7ae9d0adeb95abec761e")),7
13347379,DBRef("terms", ObjectId("53aa7ae1d0adeb95abec75e2")),6
*/
</pre>
The DBRef and ID are a bit ugly. It would be nice to output the Term text there. With DBRef, we can modify the forEach function() above to perform another query to get the term text:<br />
<pre class="brush:plain">testx:PRIMARY> db.endorsements.aggregate(
//similar to the above
//...
).forEach(function(myDoc) {
var termRef = myDoc._id.term
var termText = db[termRef.$ref].findOne( { _id: (termRef.$id) } )
print(myDoc._id.listingId + "," + termText.text + "," + myDoc.count);
});
/*
Sample output:
15495422,Cosy Venue,10
14524676,A La Carte,10
14813910,Healthy,7
13407783,Elderly Friendly,6
*/
</pre>
<h3>
5. Aggregation, count and sorting</h3>
Which user contributed the most number of endorsements? We need to aggregate based on userId.<br />
<pre class="brush:plain">testx:PRIMARY> db.endorsements.aggregate(
[
{
$group: {
_id: { userId: "$userId" },
count: { $sum: 1 }
}
},
{
$sort: {count: -1}
}
]
).forEach(function(myDoc) {
print(myDoc._id.userId + "," + myDoc.count);
});
/*
Sample output:
361259bc32c1bc79b8dffa4130f22a83ea7d1c,80
55b5ae81-000c-4a0e-87ee-b6a7cada5a01,38
4e31ccd3-371a-49a6-bfc7-e1f9c3e99672,24
b82e9b60-5e1b-42c2-8131-7ddc67cfd525,23
*/
</pre>
<h3>
6. Update the DBRef</h3>
Scenario: An existing term becomes obsolete, so we want to update all existing endorsements pointing to that obsolete terms to a different term. For example, the term "Great Service" is obsolete and should be deleted, all existing endorsements should be updated to "Great or Professional Service".<br />
First, find the term "Great Service", and the term "Great or Professional Service", note down the ObjectId for both.<br />
<pre class="brush:plain">// find the ID of "Great Service"
testx:PRIMARY> db.terms.find({text: "Great Service"}).pretty();
{
"_id" : ObjectId("54a2123744c96256390bee87"),
"className" : "models.endorsement.Term",
"text" : "Great Service",
"origin" : "TERM_DB",
"_created" : NumberLong("1419907639466"),
"_modified" : NumberLong("1419907639466")
}
// find the ID of “Great or Professional Service"
testx:PRIMARY> db.terms.find({text: "Great or Professional Service"}).pretty();
{
"_id" : ObjectId("53d99b0b44c96256390bec17"),
"className" : "models.endorsement.Term",
"text" : "Great or Professional Service",
"origin" : "TERM_DB",
"_created" : NumberLong("1406769931603"),
"_modified" : NumberLong("1406769931603")
}
</pre>
Perform update for all the endorsements having term ID of "Great Service" to the ID of "Great or Professional Service":<br />
<pre class="brush:plain">testx:PRIMARY> db.endorsements.update(
{
"term.$id": ObjectId("54a2123744c96256390bee87")
},
{
$set: {
term: { $ref: "terms", $id: "53d99b0b44c96256390bec17" }
}
}
)
</pre>
Finally, delete the term "Great Service":<br />
<pre class="brush:plain">// remove the term “Great Service"
testx:PRIMARY> db.terms.remove({text: "Great Service"})
</pre>
<h3>
7. Unique grouping from derived values</h3>
Scenario: we want to find out unique users contributing for each day. We can do a distinct query based on <code>userId</code> and <code>_created</code> fields. However, <code>_created</code> is a timestamp, not a date. From the <code>_created</code> field, we need to derive a yyyy-MM-dd and perform a distinct query. Luckily, MongoDB <code>group()</code> function and its <code>keyf</code> does support that.<br />
<pre class="brush:plain" name="code">testx:PRIMARY> db.endorsements.group({
keyf: function(doc) {
var createdDate = new Date(doc._created);
return { id: doc.userId, year: createdDate.getFullYear(), month: createdDate.getMonth() + 1, date: createdDate.getDate() };
},
reduce: function ( curr, result ) { },
initial: {}
}).forEach(function(doc) {
print(doc.id + ", " + doc.year + "/" + doc.month + "/" + doc.date);
});
/*
Sample output:
55b5ae81-000c-4a0e-87ee-b6a7cada5a01, 2014/8/14
f2ded4d1-ff70-4b5e-88e4-e5e51d8a3780, 2014/8/14
a8b8a640-0c2d-443b-8cdd-a541d610cfe2, 2014/8/14
d589f10e-a97c-473d-aecf-ab847f5b42db, 2014/8/14
5d830945-9665-4298-aff1-6ab68b27c8af, 2014/8/14
*/
</pre>
<h1>
References</h1>
<ul>
<li><a href="http://www.tutorialspoint.com/mongodb/mongodb_database_references.htm">http://www.tutorialspoint.com/mongodb/mongodb_database_references.htm</a></li>
<li><a href="http://stackoverflow.com/questions/7613801/mongodb-sort-by-subdocument-count">http://stackoverflow.com/questions/7613801/mongodb-sort-by-subdocument-count</a></li>
<li><a href="http://docs.mongodb.org/manual/reference/command/distinct/#dbcmd.distinct">http://docs.mongodb.org/manual/reference/command/distinct/#dbcmd.distinct</a></li>
<li><a href="http://stackoverflow.com/questions/13361278/mongodb-query-by-dbref">http://stackoverflow.com/questions/13361278/mongodb-query-by-dbref</a></li>
</ul>
<br />linhhttp://www.blogger.com/profile/00903052857385350518noreply@blogger.com0tag:blogger.com,1999:blog-5883704740796206805.post-83256613702291206252014-07-11T09:16:00.001+10:002021-01-15T15:17:01.988+11:00Automated JMeter Browser Test and Report in New RelicAs part of my work, we have <a href="https://rpm.newrelic.com/" target="_blank" title="New Relic">New Relic</a> set up to record and monitor back-end server response time and front-end browser response time. Recently I noticed some of our JMeter Browser test did not record any traffic nor response time in New Relic. This post is to document my findings and solution. Firstly, I need to work out whether it is our New Relic setting error, or my application JavaScript error, or something else. So I start all my browsers and hit the website URL. I then wait a few minutes before looking into New Relic. This manual test has reported correctly in New Relic, where I can see all browsers data: Chrome, Firefox, IE. That means both my New Relic setting and my application New Relic JavaScript setting are correct.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVhd-JUZ6Aem64IVPozZz8DsPgyKa8D1a44LzrWxnAt_8iINad6NQEhtmnp7oquQBL01YF8PBHtWpJOIpEzPQvrYTi-RLsq3F5hl-6WpRpQ2cVATtQpoTb2deTcvN0VCppzEWRkmZJ-Mn2/s1600/manual-browser-test-report.png" style="margin-left: auto; margin-right: auto;"><img border="0" height="208" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVhd-JUZ6Aem64IVPozZz8DsPgyKa8D1a44LzrWxnAt_8iINad6NQEhtmnp7oquQBL01YF8PBHtWpJOIpEzPQvrYTi-RLsq3F5hl-6WpRpQ2cVATtQpoTb2deTcvN0VCppzEWRkmZJ-Mn2/s640/manual-browser-test-report.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Manual browser test reported correctly in New Relic</td></tr>
</tbody></table>
<br />
So... what's wrong? I look into how my JMeter Browser test is set up. I use <a href="http://jmeter-plugins.org/wiki/ChromeDriverConfig/" target="_blank" title="JMeter Chrome Driver plugin">JMeter Chrome Driver plugin</a> and JMeter file looks like this:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWnufMkR8IroEk9dNVkDUQ_4l_pJAEaTaBalrNBO9o7snwq3JYxJl4iFBdYsxaLPUWxfTHZ_Tu1ySvSCxBQmP9vUE5KMei6A-IN6ixcPHZtJ_OSYM-bf50uHCSUl72UgYjmjVLzsWicChV/s1600/jmeter-browser-test-set-up.png" style="margin-left: auto; margin-right: auto;"><img border="0" height="252" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWnufMkR8IroEk9dNVkDUQ_4l_pJAEaTaBalrNBO9o7snwq3JYxJl4iFBdYsxaLPUWxfTHZ_Tu1ySvSCxBQmP9vUE5KMei6A-IN6ixcPHZtJ_OSYM-bf50uHCSUl72UgYjmjVLzsWicChV/s640/jmeter-browser-test-set-up.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">JMeter browser test set up</td></tr>
</tbody></table>
<br />
The full script is here:<br />
<br />
<pre class="brush:js" name="code">var javaUtilConcurrent = JavaImporter(java.util.concurrent);
var websampler = JavaImporter(org.openqa.selenium, org.openqa.selenium.support.ui, org.openqa.selenium.interactions.touch);
//open the browser before timing starts
WDS.browser.get("about:blank")
with(websampler) {
WDS.browser.manage().timeouts().implicitlyWait(10, javaUtilConcurrent.TimeUnit.SECONDS);
var wait = new WebDriverWait(WDS.browser, 15);
WDS.sampleResult.sampleStart();
WDS.browser.get("http://"+WDS.args[0]);
wait.until(ExpectedConditions.visibilityOfElementLocated(By.cssSelector(".home-page")));
//end timing
WDS.sampleResult.sampleEnd();
}
// verify on homepage
if (!WDS.browser.getTitle().contains("Our website home page")) {
WDS.sampleResult.setSuccessful(false);
WDS.sampleResult.setResponseMessage("Homepage failed: Title does not match the expected value.")
}</pre>
<br />
I start the JMeter script and watch closely to how it runs. I noticed the browser window might have closed too soon, maybe even before New Relic report script kicked in. To confirm that the New Relic script is loaded correctly in the test browser, I put in a wait for an invalid element, so that when the test runs, and test browser will pop up and I would have enough time to play with / inspect element on the test browser. And YES, I can see the New Relic script at the bottom of the page in Chrome inspect element.<br />
<br />
<pre class="brush:js" name="code"><script type="text/javascript" src="http://js-agent.newrelic.com/nr-100.js"></script>
<script type="text/javascript" src="http://beacon-3.newrelic.com/1/somethingsomething"></script></pre>
<br />
That confirms the page can load the script correctly, and also confirms my theory that JMeter closes the Chrome window too early before the New Relic script has enough time to report back to New Relic. I then put some wait for New Relic script element right after waiting for the page to load, and before the "end timing" block.<br />
<br />
<pre class="brush:js" name="code">wait.until(ExpectedConditions.presenceOfElementLocated(By.cssSelector("script[src*='beacon']")));</pre>
<br />
Voila! It works, and report in New Relic is as expected.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhl3zJwb3aKj1QSfiahcOswzsH-XDzNKbAVvSMxG099WkW2V91CGJRk4drGtNGWnuAy0EP9CACajJsS-9tHaEBj6GCYFvwNTTzb-LsgNkjpl0h7tpk1lQVAaA3lgX5vmUUAbnQ-NUdh3I7G/s1600/jmeter-browser-test-report-correctly.png" style="margin-left: auto; margin-right: auto;"><img border="0" height="190" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhl3zJwb3aKj1QSfiahcOswzsH-XDzNKbAVvSMxG099WkW2V91CGJRk4drGtNGWnuAy0EP9CACajJsS-9tHaEBj6GCYFvwNTTzb-LsgNkjpl0h7tpk1lQVAaA3lgX5vmUUAbnQ-NUdh3I7G/s640/jmeter-browser-test-report-correctly.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">JMeter browser test report correctly</td></tr>
</tbody></table>
linhhttp://www.blogger.com/profile/00903052857385350518noreply@blogger.com0tag:blogger.com,1999:blog-5883704740796206805.post-69093204410701596772014-06-18T09:02:00.001+10:002021-01-15T15:17:50.096+11:00Installing Karma behind proxy<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgm7B2CdkO8VXDe-XbQ_Nch_1KEWlcNGOXuEdIW6yVSeKrGz8ivA0BMJ_3nV-NwTkbuA6Zvnuf6gE1i364JJaSRUTyV9i8BaIV2taYVMt-gFZ4ygInNK4o4w3oLSu9YboWU7jRtf8QjsbLX/s1600/karma-banner.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgm7B2CdkO8VXDe-XbQ_Nch_1KEWlcNGOXuEdIW6yVSeKrGz8ivA0BMJ_3nV-NwTkbuA6Zvnuf6gE1i364JJaSRUTyV9i8BaIV2taYVMt-gFZ4ygInNK4o4w3oLSu9YboWU7jRtf8QjsbLX/s1600/karma-banner.png" /></a></div>
<br />
As part of working with <a href="https://angularjs.org/" target="_blank" title="AngularJS">AngularJS</a>, I tried to install <a href="http://karma-runner.github.io/" target="_blank" title="Karma">Karma</a> test runner by using npm on one of our new Build Slaves. I got this error while executing the installation.<br />
<pre class="brush:plain" title="Error log">$ npm install karma
npm http 200 http://registry.npmjs.org/abbrev
> phantomjs@1.9.7-6 install /usr/lib/node_modules/karma-phantomjs-launcher/node_modules/phantomjs
> node install.js
Downloading http://cdn.bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2
Saving to /usr/lib/node_modules/karma-phantomjs-launcher/node_modules/phantomjs/phantomjs/phantomjs-1.9.7-linux-x86_64.tar.bz2
Using proxy http://company-proxy.com:8080/
Receiving...
Error requesting archive.
Status: 403
Request options: {
"protocol": "http:",
"slashes": true,
"auth": null,
"host": "company-proxy.com:8080",
"port": "8080",
"hostname": "company-proxy.com",
"hash": null,
"search": null,
"query": null,
"pathname": "/",
"path": "http://cdn.bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2",
"href": "http://company-proxy.com:8080/",
"headers": {
"Host": "cdn.bitbucket.org",
"User-Agent": "curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5"
}
}
Response headers: {
"cache-control": "no-cache",
"pragma": "no-cache",
"content-type": "text/html; charset=utf-8",
"proxy-connection": "Keep-Alive",
"connection": "Keep-Alive",
"content-length": "606"
}
Make sure your network and proxy settings are correct.
npm ERR! phantomjs@1.9.7-6 install: `node install.js`
npm ERR! Exit status 1</pre>
<br />
<strong>Solution:</strong> Turns out that I need to set the proxy for npm:<br />
<pre class="brush:plain">$ npm config set proxy "http://company-proxy.com:8080"</pre>
<br />
Then I hit another problem, where the output has this error <code>return code 407: Unauthorized</code>; which means we need to add credential into the proxy config:<br />
<pre class="brush:plain">$ npm config set proxy "http://username:password@company-proxy.com:8080"</pre>
That's it. Installation is successful!<br />
<h2>
Side notes:</h2>
Sometimes, you might need to set the registry and set https-proxy:<br />
<pre class="brush:plain">npm config set registry http://registry.npmjs.org/
npm config set https-proxy https://username:password@proxy.company.com:8080</pre>
<h2>
<strong>References:</strong></h2>
<ul>
<li><a href="https://github.com/Medium/phantomjs/issues/54" target="_blank" title="https://github.com/Medium/phantomjs/issues/54">https://github.com/Medium/phantomjs/issues/54</a></li>
<li><a href="http://ubuntuforums.org/showthread.php?t=1573206" target="_blank" title="http://ubuntuforums.org/showthread.php?t=1573206">http://ubuntuforums.org/showthread.php?t=1573206</a></li>
</ul>
linhhttp://www.blogger.com/profile/00903052857385350518noreply@blogger.com1tag:blogger.com,1999:blog-5883704740796206805.post-52185558347757140532014-06-08T23:21:00.001+10:002021-01-15T15:18:39.603+11:00Reasons I Start Blogging<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjc2KonQRwP_0Uv-vxyPCGmKBMZVKAaah2FifKz4_-SvncwiFIAl-MjDfwXXQOcJvT-_r3d0wfucDecRv0yk-Qin0Wy-hYCEuN29cd5c134H9Oit8r28XV9quf-j8HmPX1-X4TGKEq8xak3/s1600/bg-writer.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="250" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjc2KonQRwP_0Uv-vxyPCGmKBMZVKAaah2FifKz4_-SvncwiFIAl-MjDfwXXQOcJvT-_r3d0wfucDecRv0yk-Qin0Wy-hYCEuN29cd5c134H9Oit8r28XV9quf-j8HmPX1-X4TGKEq8xak3/s400/bg-writer.jpg" width="400" /></a></div>
<br />
<br />
So... I started blogging after many years. Well, better late than never. You might wonder what made me changed my mind.<br />
<br />
For a start, I always wanted to document what I learned, problems I faced at work and their solutions. Normally I use <a href="https://evernote.com/" target="_blank" title="Evernote">Evernote</a>, it is a great tool to keep notes; but it would be much better than I could share those notes with a wider developer community which I owe so much too. There are many times I found a solution to my current in someone's blog; so I hope I could return that favour, and someone would find my blog useful.<br />
<br />
In this keynote <a href="http://vimeo.com/86396740" target="_blank" title="Jackstones: the Journey to Mastery">Jackstones: the Journey to Mastery</a> by Dan North, he mentioned a great way to learn is to teach. Blogging is a form of teaching, so by blogging I could reinforce what I learned.<br />
<br />
I have been introduced to SEO (Search Engine Optimisation) in my work, and have learned the power of a good domain and content. That is an investment. An investment that grows with time.<br />
<br />
Finally, this blog post really summarised it all, <a href="http://www.becomingminimalist.com/15-reasons-i-think-you-should-blog/" target="_blank" title="15 reasons I think you should blog">15 reasons I think you should blog</a>, and made me push the "Start Blogging" button.<br />
<br />
I sincerely hope you will find something valuable from my blog. :-)linhhttp://www.blogger.com/profile/00903052857385350518noreply@blogger.com0