Thursday, September 15, 2011

Speedup Sqoop


Sqoop [1] (sql to hadoop) lets easy connect RDBMS into a hadoop infrastructure. Newest plugin comes from Microsoft and let us connect MS-SQL Server and hadoop each together. As a cool feature you can create a jar-file from your job, its pretty easy, just here a line:

sqoop export --connect jdbc:<RDBMS>:thin:@<HOSTNAME>:<PORT>:<DB-NAME> --table<TABLENAME> --username<USERNAME> --password<PASSWORD> --export-dir <HDFS DIR WHICH CONTAINS DATA> --direct --fields-terminated-by '<TERMINATOR (Java)>' --package-name <JOBNAME>.<IDENTIFIER> --outdir <WHERE THE JAR SHOULD WRITTEN> --bindir <BIN_DIR>

After you fired up you'll find a jar-package in --outdir, unzip it and you find your java-code and the precompiled class,so you can start to tune them.

Now lets start the job again, but use the precompiled class:

sqoop export --connect jdbc:<RDBMS>:thin:@<HOSTNAME>:<PORT>:<DB-NAME> --table<TABLENAME> --username<USERNAME> --password<PASSWORD> --export-dir <HDFS DIR WHICH CONTAINS DATA> --direct --fields-terminated-by '<TERMINATOR (Java)>' --jar-file <PATH/TO/JAR> 
--class-name <JOBNAME>.<IDENTIFIER>.<CLASSNAME>

The step above let you increase the export of large datasets dramatically. So I speedup a export of 100k records from hdfs into a oracle-DB from 16sec into 8sec.