Showing posts from January, 2012

Use snappy codec with Hive

[1] Snappy is a compression and decompression library, initially developed from Google and now integrated into Hadoop. Snappy acts about 10% faster than LZO, the biggest differences are the packaging and that snappy only provides a codec and does not have a container spec, whereas LZO has a file-format container and a compression codec. Snappy is shipped with CDH3u2 (for Clouderas Distribution) included in the hadoop-0.20 package or in [2] Apache hadoop Version 0.21.0 up.

The example I explain was initially created from Esteban, an Cloudera Customer Operations Engineer.

Create a sequenced file
$ seq 1 1000 | awk '{OFS="\001";print $1, $1 % 10}' > test_input.hive
$ cat test_input.hive |head -5

Upload into hdfs
$ hadoop dfs -mkdir /tmp/hivetest
$ hadoop dfs -put /home/hdfs/test_input.hive /tmp/hivetest

$ hadoop dfs -ls /tmp/hivetest
Found 1 items
-rw-r--r--   3 hdfs supergroup       5893 2012-01-19 09:58 /tmp/hivetest/test_input.hive

Process the plain file in hive wi…