Posts

Showing posts from May, 2013

Get all extended Hive tables with location in HDFS

for file in $(hive -e "show table extended like \`*\`" | grep location: | awk 'BEGIN { FS = ":" }; {printf("hdfs:%s:%s\n",$3,$4)}'); do hdfs dfs -du -h $file; done; Output: Time taken: 2.494 seconds 12.6m  hdfs://hadoop1:8020/hive/tpcds/customer/customer.dat 5.2m  hdfs://hadoop1:8020/hive/tpcds/customer_address/customer_address.dat 76.9m  hdfs://hadoop1:8020/hive/tpcds/customer_demographics/customer_demographics.dat 9.8m  hdfs://hadoop1:8020/hive/tpcds/date_dim/date_dim.dat 148.1k  hdfs://hadoop1:8020/hive/tpcds/household_demographics/household_demographics.dat 4.8m  hdfs://hadoop1:8020/hive/tpcds/item/item.dat 36.4k  hdfs://hadoop1:8020/hive/tpcds/promotion/promotion.dat 3.1k  hdfs://hadoop1:8020/hive/tpcds/store/store.dat 370.5m  hdfs://hadoop1:8020/hive/tpcds/store_sales/store_sales.dat 4.9m  hdfs://hadoop1:8020/hive/tpcds/time_dim/time_dim.dat 0      hdfs://hadoop1:8020/user/alexander/transactions/_SUCCESS 95.0k  hd

Query HBase tables with Impala

As described in other blog posts, Impala uses Hive Metastore Service to query the underlaying data. In this post I use the Hive-HBase handler to connect Hive and HBase and query the data later with Impala. In the past I've written a tutorial ( http://mapredit.blogspot.de/2012/12/using-hives-hbase-handler.html ) how to connect HBase and Hive, please follow the instructions there. This approach offers Data Scientists a wide field of work with data stored in HDFS and / or HBase. You will get the possibility to run queries against your stored data independently which technology and database do you use, simply by querying the different data sources in a fast and easy way. I use the official available census data gathered in 2000 by the US government. The goal is to push this data as CSV into HBase and query this table per Impala. I've made a demonstration script which is available in my git repository . Demonstration scenario The dataset looks pretty simple: cat DEC_00_SF