Posts

Showing posts from 2013

How HDFS protects your data

Image
Often we get questions how HDFS protects data and what the mechanisms are to prevent data corruption. Eric Sammer explain this en detail in Hadoop Operations.

Additional to the points below, you can also have a second cluster to sync the files, simply to prevent human being failures, like deleting a subset of data. If you have enough space in your cluster, enabling the trash per core-site.xml and setting to a higher value then a day helps too.

<property>
<name>fs.trash.interval</name>
<value>1440</value>
<description>Number of minutes after which the checkpoint
gets deleted. If zero, the trash feature is disabled. 1440 means 1 day
</description>
</property>
<property>

<name>fs.trash.checkpoint.interval</name>
<value>15</value>
<description>Number of minutes between trash checkpoints.
Should be smaller or equal to fs.trash.interval.
Every time the checkpointer runs it creates a new checkpoint
out of curr…

Export more as 10k records with Sqoop

If you want to export more as 10k records out of a RDBMS via sqoop, you have some settings and properties you can tweak.

Parameter --num-mapper
Number of simultaneous connections that will be opened against database. Sqoop will use that many processes to export data (each process will export slice of the data). Here you have to take care about the max open connections to your RDBMS, since this can overwhelm the RDBMS easily.

Parameter--batch
Enabling batch mode on the JDBC driver. Here you use the JDBC batching mode, which queues the queries and deliver batched results
Propertysqoop.export.records.per.statement
Number of rows that will be created for single insert statement, e.g. INSERT INTO xxx VALUES (), (), (), ... Here you have to know which VALUES you want to catch, but this

Propertyexport.statements.per.transaction
Number of insert statements per single transaction. e.g BEGIN; INSERT, INSERT, .... COMMIT
You can specify the properties (even both at the same time) in the HADOOP_ARGS secti…

Connect to HiveServer2 with a kerberized JDBC client (Squirrel)

Squirrel work with kerberos, however, if you don't want kerberos then you don't need the JAVA_OPTS changes at the end. My colleague, Chris Conner, has created a maven project that pulls down all of the dependencies for a JDBC program:
https://github.com/cmconner156/hiveserver2-jdbc-kerberos

Note for kerberos environment, you need to kinitbeforeusing Squirrel. The above program handles kinit for you. If you are not using Kerberos and you want to use the above program, then comment out the following lines:

System.setProperty("java.security.auth.login.config","gss-jaas.conf");
System.setProperty("javax.security.auth.useSubjectCredsOnly","false");
System.setProperty("java.security.krb5.conf","krb5.conf");

Then make sure to change the jdbc URI to not have the principal. Also, it's worth mentioning that if you use kerberos, I did have some issues with differing java versions. So try matching your client's…

Enable Replication in HBase

HBase does have support for multi-site replication for disaster recovery, it is not a HA solution, the application and solution architecture will need to implement HA. This means that data from one cluster is automatically replicated to a backup cluster, this can within the same data center or across data centers. There are 3 ways to configure this, master-slave, master-master, and cyclic replication. Master slave is the simplest solution for DR as data is written to the master and replicated to the configured slave(s). Master-Master means that the two clusters cross replicate edits, however have means to prevent replication going into an infinite loop by tracking mutations using the HBase cluster ID. Cyclic replication is supported which means you can have multiple clusters replicating to each other these can be in combinations of master-master, master-slave.
Replication relies on the WAL, the WAL edits are replayed from a source region server to a target region server.

A few important…

Get all extended Hive tables with location in HDFS

for file in $(hive -e "show table extended like \`*\`" | grep location: | awk 'BEGIN { FS = ":" };{printf("hdfs:%s:%s\n",$3,$4)}'); do hdfs dfs -du -h $file; done; Output:

Time taken: 2.494 seconds 12.6m  hdfs://hadoop1:8020/hive/tpcds/customer/customer.dat 5.2m  hdfs://hadoop1:8020/hive/tpcds/customer_address/customer_address.dat 76.9m  hdfs://hadoop1:8020/hive/tpcds/customer_demographics/customer_demographics.dat 9.8m  hdfs://hadoop1:8020/hive/tpcds/date_dim/date_dim.dat 148.1k  hdfs://hadoop1:8020/hive/tpcds/household_demographics/household_demographics.dat 4.8m  hdfs://hadoop1:8020/hive/tpcds/item/item.dat 36.4k  hdfs://hadoop1:8020/hive/tpcds/promotion/promotion.dat 3.1k  hdfs://hadoop1:8020/hive/tpcds/store/store.dat 370.5m  hdfs://hadoop1:8020/hive/tpcds/store_sales/store_sales.dat 4.9m  hdfs://hadoop1:8020/hive/tpcds/time_dim/time_dim.dat 0      hdfs://hadoop1:8020/user/alexander/transactions/_SUCCESS 95.0k  hdfs://hadoop1:8020/user/alexander/transacti…

Query HBase tables with Impala

As described in other blog posts, Impala uses Hive Metastore Service to query the underlaying data. In this post I use the Hive-HBase handler to connect Hive and HBase and query the data later with Impala. In the past I've written a tutorial (http://mapredit.blogspot.de/2012/12/using-hives-hbase-handler.html) how to connect HBase and Hive, please follow the instructions there.

This approach offers Data Scientists a wide field of work with data stored in HDFS and / or HBase. You will get the possibility to run queries against your stored data independently which technology and database do you use, simply by querying the different data sources in a fast and easy way.

I use the official available census data gathered in 2000 by the US government. The goal is to push this data as CSV into HBase and query this table per Impala. I've made a demonstration script which is available in my git repository.

Demonstration scenario

The dataset looks pretty simple:

cat DEC_00_SF3_P077_with_an…

HBase: MSLAB and CMS vs. ParallelGC

Tuning Java opts for HBase, for example, are necessary steps to get the best performance and stability in large installations. The optimal recommendation looks like: HBASE_OPTS="-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled"
But you can also achieve great success with: HBASE_OPTS="-server -XX:+UseParallelGC XX:+UseParallelOldGC -XX:ParallelGCThreads=8"
What are the differences between ParallelGC and CMS?
CMS uses more CPU, but runs concurrently. If a thread is failing, CMS falls back to a non-parallel mode and stops the VM for the entire time it's collecting. But this risk can be minimized by using MSLAB in your HBase configuration. ParallelGC have a better throughput and longer pause times, and stop the VM on every collection. Means for HBase, you'll have a pause (around 1 sec per GB), which can lead on high loaded clusters to outages in a non acceptable time range.
MSLAB (MemStore-Local Allocatio…

Flume 1.3.1 Windows binary release online

Andy Blozhou, a Chinese Flume enthusiast provide precompiled Windows binaries of Flume-1.3.1, including a startup bat and Avro bat.
You can grap this build on their website http://abloz.com/flume/windows_download.html :

======== snip ========

This is the flume-ng 1.3.1 windows version for download. apache-flume-1.3.1-bin.zip
simple usage: unzip the apache-flume-1.3.1-bin.zip run bin/flume.bat for agent.  run bin/flume-avroclient.bat for avro-client.  Need modify for your own env.  detail: (To compile flume-ng on windows, please reference http://mapredit.blogspot.com/2012/07/run-flume-13x-on-windows.html or my chinese version http://abloz.com/2013/02/18/compile-under-windows-flume-1-3-1.html)
1.download the windows version of flume 1.3.1 file apache-flume-1.3.1-bin.zip from http://abloz.com/flume/windows_download.html 2.unzip the apache-flume-1.3.1-bin.zip to a directory. 3.install jdk 1.6 from oracle,and set JAVA_HOME of the env. download from http://www.oracle.com/technetwork/java/javase…

LZO Compression with Oozie

It happens, when one of the compression codec is switched to LZO, that Oozie can't start any MR job successfully. Usually this is done per core-site.xml:

<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>
Oozie reports a ClassNotFound error (java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found). To get the jobs running copy or link hadoop-lzo.jar into /var/lib/oozie/ and restart Oozie's server. 
The second, most common issue most people forget is to set the shared lib directory:
[root@hadoop2 ~]# sudo -u hdfs hadoop fs -mkdir  /user/oozie [root@hadoop2 …