Apache Flume 1.2.x and HBase

By Anonymous - June 13, 2012

The newest (and first) HBase sink was committed into trunk one week ago and was my point at the HBase workshop @Berlin Buzzwords. The slides are available in my slideshare channel.

Let me explain how it works and how you get an Apache Flume - HBase flow running. First, you've got to checkout trunk and build the project (you need git and maven installed on your system):

git clone git://git.apache.org/flume.git && cd flume && git checkout trunk && mvn package -DskipTests && cd flume-ng-dist/target

Within trunk, the HBase sink is available in the sinks - directory (ls -la flume-ng-sinks/flume-ng-hbase-sink/src/main/java/org/apache/flume/sink/hbase/)

Please note a few specialities:

The sink controls atm only HBase flush (), transaction and rollback. Apache Flume reads out the $CLASSPATH variable and uses the first available hbase-site.xml. If you use different versions of HBase on your system please keep that in mind. The HBase table, columns and column family have to be created. Thats all.

The using of an HBase sink is pretty simple, an valid configuration could look like:

host1.sources = src1

host1.sinks = sink1

host1.channels = ch1

host1.sources.src1.type = seq

host1.sources.src1.port = 25001

host1.sources.src1.bind = localhost

host1.sources.src1.channels = ch1

host1.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink

host1.sinks.sink1.channel = ch1

host1.sinks.sink1.table = test3

host1.sinks.sink1.columnFamily = testing

host1.sinks.sink1.column = foo

host1.sinks.sink1.serializer = org.apache.flume.sink.hbase.SimpleHbaseEventSerializer

host1.sinks.sink1.serializer.payloadColumn = pcol

host1.sinks.sink1.serializer.incrementColumn = icol

host1.channels.ch1.type=memory

In this example we start a Seq interface on localhost with a listening port, point the sink to the HBase sink jar and define the event serializer. Why? HBase needs the data in a HBase format, to achieve that we need to transform the input into a HBase compilant format. Apache Flume's HBase sink uses synchronous / blocking client, asynchronous support will follow (FLUME-1252).

Links:

[1] API documentation: http://archive.cloudera.com/cdh4/cdh/4/flume-ng/apidocs/org/apache/flume/sink/hbase/
[2] Flume 1252: https://issues.apache.org/jira/browse/FLUME-1252

Comments

Sébastien13 June, 2012
Hi !
Thanks for informations. I tried that and i have the following result http://pastebin.com/0YZBw8YL . Flume and ZK don't seem to communicate together :(
Have you an idea ?
ReplyDelete
Replies
Anonymous13 June, 2012
Hi,

didn't work on a SASL Zookeeper I guess. Hmm, the certificate are readable? If yes, file pls a jira about.
http://hbase.apache.org/configuration.html#zk.sasl.auth

Thanks,
Alex
ReplyDelete
Replies
Sébastien13 June, 2012
Sorry but I don't use SASL ZK. :/
ReplyDelete
Replies
Tariq14 June, 2012
thank you so much alo for the wonderful work..using this writeup I was able to use hbase-sink..but being a newbie I was left some questions..could you please tell something about following 2 line -

host1.sinks.sink1.serializer.payloadColumn = pcol
host1.sinks.sink1.serializer.incrementColumn = icol

Many thanks
ReplyDelete
Replies

Add comment

Search This Blog

2pk03 over AI, ML, BigData and data processing

Apache Flume 1.2.x and HBase

Comments

Post a Comment

Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Hive query shows ERROR "too many counters"

Life hacks for your startup with OpenAI and Bard prompts