Posts

Showing posts from November, 2012

Memory consumption in Flume

Memory required by each Source or Sink

The heap memory used by a single event is dominated by the data in the event body with some incremental usage by any headers added. So in general, a source or a sink will allocate roughly the size of the event body + maybe 100 bytes of headers (this is affected by headers added by the txn agent). To get the total memory used by a single batch, multiply your average (or 90th percentile) event
size (plus some additional buffer) by the maximum batch size. This will give you the memory needed by a single batch.

The memory required for each Sink is the memory needed for a single batch and the memory required for each Source is the memory needed for a single batch, multiplied by the number of clients simultaneously connected to the Source. Keep this in mind, and plan your event delivery according to your expected throughput.

Memory required by each File Channel

Under normal operation, each File Channel uses some heap memory and some direct memory. Give …

HBase major compaction per cronjob

Sometimes I get asked how a admin can run a major compaction on a particular table at a time when the cluster isn't usually used.

This can be done per cron, or at. HBase shell needs a ruby script, which is very simple:

# cat m_compact.rb
major_compact 't1'
exit

A working shell script for cron, as example:
# cat daily_compact #!/bin/bash
USER=hbase
PWD=`echo ~$USER`
TABLE=t1
# kerberos enabled 
KEYTAB=/etc/hbase/conf/hbase.keytab
HOST=`hostname`
REALM=ALO.ALT
LOG=/var/log/daily_compact

# get a new ticket
sudo -u $USER kinit -k -t $KEYTAB $USER/$HOST@$REALM
# start compaction
sudo -u $USER hbase shell $PWD/m_compact.rb 2>&1 |tee -a $LOG

All messages will be redirected to /var/log/daily_compact:
11/15/13 06:49:26 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
12 row(s) in 0.7800 seconds