Posts

Showing posts from 2012

Impala and Kerberos

First, Impala is beta software and has some limitations. Stay tuned and test this, you'll see it can be change your BI world dramatically.

What is Impala? Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
(https://ccp.cloudera.com/display/IMPALA10BETADOC/Introducing+Cloudera+Impala) You can build Impala by source (https://github.com/cloudera/impala) or you can grab them by using yum on a RHEL / CentOS 6x server. Imapla doesn't support RHEL / CentOS prior 6, since the most part of Impala is written in C++.

I choose the rpm-version for this article, but the compiled version will work in the same manner. To grab impala directly per yum setup a new repository:

#> ca…

Using Hive's HBase handler

Hive supports per HIVE-705 HBase integration for SELECT and write INSERT both and is well described Hive's wiki. Note, as of Hive 0.9x the integration requires HBase 0.92x. In this article I'll show how to use existing HBase tables with Hive.

To use Hive in conjunction with HBase, a storage-handler is needed. Per default, the storage handler comes along with your Hive installation and should be available in Hive's lib directory ($HIVE_HOME/lib/hive-hbase-handler*). The handler requires hadoop-0.20x and later as well as zookeeper 3.3.4 and up.

To get Hive and HBase working together, add HBase's config directory into hive-site.xml:

<property>
<name>hive.aux.jars.path</name>
<value>file:///etc/hbase/conf</value>
</property>

and sync the configs (hbase-site.xml as well as hive-site.xml) to your clients. Add a table in Hive using the HBase handler:

CREATE TABLE hbase_test
(
key1 string,
col1 string
)
STORED BY 'org.apache.hadoop.hive.hba…

Hive "drop table" hangs (Postgres Metastore)

By using postgres as a metastore database it could be happen that "drop table xyz" hangs and Postgres is showing LOCKS with UPDATE. This happen since some tables are missing and can be fixed by using:

create index "IDXS_FK1" on "IDXS" using btree ("SD_ID");
create index "IDXS_FK2" on "IDXS" using btree ("INDEX_TBL_ID");
create index "IDXS_FK3" on "IDXS" using btree ("ORIG_TBL_ID");

CREATE TABLE "ROLES" (
"ROLE_ID" bigint NOT NULL,
"CREATE_TIME" int NOT NULL,
"OWNER_NAME" varchar(128) DEFAULT NULL,
"ROLE_NAME" varchar(128) DEFAULT NULL,
PRIMARY KEY ("ROLE_ID"),
CONSTRAINT "ROLEENTITYINDEX" UNIQUE ("ROLE_NAME")
) ;

CREATE TABLE "ROLE_MAP" (
"ROLE_GRANT_ID" bigint NOT NULL,
"ADD_TIME" int NOT NULL,
"GRANT_OPTION" smallint NOT NULL,
"GRANTOR" varchar(128) DEFAULT NULL,
"GRAN…

Hive query shows ERROR "too many counters"

A hive job face the odd "Too many counters:" like

Ended Job = job_xxxxxx with exception 'org.apache.hadoop.mapreduce.counters.LimitExceededException(Too many counters: 201 max=200)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
Intercepting System.exit(1)

These happens when operators are used in queries (Hive Operators). Hive creates 4 counters per operator, max upto 1000, plus a few additional counters like file read/write, partitions and tables. Hence the number of counter required is going to be dependent upon the query. 
To avoid such exception, configure "mapreduce.job.counters.max" in mapreduce-site.xml to a value above 1000. Hive will fail when he is hitting the 1k counts, but other MR jobs not. A number around 1120 should be a good choice.

Using "EXPLAIN EXTENDED" and "grep -ri operators | wc -l" print out the used numbers of operators. Use this value to tweak the MR settings carefully.

Memory consumption in Flume

Memory required by each Source or Sink

The heap memory used by a single event is dominated by the data in the event body with some incremental usage by any headers added. So in general, a source or a sink will allocate roughly the size of the event body + maybe 100 bytes of headers (this is affected by headers added by the txn agent). To get the total memory used by a single batch, multiply your average (or 90th percentile) event
size (plus some additional buffer) by the maximum batch size. This will give you the memory needed by a single batch.

The memory required for each Sink is the memory needed for a single batch and the memory required for each Source is the memory needed for a single batch, multiplied by the number of clients simultaneously connected to the Source. Keep this in mind, and plan your event delivery according to your expected throughput.

Memory required by each File Channel

Under normal operation, each File Channel uses some heap memory and some direct memory. Give …

HBase major compaction per cronjob

Sometimes I get asked how a admin can run a major compaction on a particular table at a time when the cluster isn't usually used.

This can be done per cron, or at. HBase shell needs a ruby script, which is very simple:

# cat m_compact.rb
major_compact 't1'
exit

A working shell script for cron, as example:
# cat daily_compact #!/bin/bash
USER=hbase
PWD=`echo ~$USER`
TABLE=t1
# kerberos enabled 
KEYTAB=/etc/hbase/conf/hbase.keytab
HOST=`hostname`
REALM=ALO.ALT
LOG=/var/log/daily_compact

# get a new ticket
sudo -u $USER kinit -k -t $KEYTAB $USER/$HOST@$REALM
# start compaction
sudo -u $USER hbase shell $PWD/m_compact.rb 2>&1 |tee -a $LOG

All messages will be redirected to /var/log/daily_compact:
11/15/13 06:49:26 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
12 row(s) in 0.7800 seconds

Secure your cluster with kerberos

Many times I get questions about a safe and fast way to secure a cluster without big steps like integrate AD structures, simply to prevent unauthorized access. I created this writeup to let you know the steps you need. I used CentOS 6.3 and CDH 4.0.1, but you can use other distributions as well.

Setup KDC on a Linux Box

Install kerberos5 related packages as well as kadmin, too. First thing you have to do is to replace EXAMPLE.COM, which is delivered per default, with your own realm. I used ALO.ALT here.

Example config:

# hadoop1> cat /etc/krb5.conf 
[libdefaults]
 default_realm = ALO.ALT
 dns_lookup_realm = false
 dns_lookup_kdc = false
[realms]
  ALO.ALT = {
  kdc = HADOOP1.ALO.ALT:88
  admin_server = HADOOP1.ALO.ALT:749
  default_domain = HADOOP1.ALO.ALT
 }
[domain_realm]
 .alo.alt = ALO.ALT
 alo.alt = ALO.ALT

[logging]
  kdc = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmin.log
  default = FILE:/var/log/krb5lib.log


Now tweak your DNS or /etc/hosts to reflect the settings, if you us…

Enable JMX Metrics in Flume 1.3x

Image
As you know, Flume supports Ganglia (version 3 and above) to collect report metrics. The details are described in our documentation (1).

Now I'll describe how do you use JMX reporting (to integrate metrics into other monitoring systems or monitor flume directly via Java's builtin JMX console (2)). That's pretty easy - choose a port which is free on your server and not firewall blocked.

First, enable JMX via flume's env.sh ($FLUME_HOME/conf/flume-env.sh) and uncomment / add / edit the line starting with JAVA-OPTS:

JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=54321 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"

Of course, restart all running agents. 
Now point jconsole to Flume:

jconsole YOUR_HOSTNAME: 54321

will start a X11 Java window:

which monitors your Flume installation. 
Links (1) http://flume.apache.org/FlumeUserGuide.html#ganglia-reporting (2) http://docs.orac…

BigData - eine Übersicht

(Dieser Artikel ist auch als Slideshow verfügbar: http://www.slideshare.net/mapredit/big-data-mit-apache-hadoop)

Mehr und mehr drängt sich BigData als nebulöser Begriff in die Fachpresse. Klar ist, wer mithalten will im Business und innovativ zukünftige Projekte erfolgreich zum Abschluss führen will, kommt um das Thema nicht herum. Doch warum kommt man nicht darum herum? Was ist der Beweggrund für das Sammeln riesiger Datenmengen?

Der Weg dahin ist recht einfach und wird von vielen Unternehmen bereits seit Jahren betrieben, nur mit ungleich höherem Aufwand an Manpower und finanziellen Investments.

Ein Beispiel:
Es werden Logfiles durch riesige Datenfarmen zusammengeführt; wochenlange Jobs laufen über Terrabyte an den gewonnen und aufbereiteten Daten. Tritt in der Kette ein Fehler auf, beginnt der Lauf im Idealfall an der unterbrochenen Stelle - oder von vorn. Doch bis dahin muss eine lange Prozesskette eingehalten werden, um brauchbare Daten für eben diesen einen Job zu erhalten. Und e…

Flume 1.2.0 released

The Apache Flume Team released yesterday the next large release with number 1.2.0. Here a overview about the fixes and additions (thanks Mike, I copy your overview):

Apache Flume 1.2.0 is the third release under the auspices of Apache of the so-called "NG" codeline, and our first release as a top-level Apache project! Flume 1.2.0 has been put through many stress and regression tests, is stable, production-ready software, and is backwards-compatible with Flume 1.1.0. Four months of very active development went into this release: a whopping 192 patches were committed since 1.1.0, representing many features, enhancements, and bug fixes. While the full change log can be found in the link below, here are a few new feature highlights:

* New durable file channel  * New client API  * New HBase sinks (two different implementations)  * New Interceptor interface (a plugin processing API)  * New JMX-based monitoring support

With this release - the first after evolving into a tier 1 Apache …

Get Apache Flume 1.3.x running on Windows

Since we found an increasing interest in the flume community to get Apache Flume running on Windows systems again, I spent some time to figure out how we can reach that. Finally, the good news - Apache Flume runs on Windows. You need some tweaks to get them running.

Prerequisites
Build system:
maven 3x, git, jdk1.6.x, WinRAR (or similar program)

Apache Flume agent:
jdk1.6.x, WinRAR (or similar program), Ultraedit++ or similar texteditor

Tweak the Windows build box
1. Download and install JDK 1.6x from Oracle
2. Set the environment variables
   => Start - type "env" into the search box, select "Edit system environment variables", click Environment Variables, Select "New" from the "Systems variables" box, type "JAVA_HOME" into "variable name" and the path to your JDK installation into "Variable value" (Example: C:\Program Files (x86)\Java\jdk1.6.0_33)
3. Download maven from Apache
4. Set the environment variables
   =&g…

Apache Flume 1.2.x and HBase

The newest (and first) HBase sink was committed into trunk one week ago and was my point at the HBase workshop @Berlin Buzzwords. The slides are available in my slideshare channel.

Let me explain how it works and how you get an Apache Flume - HBase flow running. First, you've got to checkout trunk and build the project (you need git and maven installed on your system):

git clone git://git.apache.org/flume.git && cd flume && git checkout trunk && mvn package -DskipTests && cd flume-ng-dist/target

Within trunk, the HBase sink is available in the sinks - directory (ls -la flume-ng-sinks/flume-ng-hbase-sink/src/main/java/org/apache/flume/sink/hbase/)

Please note a few specialities:
The sink controls atm only HBase flush (), transaction and rollback. Apache Flume reads out the $CLASSPATH variable and uses the first available hbase-site.xml. If you use different versions of HBase on your system please keep that in mind. The HBase table, columns and column …

Using filters in HBase to match certain columns

HBase is a column oriented database which stores the content by column rather than by row. To limit the output of an scan you can use filters, so far so good.

But how it'll work when you want to filter more as one matching column, let's say 2 or more certain columns?
The trick here is to use an SingleColumnValueFilter (SCVF) in conjunction with a boolean arithmetic operation. The idea behind is to include all columns which have "X" and NOT the value DOESNOTEXIST; the filter would look like:


List list = new ArrayList<Filter>(2);
Filter filter1 = new SingleColumnValueFilter(Bytes.toBytes("fam1"),
 Bytes.toBytes("VALUE1"), CompareOp.DOES_NOT_EQUAL, Bytes.toBytes("DOESNOTEXIST"));
filter1.setFilterIfMissing(true);
list.addFilter(filter1);
Filter filter2 = new SingleColumnValueFilter(Bytes.toBytes("fam2"),
 Bytes.toBytes("VALUE2"), CompareOp.DOES_NOT_EQUAL, Bytes.toBytes("DOESNOTEXIST"));
filter2.setFilterIfMissin…

Stop accepting new jobs in a hadoop cluster (ACL)

To stop accepting new MR jobs in a hadoop cluster you have to enable ACL's first. If you've done that, you can specify a single character queue ACL (' ' = a space!). Since mapred-queue-acls.xml is polled regularly you can dynamically change the queue in a running system . Useful for ops related work (setting into maintenance, extending / decommission nodes and such things).

Enable ACL's

Edit the config file ($HADOOP/conf/mapred-queue-acls.xml) to fit your needs:

<configuration>
 <property>
   <name>mapred.queue.default.acl-submit-job</name>
   <value>user1,user2,group1,group2,admins</value>
 </property>

 <property>
   <name>mapred.queue.default.acl-administer-jobs</name>
   <value>admins</value>
 </property>

</configuration>

Enable an ACL driven cluster by editing the value of mapred.acls.enabled in conf/mapred-site.xml and setting to true.

Now edit simply the value of mapred.queue.default.…

Access Oozie WebUI after securing your hadoop cluster

To access a Kerberized hadoop environment you need a SPNEGO supporting browser (FF, IE, Chrome) and the client, who runs the browser, need network connectivity to the KDC.

IBM has written up a good tutorial about.

Here some tips:

For Firefox, access the low level configuration page by loading the about:config page. Then go to the network.negotiate-auth.trusted-uris preference and add the hostname or the domain of the web server that is HTTP Kerberos SPNEGO protected (if using multiple domains and hostname use comma to separate them).

For Chrome on Windows try: C:\Users\username\AppData\Local\Google\Chrome\Application\chrome.exe --args --auth-server whitelist="*domain.com" --auto-ssl-client-auth

For IE:
Simply add the Oozie-URL to Intranet Sites. It appears you not only have to make sure ‘Windows Integrated Authentication’ is enabled, but you also have to add the site to the ‘Local Intranet’ sites in IE.

FlumeNG - the evolution

Flume, the decentralized log collector, makes some great progress. Since the project has reached an Apache incubating tier the development on the next generation (NG) has reached a significant level.

Now, what's the advantage of a new flume? Simply the architecture. FlumeNG doesn't need zookeeper anymore, has no master / client concept nor nodes / collectors. Simply run a bunch of agents, connect each of them together and create your own flow. Flume now can run with 20MB heapsize, uses inMemory Channel for flows, can have multiflows, different sinks to one channel and will support a few more sinks as flume =< 1.0.0. But, flumeNG will not longer support tail and tailDir, here a general exec sink is available, which lets the user the freedom to use everything. 
Requirements On the build host we need jre 1.6.x, maven 3.x and git or svn. 
Installation To check out the code we use git and maven in a simple one-line command: git clone git://git.apache.org/flume.git; cd flume; git che…