Montag, 16. Februar 2015

Hadoop and trusted MiTv5 Kerberos with Active Directory

For actuality here a example how to enable an MiTv5 Kerberos <=> Active Directory trust just from scratch. Should work out of the box, just replace the realms:

HADOOP1.INTERNAL = local server (KDC)
ALO.LOCAL = local kerberos realm
AD.REMOTE = AD realm

with your servers. The KDC should be inside your hadoop network, the remote AD can be somewhere.

1. Install the bits

At the KDC server (CentOS, RHEL - other OS' should have nearly the same bits):
yum install krb5-server krb5-libs krb5-workstation -y

At the clients (hadoop nodes):
yum install krb5-libs krb5-workstation -y

Install Java's JCE policy (see Oracle documentation) on all hadoop nodes.

2. Configure your local KDC


default_realm = ALO.LOCAL
dns_lookup_realm = false
dns_lookup_kdc = false
kdc_timesync = 1
ccache_type = 4
forwardable = true
proxiable = true
fcc-mit-ticketflags = true
max_life = 1d
max_renewable_life = 7d
renew_lifetime = 7d
default_tgs_enctypes = aes128-cts arcfour-hmac
default_tkt_enctypes = aes128-cts arcfour-hmac

kdc = hadoop1.
admin_server = hadoop1.internal:749
max_life = 1d
max_renewable_life = 7d
kdc = ad.remote.internal:88
admin_server = ad.remote.internal:749
max_life = 1d
max_renewable_life = 7d

alo.local = ALO.LOCAL
.alo.local = ALO.LOCAL

ad.internal = AD.INTERNAL
.ad.internal = AD.INTERNAL

kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmin.log
default = FILE:/var/log/krb5lib.log


kdc_ports = 88
kdc_tcp_ports = 88

#master_key_type = aes256-cts
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
*/admin@ALO.ALT *

Create the realm on your local KDC and start the services
kdb5_util create -s -r ALO.LOCAL
service kadmin restart
service krb5kdc restart
chkconfig kadmin on
chkconfig krb5kdc on

Create the admin principal
kadmin.local -q "addprinc root/admin"

3. Create the MiTv5 trust in AD

Using the Windows - Power(!sic) - Shell
netdom trust ALO.LOCAL /DOMAIN: AD.REMOTE /add /realm /passwordt: passw0rd

=> On Windows 2003 this works, too:
ktpass /ALO.LOCAL /DOMAIN:AD.REMOTE /TrustEncryp aes128-cts arcfour-hmac

=> On Windows 2008 you have to add:
ksetup /SetEncTypeAttr ALO.LOCAL aes128-cts arcfour-hmac

4. Create the AD trust in MiTv5
kadmin.local: addprinc krbtgt/ALO.LOCAL@AD.REMOTE
password: passw0rd

5. Configure hadoop's mapping rules



Done. Now you should be able to get an ticket from your AD which let you work with your hadoop installation:

#> kinit alo.alt@AD.REMOTE
#> klist
Ticket cache: FILE:/tmp/krb5cc_500
Default principal: alo.alt@AD.REMOTE

Montag, 9. Februar 2015

Hadoop based SQL engines

Apache Hadoop comes more and more into the focus of business critical architectures and applications. Naturally SQL based solutions are the first to get considered, but the market is evolving and new tools are coming up, but leaving unnoticed.

Listed below an overview over currently available Hadoop based SQL technologies. The must haves are:
Open Source (various contributors), low-latency querying possible, supporting CRUD (mostly!) and statements like CREATE, INSERT INTO, SELECT * FROM (limit..), UPDATE Table SET A1=2 WHERE, DELETE, and DROP TABLE.

Apache Hive (SQL-like, with interactive SQL (Stinger)
Apache Drill (ANSI SQL support)
Apache Spark (Spark SQL, queries only, add data via Hive, RDD or Parquet)
Apache Phoenix (built atop Apache HBase, lacks full transaction support, relational operators and some built-in functions)
Presto from Facebook (can query Hive, Cassandra, relational DBs & etc. Doesn't seem to be designed for low-latency responses across small clusters, or support UPDATE operations. It is optimized for data warehousing or analytics¹)
VoltDB (ACID compatible, ANSI SQL near 92, low-latency multi query engine)
SQL-Hadoop via MapR community edition (seems to be a packaging of Hive, HP Vertica, SparkSQL, Drill and a native ODBC wrapper)
Apache Kylin from Ebay (provides an SQL interface and multi-dimensional analysis [OLAP], "… offers ANSI SQL on Hadoop and supports most ANSI SQL query functions". It depends on HDFS, MapReduce, Hive and HBase; and seems targeted at very large data-sets though maintains low query latency)
Apache Tajo (ANSI/ISO SQL standard compliance with JDBC driver support [benchmarks against Hive and Impala])
Cascading's Lingual² ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager for publishing files [or any resource] as schemas and tables.")

Non Open Source, but also interesting
Splice Machine (Standard ANSI SQL, Transactional Integrity)
Pivotal Hawq (via Pivotal HD, ANSI SQL 92, 99 and OLAP)
Cloudera Impala (SQL-like, ANSI 92 compliant, MPP and low-latency)
Impala does not incorporate usage of Hadoop, but leverages the cached data of HDFS on each node to quickly return data (w/o performing Map/Reduce jobs). Thus, overhead related to performing a Map/Reduce job is short-cut and one can gain improvements runtime.
Conclusion: Impala does not replace Hive. However, it is good for different kind of jobs, such as small ad-hoc queries well-suited for analyzing data as business analysts. Robust jobs such as typical ETL taks, on the other hand, require Hive due to the fact that a failure of one job can be very costly.

Thanks to Samuel Marks, who posted originally on the hive user mailing list

Freitag, 9. Januar 2015

Major compact an row key in HBase

Getting an row key via hbase-shell per scan:
hbase (main):0001:0 > scan ‘your_table',{LIMIT => 5}

see what the row contains:
hbase (main):0002:0 > get ‘your_table’,"\x00\x01"

To start the compaction based on the row key use this few lines and replace <row_key> and <your_table> with the findings above:
hbase (main):0003:0 > configuration = org.apache.hadoop.hbase.HBaseConfiguration.create table =, '<your_table>') regionLocation = table.getRegionLocation("<row key>”) regionLocation.getRegionInfo().getRegionName() admin = admin.majorCompact(regionLocation.getRegionInfo().getRegionName())

Mittwoch, 3. Dezember 2014

The Machine und BigData

HP's Projekt "The Machine" verfolge ich nun schon seit der ersten Veröffentlichung. Bis 2020 soll das Projekt industriereif, ab 2018 sollen erste Edge Devices verfügbar sein. Ob es die Welt der Informatik revolutioniert, bleibt abzuwarten. Extrem interessant ist der Ansatz von HP auf jeden Fall, vor allem im Hinblick auf BigData und die weitere Industrialisierung analytischer Ansätze.

Genutzt wird die Memristor-Technologie ( Ein Memristor ist nicht flüchtig, bisher langsamer als DRAM, aber bis zu 100 mal schneller als Flash. Und es kann in relativ kleinen Rackfarmen extrem viel Speicherplatz bereitgestellt werden (4TB haben derzeit den 3.5 Zoll Formfaktor, es könnten aber auch bis zu 100TB per 3.5 Zoll machbar sein).
Das grundlegend neue dabei ist – ein Memristor kann bis zu 10 Zustände speichern (Trinity Memristor). Hierbei wird der Integer auf Basis von 10 berechnet, was im Gegensatz zur herkömmlichen Basis von 8 (64Bit) ein Speicherung auf 20 Bit ermöglicht (bisher 64bit). Folglich hätten die Speicher mit dieser Technologie und bei derzeitiger Fertigungstechnologie die dreifache Kapazität. Ein weiterer interessanter Ansatz ist auch flüchtige Caches in Prozessoren durch nicht flüchtige Speicher zu ersetzen – was wiederum das Computing revolutionieren könnte, etwa durch das Nutzen bereits berechneter Teilmengen in weiteren Verarbeitungsschritten, beispielsweise Mustererkennung in vorhandenen Daten und Übergabe der Muster an einem weiteren Thread zur MCMC Analyse. 

Denkt man an Spark, macht es The Machine durchaus reizvoll – zumal mit Spark vorwiegend nur flüchtige Teilmengen berechnet werden (und bei Bedarf auf ein Speichermedium geschrieben werden, um die berechneten Spills zu persistieren). Und ein solches System wie The Machine würde auch verteilte Dateisysteme wie HDFS oder Ceph größtenteils überflüssig machen, da das Gesamtsystem (also Speicher, RAM, Persistenzlayer, Fast Caching) als ein homogener, nicht flüchtiger Speicherblock funktioniert und jeden bereits berechneten Zustand per se vorrätig halten kann. Diese Teilblöcke könnten dann beliebig wiederverwendet, integriert oder extrapoliert werden, ohne dabei Teile durch volatile caches zu verlieren.


Sonntag, 16. November 2014

Hadoop server performance tuning

To tune a Hadoop cluster from a DevOps perspective needs an understanding of the kernel principles and linux. The following article will describe the most important parameters together with tricks for an optimal tuning.


Typically modern Linux systems (Linux 2.6 +) use swapping to avoid OOM (Out of Memory) to protect the system from kernel freezes. But Hadoop uses Java, and typically Java is configured with MAXHEAPSIZE per service (HDFS, HBase, Zookeeper etc). The configuration has to match the available memory in the system. A common formula for MapReduce1:
TOTAL_MEMORY = (Mappers + Reducers) * CHILD_TASK_HEAP + TT_HEAP + DN_HEAP + RS_HEAP + OTHER_SERVICES_HEAP + 3GB (for OS and caches)

For MapReduce2 YARN takes care about the resources, but only for services which are running as YARN Applications. [1], [2]

Disable swappiness is done one the fly per
echo 0 > /proc/sys/vm/swappiness

and persistent after reboots per sysctl.conf:
echo “vm.swappiness = 0” >> /etc/sysctl.conf

Additionally, RedHat implemented in kernel 2.6.39 THP (transparent huge pages swapping). THP reduces the I/O of an I/O based application at linux systems up 30%. It’s highly recommended to disable THP.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

To do that at boot time automatically, I used /etc/rc.local:
if test -f /sys/kernel/mm/redhat_transparent_hugepage/enabled; then
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
if test -f /sys/kernel/mm/redhat_transparent_hugepage/defrag; then
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Another nice tuning trick is the use of vm.overcommit_memory. This switch enables the overcommitting of virtual memory. Mostly, virtual memory sparse arrays using zero pages - as Java does when the memory for a VM is allocated. In most circumstances these pages contain no data, and the allocated memory can be reused (overcommitted) by other pages. With this switch the OS knows that always enough memory is available to backup the virtual pages.

This feature can be configured at runtime per:
sysctl -w vm.overcommit_memory = 1
sysctl -w vm.overcommit_ratio = 50

and permanently per /etc/sysctl.conf.

Network / Sockets

On highly used and large Linux based clusters the default sockets and network configuration can slow down some operations. This section covers some of the possibilities I figured out over the years. But please be aware of the use, since this affects the network communications.

First of all, enable the whole available port range of max available sockets:
sysctl -w net.ipv4.ip_local_port_range = 1024 65535

Additionally, increasing the recycling time of sockets avoids large TIME_WAIT queues. Re-useing the sockets for new connections can speed up the network communication. It’s to be used with caution and depends highly on the network stack and the running jobs within your cluster. The performance can grow, but also drop dramatically, since we now fast recycle the connections in the WAIT status. Typically used in clusters with a high ingest rate like HBase or Storm, Kafka etc.
sysctl -w net.ipv4.tcp_tw_recycle = 1
sysctl -w net.ipv4.tcp_tw_reuse = 1

For the same purpose, the network buffers backlog can be overfilled. In this case, new connections can be dropped or deleted - which leads to performance issues. Raising the backlog up to 16MB / Socket is mostly enough, together with the number of outstanding syn - requests and backlog sockets.
sysctl -w net.core.rmem_max = 16777216
sysctl -w net.core.wmem_max = 16777216
sysctl -w net.ipv4.tcp_max_syn_backlog = 4096
sysctl -w net.ipv4.tcp_syncookies = 1
sysctl -w net.core.somaxconn = 1024

=> Remember, this is not a generic tuning trick. On generic purpose clusters playing around with the network stack is not safe at all.  

Disk / Filesystem / File descriptors

Linux tracks the file access time, and that means a lot more disk seeks. But HDFS writes once, reads many times, and the Namenode tracks the time. Hadoop doesn't need to track the access time on the OS level, it’s safe to disable this per data disk per mount options.
/dev/sdc /data01 ext3 defaults,noatime 0

Eliminate the root reserved space on partitions. The nature of EXT3/4 reserves 5% of per disk for root. That means the systems will have a lot of unused space. Disable the root reserved space on Hadoop disk
mkfs.ext3 –m 0 /dev/sdc

If the disk is already mounted this can be done forever per 
tune2fs -m 0 /dev/sdc

An optimal server has one HDFS mount point per disk, and one or two dedicated disks for logs and the operating system.

File handler and processes

Typically a Linux system has very conservative file handles configured. Mostly, these handlers are enough for small application servers, but not for Hadoop. When the file handler are to less, Hadoop reports that per Too many open files - to avoid that raise the limits up.
echo hdfs – nofile 32768 >> /etc/security/limits.conf
echo mapred – nofile 32768 >> /etc/security/limits.conf

Additionally the max. processes, too
echo hbase – nofile 32768 >> /etc/security/limits.conf
echo hdfs – nproc 32768 >> /etc/security/limits.conf
echo mapred – nproc 32768 >> /etc/security/limits.conf
echo hbase – nproc 32768 >> /etc/security/limits.conf

DNS / Name resolution

The communication of Hadoop’s Ecosystem depends highly on a correct DNS resolution. Typically the name resolution is configured per /etc/hosts. Important is, that the canonical name must be the FQDN of the server, see the example. one namenode two datanode

If DNS is used the system’s hostname must match the FQDN in forward as well as reverse name resolution.To reduce the latency of DNS lookups use the name service caching daemon (nscd), but don’t cache passwd, group, netbios informations.

There are also a lot of specific tuning tricks within the Hadoop Ecosystem, which will be discussed in one of the following articles.