Posts

Showing posts from 2014

The Machine und BigData

HP's Projekt "The Machine" verfolge ich nun schon seit der ersten Veröffentlichung. Bis 2020 soll das Projekt industriereif, ab 2018 sollen erste Edge Devices verfügbar sein. Ob es die Welt der Informatik revolutioniert, bleibt abzuwarten. Extrem interessant ist der Ansatz von HP auf jeden Fall, vor allem im Hinblick auf BigData und die weitere Industrialisierung analytischer Ansätze.

Genutzt wird die Memristor-Technologie (http://en.wikipedia.org/wiki/Memristor). Ein Memristor ist nicht flüchtig, bisher langsamer als DRAM, aber bis zu 100 mal schneller als Flash. Und es kann in relativ kleinen Rackfarmen extrem viel Speicherplatz bereitgestellt werden (4TB haben derzeit den 3.5 Zoll Formfaktor, es könnten aber auch bis zu 100TB per 3.5 Zoll machbar sein).
Das grundlegend neue dabei ist – ein Memristor kann bis zu 10 Zustände speichern (Trinity Memristor). Hierbei wird der Integer auf Basis von 10 berechnet, was im Gegensatz zur herkömmlichen Basis von 8 (64Bit) ein Spei…

Hadoop server performance tuning

To tune a Hadoop cluster from a DevOps perspective needs an understanding of the kernel principles and linux. The following article will describe the most important parameters together with tricks for an optimal tuning.

Memory Typically modern Linux systems (Linux 2.6 +) use swapping to avoid OOM (Out of Memory) to protect the system from kernel freezes. But Hadoop uses Java, and typically Java is configured with MAXHEAPSIZE per service (HDFS, HBase, Zookeeper etc). The configuration has to match the available memory in the system. A common formula for MapReduce1:
TOTAL_MEMORY = (Mappers + Reducers) * CHILD_TASK_HEAP + TT_HEAP + DN_HEAP + RS_HEAP + OTHER_SERVICES_HEAP + 3GB (for OS and caches)

For MapReduce2 YARN takes care about the resources, but only for services which are running as YARN Applications. [1], [2]

Disable swappiness is done one the fly per
echo 0 > /proc/sys/vm/swappiness

and persistent after reboots per sysctl.conf:
echo “vm.swappiness = 0” >> /etc/sysctl.conf…

Switch to HiveServer2 and Beeline

In Hive 0.11 HiveServer2 [2] was introduced, its time to switch from the old Hive CLI to the modern version. Why?
First, security [1]. Hive CLI bypasses the Apache HiveServer2 and calls a MR job directly. This behavior compromises any security projects like Apache Sentry [3]. With HiveServer2 the Kerberos impersonation brings fine granulated security down to HiveSQL. Its possible to enable a strong security layer with Kerberos, Apache Sentry [3] and Apache HDFS ACL [4], like other DWHs have.
Second, HiveServer2 brings connection concurrency to Hive. This allows multiple connections from different users and clients per JDBC (remote and per Beeline) over Thrift.
Third, the Hive CLI command could be deprecated in the future, this is discussed within the Hive Developer Community.

For the first steps a beeline connection can be established per

beeline -u jdbc:hive2://<SERVER>:<PORT>/<DB> -n USERNAME -p PASSWORD

The URI describes the JDBC connection string, followed by the …

XAttr are coming to HDFS

HDFS 2006 [1] describes the use of Extended Attributes. XAttr, known from *NIX Operating Systems, connects physically stored data with describing metadata above the strictly defined attributes by the filesystem. Mostly used to provide additional information, like hash, checksum, encoding or security relevant information like signature or author / creator.
According to the source code [2] the use of xattr can be configured by dfs.namenode.fs-limits.max-xattrs-per-inode and dfs.namenode.fs-limits.max-xattr-size in hdfs-default.xml. The default for dfs.namenode.fs-limits.max-xattrs-per-inode is 32, for dfs.namenode.fs-limits.max-xattr-size the default is 16384.
Within HDFS, the extended user attributes will be stored in the user namespace as an identifier.The identifier has four namespaces, like the Linux FS kernel implementation has: security, system, trusted and user. Only the superuser can access the trusted namespaces (system and security). The xattr definitions are free and can be i…

Cloudera + Intel + Dell = ?

Wie Clouderain einer Pressemitteilung [1] veröffentlichte, kommt nach dem Intel-Investment [2] nun der Schulterschluss mit Dell. Hier meine Meinung dazu.

Seit Jahren versprechen Analysten Wachstumsraten im hohen zweistelligen Prozentbereich bis 2020 [3], schlussendlich ist es nur logisch das Intel über den augenblicklichen Platzhirsch Cloudera in das "BigData Business" investiert, nachdem augenscheinlich die eigene Distribution nicht so erfolgreich war als gehofft. Zudem erkauft sich Intel hier einen bedeutenden Einfluss auf das Hadoop Projekt. Neben Hortonworks ist Cloudera einer der bedeutendsten Committer des gesamten Ecosystems.
Der Einfluss Intels beginnt bei Kryptographie (Rhino) [4], weitere Möglichkeiten wären optimierter Bytecode für Intel CPU's in Impala / Spark, Advanced Networking Features im Hadoop Core (IPv6) oder die Unterstützung proprietärer Lösungen Intels, die nur in CDH verfügbar sein werden. Da Cloudera in nahezu allen relevanten Projekten des Apache…

Remove HDP and Ambari completely

Its a bit hard to remove HDP and Ambari completely - so I share my removal script here. Works for me perfect, just adjust the HDFS directory. In my case it was /hadoop
#!/bin/bash
echo "==> Stop Ambari and Hue"
ambari-server stop && ambari-agent stop
/etc/init.d/hue stop
sleep 10
echo "==> Erase HDP and Ambari completely"
yum -y erase ambari-agent ambari-server ambari-log4j hadoop libconfuse nagios ganglia sqoop hcatalog\* hive\* hbase\* zookeeper\* oozie\* pig\* snappy\* hadoop-lzo\* knox\* hadoop\* storm\* hue\*
# remove configs
rm -rf /var/lib/ambari-*/keys /etc/hadoop/ /etc/hive /etc/hbase/ /etc/oozie/ /etc/zookeeper/ /etc/falcon/ /etc/ambari-* /etc/hue/
# remove ambaris default hdfs dir
rm -rf /hadoop
# remove the repos
echo "==> Remove HDP and Ambari Repo"
rm -rf /etc/yum.repos.d/HDP.repo /etc/yum.repos.d/ambari.repo
# delete all HDP related users
echo "==> Delete the user accounts"
userdel -f hdfs && userdel -f sqoop &&…

Facebook's Presto

In November 2013 Facebook publishedtheir Presto engine as Open Source, available at GitHub. Presto is a distributed interactive SQL query engine, able to run over dozens of modern BigData stores, based on Apache Hive or Cassandra. Presto comes with a limited JDBC Connector, supports Hive 0.13 with Parquet and Views.

Installation Just a few specialties. Presto runs only with Java7, does not support Kerberos and does not have built-in user authentication, neither. To protect data a user should not be able to read, the use of HDFS Acl's / POSIX permissions should be considered. The setup of Presto is pretty easy and well documented. Just follow the documentation, use "uuidgen" to generate a unique ID for your Presto Node (node.id in node.properties) and add "hive" as datasource (config.properties: datasources=jmx,hive). I used user "hive" to start the server with:
export PATH=/usr/jdk64/jdk1.7.0_45/bin:$PATH && presto-server-0.68/bin/launcher star…

Cloudera Manager fails to upgrade Sqoop2 when parcels are enabled

Cloudera Manager fails to update the generic Sqoop2 connectors when parcels are enabled, and the Sqoop2 server won't start anymore. In the logs a error like:

Caused by: org.apache.sqoop.common.SqoopException: JDBCREPO_0026:Upgrade required but not allowed - Connector: generic-jdbc-connector

is shown.
This issue can be fixed by adding two properties into the service safety valve of sqoop:

org.apache.sqoop.connector.autoupgrade=true
org.apache.sqoop.framework.autoupgrade=true

This happen trough the missing autoupdate of the default sqoop connectors in Cloudera Manager. After the properties are added, SqoopServer should be able to update the drivers and will start sucessfully.

Test: HDP 2.1 und Ambari 1.5.1

Image
Im Rahmen einiger Analysen stelle ich hier die verschiedenen Distributionen in einem recht einfachen Verfahren gegenüber. Es kommt mir hierbei vor allem auf die Einfachheit und Schnelligkeit der Installation eines Clusters an, auf technischen Differenzen und Besonderheiten gehe ich jeweils kurz ein.

Vorbereitungen
Als Basis dient ein frisches CentOS 6.5 in einem Oracle VirtualBox VM Container, 6GB Memory, 4 CPU und 100 GB HDD. Als Gastsystem kommt Windows zum Einsatz - einfach weil Windows üblicherweise auf Bürorechnern installiert ist.
Da Ambari erst vor 2 Wochen die Version 1.5.1 veröffentlicht hat, starte ich mit hiermit. Das Einspielen der entprechenden Pakete ist hinlänglich und ausführlich in der Dokumentation beschrieben. Nachdem der Ambari Server gestartet wurde ist ein problemloses Einloggen auf der Webkonsole per http://FQHN:8080 möglich.
Wichtig ist hierbei, das die zu installierenden Server per DNS lookup erreichbar sind. Im Falle der VM stellte dies ein geringfügiges P…

The Forrester Wave (Or: We're all the leaders)

Image
Forrester Research, an independent market research firm, released in February 2014 the quarterly Forrester Wave Big Data Hadoop Solutions, Q1 2014 Report [1]. The report shows this graphic, and it looks like that all major, minor and non-hadoop Vendors think they lead. It looks really funny when you follow the mainstream press news.

IBM [5] think they lead, Hortonworks [4] claim the leadership too, MapR [3]leads too, Teradata is the true leader (they say) [6]. Cloudera [2] ignores the report. The metapher is - all of the named companies are in the leader area, but nobody leads.


Anyway, let us do a quick overview about the "Big Three" - Cloudera, MapR, Hortonworks.

The 3 major Hadoop firms (Horton, MapR, Cloudera) are nearly in the same position. All distributions have the sweet piece, which lets the customer decide which one fits most. And that is the most important point - the customer wins. Not the marketing noise.

Cloudera [2] depends on Apache Hadoop, has Cloudera Man…