Posts

Showing posts from May, 2014

Facebook's Presto

In November 2013 Facebook publishedtheir Presto engine as Open Source, available at GitHub. Presto is a distributed interactive SQL query engine, able to run over dozens of modern BigData stores, based on Apache Hive or Cassandra. Presto comes with a limited JDBC Connector, supports Hive 0.13 with Parquet and Views.

Installation Just a few specialties. Presto runs only with Java7, does not support Kerberos and does not have built-in user authentication, neither. To protect data a user should not be able to read, the use of HDFS Acl's / POSIX permissions should be considered. The setup of Presto is pretty easy and well documented. Just follow the documentation, use "uuidgen" to generate a unique ID for your Presto Node (node.id in node.properties) and add "hive" as datasource (config.properties: datasources=jmx,hive). I used user "hive" to start the server with:
export PATH=/usr/jdk64/jdk1.7.0_45/bin:$PATH && presto-server-0.68/bin/launcher star…

Cloudera Manager fails to upgrade Sqoop2 when parcels are enabled

Cloudera Manager fails to update the generic Sqoop2 connectors when parcels are enabled, and the Sqoop2 server won't start anymore. In the logs a error like:

Caused by: org.apache.sqoop.common.SqoopException: JDBCREPO_0026:Upgrade required but not allowed - Connector: generic-jdbc-connector

is shown.
This issue can be fixed by adding two properties into the service safety valve of sqoop:

org.apache.sqoop.connector.autoupgrade=true
org.apache.sqoop.framework.autoupgrade=true

This happen trough the missing autoupdate of the default sqoop connectors in Cloudera Manager. After the properties are added, SqoopServer should be able to update the drivers and will start sucessfully.

Test: HDP 2.1 und Ambari 1.5.1

Image
Im Rahmen einiger Analysen stelle ich hier die verschiedenen Distributionen in einem recht einfachen Verfahren gegenüber. Es kommt mir hierbei vor allem auf die Einfachheit und Schnelligkeit der Installation eines Clusters an, auf technischen Differenzen und Besonderheiten gehe ich jeweils kurz ein.

Vorbereitungen
Als Basis dient ein frisches CentOS 6.5 in einem Oracle VirtualBox VM Container, 6GB Memory, 4 CPU und 100 GB HDD. Als Gastsystem kommt Windows zum Einsatz - einfach weil Windows üblicherweise auf Bürorechnern installiert ist.
Da Ambari erst vor 2 Wochen die Version 1.5.1 veröffentlicht hat, starte ich mit hiermit. Das Einspielen der entprechenden Pakete ist hinlänglich und ausführlich in der Dokumentation beschrieben. Nachdem der Ambari Server gestartet wurde ist ein problemloses Einloggen auf der Webkonsole per http://FQHN:8080 möglich.
Wichtig ist hierbei, das die zu installierenden Server per DNS lookup erreichbar sind. Im Falle der VM stellte dies ein geringfügiges P…

The Forrester Wave (Or: We're all the leaders)

Image
Forrester Research, an independent market research firm, released in February 2014 the quarterly Forrester Wave Big Data Hadoop Solutions, Q1 2014 Report [1]. The report shows this graphic, and it looks like that all major, minor and non-hadoop Vendors think they lead. It looks really funny when you follow the mainstream press news.

IBM [5] think they lead, Hortonworks [4] claim the leadership too, MapR [3]leads too, Teradata is the true leader (they say) [6]. Cloudera [2] ignores the report. The metapher is - all of the named companies are in the leader area, but nobody leads.


Anyway, let us do a quick overview about the "Big Three" - Cloudera, MapR, Hortonworks.

The 3 major Hadoop firms (Horton, MapR, Cloudera) are nearly in the same position. All distributions have the sweet piece, which lets the customer decide which one fits most. And that is the most important point - the customer wins. Not the marketing noise.

Cloudera [2] depends on Apache Hadoop, has Cloudera Man…