Posts

Showing posts from 2016

Hue 3.11 with HDP 2.5

Works fine with CentOS / RHEL, I used 6.8 in that case. Epel has to be available, if not, install the repo.
And I ask me why Hortonworks didn't integrated Hue v3 in their HDP release - I mean, Hue v2 is older as old and lacks dramatically on functionality.
Anyhow, lets get to work.

sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo

sudo yum install ant gcc krb5-devel mysql mysql-devel openssl-devel cyrus-sasl-devel cyrus-sasl-gssapi sqlite-devel libtidy libxml2-devel libxslt-devel openldap-devel python-devel python-simplejson python-setuptools rsync gcc-c++ saslwrapper-devel libffi-devel gmp-devel apache-maven

sudo mkdir /software; sudo chown hue: /software && cd /software
wget https://github.com/cloudera/hue/archive/master.zip -O hue.zip && unzip hue.zip; cd hue-master; sudo mkdir -p /usr/local/hue && chown -R hue: /usr/local/hue && make install


HDP config changes:
Oozie =&g…

Erase HDP 2.x and Ambari

Since I hack now often with Hortonworks HDP, I also often need to completely clean out my lab environments to get fresh boxes. I figured to write a ugly shell script is more comfortable as bothering my infra guys to reset the VM's in Azure - which also reset all my modifications. Bad!
Anyhow, here's the script in the case anyone has some use, too.

https://github.com/alo-alt/shell/blob/master/rmhdp.bash

As usual, first stop all Ambari managed services. I remove Postgres too, since the setup of a new db done by the installer of Ambari is much more faster than dealing with inconsistencies later.
Side Note: The script is made for RHEL based distributions ;)

FreeIPA and Hadoop Distributions (HDP / CDH)

FreeIPAis the tool of choice when it comes to implement a security architecture from the scratch today. I don't need to praise the advantages of FreeIPA, it speaks for himself. It's the Swiss knife of user authentication, authorization and compliance.

To implement FreeIPA into Hadoop distributions like Hortonwork's HDP and Cloudera's CDH some tweaks are necessary, but the outcome is it worth. I assume that the FreeIPA server setup is done and the client tools are distributed. If not, the guide from Hortonworks has those steps included, too.

For Hortonworks, nothing more as the link to the documentation is necessary:
https://community.hortonworks.com/articles/59645/ambari-24-kerberos-with-freeipa.html

Ambari 2.4x has FreeIPA (Ambari-6432) support (experimental, but it works as promised) included. The setup and rollout is pretty simple and runs smoothly per Wizard.

For Cloudera it takes a bit more handwork, but it works at the end also perfect and well integrated, but not…

Shifting paradigms in the world of BigData

In building the next generation of applications, companies and stakeholders need to adopt new paradigms. The need for this shift is predicated on the fundamental belief that building a new application at scale requires tailored solutions to that application’s unique challenges, business model and ROI. Some things change, and I’d like to point to some of that changes.

Event Driven vs. CRUD
Software development traditionally is driven by entity-relation modeling and CRUD operations on that data. The modern world isn’t about data at rest, it’s about being responsive to events in flight. This doesn’t mean that you don’t have data at rest, but that this data shouldn’t be organized in silos.
The traditional CRUD model is neither expressive nor responsive, given by the amount of uncountable available data sources. Since all data is structured somehow, an RDBMS isn't able to store and work with data when the schema isn't known (schema on write). That makes the use of additional free av…

Cloudera Manager and Slack

The most of us are getting bored by receiving hundreds of monitoring emails every day. To master the flood, rules are getting in play - and with that rules the interest into email communication are reduced.
To master the internal information flood, business messaging networks like Slackare taking more and more place.

To make CM work with Slack a custom alert script from my Github will do the trick:

https://github.com/alo-alt/Slack/blob/master/cm2slack.py

The use is pretty straight forward - create a channel in Slack, enable Webhooks, place the token into the script, store the script on your Cloudera Manager host, make it executable for cloudera-scm : and enable outgoing firewall / proxy rules to let the script chat with Slack's API. The script can handle proxy connections, too.

In Cloudera Manager, the script path needs to be added into Cloudera-Management-Service => Configuration => Alert Publisher => Custom Script.



Manage rights in OpenStack

Openstack lacks on sophisticated rights management, the most users figure. But that's not the case, role management in Openstack is available.
First users and groups needs to be added to projects, this can be done per CLI or GUI [1]. Lets say, a group called devops shall have the full control about OpenStack, but others not in that group can have dedicated operation access like create snapshot, stop / start / restart an instance or looking at the floating IP pool.

Users, Groups and Policies
OpenStack handles the rights in a policy file in /etc/nova/policy.json, using roles definitions per group assigned to all tasks OpenStack provides. It looks like:

{
"context_is_admin": "role:admin",
"admin_or_owner": "is_admin:True or project_id:%(project_id)s",
"default": "rule:admin_or_owner",...
}

It describes the default - an member of a project is the admin of that project. To add additional rules, they have to be defined here.
In my …

Deal with corrupted messages in Apache Kafka

Under some strange circumstances it can happen that a message in a Kafka topic is corrupted. This happens often by using 3rd party frameworks together with Kafka. Additionally, Kafka < 0.9 has no lock at Log.read() at the consumer read level, but has a lock on Log.write(). This can cause a rare race condition, as described in KAKFA-2477 [1]. Probably a log entry looks like:

ERROR Error processing message, stopping consumer: (kafka.tools.ConsoleConsumer$) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy
Kafka-Tools Kafka stores the offset of every consumer in Zookeeper. To read out the offsets, Kafka provides handy tools [2]. But also zkCli.sh can be used, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9):

bin/kafka-consumer-groups.sh --zookeeper management01:2181 --describe --group test

Prior to Kafka 0.9 the only possibility to get this informations wa…

Encryption in HDFS

Image
Encryption of data was and is the hottest topic in terms of data protection and prevention against theft. Hadoop HDFS supports full transparent encryption in transit and at rest [1], based on Kerberos implementations [2], often used within multiple trusted Kerberos domains.

Technology Hadoop KMS provides a REST-API, which has built-in SPNEGO and HTTPS support, comes mostly bundled with a pre-configured Apache Tomcat within your preferred Hadoop distribution.  To have encryption transparent for the user and the system, each encrypted zone is associated with a SEZK (single encryption zone key), created when the zone is defined as an encryption zone by interaction between NN and KMS. Each file within that zone will have its own DEK (Data Encryption Key). This behavior is fully transparent, since the NN directly asks the KMS for a new EDEK (encrypted data encryption key) encrypted with the zones key and adds them to the file’s metadata when a new file is created.
When a client wants to re…

Open Source based Hyper-Converged Infrastructures and Hadoop

Image
According to a report from Simplivity [1] Hyper-Converged Infrastructures are used by more than 50% of the interviewed businesses, tendentious increasing. But what does this mean for BigData solutions, and Hadoop especially? What tools and technologies can be used, what are the limitations and the gains from such a solution?

To build a production ready and reliable private cloud to support Hadoop clusters as well as on-demand and static I have made great experience with OpenStack, Saltstack and the Sahara plugin for Openstack.
Openstack supports Hadoop-on-demand per Sahara, it's also convenient to use VM's and install a Hadoop Distribution within, especially for static clusters with special setups. The Openstack project provides ready to go images per [2], as example for Vanilla 2.7.1 based Hadoop installations. As an additional benefit, Openstack supports Docker [3], which adds an additional layer of flexibility for additional services, like Kafka [4] or SolR [5].

Costs and I…

SolR, NiFi, Twitter and CDH 5.7

Image
Since the most interesting Apache NiFi parts are coming from ASF [1] or Hortonworks [2], I thought to use CDH 5.7 and do the same, just to be curious. Here's now my 30 minutes playground, currently running in Googles Compute.

On one of my playground nodes I installed Apache NiFi per
mkdir /software && cd /software && wget http://mirror.23media.de/apache/nifi/0.6.1/nifi-0.6.1-bin.tar.gz&& tar xvfz nifi-0.6.1-bin.tar.gz

Then I've set only nifi.sensitive.props.key property in conf/nifi.properties to an easy to remember secret. The next bash /software/nifi-0.6.1/bin/nifi.sh install installs Apache NiFi as an service. After log in into Apache NiFi's WebUI, download and add the template [3] to Apache NiFi, move the template icon to the drawer, open it and edit the twitter credentials to fit your developer account.
To use an  schema-less SolR index (or Cloudera Search in CDH) I copied some example files over into a local directory: cp -r /opt/cloudera/parcel…

Apache Tez on CDH 5.4.x

Since Cloudera doesn't support Tez in their Distribution right now (but it'll come, I'm pretty confident), we experimented with Apache Tez and CDH 5.4 a bit.
To use Tez with CDH isn't so hard - and it works quite well.  And our ETL and Hive jobs finished around 30 - 50% faster.

Anyway, here the blueprint. We use CentOS 6.7 with Epel Repo.

1. Install maven 3.2.5 
wget http://archive.apache.org/dist/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz
tar xvfz apache-maven-3.2.5-bin.tar.gz -C /usr/local/
cd /usr/local/
ln -s apache-maven-3.2.5 maven

=> Compiling Tez with protobuf worked only with 3.2.5 in my case

1.1 Install 8_u40 JDK
mkdir development && cd development (thats my dev-root)

wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u40-b26/jdk-8u40-linux-x64.tar.gz"
tar xvfz jdk-8u40-linux-x64.tar.gz
ex…