Posts

Showing posts from 2015

Build maven-based RPM's

In an daily DevOps world it's necessary to have an easy to use mechanism for a revisionable software deployment. Especially when continuous integration comes to play, in terms of installing, upgrading and deleting software in an easy and proven way.
Why not use RPM for that? The great is, Maven can do that easily.

Prerequisites: Eclipse (or IntelliJ or any other editor) maven (command "mvn" has to work on command line) git (command "git" should work on command line) rpm build (sudo yum install rpm-build)
Building an RPM works on Linux systems like RedHat or CentOS.
Guide: Build the project, so that the targets are available locally (skipTest if they fail on your PC, e.g. because of missing MongoDB or TomCat or ...). 
The following params are necessary to get it working properly:
directory = where the code should be placed after the RPM is rolled out
filemode = permissions for the installed code username = UID  groupname = GID location = local location of the project …

Hive on Spark at CDH 5.3

However, since Hive on Spark is not (yet) officially supported by Cloudera some manual steps are required to get Hive on Spark within CDH 5.3 working. Please note that there are four important requirements additionally to the hands-on work:
Spark Gateway nodes needs to be a Hive Gateway node as wellIn case the client configurations are redeployed, you need to copy the hive-site.xml againIn case CDH is upgraded (also for minor patches, often updated without noticing you), you need to adjust the class pathsHive libraries need to be present on all executors (CM should take care of this automatically) Login to your spark server(s) and copy the running hive-site.xml to spark:

cp /etc/hive/conf/hive-site.xml /etc/spark/conf/

Start your spark shell with (replace <CDH_VERSION> with your parcel version, e.g. 5.3.2-1.cdh5.3.2.p0.10) and load the hive context within spark-shell:

spark-shell --master yarn-client --driver-class-path "/opt/cloudera/parcels/CDH-<CDH_VERSION>/lib/hive/l…

Hadoop and trusted MiTv5 Kerberos with Active Directory

For actuality here a example how to enable an MiTv5 Kerberos <=> Active Directory trust just from scratch. Should work out of the box, just replace the realms:
HADOOP1.INTERNAL = local server (KDC) ALO.LOCAL = local kerberos realm AD.REMOTE = AD realm
with your servers. The KDC should be inside your hadoop network, the remote AD can be somewhere.
1. Install the bits
At the KDC server (CentOS, RHEL - other OS' should have nearly the same bits): yum install krb5-server krb5-libs krb5-workstation -y

At the clients (hadoop nodes): yum install krb5-libs krb5-workstation -y

Install Java's JCE policy (see Oracle documentation) on all hadoop nodes.
2. Configure your local KDC

/etc/krb5.conf

[libdefaults] default_realm = ALO.LOCAL
dns_lookup_realm = false
dns_lookup_kdc = false
kdc_timesync = 1
ccache_type = 4
forwardable = true
proxiable = true
fcc-mit-ticketflags = true
max_life = 1d
max_renewable_life = 7d
renew_lifetime = 7d
default_tgs_enctypes = aes128-cts arcfour-hmac
default_tkt_…

Hadoop based SQL engines

Apache Hadoop comes more and more into the focus of business critical architectures and applications. Naturally SQL based solutions are the first to get considered, but the market is evolving and new tools are coming up, but leaving unnoticed.

Listed below an overview over currently available Hadoop based SQL technologies. The must haves are:
Open Source (various contributors), low-latency querying possible, supporting CRUD (mostly!) and statements like CREATE, INSERT INTO, SELECT * FROM (limit..), UPDATE Table SET A1=2 WHERE, DELETE, and DROP TABLE.

Apache Hive (SQL-like, with interactive SQL (Stinger)
Apache Drill (ANSI SQL support)
Apache Spark (Spark SQL, queries only, add data via Hive, RDD or Parquet)
Apache Phoenix (built atop Apache HBase, lacks full transaction support, relational operators and some built-in functions)
Presto from Facebook (can query Hive, Cassandra, relational DBs & etc. Doesn't seem to be designed for low-latency responses across small clusters, or suppor…

Major compact an row key in HBase

Getting an row key via hbase-shell per scan: hbase (main):0001:0 > scan ‘your_table',{LIMIT => 5}  ROW  ....
see what the row contains: hbase (main):0002:0 > get ‘your_table’,"\x00\x01"  COLUMN  ....

To start the compaction based on the row key use this few lines and replace <row_key> and <your_table> with the findings above:
hbase (main):0003:0 > configuration = org.apache.hadoop.hbase.HBaseConfiguration.create table = org.apache.hadoop.hbase.client.HTable.new(configuration, '<your_table>') regionLocation = table.getRegionLocation("<row key>”) regionLocation.getRegionInfo().getRegionName() admin = org.apache.hadoop.hbase.client.HBaseAdmin.new(configuration) admin.majorCompact(regionLocation.getRegionInfo().getRegionName())