Monday, June 27, 2016

Encryption in HDFS

Encryption of data was and is the hottest topic in terms of data protection and prevention against theft. Hadoop HDFS supports full transparent encryption in transit and at rest [1], based on Kerberos implementations [2], often used within multiple trusted Kerberos domains.

Technology

Hadoop KMS provides a REST-API, which has built-in SPNEGO and HTTPS support, comes mostly bundled with a pre-configured Apache Tomcat within your preferred Hadoop distribution. 
To have encryption transparent for the user and the system, each encrypted zone is associated with a SEZK (single encryption zone key), created when the zone is defined as an encryption zone by interaction between NN and KMS. Each file within that zone will have its own DEK (Data Encryption Key). This behavior is fully transparent, since the NN directly asks the KMS for a new EDEK (encrypted data encryption key) encrypted with the zones key and adds them to the file’s metadata when a new file is created.

When a client wants to read a file in an encrypted zone, the NN provides the EDEK together with a zone key version and the client asks the KMS to decrypt the EDEK. If the client has permissions to read that zone (POSIX), the client will use the provided DEK to read the file. Seen from a DFS node perspective, that datastream is encrypted and the nodes only see an encrypted data stream. 

Setup and Use

I use here Cloudera’s CDH as example, but the same would work with other distributions and for sure with the official Apache Hadoop distribution. Enabling KMS in CDH (5.3.x and up) it's pretty easy, and doesn’t need to be explained here since Cloudera has great articles online about that process [3]. Important to know is only that KMS doesn’t work without a working Kerberos implementation. Additionally, there are other configuration parameters which need to be known, especially in a multi-domain Kerberos environment.
First, KMS uses the same rule based mechanism as HDFS uses when a trusted kerberos environment is used. That means the same filtering rules as existent in core-site.xml need to be added to kms-site.xml to get the encryption for all trusted domains working. This has to be done per:

<property>
 <name>hadoop.kms.authentication.kerberos.name.rules</name>
  <value>RULE:[1:$1@$0](.*@\QTRUSTED.DOMAIN\E$)s/@\QTRUSTED.DOMAIN\E$//
RULE:[2:$1@$0](.*@\QTRUSTED.DOMAIN\E$)s/@\QTRUSTED.DOMAIN\E$//
RULE:[1:$1@$0](.*@\QMAIN.DOMAIN\E$)s/@\QMAIN.DOMAIN\E$//
RULE:[2:$1@$0](.*@\QMAIN.DOMAIN\E$)s/@\QMAIN.DOMAIN\E$//
DEFAULT</value>
</property>


per kms-site.xml. The terms trusted.domain / main.domain are placeholders, describing the original and the trusted kerberos domain. The use from an administrative standpoint is straightforward:
hadoop key create KEYNAME #(one time key creation)
hadoop fs -mkdir /enc_zones/data
hdfs crypto -createZone -keyName KEYNAME -path /enc_zones/data
hdfs crypto -listZones


First I create a key, then I create the directory I want to encrypt in HDFS and encrypt this with the key I created first. 
This directory is now only accessible by me or users I give access per HDFS POSIX permissions. Others aren’t able to change or read files. To give superusers the possibility to create backups without de- and encrypt, a virtual path prefix for distCp (/.reserved/raw) [4] is available. This prefix allows the block-wise copy of encrypted files, for backup and DR reasons.

The use of distCp for encrypted zones can cause some mishaps. Highly recommended is to have identical encrypted zones on both sides to avoid problems later. A potential distCp command for encrypted zones could look like:

hadoop distcp -px hdfs://source-cluster-namenode:8020/.reserved/raw/enc_zones/data hdfs://target-cluster-namenode:8020/.reserved/raw/enc_zones/data