Monday, October 10, 2011

Secure your hadoop cluster, Part I

Use mapred with Active Directory (basic auth)

The most cases I managed in past weeks concerned hadoop security. That means firewalling, SELinux, authentication and user management. It is usually a difficult process, related to the companys security profile and processes.

So I start with the most interesting part - authentication. And, in all cases I worked on the main authentication system was a Windows Active Directory Forest (AD). Since hadoop is shipped with more taskcontroller-classes we can use LinuxTaskController. I use RHEL5 server, but it can be adapted similar to other installations.

To enable the UNIX services in Windows Server > 2003 you have to extend the existing schema with UNIX templates, delivered from Microsoft. After that you have to install the "Identity Management for UNIX", in 2008 located in Server Manager => Roles => AD DS => common tasks => Add Role => Select Role Services. Install the software, restart your server and it should be done. Now create a default bind-account, configure the AD server and create a test-user with group hdfs.

For testing we use these settings:
Binding-Acc: main
Group: hdfs

Now we take a closer look at the RedHat box. Since kerberos5 is fully supported the task is really simple. 3 problems could occur: time, DNS and wrong schema on the AD Server(s).
Setup the ldap authentication with:
# authconfig-tui

Authentication => Use LDAP (use MD5 Password + Use shadow Password + Use Kerberos) => Next.

Server => Base DN (FQDN or IP) + DN (dc=hadoop,dc=company,dc=local) => Next.

Kerberos Settings => REALM in uppercase + KDC and admin server (FQDN or IP of AD Server) + Use DNS to resolve hosts to realms + Use DNS to locate KDCs for realms => OK

AD does not allow anonymous connection, so you have to use the bind-account in /etc/ldap.conf (see above).

add in /etc/nsswitch.conf ldap service after files:
passwd:     files ldap
shadow:     files ldap
group:      files ldap

Now edit the /etc/ldap.conf:

base dc=hadoop,dc=company,dc=local
uri ldap://
bindpw <PASSWORD>
scope sub
ssl no
nss_base_passwd dc=hadoop,dc=company,dc=local?sub
nss_base_shadow dc=hadoop,dc=company,dc=local?sub
nss_base_group dc=hadoop,dc=company,dc=local?sub? \
nss_map_objectclass posixAccount user
nss_map_objectclass shadowAccount user
nss_map_objectclass posixGroup group
nss_map_attribute gecos cn
nss_map_attribute homeDirectory unixHomeDirectory
nss_map_attribute uniqueMember member
nss_map_objectclass posixGroup Group
tls_cacertdir /etc/openldap/cacerts
pam_password md5
pam_login_attribute sAMAccountName
pam_filter objectclass=User

After you have written your file you should test your config:
# getent passwd
if you get no errors all works as expected.

Add to /etc/pam.d/system-auth:
session      required skel=/etc/skel umask=0022

Now it is time to use mapred with your AD. For that we use the shipped class org.apache.hadoop.mapred.LinuxTaskController, configuration will be done in mapred-site.xml:


Now jobs will be submitted in the given usercontext via pam. Here you have to keep in mind that the group should be set to the group you setup in your AD.

Known issues mostly depend on your setup. Be sure you have a syncronized time in your network (usually done with ntpd), a working DNS infrastructure and the user and groups are known in AD.