Wednesday, September 28, 2011

hadoop log retention

Some people ask me for a "issue" in mapreduce-jobhistory (/jobhistory.jsp) - the history tooks a while to load the site on high-traffic clusters. For that I'll explain the mechanism:

The history-files will be available for 30 days (hardcoded in pre-h21). That produce a lot of logs and waste also space on the hadoop-jobtracker. So I have some installations which hold 20GB on logs in history, as a dependecy a audit of long running jobs isn't really useable.

Beginning from h21 the cleanup is configurable:

Key: mapreduce.jobtracker.jobhistory.maxage
Default: 7 * 24 * 60 * 60 * 1000L (one week)

to set the store into a 3-day period use:

mapreduce.jobtracker.jobhistory.maxage
3 * 24 * 60 * 60 * 1000L
That means 3 Days, 24 hours, 60 minutes, 60 seconds and a cache size of 1000.

a other way, but more a hack via crond.d:
find /var/log/hadoop-0.20/history/done/ -type f -mtime +1 |xargs rm -f