Some people ask me for a "issue" in mapreduce-jobhistory (/jobhistory.jsp) - the history tooks a while to load the site on high-traffic clusters. For that I'll explain the mechanism:
The history-files will be available for 30 days (hardcoded in pre-h21). That produce a lot of logs and waste also space on the hadoop-jobtracker. So I have some installations which hold 20GB on logs in history, as a dependecy a audit of long running jobs isn't really useable.
Beginning from h21 the cleanup is configurable:
Default: 7 * 24 * 60 * 60 * 1000L (one week)
to set the store into a 3-day period use:
3 * 24 * 60 * 60 * 1000L
That means 3 Days, 24 hours, 60 minutes, 60 seconds and a cache size of 1000.
a other way, but more a hack via crond.d:
find /var/log/hadoop-0.20/history/done/ -type f -mtime +1 |xargs rm -f