Tuesday, October 18, 2011
Mostly it is a good idea to test new code on a reference cluster with a nearly live dataset. To sync files from a cluster to another use the hadoop builtin tool distcp . With a small script I "rebase" a development cluster with logfiles we collected over the past day.
COPYDATE=`date -d '-1 Day' +"%Y-%m-%d"`
DELDATE=`date -d '-3 Day' +"%Y-%m-%d"`
exec >> $LOG 2>&1
echo -e "\n ------- sync $COPYDATE ------- \n"
/usr/bin/hadoop distcp -i -m 100 hdfs://$SNAMENODE:9000/$PATH/$COPYDATE hdfs://$TNAMENODE:9000/$PATH/$COPYDATE/
echo -e "\n ------- delete $DELDATE ------- \n"
/usr/bin/hadoop dfs -rmr /$PATH/$DELDATE
/usr/bin/hadoop dfs -rmr /$PATH/_distcp_logs*
/usr/bin/hadoop dfs -chmod -R 777 /$PATH/
The script copy logfiles from the past day and the given path to the target's hdfs and delete the datasets if they older than 3 days. I didn't want the logs in that directory (and I didn't need them), so I delete them too. We didn't have the user flume in our development cluster, so I set permissions to 777 for the whole directory.
To debug a failure the script writes all output into the given logfile. If you want to rotate the file add a logrote-definition into /etc/logrotate.d/. To decrease the load and network impact at our live cluster I use only 100 maps. The script runs every day via cron 02:00 pm and took for 1TB around 1 hour. Here a ganglia chart for a 300 GB sync.
Technocrati Claim: PYBPPWZ4RFST