You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by "Thomas Palka (TripAdvisor)" <tp...@tripadvisor.com> on 2012/05/15 22:50:01 UTC

Hadoop hdfs / Hive backup solution, open-sourced by TripAdvisor

At TripAdvisor we use Hadoop and Hive for our warehousing needs. Processing the daily logs takes a long time, and re-processing them would be prohibitive.  As we couldn't find a backup solution, we put one together.  We open sourced it in hopes that it might be useful to others as well.  You can find it on github:

https://github.com/TAwarehouse/backup-hadoop-and-hive

The backup app traverses the hdfs filesystem looking for all files with mtime in a given range, then copying (a'la copyToLocal) the files to  local directory.  If hdfs were to crash, you can use "hadoop fs -copyFromLocal" to restore the filesystem contents.  The backup can be invoked incrementally to keep updating the local copy.  Files that would be overwritten get copied to a "preserved" area, so that older versions remain available.

This project also includes a dump of the hive schema, along with hql statements to reassociate the tables with hdfs partitions.  This portion came in very handy when we migrated our Hive backing db from derby to mysql.

Thanks go to Josh Patterson, Edward Capriolo, and Rapleaf for letting us use their hdfs-style checksum, hive show-create-table, hdfs traversal code.

For more info see the README:

https://github.com/TAwarehouse/backup-hadoop-and-hive/blob/master/README.txt


tom.
tpalka<at>tripadvisor<dot>com