You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Dan Zinngrabe (JIRA)" <ji...@apache.org> on 2008/09/23 01:01:44 UTC

[jira] Created: (HBASE-897) Backup/Export/Import Tool

Backup/Export/Import Tool
-------------------------

Key: HBASE-897
URL: https://issues.apache.org/jira/browse/HBASE-897
Project: Hadoop HBase
Issue Type: New Feature
Affects Versions: 0.1.3, 0.1.2
Environment: MacOS 10.5.4, CentOS 5.1
Reporter: Dan Zinngrabe
Priority: Minor

Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.

Documentation included below is from the readme.

HBase Backup
author: Dan Zinngrabe dan@mahalo.com

------------------
Summary:
Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.

Backup File Format
------------------

The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.

Setup and installation
------------------

First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:

export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar

Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into

${HADOOP_HOME}/lib

With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.

Backing up
------------------

Backups are run using the Exporter class. From ${HADOOP_HOME} :

bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:

This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.

Restoring from a backup
------------------

>From ${HADOOP_HOME} :

bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text

This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.

Combining a file from pieces using cat
------------------

As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:

bin/hadoop dfs -copyToLocal backup ~/mybackups

Then:

cd ~/
find mybackups/. -name "part-00*" | xargs cat >> backup.tsv

This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"

Troubleshooting
------------------

During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-897) Backup/Export/Import Tool

Posted by "Dan Zinngrabe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634279#action_12634279 ] 

Dan Zinngrabe commented on HBASE-897:
-------------------------------------

That's correct: look at www.mahalo.com . All the markup that power the wiki is stored in HBase, and backed up using this tool every hour. It's been in use for a few months now. MediaWiki - same software that power Wikipedia - has version/revision control. Mahalo's in-house editors produce a *lot* of revisions per day, which was not working well in a RDBMS. An hbase-based solution for this was built and tested, and the data migrated out of MySQL using this tool (and a few python scripts) and into HBase. Right now it's at something like 6 million items in HBase. The tool runs every hour from a shell script to back up that data, and on 6 nodes takes about 5-10 minutes to run - and does not slow down production at all. So its not just a backup, it's a hot backup.

Mahalo has no problem with being added to the powered-by page :) . 

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-897) Backup/Export/Import Tool

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732921#action_12732921 ] 

Jonathan Gray commented on HBASE-897:
-------------------------------------

Good stuff, atppp.  FYI there was another issue opened for an 0.19 version HBASE-974

I'm working on a new one for 0.20 soon, will have it up next week in a new issue and will post here.

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz, hbase_backup_with_hbase_0.19.x.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-897) Backup/Export/Import Tool

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray resolved HBASE-897.
---------------------------------

    Resolution: Won't Fix

Issue contains tools to perform this on really old versions and also on 0.19.  No plans to commit any of this into branches.

Other implementations for 0.18/0.19 available in HBASE-974

Closing issue as Won't Fix.  0.20 backup now being worked on in HBASE-1684

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz, hbase_backup_with_hbase_0.19.x.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-897) Backup/Export/Import Tool

Posted by "Dan Zinngrabe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Zinngrabe updated HBASE-897:
--------------------------------

    Attachment: hbase_backup_release.tar.gz

Unzip and run 'ant build' to create the binary. Documentation is included in the readme. Note that while this has primarily been used with 0.1.2 and 0.1.3, it should be usable on newer versions with little or no modifications.

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-897) Backup/Export/Import Tool

Posted by "Dan Zinngrabe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633774#action_12633774 ] 

Dan Zinngrabe commented on HBASE-897:
-------------------------------------

Yes, you need the hbase and hadoop jars either in the lib directory or on your classpath for it to build properly.

This hasn't been tested with the most recent HBase and Hadoop releases but there is no reason I can find that it would not work other than class name changes. I think including it in hbase may be a good idea - being able to export and import data even just for testing purposes is valuable to developers, and the backup capability is something people have asked for quite a bit. Until there is a more robust backup tool like what has been suggested for HBASE-50, this would certainly be a reasonable stopgap.

Since for backup purposes the tool is likely to be deployed and used by systems administrator, the README should probably remain separate for now - it makes it easier to get it in their hands.

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-897) Backup/Export/Import Tool

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634286#action_12634286 ] 

stack commented on HBASE-897:
-----------------------------

Great stuff Dan.  Info should make Lars George happy -- the fellow asking.  He's dataset size sounds about same as yours.

I'll make first cut at adding a Mahalo entry to the powered-by page.  Take a cut at it if I misrepresent.  Thanks.

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-897) Backup/Export/Import Tool

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633613#action_12633613 ] 

stack commented on HBASE-897:
-----------------------------

This looks excellent Dan.

I tried to build it but the lib dir is empty; my guess is its supposed to be populated with some subset of hadoop and hbase jars and lib content.

But more important, do you think we should bundle this tool with hbase itself?  What would you suggest?  Perhaps add it as subpackage under hbase mapred?  The README could be redone as package-level documentation?  Or do you think it better it remain its own self-contained thing?  If so, where should it live other than as a JIRA attachment?

Good stuff (Just back from Hawaii so I know what Mahalo means now).

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-897) Backup/Export/Import Tool

Posted by "sishen.freecity (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644246#action_12644246 ] 

sishen.freecity commented on HBASE-897:
---------------------------------------

Dan Zinngrabe.

Thanks for your wonderful tool.  It's really very helpful, :)

In the jira, it's recorded as affect the hbase 0.1.2, 0.1.3.  Now hbase 0.19.1 is release and i found that it's not compatible with that.

I want to know that do you have the new version which will fix the problems?  Thanks.   

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-897) Backup/Export/Import Tool

Posted by "Dan Zinngrabe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633955#action_12633955 ] 

Dan Zinngrabe commented on HBASE-897:
-------------------------------------

I'll give that a shot, it shouldn't present any problems that I can see.
I'll put most of the readme into the package docs, and I'll see if I can do a version of it targetted at sysadmins for the wiki.


> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-897) Backup/Export/Import Tool

Posted by "atppp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

atppp updated HBASE-897:
------------------------

    Attachment: hbase_backup_with_hbase_0.19.x.tar.gz

Spent a little bit time on this and made it work with hadoop/0.19.1 and hbase/0.19.2. Fixed a handful of bugs.

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz, hbase_backup_with_hbase_0.19.x.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-897) Backup/Export/Import Tool

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634272#action_12634272 ] 

stack commented on HBASE-897:
-----------------------------

Dan: Thanks.  One minor thing, rather than put the doc. in wiki, if its in the javadoc, it can evolve along with the hbase versions.  Also, to be clear, you've run this MR job against a 'live' instance (Just asking.  Someone off-the-mailing-list was looking for such a thing and I pointed them here).  Finally, any chance of adding mahola to the powered-by page.  I'm making the rounds trying to get fellas to add their names; its empty now and that gives off a bad impression.  Good stuff.

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-897) Backup/Export/Import Tool

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633833#action_12633833 ] 

stack commented on HBASE-897:
-----------------------------

Mind making a patch then?  Add it to mapred package or into a subpackage of mapred.

Regards the README, where do you think it would live?   If its package-level documentation, its beside the code and (hopefully) evolves with it.

Thanks.

> Backup/Export/Import Tool
> -------------------------
>
>                 Key: HBASE-897
>                 URL: https://issues.apache.org/jira/browse/HBASE-897
>             Project: Hadoop HBase
>          Issue Type: New Feature
>    Affects Versions: 0.1.2, 0.1.3
>         Environment: MacOS 10.5.4, CentOS 5.1
>            Reporter: Dan Zinngrabe
>            Priority: Minor
>         Attachments: hbase_backup_release.tar.gz
>
>
> Attached is a simple import, export, and backup utility. Mahalo.com has been using this in production for several months to back up our HBase clusters as well as to migrate data from production to development clusters, etc.
> Documentation included below is from the readme.
> HBase Backup
> author: Dan Zinngrabe dan@mahalo.com
> ------------------
> Summary:
> Simple MapReduce job for exporting data from an HBase table. The exported data is in a simple, flat format that can then be imported using another MapReduce job. This gives you both a backup capability, and a simple way to import and export data from tables.
> Backup File Format
> ------------------
> The output of a backup job is a flat text file, or series of flat text files. Each row is represented by a single line, with each item tab delimited. Column names are plain text, while column values are base 64 encoded. This helps us deal with tabs and line breaks in the data. Generally you should not have to worry about this at all.
> Setup and installation
> ------------------
> First, make sure your Hadoop installation is properly configured to load the HBase classes. This can easily be done by editing the hadoop-env.sh file to include HBase's jar libraries. You can add the following to hadoop-env.sh to have it load HBase classes:
> export HBASE_HOME=/Users/quellish/Desktop/hadoop/hbase-0.1.2
> export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.1.2.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.1.2-test.jar
> Second, make sure the hbase-backup.jar file is on the classpath for Hadoop as well. While you can put this into a system-wide class path directory such as ${JAVA_HOME}/lib , it's much easier to just put it into
> ${HADOOP_HOME}/lib
> With that done, you are ready to go. Start up hadoop and HBase normally and you will be able to run a backup and restore.
> Backing up
> ------------------
> Backups are run using the Exporter class. From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Exporter -output backup -table text -columns text_flags: text_data:
> This will output the backup into the new directory "backup" in the Hadoop File System, and will back up the columns "old_flags" and "old_text", with whatever the table's row identifier is. Colons are required in the column names, and this will produce multiple files in the output directory (simply 'cat' them together to form a single file). Note that if the backup directory exists it will stop. This may be changed in a future version. The output directory can also be any file system path or URL that Hadoop can understand, such as an S3 URL.
> Restoring from a backup
> ------------------
> From  ${HADOOP_HOME} :
> bin/hadoop com.mahalo.hadoop.hbase.Importer backup/backup.tsv text
> This will load a single file (that you 'cat'd together from parts), backup/backup.tsv into the table text. Note that the table must already exist, and it can have data in it - those values can be overwritten by the restore process. You can create the table easily using HBase's Shell. The backup file can be loaded from any URL that Hadoop understands, such as a file URL or S3 URL. A path not formatted as URL (such as shown above) assumes a path from your user directory in the hadoop filesystem.
> Combining a file from pieces using cat
> ------------------
> As mentioned above, typically a MapReduce job will produce several files of output that must be assembled together to make a single file. On a unix system, this is fairly easy to do, using cat and the find command: First, export your data from the hadoop filesystem to the local filesystem:
> bin/hadoop dfs -copyToLocal backup ~/mybackups
> Then:
> cd ~/
> find mybackups/. -name "part-00*" | xargs cat >> backup.tsv
> This will take all the files in the "backup" directory matching the pattern "part-00*" and combine them into a file "backup.tsv"
> Troubleshooting
> ------------------
> During a restore/import, regionservers splitting or becoming unavailable is normal, and the application will recover from it. You may see errors in the logs, but this is normal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.