You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2018/11/05 01:52:28 UTC
[GitHub] EdColeman commented on issue #742: How to move Accumulo table data to HDFS which has different structure

EdColeman commented on issue #742: How to move Accumulo table data to HDFS which has different structure 
URL: https://github.com/apache/accumulo/issues/742#issuecomment-435732114
 
 
   I believe that you are correct. In the Accumulo example, the use of distcp is because that is the most common way of transferring the files between clusters.  How ever you get the files to the destination system is really up to you.  If you can get the files listed in discp.txt and the exportMetadata.zip file into a single directory on the destination system, the import command does not care how they got there, and you should be good to go.
   
   The Accumulo bulk import command takes a table name and the hdfs directory on the destination system, the only hdfs url needed is the url / path to the directory with the files. 
   
   > importtable table_name hdfs_path_to_files
   
   The import uses the exportMetadata.zip file to recreate the table metadata and settings (including the splits) and then moves the rfiles in the provided directory into / under the Accumulo directory structure (i.e. xxx/accumulo/tables/new_table_id/...rf)
   
   The paths in the distcp.txt file are a convenience for using the distcp command -f option  (see https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html):
   
   > bash$ hadoop distcp -f hdfs://nn1:8020/distcp.txt hdfs://nn2:8020/bar/foo
   
   distcp uses the paths listed in the provided file (distcp.txt in this case) and copies them to the foo directory on the destination system.  The results of distcp on the destination system are a single directory, with the Accumulo rfiles and the exportMetadata.zip file. 
   
   The alternate form of the distcp command takes a file list and the destination system directory, so you could do something like:
   
   > disctp hdfs:nn1/accumulo/tables/yy/Axxxx1.rf hdfs:nn1/acumulo/tables/yy/Axxxxx2.rf hdfs:nn1/path_to_exportMetadata.zip  hdfs:nn2/foo/bar  
   
   Passing all of the paths from distcp.txt, including the exportMetadata.zip file to distcp, the result would be the same - a single directory on the destination system with the Accumulo rfiles and the exportMetadata.zip file, its just much easier in this case to give it the files with -f.
   
   You don't need to use distcp, you just need to end up with the same, expected directory and all of the files. Hope this helps.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services