You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2018/10/13 14:25:00 UTC
[jira] [Created] (HADOOP-15850) Allow CopyCommitter to skip concatenating source files specified by DistCpConstants.CONF_LABEL_LISTING_FILE_PATH

Ted Yu created HADOOP-15850:
-------------------------------

             Summary: Allow CopyCommitter to skip concatenating source files specified by DistCpConstants.CONF_LABEL_LISTING_FILE_PATH
                 Key: HADOOP-15850
                 URL: https://issues.apache.org/jira/browse/HADOOP-15850
             Project: Hadoop Common
          Issue Type: Task
            Reporter: Ted Yu


I was investigating test failure of TestIncrementalBackupWithBulkLoad from hbase against hadoop 3.1.1

hbase MapReduceBackupCopyJob$BackupDistCp would create listing file:
{code}
        LOG.debug("creating input listing " + listing + " , totalRecords=" + totalRecords);
        cfg.set(DistCpConstants.CONF_LABEL_LISTING_FILE_PATH, listing);
        cfg.setLong(DistCpConstants.CONF_LABEL_TOTAL_NUMBER_OF_RECORDS, totalRecords);
{code}
For the test case, two bulk loaded hfiles are in the listing:
{code}
2018-10-13 14:09:24,123 DEBUG [Time-limited test] mapreduce.MapReduceBackupCopyJob$BackupDistCp(195): BackupDistCp : hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_
2018-10-13 14:09:24,125 DEBUG [Time-limited test] mapreduce.MapReduceBackupCopyJob$BackupDistCp(195): BackupDistCp : hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_
2018-10-13 14:09:24,125 DEBUG [Time-limited test] mapreduce.MapReduceBackupCopyJob$BackupDistCp(197): BackupDistCp execute for 2 files of 10242
{code}
Later on, CopyCommitter#concatFileChunks would throw the following exception:
{code}
2018-10-13 14:09:25,351 WARN  [Thread-936] mapred.LocalJobRunner$Job(590): job_local1795473782_0004
java.io.IOException: Inconsistent sequence file: current chunk file org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/       160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ length = 5100 aclEntries  = null, xAttrs = null} doesnt match prior entry org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e-   2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ length = 5142 aclEntries = null, xAttrs = null}
  at org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276)
  at org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100)
  at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567)
{code}
The above warning shouldn't happen - the two bulk loaded hfiles are independent.

From the contents of the two CopyListingFileStatus instances, we can see that their isSplit() return false. Otherwise the following from toString should be logged:
{code}
    if (isSplit()) {
      sb.append(", chunkOffset = ").append(this.getChunkOffset());
      sb.append(", chunkLength = ").append(this.getChunkLength());
    }
{code}
From hbase side, we can specify one bulk loaded hfile per job but that defeats the purpose of using DistCp.

There should be a way for DistCp to specify the skipping of source file concatenation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org