You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Andrew Olson (JIRA)" <ji...@apache.org> on 2019/02/26 18:16:00 UTC

[jira] [Created] (CRUNCH-679) Improvements for usage of DistCp

Andrew Olson created CRUNCH-679:
-----------------------------------

             Summary: Improvements for usage of DistCp
                 Key: CRUNCH-679
                 URL: https://issues.apache.org/jira/browse/CRUNCH-679
             Project: Crunch
          Issue Type: Improvement
          Components: Core
            Reporter: Andrew Olson
            Assignee: Josh Wills


As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and improvements have been identified during testing.

* We need to preserve preferred part names, e.g. part-m-00000. Currently the DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile method, and would therefore create destination file names like out0-m-00000, which are problematic when there are multiple map-only jobs writing to the same target path. This can be achieved by providing a custom CopyListing implementation that is capable of dynamically renaming target paths based on a given mapping. Unfortunately a substantial amount of code duplication from the original SimpleCopyListing class is currently required in order to inject the necessary logic for modifying the sequence file entry keys. HADOOP-16147 has been opened to allow it to be simplified in the future.

* The handleOutputs implementation in HFileTarget is essentially identical to the one in FileTargetImpl that it overrides. We can remove it and just share the same code.

* It could be useful to add a property for configuring the max DistCp task bandwidth, as the default (100 MB/s per task) may be too high for certain environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)