You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Micah Whitacre (JIRA)" <ji...@apache.org> on 2017/12/11 15:03:00 UTC

[jira] [Commented] (CRUNCH-660) FileTargetImpl uses Distcp vs FileUtils.copy

    [ https://issues.apache.org/jira/browse/CRUNCH-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286010#comment-16286010 ] 

Micah Whitacre commented on CRUNCH-660:
---------------------------------------

The other idea i came up with that is more significant of a change but make more sense in the context of Crunch and doesn't tie us to MR, would be to support the "working path" being created on a destination FS instead always local.  Yes it might not be as performant but if they are wanting the stuff on another cluster they've likely already given up on data locality.

This would make "cleanup" more complicated because you'd have to clean up the working directory on each filesystem but that seems trivial compared to the other design implications.

> FileTargetImpl uses Distcp vs FileUtils.copy
> --------------------------------------------
>
>                 Key: CRUNCH-660
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-660
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Micah Whitacre
>            Assignee: Josh Wills
>
> So for handling multiple runtimes I'm not sure there is a way to solve this but documenting as a JIRA regardless.
> If you are running in a multi-cluster environment where you might want to read data from one cluster and then write the output on another cluster (e.g. generating HFiles to be loaded into a separate HBase cluster), the performance of moving files is noticeable.  Specifically due to the fact that the moving of the files happens in the launcher/driver process versus as part of the node execution it seems.[1]
> An efficient option would be to kick off a DistCp instead but that would tie the target directly to a runtime which is not a great approach.  
> [1] - https://github.com/apache/crunch/blob/5609b014378d3460a55ce25522f0c00659872807/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java#L157



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)