You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by detonator413 <gi...@git.apache.org> on 2015/09/03 16:20:15 UTC

[GitHub] flink pull request: Implementation of distributed copying utility ...

GitHub user detonator413 opened a pull request:

    https://github.com/apache/flink/pull/1090

    Implementation of distributed copying utility using Flink

    Uses a "dynamic" input format where faster nodes will get more stuff to be copied. 
    The finest level of granularity is a file.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/detonator413/flink distcp-example-20150903

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1090.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1090
    
----
commit 05fc8ec319780eddbb67269e52f3bc1d090df41f
Author: Vyacheslav Zholudev <vy...@researchgate.com>
Date:   2015-09-03T14:09:18Z

    initial Flink DistCp example

commit c4f2b447e89a4ef80ee6ab171d04c519f7498d0e
Author: Vyacheslav Zholudev <vy...@researchgate.com>
Date:   2015-09-03T14:13:28Z

    a bit more comments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by mxm <gi...@git.apache.org>.

Github user mxm commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-138942223
  
    Should we bundle the utility into a JAR like the other examples? If so, we need to adjust the `pom.xml` file in flink-examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by mxm <gi...@git.apache.org>.

Github user mxm commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-137760319
  
    Yes, I guess it is a better fit for the examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by detonator413 <gi...@git.apache.org>.

Github user detonator413 commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-137516063
  
    It could be faster because of dynamic assignment of files to copy as opposed to the default method of distcp where set of files are preassigned to mappers in advance


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by mxm <gi...@git.apache.org>.

Github user mxm commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-137477152
  
    Thanks for your pull request! I'm assuming you would use this utility to copy files from your local to a remote file system, right? Your utility starts a Flink job to copy the files to the remote file systems. This only works if you execute it locally because otherwise the task managers need to have the files available and that might defeat the utility's purpose. Also, imagine someone embedding the tool in a Flink program. The person might wonder why his/her program actually executes two jobs (one for the utility, one for the actual job). 
    
    I think this would be more useful as a utility function, e.g. in a `FileUtils` class in `flink-core`. The method there would receive a list of files and then upload the files like you did using Flink's `FileSystem` abstraction. We could still parallelize the method by starting multiple threads to upload the files.
    
    Correct me if I'm wrong or misunderstood your pull request :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by detonator413 <gi...@git.apache.org>.

Github user detonator413 commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-138241466
  
    Sure, will push some changes soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by detonator413 <gi...@git.apache.org>.

Github user detonator413 commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-137680835
  
    Actually hadoop distcp also has an implementation of a dynamic input format which in my taste is a bit overcomplicated. So not sure if this Flink tool will give much benefits in real life (also it's lacking elasticity unlike hadoop distcp), but can be a good example how one can implement his own input format for a slightly unusual usecase. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by StephanEwen <gi...@git.apache.org>.

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-138240660
  
    Okay, let's merge it to the examples.
    
    @detonator413 Can you add some class-level comments to the files that explains what they do?
    Also, we need to remove the author tags. It is an Apache policy that code is not author tagged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by mxm <gi...@git.apache.org>.

Github user mxm commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-137499875
  
    Thanks for pointing me to the `distcp` page. So far, I was agnostic of this tool :) The performance difference between Hadoop and Flink should not be too different because the copying of files is mostly IO-bound work. Still, it is 1.5 minutes faster.
    
    Not sure if we can include your code in the Flink examples but definitely under `flink-contrib` where we usually put external tools that are not directly part of Flink.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by mxm <gi...@git.apache.org>.

Github user mxm commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-138940821
  
    Yes, this failed check is unrelated to your changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by detonator413 <gi...@git.apache.org>.

Github user detonator413 commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-137480178
  
    Hi Max,
    
    Look at the distcp utility (http://hadoop.apache.org/docs/r1.2.1/distcp.html <http://hadoop.apache.org/docs/r1.2.1/distcp.html>). The purpose of it is to copy big amount of files within one cluster or between clusters. In local mode the tool will also work for local FS, whereas in the distributed mode only HDFS paths are supposed to be used. I made a simple benchmark on copying 800GB of data within one cluster running Hadoop distcp (using default distcp input format ) and Flink distcp in parallel. Flink job was 1.5 minutes faster (it took approximately 35 minutes in our setup).
    
    Slava
    
    > On 03 Sep 2015, at 17:00, Max <no...@github.com> wrote:
    > 
    > Thanks for your pull request! I'm assuming you would use this utility to copy files from your local to a remote file system, right? Your utility starts a Flink job to copy the files to the remote file systems. This only works if you execute it locally because otherwise the task managers need to have the files available and that might defeat the utility's purpose. Also, imagine someone embedding the tool in a Flink program. The person might wonder why his/her program actually executes two jobs (one for the utility, one for the actual job).
    > 
    > I think this would be more useful as a utility function, e.g. in a FileUtils class in flink-core. The method there would receive a list of files and then upload the files like you did using Flink's FileSystem abstraction. We could still parallelize the method by starting multiple threads to upload the files.
    > 
    > Correct me if I'm wrong or misunderstood your pull request :)
    > 
    > —
    > Reply to this email directly or view it on GitHub <https://github.com/apache/flink/pull/1090#issuecomment-137477152>.
    > 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by StephanEwen <gi...@git.apache.org>.

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-137533466
  
    @detonator413 Good point with the dynamic assignment.
    
    What do sou think, would `flink-contrib` or `flink-examples` be a better place? Is it rather a nice tool, or is it also educational code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by detonator413 <gi...@git.apache.org>.

Github user detonator413 commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-138547980
  
    1 profile check mysteriously fails and seems unrelated to the changes I introduced. The code should be now compliant to the guidelines. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by StephanEwen <gi...@git.apache.org>.

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-137757590
  
    Okay, why not add it to the examples then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by StephanEwen <gi...@git.apache.org>.

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1090#issuecomment-141406457
  
    Will merge this and add a JAR file entry to the pom file...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Implementation of distributed copying utility ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/1090


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---