You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2014/12/10 17:02:12 UTC

[jira] [Commented] (FLINK-1287) Improve File Input Split assignment

    [ https://issues.apache.org/jira/browse/FLINK-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241300#comment-14241300 ] 

ASF GitHub Bot commented on FLINK-1287:
---------------------------------------

GitHub user fhueske opened a pull request:

    https://github.com/apache/incubator-flink/pull/258

    [FLINK-1287] LocalizableSplitAssigner prefers splits with less degrees of freedom

    The current LocalizableSplitAssigner assigns remote and local splits without priorities, i.e., each remote (local) split has the same probability of being assigned.
    With this change, splits are prioritised by the number of hosts they can be locally read from. Splits with fewer options to be locally read are preferred over others that can be locally accessed from many hosts.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/fhueske/incubator-flink splitAssignment

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-flink/pull/258.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #258
    
----
commit 8900b91dc50a55d1149482c3f9c10dc3eaa5048c
Author: Fabian Hueske <fh...@apache.org>
Date:   2014-12-07T21:28:22Z

    [FLINK-1287] LocalizableSplitAssigner prefers splits with less degrees of freedom

----


> Improve File Input Split assignment
> -----------------------------------
>
>                 Key: FLINK-1287
>                 URL: https://issues.apache.org/jira/browse/FLINK-1287
>             Project: Flink
>          Issue Type: Improvement
>          Components: Local Runtime
>            Reporter: Robert Metzger
>            Assignee: Fabian Hueske
>
> While running some DFS read-intensive benchmarks, I found that the assignment of input splits is not optimal. In particular in cases where the numWorker != numDataNodes and when the replication factor is low (in my case it was 1).
> In the particular example, the input had 40960 splits, of which 4694 were read remotely.  Spark did only 2056 remote reads for the same dataset.
> With the replication factor increased to 2, Flink did only 290 remote reads. So usually, users shouldn't be affected by this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)