You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vladimir Ozerov (JIRA)" <ji...@apache.org> on 2016/12/05 13:51:58 UTC

[jira] [Updated] (IGNITE-4270) Hadoop: optionally stripe mapper output for every partition.

     [ https://issues.apache.org/jira/browse/IGNITE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vladimir Ozerov updated IGNITE-4270:
------------------------------------
    Description: 
Currently we have R maps for M mappers, where R is number of reducers. For this reason many mappers writes to concurrent offheap data structure, loosing time on concurrency burden.
Let's add an option to create R * M maps, so that every mapper has dedicated map for every reducer. This will eliminate almost all concurrency overhead.

Design:
1) Every mapper works with it's own set of "remote" output maps;
2) These maps are essentially not "maps", but IO messages, which we fill up to certain threshold;
3) Once filled, message is sent to remote node.
4) Async shuffle thread is no longer need in this architecture.

As a result we decrease concurrency, removes slowdown from a single shuffle thread which is not able to send messages fast enough, and removes unnecessary intermediate sorting.

NB! Be careful with "combiner" case and with "external" execution.

  was:
Currently we have R maps for M mappers, where R is number of reducers. For this reason many mappers writes to concurrent offheap data structure, loosing time on concurrency burden.

Let's add an option to create R * M maps, so that every mapper has dedicated map for every reducer. This will eliminate almost all concurrency overhead.


> Hadoop: optionally stripe mapper output for every partition.
> ------------------------------------------------------------
>
>                 Key: IGNITE-4270
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4270
>             Project: Ignite
>          Issue Type: Sub-task
>          Components: hadoop
>    Affects Versions: 1.8
>            Reporter: Vladimir Ozerov
>            Assignee: Vladimir Ozerov
>             Fix For: 2.0
>
>
> Currently we have R maps for M mappers, where R is number of reducers. For this reason many mappers writes to concurrent offheap data structure, loosing time on concurrency burden.
> Let's add an option to create R * M maps, so that every mapper has dedicated map for every reducer. This will eliminate almost all concurrency overhead.
> Design:
> 1) Every mapper works with it's own set of "remote" output maps;
> 2) These maps are essentially not "maps", but IO messages, which we fill up to certain threshold;
> 3) Once filled, message is sent to remote node.
> 4) Async shuffle thread is no longer need in this architecture.
> As a result we decrease concurrency, removes slowdown from a single shuffle thread which is not able to send messages fast enough, and removes unnecessary intermediate sorting.
> NB! Be careful with "combiner" case and with "external" execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)