You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2015/04/11 00:31:12 UTC

[jira] [Comment Edited] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

    [ https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490468#comment-14490468 ] 

Bikas Saha edited comment on TEZ-145 at 4/10/15 10:30 PM:
----------------------------------------------------------

Taking a step back, lets figure out the scenarios for this. 
Do we agree that 
1) Small jobs (small data) - this is not going to be helpful because we will be adding an extra stage latency for small combiner benefits.
2) Large job (large data) with no data reduction in the map side combiner - this is not going to be helpful because the extra combiner will not reduce the data further.
3) Large job (large data) with high data reduction in the map side combiner - this is going to be useful because the extra combiner will reduce the data further and also decrease the number of data shards by aggregating small outputs from the map tasks into smaller number of combiner tasks.
4) Large job (large data) with lot of filtering (no combiner) - this may be useful, not because their is a combine operation) but to reduce the large number of small outputs produced by the map tasks into a smaller number of shards due to the combiner tasks.

For 3/4 this may be useful if we can run aggregation combiner tasks at the rack level to coalesce the data within a rack (cheap) compared to having to pull that data across racks in the final reducer. Even in these cases, given better networks, we need to understand the trade off between pulling the data across to the final reducer vs the cost of running the extra combiner stage. Essentially, what is the killer scenario for this?


was (Author: bikassaha):
Taking a step back, lets figure out the scenarios for this. 
Do we agree that for small jobs (small data) - this is not going to be helpful because we will be adding an extra stage latency for small combiner benefits.
Large job (large data) with no data reduction in the map side combiner - this is not going to be helpful because the extra combiner will not reduce the data further.
Large job (large data) with high data reduction in the map side combiner - this is going to be useful because the extra combiner will reduce the data further and also decrease the number of data shards by aggregating small outputs from the map tasks into smaller number of combiner tasks.
Large job (large data) with lot of filtering (no combiner) - this may be useful, not because their is a combine operation) but to reduce the large number of small outputs produced by the map tasks into a smaller number of shards due to the combiner tasks.

> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
>                 Key: TEZ-145
>                 URL: https://issues.apache.org/jira/browse/TEZ-145
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Tsuyoshi Ozawa
>         Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, support of being able to run a combiner in a non-local mode would allow performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)