You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2017/12/28 21:38:00 UTC

[jira] [Updated] (PIG-5324) Stream the first input in Tez for skewed, bloom and replicate joins

     [ https://issues.apache.org/jira/browse/PIG-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy updated PIG-5324:
------------------------------------
    Description: 
 In PIG-5323, the code streams the last input similar to MR. In MR only last input can be streamed because it is read last (key values are sorted by index). But in Tez since each input (except for cases of self-join) is separate, it is possible to either stream the first or the last input. For the cases of skewed join, bloom join and replicate join it is the left most dataset that is always big. So we should stream that instead of the right most dataset. 

If we actually had sizes of each input (from counters), we can choose which input to stream dynamically during runtime. This can help with cases of regular join where people have not placed the bigger dataset as the right most one.

  was:
 In PIG-5323, the code streams the last input similar to MR. In MR only last input can be streamed because it is read last (key values are sorted by index). But in Tez since each input (except for cases of self-join) is separate. So it is possible to either stream the first or the last input. For the cases of skewed join, bloom join and replicate join it is the left most dataset that is always big. So we should stream that instead of the right most dataset. 

If we actually had sizes of each input (from counters), we can choose which input to stream dynamically during runtime. This can help with cases of regular join where people have not placed the bigger dataset as the right most one.


> Stream the first input in Tez for skewed, bloom and replicate joins
> -------------------------------------------------------------------
>
>                 Key: PIG-5324
>                 URL: https://issues.apache.org/jira/browse/PIG-5324
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Rohini Palaniswamy
>
>  In PIG-5323, the code streams the last input similar to MR. In MR only last input can be streamed because it is read last (key values are sorted by index). But in Tez since each input (except for cases of self-join) is separate, it is possible to either stream the first or the last input. For the cases of skewed join, bloom join and replicate join it is the left most dataset that is always big. So we should stream that instead of the right most dataset. 
> If we actually had sizes of each input (from counters), we can choose which input to stream dynamically during runtime. This can help with cases of regular join where people have not placed the bigger dataset as the right most one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)