You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/07/24 21:18:20 UTC

[jira] [Commented] (PIG-4958) Tez autoparallelism estimation for order by is higher than mapreduce

    [ https://issues.apache.org/jira/browse/PIG-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391176#comment-15391176 ] 

Rohini Palaniswamy commented on PIG-4958:
-----------------------------------------

Solving this by getting OUTPUT_BYTES counter of sampler vertices and using that to estimate the input size. If there are multiple outputs in sampler vertex, then the OUTPUT_BYTES counter value might be high as we are not getting counter per output. So doing a min of (OUTPUT_BYTES , sizeEstimatedFromSamples).

Skewed join has the same issue. But fix is slightly more complicated as we need to include the OUTPUT_BYTES of right input as well. Currently it is estimated based only on the left input size. That is working out ok as memory size estimation which is 4x or more of left input will always be greater  than (left input + right input OUTPUT_BYTES). Will create a separate jira for that.

> Tez autoparallelism estimation for order by is higher than mapreduce
> --------------------------------------------------------------------
>
>                 Key: PIG-4958
>                 URL: https://issues.apache.org/jira/browse/PIG-4958
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0
>
>
>   The input size is calculated from the size of the samples in memory. Size in memory is usually 4x or more than the serialized size. Mapreduce estimates the number of reducers based on serialized size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)