You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/09/11 01:33:20 UTC

[jira] [Commented] (PIG-4958) Tez autoparallelism estimation for order by is higher than mapreduce

    [ https://issues.apache.org/jira/browse/PIG-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15480794#comment-15480794 ] 

Rohini Palaniswamy commented on PIG-4958:
-----------------------------------------

bq. If we want to use counter, why not get NUM_RECORDS as well? Then we can remove NUMROWS_TUPLE_MARKER row and simplify the code.
   That is not a accurate indicator. That includes number of records that were sent to the sampler and any other split output. There is no way to separate out the NUM_RECORDS of other outputs. Even trying to subtract the number of records sent to the sampler is again complicating things a lot as it is always not 100 * number of tasks. We may have less samples than 100 if the task does not have that many rows and there is no way to determined that. Number of samples is configurable and that has to be taken into account

bq. On the other hand, we can also write a GetDiskNumRows instead of GetMemNumRows to estimate the serialized size.
Started with that. GetDiskNumRows gets very complex when you want to accurately estimate the size of tuple and bag. You have to totally duplicate the logic of BinSedesTuple to get the exact serialized size in disk. Even then you are just doing the sample records which does not help when record sizes are not closely equal. Current approach is simpler and faster and better in terms of accuracy.

bq. The DAGClientImpl + RM token approach sounds a little scary to me.
Have been running with it for more than a month and have not seen any issues. We don't fetch for Oozie jobs as they already have them. Only command line ones. I think with Tez 0.9, Tez itself has plans to add the RM token, if I remember correctly what [~hitesh] mentioned in a offline conversation.  

> Tez autoparallelism estimation for order by is higher than mapreduce
> --------------------------------------------------------------------
>
>                 Key: PIG-4958
>                 URL: https://issues.apache.org/jira/browse/PIG-4958
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0
>
>         Attachments: PIG-4958-1.patch, PIG-4958-2.patch, PIG-4958-withoutsecurity.patch
>
>
>   The input size is calculated from the size of the samples in memory. Size in memory is usually 4x or more than the serialized size. Mapreduce estimates the number of reducers based on serialized size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)