You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/07/25 19:30:20 UTC

[jira] [Comment Edited] (PIG-4958) Tez autoparallelism estimation for order by is higher than mapreduce

    [ https://issues.apache.org/jira/browse/PIG-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392540#comment-15392540 ] 

Rohini Palaniswamy edited comment on PIG-4958 at 7/25/16 7:30 PM:
------------------------------------------------------------------

The above approach in the patch which makes a DAGClient call from the task requires getting a RM token and passing to the job. Talked with [~jlowe] and obviously he doesn't like the idea of talking to RM from task.

A task in the target vertex needs to get the OUTPUT_BYTES counter of all the input vertices. The target vertex is a sample aggregator vertex and only gets the samples. What is required is getting the OUTPUT_BYTES size of the other output (actual data being ordered) of the source vertex to which it does not have access to in its VertexManagerPlugin. 

Problem 1 - Get counter value
Problem 2 - Pass it to the task

I had already looked at other options. There does not seem to be a good way to do it with VertexManagerPlugin. There is no API to get counters or to send events to another VertexManagerPlugin class in the AM. 

[~bikassaha]/[~hitesh]/[~sseth],
    Is there any other cleaner and simpler way to do it and avoid DAGClientImplRPC? 
      


was (Author: rohini):
The above approach in the patch which makes a DAGClient call from the task requires getting a RM token and passing to the job. Talked with [~jlowe] and obviously he doesn't like the idea of talking to RM from task.

A task in the target vertex needs to get the OUTPUT_BYTES counter of all the input vertices. 
Problem 1 - Get counter value
Problem 2 - Pass it to the task

I had already looked at other options. There does not seem to be a good way to do it with VertexManagerPlugin. There is no API to get counters or to send events to another VertexManagerPlugin class in the AM. 

[~bikassaha]/[~hitesh]/[~sseth],
    Is there any other cleaner and simpler way to do it and avoid DAGClientImplRPC? 
      

> Tez autoparallelism estimation for order by is higher than mapreduce
> --------------------------------------------------------------------
>
>                 Key: PIG-4958
>                 URL: https://issues.apache.org/jira/browse/PIG-4958
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0
>
>         Attachments: PIG-4958-withoutsecurity.patch
>
>
>   The input size is calculated from the size of the samples in memory. Size in memory is usually 4x or more than the serialized size. Mapreduce estimates the number of reducers based on serialized size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)