You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2015/04/07 01:48:12 UTC
[jira] [Commented] (PIG-4495) Better multi-query planning in case of multiple edges

    [ https://issues.apache.org/jira/browse/PIG-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482215#comment-14482215 ] 

Rohini Palaniswamy commented on PIG-4495:
-----------------------------------------

This patch basically gets rid of the need for the ask TEZ-1190 Allow multiple edges between two vertexes. 

Changes done:
   1) Case of Self join/cross/cogroup
        - Multiple sub-plans of split write to the same output. The POShuffleTezLoad is now capable of splitting the input into correct bags based on the index in the key.
        - Do not allow cases like self-replicate/self-skewed join
   2) Case of union
        - Multiple sub-plans of split write to the same output and connect to the vertex group. If only sub-plans of the split are members of the union, then no vertex group is created and split is directly connected to union successors. 
        - For cases like nightly.conf Union_16.pig (moved to multiquery.conf now) which has multiple levels of union all from same split, even the vertex group created is removed and all the split sub-plans write directly to the successor.
   3) Other optimizations done
        - If there was a union followed by replicate join it was not optimized (PIG-3856). But if the union is within the same split we now broadcast the replicate join once to the split operator.
   4) Refactored code in UnionOptimizer into methods for easy readability.
   5) Not very related, but cleaned up TestMultiQueryLocal as had to search the logs for exception logged while testing this patch instead of being able to look at the junit test failure stacktrace.

For one of the pig scripts, which had 72 vertex+vertex groups (due to lots of splits and unions) the new plan just has 18 vertex/vertex groups. Performance difference was not that much only improving by a minute (from 9 mins to 8 mins) which I expected it to be more better and need to investigate. It utilized lot less resources (from 3561 tasks to 1011 tasks. MR uses 999 tasks in total) which is good and also there was a good difference in the file bytes read (2.7G less) and written counters (fG less) for the 64G input data as more vertices were merged into the Split vertex. Writing to same output without using vertex group to combine it later also helps as the io.sort.mb is not divided between multiple outputs making sorting faster.

> Better multi-query planning in case of multiple edges
> -----------------------------------------------------
>
>                 Key: PIG-4495
>                 URL: https://issues.apache.org/jira/browse/PIG-4495
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>    Affects Versions: 0.14.0
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.15.0
>
>         Attachments: PIG-4495-1.patch
>
>
> Details in https://issues.apache.org/jira/browse/TEZ-1190?focusedCommentId=14393033&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14393033
> People split the data, perform some foreach transformations/filter, union them and then do some operation like group by or join with other data. In those cases it creates multiple edges from same Split, so we do not merge them, but  
> write out the data to another dummy vertex to avoid multiple edges and this adds overhead and affects performance. Vertex groups accept multiple edges from same vertex. So if the multiple edges end up in a vertex group (and not a vertex which is the case in self join) we can avoid the dummy vertex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)