You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2014/03/28 00:01:16 UTC

[jira] [Commented] (PIG-3631) Improve performance of replicate-join

    [ https://issues.apache.org/jira/browse/PIG-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950074#comment-13950074 ] 

Rohini Palaniswamy commented on PIG-3631:
-----------------------------------------

bq. Verify no performance regression with MR: The above approach is good when the replicate join is not the first vertex of the DAG (i.e in case of a MR, replicate join is part of a reduce). If it is the first vertex of the DAG, we need to compare and see that with this approach the performance does not regress with the MR's map only replicate join using distributed cache.
   Broadcasting and using Tez Vertex cache for replicate join gives performance in orders of magnitude compared to Distributed cache approach. Joining 1TB data with 100MB table using replicate join gave 2.5-3x performance on a 25 node cluster due to high container reuse + vertex caching. 
  

> Improve performance of replicate-join
> -------------------------------------
>
>                 Key: PIG-3631
>                 URL: https://issues.apache.org/jira/browse/PIG-3631
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>             Fix For: tez-branch
>
>
> Replicated join is implemented in Tez as follows:
> - POFRJoinTez extends POFRJoin. The difference between two is that replication hash table is constructed out of broadcasting edges in Tez instead of files on distributed cache in MR.
> - TezCompiler adds a vertex per replicated table and connect it to POFRJoin vertex via broadcasting edge.
> Verify no performance regression with MR:
>   - The above approach is good when the replicate join is not the first vertex of the DAG (i.e in case of a MR, replicate join is part of a reduce). If it is the first vertex of the DAG, we need to compare and see that with this approach the performance does not regress with the MR's map only replicate join using distributed cache. 
> Evaluate:
>    - Instead of broadcasting key values and constructing hashmap, evaluate broadcasting (or distributing via cache based on performance) serialized hashmap and loading it as is similar to hive.



--
This message was sent by Atlassian JIRA
(v6.2#6252)