You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2015/02/13 17:03:11 UTC
[jira] [Comment Edited] (PIG-4420) Support for map side cross similar to replicate join

    [ https://issues.apache.org/jira/browse/PIG-4420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320325#comment-14320325 ] 

Rohini Palaniswamy edited comment on PIG-4420 at 2/13/15 4:02 PM:
------------------------------------------------------------------

Thanks [~brian@brianjohnson.cc]. That is a very nice workaround. But still this is nice to have as maintaining a list (avoiding cost of construction of hashmap) and not doing rearrange of tuples to get key,value for replicate join will cut down an lot of overhead and boost performance a lot.  Replicate join itself needs a revisit for performance as the SchemaTuple stuff seems to be adding more memory overhead. Found PIG-865 recently which is another waste for replicate join. Also Hive folks were telling that they have reduced the data structures used in their map side join with Tez and it is far more efficient, but I haven't got around to looking into it.

The replicate join workaround will run with parallelism of number of splits in A. To speed up the CROSS, we also did set the value of below settings to less than 128MB to increase the parallelism by increasing the number of splits in A.

mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
pig.maxCombinedSplitSize


was (Author: rohini):
Thanks [~brian@brianjohnson.cc]. That is a very nice workaround. But still this is nice to have as maintaining a list (avoiding cost of construction of hashmap) and not doing rearrange of tuples to get key,value for replicate join will cut down an lot of overhead and boost performance a lot.  Replicate join itself needs a revisit for performance as the SchemaTuple stuff seems to be adding more memory overhead. Found PIG-865 recently which is a waste for replicate join. Also Hive folks were also telling that they have reduced the data structures used in their map side join with Tez and it is far more efficient, but I haven't got around to looking into it.

The replicate join workaround will run with parallelism of number of splits in A. To speed up the CROSS, we also did set the value of below settings to less than 128MB to increase the parallelism by increasing the number of splits in A.

mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
pig.maxCombinedSplitSize

> Support for map side cross similar to replicate join
> ----------------------------------------------------
>
>                 Key: PIG-4420
>                 URL: https://issues.apache.org/jira/browse/PIG-4420
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Rohini Palaniswamy
>
>    Our CROSS implementation is very costly.  Recently had a case where a user was doing a CROSS of 30million records against 3K records and it caused lot of disk error exceptions during the shuffle phase. We need to add support for a map side cross syntax
> C = CROSS A, B using 'replicate';
> The smaller table can be loaded in a list (hashmap in replicate join) and iterated through for each record in the bigger table. It should give a major performance boost and drastically reduce the resource usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)