You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2018/09/26 20:36:00 UTC

[jira] [Comment Edited] (PIG-5342) Add setting to turn off bloom join combiner

    [ https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629371#comment-16629371 ] 

Rohini Palaniswamy edited comment on PIG-5342 at 9/26/18 8:35 PM:
------------------------------------------------------------------

1) For the reduce case, we can optimize by making the keys always NullableBytesWritable and doing the DataType.toBytes(key, keyType) in the POBuildBloomRearrangeTez itself on the map side. Comparator also needs to be set to PigBytesRawBytesComparator. Can you make that change?
 2) Can you remove these lines as we have a distinct combiner now?
 // In case of reduce, not adding a combiner and doing the distinct during reduce itself.
 // If needed one can be added later

Another optimization would be to use IntWritable instead of NullableTuple for the value type. But that needs more work. We can do that later in another jira.


was (Author: rohini):
 For the reduce case, we can optimize by making the keys always NullableBytesWritable and doing the DataType.toBytes(key, keyType) in the POBuildBloomRearrangeTez itself on the map side. Comparator also needs to be set to PigBytesRawBytesComparator. Can you make that change?

Another optimization would be to use IntWritable instead of NullableTuple for the value type. But that needs more work. We can do that later in another jira.

> Add setting to turn off bloom join combiner
> -------------------------------------------
>
>                 Key: PIG-5342
>                 URL: https://issues.apache.org/jira/browse/PIG-5342
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Satish Subhashrao Saley
>            Assignee: Satish Subhashrao Saley
>            Priority: Major
>         Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, PIG-5342-4.patch, PIG-5342-5.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were the join key. Combining involved doing a distinct on the bag of values which has memory issues for more than 10 million records. That needs to be flipped and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right outer join with smaller dataset on the right. Replicate join only supports left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)