You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2018/07/02 17:02:00 UTC

[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

    [ https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530184#comment-16530184 ] 

Rohini Palaniswamy commented on PIG-5342:
-----------------------------------------

1)Can you add pig.bloomjoin.num.filters in e2e tests to reduce type as well?
2) You still need combiner for the map type.
3) return (int) t.get(0); in BloomPartitioner

> Add setting to turn off bloom join combiner
> -------------------------------------------
>
>                 Key: PIG-5342
>                 URL: https://issues.apache.org/jira/browse/PIG-5342
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Satish Subhashrao Saley
>            Assignee: Satish Subhashrao Saley
>            Priority: Major
>         Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were the join key. Combining involved doing a distinct on the bag of values which has memory issues for more than 10 million records. That needs to be flipped and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right outer join with smaller dataset on the right. Replicate join only supports left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)