You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2011/09/19 23:13:08 UTC

[jira] [Assigned] (PIG-2293) Pig should support a more efficient merge join against data sources that natively support point lookups or where the join is against large, sparse tables.

     [ https://issues.apache.org/jira/browse/PIG-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates reassigned PIG-2293:
-------------------------------

    Assignee: Aaron Klish

> Pig should support a more efficient merge join against data sources that natively support point lookups or where the join is against large, sparse tables.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2293
>                 URL: https://issues.apache.org/jira/browse/PIG-2293
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.9.0
>            Reporter: Aaron Klish
>            Assignee: Aaron Klish
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The existing PIG merge join has the following limitations:
>    1. It assumes the right side of the table must be accessed sequentially - record by record.
>    2. It does not perform well against large, sparse tables.
> The current implementation of the merge join introduced the interface IndexableLoadFunc.  This 'LoadFunc'
> supports the ability to 'seekNear' a given key (before reading the next record).  
> The merge join physical operator only calls 'seekNear' for the first key in each split (effectively eliminating splits
> where the first and subsequent keys will not be found).  Subsequent joins are found by reading sequentially through
> the records on the right table looking for matches from the left table.
> While this method works well for dense join tables - it performs poorly against large sparse tables or data sources that support 
> point lookups natively (HBase for example).
> The proposed enhancement is to add a new join type - 'merge-sparse' to PIG latin.  When specified in the PIG script, this join type
> will cause the merge join operator to call seekNear on each and every key (rather than just the first in each split).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira