You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2011/09/19 23:13:08 UTC
[jira] [Assigned] (PIG-2293) Pig should support a more efficient
merge join against data sources that natively support point lookups or
where the join is against large, sparse tables.
[ https://issues.apache.org/jira/browse/PIG-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates reassigned PIG-2293:
-------------------------------
Assignee: Aaron Klish
> Pig should support a more efficient merge join against data sources that natively support point lookups or where the join is against large, sparse tables.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: PIG-2293
> URL: https://issues.apache.org/jira/browse/PIG-2293
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Affects Versions: 0.9.0
> Reporter: Aaron Klish
> Assignee: Aaron Klish
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> The existing PIG merge join has the following limitations:
> 1. It assumes the right side of the table must be accessed sequentially - record by record.
> 2. It does not perform well against large, sparse tables.
> The current implementation of the merge join introduced the interface IndexableLoadFunc. This 'LoadFunc'
> supports the ability to 'seekNear' a given key (before reading the next record).
> The merge join physical operator only calls 'seekNear' for the first key in each split (effectively eliminating splits
> where the first and subsequent keys will not be found). Subsequent joins are found by reading sequentially through
> the records on the right table looking for matches from the left table.
> While this method works well for dense join tables - it performs poorly against large sparse tables or data sources that support
> point lookups natively (HBase for example).
> The proposed enhancement is to add a new join type - 'merge-sparse' to PIG latin. When specified in the PIG script, this join type
> will cause the merge join operator to call seekNear on each and every key (rather than just the first in each split).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira