You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2010/01/15 07:10:54 UTC

[jira] Resolved: (PIG-241) Sharding and joins

     [ https://issues.apache.org/jira/browse/PIG-241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates resolved PIG-241.
----------------------------

    Resolution: Won't Fix

We have chosen a different approach to this.  Our merge join does take advantage of sort order, but does not require that data be partitioned in the same way in order to do the join, as the this suggested sharding approach does.

> Sharding and joins
> ------------------
>
>                 Key: PIG-241
>                 URL: https://issues.apache.org/jira/browse/PIG-241
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>            Reporter: John DeTreville
>
> Many large distributed systems for storage and computing over tables divide these tables into smaller _shards,_ such that all rows with the same (primary) key will appear in the same shard. If two tables are consistently sharded, then they can be joined shard-by-shard. If corresponding shards are stored on the same hosts (or racks), then joins can be performed locally on those hosts without copying the rows of the tables over the network; this can produce significant speedups.
> Pig does not currently provide application-controlled sharding and the associated shard placement and computation placement. The performance of joins therefore suffers in many scenarios; rows are passed over the network multiple times when performing a join. If Pig (and Hadoop) could provide the ability for the application to shard tables consistently, according to an application-controlled policy, joins could be completely local operations and could in many cases perform much better.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.