You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "John DeTreville (JIRA)" <ji...@apache.org> on 2008/05/17 00:31:55 UTC

[jira] Created: (PIG-241) Sharding and joins

Sharding and joins
------------------

                 Key: PIG-241
                 URL: https://issues.apache.org/jira/browse/PIG-241
             Project: Pig
          Issue Type: New Feature
          Components: data
            Reporter: John DeTreville


Many large distributed systems for storage and computing over tables divide these tables into smaller _shards,_ such that all rows with the same (primary) key will appear in the same shard. If two tables are consistently sharded, then they can be joined shard-by-shard. If corresponding shards are stored on the same hosts (or racks), then joins can be performed locally on those hosts without copying the rows of the tables over the network; this can produce significant speedups.

Pig does not currently provide application-controlled sharding and the associated shard placement and computation placement. The performance of joins therefore suffers in many scenarios; rows are passed over the network multiple times when performing a join. If Pig (and Hadoop) could provide the ability for the application to shard tables consistently, according to an application-controlled policy, joins could be completely local operations and could in many cases perform much better.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-241) Sharding and joins

Posted by "Pi Song (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601122#action_12601122 ] 

Pi Song commented on PIG-241:
-----------------------------

I think we would have to work below the abstraction provided by Hadoop in order to achieve such optimization.  This would mean Hadoop has to support direct control over physical file placement in its APIs.

My suggestion:-

One possible optimization from distributed database textbooks is fragment-aware relational algebra. Files on HDFS are small chunks which are already natural fragments. If we could :-
 - Cluster same or close keys in the same set of chunks.
 - Map sets of chunks to sets of key ranges using Metadata.
Then we should be able to save a fair amount of unnecessary processing.

> Sharding and joins
> ------------------
>
>                 Key: PIG-241
>                 URL: https://issues.apache.org/jira/browse/PIG-241
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>            Reporter: John DeTreville
>
> Many large distributed systems for storage and computing over tables divide these tables into smaller _shards,_ such that all rows with the same (primary) key will appear in the same shard. If two tables are consistently sharded, then they can be joined shard-by-shard. If corresponding shards are stored on the same hosts (or racks), then joins can be performed locally on those hosts without copying the rows of the tables over the network; this can produce significant speedups.
> Pig does not currently provide application-controlled sharding and the associated shard placement and computation placement. The performance of joins therefore suffers in many scenarios; rows are passed over the network multiple times when performing a join. If Pig (and Hadoop) could provide the ability for the application to shard tables consistently, according to an application-controlled policy, joins could be completely local operations and could in many cases perform much better.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.