You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2009/12/01 02:13:21 UTC

[jira] Updated: (PIG-1116) Remove redundant map-reduce job for merge join

     [ https://issues.apache.org/jira/browse/PIG-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1116:
----------------------------

    Description: 
In merge join, when we convert right hand side file into a side file, we didn't remove it from the map-reduce plan, we only disconnect it from the plan. When we run the query, the redundant load will load the data but doing nothing. This operation should be removed entirely. 

Eg: 
a = load '/user/pig/tests/data/zebra/singlefile/studentsortedtab10k' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (name, age, gpa);
b = load '/user/pig/tests/data/zebra/singlefile/votersortedtab10k' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (name, age, registration, contributions);
c = join a by name, b by name using "merge";
explain c;

{code}
#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node 1-21
Map Plan
Load(hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/user/pig/tests/data/zebra/singlefile/votersortedtab10k:org.apache.hadoop.zebra.pig.TableLoader('','sorted')) - 1-13--------
Global sort: false
----------------

MapReduce node 1-20
Map Plan
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-19
|
|---MergeJoin[tuple] - 1-16
    |
    |---Load(hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/user/pig/tests/data/zebra/singlefile/studentsortedtab10k:org.apache.hadoop.zebra.pig.TableLoader('','sorted')) - 1-12--------
Global sort: false
----------------
{code}

1-21 should be removed.

  was:
In merge join, when we convert right hand side file into a side file, we didn't remove it from the map-reduce plan, we only disconnect it from the plan. When we run the query, the redundant load will load the data but doing nothing. This operation should be removed entirely. 

Eg: 
a = load '/user/pig/tests/data/zebra/singlefile/studentsortedtab10k' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (name, age, gpa);
b = load '/user/pig/tests/data/zebra/singlefile/votersortedtab10k' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (name, age, registration, contributions);
c = join a by name, b by name using "merge";
explain c;

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node 1-21
Map Plan
Load(hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/user/pig/tests/data/zebra/singlefile/votersortedtab10k:org.apache.hadoop.zebra.pig.TableLoader('','sorted')) - 1-13--------
Global sort: false
----------------

MapReduce node 1-20
Map Plan
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-19
|
|---MergeJoin[tuple] - 1-16
    |
    |---Load(hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/user/pig/tests/data/zebra/singlefile/studentsortedtab10k:org.apache.hadoop.zebra.pig.TableLoader('','sorted')) - 1-12--------
Global sort: false
----------------

1-21 should be removed.


> Remove redundant map-reduce job for merge join
> ----------------------------------------------
>
>                 Key: PIG-1116
>                 URL: https://issues.apache.org/jira/browse/PIG-1116
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Daniel Dai
>
> In merge join, when we convert right hand side file into a side file, we didn't remove it from the map-reduce plan, we only disconnect it from the plan. When we run the query, the redundant load will load the data but doing nothing. This operation should be removed entirely. 
> Eg: 
> a = load '/user/pig/tests/data/zebra/singlefile/studentsortedtab10k' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (name, age, gpa);
> b = load '/user/pig/tests/data/zebra/singlefile/votersortedtab10k' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (name, age, registration, contributions);
> c = join a by name, b by name using "merge";
> explain c;
> {code}
> #--------------------------------------------------
> # Map Reduce Plan                                  
> #--------------------------------------------------
> MapReduce node 1-21
> Map Plan
> Load(hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/user/pig/tests/data/zebra/singlefile/votersortedtab10k:org.apache.hadoop.zebra.pig.TableLoader('','sorted')) - 1-13--------
> Global sort: false
> ----------------
> MapReduce node 1-20
> Map Plan
> Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-19
> |
> |---MergeJoin[tuple] - 1-16
>     |
>     |---Load(hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/user/pig/tests/data/zebra/singlefile/studentsortedtab10k:org.apache.hadoop.zebra.pig.TableLoader('','sorted')) - 1-12--------
> Global sort: false
> ----------------
> {code}
> 1-21 should be removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.