You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Zhijie Shen (JIRA)" <ji...@apache.org> on 2011/07/04 08:51:22 UTC

[jira] [Commented] (PIG-1916) Nested cross

    [ https://issues.apache.org/jira/browse/PIG-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059338#comment-13059338 ] 

Zhijie Shen commented on PIG-1916:
----------------------------------

The nested cross functionality basically works. I've respectively tested one, two and three bags cross on both local machine and single-node hadoop platform. Also, I've fixed the bug of cross over null bag. If an input bag is found null, POCross return null result whatever the other inputs are, because the cross product of null and anything is null. Moreover, I refactored POCross a bit and implmented illustratorMarkup() function, which is the part I'm still not sure.

Afterwards, I'll come up with comprehensive test cases, and investegate into illustratorMarkup().


Below is an example. For the following sample commands,

#-----------------------------------------------
user = load 'user.txt' using PigStorage('\t') as (uid, region);
session = load 'session.txt' using PigStorage('\t') as (uid, region, duration);
C = cogroup user by uid, session by uid;
D = foreach C {
    crossed = cross user, session;
    generate crossed;
}
store D into 'test.out';
#-----------------------------------------------

I got the logical, phyical and map/reduce plans, which are demonstrated as follows.

#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
D: (Name: LOStore Schema: crossed#22:bag{#23:tuple(user::uid#16:bytearray,user::region#17:bytearray,session::uid#18:bytearray,session::region#19:bytearray,session::duration#20:bytearray)})
|
|---D: (Name: LOForEach Schema: crossed#22:bag{#23:tuple(user::uid#16:bytearray,user::region#17:bytearray,session::uid#18:bytearray,session::region#19:bytearray,session::duration#20:bytearray)})
    |   |
    |   (Name: LOGenerate[false] Schema: crossed#22:bag{#23:tuple(user::uid#16:bytearray,user::region#17:bytearray,session::uid#18:bytearray,session::region#19:bytearray,session::duration#20:bytearray)})
    |   |   |
    |   |   crossed:(Name: Project Type: bag Uid: 22 Input: 0 Column: (*))
    |   |
    |   |---crossed: (Name: LOCross Schema: user::uid#16:bytearray,user::region#17:bytearray,session::uid#18:bytearray,session::region#19:bytearray,session::duration#20:bytearray)
    |       |
    |       |---user: (Name: LOInnerLoad[1] Schema: uid#16:bytearray,region#17:bytearray)
    |       |
    |       |---session: (Name: LOInnerLoad[2] Schema: uid#18:bytearray,region#19:bytearray,duration#20:bytearray)
    |
    |---C: (Name: LOCogroup Schema: group#21:bytearray,user#22:bag{#28:tuple(uid#16:bytearray,region#17:bytearray)},session#24:bag{#29:tuple(uid#18:bytearray,region#19:bytearray,duration#20:bytearray)})
        |   |
        |   uid:(Name: Project Type: bytearray Uid: 16 Input: 0 Column: 0)
        |   |
        |   uid:(Name: Project Type: bytearray Uid: 18 Input: 1 Column: 0)
        |
        |---user: (Name: LOLoad Schema: uid#16:bytearray,region#17:bytearray)RequiredFields:null
        |
        |---session: (Name: LOLoad Schema: uid#18:bytearray,region#19:bytearray,duration#20:bytearray)RequiredFields:null

#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
D: Store(test.out:org.apache.pig.builtin.PigStorage) - scope-25
|
|---D: New For Each(false)[bag] - scope-24
    |   |
    |   RelationToExpressionProject[bag][*] - scope-20
    |   |
    |   |---POCross[bag] - scope-23
    |       |
    |       |---Project[bag][1] - scope-21
    |       |
    |       |---Project[bag][2] - scope-22
    |
    |---C: Package[tuple]{bytearray} - scope-15
        |
        |---C: Global Rearrange[tuple] - scope-14
            |
            |---C: Local Rearrange[tuple]{bytearray}(false) - scope-16
            |   |   |
            |   |   Project[bytearray][0] - scope-17
            |   |
            |   |---user: New For Each(false,false)[bag] - scope-5
            |       |   |
            |       |   Project[bytearray][0] - scope-1
            |       |   |
            |       |   Project[bytearray][1] - scope-3
            |       |
            |       |---user: Load(file:///home/zjshen/Workspace/eclipse/pig_test/user.txt:PigStorage('	')) - scope-0
            |
            |---C: Local Rearrange[tuple]{bytearray}(false) - scope-18
                |   |
                |   Project[bytearray][0] - scope-19
                |
                |---session: New For Each(false,false,false)[bag] - scope-13
                    |   |
                    |   Project[bytearray][0] - scope-7
                    |   |
                    |   Project[bytearray][1] - scope-9
                    |   |
                    |   Project[bytearray][2] - scope-11
                    |
                    |---session: Load(file:///home/zjshen/Workspace/eclipse/pig_test/session.txt:PigStorage('	')) - scope-6

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-28
Map Plan
Union[tuple] - scope-29
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-16
|   |   |
|   |   Project[bytearray][0] - scope-17
|   |
|   |---user: New For Each(false,false)[bag] - scope-5
|       |   |
|       |   Project[bytearray][0] - scope-1
|       |   |
|       |   Project[bytearray][1] - scope-3
|       |
|       |---user: Load(file:///home/zjshen/Workspace/eclipse/pig_test/user.txt:PigStorage('	')) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-18
    |   |
    |   Project[bytearray][0] - scope-19
    |
    |---session: New For Each(false,false,false)[bag] - scope-13
        |   |
        |   Project[bytearray][0] - scope-7
        |   |
        |   Project[bytearray][1] - scope-9
        |   |
        |   Project[bytearray][2] - scope-11
        |
        |---session: Load(file:///home/zjshen/Workspace/eclipse/pig_test/session.txt:PigStorage('	')) - scope-6--------
Reduce Plan
D: Store(test.out:org.apache.pig.builtin.PigStorage) - scope-25
|
|---D: New For Each(false)[bag] - scope-24
    |   |
    |   RelationToExpressionProject[bag][*] - scope-20
    |   |
    |   |---POCross[bag] - scope-23
    |       |
    |       |---Project[bag][1] - scope-21
    |       |
    |       |---Project[bag][2] - scope-22
    |
    |---C: Package[tuple]{bytearray} - scope-15--------
Global sort: false
----------------



> Nested cross
> ------------
>
>                 Key: PIG-1916
>                 URL: https://issues.apache.org/jira/browse/PIG-1916
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1916_1.patch, PIG-1916_2.patch, PIG-1916_3.patch
>
>
> It is useful to have cross inside foreach nested statement. One typical use case for nested foreach is after cogroup two relations, we want to flatten the records of the same key, and do some processing. This is naturally to be achieved by cross. Eg:
> {code}
> C = cogroup user by uid, session by uid;
> D = foreach C {
>     crossed = cross user, session; -- To flatten two input bags
>     filtered = filter crossed by user::region == session::region;
>     result = foreach crossed generate processSession(user::age, user::gender, session::ip);  --Nested foreach Jira: PIG-1631
>     generate result;
> }
> {code}
> If we don't have cross, user have to write a UDF process the bag user, session. It is much harder than a UDF process flattened tuples. This is especially true when we have nested foreach statement(PIG-1631).
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira