You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2011/07/14 20:47:59 UTC

[jira] [Created] (PIG-2163) Improve nested cross to stream one relation

Improve nested cross to stream one relation
-------------------------------------------

                 Key: PIG-2163
                 URL: https://issues.apache.org/jira/browse/PIG-2163
             Project: Pig
          Issue Type: Improvement
          Components: impl
    Affects Versions: 0.10
            Reporter: Daniel Dai
            Assignee: Zhijie Shen
             Fix For: 0.10


PIG-1916 added nested cross support for PIG. One optimization is instead of materialize all bags before producing result, we can stream one of the input to save on memory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2163) Improve nested cross to stream one relation

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093990#comment-13093990 ] 

Daniel Dai commented on PIG-2163:
---------------------------------

Thanks Zhijie, patch works good. One comment, as I mentioned in my last comment, Pig usually stream the rightmost relation. We wanna that here as well, that will improves user experience, and simplify the documentation.

> Improve nested cross to stream one relation
> -------------------------------------------
>
>                 Key: PIG-2163
>                 URL: https://issues.apache.org/jira/browse/PIG-2163
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.10
>            Reporter: Daniel Dai
>            Assignee: Zhijie Shen
>             Fix For: 0.10
>
>         Attachments: PIG-2163.patch
>
>
> PIG-1916 added nested cross support for PIG. One optimization is instead of materialize all bags before producing result, we can stream one of the input to save on memory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2163) Improve nested cross to stream one relation

Posted by "Zhijie Shen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092680#comment-13092680 ] 

Zhijie Shen commented on PIG-2163:
----------------------------------

sorry, I tought this issue again, and found I was wrong. The memory for one bag is savable. 

Whenever a new tuple of the first relation is ejected, iterate all the combinations of the tuples in the other n-1relationship.

> Improve nested cross to stream one relation
> -------------------------------------------
>
>                 Key: PIG-2163
>                 URL: https://issues.apache.org/jira/browse/PIG-2163
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.10
>            Reporter: Daniel Dai
>            Assignee: Zhijie Shen
>             Fix For: 0.10
>
>
> PIG-1916 added nested cross support for PIG. One optimization is instead of materialize all bags before producing result, we can stream one of the input to save on memory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2163) Improve nested cross to stream one relation

Posted by "Zhijie Shen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092675#comment-13092675 ] 

Zhijie Shen commented on PIG-2163:
----------------------------------

Hi Daniel,

If I understand your suggestion correctly, you mean that when cross over n relations, the first n-1 relations are recorded temporally in n-1 bags, and the last relation ejects the tuples iteratively (through getNext()) and crosses it with the stored bags.

However, the problem is that the tuples in the last relation will not be iterated once but k1*k2*...kn-1 times, where ki is the number of tuples in i-th relation. For example, if there are three relations:

bag1: {(a, 1)}

bag2: {(a, x), (a, y)}
1st     ^
2rd             ^

bag3: {(a, true), (a, false)}

the bag3 will be iterated twice: first to cross with (a, x) and second to cross with (a, y).

On the other hand, getNext() can only go through the last relation once. Hence I think the n bags inevitable. How do you think about this? Correct me if I'm wrong.

By the way, this issue reminds me a problem that the computation of cross product is expensive especially when the number of relations is large. I'm not a database specialist. Does anybody know some smarter algorithms to reduce the rounds of scanning the relations?


> Improve nested cross to stream one relation
> -------------------------------------------
>
>                 Key: PIG-2163
>                 URL: https://issues.apache.org/jira/browse/PIG-2163
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.10
>            Reporter: Daniel Dai
>            Assignee: Zhijie Shen
>             Fix For: 0.10
>
>
> PIG-1916 added nested cross support for PIG. One optimization is instead of materialize all bags before producing result, we can stream one of the input to save on memory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2163) Improve nested cross to stream one relation

Posted by "Zhijie Shen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhijie Shen updated PIG-2163:
-----------------------------

    Attachment: PIG-2163.patch

Attached is the patch for this issue. Assume there are n bags as input. Now POCross only create n - 1 temporal bags.

The general logic is iterating the tuple of the first bag (the left-most one in the bag list) and merging it with all the combinations with the tuples stored in the n - 1 temporal bags. Choosing the first bag to iterate separately is to keep the order of the cross product in this method the same as that with n temporal bags.



> Improve nested cross to stream one relation
> -------------------------------------------
>
>                 Key: PIG-2163
>                 URL: https://issues.apache.org/jira/browse/PIG-2163
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.10
>            Reporter: Daniel Dai
>            Assignee: Zhijie Shen
>             Fix For: 0.10
>
>         Attachments: PIG-2163.patch
>
>
> PIG-1916 added nested cross support for PIG. One optimization is instead of materialize all bags before producing result, we can stream one of the input to save on memory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2163) Improve nested cross to stream one relation

Posted by "Zhijie Shen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhijie Shen updated PIG-2163:
-----------------------------

    Attachment: PIG-2163_1.patch

Hi Daniel,

I modified the patch according to your comments.

Now the right-most bag will be streamed while the cross product will be generated on the fly. Additionally, to make the order of generated tuples reasonable, I reverse the iteration order of n bags (converting to n, n - 1, ..., 2, 1 order, and avoiding the strange 2, 3, ..., n - 1, n, 1 order). For example, if there are three bags from left to right:

bag #1 {(a, 1), (a, 2)}
bag #2 {(a, 11), (a, 22)}
bag #3 {(a, 111), (a, 222)}

the generated bag will be:
{
(a, 1, a, 11, a, 111),
(a, 2, a, 11, a, 111),
(a, 1, a, 22, a, 111),
(a, 2, a, 22, a, 111),
(a, 1, a, 11, a, 222),
(a, 2, a, 11, a, 222),
(a, 1, a, 22, a, 222),
(a, 2, a, 22, a, 222)
}

> Improve nested cross to stream one relation
> -------------------------------------------
>
>                 Key: PIG-2163
>                 URL: https://issues.apache.org/jira/browse/PIG-2163
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.10
>            Reporter: Daniel Dai
>            Assignee: Zhijie Shen
>             Fix For: 0.10
>
>         Attachments: PIG-2163.patch, PIG-2163_1.patch
>
>
> PIG-1916 added nested cross support for PIG. One optimization is instead of materialize all bags before producing result, we can stream one of the input to save on memory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (PIG-2163) Improve nested cross to stream one relation

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai resolved PIG-2163.
-----------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

+1 for PIG-2163_1.patch, works as expected. 

test-patch:
     [exec] -1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.

This is an enhancement of PIG-1916. Test in PIG-1916 is enough to cover the patch, so no new tests is fine for this patch.

All unit test pass.

Patch committed to trunk. Thanks Zhijie!

> Improve nested cross to stream one relation
> -------------------------------------------
>
>                 Key: PIG-2163
>                 URL: https://issues.apache.org/jira/browse/PIG-2163
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.10
>            Reporter: Daniel Dai
>            Assignee: Zhijie Shen
>             Fix For: 0.10
>
>         Attachments: PIG-2163.patch, PIG-2163_1.patch
>
>
> PIG-1916 added nested cross support for PIG. One optimization is instead of materialize all bags before producing result, we can stream one of the input to save on memory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2163) Improve nested cross to stream one relation

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093212#comment-13093212 ] 

Daniel Dai commented on PIG-2163:
---------------------------------

Yes, and usually we stream the last relation as we did in join. So assume the last relation is a large one, user can optimize to make use of it.

> Improve nested cross to stream one relation
> -------------------------------------------
>
>                 Key: PIG-2163
>                 URL: https://issues.apache.org/jira/browse/PIG-2163
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.10
>            Reporter: Daniel Dai
>            Assignee: Zhijie Shen
>             Fix For: 0.10
>
>
> PIG-1916 added nested cross support for PIG. One optimization is instead of materialize all bags before producing result, we can stream one of the input to save on memory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira