You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Gabriel Reid (JIRA)" <ji...@apache.org> on 2013/07/18 08:48:49 UTC

[jira] [Commented] (CRUNCH-174) Add support for join3 and cogroup3

    [ https://issues.apache.org/jira/browse/CRUNCH-174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13712074#comment-13712074 ] 

Gabriel Reid commented on CRUNCH-174:
-------------------------------------

Nice. I have been working with Pig quite a bit lately, and the ability to do stuff like this (at least in terms of joins) was making me wonder why we didn't have it in Crunch yet :-)

One idea around the implementation: as I understand it right now based on an initial readthrough, the initial mapper maps values into sparse arrays, and then the second phase combines those sparse arrays, so for triples it's like this:

   A = { 1: '1A' }
   B = { 1: '1B' }
   C = { 1: '1C' }
   D = { 2, '2D' }
   
   // pcollection after first phase
   UNION = [
                   1: ('1A', null, null), 
                   1: (null, '1B', null),
                   1: (null, null, '1C)
                   2: ('2D', null, null)]

And then the final value is made by looping through all tupleNs under the same key and combining combining their non-null values into a collection that's at the same tuple index as the non-null value.

Seeing as there's only ever one value per sparse array after the first phase, I was thinking it could probably be more efficient for larger tuples (particularly tupleNs) to just work with a pair of (index, value) instead of using sparse tuples. Using this method, the union after the first phase would look like this:

    UNION = [
                    1: (0, '1A'),
                    1: (1, '1B'),
                    1: (2, '1C') 
                    2: (0, '2D')]

I think this'll make it a bit more efficient in terms of not needing to allocate arrays that are mostly not going to be used, as well as removing the need for the loop over the sparse tuples in PostGroupFn.

What do you think?
                
> Add support for join3 and cogroup3
> ----------------------------------
>
>                 Key: CRUNCH-174
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-174
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-174.patch
>
>
> This seemed like a nice starter JIRA: it would be great to have the three (and even four!) argument analogues of Join.join() and Cogroup.cogroup(), something like:
> PTable<K, Tuple3<V1, V2, V3>> j = Join.join(PTable<K, V1> a, PTable<K, V2> b, PTable<K, V3> c);
> ... and similar for co-groups.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira