You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "F. Jerrell Schivers" <je...@bordercore.com> on 2013/09/05 01:39:13 UTC

Join Question

Howdy folks,

Let's say I have a set of data that looks like this:

X, (X1, X2)
Y, (Y1, Y2, Y3)

So there could be an unknown number of members of each tuple per row.

I also have a second set of data that looks like this:

X1, 4, 5, 6
X2, 3, 7, 3

I'd like to join these such that I get:

X, (X1, 4, 5, 6), (X2, 3, 7, 3)
Y, (Y1, etc), (Y2, etc), (Y3, etc)

Is this possible with Pig?

Thanks,
Jerrell

Re: Join Question

Posted by "F. Jerrell Schivers" <je...@bordercore.com>.
Hi Pradeep,

This is exactly what I'm looking for.  I was going to process this data 
inside a UDF anyway, so it's easy for me to pick out what I need.  Many 
thanks.

--Jerrell

On Wed, 4 Sep 2013, Pradeep Gollakota wrote:

> I think there's probably some convoluted way to do this. First thing you'll
> have to do is flatten your data.
>
> data1 = A, B
> _____
> X, X1
> X, X2
> Y, Y1
> Y, Y2
> Y, Y3
>
> Then do a  join by "B" onto you second dataset. This should produce the
> following
>
> data2 = data1::A, data1::B, data2::A, data2::B, data2::C (I'm assuming data
> set has exactly 4 columns).
> _______________
> X, X1, X1, 4, 5, 6
> X, X2, X2, 3, 7, 3
>
> Now do a group by data1::A to get
> {X, {(X, X1, X1, 4, 5, 6), (X, X2, X2, 3, 7, 3), ...}}
> {Y, {(Y, Y1, Y1, ...), (Y, Y2, Y2, ...), ...}}
>
> This is as far as I got, I'm not sure if there's a built-in UDF to
> transform that into what you're looking for. I thought maybe BagToTuple,
> but it will return a single tuple with all elements of all tuples in the
> bag. If the above data format supports your use cases, you're done. If not,
> you can write a UDF to transform it into the required format.
>
>
> On Wed, Sep 4, 2013 at 4:39 PM, F. Jerrell Schivers
> <je...@bordercore.com>wrote:
>
>> Howdy folks,
>>
>> Let's say I have a set of data that looks like this:
>>
>> X, (X1, X2)
>> Y, (Y1, Y2, Y3)
>>
>> So there could be an unknown number of members of each tuple per row.
>>
>> I also have a second set of data that looks like this:
>>
>> X1, 4, 5, 6
>> X2, 3, 7, 3
>>
>> I'd like to join these such that I get:
>>
>> X, (X1, 4, 5, 6), (X2, 3, 7, 3)
>> Y, (Y1, etc), (Y2, etc), (Y3, etc)
>>
>> Is this possible with Pig?
>>
>> Thanks,
>> Jerrell
>>
>

Re: Join Question

Posted by Pradeep Gollakota <pr...@gmail.com>.
I think there's probably some convoluted way to do this. First thing you'll
have to do is flatten your data.

data1 = A, B
_____
X, X1
X, X2
Y, Y1
Y, Y2
Y, Y3

Then do a  join by "B" onto you second dataset. This should produce the
following

data2 = data1::A, data1::B, data2::A, data2::B, data2::C (I'm assuming data
set has exactly 4 columns).
_______________
X, X1, X1, 4, 5, 6
X, X2, X2, 3, 7, 3

Now do a group by data1::A to get
{X, {(X, X1, X1, 4, 5, 6), (X, X2, X2, 3, 7, 3), ...}}
{Y, {(Y, Y1, Y1, ...), (Y, Y2, Y2, ...), ...}}

This is as far as I got, I'm not sure if there's a built-in UDF to
transform that into what you're looking for. I thought maybe BagToTuple,
but it will return a single tuple with all elements of all tuples in the
bag. If the above data format supports your use cases, you're done. If not,
you can write a UDF to transform it into the required format.


On Wed, Sep 4, 2013 at 4:39 PM, F. Jerrell Schivers
<je...@bordercore.com>wrote:

> Howdy folks,
>
> Let's say I have a set of data that looks like this:
>
> X, (X1, X2)
> Y, (Y1, Y2, Y3)
>
> So there could be an unknown number of members of each tuple per row.
>
> I also have a second set of data that looks like this:
>
> X1, 4, 5, 6
> X2, 3, 7, 3
>
> I'd like to join these such that I get:
>
> X, (X1, 4, 5, 6), (X2, 3, 7, 3)
> Y, (Y1, etc), (Y2, etc), (Y3, etc)
>
> Is this possible with Pig?
>
> Thanks,
> Jerrell
>