You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mehmet Tepedelenlioglu <me...@yahoo.com> on 2013/05/23 20:22:18 UTC

Synthetic keys

Hi,

I am using this: 

x = join a by 1, b by 1  using 'replicated';

with the hope that it generates some synthetic key '1' on both a and b and
joins it on that key, thereby, in this case, doing a clean map side cross of
a and b with no schema changes (exactly the way a cross would work). It
seems to be working, but since I just tried it and it worked, I am not sure
if there is anything in there I should be aware of. Does anyone know?

Thanks,

Mehmet

Re: Synthetic keys

Posted by Sergey Goder <se...@gmail.com>.

One reason you might prefer to use the JOIN a BY 1, b BY 1 syntax is to
specify a type of join such as the replicated join which will increase the
performance.


On Fri, May 24, 2013 at 9:15 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> You can do this, but pig has a CROSS keyword that you can use.
>
>
> 2013/5/23 Mehmet Tepedelenlioglu <me...@yahoo.com>
>
> > Hi,
> >
> > I am using this:
> >
> > x = join a by 1, b by 1  using 'replicated';
> >
> > with the hope that it generates some synthetic key '1' on both a and b
> and
> > joins it on that key, thereby, in this case, doing a clean map side cross
> > of
> > a and b with no schema changes (exactly the way a cross would work). It
> > seems to be working, but since I just tried it and it worked, I am not
> sure
> > if there is anything in there I should be aware of. Does anyone know?
> >
> > Thanks,
> >
> > Mehmet
> >
> >
> >
>

Re: Synthetic keys

Posted by Mehmet Tepedelenlioglu <me...@yahoo.com>.

0.10.0-cdh4.1.2

On 5/28/13 11:07 AM, "Pradeep Gollakota" <pr...@gmail.com> wrote:

>Oh I see... I don't remember if I tried to do it your way or not. I'm
>using
>the CDH3 version (0.8.1) of pig. I'm not sure if explicit literals in
>join's are supported in that version. I'll give it a shot and see since it
>will simplify my script.
>What version of pig are you using?
>
>
>On Tue, May 28, 2013 at 2:04 PM, Mehmet Tepedelenlioglu <
>mehmetsino@yahoo.com> wrote:
>
>> So, the example I gave before: x = join a by 1, b by 1  using
>> 'replicated'; does a replicated cross, and it creates the synthetic keys
>> implicitly, which is great because the tuple it returns does not have
>>the
>> synthetic keys in it. An explicit replicated cross would be good though,
>> since the implementation probably is pretty simple.
>>
>>
>> On 5/28/13 10:30 AM, "Pradeep Gollakota" <pr...@gmail.com> wrote:
>>
>> >I ran into a similar problem where I had a relation (A) which was
>>massive
>> >and another relation (B) which had exactly 1 record. I needed to do a
>> >cross
>> >product of these two relations, and the default implementation was very
>> >slow. I worked around it by generating a synthetic key myself and then
>> >used
>> >a replicated join to cross the two relations. It looked something like
>>the
>> >following:
>> >
>> >data1 = load 'data1'; # billions of records
>> >data2 = load 'data2'; # 1 record
>> >A = foreach data1 generate *, 1 as fake_key;
>> >B = foreach data2 generate *, 1 as fake_key;
>> >C = join B by fake_key, A by fake_key using 'replicated';
>> >
>> >I looked around to see if Pig supported this out of the box, but I
>>didn't
>> >find anything.
>> >
>> >Perhaps a replicated cross operator would be helpful for these type of
>> >problems.
>> >From the O'Reilly book, this is what is said about the cross operator:
>> >"Pig
>> >does implement cross in a parallel fashion. It does this by generating
>>a
>> >synthetic join key, replicating rows, and then doing the cross as a
>>join."
>> >Since the cross product operator is already being performed as join
>>under
>> >the hood, I wonder how difficult it would be to support different join
>> >strategies for cross.
>> >
>> >
>> >On Fri, May 24, 2013 at 12:21 PM, Mehmet Tepedelenlioglu <
>> >mehmetsino@yahoo.com> wrote:
>> >
>> >> Thanks, but is there a map-side cross? The usual cross seems to have
>>a
>> >> bug. I sent an example of how to replicate this bug.
>> >>
>> >> On 5/24/13 9:15 AM, "Jonathan Coveney" <jc...@gmail.com> wrote:
>> >>
>> >> >You can do this, but pig has a CROSS keyword that you can use.
>> >> >
>> >> >
>> >> >2013/5/23 Mehmet Tepedelenlioglu <me...@yahoo.com>
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> I am using this:
>> >> >>
>> >> >> x = join a by 1, b by 1  using 'replicated';
>> >> >>
>> >> >> with the hope that it generates some synthetic key '1' on both a
>>and
>> >>b
>> >> >>and
>> >> >> joins it on that key, thereby, in this case, doing a clean map
>>side
>> >> >>cross
>> >> >> of
>> >> >> a and b with no schema changes (exactly the way a cross would
>>work).
>> >>It
>> >> >> seems to be working, but since I just tried it and it worked, I am
>> >>not
>> >> >>sure
>> >> >> if there is anything in there I should be aware of. Does anyone
>>know?
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Mehmet
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>>
>>
>>

Re: Synthetic keys

Posted by Pradeep Gollakota <pr...@gmail.com>.

Oh I see... I don't remember if I tried to do it your way or not. I'm using
the CDH3 version (0.8.1) of pig. I'm not sure if explicit literals in
join's are supported in that version. I'll give it a shot and see since it
will simplify my script.
What version of pig are you using?


On Tue, May 28, 2013 at 2:04 PM, Mehmet Tepedelenlioglu <
mehmetsino@yahoo.com> wrote:

> So, the example I gave before: x = join a by 1, b by 1  using
> 'replicated'; does a replicated cross, and it creates the synthetic keys
> implicitly, which is great because the tuple it returns does not have the
> synthetic keys in it. An explicit replicated cross would be good though,
> since the implementation probably is pretty simple.
>
>
> On 5/28/13 10:30 AM, "Pradeep Gollakota" <pr...@gmail.com> wrote:
>
> >I ran into a similar problem where I had a relation (A) which was massive
> >and another relation (B) which had exactly 1 record. I needed to do a
> >cross
> >product of these two relations, and the default implementation was very
> >slow. I worked around it by generating a synthetic key myself and then
> >used
> >a replicated join to cross the two relations. It looked something like the
> >following:
> >
> >data1 = load 'data1'; # billions of records
> >data2 = load 'data2'; # 1 record
> >A = foreach data1 generate *, 1 as fake_key;
> >B = foreach data2 generate *, 1 as fake_key;
> >C = join B by fake_key, A by fake_key using 'replicated';
> >
> >I looked around to see if Pig supported this out of the box, but I didn't
> >find anything.
> >
> >Perhaps a replicated cross operator would be helpful for these type of
> >problems.
> >From the O'Reilly book, this is what is said about the cross operator:
> >"Pig
> >does implement cross in a parallel fashion. It does this by generating a
> >synthetic join key, replicating rows, and then doing the cross as a join."
> >Since the cross product operator is already being performed as join under
> >the hood, I wonder how difficult it would be to support different join
> >strategies for cross.
> >
> >
> >On Fri, May 24, 2013 at 12:21 PM, Mehmet Tepedelenlioglu <
> >mehmetsino@yahoo.com> wrote:
> >
> >> Thanks, but is there a map-side cross? The usual cross seems to have a
> >> bug. I sent an example of how to replicate this bug.
> >>
> >> On 5/24/13 9:15 AM, "Jonathan Coveney" <jc...@gmail.com> wrote:
> >>
> >> >You can do this, but pig has a CROSS keyword that you can use.
> >> >
> >> >
> >> >2013/5/23 Mehmet Tepedelenlioglu <me...@yahoo.com>
> >> >
> >> >> Hi,
> >> >>
> >> >> I am using this:
> >> >>
> >> >> x = join a by 1, b by 1  using 'replicated';
> >> >>
> >> >> with the hope that it generates some synthetic key '1' on both a and
> >>b
> >> >>and
> >> >> joins it on that key, thereby, in this case, doing a clean map side
> >> >>cross
> >> >> of
> >> >> a and b with no schema changes (exactly the way a cross would work).
> >>It
> >> >> seems to be working, but since I just tried it and it worked, I am
> >>not
> >> >>sure
> >> >> if there is anything in there I should be aware of. Does anyone know?
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Mehmet
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
>
>
>

Re: Synthetic keys

Posted by Mehmet Tepedelenlioglu <me...@yahoo.com>.

So, the example I gave before: x = join a by 1, b by 1  using
'replicated'; does a replicated cross, and it creates the synthetic keys
implicitly, which is great because the tuple it returns does not have the
synthetic keys in it. An explicit replicated cross would be good though,
since the implementation probably is pretty simple.
  

On 5/28/13 10:30 AM, "Pradeep Gollakota" <pr...@gmail.com> wrote:

>I ran into a similar problem where I had a relation (A) which was massive
>and another relation (B) which had exactly 1 record. I needed to do a
>cross
>product of these two relations, and the default implementation was very
>slow. I worked around it by generating a synthetic key myself and then
>used
>a replicated join to cross the two relations. It looked something like the
>following:
>
>data1 = load 'data1'; # billions of records
>data2 = load 'data2'; # 1 record
>A = foreach data1 generate *, 1 as fake_key;
>B = foreach data2 generate *, 1 as fake_key;
>C = join B by fake_key, A by fake_key using 'replicated';
>
>I looked around to see if Pig supported this out of the box, but I didn't
>find anything.
>
>Perhaps a replicated cross operator would be helpful for these type of
>problems.
>From the O'Reilly book, this is what is said about the cross operator:
>"Pig
>does implement cross in a parallel fashion. It does this by generating a
>synthetic join key, replicating rows, and then doing the cross as a join."
>Since the cross product operator is already being performed as join under
>the hood, I wonder how difficult it would be to support different join
>strategies for cross.
>
>
>On Fri, May 24, 2013 at 12:21 PM, Mehmet Tepedelenlioglu <
>mehmetsino@yahoo.com> wrote:
>
>> Thanks, but is there a map-side cross? The usual cross seems to have a
>> bug. I sent an example of how to replicate this bug.
>>
>> On 5/24/13 9:15 AM, "Jonathan Coveney" <jc...@gmail.com> wrote:
>>
>> >You can do this, but pig has a CROSS keyword that you can use.
>> >
>> >
>> >2013/5/23 Mehmet Tepedelenlioglu <me...@yahoo.com>
>> >
>> >> Hi,
>> >>
>> >> I am using this:
>> >>
>> >> x = join a by 1, b by 1  using 'replicated';
>> >>
>> >> with the hope that it generates some synthetic key '1' on both a and
>>b
>> >>and
>> >> joins it on that key, thereby, in this case, doing a clean map side
>> >>cross
>> >> of
>> >> a and b with no schema changes (exactly the way a cross would work).
>>It
>> >> seems to be working, but since I just tried it and it worked, I am
>>not
>> >>sure
>> >> if there is anything in there I should be aware of. Does anyone know?
>> >>
>> >> Thanks,
>> >>
>> >> Mehmet
>> >>
>> >>
>> >>
>>
>>
>>

Re: Synthetic keys

Posted by Pradeep Gollakota <pr...@gmail.com>.

I ran into a similar problem where I had a relation (A) which was massive
and another relation (B) which had exactly 1 record. I needed to do a cross
product of these two relations, and the default implementation was very
slow. I worked around it by generating a synthetic key myself and then used
a replicated join to cross the two relations. It looked something like the
following:

data1 = load 'data1'; # billions of records
data2 = load 'data2'; # 1 record
A = foreach data1 generate *, 1 as fake_key;
B = foreach data2 generate *, 1 as fake_key;
C = join B by fake_key, A by fake_key using 'replicated';

I looked around to see if Pig supported this out of the box, but I didn't
find anything.

Perhaps a replicated cross operator would be helpful for these type of
problems.
>From the O'Reilly book, this is what is said about the cross operator: "Pig
does implement cross in a parallel fashion. It does this by generating a
synthetic join key, replicating rows, and then doing the cross as a join."
Since the cross product operator is already being performed as join under
the hood, I wonder how difficult it would be to support different join
strategies for cross.


On Fri, May 24, 2013 at 12:21 PM, Mehmet Tepedelenlioglu <
mehmetsino@yahoo.com> wrote:

> Thanks, but is there a map-side cross? The usual cross seems to have a
> bug. I sent an example of how to replicate this bug.
>
> On 5/24/13 9:15 AM, "Jonathan Coveney" <jc...@gmail.com> wrote:
>
> >You can do this, but pig has a CROSS keyword that you can use.
> >
> >
> >2013/5/23 Mehmet Tepedelenlioglu <me...@yahoo.com>
> >
> >> Hi,
> >>
> >> I am using this:
> >>
> >> x = join a by 1, b by 1  using 'replicated';
> >>
> >> with the hope that it generates some synthetic key '1' on both a and b
> >>and
> >> joins it on that key, thereby, in this case, doing a clean map side
> >>cross
> >> of
> >> a and b with no schema changes (exactly the way a cross would work). It
> >> seems to be working, but since I just tried it and it worked, I am not
> >>sure
> >> if there is anything in there I should be aware of. Does anyone know?
> >>
> >> Thanks,
> >>
> >> Mehmet
> >>
> >>
> >>
>
>
>

Re: Synthetic keys

Posted by Mehmet Tepedelenlioglu <me...@yahoo.com>.

Thanks, but is there a map-side cross? The usual cross seems to have a
bug. I sent an example of how to replicate this bug.

On 5/24/13 9:15 AM, "Jonathan Coveney" <jc...@gmail.com> wrote:

>You can do this, but pig has a CROSS keyword that you can use.
>
>
>2013/5/23 Mehmet Tepedelenlioglu <me...@yahoo.com>
>
>> Hi,
>>
>> I am using this:
>>
>> x = join a by 1, b by 1  using 'replicated';
>>
>> with the hope that it generates some synthetic key '1' on both a and b
>>and
>> joins it on that key, thereby, in this case, doing a clean map side
>>cross
>> of
>> a and b with no schema changes (exactly the way a cross would work). It
>> seems to be working, but since I just tried it and it worked, I am not
>>sure
>> if there is anything in there I should be aware of. Does anyone know?
>>
>> Thanks,
>>
>> Mehmet
>>
>>
>>

Re: Synthetic keys

Posted by Jonathan Coveney <jc...@gmail.com>.

You can do this, but pig has a CROSS keyword that you can use.


2013/5/23 Mehmet Tepedelenlioglu <me...@yahoo.com>

> Hi,
>
> I am using this:
>
> x = join a by 1, b by 1  using 'replicated';
>
> with the hope that it generates some synthetic key '1' on both a and b and
> joins it on that key, thereby, in this case, doing a clean map side cross
> of
> a and b with no schema changes (exactly the way a cross would work). It
> seems to be working, but since I just tried it and it worked, I am not sure
> if there is anything in there I should be aware of. Does anyone know?
>
> Thanks,
>
> Mehmet
>
>
>