You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kevin Burton <bu...@spinn3r.com> on 2011/08/20 07:51:07 UTC

Does the pig optimizer keep track of relations that are already sorted when doing a JOIN?

I was reading about USING 'merge' with JOIN when relations are already
sorted.

I actually was just looking through some code and realized that one of my
JOINs was on two relations that were *already* sorted due to a DISTINCT and
GROUP operation.

I just added USING 'merge' and the initial results look the same.

I haven't benchmarked it though.

Does/would the existing optimizer be able to detect this and just use merge
without manual intervention?

-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Re: Does the pig optimizer keep track of relations that are already sorted when doing a JOIN?

Posted by Ashutosh Chauhan <ha...@apache.org>.
@Andrew,
You can take a look at the conditions for merge-join here:
http://pig.apache.org/docs/r0.8.1/piglatin_ref1.html#Merge+Joins

@Kevin,
If you want to improve merge-join, way to go is
https://issues.apache.org/jira/browse/PIG-959

Ashutosh

On Sun, Aug 21, 2011 at 04:27, Andrew Clegg
<an...@gmail.com>wrote:

> I'd never thought about this before, but some of my scripts could
> probably be made much quicker by taking advantage of this. From what
> operations are relations guaranteed to be sorted? Distinct, group by,
> order by, previous merge join I guess? Any others?
>
> On 20 August 2011 07:12, Ashutosh Chauhan <ha...@apache.org> wrote:
> > Hey Kevin,
> >
> > No, Pig currently doesn't auto-detect that data is getting sorted in
> > previous steps of script. So, you need to tell it by 'using merge'.
> >
> > Hope it helps,
> > Ashutosh
> >
> > On Fri, Aug 19, 2011 at 22:51, Kevin Burton <bu...@spinn3r.com> wrote:
> >
> >> I was reading about USING 'merge' with JOIN when relations are already
> >> sorted.
> >>
> >> I actually was just looking through some code and realized that one of
> my
> >> JOINs was on two relations that were *already* sorted due to a DISTINCT
> and
> >> GROUP operation.
> >>
> >> I just added USING 'merge' and the initial results look the same.
> >>
> >> I haven't benchmarked it though.
> >>
> >> Does/would the existing optimizer be able to detect this and just use
> merge
> >> without manual intervention?
> >>
> >> --
> >>
> >> Founder/CEO Spinn3r.com
> >>
> >> Location: *San Francisco, CA*
> >> Skype: *burtonator*
> >>
> >> Skype-in: *(415) 871-0687*
> >>
> >
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>

Re: Does the pig optimizer keep track of relations that are already sorted when doing a JOIN?

Posted by Andrew Clegg <an...@gmail.com>.
I'd never thought about this before, but some of my scripts could
probably be made much quicker by taking advantage of this. From what
operations are relations guaranteed to be sorted? Distinct, group by,
order by, previous merge join I guess? Any others?

On 20 August 2011 07:12, Ashutosh Chauhan <ha...@apache.org> wrote:
> Hey Kevin,
>
> No, Pig currently doesn't auto-detect that data is getting sorted in
> previous steps of script. So, you need to tell it by 'using merge'.
>
> Hope it helps,
> Ashutosh
>
> On Fri, Aug 19, 2011 at 22:51, Kevin Burton <bu...@spinn3r.com> wrote:
>
>> I was reading about USING 'merge' with JOIN when relations are already
>> sorted.
>>
>> I actually was just looking through some code and realized that one of my
>> JOINs was on two relations that were *already* sorted due to a DISTINCT and
>> GROUP operation.
>>
>> I just added USING 'merge' and the initial results look the same.
>>
>> I haven't benchmarked it though.
>>
>> Does/would the existing optimizer be able to detect this and just use merge
>> without manual intervention?
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>>
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>>
>> Skype-in: *(415) 871-0687*
>>
>



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Re: Does the pig optimizer keep track of relations that are already sorted when doing a JOIN?

Posted by Kevin Burton <bu...@spinn3r.com>.
Seems like something to put on the TODO if it isn't already there. I might
look at the bug list to see what is in store for the future :)

The good news is that I just made the changes to my script to use merge so
I'll benchmark again and see how much faster it is …. probably significantly
faster :)

On Fri, Aug 19, 2011 at 11:12 PM, Ashutosh Chauhan <ha...@apache.org>wrote:

> Hey Kevin,
>
> No, Pig currently doesn't auto-detect that data is getting sorted in
> previous steps of script. So, you need to tell it by 'using merge'.
>
> Hope it helps,
> Ashutosh
>
> On Fri, Aug 19, 2011 at 22:51, Kevin Burton <bu...@spinn3r.com> wrote:
>
> > I was reading about USING 'merge' with JOIN when relations are already
> > sorted.
> >
> > I actually was just looking through some code and realized that one of my
> > JOINs was on two relations that were *already* sorted due to a DISTINCT
> and
> > GROUP operation.
> >
> > I just added USING 'merge' and the initial results look the same.
> >
> > I haven't benchmarked it though.
> >
> > Does/would the existing optimizer be able to detect this and just use
> merge
> > without manual intervention?
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> >
> > Location: *San Francisco, CA*
> > Skype: *burtonator*
> >
> > Skype-in: *(415) 871-0687*
> >
>



-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Re: Does the pig optimizer keep track of relations that are already sorted when doing a JOIN?

Posted by Ashutosh Chauhan <ha...@apache.org>.
Hey Kevin,

No, Pig currently doesn't auto-detect that data is getting sorted in
previous steps of script. So, you need to tell it by 'using merge'.

Hope it helps,
Ashutosh

On Fri, Aug 19, 2011 at 22:51, Kevin Burton <bu...@spinn3r.com> wrote:

> I was reading about USING 'merge' with JOIN when relations are already
> sorted.
>
> I actually was just looking through some code and realized that one of my
> JOINs was on two relations that were *already* sorted due to a DISTINCT and
> GROUP operation.
>
> I just added USING 'merge' and the initial results look the same.
>
> I haven't benchmarked it though.
>
> Does/would the existing optimizer be able to detect this and just use merge
> without manual intervention?
>
> --
>
> Founder/CEO Spinn3r.com
>
> Location: *San Francisco, CA*
> Skype: *burtonator*
>
> Skype-in: *(415) 871-0687*
>