You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Jeff Zhang <zj...@gmail.com> on 2015/03/05 10:36:05 UTC
Optimization opportunity for group by followed by join on the same
key ?
Hi folks,
Here's my pig script:
* a = load 'pig/input' as (x:int, y:chararray);*
* b = load 'pig/input1' as (x:int, y:chararray);*
* c = group a by x;*
* d = foreach c generate group as x, COUNT($1) as cnt;*
* d = join d by x, b by x;*
* store d into 'pig/output';*
I use tez as the execution engine and notice that pig would convert it to
one dag with 4 vertices as following. But I think 3 vertices should be
sufficient. Because the group by and join are using the same key
So I think vertex (scop_39) is not necessary, we don't need to repartition
the data again. The only impact on converting 4 vertices to 3 vertices may
be on the parallelism of vertex (scope_41). Not sure how much the
performance difference between
these 2 methods, but think this could be a potential optimization.
[image: Inline image 1]
--
Best Regards
Jeff Zhang
Re: Optimization opportunity for group by followed by join on the
same key ?
Posted by Jeff Zhang <zj...@gmail.com>.
Thanks Daniel & Rohini, I have updated PIG-3839, change its title to
"Integrate YSmart into Pig on tez" and add more comments on it.
On Fri, Mar 6, 2015 at 8:53 AM, Rohini Palaniswamy <ro...@gmail.com>
wrote:
> Jeff,
> https://issues.apache.org/jira/browse/PIG-3839 is the umbrella jira
> for Tez performance. Please file anything you identify in it if it is
> already not there.
>
> Regards,
> Rohini
>
> On Thu, Mar 5, 2015 at 4:50 PM, Rohini Palaniswamy <
> rohini.aditya@gmail.com> wrote:
>
>> Jeff,
>> There is already a JIRA -
>> https://issues.apache.org/jira/browse/PIG-3849. You can update it with
>> the details/diagrams.
>>
>> Regards,
>> Rohini
>>
>>
>> On Thu, Mar 5, 2015 at 9:41 AM, Daniel Dai <da...@hortonworks.com> wrote:
>>
>>> Thanks Jeff. I think mailing list does not allow attachment, but I get
>>> your point.
>>>
>>> Yes, and there are actually a couple of more pattens like this: rank ->
>>> sort, join -> sort, sort -> distinct, etc. This certainly can be done, and
>>> it can be done in a more general way similar to YSmart (HIVE-2206). The
>>> question is the amount of work involved. Can you open a ticket to track it?
>>> I don't think there is one yet.
>>>
>>> Daniel
>>>
>>> From: Jeff Zhang <zj...@gmail.com>>
>>> Reply-To: "dev@pig.apache.org<ma...@pig.apache.org>" <
>>> dev@pig.apache.org<ma...@pig.apache.org>>
>>> Date: Thursday, March 5, 2015 at 6:30 AM
>>> To: "dev@pig.apache.org<ma...@pig.apache.org>" <dev@pig.apache.org
>>> <ma...@pig.apache.org>>
>>> Subject: Re: Optimization opportunity for group by followed by join on
>>> the same key ?
>>>
>>> Upload dag diagram again (someone told me it is not visible )
>>> [Inline image 1]
>>>
>>> On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
>>> zjffdu@gmail.com>> wrote:
>>> Thanks Rajesh, will upload it to dev mail list again.
>>>
>>> On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <
>>> rajesh.balamohan@gmail.com<ma...@gmail.com>> wrote:
>>> Works fine. Thank you. Not sure if it got trimmed by dev mailing list.
>>> I didn't see this diagram from the mailing list and thought of informing
>>> you.
>>>
>>> ~Rajesh.B
>>>
>>> On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
>>> zjffdu@gmail.com>> wrote:
>>> upload the dag diagram again, hope it works this time
>>>
>>>
>>> [Inline image 1]
>>>
>>> On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <
>>> rajesh.balamohan@gmail.com<ma...@gmail.com>> wrote:
>>> Hey Jeff,
>>>
>>> The diagram isn't visible. Can you please reattach the diagram?
>>>
>>> ~Rajesh.B
>>>
>>> On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
>>> zjffdu@gmail.com>> wrote:
>>> Hi folks,
>>>
>>> Here's my pig script:
>>>
>>>
>>> a = load 'pig/input' as (x:int, y:chararray);
>>>
>>> b = load 'pig/input1' as (x:int, y:chararray);
>>>
>>> c = group a by x;
>>>
>>> d = foreach c generate groupas x, COUNT($1) as cnt;
>>>
>>> d = join d by x, b by x;
>>>
>>> store d into 'pig/output';
>>>
>>>
>>> I use tez as the execution engine and notice that pig would convert it
>>> to one dag with 4 vertices as following. But I think 3 vertices should be
>>> sufficient. Because the group by and join are using the same key
>>>
>>> So I think vertex (scop_39) is not necessary, we don't need to
>>> repartition the data again. The only impact on converting 4 vertices to 3
>>> vertices may be on the parallelism of vertex (scope_41). Not sure how much
>>> the performance difference between
>>> these 2 methods, but think this could be a potential optimization.
>>>
>>>
>>>
>>>
>>>
>>> [Inline image 1]
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>> --
>>> ~Rajesh.B
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>> --
>>> ~Rajesh.B
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
--
Best Regards
Jeff Zhang
Re: Optimization opportunity for group by followed by join on the
same key ?
Posted by Rohini Palaniswamy <ro...@gmail.com>.
Jeff,
There is already a JIRA - https://issues.apache.org/jira/browse/PIG-3849.
You can update it with the details/diagrams.
Regards,
Rohini
On Thu, Mar 5, 2015 at 9:41 AM, Daniel Dai <da...@hortonworks.com> wrote:
> Thanks Jeff. I think mailing list does not allow attachment, but I get
> your point.
>
> Yes, and there are actually a couple of more pattens like this: rank ->
> sort, join -> sort, sort -> distinct, etc. This certainly can be done, and
> it can be done in a more general way similar to YSmart (HIVE-2206). The
> question is the amount of work involved. Can you open a ticket to track it?
> I don't think there is one yet.
>
> Daniel
>
> From: Jeff Zhang <zj...@gmail.com>>
> Reply-To: "dev@pig.apache.org<ma...@pig.apache.org>" <
> dev@pig.apache.org<ma...@pig.apache.org>>
> Date: Thursday, March 5, 2015 at 6:30 AM
> To: "dev@pig.apache.org<ma...@pig.apache.org>" <dev@pig.apache.org
> <ma...@pig.apache.org>>
> Subject: Re: Optimization opportunity for group by followed by join on the
> same key ?
>
> Upload dag diagram again (someone told me it is not visible )
> [Inline image 1]
>
> On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
> zjffdu@gmail.com>> wrote:
> Thanks Rajesh, will upload it to dev mail list again.
>
> On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com<ma...@gmail.com>> wrote:
> Works fine. Thank you. Not sure if it got trimmed by dev mailing list. I
> didn't see this diagram from the mailing list and thought of informing you.
>
> ~Rajesh.B
>
> On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
> zjffdu@gmail.com>> wrote:
> upload the dag diagram again, hope it works this time
>
>
> [Inline image 1]
>
> On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com<ma...@gmail.com>> wrote:
> Hey Jeff,
>
> The diagram isn't visible. Can you please reattach the diagram?
>
> ~Rajesh.B
>
> On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
> zjffdu@gmail.com>> wrote:
> Hi folks,
>
> Here's my pig script:
>
>
> a = load 'pig/input' as (x:int, y:chararray);
>
> b = load 'pig/input1' as (x:int, y:chararray);
>
> c = group a by x;
>
> d = foreach c generate groupas x, COUNT($1) as cnt;
>
> d = join d by x, b by x;
>
> store d into 'pig/output';
>
>
> I use tez as the execution engine and notice that pig would convert it to
> one dag with 4 vertices as following. But I think 3 vertices should be
> sufficient. Because the group by and join are using the same key
>
> So I think vertex (scop_39) is not necessary, we don't need to repartition
> the data again. The only impact on converting 4 vertices to 3 vertices may
> be on the parallelism of vertex (scope_41). Not sure how much the
> performance difference between
> these 2 methods, but think this could be a potential optimization.
>
>
>
>
>
> [Inline image 1]
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
> --
> ~Rajesh.B
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
> --
> ~Rajesh.B
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
Re: Optimization opportunity for group by followed by join on the
same key ?
Posted by Daniel Dai <da...@hortonworks.com>.
Thanks Jeff. I think mailing list does not allow attachment, but I get your point.
Yes, and there are actually a couple of more pattens like this: rank -> sort, join -> sort, sort -> distinct, etc. This certainly can be done, and it can be done in a more general way similar to YSmart (HIVE-2206). The question is the amount of work involved. Can you open a ticket to track it? I don't think there is one yet.
Daniel
From: Jeff Zhang <zj...@gmail.com>>
Reply-To: "dev@pig.apache.org<ma...@pig.apache.org>" <de...@pig.apache.org>>
Date: Thursday, March 5, 2015 at 6:30 AM
To: "dev@pig.apache.org<ma...@pig.apache.org>" <de...@pig.apache.org>>
Subject: Re: Optimization opportunity for group by followed by join on the same key ?
Upload dag diagram again (someone told me it is not visible )
[Inline image 1]
On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <zj...@gmail.com>> wrote:
Thanks Rajesh, will upload it to dev mail list again.
On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <ra...@gmail.com>> wrote:
Works fine. Thank you. Not sure if it got trimmed by dev mailing list. I didn't see this diagram from the mailing list and thought of informing you.
~Rajesh.B
On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <zj...@gmail.com>> wrote:
upload the dag diagram again, hope it works this time
[Inline image 1]
On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <ra...@gmail.com>> wrote:
Hey Jeff,
The diagram isn't visible. Can you please reattach the diagram?
~Rajesh.B
On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <zj...@gmail.com>> wrote:
Hi folks,
Here's my pig script:
a = load 'pig/input' as (x:int, y:chararray);
b = load 'pig/input1' as (x:int, y:chararray);
c = group a by x;
d = foreach c generate groupas x, COUNT($1) as cnt;
d = join d by x, b by x;
store d into 'pig/output';
I use tez as the execution engine and notice that pig would convert it to one dag with 4 vertices as following. But I think 3 vertices should be sufficient. Because the group by and join are using the same key
So I think vertex (scop_39) is not necessary, we don't need to repartition the data again. The only impact on converting 4 vertices to 3 vertices may be on the parallelism of vertex (scope_41). Not sure how much the performance difference between
these 2 methods, but think this could be a potential optimization.
[Inline image 1]
--
Best Regards
Jeff Zhang
--
~Rajesh.B
--
Best Regards
Jeff Zhang
--
~Rajesh.B
--
Best Regards
Jeff Zhang
--
Best Regards
Jeff Zhang
Re: Optimization opportunity for group by followed by join on the
same key ?
Posted by Jeff Zhang <zj...@gmail.com>.
Upload dag diagram again (someone told me it is not visible )
[image: Inline image 1]
On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <zj...@gmail.com> wrote:
> Thanks Rajesh, will upload it to dev mail list again.
>
> On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
>> Works fine. Thank you. Not sure if it got trimmed by dev mailing list.
>> I didn't see this diagram from the mailing list and thought of informing
>> you.
>>
>> ~Rajesh.B
>>
>> On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>> upload the dag diagram again, hope it works this time
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <
>>> rajesh.balamohan@gmail.com> wrote:
>>>
>>>> Hey Jeff,
>>>>
>>>> The diagram isn't visible. Can you please reattach the diagram?
>>>>
>>>> ~Rajesh.B
>>>>
>>>> On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> Here's my pig script:
>>>>>
>>>>> * a = load 'pig/input' as (x:int, y:chararray);*
>>>>>
>>>>> * b = load 'pig/input1' as (x:int, y:chararray);*
>>>>>
>>>>> * c = group a by x;*
>>>>>
>>>>> * d = foreach c generate group as x, COUNT($1) as cnt;*
>>>>>
>>>>> * d = join d by x, b by x;*
>>>>>
>>>>> * store d into 'pig/output';*
>>>>>
>>>>>
>>>>> I use tez as the execution engine and notice that pig would convert
>>>>> it to one dag with 4 vertices as following. But I think 3 vertices should
>>>>> be sufficient. Because the group by and join are using the same key
>>>>> So I think vertex (scop_39) is not necessary, we don't need to
>>>>> repartition the data again. The only impact on converting 4 vertices to 3
>>>>> vertices may be on the parallelism of vertex (scope_41). Not sure how much
>>>>> the performance difference between
>>>>> these 2 methods, but think this could be a potential optimization.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ~Rajesh.B
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
--
Best Regards
Jeff Zhang