You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Jeff Zhang <zj...@gmail.com> on 2015/03/05 10:36:05 UTC

Optimization opportunity for group by followed by join on the same key ?

Hi folks,

Here's my pig script:

*    a = load 'pig/input' as (x:int, y:chararray);*

*    b = load 'pig/input1' as (x:int, y:chararray);*

*    c = group a by x;*

*    d = foreach c generate group as x, COUNT($1) as cnt;*

*    d = join d by x, b by x;*

*    store d into 'pig/output';*


 I use tez as the execution engine and notice that pig would convert it to
one dag with 4 vertices as following. But I think 3 vertices should be
sufficient. Because the group by and join are using the same key
So I think vertex (scop_39) is not necessary, we don't need to repartition
the data again. The only impact on converting 4 vertices to 3 vertices may
be on the parallelism of vertex (scope_41). Not sure how much the
performance difference between
these 2 methods, but think this could be a potential optimization.





[image: Inline image 1]



-- 
Best Regards

Jeff Zhang

Re: Optimization opportunity for group by followed by join on the same key ?

Posted by Jeff Zhang <zj...@gmail.com>.
Thanks Daniel & Rohini,  I have updated PIG-3839, change its title to
"Integrate YSmart into Pig on tez" and add more comments on it.


On Fri, Mar 6, 2015 at 8:53 AM, Rohini Palaniswamy <ro...@gmail.com>
wrote:

> Jeff,
>    https://issues.apache.org/jira/browse/PIG-3839 is the umbrella jira
> for Tez performance. Please file anything you identify in it if it is
> already not there.
>
> Regards,
> Rohini
>
> On Thu, Mar 5, 2015 at 4:50 PM, Rohini Palaniswamy <
> rohini.aditya@gmail.com> wrote:
>
>> Jeff,
>>    There is already a JIRA -
>> https://issues.apache.org/jira/browse/PIG-3849. You can update it with
>> the details/diagrams.
>>
>> Regards,
>> Rohini
>>
>>
>> On Thu, Mar 5, 2015 at 9:41 AM, Daniel Dai <da...@hortonworks.com> wrote:
>>
>>> Thanks Jeff. I think mailing list does not allow attachment, but I get
>>> your point.
>>>
>>> Yes, and there are actually a couple of more pattens like this: rank ->
>>> sort, join -> sort, sort -> distinct, etc. This certainly can be done, and
>>> it can be done in a more general way similar to YSmart (HIVE-2206). The
>>> question is the amount of work involved. Can you open a ticket to track it?
>>> I don't think there is one yet.
>>>
>>> Daniel
>>>
>>> From: Jeff Zhang <zj...@gmail.com>>
>>> Reply-To: "dev@pig.apache.org<ma...@pig.apache.org>" <
>>> dev@pig.apache.org<ma...@pig.apache.org>>
>>> Date: Thursday, March 5, 2015 at 6:30 AM
>>> To: "dev@pig.apache.org<ma...@pig.apache.org>" <dev@pig.apache.org
>>> <ma...@pig.apache.org>>
>>> Subject: Re: Optimization opportunity for group by followed by join on
>>> the same key ?
>>>
>>> Upload dag diagram again (someone told me it is not visible )
>>> [Inline image 1]
>>>
>>> On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
>>> zjffdu@gmail.com>> wrote:
>>> Thanks Rajesh, will upload it to dev mail list again.
>>>
>>> On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <
>>> rajesh.balamohan@gmail.com<ma...@gmail.com>> wrote:
>>> Works fine.  Thank you. Not sure if it got trimmed by dev mailing list.
>>> I didn't see this diagram from the mailing list and thought of informing
>>> you.
>>>
>>> ~Rajesh.B
>>>
>>> On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
>>> zjffdu@gmail.com>> wrote:
>>> upload the dag diagram again, hope it works this time
>>>
>>>
>>> [Inline image 1]
>>>
>>> On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <
>>> rajesh.balamohan@gmail.com<ma...@gmail.com>> wrote:
>>> Hey Jeff,
>>>
>>> The diagram isn't visible.  Can you please reattach the diagram?
>>>
>>> ~Rajesh.B
>>>
>>> On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
>>> zjffdu@gmail.com>> wrote:
>>> Hi folks,
>>>
>>> Here's my pig script:
>>>
>>>
>>>     a = load 'pig/input' as (x:int, y:chararray);
>>>
>>>     b = load 'pig/input1' as (x:int, y:chararray);
>>>
>>>     c = group a by x;
>>>
>>>     d = foreach c generate groupas x, COUNT($1) as cnt;
>>>
>>>     d = join d by x, b by x;
>>>
>>>     store d into 'pig/output';
>>>
>>>
>>> I use tez as the execution engine and notice that pig would convert it
>>> to one dag with 4 vertices as following. But I think 3 vertices should be
>>> sufficient. Because the group by and join are using the same key
>>>
>>> So I think vertex (scop_39) is not necessary, we don't need to
>>> repartition the data again. The only impact on converting 4 vertices to 3
>>> vertices may be on the parallelism of vertex (scope_41). Not sure how much
>>> the performance difference between
>>> these 2 methods, but think this could be a potential optimization.
>>>
>>>
>>>
>>>
>>>
>>> [Inline image 1]
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>> --
>>> ~Rajesh.B
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>> --
>>> ~Rajesh.B
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>


-- 
Best Regards

Jeff Zhang

Re: Optimization opportunity for group by followed by join on the same key ?

Posted by Rohini Palaniswamy <ro...@gmail.com>.
Jeff,
   There is already a JIRA - https://issues.apache.org/jira/browse/PIG-3849.
You can update it with the details/diagrams.

Regards,
Rohini

On Thu, Mar 5, 2015 at 9:41 AM, Daniel Dai <da...@hortonworks.com> wrote:

> Thanks Jeff. I think mailing list does not allow attachment, but I get
> your point.
>
> Yes, and there are actually a couple of more pattens like this: rank ->
> sort, join -> sort, sort -> distinct, etc. This certainly can be done, and
> it can be done in a more general way similar to YSmart (HIVE-2206). The
> question is the amount of work involved. Can you open a ticket to track it?
> I don't think there is one yet.
>
> Daniel
>
> From: Jeff Zhang <zj...@gmail.com>>
> Reply-To: "dev@pig.apache.org<ma...@pig.apache.org>" <
> dev@pig.apache.org<ma...@pig.apache.org>>
> Date: Thursday, March 5, 2015 at 6:30 AM
> To: "dev@pig.apache.org<ma...@pig.apache.org>" <dev@pig.apache.org
> <ma...@pig.apache.org>>
> Subject: Re: Optimization opportunity for group by followed by join on the
> same key ?
>
> Upload dag diagram again (someone told me it is not visible )
> [Inline image 1]
>
> On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
> zjffdu@gmail.com>> wrote:
> Thanks Rajesh, will upload it to dev mail list again.
>
> On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com<ma...@gmail.com>> wrote:
> Works fine.  Thank you. Not sure if it got trimmed by dev mailing list.  I
> didn't see this diagram from the mailing list and thought of informing you.
>
> ~Rajesh.B
>
> On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
> zjffdu@gmail.com>> wrote:
> upload the dag diagram again, hope it works this time
>
>
> [Inline image 1]
>
> On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com<ma...@gmail.com>> wrote:
> Hey Jeff,
>
> The diagram isn't visible.  Can you please reattach the diagram?
>
> ~Rajesh.B
>
> On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <zjffdu@gmail.com<mailto:
> zjffdu@gmail.com>> wrote:
> Hi folks,
>
> Here's my pig script:
>
>
>     a = load 'pig/input' as (x:int, y:chararray);
>
>     b = load 'pig/input1' as (x:int, y:chararray);
>
>     c = group a by x;
>
>     d = foreach c generate groupas x, COUNT($1) as cnt;
>
>     d = join d by x, b by x;
>
>     store d into 'pig/output';
>
>
> I use tez as the execution engine and notice that pig would convert it to
> one dag with 4 vertices as following. But I think 3 vertices should be
> sufficient. Because the group by and join are using the same key
>
> So I think vertex (scop_39) is not necessary, we don't need to repartition
> the data again. The only impact on converting 4 vertices to 3 vertices may
> be on the parallelism of vertex (scope_41). Not sure how much the
> performance difference between
> these 2 methods, but think this could be a potential optimization.
>
>
>
>
>
> [Inline image 1]
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
> --
> ~Rajesh.B
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
> --
> ~Rajesh.B
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Optimization opportunity for group by followed by join on the same key ?

Posted by Daniel Dai <da...@hortonworks.com>.
Thanks Jeff. I think mailing list does not allow attachment, but I get your point.

Yes, and there are actually a couple of more pattens like this: rank -> sort, join -> sort, sort -> distinct, etc. This certainly can be done, and it can be done in a more general way similar to YSmart (HIVE-2206). The question is the amount of work involved. Can you open a ticket to track it? I don't think there is one yet.

Daniel

From: Jeff Zhang <zj...@gmail.com>>
Reply-To: "dev@pig.apache.org<ma...@pig.apache.org>" <de...@pig.apache.org>>
Date: Thursday, March 5, 2015 at 6:30 AM
To: "dev@pig.apache.org<ma...@pig.apache.org>" <de...@pig.apache.org>>
Subject: Re: Optimization opportunity for group by followed by join on the same key ?

Upload dag diagram again (someone told me it is not visible )
[Inline image 1]

On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <zj...@gmail.com>> wrote:
Thanks Rajesh, will upload it to dev mail list again.

On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <ra...@gmail.com>> wrote:
Works fine.  Thank you. Not sure if it got trimmed by dev mailing list.  I didn't see this diagram from the mailing list and thought of informing you.

~Rajesh.B

On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <zj...@gmail.com>> wrote:
upload the dag diagram again, hope it works this time


[Inline image 1]

On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <ra...@gmail.com>> wrote:
Hey Jeff,

The diagram isn't visible.  Can you please reattach the diagram?

~Rajesh.B

On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <zj...@gmail.com>> wrote:
Hi folks,

Here's my pig script:


    a = load 'pig/input' as (x:int, y:chararray);

    b = load 'pig/input1' as (x:int, y:chararray);

    c = group a by x;

    d = foreach c generate groupas x, COUNT($1) as cnt;

    d = join d by x, b by x;

    store d into 'pig/output';


I use tez as the execution engine and notice that pig would convert it to one dag with 4 vertices as following. But I think 3 vertices should be sufficient. Because the group by and join are using the same key

So I think vertex (scop_39) is not necessary, we don't need to repartition the data again. The only impact on converting 4 vertices to 3 vertices may be on the parallelism of vertex (scope_41). Not sure how much the performance difference between
these 2 methods, but think this could be a potential optimization.





[Inline image 1]



--
Best Regards

Jeff Zhang



--
~Rajesh.B



--
Best Regards

Jeff Zhang



--
~Rajesh.B



--
Best Regards

Jeff Zhang



--
Best Regards

Jeff Zhang

Re: Optimization opportunity for group by followed by join on the same key ?

Posted by Jeff Zhang <zj...@gmail.com>.
Upload dag diagram again (someone told me it is not visible )
[image: Inline image 1]

On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <zj...@gmail.com> wrote:

> Thanks Rajesh, will upload it to dev mail list again.
>
> On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
>> Works fine.  Thank you. Not sure if it got trimmed by dev mailing list.
>> I didn't see this diagram from the mailing list and thought of informing
>> you.
>>
>> ~Rajesh.B
>>
>> On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>> upload the dag diagram again, hope it works this time
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <
>>> rajesh.balamohan@gmail.com> wrote:
>>>
>>>> Hey Jeff,
>>>>
>>>> The diagram isn't visible.  Can you please reattach the diagram?
>>>>
>>>> ~Rajesh.B
>>>>
>>>> On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> Here's my pig script:
>>>>>
>>>>> *    a = load 'pig/input' as (x:int, y:chararray);*
>>>>>
>>>>> *    b = load 'pig/input1' as (x:int, y:chararray);*
>>>>>
>>>>> *    c = group a by x;*
>>>>>
>>>>> *    d = foreach c generate group as x, COUNT($1) as cnt;*
>>>>>
>>>>> *    d = join d by x, b by x;*
>>>>>
>>>>> *    store d into 'pig/output';*
>>>>>
>>>>>
>>>>>  I use tez as the execution engine and notice that pig would convert
>>>>> it to one dag with 4 vertices as following. But I think 3 vertices should
>>>>> be sufficient. Because the group by and join are using the same key
>>>>> So I think vertex (scop_39) is not necessary, we don't need to
>>>>> repartition the data again. The only impact on converting 4 vertices to 3
>>>>> vertices may be on the parallelism of vertex (scope_41). Not sure how much
>>>>> the performance difference between
>>>>> these 2 methods, but think this could be a potential optimization.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ~Rajesh.B
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang