You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by mingda li <li...@gmail.com> on 2016/12/07 00:17:43 UTC

How to test the efficiency of multiple join

Dear all,

I want to test the different multiple join orders' efficiency. However,
since the pig query is executed lazily, I need to use dump or store to let
the query be executed.

Now, I use the following query to test the efficiency.

*Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY cs_item_sk;*
*Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
catalog_returns BY (cr_item_sk, cr_order_number);*
*limit_data = LIMIT Bad_OrderRes 4; *
*Dump limit_data;*

Do you think this is OK to just show 4 of results? Could this query
execution time represent the efficiency of multilpe join? I am not sure if
it will just get 4 items and stop without executing other items.

Bests,
Mingda

Re: How to test the efficiency of multiple join

Posted by mingda li <li...@gmail.com>.
So I need to test something like count (*)?  To use count, I need the
following query:

Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY cs_item_sk;
Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
catalog    _returns BY (cr_item_sk, cr_order_number);
b = foreach Bad_OrderRes generate cr_returned_date_sk, cr_returned_time_sk,
    cr_item_sk, cr_refunded_customer_sk, cs_sold_date_sk, cs_sold_time_sk,
cs_it    em_sk, cs_order_number, inv_date_sk, inv_item_sk,
inv_warehouse_sk, inv_quan    tity_on_hand;

b_group=GROUP b ALL;
b_count=FOREACH b_group GENERATE COUNT(b);
Dump b_count;

or use

ones = FOREACH b GENERATE 1 AS one:int;
counter_group = GROUP ones ALL;
log_count = FOREACH counter_group GENERATE COUNT(ones);
dump log_count

I think the second method can save more time.



On Wed, Dec 7, 2016 at 9:49 AM, Rohini Palaniswamy <ro...@gmail.com>
wrote:

> Limit 4 would make processing of join stop after 4 records. It is not a
> good idea to add it if you are testing performance of join.
>
> On Tue, Dec 6, 2016 at 8:13 PM mingda li <li...@gmail.com> wrote:
>
> > Thanks for your quick reply. If so, I can use the limit operator to
> compare
> >
> > good and bad join plan. It takes time to dump all.
> >
> >
> >
> > Bests,
> >
> > Mingda
> >
> >
> >
> >
> >
> > On Tue, Dec 6, 2016 at 5:23 PM, Zhang, Liyun <li...@intel.com>
> > wrote:
> >
> >
> >
> > > Hi:
> >
> > >    I think the query time about multiple join part is not related with
> > the
> >
> > > number of limit operator(in your case the number is 4). When the query
> is
> >
> > > executed, limit_data is executed after Bad_OrderRes, after join
> >
> > > (Bad_OrderRes) is finished, limit(limit_data) starts.
> >
> > > If I have missed something, please tell me.
> >
> > >
> >
> > >
> >
> > > Best Regards
> >
> > > Kelly Zhang/Zhang,Liyun
> >
> > >
> >
> > >
> >
> > >
> >
> > > -----Original Message-----
> >
> > > From: mingda li [mailto:limingda1993@gmail.com]
> >
> > > Sent: Wednesday, December 7, 2016 8:18 AM
> >
> > > To: dev@pig.apache.org; user@pig.apache.org
> >
> > > Subject: How to test the efficiency of multiple join
> >
> > >
> >
> > > Dear all,
> >
> > >
> >
> > > I want to test the different multiple join orders' efficiency. However,
> >
> > > since the pig query is executed lazily, I need to use dump or store to
> > let
> >
> > > the query be executed.
> >
> > >
> >
> > > Now, I use the following query to test the efficiency.
> >
> > >
> >
> > > *Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
> >
> > > cs_item_sk;*
> >
> > > *Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
> >
> > > catalog_returns BY (cr_item_sk, cr_order_number);* *limit_data = LIMIT
> >
> > > Bad_OrderRes 4; * *Dump limit_data;*
> >
> > >
> >
> > > Do you think this is OK to just show 4 of results? Could this query
> >
> > > execution time represent the efficiency of multilpe join? I am not sure
> > if
> >
> > > it will just get 4 items and stop without executing other items.
> >
> > >
> >
> > > Bests,
> >
> > > Mingda
> >
> > >
> >
> >
>

Re: How to test the efficiency of multiple join

Posted by mingda li <li...@gmail.com>.
So I need to test something like count (*)?  To use count, I need the
following query:

Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY cs_item_sk;
Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
catalog    _returns BY (cr_item_sk, cr_order_number);
b = foreach Bad_OrderRes generate cr_returned_date_sk, cr_returned_time_sk,
    cr_item_sk, cr_refunded_customer_sk, cs_sold_date_sk, cs_sold_time_sk,
cs_it    em_sk, cs_order_number, inv_date_sk, inv_item_sk,
inv_warehouse_sk, inv_quan    tity_on_hand;

b_group=GROUP b ALL;
b_count=FOREACH b_group GENERATE COUNT(b);
Dump b_count;

or use

ones = FOREACH b GENERATE 1 AS one:int;
counter_group = GROUP ones ALL;
log_count = FOREACH counter_group GENERATE COUNT(ones);
dump log_count

I think the second method can save more time.



On Wed, Dec 7, 2016 at 9:49 AM, Rohini Palaniswamy <ro...@gmail.com>
wrote:

> Limit 4 would make processing of join stop after 4 records. It is not a
> good idea to add it if you are testing performance of join.
>
> On Tue, Dec 6, 2016 at 8:13 PM mingda li <li...@gmail.com> wrote:
>
> > Thanks for your quick reply. If so, I can use the limit operator to
> compare
> >
> > good and bad join plan. It takes time to dump all.
> >
> >
> >
> > Bests,
> >
> > Mingda
> >
> >
> >
> >
> >
> > On Tue, Dec 6, 2016 at 5:23 PM, Zhang, Liyun <li...@intel.com>
> > wrote:
> >
> >
> >
> > > Hi:
> >
> > >    I think the query time about multiple join part is not related with
> > the
> >
> > > number of limit operator(in your case the number is 4). When the query
> is
> >
> > > executed, limit_data is executed after Bad_OrderRes, after join
> >
> > > (Bad_OrderRes) is finished, limit(limit_data) starts.
> >
> > > If I have missed something, please tell me.
> >
> > >
> >
> > >
> >
> > > Best Regards
> >
> > > Kelly Zhang/Zhang,Liyun
> >
> > >
> >
> > >
> >
> > >
> >
> > > -----Original Message-----
> >
> > > From: mingda li [mailto:limingda1993@gmail.com]
> >
> > > Sent: Wednesday, December 7, 2016 8:18 AM
> >
> > > To: dev@pig.apache.org; user@pig.apache.org
> >
> > > Subject: How to test the efficiency of multiple join
> >
> > >
> >
> > > Dear all,
> >
> > >
> >
> > > I want to test the different multiple join orders' efficiency. However,
> >
> > > since the pig query is executed lazily, I need to use dump or store to
> > let
> >
> > > the query be executed.
> >
> > >
> >
> > > Now, I use the following query to test the efficiency.
> >
> > >
> >
> > > *Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
> >
> > > cs_item_sk;*
> >
> > > *Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
> >
> > > catalog_returns BY (cr_item_sk, cr_order_number);* *limit_data = LIMIT
> >
> > > Bad_OrderRes 4; * *Dump limit_data;*
> >
> > >
> >
> > > Do you think this is OK to just show 4 of results? Could this query
> >
> > > execution time represent the efficiency of multilpe join? I am not sure
> > if
> >
> > > it will just get 4 items and stop without executing other items.
> >
> > >
> >
> > > Bests,
> >
> > > Mingda
> >
> > >
> >
> >
>

Re: How to test the efficiency of multiple join

Posted by Rohini Palaniswamy <ro...@gmail.com>.
Limit 4 would make processing of join stop after 4 records. It is not a
good idea to add it if you are testing performance of join.

On Tue, Dec 6, 2016 at 8:13 PM mingda li <li...@gmail.com> wrote:

> Thanks for your quick reply. If so, I can use the limit operator to compare
>
> good and bad join plan. It takes time to dump all.
>
>
>
> Bests,
>
> Mingda
>
>
>
>
>
> On Tue, Dec 6, 2016 at 5:23 PM, Zhang, Liyun <li...@intel.com>
> wrote:
>
>
>
> > Hi:
>
> >    I think the query time about multiple join part is not related with
> the
>
> > number of limit operator(in your case the number is 4). When the query is
>
> > executed, limit_data is executed after Bad_OrderRes, after join
>
> > (Bad_OrderRes) is finished, limit(limit_data) starts.
>
> > If I have missed something, please tell me.
>
> >
>
> >
>
> > Best Regards
>
> > Kelly Zhang/Zhang,Liyun
>
> >
>
> >
>
> >
>
> > -----Original Message-----
>
> > From: mingda li [mailto:limingda1993@gmail.com]
>
> > Sent: Wednesday, December 7, 2016 8:18 AM
>
> > To: dev@pig.apache.org; user@pig.apache.org
>
> > Subject: How to test the efficiency of multiple join
>
> >
>
> > Dear all,
>
> >
>
> > I want to test the different multiple join orders' efficiency. However,
>
> > since the pig query is executed lazily, I need to use dump or store to
> let
>
> > the query be executed.
>
> >
>
> > Now, I use the following query to test the efficiency.
>
> >
>
> > *Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
>
> > cs_item_sk;*
>
> > *Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
>
> > catalog_returns BY (cr_item_sk, cr_order_number);* *limit_data = LIMIT
>
> > Bad_OrderRes 4; * *Dump limit_data;*
>
> >
>
> > Do you think this is OK to just show 4 of results? Could this query
>
> > execution time represent the efficiency of multilpe join? I am not sure
> if
>
> > it will just get 4 items and stop without executing other items.
>
> >
>
> > Bests,
>
> > Mingda
>
> >
>
>

Re: How to test the efficiency of multiple join

Posted by Rohini Palaniswamy <ro...@gmail.com>.
Limit 4 would make processing of join stop after 4 records. It is not a
good idea to add it if you are testing performance of join.

On Tue, Dec 6, 2016 at 8:13 PM mingda li <li...@gmail.com> wrote:

> Thanks for your quick reply. If so, I can use the limit operator to compare
>
> good and bad join plan. It takes time to dump all.
>
>
>
> Bests,
>
> Mingda
>
>
>
>
>
> On Tue, Dec 6, 2016 at 5:23 PM, Zhang, Liyun <li...@intel.com>
> wrote:
>
>
>
> > Hi:
>
> >    I think the query time about multiple join part is not related with
> the
>
> > number of limit operator(in your case the number is 4). When the query is
>
> > executed, limit_data is executed after Bad_OrderRes, after join
>
> > (Bad_OrderRes) is finished, limit(limit_data) starts.
>
> > If I have missed something, please tell me.
>
> >
>
> >
>
> > Best Regards
>
> > Kelly Zhang/Zhang,Liyun
>
> >
>
> >
>
> >
>
> > -----Original Message-----
>
> > From: mingda li [mailto:limingda1993@gmail.com]
>
> > Sent: Wednesday, December 7, 2016 8:18 AM
>
> > To: dev@pig.apache.org; user@pig.apache.org
>
> > Subject: How to test the efficiency of multiple join
>
> >
>
> > Dear all,
>
> >
>
> > I want to test the different multiple join orders' efficiency. However,
>
> > since the pig query is executed lazily, I need to use dump or store to
> let
>
> > the query be executed.
>
> >
>
> > Now, I use the following query to test the efficiency.
>
> >
>
> > *Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
>
> > cs_item_sk;*
>
> > *Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
>
> > catalog_returns BY (cr_item_sk, cr_order_number);* *limit_data = LIMIT
>
> > Bad_OrderRes 4; * *Dump limit_data;*
>
> >
>
> > Do you think this is OK to just show 4 of results? Could this query
>
> > execution time represent the efficiency of multilpe join? I am not sure
> if
>
> > it will just get 4 items and stop without executing other items.
>
> >
>
> > Bests,
>
> > Mingda
>
> >
>
>

Re: How to test the efficiency of multiple join

Posted by mingda li <li...@gmail.com>.
Thanks for your quick reply. If so, I can use the limit operator to compare
good and bad join plan. It takes time to dump all.

Bests,
Mingda


On Tue, Dec 6, 2016 at 5:23 PM, Zhang, Liyun <li...@intel.com> wrote:

> Hi:
>    I think the query time about multiple join part is not related with the
> number of limit operator(in your case the number is 4). When the query is
> executed, limit_data is executed after Bad_OrderRes, after join
> (Bad_OrderRes) is finished, limit(limit_data) starts.
> If I have missed something, please tell me.
>
>
> Best Regards
> Kelly Zhang/Zhang,Liyun
>
>
>
> -----Original Message-----
> From: mingda li [mailto:limingda1993@gmail.com]
> Sent: Wednesday, December 7, 2016 8:18 AM
> To: dev@pig.apache.org; user@pig.apache.org
> Subject: How to test the efficiency of multiple join
>
> Dear all,
>
> I want to test the different multiple join orders' efficiency. However,
> since the pig query is executed lazily, I need to use dump or store to let
> the query be executed.
>
> Now, I use the following query to test the efficiency.
>
> *Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
> cs_item_sk;*
> *Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
> catalog_returns BY (cr_item_sk, cr_order_number);* *limit_data = LIMIT
> Bad_OrderRes 4; * *Dump limit_data;*
>
> Do you think this is OK to just show 4 of results? Could this query
> execution time represent the efficiency of multilpe join? I am not sure if
> it will just get 4 items and stop without executing other items.
>
> Bests,
> Mingda
>

Re: How to test the efficiency of multiple join

Posted by mingda li <li...@gmail.com>.
Thanks for your quick reply. If so, I can use the limit operator to compare
good and bad join plan. It takes time to dump all.

Bests,
Mingda


On Tue, Dec 6, 2016 at 5:23 PM, Zhang, Liyun <li...@intel.com> wrote:

> Hi:
>    I think the query time about multiple join part is not related with the
> number of limit operator(in your case the number is 4). When the query is
> executed, limit_data is executed after Bad_OrderRes, after join
> (Bad_OrderRes) is finished, limit(limit_data) starts.
> If I have missed something, please tell me.
>
>
> Best Regards
> Kelly Zhang/Zhang,Liyun
>
>
>
> -----Original Message-----
> From: mingda li [mailto:limingda1993@gmail.com]
> Sent: Wednesday, December 7, 2016 8:18 AM
> To: dev@pig.apache.org; user@pig.apache.org
> Subject: How to test the efficiency of multiple join
>
> Dear all,
>
> I want to test the different multiple join orders' efficiency. However,
> since the pig query is executed lazily, I need to use dump or store to let
> the query be executed.
>
> Now, I use the following query to test the efficiency.
>
> *Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
> cs_item_sk;*
> *Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
> catalog_returns BY (cr_item_sk, cr_order_number);* *limit_data = LIMIT
> Bad_OrderRes 4; * *Dump limit_data;*
>
> Do you think this is OK to just show 4 of results? Could this query
> execution time represent the efficiency of multilpe join? I am not sure if
> it will just get 4 items and stop without executing other items.
>
> Bests,
> Mingda
>

RE: How to test the efficiency of multiple join

Posted by "Zhang, Liyun" <li...@intel.com>.
Hi:
   I think the query time about multiple join part is not related with the number of limit operator(in your case the number is 4). When the query is executed, limit_data is executed after Bad_OrderRes, after join (Bad_OrderRes) is finished, limit(limit_data) starts.
If I have missed something, please tell me.


Best Regards
Kelly Zhang/Zhang,Liyun



-----Original Message-----
From: mingda li [mailto:limingda1993@gmail.com] 
Sent: Wednesday, December 7, 2016 8:18 AM
To: dev@pig.apache.org; user@pig.apache.org
Subject: How to test the efficiency of multiple join

Dear all,

I want to test the different multiple join orders' efficiency. However, since the pig query is executed lazily, I need to use dump or store to let the query be executed.

Now, I use the following query to test the efficiency.

*Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY cs_item_sk;*
*Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
catalog_returns BY (cr_item_sk, cr_order_number);* *limit_data = LIMIT Bad_OrderRes 4; * *Dump limit_data;*

Do you think this is OK to just show 4 of results? Could this query execution time represent the efficiency of multilpe join? I am not sure if it will just get 4 items and stop without executing other items.

Bests,
Mingda