You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by sonia gehlot <so...@gmail.com> on 2012/04/03 02:27:12 UTC

Improve Performance of Pig script

Hi,

I have a really large data set of about 10 to 15 billion rows. I wanted to
do some aggregates like sum, count distinct, max etc but this is taking
forever to run the script.

What hints or properties should I set to improve performance.

Please let me know.

Thanks,
Sonia

Re: Improve Performance of Pig script

Posted by sonia gehlot <so...@gmail.com>.
Actually I don't expect  lots of rows, there should be only one row in the
output. I will try with groups rather than distinct.

On Tue, Apr 3, 2012 at 12:09 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> woops hit enter. just to see, how long does it take if you just store h?
>
> 2012/4/3 Jonathan Coveney <jc...@gmail.com>
>
> > point 1: doing dump is dangerous, depending on how many rows you expect
> in
> > the relation. you're going to serialize every row in the output to your
> > console
> > point 2: the issue is that you're doing a nested DISTINCT. This is done
> in
> > memory, and for large data sets can be quite slow. The scalable solution
> is
> > to do two groups.
> >
> >
> > 2012/4/3 sonia gehlot <so...@gmail.com>
> >
> >> Thanks Guys,
> >>
> >> This is pig script which I am running, Dataset is also small for the
> >> filtered date, which is around 2 million rows but I am targeting to
> write
> >> this script for larger scope. In here titles is array of JSON object but
> >> stored as string datatype so I am using python udf to split it into
> >> columns/rows.
> >>
> >> register 'dse_pig-1.0.0.jar' ;
> >> register 'udfs.py' using jython as udfs;
> >> SET default_parallel 100;
> >> a = load 'default.merchimpressions_client_log' using DataovenLoader();
> >> b = filter a by (dateint == 20120331);
> >> d = foreach b generate
> >>    dateint,
> >>    account_id,
> >>    device_type,
> >>    esn,
> >>    country,
> >>    flatten(udfs.to_map(titles)) as titles_map;
> >> e = foreach d generate
> >>    dateint,
> >>    account_id,
> >>    device_type,
> >>    esn,
> >>    country,
> >>    (chararray)titles_map#'location' as location,
> >>    (long)titles_map#'time' as time,
> >>    (int)titles_map#'count' as count,
> >>    (int)titles_map#'video_id' as title_id,
> >>    (int)titles_map#'track_id' as location_id,
> >>    (chararray)titles_map#'request_id' as request_id,
> >>    (int)titles_map#'row' as pres_row,
> >>    (int)titles_map#'rank' as pres_rank;
> >>  f = group e by (dateint,
> >>         account_id,
> >>         device_type,
> >>         esn,
> >>         country,
> >>         title_id,
> >>         location_id,
> >>         request_id,
> >>         pres_row,
> >>         pres_rank);
> >> g = foreach f generate group,
> >>               COUNT(e) as row_cnt,
> >>               MIN(e.time) as min_time,
> >>               SUM(e.count) as imp_cnt;
> >> h = foreach g generate group.dateint as dateint,
> >>               group.account_id as account_id,
> >>               group.country as country,
> >>               group.device_type as device_type,
> >>               group.esn as esn,
> >>               group.title_id as title_id,
> >>               group.location_id as location_id,
> >>               group.request_id as request_id,
> >>               group.pres_row as pres_row,
> >>               group.pres_rank as pres_rank,
> >>               row_cnt as row_cnt,
> >>               min_time as min_time,
> >>               imp_cnt as imp_cnt;
> >> i = group h by dateint;
> >> j = foreach i  {
> >>            dist_titles = DISTINCT h.title_id;
> >>            dist_acct = DISTINCT h.account_id;
> >>            generate
> >>            group as dateint,
> >>                (int)COUNT(dist_titles) as dist_titles,
> >>                (int)COUNT(dist_acct) as dist_acct,
> >>                (int)SUM(h.imp_cnt) as imp_cnt;
> >>        };
> >> dump j;
> >>
> >> If I run this script for 1-2 account_id then its really fast, comes back
> >> in
> >> minutes but for 2 million rows it takes for ever. It kicks of 3 map
> reduce
> >> job and only Map part of 1 MR job take 2.5 hours.
> >>
> >> Please let me know how I can improve performance.
> >>
> >> Thanks,
> >> Sonia
> >>
> >> On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <jc...@gmail.com>
> >> wrote:
> >>
> >> > One of these things is not like the other :) Distincts are dangerous.
> >> > Prashant is right, post the script, and we can help you dig in. sum,
> >> count,
> >> > max, should all be super fast and if they aren't it's because the
> >> Algebraic
> >> > nature isn't being kicked off.
> >> >
> >> > 2012/4/2 Prashant Kommireddi <pr...@gmail.com>
> >> >
> >> > > Can you please forward the script and Job Counters? Cluster size - #
> >> of
> >> > Map
> >> > > Reduce slots would be good too.
> >> > >
> >> > > Thanks,
> >> > > Prashant
> >> > >
> >> > > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <
> sonia.gehlot@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > I have a really large data set of about 10 to 15 billion rows. I
> >> wanted
> >> > > to
> >> > > > do some aggregates like sum, count distinct, max etc but this is
> >> taking
> >> > > > forever to run the script.
> >> > > >
> >> > > > What hints or properties should I set to improve performance.
> >> > > >
> >> > > > Please let me know.
> >> > > >
> >> > > > Thanks,
> >> > > > Sonia
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Improve Performance of Pig script

Posted by Jonathan Coveney <jc...@gmail.com>.
woops hit enter. just to see, how long does it take if you just store h?

2012/4/3 Jonathan Coveney <jc...@gmail.com>

> point 1: doing dump is dangerous, depending on how many rows you expect in
> the relation. you're going to serialize every row in the output to your
> console
> point 2: the issue is that you're doing a nested DISTINCT. This is done in
> memory, and for large data sets can be quite slow. The scalable solution is
> to do two groups.
>
>
> 2012/4/3 sonia gehlot <so...@gmail.com>
>
>> Thanks Guys,
>>
>> This is pig script which I am running, Dataset is also small for the
>> filtered date, which is around 2 million rows but I am targeting to write
>> this script for larger scope. In here titles is array of JSON object but
>> stored as string datatype so I am using python udf to split it into
>> columns/rows.
>>
>> register 'dse_pig-1.0.0.jar' ;
>> register 'udfs.py' using jython as udfs;
>> SET default_parallel 100;
>> a = load 'default.merchimpressions_client_log' using DataovenLoader();
>> b = filter a by (dateint == 20120331);
>> d = foreach b generate
>>    dateint,
>>    account_id,
>>    device_type,
>>    esn,
>>    country,
>>    flatten(udfs.to_map(titles)) as titles_map;
>> e = foreach d generate
>>    dateint,
>>    account_id,
>>    device_type,
>>    esn,
>>    country,
>>    (chararray)titles_map#'location' as location,
>>    (long)titles_map#'time' as time,
>>    (int)titles_map#'count' as count,
>>    (int)titles_map#'video_id' as title_id,
>>    (int)titles_map#'track_id' as location_id,
>>    (chararray)titles_map#'request_id' as request_id,
>>    (int)titles_map#'row' as pres_row,
>>    (int)titles_map#'rank' as pres_rank;
>>  f = group e by (dateint,
>>         account_id,
>>         device_type,
>>         esn,
>>         country,
>>         title_id,
>>         location_id,
>>         request_id,
>>         pres_row,
>>         pres_rank);
>> g = foreach f generate group,
>>               COUNT(e) as row_cnt,
>>               MIN(e.time) as min_time,
>>               SUM(e.count) as imp_cnt;
>> h = foreach g generate group.dateint as dateint,
>>               group.account_id as account_id,
>>               group.country as country,
>>               group.device_type as device_type,
>>               group.esn as esn,
>>               group.title_id as title_id,
>>               group.location_id as location_id,
>>               group.request_id as request_id,
>>               group.pres_row as pres_row,
>>               group.pres_rank as pres_rank,
>>               row_cnt as row_cnt,
>>               min_time as min_time,
>>               imp_cnt as imp_cnt;
>> i = group h by dateint;
>> j = foreach i  {
>>            dist_titles = DISTINCT h.title_id;
>>            dist_acct = DISTINCT h.account_id;
>>            generate
>>            group as dateint,
>>                (int)COUNT(dist_titles) as dist_titles,
>>                (int)COUNT(dist_acct) as dist_acct,
>>                (int)SUM(h.imp_cnt) as imp_cnt;
>>        };
>> dump j;
>>
>> If I run this script for 1-2 account_id then its really fast, comes back
>> in
>> minutes but for 2 million rows it takes for ever. It kicks of 3 map reduce
>> job and only Map part of 1 MR job take 2.5 hours.
>>
>> Please let me know how I can improve performance.
>>
>> Thanks,
>> Sonia
>>
>> On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <jc...@gmail.com>
>> wrote:
>>
>> > One of these things is not like the other :) Distincts are dangerous.
>> > Prashant is right, post the script, and we can help you dig in. sum,
>> count,
>> > max, should all be super fast and if they aren't it's because the
>> Algebraic
>> > nature isn't being kicked off.
>> >
>> > 2012/4/2 Prashant Kommireddi <pr...@gmail.com>
>> >
>> > > Can you please forward the script and Job Counters? Cluster size - #
>> of
>> > Map
>> > > Reduce slots would be good too.
>> > >
>> > > Thanks,
>> > > Prashant
>> > >
>> > > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <so...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > I have a really large data set of about 10 to 15 billion rows. I
>> wanted
>> > > to
>> > > > do some aggregates like sum, count distinct, max etc but this is
>> taking
>> > > > forever to run the script.
>> > > >
>> > > > What hints or properties should I set to improve performance.
>> > > >
>> > > > Please let me know.
>> > > >
>> > > > Thanks,
>> > > > Sonia
>> > > >
>> > >
>> >
>>
>
>

Re: Improve Performance of Pig script

Posted by Jonathan Coveney <jc...@gmail.com>.
point 1: doing dump is dangerous, depending on how many rows you expect in
the relation. you're going to serialize every row in the output to your
console
point 2: the issue is that you're doing a nested DISTINCT. This is done in
memory, and for large data sets can be quite slow. The scalable solution is
to do two groups.

2012/4/3 sonia gehlot <so...@gmail.com>

> Thanks Guys,
>
> This is pig script which I am running, Dataset is also small for the
> filtered date, which is around 2 million rows but I am targeting to write
> this script for larger scope. In here titles is array of JSON object but
> stored as string datatype so I am using python udf to split it into
> columns/rows.
>
> register 'dse_pig-1.0.0.jar' ;
> register 'udfs.py' using jython as udfs;
> SET default_parallel 100;
> a = load 'default.merchimpressions_client_log' using DataovenLoader();
> b = filter a by (dateint == 20120331);
> d = foreach b generate
>    dateint,
>    account_id,
>    device_type,
>    esn,
>    country,
>    flatten(udfs.to_map(titles)) as titles_map;
> e = foreach d generate
>    dateint,
>    account_id,
>    device_type,
>    esn,
>    country,
>    (chararray)titles_map#'location' as location,
>    (long)titles_map#'time' as time,
>    (int)titles_map#'count' as count,
>    (int)titles_map#'video_id' as title_id,
>    (int)titles_map#'track_id' as location_id,
>    (chararray)titles_map#'request_id' as request_id,
>    (int)titles_map#'row' as pres_row,
>    (int)titles_map#'rank' as pres_rank;
>  f = group e by (dateint,
>         account_id,
>         device_type,
>         esn,
>         country,
>         title_id,
>         location_id,
>         request_id,
>         pres_row,
>         pres_rank);
> g = foreach f generate group,
>               COUNT(e) as row_cnt,
>               MIN(e.time) as min_time,
>               SUM(e.count) as imp_cnt;
> h = foreach g generate group.dateint as dateint,
>               group.account_id as account_id,
>               group.country as country,
>               group.device_type as device_type,
>               group.esn as esn,
>               group.title_id as title_id,
>               group.location_id as location_id,
>               group.request_id as request_id,
>               group.pres_row as pres_row,
>               group.pres_rank as pres_rank,
>               row_cnt as row_cnt,
>               min_time as min_time,
>               imp_cnt as imp_cnt;
> i = group h by dateint;
> j = foreach i  {
>            dist_titles = DISTINCT h.title_id;
>            dist_acct = DISTINCT h.account_id;
>            generate
>            group as dateint,
>                (int)COUNT(dist_titles) as dist_titles,
>                (int)COUNT(dist_acct) as dist_acct,
>                (int)SUM(h.imp_cnt) as imp_cnt;
>        };
> dump j;
>
> If I run this script for 1-2 account_id then its really fast, comes back in
> minutes but for 2 million rows it takes for ever. It kicks of 3 map reduce
> job and only Map part of 1 MR job take 2.5 hours.
>
> Please let me know how I can improve performance.
>
> Thanks,
> Sonia
>
> On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <jc...@gmail.com>
> wrote:
>
> > One of these things is not like the other :) Distincts are dangerous.
> > Prashant is right, post the script, and we can help you dig in. sum,
> count,
> > max, should all be super fast and if they aren't it's because the
> Algebraic
> > nature isn't being kicked off.
> >
> > 2012/4/2 Prashant Kommireddi <pr...@gmail.com>
> >
> > > Can you please forward the script and Job Counters? Cluster size - # of
> > Map
> > > Reduce slots would be good too.
> > >
> > > Thanks,
> > > Prashant
> > >
> > > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <so...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a really large data set of about 10 to 15 billion rows. I
> wanted
> > > to
> > > > do some aggregates like sum, count distinct, max etc but this is
> taking
> > > > forever to run the script.
> > > >
> > > > What hints or properties should I set to improve performance.
> > > >
> > > > Please let me know.
> > > >
> > > > Thanks,
> > > > Sonia
> > > >
> > >
> >
>

Re: Improve Performance of Pig script

Posted by sonia gehlot <so...@gmail.com>.
Thanks Guys,

This is pig script which I am running, Dataset is also small for the
filtered date, which is around 2 million rows but I am targeting to write
this script for larger scope. In here titles is array of JSON object but
stored as string datatype so I am using python udf to split it into
columns/rows.

register 'dse_pig-1.0.0.jar' ;
register 'udfs.py' using jython as udfs;
SET default_parallel 100;
a = load 'default.merchimpressions_client_log' using DataovenLoader();
b = filter a by (dateint == 20120331);
d = foreach b generate
    dateint,
    account_id,
    device_type,
    esn,
    country,
    flatten(udfs.to_map(titles)) as titles_map;
e = foreach d generate
    dateint,
    account_id,
    device_type,
    esn,
    country,
    (chararray)titles_map#'location' as location,
    (long)titles_map#'time' as time,
    (int)titles_map#'count' as count,
    (int)titles_map#'video_id' as title_id,
    (int)titles_map#'track_id' as location_id,
    (chararray)titles_map#'request_id' as request_id,
    (int)titles_map#'row' as pres_row,
    (int)titles_map#'rank' as pres_rank;
 f = group e by (dateint,
         account_id,
         device_type,
         esn,
         country,
         title_id,
         location_id,
         request_id,
         pres_row,
         pres_rank);
g = foreach f generate group,
               COUNT(e) as row_cnt,
               MIN(e.time) as min_time,
               SUM(e.count) as imp_cnt;
h = foreach g generate group.dateint as dateint,
               group.account_id as account_id,
               group.country as country,
               group.device_type as device_type,
               group.esn as esn,
               group.title_id as title_id,
               group.location_id as location_id,
               group.request_id as request_id,
               group.pres_row as pres_row,
               group.pres_rank as pres_rank,
               row_cnt as row_cnt,
               min_time as min_time,
               imp_cnt as imp_cnt;
i = group h by dateint;
j = foreach i  {
            dist_titles = DISTINCT h.title_id;
            dist_acct = DISTINCT h.account_id;
            generate
            group as dateint,
                (int)COUNT(dist_titles) as dist_titles,
                (int)COUNT(dist_acct) as dist_acct,
                (int)SUM(h.imp_cnt) as imp_cnt;
        };
dump j;

If I run this script for 1-2 account_id then its really fast, comes back in
minutes but for 2 million rows it takes for ever. It kicks of 3 map reduce
job and only Map part of 1 MR job take 2.5 hours.

Please let me know how I can improve performance.

Thanks,
Sonia

On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <jc...@gmail.com> wrote:

> One of these things is not like the other :) Distincts are dangerous.
> Prashant is right, post the script, and we can help you dig in. sum, count,
> max, should all be super fast and if they aren't it's because the Algebraic
> nature isn't being kicked off.
>
> 2012/4/2 Prashant Kommireddi <pr...@gmail.com>
>
> > Can you please forward the script and Job Counters? Cluster size - # of
> Map
> > Reduce slots would be good too.
> >
> > Thanks,
> > Prashant
> >
> > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <so...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I have a really large data set of about 10 to 15 billion rows. I wanted
> > to
> > > do some aggregates like sum, count distinct, max etc but this is taking
> > > forever to run the script.
> > >
> > > What hints or properties should I set to improve performance.
> > >
> > > Please let me know.
> > >
> > > Thanks,
> > > Sonia
> > >
> >
>

Re: Improve Performance of Pig script

Posted by Jonathan Coveney <jc...@gmail.com>.
One of these things is not like the other :) Distincts are dangerous.
Prashant is right, post the script, and we can help you dig in. sum, count,
max, should all be super fast and if they aren't it's because the Algebraic
nature isn't being kicked off.

2012/4/2 Prashant Kommireddi <pr...@gmail.com>

> Can you please forward the script and Job Counters? Cluster size - # of Map
> Reduce slots would be good too.
>
> Thanks,
> Prashant
>
> On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <so...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I have a really large data set of about 10 to 15 billion rows. I wanted
> to
> > do some aggregates like sum, count distinct, max etc but this is taking
> > forever to run the script.
> >
> > What hints or properties should I set to improve performance.
> >
> > Please let me know.
> >
> > Thanks,
> > Sonia
> >
>

Re: Improve Performance of Pig script

Posted by Prashant Kommireddi <pr...@gmail.com>.
Can you please forward the script and Job Counters? Cluster size - # of Map
Reduce slots would be good too.

Thanks,
Prashant

On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <so...@gmail.com> wrote:

> Hi,
>
> I have a really large data set of about 10 to 15 billion rows. I wanted to
> do some aggregates like sum, count distinct, max etc but this is taking
> forever to run the script.
>
> What hints or properties should I set to improve performance.
>
> Please let me know.
>
> Thanks,
> Sonia
>