You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Sujee Maniyam <su...@sujee.net> on 2011/06/18 00:37:47 UTC

pig script takes much longer than java MR job

I have log files like this:
   #timestamp (ms),     server,    user,    action,    domain , x,    y ,
z
   1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
   1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
   1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar

I have the following pig script to count the number of domains from logs. (
For example, we have seen facebook.com 10 times ..etc.)

Here is the pig script:

--------------------------------
records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
server:int, user:int, action_id:int, domain:chararray, price:int);

-- DUMP records;
grouped_by_domain = GROUP records BY domain;
-- DUMP grouped_by_domain;
-- DESCRIBE grouped_by_domain;

freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records) as
mycount;
-- DESCRIBE freq;
-- DUMP freq;

sorted = ORDER freq BY mycount DESC;
DUMP sorted;
--------------------------------

This script takes a hour to run.   I also wrote a simple Java MR job to
count the domains, it takes about 15 mins.  So the pig script is taking 4x
longer to complete.

any suggestions on what I am doing wrong in pig?

thanks
Sujee
http://sujee.net

Re: pig script takes much longer than java MR job

Posted by Dexin Wang <wa...@gmail.com>.
Yeah sounds like a lot to dump if it takes 15 minutes to run. That alone can take long time. 
 
I once forgot to comment out some debug line in my udf. When run with production data, not only it's slow, it blew up the cluster - simply run out of log space :)

On Jun 17, 2011, at 5:06 PM, Jonathan Coveney <jc...@gmail.com> wrote:

> A couple of possibilities that I'm kicking around off the top of my head...
> 
> 1) Does your MR job also sort afterwards? That's going to kick off another
> MR job
> 2) Does your MR job compile all the results into one job?
> 
> My guess is the Order+Dump are making it take longer.
> 
> 2011/6/17 Sujee Maniyam <su...@sujee.net>
> 
>> I have log files like this:
>>  #timestamp (ms),     server,    user,    action,    domain , x,    y ,
>> z
>>  1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
>>  1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
>>  1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar
>> 
>> I have the following pig script to count the number of domains from logs. (
>> For example, we have seen facebook.com 10 times ..etc.)
>> 
>> Here is the pig script:
>> 
>> --------------------------------
>> records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
>> server:int, user:int, action_id:int, domain:chararray, price:int);
>> 
>> -- DUMP records;
>> grouped_by_domain = GROUP records BY domain;
>> -- DUMP grouped_by_domain;
>> -- DESCRIBE grouped_by_domain;
>> 
>> freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records)
>> as
>> mycount;
>> -- DESCRIBE freq;
>> -- DUMP freq;
>> 
>> sorted = ORDER freq BY mycount DESC;
>> DUMP sorted;
>> --------------------------------
>> 
>> This script takes a hour to run.   I also wrote a simple Java MR job to
>> count the domains, it takes about 15 mins.  So the pig script is taking 4x
>> longer to complete.
>> 
>> any suggestions on what I am doing wrong in pig?
>> 
>> thanks
>> Sujee
>> http://sujee.net
>> 

Re: pig script takes much longer than java MR job

Posted by Jonathan Coveney <jc...@gmail.com>.
A couple of possibilities that I'm kicking around off the top of my head...

1) Does your MR job also sort afterwards? That's going to kick off another
MR job
2) Does your MR job compile all the results into one job?

My guess is the Order+Dump are making it take longer.

2011/6/17 Sujee Maniyam <su...@sujee.net>

> I have log files like this:
>   #timestamp (ms),     server,    user,    action,    domain , x,    y ,
> z
>   1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
>   1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
>   1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar
>
> I have the following pig script to count the number of domains from logs. (
> For example, we have seen facebook.com 10 times ..etc.)
>
> Here is the pig script:
>
> --------------------------------
> records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
> server:int, user:int, action_id:int, domain:chararray, price:int);
>
> -- DUMP records;
> grouped_by_domain = GROUP records BY domain;
> -- DUMP grouped_by_domain;
> -- DESCRIBE grouped_by_domain;
>
> freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records)
> as
> mycount;
> -- DESCRIBE freq;
> -- DUMP freq;
>
> sorted = ORDER freq BY mycount DESC;
> DUMP sorted;
> --------------------------------
>
> This script takes a hour to run.   I also wrote a simple Java MR job to
> count the domains, it takes about 15 mins.  So the pig script is taking 4x
> longer to complete.
>
> any suggestions on what I am doing wrong in pig?
>
> thanks
> Sujee
> http://sujee.net
>