You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Yen SYU <ye...@gmail.com> on 2012/03/13 20:00:08 UTC

Accumulator is not fired

Hi all,

I just test a very simple pig script as following:

records = LOAD '$input' AS (hash:chararray, domain:chararray,
host:chararray, page:chararray, freq:int);
grpd = GROUP records BY (domain, host);
stats = FOREACH grpd {
                                  hashes = records.hash;
                                  uniq_hashes = DISTINCT hashes;
                                  pages = records.page;
                                  GENERATE group.$1 AS host, group.$0 AS
domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS
page_count:long, SUM(freq) AS freq:long);
};
STORE stats INTO '$output';

where PAGE_COUNT is a customized UDF implementing Accumulator. I add an
EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the
accumulate method is never called. Even I tried to remove all other
built-in UDFs and keep the NESTED FOREACH as simple as:

stats = FOREACH grpd {
                                  pages = records.page;
                                  GENERATE group.$1 AS host, group.$0 AS
domain, PAGE_COUNT(pages) AS page_count:long;
};

Anyone idea what's going on behind the scenes?

Thanks,
Yen

Re: Accumulator is not fired

Posted by Yen SYU <ye...@gmail.com>.
Hi Thejas,

Thank you for your reply!

I agree that accumulator mode can be used if you only use built-in UDFs. :)

I noticed what you mentioned in your reply in the past. In my script,
PAGE_COUNT is an evaluation function which accumulator is the only
interface implemented. I also check built-in UDF COUNT and SUM's source
code - it implements both algebraic and accumulator. Theoretically pig
should use accumulator mode in this situation.

I also notice sometimes if my pig script is VERY SIMPLE, such as without
extracting a lot of tuples/fields in nested foreach, the PAGE_COUNT can be
called in accumulator mode, but not in all situations.

I'm really curious if there are other situations that can break accumulator
call - or do I nedd to manually turn on some optimization for pig to fire
accumulator? Cannot find a lot of relative resource online...

Best,
Yen

On Thu, Mar 15, 2012 at 9:34 PM, Thejas Nair <th...@hortonworks.com> wrote:

> Hi Yen,
> Does the function also implement Algebraic ? In that case it might end up
> using the algebraic interface of the udf.
> If your foreach statement has functions that don't implement Accumulator
> interface, then reduce task won't run in accumulative mode. This is because
> you are anyway going to load the whole bag into memory.
>
> If the query is using accumulator mode, you would see this log message -
> INFO org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**AccumulatorOptimizer
> - Reducer is to run in accumulative mode.
>
>
> I tried modifying your query to -
>
>
> stats = FOREACH grpd {
>                                  pages = records.page;
>                                  GENERATE group.$1 AS host, group.$0 AS
> domain, COUNT(pages) AS page_count:long;
> };
>
> and ran it disabling the combiner -
> bin/pig -Dpig.exec.nocombiner=true -x local -e 'explain -script
> /tmp/t.pig;'
>
> I was able to verify that it would run Accumulator mode, using the above
> log message.
>
> Thanks,
> Thejas
>
>
>
>
>
> On 3/13/12 12:22 PM, Yen SYU wrote:
>
>> Hi Jon,
>>
>> Thanks for your reponse! I use pig 0.9.1-snapshot.
>>
>> I've used FLATTEN instead of $0 and $1, but ACCUM_CALL is still not fired.
>> Also tried to remove generic type in accumulator but it did not help. :(
>>
>> Is it easy for you to fire accumulator?
>>
>> Yen
>>
>> On Tue, Mar 13, 2012 at 3:06 PM, Jonathan Coveney<jc...@gmail.com>**
>> wrote:
>>
>>  What version of pig are you using?
>>>
>>> just as an experiment in the simple case, can you try doing
>>>
>>> GENERATE flatten(group) as (domain,host), ...(the rest)...
>>>
>>> shouldn't make a difference, but I think I remember that in some older
>>> versions it did
>>>
>>> 2012/3/13 Yen SYU<ye...@gmail.com>
>>>
>>>  Hi all,
>>>>
>>>> I just test a very simple pig script as following:
>>>>
>>>> records = LOAD '$input' AS (hash:chararray, domain:chararray,
>>>> host:chararray, page:chararray, freq:int);
>>>> grpd = GROUP records BY (domain, host);
>>>> stats = FOREACH grpd {
>>>>                                  hashes = records.hash;
>>>>                                  uniq_hashes = DISTINCT hashes;
>>>>                                  pages = records.page;
>>>>                                  GENERATE group.$1 AS host, group.$0 AS
>>>> domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS
>>>> page_count:long, SUM(freq) AS freq:long);
>>>> };
>>>> STORE stats INTO '$output';
>>>>
>>>> where PAGE_COUNT is a customized UDF implementing Accumulator. I add an
>>>> EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the
>>>> accumulate method is never called. Even I tried to remove all other
>>>> built-in UDFs and keep the NESTED FOREACH as simple as:
>>>>
>>>> stats = FOREACH grpd {
>>>>                                  pages = records.page;
>>>>                                  GENERATE group.$1 AS host, group.$0 AS
>>>> domain, PAGE_COUNT(pages) AS page_count:long;
>>>> };
>>>>
>>>> Anyone idea what's going on behind the scenes?
>>>>
>>>> Thanks,
>>>> Yen
>>>>
>>>>
>>>
>>
>

Re: Accumulator is not fired

Posted by Thejas Nair <th...@hortonworks.com>.
Hi Yen,
Does the function also implement Algebraic ? In that case it might end 
up using the algebraic interface of the udf.
If your foreach statement has functions that don't implement Accumulator 
interface, then reduce task won't run in accumulative mode. This is 
because you are anyway going to load the whole bag into memory.

If the query is using accumulator mode, you would see this log message - 
INFO 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.AccumulatorOptimizer 
- Reducer is to run in accumulative mode.


I tried modifying your query to -

stats = FOREACH grpd {
                                   pages = records.page;
                                   GENERATE group.$1 AS host, group.$0 AS
domain, COUNT(pages) AS page_count:long;
};

and ran it disabling the combiner -
bin/pig -Dpig.exec.nocombiner=true -x local -e 'explain -script /tmp/t.pig;'

I was able to verify that it would run Accumulator mode, using the above 
log message.

Thanks,
Thejas




On 3/13/12 12:22 PM, Yen SYU wrote:
> Hi Jon,
>
> Thanks for your reponse! I use pig 0.9.1-snapshot.
>
> I've used FLATTEN instead of $0 and $1, but ACCUM_CALL is still not fired.
> Also tried to remove generic type in accumulator but it did not help. :(
>
> Is it easy for you to fire accumulator?
>
> Yen
>
> On Tue, Mar 13, 2012 at 3:06 PM, Jonathan Coveney<jc...@gmail.com>wrote:
>
>> What version of pig are you using?
>>
>> just as an experiment in the simple case, can you try doing
>>
>> GENERATE flatten(group) as (domain,host), ...(the rest)...
>>
>> shouldn't make a difference, but I think I remember that in some older
>> versions it did
>>
>> 2012/3/13 Yen SYU<ye...@gmail.com>
>>
>>> Hi all,
>>>
>>> I just test a very simple pig script as following:
>>>
>>> records = LOAD '$input' AS (hash:chararray, domain:chararray,
>>> host:chararray, page:chararray, freq:int);
>>> grpd = GROUP records BY (domain, host);
>>> stats = FOREACH grpd {
>>>                                   hashes = records.hash;
>>>                                   uniq_hashes = DISTINCT hashes;
>>>                                   pages = records.page;
>>>                                   GENERATE group.$1 AS host, group.$0 AS
>>> domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS
>>> page_count:long, SUM(freq) AS freq:long);
>>> };
>>> STORE stats INTO '$output';
>>>
>>> where PAGE_COUNT is a customized UDF implementing Accumulator. I add an
>>> EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the
>>> accumulate method is never called. Even I tried to remove all other
>>> built-in UDFs and keep the NESTED FOREACH as simple as:
>>>
>>> stats = FOREACH grpd {
>>>                                   pages = records.page;
>>>                                   GENERATE group.$1 AS host, group.$0 AS
>>> domain, PAGE_COUNT(pages) AS page_count:long;
>>> };
>>>
>>> Anyone idea what's going on behind the scenes?
>>>
>>> Thanks,
>>> Yen
>>>
>>
>


Re: Accumulator is not fired

Posted by Yen SYU <ye...@gmail.com>.
Hi Jon,

Thanks for your reponse! I use pig 0.9.1-snapshot.

I've used FLATTEN instead of $0 and $1, but ACCUM_CALL is still not fired.
Also tried to remove generic type in accumulator but it did not help. :(

Is it easy for you to fire accumulator?

Yen

On Tue, Mar 13, 2012 at 3:06 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> What version of pig are you using?
>
> just as an experiment in the simple case, can you try doing
>
> GENERATE flatten(group) as (domain,host), ...(the rest)...
>
> shouldn't make a difference, but I think I remember that in some older
> versions it did
>
> 2012/3/13 Yen SYU <ye...@gmail.com>
>
> > Hi all,
> >
> > I just test a very simple pig script as following:
> >
> > records = LOAD '$input' AS (hash:chararray, domain:chararray,
> > host:chararray, page:chararray, freq:int);
> > grpd = GROUP records BY (domain, host);
> > stats = FOREACH grpd {
> >                                  hashes = records.hash;
> >                                  uniq_hashes = DISTINCT hashes;
> >                                  pages = records.page;
> >                                  GENERATE group.$1 AS host, group.$0 AS
> > domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS
> > page_count:long, SUM(freq) AS freq:long);
> > };
> > STORE stats INTO '$output';
> >
> > where PAGE_COUNT is a customized UDF implementing Accumulator. I add an
> > EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the
> > accumulate method is never called. Even I tried to remove all other
> > built-in UDFs and keep the NESTED FOREACH as simple as:
> >
> > stats = FOREACH grpd {
> >                                  pages = records.page;
> >                                  GENERATE group.$1 AS host, group.$0 AS
> > domain, PAGE_COUNT(pages) AS page_count:long;
> > };
> >
> > Anyone idea what's going on behind the scenes?
> >
> > Thanks,
> > Yen
> >
>

Re: Accumulator is not fired

Posted by Jonathan Coveney <jc...@gmail.com>.
What version of pig are you using?

just as an experiment in the simple case, can you try doing

GENERATE flatten(group) as (domain,host), ...(the rest)...

shouldn't make a difference, but I think I remember that in some older
versions it did

2012/3/13 Yen SYU <ye...@gmail.com>

> Hi all,
>
> I just test a very simple pig script as following:
>
> records = LOAD '$input' AS (hash:chararray, domain:chararray,
> host:chararray, page:chararray, freq:int);
> grpd = GROUP records BY (domain, host);
> stats = FOREACH grpd {
>                                  hashes = records.hash;
>                                  uniq_hashes = DISTINCT hashes;
>                                  pages = records.page;
>                                  GENERATE group.$1 AS host, group.$0 AS
> domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS
> page_count:long, SUM(freq) AS freq:long);
> };
> STORE stats INTO '$output';
>
> where PAGE_COUNT is a customized UDF implementing Accumulator. I add an
> EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the
> accumulate method is never called. Even I tried to remove all other
> built-in UDFs and keep the NESTED FOREACH as simple as:
>
> stats = FOREACH grpd {
>                                  pages = records.page;
>                                  GENERATE group.$1 AS host, group.$0 AS
> domain, PAGE_COUNT(pages) AS page_count:long;
> };
>
> Anyone idea what's going on behind the scenes?
>
> Thanks,
> Yen
>