You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Josh Devins <in...@joshdevins.net> on 2010/10/16 23:28:35 UTC

Built-in counters

I've seen a few threads about counters, PigStats, Elephant-Bird's stats
utility class, etc.

http://www.mail-archive.com/pig-user@hadoop.apache.org/msg00900.html
http://www.mail-archive.com/user%40pig.apache.org/msg00034.html

Has any progress been made on this or to provide a comprehensive
stats/counter mechanism?

What I'm looking to do is three-fold:

1) Get stats on the number of records that are filtered out when using the
FILTER operation
2) Get stats on the number of records dropped/not loaded in a LOAD function
(and actual copies of the records/rows from the file for later evaluation)
3) Output my own stats from a Pig job (without resorting to writing my own
UDF and pushing things into PigStats using the Elephant-Bird utility)

If any of this is possible, it would be great to see some examples or
documentation. I would hate to go to raw Hadoop MR code just to get to
counters.

Thanks,

Josh

Re: Built-in counters

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

The code snipped I wrote was for use inside a UDF, not part of Pig Latin.
The way to get at things like counters when running Pig code would
have to be to write a Java driver program that would use the new API
in https://issues.apache.org/jira/browse/PIG-1478 and
https://issues.apache.org/jira/browse/PIG-1333

-Dmitriy

On Mon, Oct 18, 2010 at 2:57 AM, Josh Devins <in...@joshdevins.net> wrote:
> Ah, sorry, just saw that this should read:
>
> PigStatusReporter.getInstance() and there is no special counters
> keyword/variable. However is this common for Pig, being able to access
> static methods directly from within a Pig script?
>
> Thanks,
>
> Josh
>
>
> On 18 October 2010 11:56, Josh Devins <in...@joshdevins.net> wrote:
>
>> Thanks, I will explore the stats in MR mode a bit once I'm on 0.8/trunk.
>>
>> I will also have a look at wrapping some of the standard loaders to get
>> better stats out of them. Is this of interest to anyone else? Should I
>> submit back to PiggyBank?
>>
>> This syntax of counters.PigStatusReporter, is that documented somewhere? Is
>> it only on 0.8/trunk? What other variables do we have access to in the
>> "native" Pig script other than "counters"?
>>
>> Josh
>>
>>
>>
>> On 17 October 2010 19:44, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>>
>>> No on Filters (though every MR job tells you the number of records
>>> ingested,
>>> and the number returned, and as of 0.8 it also tells you which relations
>>> were being produced in the job -- so you can sort of back into that).
>>> EB sort of gives you 2), most of the loaders in there give you number of
>>> malformed records, though they do not store the bad records anywhere.
>>> I am not sure what you mean by 3) -- you can just increment
>>> counters.
>>> PigStatusReporter.getInstance().getCounter(myEnum).increment(1L);
>>>
>>> (watch out for a null reporter when you are still in the client-side).
>>>
>>> -D
>>>
>>>
>>> On Sat, Oct 16, 2010 at 2:28 PM, Josh Devins <in...@joshdevins.net> wrote:
>>>
>>> > I've seen a few threads about counters, PigStats, Elephant-Bird's stats
>>> > utility class, etc.
>>> >
>>> > http://www.mail-archive.com/pig-user@hadoop.apache.org/msg00900.html
>>> > http://www.mail-archive.com/user%40pig.apache.org/msg00034.html
>>> >
>>> > Has any progress been made on this or to provide a comprehensive
>>> > stats/counter mechanism?
>>> >
>>> > What I'm looking to do is three-fold:
>>> >
>>> > 1) Get stats on the number of records that are filtered out when using
>>> the
>>> > FILTER operation
>>> > 2) Get stats on the number of records dropped/not loaded in a LOAD
>>> function
>>> > (and actual copies of the records/rows from the file for later
>>> evaluation)
>>> > 3) Output my own stats from a Pig job (without resorting to writing my
>>> own
>>> > UDF and pushing things into PigStats using the Elephant-Bird utility)
>>> >
>>> > If any of this is possible, it would be great to see some examples or
>>> > documentation. I would hate to go to raw Hadoop MR code just to get to
>>> > counters.
>>> >
>>> > Thanks,
>>> >
>>> > Josh
>>> >
>>>
>>
>>
>

Re: Built-in counters

Posted by Josh Devins <in...@joshdevins.net>.

Ah, sorry, just saw that this should read:

PigStatusReporter.getInstance() and there is no special counters
keyword/variable. However is this common for Pig, being able to access
static methods directly from within a Pig script?

Thanks,

Josh


On 18 October 2010 11:56, Josh Devins <in...@joshdevins.net> wrote:

> Thanks, I will explore the stats in MR mode a bit once I'm on 0.8/trunk.
>
> I will also have a look at wrapping some of the standard loaders to get
> better stats out of them. Is this of interest to anyone else? Should I
> submit back to PiggyBank?
>
> This syntax of counters.PigStatusReporter, is that documented somewhere? Is
> it only on 0.8/trunk? What other variables do we have access to in the
> "native" Pig script other than "counters"?
>
> Josh
>
>
>
> On 17 October 2010 19:44, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> No on Filters (though every MR job tells you the number of records
>> ingested,
>> and the number returned, and as of 0.8 it also tells you which relations
>> were being produced in the job -- so you can sort of back into that).
>> EB sort of gives you 2), most of the loaders in there give you number of
>> malformed records, though they do not store the bad records anywhere.
>> I am not sure what you mean by 3) -- you can just increment
>> counters.
>> PigStatusReporter.getInstance().getCounter(myEnum).increment(1L);
>>
>> (watch out for a null reporter when you are still in the client-side).
>>
>> -D
>>
>>
>> On Sat, Oct 16, 2010 at 2:28 PM, Josh Devins <in...@joshdevins.net> wrote:
>>
>> > I've seen a few threads about counters, PigStats, Elephant-Bird's stats
>> > utility class, etc.
>> >
>> > http://www.mail-archive.com/pig-user@hadoop.apache.org/msg00900.html
>> > http://www.mail-archive.com/user%40pig.apache.org/msg00034.html
>> >
>> > Has any progress been made on this or to provide a comprehensive
>> > stats/counter mechanism?
>> >
>> > What I'm looking to do is three-fold:
>> >
>> > 1) Get stats on the number of records that are filtered out when using
>> the
>> > FILTER operation
>> > 2) Get stats on the number of records dropped/not loaded in a LOAD
>> function
>> > (and actual copies of the records/rows from the file for later
>> evaluation)
>> > 3) Output my own stats from a Pig job (without resorting to writing my
>> own
>> > UDF and pushing things into PigStats using the Elephant-Bird utility)
>> >
>> > If any of this is possible, it would be great to see some examples or
>> > documentation. I would hate to go to raw Hadoop MR code just to get to
>> > counters.
>> >
>> > Thanks,
>> >
>> > Josh
>> >
>>
>
>

Re: Built-in counters

Posted by Josh Devins <in...@joshdevins.net>.

Thanks, I will explore the stats in MR mode a bit once I'm on 0.8/trunk.

I will also have a look at wrapping some of the standard loaders to get
better stats out of them. Is this of interest to anyone else? Should I
submit back to PiggyBank?

This syntax of counters.PigStatusReporter, is that documented somewhere? Is
it only on 0.8/trunk? What other variables do we have access to in the
"native" Pig script other than "counters"?

Josh


On 17 October 2010 19:44, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> No on Filters (though every MR job tells you the number of records
> ingested,
> and the number returned, and as of 0.8 it also tells you which relations
> were being produced in the job -- so you can sort of back into that).
> EB sort of gives you 2), most of the loaders in there give you number of
> malformed records, though they do not store the bad records anywhere.
> I am not sure what you mean by 3) -- you can just increment
> counters. PigStatusReporter.getInstance().getCounter(myEnum).increment(1L);
>
> (watch out for a null reporter when you are still in the client-side).
>
> -D
>
>
> On Sat, Oct 16, 2010 at 2:28 PM, Josh Devins <in...@joshdevins.net> wrote:
>
> > I've seen a few threads about counters, PigStats, Elephant-Bird's stats
> > utility class, etc.
> >
> > http://www.mail-archive.com/pig-user@hadoop.apache.org/msg00900.html
> > http://www.mail-archive.com/user%40pig.apache.org/msg00034.html
> >
> > Has any progress been made on this or to provide a comprehensive
> > stats/counter mechanism?
> >
> > What I'm looking to do is three-fold:
> >
> > 1) Get stats on the number of records that are filtered out when using
> the
> > FILTER operation
> > 2) Get stats on the number of records dropped/not loaded in a LOAD
> function
> > (and actual copies of the records/rows from the file for later
> evaluation)
> > 3) Output my own stats from a Pig job (without resorting to writing my
> own
> > UDF and pushing things into PigStats using the Elephant-Bird utility)
> >
> > If any of this is possible, it would be great to see some examples or
> > documentation. I would hate to go to raw Hadoop MR code just to get to
> > counters.
> >
> > Thanks,
> >
> > Josh
> >
>

Re: Built-in counters

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

No on Filters (though every MR job tells you the number of records ingested,
and the number returned, and as of 0.8 it also tells you which relations
were being produced in the job -- so you can sort of back into that).
EB sort of gives you 2), most of the loaders in there give you number of
malformed records, though they do not store the bad records anywhere.
I am not sure what you mean by 3) -- you can just increment
counters. PigStatusReporter.getInstance().getCounter(myEnum).increment(1L);

(watch out for a null reporter when you are still in the client-side).

-D

On Sat, Oct 16, 2010 at 2:28 PM, Josh Devins <in...@joshdevins.net> wrote:

> I've seen a few threads about counters, PigStats, Elephant-Bird's stats
> utility class, etc.
>
> http://www.mail-archive.com/pig-user@hadoop.apache.org/msg00900.html
> http://www.mail-archive.com/user%40pig.apache.org/msg00034.html
>
> Has any progress been made on this or to provide a comprehensive
> stats/counter mechanism?
>
> What I'm looking to do is three-fold:
>
> 1) Get stats on the number of records that are filtered out when using the
> FILTER operation
> 2) Get stats on the number of records dropped/not loaded in a LOAD function
> (and actual copies of the records/rows from the file for later evaluation)
> 3) Output my own stats from a Pig job (without resorting to writing my own
> UDF and pushing things into PigStats using the Elephant-Bird utility)
>
> If any of this is possible, it would be great to see some examples or
> documentation. I would hate to go to raw Hadoop MR code just to get to
> counters.
>
> Thanks,
>
> Josh
>