You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Cameron Gandevia <cg...@gmail.com> on 2011/11/02 18:17:26 UTC

Reducing pig operations in script

Hey

I am trying to extract performance metrics from some of my logs using Pig
and have come up with the following. I feel like I might be performing one
too many steps and was wondering if there is a way to reduce the number of
FILTER/FOREACH operations I need to run. Still trying to learn the proper
syntax.

uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
body:CHARARRAY;
metricLogLine = FILTER uniqLogs BY (body MATCHES
'.*gr.perf.metrics.Category.*');
metricLogData = FOREACH metricLogLine GENERATE host,
REGEX_EXTRACT_ALL(body,
'.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
AS regex;
fltrdMetricLogData = FILTER metricLogData BY regex is not null;
eventCategories = FOREACH fltrdMetricLogData GENERATE host, FLATTEN(regex)
AS (category:CHARARRAY, event:CHARARRAY);

Thanks

Re: Reducing pig operations in script

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Let's just say it's overly optimistic w.r.t. what actually takes time in a
pig job.

D

On Wed, Nov 2, 2011 at 1:45 PM, Cameron Gandevia <cg...@gmail.com>wrote:

> In the pig documentation there is a section title Reduce your operator
> pipeline which talks about combining foreach statements as an optimization.
> It also mentions you should do the same for filter statements. Is this
> incorrect?
>
> On Wed, Nov 2, 2011 at 1:14 PM, Cameron Gandevia <cgandevia@gmail.com
> >wrote:
>
> > Cool thanks
> >
> >
> > On Wed, Nov 2, 2011 at 1:06 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> >> Just to be explicit:
> >>
> >> This:
> >>
> >> x = FILTER something by num1 > 10 AND num2 < 12;
> >>
> >> is equivalent to this:
> >>
> >> x = FILTER something by num1 > 10;
> >> x = FILTER x by num2 < 12;
> >>
> >> All non-blocking operators are evaluated in a streaming fashion, so you
> >> don't need to worry about combining them into a single operator.
> >>
> >> On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <hashutosh@apache.org
> >> >wrote:
> >>
> >> > Hi Cameron,
> >> >
> >> > Your script looks alright. Each of your steps process data in
> different
> >> > ways. Instead of cramming together them in a single statement
> (possibly
> >> via
> >> > some custom UDF), it makes sense to have them in a series of steps as
> >> you
> >> > have done for better readability and debuggability. Are you worried
> >> about
> >> > performance? You need not to. As long as your operations don't
> >> introduce a
> >> > unnecessary map-reduce boundary (which your script doesn't) you are
> >> good.
> >> >
> >> > Hope it helps,
> >> > Ashutosh
> >> >
> >> > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <cg...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hey
> >> > >
> >> > > I am trying to extract performance metrics from some of my logs
> using
> >> Pig
> >> > > and have come up with the following. I feel like I might be
> performing
> >> > one
> >> > > too many steps and was wondering if there is a way to reduce the
> >> number
> >> > of
> >> > > FILTER/FOREACH operations I need to run. Still trying to learn the
> >> proper
> >> > > syntax.
> >> > >
> >> > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
> >> > > body:CHARARRAY;
> >> > > metricLogLine = FILTER uniqLogs BY (body MATCHES
> >> > > '.*gr.perf.metrics.Category.*');
> >> > > metricLogData = FOREACH metricLogLine GENERATE host,
> >> > > REGEX_EXTRACT_ALL(body,
> >> > >
> >> > >
> >> >
> >>
> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
> >> > > AS regex;
> >> > > fltrdMetricLogData = FILTER metricLogData BY regex is not null;
> >> > > eventCategories = FOREACH fltrdMetricLogData GENERATE host,
> >> > FLATTEN(regex)
> >> > > AS (category:CHARARRAY, event:CHARARRAY);
> >> > >
> >> > > Thanks
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> > Thanks
> >
> > Cameron Gandevia
> >
>
>
>
> --
> Thanks
>
> Cameron Gandevia
>

Re: Reducing pig operations in script

Posted by Cameron Gandevia <cg...@gmail.com>.
In the pig documentation there is a section title Reduce your operator
pipeline which talks about combining foreach statements as an optimization.
It also mentions you should do the same for filter statements. Is this
incorrect?

On Wed, Nov 2, 2011 at 1:14 PM, Cameron Gandevia <cg...@gmail.com>wrote:

> Cool thanks
>
>
> On Wed, Nov 2, 2011 at 1:06 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> Just to be explicit:
>>
>> This:
>>
>> x = FILTER something by num1 > 10 AND num2 < 12;
>>
>> is equivalent to this:
>>
>> x = FILTER something by num1 > 10;
>> x = FILTER x by num2 < 12;
>>
>> All non-blocking operators are evaluated in a streaming fashion, so you
>> don't need to worry about combining them into a single operator.
>>
>> On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <hashutosh@apache.org
>> >wrote:
>>
>> > Hi Cameron,
>> >
>> > Your script looks alright. Each of your steps process data in different
>> > ways. Instead of cramming together them in a single statement (possibly
>> via
>> > some custom UDF), it makes sense to have them in a series of steps as
>> you
>> > have done for better readability and debuggability. Are you worried
>> about
>> > performance? You need not to. As long as your operations don't
>> introduce a
>> > unnecessary map-reduce boundary (which your script doesn't) you are
>> good.
>> >
>> > Hope it helps,
>> > Ashutosh
>> >
>> > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <cg...@gmail.com>
>> > wrote:
>> >
>> > > Hey
>> > >
>> > > I am trying to extract performance metrics from some of my logs using
>> Pig
>> > > and have come up with the following. I feel like I might be performing
>> > one
>> > > too many steps and was wondering if there is a way to reduce the
>> number
>> > of
>> > > FILTER/FOREACH operations I need to run. Still trying to learn the
>> proper
>> > > syntax.
>> > >
>> > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
>> > > body:CHARARRAY;
>> > > metricLogLine = FILTER uniqLogs BY (body MATCHES
>> > > '.*gr.perf.metrics.Category.*');
>> > > metricLogData = FOREACH metricLogLine GENERATE host,
>> > > REGEX_EXTRACT_ALL(body,
>> > >
>> > >
>> >
>> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
>> > > AS regex;
>> > > fltrdMetricLogData = FILTER metricLogData BY regex is not null;
>> > > eventCategories = FOREACH fltrdMetricLogData GENERATE host,
>> > FLATTEN(regex)
>> > > AS (category:CHARARRAY, event:CHARARRAY);
>> > >
>> > > Thanks
>> > >
>> >
>>
>
>
>
> --
> Thanks
>
> Cameron Gandevia
>



-- 
Thanks

Cameron Gandevia

Re: Reducing pig operations in script

Posted by Cameron Gandevia <cg...@gmail.com>.
Cool thanks

On Wed, Nov 2, 2011 at 1:06 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Just to be explicit:
>
> This:
>
> x = FILTER something by num1 > 10 AND num2 < 12;
>
> is equivalent to this:
>
> x = FILTER something by num1 > 10;
> x = FILTER x by num2 < 12;
>
> All non-blocking operators are evaluated in a streaming fashion, so you
> don't need to worry about combining them into a single operator.
>
> On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <hashutosh@apache.org
> >wrote:
>
> > Hi Cameron,
> >
> > Your script looks alright. Each of your steps process data in different
> > ways. Instead of cramming together them in a single statement (possibly
> via
> > some custom UDF), it makes sense to have them in a series of steps as you
> > have done for better readability and debuggability. Are you worried about
> > performance? You need not to. As long as your operations don't introduce
> a
> > unnecessary map-reduce boundary (which your script doesn't) you are good.
> >
> > Hope it helps,
> > Ashutosh
> >
> > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <cg...@gmail.com>
> > wrote:
> >
> > > Hey
> > >
> > > I am trying to extract performance metrics from some of my logs using
> Pig
> > > and have come up with the following. I feel like I might be performing
> > one
> > > too many steps and was wondering if there is a way to reduce the number
> > of
> > > FILTER/FOREACH operations I need to run. Still trying to learn the
> proper
> > > syntax.
> > >
> > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
> > > body:CHARARRAY;
> > > metricLogLine = FILTER uniqLogs BY (body MATCHES
> > > '.*gr.perf.metrics.Category.*');
> > > metricLogData = FOREACH metricLogLine GENERATE host,
> > > REGEX_EXTRACT_ALL(body,
> > >
> > >
> >
> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
> > > AS regex;
> > > fltrdMetricLogData = FILTER metricLogData BY regex is not null;
> > > eventCategories = FOREACH fltrdMetricLogData GENERATE host,
> > FLATTEN(regex)
> > > AS (category:CHARARRAY, event:CHARARRAY);
> > >
> > > Thanks
> > >
> >
>



-- 
Thanks

Cameron Gandevia

Re: Reducing pig operations in script

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Just to be explicit:

This:

x = FILTER something by num1 > 10 AND num2 < 12;

is equivalent to this:

x = FILTER something by num1 > 10;
x = FILTER x by num2 < 12;

All non-blocking operators are evaluated in a streaming fashion, so you
don't need to worry about combining them into a single operator.

On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <ha...@apache.org>wrote:

> Hi Cameron,
>
> Your script looks alright. Each of your steps process data in different
> ways. Instead of cramming together them in a single statement (possibly via
> some custom UDF), it makes sense to have them in a series of steps as you
> have done for better readability and debuggability. Are you worried about
> performance? You need not to. As long as your operations don't introduce a
> unnecessary map-reduce boundary (which your script doesn't) you are good.
>
> Hope it helps,
> Ashutosh
>
> On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <cg...@gmail.com>
> wrote:
>
> > Hey
> >
> > I am trying to extract performance metrics from some of my logs using Pig
> > and have come up with the following. I feel like I might be performing
> one
> > too many steps and was wondering if there is a way to reduce the number
> of
> > FILTER/FOREACH operations I need to run. Still trying to learn the proper
> > syntax.
> >
> > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
> > body:CHARARRAY;
> > metricLogLine = FILTER uniqLogs BY (body MATCHES
> > '.*gr.perf.metrics.Category.*');
> > metricLogData = FOREACH metricLogLine GENERATE host,
> > REGEX_EXTRACT_ALL(body,
> >
> >
> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
> > AS regex;
> > fltrdMetricLogData = FILTER metricLogData BY regex is not null;
> > eventCategories = FOREACH fltrdMetricLogData GENERATE host,
> FLATTEN(regex)
> > AS (category:CHARARRAY, event:CHARARRAY);
> >
> > Thanks
> >
>

Re: Reducing pig operations in script

Posted by Ashutosh Chauhan <ha...@apache.org>.
Hi Cameron,

Your script looks alright. Each of your steps process data in different
ways. Instead of cramming together them in a single statement (possibly via
some custom UDF), it makes sense to have them in a series of steps as you
have done for better readability and debuggability. Are you worried about
performance? You need not to. As long as your operations don't introduce a
unnecessary map-reduce boundary (which your script doesn't) you are good.

Hope it helps,
Ashutosh

On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <cg...@gmail.com> wrote:

> Hey
>
> I am trying to extract performance metrics from some of my logs using Pig
> and have come up with the following. I feel like I might be performing one
> too many steps and was wondering if there is a way to reduce the number of
> FILTER/FOREACH operations I need to run. Still trying to learn the proper
> syntax.
>
> uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
> body:CHARARRAY;
> metricLogLine = FILTER uniqLogs BY (body MATCHES
> '.*gr.perf.metrics.Category.*');
> metricLogData = FOREACH metricLogLine GENERATE host,
> REGEX_EXTRACT_ALL(body,
>
> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
> AS regex;
> fltrdMetricLogData = FILTER metricLogData BY regex is not null;
> eventCategories = FOREACH fltrdMetricLogData GENERATE host, FLATTEN(regex)
> AS (category:CHARARRAY, event:CHARARRAY);
>
> Thanks
>