You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mike Subelsky <mi...@subelsky.com> on 2010/06/29 20:13:36 UTC

filters with lots of match clauses

Hello,

Does this make sense?  I'm generate reports using Pig where I only want to
report on rows matching a set of regular expressions, but those regular
expressions are pretty numerous. Some reports have 500 matching clauses and
others 6000 matching clauses.

Pig fails with an internal error when I run FILTER with the 500 terms
through, so I split that into two chunks of 250 terms and UNION the results.
 It works great, but is that the sensible thing to do or am I missing
something obvious?

I haven't tried the 6000 term report yet.  I don't know what percentage of
the data that represents, but I'm tempted to get rid of the FILTER statement
and generate my report for the whole data set, then use a quick script to
select out the 6000 terms, but somehow that seems like "cheating".
 Otherwise I'll repeat the above UNION technique.

Using Hadoop 0.20.2 and Pig 0.6 on Amazon Elastic MR.

thanks!

-Mike

-- 
Mike Subelsky
oib.com // ignitebaltimore.com // subelsky.com
@subelsky

Re: filters with lots of match clauses

Posted by Mike Subelsky <mi...@subelsky.com>.

Hi Thejas,

Ticket created: https://issues.apache.org/jira/browse/PIG-1475

The exact error was:
2010-06-29 15:46:04,579 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2998: Unhandled internal error. null

Your example is close to what I'm doing, but I have to do one grouping step
after the union, like so:

L = load 'f1';
F1 = filter L by exp1 OR exp2 ... exp250 ;
F2 = filter L by exp251 OR exp252... ;
COMBINED = UNION F1,F2;
GROUPED = GROUP COMBINED BY (....) PARALLEL ###;
COUNTS = FOREACH GROUPED GENERATE FLATTEN($0), COUNT($1) AS COUNT;
STORE COUNTS INTO '$OUTPUT';

not sure if that makes a difference.  Also, the error only happens in script
mode, not when I'm testing in local mode.

-Mike

On Tue, Jun 29, 2010 at 2:54 PM, Thejas Nair <te...@yahoo-inc.com> wrote:

> What is the internal error you are getting (details might be in the log
> file) ? This does not sound like a known issue. (A new JIRA would be even
> more useful!)
>
> Your workaround of using union should do what you want. I am assuming that
> you have a filter with OR of the regular expression matches.
> Ie -
> L = load 'f1';
> Filter1 = filter L by exp1 OR exp2 ... exp250 ;
> Filter2 = filter L by exp251 OR exp252... ;
> OUTPUT = UNION Filter1, Filter2;
>
> Only a single MR job is needed for above query, so there should not be much
> of performance degradation due to the workaround.
>
> If you generate the report for whole dataset and then use a filter script,
> you would end up doing an additional read/write of the larger dataset.
>
>
> Thanks,
> Thejas
>
>
>
> On 6/29/10 11:13 AM, "Mike Subelsky" <mi...@subelsky.com> wrote:
>
> > Hello,
> >
> > Does this make sense?  I'm generate reports using Pig where I only want
> to
> > report on rows matching a set of regular expressions, but those regular
> > expressions are pretty numerous. Some reports have 500 matching clauses
> and
> > others 6000 matching clauses.
> >
> > Pig fails with an internal error when I run FILTER with the 500 terms
> > through, so I split that into two chunks of 250 terms and UNION the
> results.
> >  It works great, but is that the sensible thing to do or am I missing
> > something obvious?
> >
> > I haven't tried the 6000 term report yet.  I don't know what percentage
> of
> > the data that represents, but I'm tempted to get rid of the FILTER
> statement
> > and generate my report for the whole data set, then use a quick script to
> > select out the 6000 terms, but somehow that seems like "cheating".
> >  Otherwise I'll repeat the above UNION technique.
> >
> > Using Hadoop 0.20.2 and Pig 0.6 on Amazon Elastic MR.
> >
> > thanks!
> >
> > -Mike
>
>


-- 
Mike Subelsky
oib.com // ignitebaltimore.com // subelsky.com
@subelsky // (410) 929-4022

Re: filters with lots of match clauses

Posted by Thejas Nair <te...@yahoo-inc.com>.

What is the internal error you are getting (details might be in the log
file) ? This does not sound like a known issue. (A new JIRA would be even
more useful!)

Your workaround of using union should do what you want. I am assuming that
you have a filter with OR of the regular expression matches.
Ie -
L = load 'f1';
Filter1 = filter L by exp1 OR exp2 ... exp250 ;
Filter2 = filter L by exp251 OR exp252... ;
OUTPUT = UNION Filter1, Filter2;

Only a single MR job is needed for above query, so there should not be much
of performance degradation due to the workaround.

If you generate the report for whole dataset and then use a filter script,
you would end up doing an additional read/write of the larger dataset.

Thanks,
Thejas

On 6/29/10 11:13 AM, "Mike Subelsky" <mi...@subelsky.com> wrote:

> Hello,
> 
> Does this make sense?  I'm generate reports using Pig where I only want to
> report on rows matching a set of regular expressions, but those regular
> expressions are pretty numerous. Some reports have 500 matching clauses and
> others 6000 matching clauses.
> 
> Pig fails with an internal error when I run FILTER with the 500 terms
> through, so I split that into two chunks of 250 terms and UNION the results.
>  It works great, but is that the sensible thing to do or am I missing
> something obvious?
> 
> I haven't tried the 6000 term report yet.  I don't know what percentage of
> the data that represents, but I'm tempted to get rid of the FILTER statement
> and generate my report for the whole data set, then use a quick script to
> select out the 6000 terms, but somehow that seems like "cheating".
>  Otherwise I'll repeat the above UNION technique.
> 
> Using Hadoop 0.20.2 and Pig 0.6 on Amazon Elastic MR.
> 
> thanks!
> 
> -Mike