You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by zaki rahaman <za...@gmail.com> on 2009/09/04 20:34:58 UTC

Multiquery Optimization Issues

So right off the bat, I fixed the regex patterns in my split, but what I
kept getting an error from the multiquery optimize. Specifically, the
following:

ERROR 2146: Internal Error. Inconsistency in key index found during
optimization. + stacktrace

As a temporary fix, I re-ran without multiquery optimization. Obviously as a
result, the script is running much slower. The question I have then is in
what exactly is causing this issue? How can I fix my script to be able to
run my queries and take advantage of the optimizer?

On Thu, Sep 3, 2009 at 4:03 PM, zaki rahaman <za...@gmail.com> wrote:

> Hi all,
>
> I'm becoming a bit more comfortable writing scripts, but still not always
> sure what the best way to structure/frame my statements in order to optimize
> performance. When it comes to Split and Filter, for example, one could
> filter multiple times on a raw set of data or condense it into one split
> statement, but it's not clear from the docs what the best practice in this
> case is. Below is my script as it stands. Your input would be greatly
> appreciated.
>
> -- Queries for August by Day/Month/Week
>
> REGISTER mypigudfs.jar;
>
> raw = LOAD 'data' AS (timestamp:chararray, ip:chararray, userid:chararray);
>
>
> dailyraw = FOREACH raw GENERATE userid, mypigudfs.ExtractDay(timestamp) AS
> day;
> SPLIT dailyraw INTO broken IF (userid matches '*BROKEN*'), noperm IF
> (userid matches '*NOPERM*'), daily IF (NOT ((userid matches '*BROKEN*') OR
> (userid matches '*NOPERM*')));
>
>
> -- Daily Count(s)
>
> daygrp = GROUP daily BY day PARALLEL 36;
> daycnts = FOREACH daygrp GENERATE group, COUNT(daily);
>
>
> -- NoPerm
> npgrp = GROUP noperm BY day;
> npcnts = FOREACH npgrp GENERATE group, COUNT(noperm);
>
> --Broken
> brkgrp = GROUP broken BY day;
> brkcnts = FOREACH brkgrp GENERATE group, COUNT(broken);
>
>
> -- Weekly Count(s)
>
> weekly = FOREACH daily GENERATE userid, mypigudfs.ExtractWeek(day) AS week;
> wkgrp = GROUP weekly By week PARALLEL 36;
> wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);
>
> --Broken
> broken2 = FOREACH broken GENERATE userid, mypigudfs.ExtractWeek(day) AS
> week;
> brkgrp2 = GROUP broken2 BY week;
> brkcnts2 = FOREACH brkgrp2 GENERATE group, COUNT(broken2);
>
>
> --NoPerm
> noperm2 = FOREACH noperm GENERATE userid, mypigudfs.ExtractWeek(day) AS
> week;
> npgrp2 = GROUP noperm2 BY week;
> npcnts2 = FOREACH npgrp2 GENERATE group, COUNT(noperm2);
>
>
> -- Monthly Count
>
> month = GROUP weekly ALL;
> mcnt = FOREACH month GENERATE COUNT(weekly);
>
> npmonth = GROUP noperm2 ALL;
> npmcnt = FOREACH npmonth GENERATE COUNT(noperm2);
>
> brkmonth = GROUP broken2 ALL;
> brkmcnt = FOREACH brkmonth GENERATE COUNT(broken2);
>
> // Store Output
>
> --
> Zaki Rahaman
>
>


-- 
Zaki Rahaman