You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Koji Noguchi (JIRA)" <ji...@apache.org> on 2013/10/08 23:28:44 UTC

[jira] [Updated] (PIG-3492) ColumnPrune dropping used column due to LogicalRelationalOperator.fixDuplicateUids changes not propagating

     [ https://issues.apache.org/jira/browse/PIG-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Noguchi updated PIG-3492:
------------------------------

    Attachment: pig-3492-v0.12_01.patch

I see three Jiras that added LogicalRelationalOperator.fixDuplicateUids.

* PIG-3020 (LOJoin) "Duplicate uid in schema" error when joining two relations derived from the same load statement"
* PIG-3144 (LOGenerate)  "Erroneous map entry alias resolution leading to "Duplicate schema alias" errors"
* PIG-3292 (LOCross)   "Logical plan invalid state: duplicate uid in schema during self-join to get cross product"

I'm skipping PIG-3292 since Daniel reviewed with the comment
  "Interplay with ColumnPruner is fine here since nested plan will include entire required plan branch"


PIG-3020 (LOJoin) actually talks about two separate problems.
(i-1) PigParser failing with 'Duplicate schema alias: age'. Only happened in 0.11.
   This was actually about ImplicitSplitInserter's new uid not propagating to the top foreach.
I believe this issue was fixed later by PIG-3310 ("ImplicitSplitInserter does not generate new uids for nested schema fields, leading to miscomputations" fixed only in 0.12).  Confirmed by running a simple test without LogicalRelationalOperator.fixDuplicateUids.

(i-2) 'describe' showing incorrect schema due to duplicate UID.  Happened on 0.10 and 0.11.
   This was due to 'describe' being called without LogicalPlanOptimizer.optimize() which includes some important rules like ImplicitSplitInserter and  DuplicateForEachColumnRewrite.

(ii)  PIG-3144(LOGenerate) issue seems to have started after a completely unrelated Jira,

    PIG-2710 "Implement Naive CUBE operator" in 0.11.

{noformat}
src/org/apache/pig/parser/LogicalPlanBuilder.java

+ 406     private void expandAndResetVisitor(SourceLocation loc,
+ 407       LogicalRelationalOperator lrop) throws ParserValidationException {
+ 408           try {
+ 409               (new ProjectStarExpander(lrop.getPlan())).visit();
+ 410               (new ProjStarInUdfExpander(lrop.getPlan())).visit();
+ 411               new SchemaResetter(lrop.getPlan(), true).visit();
+ 412           } catch (FrontendException e) {
+ 413               throw new ParserValidationException(intStream, loc, e);
+ 414           }
+ 415     }

 934     String buildForeachOp(SourceLocation loc, LOForEach op, String alias, String inputAlias, LogicalPlan innerPlan)
 935     throws ParserValidationException {
 936         op.setInnerPlan( innerPlan );
 937         alias = buildOp( loc, op, alias, inputAlias, null );

-             (new ProjectStarExpander(op.getPlan())).visit(op);
-             (new ProjStarInUdfExpander(op.getPlan())).visit(op);
-             new SchemaResetter(op.getPlan(), true).visit(op);

+938         expandAndResetVisitor(loc, op);
 939         return alias;
 940     }
{noformat}

So basically we started traversing the entire plan (visit()) for every operator builds instead of just the operator it's working on (visit(op)).
This has caused the 'alias' to get updated before LogicalPlanOptimizer.optimize() -> DuplicateForEachColumnRewrite and causing the
"Duplicate schema alias" error.  Rolling back this change seems to bring back the pre-0.11 behavior.


Uploading an intial patch. Goal is to take out the LogicalRelationalOperator.fixDuplicateUids. from both PIG-3020(LOJoin) and PIG-3144(LOGenerate).

(i-1) For release-0.12: No-op.  For release-0.11: Backport pig-3310.
(i-2) We can either fix it by forcing compilePp() before describe or moving ImplicitSplitInserter/DuplicateForEachColumnRewrite to PigServer.compile().
      There is a comment that says

{noformat}
./src/org/apache/pig/PigServer.java
1692         private void compile(LogicalPlan lp) throws FrontendException  {
....
1699
1700             // TODO: move optimizer here from HExecuteEngine.
1701             // TODO: input/output validation visitor
1702
{noformat}

       For now, I'm taking an easy approach of calling compilePp() for describe.

(ii) I'm rolling back small section of PIG-2710 in src/org/apache/pig/parser/LogicalPlanBuilder.java that was hopefully only for shortening the code and the change in behavior was unintended.

For now, patch only applies to release 0.12 since it seems like location of LogicalPlanOptimizer.optimize() may change in the near future (PIG-3508).

> ColumnPrune dropping used column due to LogicalRelationalOperator.fixDuplicateUids changes not propagating
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-3492
>                 URL: https://issues.apache.org/jira/browse/PIG-3492
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11.1, 0.12.1, 0.13.0
>            Reporter: Koji Noguchi
>         Attachments: pig-3492-v0.12_01.patch
>
>
> I don't have a testcase I can upload at the moment, but here's my observation.
> SplitFilter -> schemaResetter -> LOGenerate.getSchema -> LogicalRelationalOperator.fixDuplicateUids() creating a new UID but that UID is not propagated to the entire plan (since SplitFilter.reportChanges only returns subplan).
> As a result, I am seeing ColumnPruning cutting off those used columns.



--
This message was sent by Atlassian JIRA
(v6.1#6144)