You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2017/10/13 22:47:00 UTC

[jira] [Comment Edited] (PIG-5309) Problem with tez + union + replicated join

    [ https://issues.apache.org/jira/browse/PIG-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204295#comment-16204295 ] 

Rohini Palaniswamy edited comment on PIG-5309 at 10/13/17 10:46 PM:
--------------------------------------------------------------------

One of our users ran into this issue as well. This is not related to PIG-3856. That is an optimization when same replicated join data has to be sent to multiple different vertices. In this case, same replicated join data is being sent to a single vertex twice which is causing the error (there can be only one edge between two vertices).  In this case oldAFeatures, newAFeatures, BFeatures all join with the replicated table. The UnionOptimizer ensures there is a single edge for oldAFeatures + newAFeatures (MultiQuery_Union_3/4 e2e testcases). But another gets added for BFeatures which is a issue.




was (Author: rohini):
Had one of our users run into this as well. This is not related to PIG-3856. That is an optimization when same replicated join data has to be sent to multiple different vertices. In this case, same replicated join data is being sent to a single vertex twice which is causing the error (there can be only one edge between two vertices).  In this case oldAFeatures, newAFeatures, BFeatures all join with the replicated table. The UnionOptimizer ensures there is a single edge for oldAFeatures + newAFeatures (MultiQuery_Union_3/4 e2e testcases). But another gets added for BFeatures which is a issue.

> Problem with tez + union + replicated join
> ------------------------------------------
>
>                 Key: PIG-5309
>                 URL: https://issues.apache.org/jira/browse/PIG-5309
>             Project: Pig
>          Issue Type: Bug
>          Components: tez
>    Affects Versions: 0.17.0
>            Reporter: Will Oberman
>            Assignee: Rohini Palaniswamy
>            Priority: Minor
>             Fix For: 0.18.0, 0.17.1
>
>
> I've been using Pig 0.12.1 for quite some time and am finally upgrading to 0.17.  One of my existing scripts failed.  I have a workaround (SET pig.tez.opt.union false), but I thought I'd pass on the problem I observed.  
> In stdout: 
> {noformat}
> ERROR 2017: Internal error creating job configuration.
> {noformat}
> In the Pig log:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Edge [scope-93 : org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor] -> [scope-83 : org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor] ({ BROADCAST : org.apache.tez.runtime.library.input.UnorderedKVInput >> PERSISTED >> org.apache.tez.runtime.library.output.UnorderedKVOutput >> NullEdgeManager }) already defined!
> 	at org.apache.tez.dag.api.DAG.addEdge(DAG.java:272)
> 	at org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder.visitTezOp(TezDagBuilder.java:404)
> 	at org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:259)
> 	at org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:56)
> 	at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:87)
> 	at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:46)
> 	at org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.buildDAG(TezJobCompiler.java:69)
> 	at org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:120)
> 	... 20 more
> {noformat}
> I played around with a minimum viable test script and can cause this to fail:
> {noformat}
> weblogs = LOAD '/tmp/in/weblogInfo' as (path:chararray, queryMap:map[chararray]); 
> featureToExtraData = LOAD '/tmp/in/featureToExtraData' as (feature:chararray, extraData:chararray); 
> oldA = FILTER weblogs BY path == '/A';
> newA = FILTER weblogs BY path == '/somethingElse';
> B = FILTER weblogs BY path == '/B';
> oldAFeatures = FOREACH oldA GENERATE queryMap#'feature1' as feature1, queryMap#'feature2' as feature2;
> newAFeatures = FOREACH newA GENERATE queryMap#'different1' as feature1, queryMap#'different2' as feature2;
> AFeatures = UNION oldAFeatures, newAFeatures;
> AFeaturesPlusMore = JOIN AFeatures BY feature1 LEFT, featureToExtraData BY feature USING 'replicated';
> BFeatures = FOREACH B GENERATE queryMap#'somethingElseEntirely1' as feature1, queryMap#'somethingElseEntirely2' as feature2;
> BFeaturesPlusMore = JOIN BFeatures BY feature1 LEFT, featureToExtraData BY feature USING 'replicated';
> STORE AFeaturesPlusMore INTO '/tmp/out/1/AFeaturesPlusMore';
> STORE BFeaturesPlusMore INTO '/tmp/out/1/BFeaturesPlusMore';
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)