You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (JIRA)" <ji...@apache.org> on 2017/07/12 04:40:00 UTC

[jira] [Comment Edited] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased

    [ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083434#comment-16083434 ] 

Dongjoon Hyun edited comment on SPARK-21380 at 7/12/17 4:39 AM:
----------------------------------------------------------------

One row in real normal table is okay.
Your example is a constant. So, `FoldablePropagation` and `ConstantFolding` is applied. See the optimized result.
{code}
l_name#11 = r_name
{code}
becomes
{code}
bob = bob
{code}
and, finally,
{code}
true
{code}.

So, `Join` finally has no condition after optimization. So, it become Cartesian Join.


was (Author: dongjoon):
One row in real normal table is okay.
Your example is a constant. So, `FoldablePropagation` and `ConstantFolding` is applied. See the optimized result.

> Join with Columns thinks inner join is cross join even when aliased
> -------------------------------------------------------------------
>
>                 Key: SPARK-21380
>                 URL: https://issues.apache.org/jira/browse/SPARK-21380
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 2.1.0, 2.1.1
>            Reporter: Everett Anderson
>              Labels: correctness
>
> While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1.
> Even after aliasing both the table names and all the columns, joining Datasets using a criteria assembled from Columns rather than the with the join(.... usingColumns) method variants errors complaining that a join is a cross join / cartesian product even when it isn't.
> Example:
> {noformat}
>     Dataset<Row> left = spark.sql("select 'bob' as name, 23 as age");
>     left = left
>         .alias("l")
>         .select(
>             left.col("name").as("l_name"),
>             left.col("age").as("l_age"));
>     Dataset<Row> right = spark.sql("select 'bob' as name, 'bobco' as company");
>     right = right
>         .alias("r")
>         .select(
>             right.col("name").as("r_name"),
>             right.col("company").as("r_age"));
>     Dataset<Row> result = left.join(
>         right,
>         left.col("l_name").equalTo(right.col("r_name")),
>         "inner");
>     result.show();
> {noformat}
> Results in
> {noformat}
> org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
> Project [bob AS l_name#22, 23 AS l_age#23]
> +- OneRowRelation$
> and
> Project [bob AS r_name#33, bobco AS r_age#34]
> +- OneRowRelation$
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these relations.;
> 	at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067)
> 	at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
> 	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)
> 	at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064)
> 	at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049)
> 	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> 	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> 	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> 	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> 	at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
> 	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
> 	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
> 	at scala.collection.immutable.List.foreach(List.scala:381)
> 	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
> 	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
> 	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
> 	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
> 	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
> 	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
> 	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
> 	at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2814)
> 	at org.apache.spark.sql.Dataset.head(Dataset.scala:2127)
> 	at org.apache.spark.sql.Dataset.take(Dataset.scala:2342)
> 	at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> 	at org.apache.spark.sql.Dataset.show(Dataset.scala:638)
> 	at org.apache.spark.sql.Dataset.show(Dataset.scala:597)
> 	at org.apache.spark.sql.Dataset.show(Dataset.scala:606)
> 	at com.nuna.platform.common.spark.util.JoinBuilderIntegrationTest.testSimpleJoin(JoinBuilderIntegrationTest.java:129)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 	at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
> 	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
> 	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
> 	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
> 	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
> 	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {noformat}
> This is related to other issues like SPARK-14854.
> I feel like in many of these cases, Spark shouldn't be considering these joins as Cartesian products. Usually, I run across this when one table is derived from another, but in this case it happens even with the two tables have fully distinct lineages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org