You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Vincent Doba (Jira)" <ji...@apache.org> on 2021/09/28 08:23:00 UTC
[jira] [Updated] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

     [ https://issues.apache.org/jira/browse/SPARK-36874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vincent Doba updated SPARK-36874:
---------------------------------
    Description: 
When joining two dataframes, if they share the same lineage and one dataframe is a transformation of the other, Ambiguous Self Join detection only works when transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column key2#11 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
	at org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
	at org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:91)
{code}
h2. Actual result

Empty dataframe:
{code:java}
+----+----+-----+----+----+-----+
|key1|key2|value|key1|key2|value|
+----+----+-----+----+----+-----+
+----+----+-----+----+----+-----+
{code}

h2. Related issue

https://issues.apache.org/jira/browse/SPARK-28344

  was:
When joining two dataframes, if they share the same lineage and one dataframe is a transformation of the other, Ambiguous Self Join detection only works when transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column key2#11 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
	at org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
	at org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:91)
{code}
h2. Actual result

Empty dataframe:
{code:java}
+----+----+-----+----+----+-----+
|key1|key2|value|key1|key2|value|
+----+----+-----+----+----+-----+
+----+----+-----+----+----+-----+
{code}


> Ambiguous Self-Join detected only on right dataframe
> ----------------------------------------------------
>
>                 Key: SPARK-36874
>                 URL: https://issues.apache.org/jira/browse/SPARK-36874
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2
>            Reporter: Vincent Doba
>            Priority: Major
>              Labels: correctness
>
> When joining two dataframes, if they share the same lineage and one dataframe is a transformation of the other, Ambiguous Self Join detection only works when transformed dataframe is the right dataframe. 
> For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous Self Join detection only works when {{df2}} is the right dataframe:
> - {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
> - {{df2.join(df1, ...)}} returns a valid dataframe
> h1. Minimum Reproducible example
> h2. Code
> {code:scala}
> import sparkSession.implicit._
> val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")
> val df2 = df1.filter($"value" === "A2")
> df2.join(df1, df1("key1") === df2("key2")).show()
> {code}
> h2. Expected Result
> Throw the following exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Column key2#11 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
> 	at org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
> 	at org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
> 	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
> 	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> 	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
> 	at scala.collection.immutable.List.foldLeft(List.scala:91)
> {code}
> h2. Actual result
> Empty dataframe:
> {code:java}
> +----+----+-----+----+----+-----+
> |key1|key2|value|key1|key2|value|
> +----+----+-----+----+----+-----+
> +----+----+-----+----+----+-----+
> {code}
> h2. Related issue
> https://issues.apache.org/jira/browse/SPARK-28344



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org