You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Alexey Dmitriev (Jira)" <ji...@apache.org> on 2023/10/30 11:28:00 UTC
[jira] [Comment Edited] (SPARK-45722) False positive when cheking for ambigious columns

    [ https://issues.apache.org/jira/browse/SPARK-45722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780965#comment-17780965 ] 

Alexey Dmitriev edited comment on SPARK-45722 at 10/30/23 11:27 AM:
--------------------------------------------------------------------

turning off spark.sql.analyzer.failAmbiguousSelfJoin doesn't help, so probably issue is not exatcly where I think it was:
{code:java}
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
session = SparkSession.Builder().getOrCreate()
session.conf.set('spark.sql.analyzer.failAmbiguousSelfJoin', False)
A = session.createDataFrame([(1,)], ['a'])
B = session.createDataFrame([(1,)], ['b'])
A.join(B).select(B.b)
C = A.join(A.join(B), on=F.lit(False), how='leftanti')
C.join(B).select(B.b) {code}
AnalysisException: Resolved attribute(s) b#2L missing from a#0L,b#12L in operator !Project [b#2L|#2L]. Attribute(s) with the same name appear in the operation: b. Please check if the right attribute(s) are used.; !Project [b#2L|#2L] +- Join Inner :- Join LeftAnti, false : :- LogicalRDD [a#0L|#0L], false : +- Join Inner : :- LogicalRDD [a#9L|#9L], false : +- LogicalRDD [b#2L|#2L], false +- LogicalRDD [b#12L|#12L], false

for some reason, B.b is b#2, but in C.join(b) it's actually #12


was (Author: JIRAUSER301517):
turning off spark.sql.analyzer.failAmbiguousSelfJoin doesn't help, so probably issue is not exatcly where I think it was:
{code:java}
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
session = SparkSession.Builder().getOrCreate()
session.conf.set('spark.sql.analyzer.failAmbiguousSelfJoin', False)
A = session.createDataFrame([(1,)], ['a'])
B = session.createDataFrame([(1,)], ['b'])
A.join(B).select(B.b)
C = A.join(A.join(B), on=F.lit(False), how='leftanti')
C.join(B).select(B.b) {code}
AnalysisException: Resolved attribute(s) b#2L missing from a#0L,b#12L in operator !Project [b#2L]. Attribute(s) with the same name appear in the operation: b. Please check if the right attribute(s) are used.; !Project [b#2L] +- Join Inner :- Join LeftAnti, false : :- LogicalRDD [a#0L], false : +- Join Inner : :- LogicalRDD [a#9L], false : +- LogicalRDD [b#2L], false +- LogicalRDD [b#12L], false

> False positive when cheking for ambigious columns 
> --------------------------------------------------
>
>                 Key: SPARK-45722
>                 URL: https://issues.apache.org/jira/browse/SPARK-45722
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.4.0
>         Environment: py3.11 
> pyspark 3.4.0
>            Reporter: Alexey Dmitriev
>            Priority: Major
>
> I have following code, which I expect to work
> {code:java}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F session = SparkSession.Builder().getOrCreate() A = session.createDataFrame([(1,)], ['a'])
> B = session.createDataFrame([(1,)], ['b'])
> A.join(B).select(B.b) # works fine
> C = A.join(A.join(B), on=F.lit(False), how='leftanti') # C has the same columns as A (columns, not only names)
> C.join(B).select(B.b) #doesn't work, says B.b is ambigious,
> {code}
> {code:java}
> Exception below:{code}
> {code:java}
> AnalysisException: Column b#11L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org