You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Allison Wang (Jira)" <ji...@apache.org> on 2023/10/12 01:08:00 UTC
[jira] [Updated] (SPARK-45509) Investigate the behavior difference in self-join
[ https://issues.apache.org/jira/browse/SPARK-45509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allison Wang updated SPARK-45509:
---------------------------------
Description:
SPARK-45220 discovers a behavior difference for a self-join scenario between class Spark and Spark Connect.
For instance. here is the query that works without Spark Connect:
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name))
joined.show(){code}
But in Spark Connect, it throws this exception:
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
:- LocalRelation [name#64, age#65L]
+- LocalRelation [name#78, height#79L]
{code}
On the other hand, this query failed in classic Spark Connect:
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... {code}
but this query works with Spark Connect.
We need to investigate the behavior difference and fix it.
was:
SAPRK-45220 discovers a behavior difference for a self-join scenario between class Spark and Spark Connect.
For instance. here is the query that works without Spark Connect:
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name))
joined.show(){code}
But in Spark Connect, it throws this exception:
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
:- LocalRelation [name#64, age#65L]
+- LocalRelation [name#78, height#79L]
{code}
On the other hand, this query failed in classic Spark Connect:
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... {code}
but this query works with Spark Connect.
We need to investigate the behavior difference and fix it.
> Investigate the behavior difference in self-join
> ------------------------------------------------
>
> Key: SPARK-45509
> URL: https://issues.apache.org/jira/browse/SPARK-45509
> Project: Spark
> Issue Type: Sub-task
> Components: Connect, PySpark
> Affects Versions: 3.5.0, 4.0.0
> Reporter: Allison Wang
> Priority: Major
>
> SPARK-45220 discovers a behavior difference for a self-join scenario between class Spark and Spark Connect.
> For instance. here is the query that works without Spark Connect:
>
> {code:java}
> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name))
> joined.show(){code}
>
> But in Spark Connect, it throws this exception:
>
> {code:java}
> pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].;
> 'Sort ['name DESC NULLS LAST], true
> +- Join FullOuter, (name#64 = name#78)
> :- LocalRelation [name#64, age#65L]
> +- LocalRelation [name#78, height#79L]
> {code}
>
> On the other hand, this query failed in classic Spark Connect:
>
> {code:java}
> df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
> {code:java}
> pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... {code}
>
> but this query works with Spark Connect.
> We need to investigate the behavior difference and fix it.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org