You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zzzzming95 (Jira)" <ji...@apache.org> on 2023/02/20 16:00:00 UTC

[jira] [Updated] (SPARK-42503) Spark SQL should do further validation on join condition fields

     [ https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zzzzming95 updated SPARK-42503:
-------------------------------
    Description: 
In Spark SQL, the conditions for the join use fields that are allowed to be fields from a non-left table or a non-right table. In this case, the join will degenerate into a cross join. 

Suppose you have two tables, test1 and test2, which have the same table schema:
{code:java}
```
CREATE TABLE `default`.`test1` (
  `id` INT,
  `name` STRING,
  `age` INT,
  `dt` STRING)
USING parquet
PARTITIONED BY (dt)
```{code}
The following SQL has three joins, but in the last left join, the conditions is `t1.name=t2.name`, and t3.name is not used. So the last left join will be cross join. 
{code:java}
```
select *
from 
(select * from test1 where dt="20230215"  and age=1 ) t1 
left join
(select * from test1 where dt=="20230215"  and age=2) t2
on t1.name=t2.name
left join
(select * from test2 where dt="20230215") t3
on
t1.name=t2.name;
```{code}
So i think Spark SQL should do further validation on join condition, the fields of join condition must be a left table or right table field , otherwise it is thrown `AnalysisException`.

 

  was:
In Spark SQL, the conditions for the join use fields that are allowed to be fields from a non-left table or a non-right table. In this case, the join will degenerate into a cross join. 

Suppose you have two tables, test1 and test2, which have the same table schema:

```
CREATE TABLE `default`.`test1` (
  `id` INT,
  `name` STRING,
  `age` INT,
  `dt` STRING)
USING parquet
PARTITIONED BY (dt)
```

The following SQL has three joins, but in the last left join, the conditions is `t1.name=t2.name`, and t3.name is not used. So the last left join will be cross join. 
```
select *
from 
(select * from test1 where dt="20230215"  and age=1 ) t1 
left join
(select * from test1 where dt=="20230215"  and age=2) t2
on t1.name=t2.name
left join
(select * from test2 where dt="20230215") t3
on
t1.name=t2.name;
```

So i think Spark SQL should do further validation on join condition, the fields of join condition must be a left table or right table field , otherwise it is thrown `AnalysisException`.

 


> Spark SQL should do further validation on join condition fields
> ---------------------------------------------------------------
>
>                 Key: SPARK-42503
>                 URL: https://issues.apache.org/jira/browse/SPARK-42503
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.2
>            Reporter: zzzzming95
>            Priority: Major
>             Fix For: 3.4.0
>
>
> In Spark SQL, the conditions for the join use fields that are allowed to be fields from a non-left table or a non-right table. In this case, the join will degenerate into a cross join. 
> Suppose you have two tables, test1 and test2, which have the same table schema:
> {code:java}
> ```
> CREATE TABLE `default`.`test1` (
>   `id` INT,
>   `name` STRING,
>   `age` INT,
>   `dt` STRING)
> USING parquet
> PARTITIONED BY (dt)
> ```{code}
> The following SQL has three joins, but in the last left join, the conditions is `t1.name=t2.name`, and t3.name is not used. So the last left join will be cross join. 
> {code:java}
> ```
> select *
> from 
> (select * from test1 where dt="20230215"  and age=1 ) t1 
> left join
> (select * from test1 where dt=="20230215"  and age=2) t2
> on t1.name=t2.name
> left join
> (select * from test2 where dt="20230215") t3
> on
> t1.name=t2.name;
> ```{code}
> So i think Spark SQL should do further validation on join condition, the fields of join condition must be a left table or right table field , otherwise it is thrown `AnalysisException`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org