You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "chendihao (Jira)" <ji...@apache.org> on 2020/10/30 09:15:00 UTC

[jira] [Updated] (SPARK-33300) Rule SimplifyCasts will not work for nested columns

     [ https://issues.apache.org/jira/browse/SPARK-33300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

chendihao updated SPARK-33300:
------------------------------
    Description: 
We use SparkSQL and Catalyst to optimize the Spark job. We have read the source code and test the rule of SimplifyCasts which will work for simple SQL without nested cast.

The SQL "select cast(string_date as string) from t1" will be optimized.

{code:java}
== Analyzed Logical Plan ==
string_date: string
Project [cast(string_date#12 as string) AS string_date#24]
+- SubqueryAlias t1
 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false

== Optimized Logical Plan ==
Project [string_date#12]
+- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false
{code}

However, it fail to optimize with the nested cast like this "select cast(cast(string_date as string) as string) from t1".

{code:java}
== Analyzed Logical Plan ==
CAST(CAST(string_date AS STRING) AS STRING): string
Project [cast(cast(string_date#12 as string) as string) AS CAST(CAST(string_date AS STRING) AS STRING)#24]
+- SubqueryAlias t1
 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false

== Optimized Logical Plan ==
Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24]
+- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false
{code}

  was:
We use SparkSQL and Catalyst to optimize the Spark job. We have read the source code and test the rule of SimplifyCasts which will work for simple SQL without nested cast.

The SQL "select cast(string_date as string) from t1" will be optimized.

```
== Analyzed Logical Plan ==
string_date: string
Project [cast(string_date#12 as string) AS string_date#24]
+- SubqueryAlias t1
 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false

== Optimized Logical Plan ==
Project [string_date#12]
+- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false
```

However, it fail to optimize with the nested cast like this "select cast(cast(string_date as string) as string) from t1".

```
== Analyzed Logical Plan ==
CAST(CAST(string_date AS STRING) AS STRING): string
Project [cast(cast(string_date#12 as string) as string) AS CAST(CAST(string_date AS STRING) AS STRING)#24]
+- SubqueryAlias t1
 +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false

== Optimized Logical Plan ==
Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24]
+- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false
```

 


> Rule SimplifyCasts will not work for nested columns
> ---------------------------------------------------
>
>                 Key: SPARK-33300
>                 URL: https://issues.apache.org/jira/browse/SPARK-33300
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, SQL
>    Affects Versions: 3.0.0
>            Reporter: chendihao
>            Priority: Minor
>
> We use SparkSQL and Catalyst to optimize the Spark job. We have read the source code and test the rule of SimplifyCasts which will work for simple SQL without nested cast.
> The SQL "select cast(string_date as string) from t1" will be optimized.
> {code:java}
> == Analyzed Logical Plan ==
> string_date: string
> Project [cast(string_date#12 as string) AS string_date#24]
> +- SubqueryAlias t1
>  +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false
> == Optimized Logical Plan ==
> Project [string_date#12]
> +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false
> {code}
> However, it fail to optimize with the nested cast like this "select cast(cast(string_date as string) as string) from t1".
> {code:java}
> == Analyzed Logical Plan ==
> CAST(CAST(string_date AS STRING) AS STRING): string
> Project [cast(cast(string_date#12 as string) as string) AS CAST(CAST(string_date AS STRING) AS STRING)#24]
> +- SubqueryAlias t1
>  +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false
> == Optimized Logical Plan ==
> Project [string_date#12 AS CAST(CAST(string_date AS STRING) AS STRING)#24]
> +- LogicalRDD [name#8, c1#9, c2#10, c5#11L, string_date#12, string_timestamp#13, timestamp_field#14, bool_field#15], false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org