You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2023/01/13 19:12:00 UTC

[jira] [Assigned] (SPARK-41982) When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`

     [ https://issues.apache.org/jira/browse/SPARK-41982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-41982:
------------------------------------

    Assignee:     (was: Apache Spark)

> When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-41982
>                 URL: https://issues.apache.org/jira/browse/SPARK-41982
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.4.0
>            Reporter: jingxiong zhong
>            Priority: Critical
>
> At present, during the process of upgrading Spark2.4 to Spark3.2, we carefully read the migration documentwe and found a kind of situation not involved:
> {code:java}
> create table if not exists test_90(a string, b string) partitioned by (dt string);
> desc formatted test_90;
> // case1
> insert into table test_90 partition (dt=05) values("1","2");
> // case2
> insert into table test_90 partition (dt='05') values("1","2");
> drop table test_90;{code}
> in spark2.4.3, it will generate such a path:
> {code:java}
> // the path
> hdfs://test5/user/hive/db1/test_90/dt=05 
> //result
> spark-sql> select * from test_90;
> 1       2       05
> 1       2       05
> Time taken: 1.316 seconds, Fetched 2 row(s)
> spark-sql> show partitions test_90; 
> dt=05 
> Time taken: 0.201 seconds, Fetched 1 row(s)
> spark-sql> select * from test_90 where dt='05';
> 1       2       05
> 1       2       05
> Time taken: 0.212 seconds, Fetched 2 row(s)
> spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
> == Physical Plan ==
> Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, [a, b]
> +- LocalTableScan [a#116, b#117]
> Time taken: 1.145 seconds, Fetched 1 row(s){code}
> in spark3.2.0, it will generate two path:
> {code:java}
> // the path
> hdfs://test5/user/hive/db1/test_90/dt=05 
> hdfs://test5/user/hive/db1/test_90/dt=5 
> // result
> spark-sql> select * from test_90;
> 1       2       05
> 1       2       5
> Time taken: 2.119 seconds, Fetched 2 row(s)
> spark-sql> show partitions test_90;
> dt=05
> dt=5
> Time taken: 0.161 seconds, Fetched 2 row(s)
> spark-sql> select * from test_90 where dt='05';
> 1       2       05
> Time taken: 0.252 seconds, Fetched 1 row(s)
> spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
> plan
> == Physical Plan ==
> Execute InsertIntoHiveTable `db1`.`test_90`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
> +- LocalTableScan [a#109, b#110]{code}
> This will cause problems in reading data after the user switches to spark3. The root cause is that in the process of partition field resolution, Spark3 has a process of strongly converting this string type, which will cause partition `05` to lose the previous `0`
> So I think we have two solutions:
> one is to record the risk clearly in the migration document, and the other is to repair this case, because we internally keep the partition of string type as string type, regardless of whether single or double quotation marks are added.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org