You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/02/14 04:20:00 UTC
[jira] [Commented] (SPARK-30769) insertInto() with existing column as partition key cause weird partition result

    [ https://issues.apache.org/jira/browse/SPARK-30769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036686#comment-17036686 ] 

Hyukjin Kwon commented on SPARK-30769:
--------------------------------------

Please avoid to set Critical+ which is reserved for committers. Are you able to show a full and self-contained reproducer?

> insertInto() with existing column as partition key cause weird partition result
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-30769
>                 URL: https://issues.apache.org/jira/browse/SPARK-30769
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4
>         Environment: EMR 5.29.0 with Spark 2.4.4
>            Reporter: Woong Seok Kang
>            Priority: Major
>
> {code:java}
> val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned"
> val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, typedLit[String](date.toString))) 
> if (xsc.tableExistIn(config.service, saveDatabase, s"${config.table}_partitioned")) writer.insertInto(tableName)
> else writer.partitionBy(config.dateColumn).saveAsTable(tableName){code}
> This code checks whether table exists in desired path. (somewhere in S3 in this case) If table already exists in path then insert a new partition with insertInto() function.
> If config.dateColumn not exists in table schema, no problem occurred. (just new column will be added) but if it is already exists in schema, Spark does not use given column as a partition key, instead it will create a hundred of partitions. Below is a part of Spark logs:
> (Note that the name of partition column is date_ymd, which is already exists in source table. original value is a date string like '2020-01-01')
> 20/02/10 05:33:01 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174 s3://\{my_path_at_s3}_partitioned_test/date_ymd=174
> 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62 s3://\{my_path_at_s3}_partitioned_test/date_ymd=62
> 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83 s3://\{my_path_at_s3}_partitioned_test/date_ymd=83
> 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231 s3://\{my_path_at_s3}_partitioned_test/date_ymd=231
> 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268 s3://\{my_path_at_s3}_partitioned_test/date_ymd=268
> 20/02/10 05:33:04 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33 s3://\{my_path_at_s3}_partitioned_test/date_ymd=33
> 20/02/10 05:33:05 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40 s3://\{my_path_at_s3}_partitioned_test/date_ymd=40
> rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=__HIVE_DEFAULT_PARTITION__ s3://\{my_path_at_s3}_partitioned_test/date_ymd=__HIVE_DEFAULT_PARTITION__
> When I use different partition key which not in table schema such as 'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I just wrote the report. (I think it is related with Hive...)
> Thanks for reading!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org