You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Filimonov Valentin (Jira)" <ji...@apache.org> on 2022/06/04 16:20:00 UTC
[jira] [Updated] (SPARK-39379) FileAlreadyExistsException while insertInto() DF to hive table or directly write().parquet()
[ https://issues.apache.org/jira/browse/SPARK-39379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Filimonov Valentin updated SPARK-39379:
---------------------------------------
Environment:
java.version = 1.8
spark.version = 2.4.8
hadoop.version = 3.1.3
File Output Committer Algorithm version is 2
FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
was:
java.version = 1.8
spark.version = 2.4.8
hadoop.version = 3.1.3
Labels: FileOutputCommitter spark-sql (was: )
> FileAlreadyExistsException while insertInto() DF to hive table or directly write().parquet()
> --------------------------------------------------------------------------------------------
>
> Key: SPARK-39379
> URL: https://issues.apache.org/jira/browse/SPARK-39379
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.8
> Environment: java.version = 1.8
> spark.version = 2.4.8
> hadoop.version = 3.1.3
> File Output Committer Algorithm version is 2
> FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> Reporter: Filimonov Valentin
> Priority: Major
> Labels: FileOutputCommitter, spark-sql
>
> I have such structure of table where I want to write DF:
>
> {code:java}
> CREATE EXTERNAL TABLE `usl_rdm_idl_spark_stg.okogu_h`(
> `ctl_loading` bigint,
> `ctl_validfrom` timestamp,
> `end_dt` date,
> `okogu_accept_dt` date)
> PARTITIONED BY (
> `p1day` string)
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
> 'hdfs://FESS-DEV/data/usl/rdm_idl_spark/stg/okogu_h'
> TBLPROPERTIES (
> 'bucketing_version'='2',
> 'spark.sql.partitionProvider'='catalog',
> 'transient_lastDdlTime'='1654082666')
> {code}
>
> Final DF has the same structure as mentioned table structure. The issue happens when attr "p1day" (table is partitioned by this attr) has *null* value only. So when I try to write it with any option
>
> {code:java}
> finalDF.write().mode(SaveMode.Append).partitionBy("p1day").parquet("somepath);{code}
>
> or
>
> {code:java}
> finalDF.write().mode(SaveMode.Append).insertInto(String.format("%s.%s", tgtSchema, tgtTable));{code}
> I see such error:
>
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.fs.FileAlreadyExistsException: /data/usl/rdm_idl_spark/stg/okogu_h/.hive-staging_hive_2022-06-01_16-59-37_442_6329951430234699240-1/-ext-10000/_temporary/0/_temporary/attempt_20220601165937_0116_m_000001_586/p1day=__HIVE_DEFAULT_PARTITION__/part-00001-05999af9-8a25-406e-a307-f97781547db2.c000 for client 10.106.105.11 already exists{code}
>
>
> For me it works correctly only when I replace null value in "p1day" column with any value( for ex. "1"):
>
> {code:java}
> finalDF.withColumn("p1day",lit("1"));{code}
>
>
> Is it a bug in spark-sql code? I use org.apache.spark:spark-sql_2.11:2.4.8
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org