You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Filimonov Valentin (Jira)" <ji...@apache.org> on 2022/06/04 16:20:00 UTC

[jira] [Updated] (SPARK-39379) FileAlreadyExistsException while insertInto() DF to hive table or directly write().parquet()

     [ https://issues.apache.org/jira/browse/SPARK-39379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Filimonov Valentin updated SPARK-39379:
---------------------------------------
    Environment: 
java.version = 1.8
spark.version = 2.4.8
hadoop.version = 3.1.3

File Output Committer Algorithm version is 2

FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false

  was:
java.version = 1.8
spark.version = 2.4.8
hadoop.version = 3.1.3

         Labels: FileOutputCommitter spark-sql  (was: )

> FileAlreadyExistsException while insertInto() DF to hive table or directly write().parquet()
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-39379
>                 URL: https://issues.apache.org/jira/browse/SPARK-39379
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.8
>         Environment: java.version = 1.8
> spark.version = 2.4.8
> hadoop.version = 3.1.3
> File Output Committer Algorithm version is 2
> FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
>            Reporter: Filimonov Valentin
>            Priority: Major
>              Labels: FileOutputCommitter, spark-sql
>
> I have such structure of table where I want to write DF:
>  
> {code:java}
> CREATE EXTERNAL TABLE `usl_rdm_idl_spark_stg.okogu_h`(
>   `ctl_loading` bigint,
>   `ctl_validfrom` timestamp,
>   `end_dt` date,
>   `okogu_accept_dt` date)
> PARTITIONED BY (
>   `p1day` string)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'hdfs://FESS-DEV/data/usl/rdm_idl_spark/stg/okogu_h'
> TBLPROPERTIES (
>   'bucketing_version'='2',
>   'spark.sql.partitionProvider'='catalog',
>   'transient_lastDdlTime'='1654082666')
> {code}
>  
> Final DF has the same structure as mentioned table structure. The issue happens when attr "p1day" (table is partitioned by this attr) has *null* value only. So when I try to write it with any option 
>  
> {code:java}
> finalDF.write().mode(SaveMode.Append).partitionBy("p1day").parquet("somepath);{code}
>  
>  or
>  
> {code:java}
> finalDF.write().mode(SaveMode.Append).insertInto(String.format("%s.%s", tgtSchema, tgtTable));{code}
> I see such error:
>  
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.fs.FileAlreadyExistsException: /data/usl/rdm_idl_spark/stg/okogu_h/.hive-staging_hive_2022-06-01_16-59-37_442_6329951430234699240-1/-ext-10000/_temporary/0/_temporary/attempt_20220601165937_0116_m_000001_586/p1day=__HIVE_DEFAULT_PARTITION__/part-00001-05999af9-8a25-406e-a307-f97781547db2.c000 for client 10.106.105.11 already exists{code}
>  
>  
> For me it works correctly only when I replace null value in "p1day" column with any value( for ex. "1"):
>  
> {code:java}
> finalDF.withColumn("p1day",lit("1"));{code}
>  
>  
> Is it a bug in spark-sql code? I use org.apache.spark:spark-sql_2.11:2.4.8 
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org