You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bimalendu Choudhary (Jira)" <ji...@apache.org> on 2021/03/23 15:25:00 UTC

[jira] [Created] (SPARK-34839) FileNotFoundException on _temporary when multiple app write to same table

Bimalendu Choudhary created SPARK-34839:
-------------------------------------------

             Summary: FileNotFoundException on _temporary when multiple app write to same table
                 Key: SPARK-34839
                 URL: https://issues.apache.org/jira/browse/SPARK-34839
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.0
         Environment: CDH 6.2.1 Hadoop 3.0.0 
            Reporter: Bimalendu Choudhary


When multiple Spark applications are writing to the same hive table ( but different partitions so not interfering with each other in any way), the application finishing first ends up deleting the parent _temporary directory which is still being used by other application.

I think the temporary directory being used by FileOutputCommitter should be made configurable to let the caller call with with its own unique value as per the requirement and without having to worry about some other application deleting it unknowingly. Something like:
{quote}
 public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
 public static final String PENDING_DIR_NAME_DEFAULT =
 "mapreduce.fileoutputcommitter.tempdir";
{quote}
 

We can not use mapreduce.fileoutputcommitter.algorithm.version = 2 due to its issue with data loss https://issues.apache.org/jira/browse/MAPREDUCE-7282. 

There are similar Jira https://issues.apache.org/jira/browse/SPARK-18883, whihc was not resolved. This is very generic case of simply one spark application messing up working of other spark application working on same table and can be avoided by making temp unique or configurable.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org