You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Felix Kizhakkel Jose (Jira)" <ji...@apache.org> on 2020/03/06 15:49:00 UTC

[jira] [Comment Edited] (SPARK-31072) Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"

    [ https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053543#comment-17053543 ] 

Felix Kizhakkel Jose edited comment on SPARK-31072 at 3/6/20, 3:48 PM:
-----------------------------------------------------------------------

[~steve_l], 
 I have seen some issues you have addressed in this area (zero rename with s3a etc), could you please give me some insights?

All,
 Please provide some help on this issue.


was (Author: felixkjose):
[~steve_l], 
I have seen some issues you have addressed in this area, could you please give me some insights?

All,
Please provide some help on this issue.

> Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31072
>                 URL: https://issues.apache.org/jira/browse/SPARK-31072
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.4.5
>            Reporter: Felix Kizhakkel Jose
>            Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ even after I configure to use "PartitionedStagingCommitter" with the following configuration:
>  * sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer partitioned to output data to s3a:************
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter PartitionedStagingCommitter**********
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with partitionBy() two columns -  I was getting  file not found exceptions intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org