You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Alexander Czech <al...@googlemail.com.INVALID> on 2019/07/14 22:10:45 UTC

How to use HDFS >3.1.1 with spark 2.3.3 to output parquet files to S3?

As the subject suggest I want to output an parquet to S3. I know this was
rather troublesome in the past because of S3 not having a move but needed
to do a copy+delete.
This issues has been discussed before see:
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-files-to-s3-with-out-temporary-directory-tc28088.html

Now Hadoop-13786 <https://issues.apache.org/jira/browse/HADOOP-13786> is
fixing this problem in Hadoop 3.1.0 and later. How can I use that with
spark 2.3.3? I usually orchestrate my cluster on EC2 with flintrock
<https://github.com/nchammas/flintrock>. Do I just set in the flintrock
config HDFS to 3.1.1 and everything "just works"? Or do I also have to set
a committer algorithm like this when I create my spark context in pyspark:

.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version','some_kind_of_Version')

thanks for the help!