You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "joffe.tal" <jo...@gmail.com> on 2016/09/29 12:28:13 UTC

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

You can use partition explicitly by adding "/<col_name>=<partition value>" to
the end of the path you are writing to and then use overwrite.

BTW in Spark 2.0 you just need to use:

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”)
and use s3a://

and you can work with regular output committer (actually
DirectParquetOutputCommitter is no longer available in Spark 2.0)

so if you are planning on upgrading this could be another motivation



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3-DirectParquetOutputCommitter-PartitionBy-SaveMode-Append-tp26398p27810.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Posted by Takeshi Yamamuro <li...@gmail.com>.

I got this info. from a hadoop jira ticket:
https://issues.apache.org/jira/browse/MAPREDUCE-5485

// maropu

On Sat, Oct 1, 2016 at 7:14 PM, Igor Berman <ig...@gmail.com> wrote:

> Takeshi, why are you saying this, how have you checked it's only used from
> 2.7.3?
> We use spark 2.0 which is shipped with hadoop dependency of 2.7.2 and we
> use this setting.
> We've sort of "verified" it's used by configuring log of file output
> commiter
>
> On 30 September 2016 at 03:12, Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> Hi,
>>
>> FYI: Seems `sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”)`
>> is only available at hadoop-2.7.3+.
>>
>> // maropu
>>
>>
>> On Thu, Sep 29, 2016 at 9:28 PM, joffe.tal <jo...@gmail.com> wrote:
>>
>>> You can use partition explicitly by adding "/<col_name>=<partition
>>> value>" to
>>> the end of the path you are writing to and then use overwrite.
>>>
>>> BTW in Spark 2.0 you just need to use:
>>>
>>> sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.al
>>> gorithm.version","2”)
>>> and use s3a://
>>>
>>> and you can work with regular output committer (actually
>>> DirectParquetOutputCommitter is no longer available in Spark 2.0)
>>>
>>> so if you are planning on upgrading this could be another motivation
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/S3-DirectParquetOutputCommitter-Partit
>>> ionBy-SaveMode-Append-tp26398p27810.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Posted by Igor Berman <ig...@gmail.com>.

Takeshi, why are you saying this, how have you checked it's only used from
2.7.3?
We use spark 2.0 which is shipped with hadoop dependency of 2.7.2 and we
use this setting.
We've sort of "verified" it's used by configuring log of file output
commiter

On 30 September 2016 at 03:12, Takeshi Yamamuro <li...@gmail.com>
wrote:

> Hi,
>
> FYI: Seems `sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”)`
> is only available at hadoop-2.7.3+.
>
> // maropu
>
>
> On Thu, Sep 29, 2016 at 9:28 PM, joffe.tal <jo...@gmail.com> wrote:
>
>> You can use partition explicitly by adding "/<col_name>=<partition
>> value>" to
>> the end of the path you are writing to and then use overwrite.
>>
>> BTW in Spark 2.0 you just need to use:
>>
>> sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.al
>> gorithm.version","2”)
>> and use s3a://
>>
>> and you can work with regular output committer (actually
>> DirectParquetOutputCommitter is no longer available in Spark 2.0)
>>
>> so if you are planning on upgrading this could be another motivation
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/S3-DirectParquetOutputCommitter-Partit
>> ionBy-SaveMode-Append-tp26398p27810.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Posted by Takeshi Yamamuro <li...@gmail.com>.

Hi,

FYI: Seems `sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”)`
is only available at hadoop-2.7.3+.

// maropu


On Thu, Sep 29, 2016 at 9:28 PM, joffe.tal <jo...@gmail.com> wrote:

> You can use partition explicitly by adding "/<col_name>=<partition value>"
> to
> the end of the path you are writing to and then use overwrite.
>
> BTW in Spark 2.0 you just need to use:
>
> sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.
> algorithm.version","2”)
> and use s3a://
>
> and you can work with regular output committer (actually
> DirectParquetOutputCommitter is no longer available in Spark 2.0)
>
> so if you are planning on upgrading this could be another motivation
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/S3-DirectParquetOutputCommitter-
> PartitionBy-SaveMode-Append-tp26398p27810.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro