You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Anubhav Agarwal <an...@gmail.com> on 2016/10/06 15:32:39 UTC

Best Savemode option to write Parquet file

Hi all,
I have searched a bit before posting this query.

Using Spark 1.6.1
Dataframe.write().format("parquet").mode(SaveMode.Append).save("location)

Note:- The data in that folder can be deleted and most of the times that
folder doesn't even exist.

Which Savemode is the best, if necessary at all?

I am using Savemode.Append which seems to cause huge amounts of shuffle as
only executioner is doing the actual write. (May be wrong)

Would using Overwrite cause all the executors write to that folder at once
or would this also send data to one single executor before writing?

Or should I not use any of the modes at all and just do a write?


Thank You,
Anu

Re: Best Savemode option to write Parquet file

Posted by Chanh Le <gi...@gmail.com>.

Hi,
It depends on your case but if you do shuffle it’s expensive operation unless you want to reduce number of files and it's not parallel so it might have cost you a lot of time to write data.

Regards,
Chanh



> On Oct 7, 2016, at 1:25 AM, Anubhav Agarwal <an...@gmail.com> wrote:
> 
> Hi,
> I already had the following set:-
> sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
> 
> Will add the other setting too.
> 
> But my question is I am correct in assuming Append mode shuffles all data to one node before writing?
> And do other modes do the same or all executors write to the folder in parallel .
> 
> Thank You,
> Anu
> 
> On Thu, Oct 6, 2016 at 11:36 AM, Chanh Le <giaosudau@gmail.com <ma...@gmail.com>> wrote:
> Hi Abnubhav,
> The best way to store parquet is partition it by time or specific field that you are going to mark for delete after the time.
> in my case I partition my data by time so I can easy to delete the data after 30 days.
> Use with mode Append and disable the summary information 
> 
> sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
> sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
> 
> Regards,
> Chanh
> 
> 
>> On Oct 6, 2016, at 10:32 PM, Anubhav Agarwal <anubhav33@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi all,
>> I have searched a bit before posting this query.
>> 
>> Using Spark 1.6.1
>> Dataframe.write().format("parquet").mode(SaveMode.Append).save("location)
>> 
>> Note:- The data in that folder can be deleted and most of the times that folder doesn't even exist.
>> 
>> Which Savemode is the best, if necessary at all?
>> 
>> I am using Savemode.Append which seems to cause huge amounts of shuffle as only executioner is doing the actual write. (May be wrong)
>> 
>> Would using Overwrite cause all the executors write to that folder at once or would this also send data to one single executor before writing?
>> 
>> Or should I not use any of the modes at all and just do a write?
>> 
>> 
>> Thank You,
>> Anu
> 
>

Re: Best Savemode option to write Parquet file

Posted by Anubhav Agarwal <an...@gmail.com>.

Hi,
I already had the following set:-
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

Will add the other setting too.

But my question is I am correct in assuming Append mode shuffles all data
to one node before writing?
And do other modes do the same or all executors write to the folder in
parallel .

Thank You,
Anu

On Thu, Oct 6, 2016 at 11:36 AM, Chanh Le <gi...@gmail.com> wrote:

> Hi Abnubhav,
> The best way to store parquet is partition it by time or specific field
> that you are going to mark for delete after the time.
> in my case I partition my data by time so I can easy to delete the data
> after 30 days.
> Use with mode Append and disable the summary information
>
> sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
> sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
>
>
> Regards,
> Chanh
>
>
> On Oct 6, 2016, at 10:32 PM, Anubhav Agarwal <an...@gmail.com> wrote:
>
> Hi all,
> I have searched a bit before posting this query.
>
> Using Spark 1.6.1
> Dataframe.write().format("parquet").mode(SaveMode.Append).save("location)
>
> Note:- The data in that folder can be deleted and most of the times that
> folder doesn't even exist.
>
> Which Savemode is the best, if necessary at all?
>
> I am using Savemode.Append which seems to cause huge amounts of shuffle as
> only executioner is doing the actual write. (May be wrong)
>
> Would using Overwrite cause all the executors write to that folder at once
> or would this also send data to one single executor before writing?
>
> Or should I not use any of the modes at all and just do a write?
>
>
> Thank You,
> Anu
>
>
>

Re: Best Savemode option to write Parquet file

Posted by Chanh Le <gi...@gmail.com>.

Hi Abnubhav,
The best way to store parquet is partition it by time or specific field that you are going to mark for delete after the time.
in my case I partition my data by time so I can easy to delete the data after 30 days.
Use with mode Append and disable the summary information 

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Regards,
Chanh


> On Oct 6, 2016, at 10:32 PM, Anubhav Agarwal <an...@gmail.com> wrote:
> 
> Hi all,
> I have searched a bit before posting this query.
> 
> Using Spark 1.6.1
> Dataframe.write().format("parquet").mode(SaveMode.Append).save("location)
> 
> Note:- The data in that folder can be deleted and most of the times that folder doesn't even exist.
> 
> Which Savemode is the best, if necessary at all?
> 
> I am using Savemode.Append which seems to cause huge amounts of shuffle as only executioner is doing the actual write. (May be wrong)
> 
> Would using Overwrite cause all the executors write to that folder at once or would this also send data to one single executor before writing?
> 
> Or should I not use any of the modes at all and just do a write?
> 
> 
> Thank You,
> Anu