You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Philip Weaver <ph...@gmail.com> on 2015/08/04 04:37:16 UTC

Safe to write to parquet at the same time?

I think this question applies regardless if I have two completely separate
Spark jobs or tasks on different machines, or two cores that are part of
the same task on the same machine.

If two jobs/tasks/cores/stages both save to the same parquet directory in
parallel like this:

df1.write.mode(SaveMode.Append).partitionBy(a, b).parquet(dir)

df2.write.mode(SaveMode.Append).partitionBy(a, b).parquet(dir)


Will the result be equivalent to this?

df1.unionAll(df2).write.mode(SaveMode.Append).partitionBy(a, b).parquet(dir)


What if we ensure that 'dir' does not exist first?

- Philip

Re: Safe to write to parquet at the same time?

Posted by Cheng Lian <li...@gmail.com>.

It should be safe for Spark 1.4.1 and later versions.

Now Spark SQL adds a job-wise UUID to output file names to distinguish 
files written by different write jobs. So those two write jobs you gave 
should play well with each other. And the job committed later will 
generate a summary file for all Parquet data files it sees. (However, 
Parquet summary file generation may fail due to various reasons and is 
generally not reliable.)

Cheng

On 8/4/15 10:37 AM, Philip Weaver wrote:
> I think this question applies regardless if I have two completely 
> separate Spark jobs or tasks on different machines, or two cores that 
> are part of the same task on the same machine.
>
> If two jobs/tasks/cores/stages both save to the same parquet directory 
> in parallel like this:
>
>     df1.write.mode(SaveMode.Append).partitionBy(a, b).parquet(dir)
>
>     df2.write.mode(SaveMode.Append).partitionBy(a, b).parquet(dir)
>
>
> Will the result be equivalent to this?
>
>     df1.unionAll(df2).write.mode(SaveMode.Append).partitionBy(a,
>     b).parquet(dir)
>
>
> What if we ensure that 'dir' does not exist first?
>
> - Philip
>