You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Antonin Delpeuch (lists)" <li...@antonin.delpeuch.eu> on 2020/08/07 13:57:56 UTC

Async RDD saves

Hi all,

Following my request on the user mailing list [1], there does not seem
to be any simple way to save RDDs to the file system in an asynchronous
way. I am looking into implementing this, so I am first checking whether
there is consensus around the idea.

The goal would be to add methods such as `saveAsTextFileAsync` and
`saveAsObjectFileAsync` to the RDD API.

I am thinking about doing this by:

- refactoring SparkHadoopWriter to allow for submitting jobs
asynchronously (with `submitJob` rather than `runJob`)

- add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
counterpart to the existing `saveAsHadoopFile`

- add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.

Because SparkHadoopWriter is private, it is complicated to reimplement
this functionality outside of Spark as a user, so I think this would be
an API worth offering. It should be possible to implement this without
too much code duplication hopefully.

Cheers,

Antonin

[1]:
http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html



---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Async RDD saves

Posted by "Antonin Delpeuch (lists)" <li...@antonin.delpeuch.eu>.

Hi both,

Thanks for your replies!

Sean, your proposal to use a driver-side future wrapping the blocking
call sounds a lot easier indeed.

But I want to ensure that canceling the future in the driver code kills
the corresponding tasks on all executors. If I wrap the driver-side call
in a standard Scala or Java future it will not be cancelable, will it? I
think I would need to interrupt the thread that executes the future somehow.

As you can see I am far from an expert on this topic, sorry if I
misunderstood your proposal.

Cheers,
Antonin


On 07/08/2020 19:53, Edward Mitchell wrote:
> I will agree that the side effects of using Futures in driver code tend
> to be tricky to track down.
> 
> If you forget to clear the job description and job group information,
> when the LocalProperties on the SparkContext remain intact -
> SparkContext#submitJob makes sure to pass down the localProperties.
> 
> This has led to us doing this hack:
> 
> image.png
> 
> This can also cause problems with Spark Streaming where the Streaming UI
> can get messed up from the various streaming related properties set
> getting cleared or re-used.
> 
> On Fri, Aug 7, 2020 at 10:38 AM Sean Owen <srowen@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     Why do you need to do it, and can you just use a future in your
>     driver code?
> 
>     On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
>     <lists@antonin.delpeuch.eu <ma...@antonin.delpeuch.eu>> wrote:
>     >
>     > Hi all,
>     >
>     > Following my request on the user mailing list [1], there does not seem
>     > to be any simple way to save RDDs to the file system in an
>     asynchronous
>     > way. I am looking into implementing this, so I am first checking
>     whether
>     > there is consensus around the idea.
>     >
>     > The goal would be to add methods such as `saveAsTextFileAsync` and
>     > `saveAsObjectFileAsync` to the RDD API.
>     >
>     > I am thinking about doing this by:
>     >
>     > - refactoring SparkHadoopWriter to allow for submitting jobs
>     > asynchronously (with `submitJob` rather than `runJob`)
>     >
>     > - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
>     > counterpart to the existing `saveAsHadoopFile`
>     >
>     > - add a `saveAsTextFileAsync` (and other formats) in
>     `AsyncRDDActions`.
>     >
>     > Because SparkHadoopWriter is private, it is complicated to reimplement
>     > this functionality outside of Spark as a user, so I think this
>     would be
>     > an API worth offering. It should be possible to implement this without
>     > too much code duplication hopefully.
>     >
>     > Cheers,
>     >
>     > Antonin
>     >
>     > [1]:
>     >
>     http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
>     >
>     >
>     >
>     > ---------------------------------------------------------------------
>     > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>     >
> 
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Async RDD saves

Posted by kalyan <ju...@gmail.com>.

This looks interesting.. anyways, it will be good if you can elaborate more
on the expectations and the various other ways you had tried before
deciding to do it this way...

Regards,
Kalyan.

On Fri, Aug 7, 2020, 11:24 PM Edward Mitchell <ed...@gmail.com> wrote:

> I will agree that the side effects of using Futures in driver code tend to
> be tricky to track down.
>
> If you forget to clear the job description and job group information, when
> the LocalProperties on the SparkContext remain intact -
> SparkContext#submitJob makes sure to pass down the localProperties.
>
> This has led to us doing this hack:
>
> [image: image.png]
>
> This can also cause problems with Spark Streaming where the Streaming UI
> can get messed up from the various streaming related properties set getting
> cleared or re-used.
>
> On Fri, Aug 7, 2020 at 10:38 AM Sean Owen <sr...@gmail.com> wrote:
>
>> Why do you need to do it, and can you just use a future in your driver
>> code?
>>
>> On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
>> <li...@antonin.delpeuch.eu> wrote:
>> >
>> > Hi all,
>> >
>> > Following my request on the user mailing list [1], there does not seem
>> > to be any simple way to save RDDs to the file system in an asynchronous
>> > way. I am looking into implementing this, so I am first checking whether
>> > there is consensus around the idea.
>> >
>> > The goal would be to add methods such as `saveAsTextFileAsync` and
>> > `saveAsObjectFileAsync` to the RDD API.
>> >
>> > I am thinking about doing this by:
>> >
>> > - refactoring SparkHadoopWriter to allow for submitting jobs
>> > asynchronously (with `submitJob` rather than `runJob`)
>> >
>> > - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
>> > counterpart to the existing `saveAsHadoopFile`
>> >
>> > - add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.
>> >
>> > Because SparkHadoopWriter is private, it is complicated to reimplement
>> > this functionality outside of Spark as a user, so I think this would be
>> > an API worth offering. It should be possible to implement this without
>> > too much code duplication hopefully.
>> >
>> > Cheers,
>> >
>> > Antonin
>> >
>> > [1]:
>> >
>> http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: Async RDD saves

Posted by Edward Mitchell <ed...@gmail.com>.

I will agree that the side effects of using Futures in driver code tend to
be tricky to track down.

If you forget to clear the job description and job group information, when
the LocalProperties on the SparkContext remain intact -
SparkContext#submitJob makes sure to pass down the localProperties.

This has led to us doing this hack:

[image: image.png]

This can also cause problems with Spark Streaming where the Streaming UI
can get messed up from the various streaming related properties set getting
cleared or re-used.

On Fri, Aug 7, 2020 at 10:38 AM Sean Owen <sr...@gmail.com> wrote:

> Why do you need to do it, and can you just use a future in your driver
> code?
>
> On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
> <li...@antonin.delpeuch.eu> wrote:
> >
> > Hi all,
> >
> > Following my request on the user mailing list [1], there does not seem
> > to be any simple way to save RDDs to the file system in an asynchronous
> > way. I am looking into implementing this, so I am first checking whether
> > there is consensus around the idea.
> >
> > The goal would be to add methods such as `saveAsTextFileAsync` and
> > `saveAsObjectFileAsync` to the RDD API.
> >
> > I am thinking about doing this by:
> >
> > - refactoring SparkHadoopWriter to allow for submitting jobs
> > asynchronously (with `submitJob` rather than `runJob`)
> >
> > - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
> > counterpart to the existing `saveAsHadoopFile`
> >
> > - add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.
> >
> > Because SparkHadoopWriter is private, it is complicated to reimplement
> > this functionality outside of Spark as a user, so I think this would be
> > an API worth offering. It should be possible to implement this without
> > too much code duplication hopefully.
> >
> > Cheers,
> >
> > Antonin
> >
> > [1]:
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Async RDD saves

Posted by Sean Owen <sr...@gmail.com>.

Why do you need to do it, and can you just use a future in your driver code?

On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
<li...@antonin.delpeuch.eu> wrote:
>
> Hi all,
>
> Following my request on the user mailing list [1], there does not seem
> to be any simple way to save RDDs to the file system in an asynchronous
> way. I am looking into implementing this, so I am first checking whether
> there is consensus around the idea.
>
> The goal would be to add methods such as `saveAsTextFileAsync` and
> `saveAsObjectFileAsync` to the RDD API.
>
> I am thinking about doing this by:
>
> - refactoring SparkHadoopWriter to allow for submitting jobs
> asynchronously (with `submitJob` rather than `runJob`)
>
> - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
> counterpart to the existing `saveAsHadoopFile`
>
> - add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.
>
> Because SparkHadoopWriter is private, it is complicated to reimplement
> this functionality outside of Spark as a user, so I think this would be
> an API worth offering. It should be possible to implement this without
> too much code duplication hopefully.
>
> Cheers,
>
> Antonin
>
> [1]:
> http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org