You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Matthew Dailey <ma...@gmail.com> on 2016/10/05 16:23:38 UTC

Are Task Closures guaranteed to be accessed by only one Thread?

Looking at the programming guide
<http://spark.apache.org/docs/1.6.1/programming-guide.html#local-vs-cluster-modes>
for Spark 1.6.1, it states
> Prior to execution, Spark computes the task’s closure. The closure is
those variables and methods which must be visible for the executor to
perform its computations on the RDD
> The variables within the closure sent to each executor are now copies

So my question is, will an executor access a single copy of the closure
with more than one thread?  I ask because I want to know if I can ignore
thread-safety in a function I write.  Take a look at this gist as a
simplified example with a thread-unsafe operation being passed to map():
https://gist.github.com/matthew-dailey/4e1ab0aac580151dcfd7fbe6beab84dc

This is for Spark Streaming, but I suspect the answer is the same between
batch and streaming.

Thanks for any help,
Matt

Re: Are Task Closures guaranteed to be accessed by only one Thread?

Posted by Denis Bolshakov <bo...@gmail.com>.
In a few words, you cannot ignore thread safety if you use more than 1 core
per executer. Year ago I faced a race conditiob issue with
SimpleDateFormat. And I solved it using ThreadLocal.

5 Окт 2016 г. 20:12 пользователь "Sean Owen" <so...@cloudera.com> написал:

> I don't think this is guaranteed and don't think I'd rely on it. Ideally
> your functions here aren't even stateful, because they could be
> reinstantiated and/or re-executed many times due to, say, failures. Not
> being stateful dodges a lot of thread-safety issues. If you're doing this
> because you have some expensive shared resource, and you're mapping,
> consider mapPartitions, and setting up the resource at the start.
>
> On Wed, Oct 5, 2016 at 5:23 PM Matthew Dailey <ma...@gmail.com>
> wrote:
>
>> Looking at the programming guide
>> <http://spark.apache.org/docs/1.6.1/programming-guide.html#local-vs-cluster-modes>
>> for Spark 1.6.1, it states
>> > Prior to execution, Spark computes the task’s closure. The closure is
>> those variables and methods which must be visible for the executor to
>> perform its computations on the RDD
>> > The variables within the closure sent to each executor are now copies
>>
>> So my question is, will an executor access a single copy of the closure
>> with more than one thread?  I ask because I want to know if I can ignore
>> thread-safety in a function I write.  Take a look at this gist as a
>> simplified example with a thread-unsafe operation being passed to map():
>> https://gist.github.com/matthew-dailey/4e1ab0aac580151dcfd7fbe6beab84dc
>>
>> This is for Spark Streaming, but I suspect the answer is the same between
>> batch and streaming.
>>
>> Thanks for any help,
>> Matt
>>
>

Re: Are Task Closures guaranteed to be accessed by only one Thread?

Posted by Sean Owen <so...@cloudera.com>.
I don't think this is guaranteed and don't think I'd rely on it. Ideally
your functions here aren't even stateful, because they could be
reinstantiated and/or re-executed many times due to, say, failures. Not
being stateful dodges a lot of thread-safety issues. If you're doing this
because you have some expensive shared resource, and you're mapping,
consider mapPartitions, and setting up the resource at the start.

On Wed, Oct 5, 2016 at 5:23 PM Matthew Dailey <ma...@gmail.com>
wrote:

> Looking at the programming guide
> <http://spark.apache.org/docs/1.6.1/programming-guide.html#local-vs-cluster-modes>
> for Spark 1.6.1, it states
> > Prior to execution, Spark computes the task’s closure. The closure is
> those variables and methods which must be visible for the executor to
> perform its computations on the RDD
> > The variables within the closure sent to each executor are now copies
>
> So my question is, will an executor access a single copy of the closure
> with more than one thread?  I ask because I want to know if I can ignore
> thread-safety in a function I write.  Take a look at this gist as a
> simplified example with a thread-unsafe operation being passed to map():
> https://gist.github.com/matthew-dailey/4e1ab0aac580151dcfd7fbe6beab84dc
>
> This is for Spark Streaming, but I suspect the answer is the same between
> batch and streaming.
>
> Thanks for any help,
> Matt
>