You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by jelmer <jk...@gmail.com> on 2021/01/22 13:58:00 UTC

Using same rdd from two threads

HI,

I have a piece of code in which an rdd is created from a main method.
It then does work on this rdd from 2 different threads running in parallel.

When running this code as part of a test with a local master it will
sometimes make spark hang ( 1 task will never get completed)

If i make a copy of the rdd  the joh will complete fine.

I suspect it's a bad idea to use the same rdd from two threads but I could
not find any documentation on the subject.

Should it be possible to do this and if not can anyone point me to
documentation pointing our that this is not on the table

--jelmer

Re: Using same rdd from two threads

Posted by jelmer <jk...@gmail.com>.
Well it is now...

The RDD had a repartition call on it.

When I removed repartition it it it would work,
When i did not remove the repartition but called
called rdd.partitions.length on it it would also work!

I looked into the partitions method and in it some instance variables get
initialized, so saying rdd's are immutable is only true on a "logical" level

It seems I ran into https://issues.apache.org/jira/browse/SPARK-28917

And it looks like this change fixed it

https://github.com/apache/spark/blame/485145326a9c97ede260b0e267ee116f182cfd56/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L298

But since we're using an old version that does not really help


On Fri, 22 Jan 2021 at 15:34, Sean Owen <sr...@gmail.com> wrote:

> RDDs are immutable, and Spark itself is thread-safe. This should be fine.
> Something else is going on in your code.
>
> On Fri, Jan 22, 2021 at 7:59 AM jelmer <jk...@gmail.com> wrote:
>
>> HI,
>>
>> I have a piece of code in which an rdd is created from a main method.
>> It then does work on this rdd from 2 different threads running in
>> parallel.
>>
>> When running this code as part of a test with a local master it will
>> sometimes make spark hang ( 1 task will never get completed)
>>
>> If i make a copy of the rdd  the joh will complete fine.
>>
>> I suspect it's a bad idea to use the same rdd from two threads but I
>> could not find any documentation on the subject.
>>
>> Should it be possible to do this and if not can anyone point me to
>> documentation pointing our that this is not on the table
>>
>> --jelmer
>>
>

Re: Using same rdd from two threads

Posted by Sean Owen <sr...@gmail.com>.
RDDs are immutable, and Spark itself is thread-safe. This should be fine.
Something else is going on in your code.

On Fri, Jan 22, 2021 at 7:59 AM jelmer <jk...@gmail.com> wrote:

> HI,
>
> I have a piece of code in which an rdd is created from a main method.
> It then does work on this rdd from 2 different threads running in parallel.
>
> When running this code as part of a test with a local master it will
> sometimes make spark hang ( 1 task will never get completed)
>
> If i make a copy of the rdd  the joh will complete fine.
>
> I suspect it's a bad idea to use the same rdd from two threads but I could
> not find any documentation on the subject.
>
> Should it be possible to do this and if not can anyone point me to
> documentation pointing our that this is not on the table
>
> --jelmer
>