You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by jelmer <jk...@gmail.com> on 2021/01/22 13:58:00 UTC
Using same rdd from two threads
HI,
I have a piece of code in which an rdd is created from a main method.
It then does work on this rdd from 2 different threads running in parallel.
When running this code as part of a test with a local master it will
sometimes make spark hang ( 1 task will never get completed)
If i make a copy of the rdd the joh will complete fine.
I suspect it's a bad idea to use the same rdd from two threads but I could
not find any documentation on the subject.
Should it be possible to do this and if not can anyone point me to
documentation pointing our that this is not on the table
--jelmer
Re: Using same rdd from two threads
Posted by jelmer <jk...@gmail.com>.
Well it is now...
The RDD had a repartition call on it.
When I removed repartition it it it would work,
When i did not remove the repartition but called
called rdd.partitions.length on it it would also work!
I looked into the partitions method and in it some instance variables get
initialized, so saying rdd's are immutable is only true on a "logical" level
It seems I ran into https://issues.apache.org/jira/browse/SPARK-28917
And it looks like this change fixed it
https://github.com/apache/spark/blame/485145326a9c97ede260b0e267ee116f182cfd56/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L298
But since we're using an old version that does not really help
On Fri, 22 Jan 2021 at 15:34, Sean Owen <sr...@gmail.com> wrote:
> RDDs are immutable, and Spark itself is thread-safe. This should be fine.
> Something else is going on in your code.
>
> On Fri, Jan 22, 2021 at 7:59 AM jelmer <jk...@gmail.com> wrote:
>
>> HI,
>>
>> I have a piece of code in which an rdd is created from a main method.
>> It then does work on this rdd from 2 different threads running in
>> parallel.
>>
>> When running this code as part of a test with a local master it will
>> sometimes make spark hang ( 1 task will never get completed)
>>
>> If i make a copy of the rdd the joh will complete fine.
>>
>> I suspect it's a bad idea to use the same rdd from two threads but I
>> could not find any documentation on the subject.
>>
>> Should it be possible to do this and if not can anyone point me to
>> documentation pointing our that this is not on the table
>>
>> --jelmer
>>
>
Re: Using same rdd from two threads
Posted by Sean Owen <sr...@gmail.com>.
RDDs are immutable, and Spark itself is thread-safe. This should be fine.
Something else is going on in your code.
On Fri, Jan 22, 2021 at 7:59 AM jelmer <jk...@gmail.com> wrote:
> HI,
>
> I have a piece of code in which an rdd is created from a main method.
> It then does work on this rdd from 2 different threads running in parallel.
>
> When running this code as part of a test with a local master it will
> sometimes make spark hang ( 1 task will never get completed)
>
> If i make a copy of the rdd the joh will complete fine.
>
> I suspect it's a bad idea to use the same rdd from two threads but I could
> not find any documentation on the subject.
>
> Should it be possible to do this and if not can anyone point me to
> documentation pointing our that this is not on the table
>
> --jelmer
>