You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ruijing Li <li...@gmail.com> on 2020/05/03 16:31:41 UTC

Good idea to do multi-threading in spark job?

Hi all,

We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we use
semaphores / parallel collections within our spark job. We definitely
notice a huge speedup in our job from doing this, but were wondering if
this could cause any unintended side effects? Particularly I’m worried
about any deadlocks and if it could mess with the fixes for issues such as
this
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961

We do run with multiple cores.

Thanks!
-- 
Cheers,
Ruijing Li

Re: Good idea to do multi-threading in spark job?

Posted by Ruijing Li <li...@gmail.com>.
Thanks for the answer Sean!

On Sun, May 3, 2020 at 10:35 AM Sean Owen <sr...@gmail.com> wrote:

> Spark will by default assume each task needs 1 CPU. On an executor
> with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using
> 4 cores, then 64 threads are trying to run. If you're CPU-bound, that
> could slow things down. But to the extent some of tasks take some time
> blocking on I/O, it could increase overall utilization. You shouldn't
> have to worry about Spark there, but, you do have to consider that N
> tasks, each with its own concurrency, maybe executing your code in one
> JVM, and whatever synchronization that implies.
>
> On Sun, May 3, 2020 at 11:32 AM Ruijing Li <li...@gmail.com> wrote:
> >
> > Hi all,
> >
> > We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we
> use semaphores / parallel collections within our spark job. We definitely
> notice a huge speedup in our job from doing this, but were wondering if
> this could cause any unintended side effects? Particularly I’m worried
> about any deadlocks and if it could mess with the fixes for issues such as
> this
> > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961
> >
> > We do run with multiple cores.
> >
> > Thanks!
> > --
> > Cheers,
> > Ruijing Li
>
-- 
Cheers,
Ruijing Li

Re: Good idea to do multi-threading in spark job?

Posted by Sean Owen <sr...@gmail.com>.
Spark will by default assume each task needs 1 CPU. On an executor
with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using
4 cores, then 64 threads are trying to run. If you're CPU-bound, that
could slow things down. But to the extent some of tasks take some time
blocking on I/O, it could increase overall utilization. You shouldn't
have to worry about Spark there, but, you do have to consider that N
tasks, each with its own concurrency, maybe executing your code in one
JVM, and whatever synchronization that implies.

On Sun, May 3, 2020 at 11:32 AM Ruijing Li <li...@gmail.com> wrote:
>
> Hi all,
>
> We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we use semaphores / parallel collections within our spark job. We definitely notice a huge speedup in our job from doing this, but were wondering if this could cause any unintended side effects? Particularly I’m worried about any deadlocks and if it could mess with the fixes for issues such as this
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961
>
> We do run with multiple cores.
>
> Thanks!
> --
> Cheers,
> Ruijing Li

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org