You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Elango Cheran <el...@gmail.com> on 2016/01/26 06:59:47 UTC

multi-threaded Spark jobs

Hi everyone,
I've gone through the effort of figuring out how to modify a Spark job to
have an operation become multi-threaded inside an executor.  I've written
up an explanation of what worked, what didn't work, and why:

http://www.elangocheran.com/blog/2016/01/using-clojure-to-create-multi-threaded-spark-jobs/

I think the ideas there should be applicable generally -- which would
include Scala and Java since the JVM is genuinely multi-threaded -- and
therefore may be of interest to others.  I will need to convert this code
to Scala for personal requirements in the near future, anyways.

I hope this helps.

-- Elango

Re: multi-threaded Spark jobs

Posted by Elango Cheran <el...@gmail.com>.
I think I understand what you're saying, but I think whether you're
"over-provisioning" or not depends on the nature of your workload, your
system's resources, and how Spark determines how to spawn task threads
inside executor processes.

As I concluded in the post, if you're doing CPU-bound work, then stick with
.map and don't bother with .mapPartitions.  If you have an asynchronous
API, you might be able to use the .mapPartitions strategy to parallelize
(in combination with core.async/FRP/etc.).  And if you're doing I/O-bound
work as I was, then a thread pool might be good, and it shouldn't be too
problematic since you're not contending with other threads much for CPU
time.

Unless there's something that I'm missing, or something about Spark's
provisioning of threads inside an executor that could be a problem here,
I'm not sure how other tasks would get pre-empted by threads doing blocking
I/O (meaning they're I/O-bound, not CPU-bound).

What I can attest to is that an ETL job whose runtime was dominated by the
I/O work used to consistently take 2h 45m to run with .map, and after using
this multi-threaded approach, the job started taking approx. 45-50 mins
consistently using a thread pool of size 3 threads.  3 threads => 1/3
runtime.  Increasing the size of the thread pool also led to efficient,
successful runs without any performance concerns.

-- Elango

On Mon, Jan 25, 2016 at 11:12 PM, Igor Berman <ig...@gmail.com> wrote:

> IMHO, you are making mistake.
> spark manages tasks and cores internally. when you open new threads inside
> executor - meaning you "over-provisioning" executor(e.g. tasks on other
> cores will be preempted)
>
>
>
> On 26 January 2016 at 07:59, Elango Cheran <el...@gmail.com>
> wrote:
>
>> Hi everyone,
>> I've gone through the effort of figuring out how to modify a Spark job to
>> have an operation become multi-threaded inside an executor.  I've written
>> up an explanation of what worked, what didn't work, and why:
>>
>>
>> http://www.elangocheran.com/blog/2016/01/using-clojure-to-create-multi-threaded-spark-jobs/
>>
>> I think the ideas there should be applicable generally -- which would
>> include Scala and Java since the JVM is genuinely multi-threaded -- and
>> therefore may be of interest to others.  I will need to convert this code
>> to Scala for personal requirements in the near future, anyways.
>>
>> I hope this helps.
>>
>> -- Elango
>>
>
>

Re: multi-threaded Spark jobs

Posted by Igor Berman <ig...@gmail.com>.
IMHO, you are making mistake.
spark manages tasks and cores internally. when you open new threads inside
executor - meaning you "over-provisioning" executor(e.g. tasks on other
cores will be preempted)



On 26 January 2016 at 07:59, Elango Cheran <el...@gmail.com> wrote:

> Hi everyone,
> I've gone through the effort of figuring out how to modify a Spark job to
> have an operation become multi-threaded inside an executor.  I've written
> up an explanation of what worked, what didn't work, and why:
>
>
> http://www.elangocheran.com/blog/2016/01/using-clojure-to-create-multi-threaded-spark-jobs/
>
> I think the ideas there should be applicable generally -- which would
> include Scala and Java since the JVM is genuinely multi-threaded -- and
> therefore may be of interest to others.  I will need to convert this code
> to Scala for personal requirements in the near future, anyways.
>
> I hope this helps.
>
> -- Elango
>