You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Brian Parker <as...@gmail.com> on 2015/08/31 15:51:53 UTC

Parallel execution of RDDs

Hi, I have a large number of RDDs that I need to process separately.
Instead of submitting these jobs to the Spark scheduler one by one, I'd
like to submit them in parallel in order to maximize cluster utilization.

I've tried to process the RDDs as Futures, but the number of Active jobs
maxes out at 8 and the run time is no faster than serial processing (even
with a 15 node cluster).  What is the limitation on number of Active jobs
in the Spark scheduler?

What are some strategies to maximize cluster utilization with many(possibly
small) RDDs ?  Is this a good use case for Spark Streaming?

Re: Parallel execution of RDDs

Posted by Brian Parker <as...@gmail.com>.

Thank you for the comments.  As you mentioned, increasing the thread pool
succeeded to allow more parallel jobs and decreasing #partitions allowed
more RDDs to execute in parallel. Much appreciated
On Aug 31, 2015 7:07 AM, "Igor Berman" <ig...@gmail.com> wrote:

> what is size of the pool you submitting spark jobs from(futures you've
> mentioned)? is it 8? I think you have fixed thread pool of 8 so there can't
> be more than 8 parallel jobs running...so try to increase it
> what is number of partitions of each of your rdds?
> how many cores has your worker machine(those 15 you've mentioned)
> e.g. if you have 15 * 8 cores but your rdd with 1000 partitions - there is
> no way you'll get parallel job execution since only 1 job already fills all
> cores with tasks(unless you are going to manage resources per each
> submit/job)
>
>
>
> On 31 August 2015 at 16:51, Brian Parker <as...@gmail.com> wrote:
>
>> Hi, I have a large number of RDDs that I need to process separately.
>> Instead of submitting these jobs to the Spark scheduler one by one, I'd
>> like to submit them in parallel in order to maximize cluster utilization.
>>
>> I've tried to process the RDDs as Futures, but the number of Active jobs
>> maxes out at 8 and the run time is no faster than serial processing (even
>> with a 15 node cluster).  What is the limitation on number of Active jobs
>> in the Spark scheduler?
>>
>> What are some strategies to maximize cluster utilization with
>> many(possibly small) RDDs ?  Is this a good use case for Spark Streaming?
>>
>
>

Re: Parallel execution of RDDs

Posted by Igor Berman <ig...@gmail.com>.

what is size of the pool you submitting spark jobs from(futures you've
mentioned)? is it 8? I think you have fixed thread pool of 8 so there can't
be more than 8 parallel jobs running...so try to increase it
what is number of partitions of each of your rdds?
how many cores has your worker machine(those 15 you've mentioned)
e.g. if you have 15 * 8 cores but your rdd with 1000 partitions - there is
no way you'll get parallel job execution since only 1 job already fills all
cores with tasks(unless you are going to manage resources per each
submit/job)

On 31 August 2015 at 16:51, Brian Parker <as...@gmail.com> wrote:

> Hi, I have a large number of RDDs that I need to process separately.
> Instead of submitting these jobs to the Spark scheduler one by one, I'd
> like to submit them in parallel in order to maximize cluster utilization.
>
> I've tried to process the RDDs as Futures, but the number of Active jobs
> maxes out at 8 and the run time is no faster than serial processing (even
> with a 15 node cluster).  What is the limitation on number of Active jobs
> in the Spark scheduler?
>
> What are some strategies to maximize cluster utilization with
> many(possibly small) RDDs ?  Is this a good use case for Spark Streaming?
>