You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by shahab <sh...@gmail.com> on 2014/11/03 10:57:14 UTC

How number of partitions effect the performance?

Hi,

I just wonder how number of partitions effect the performance in Spark!

Is it just the parallelism (more partitions, more parallel sub-tasks) that
improves the performance? or there exist other considerations?

In my case,I run couple of map/reduce jobs on same dataset two times with
two different partition numbers, 7 and 9. I used a stand alone cluster,
with two workers on each, where the master resides with the same machine as
one of the workers.

Surprisingly, the performance of map/reduce jobs in case of 9 partitions is
almost  4X-5X better than that of 7 partitions !??  Does it mean that
choosing right number of partitions is the key factor in the Spark
performance ?

best,
/Shahab

Re: How number of partitions effect the performance?

Posted by shahab <sh...@gmail.com>.
Thanks Sean for very useful comments. I understand now better what could be
the reasons that my evaluations are messed up.

best,
/Shahab

On Mon, Nov 3, 2014 at 12:08 PM, Sean Owen <so...@cloudera.com> wrote:

> Yes partitions matter. Usually you can use the default, which will
> make a partition per input split, and that's usually good, to let one
> task process one block of data, which will all be on one machine.
>
> Reasons I could imagine why 9 partitions is faster than 7:
>
> Probably: Your cluster can execute at least 9 tasks concurrently. It
> will finish faster since each partition is smaller when split into 9
> partitions. This just means you weren't using your cluster's full
> parallelism at 7.
>
> 9 partitions lets tasks execute entirely locally to the data, whereas
> 7 is too few compared to how the data blocks are distributed on HDFS.
> That is, maybe 7 is inducing a shuffle whereas 9 is not for some
> reason in your code.
>
> Your executors are running near their memory limit and are thrashing
> in GC. With less data to process each, you may avoid thrashing and so
> go a lot faster.
>
> (Or there's some other factor that messed up your measurements :))
>
>
> There can be instances where more partitions is slower too.
>
> On Mon, Nov 3, 2014 at 9:57 AM, shahab <sh...@gmail.com> wrote:
> > Hi,
> >
> > I just wonder how number of partitions effect the performance in Spark!
> >
> > Is it just the parallelism (more partitions, more parallel sub-tasks)
> that
> > improves the performance? or there exist other considerations?
> >
> > In my case,I run couple of map/reduce jobs on same dataset two times with
> > two different partition numbers, 7 and 9. I used a stand alone cluster,
> with
> > two workers on each, where the master resides with the same machine as
> one
> > of the workers.
> >
> > Surprisingly, the performance of map/reduce jobs in case of 9 partitions
> is
> > almost  4X-5X better than that of 7 partitions !??  Does it mean that
> > choosing right number of partitions is the key factor in the Spark
> > performance ?
> >
> > best,
> > /Shahab
>

Re: How number of partitions effect the performance?

Posted by Sean Owen <so...@cloudera.com>.
Yes partitions matter. Usually you can use the default, which will
make a partition per input split, and that's usually good, to let one
task process one block of data, which will all be on one machine.

Reasons I could imagine why 9 partitions is faster than 7:

Probably: Your cluster can execute at least 9 tasks concurrently. It
will finish faster since each partition is smaller when split into 9
partitions. This just means you weren't using your cluster's full
parallelism at 7.

9 partitions lets tasks execute entirely locally to the data, whereas
7 is too few compared to how the data blocks are distributed on HDFS.
That is, maybe 7 is inducing a shuffle whereas 9 is not for some
reason in your code.

Your executors are running near their memory limit and are thrashing
in GC. With less data to process each, you may avoid thrashing and so
go a lot faster.

(Or there's some other factor that messed up your measurements :))


There can be instances where more partitions is slower too.

On Mon, Nov 3, 2014 at 9:57 AM, shahab <sh...@gmail.com> wrote:
> Hi,
>
> I just wonder how number of partitions effect the performance in Spark!
>
> Is it just the parallelism (more partitions, more parallel sub-tasks) that
> improves the performance? or there exist other considerations?
>
> In my case,I run couple of map/reduce jobs on same dataset two times with
> two different partition numbers, 7 and 9. I used a stand alone cluster, with
> two workers on each, where the master resides with the same machine as one
> of the workers.
>
> Surprisingly, the performance of map/reduce jobs in case of 9 partitions is
> almost  4X-5X better than that of 7 partitions !??  Does it mean that
> choosing right number of partitions is the key factor in the Spark
> performance ?
>
> best,
> /Shahab

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org