You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Manoj Samel <ma...@gmail.com> on 2014/01/23 22:18:14 UTC

Too many RDD partititons ???

Hi,

On some RDD actions, I noticed ~500 tasks being executed. In the tasks
details, most of the tasks were too small IMO and may be the task
startup/shutdown/coordination overhead is coming into picture. The task
durations are

Min : 5ms
25th %ile: 9ms
Median: 10ms
75th %ile: 13 ms
Max: 40 ms

In the RDDs, number of partitions are 428 for Many RDDs built on top of
each other. The base RDD could benefit from large number of partitions but
RDDs derived from it should have much less # of partitions.

How to control # of partitions @ RDD level ?

Re: Too many RDD partititons ???

Posted by Jey Kottalam <je...@cs.berkeley.edu>.

You can use the RDD.coalese method to adjust the partitioning of a
single RDD. Note that you'll need to set "shuffle=true" when
increasing the number of partitions.

See: http://spark.incubator.apache.org/docs/latest/api/core/org/apache/spark/rdd/RDD.html#coalesce(Int,Boolean):RDD[T]

On Thu, Jan 23, 2014 at 1:18 PM, Manoj Samel <ma...@gmail.com> wrote:
> Hi,
>
> On some RDD actions, I noticed ~500 tasks being executed. In the tasks
> details, most of the tasks were too small IMO and may be the task
> startup/shutdown/coordination overhead is coming into picture. The task
> durations are
>
> Min : 5ms
> 25th %ile: 9ms
> Median: 10ms
> 75th %ile: 13 ms
> Max: 40 ms
>
> In the RDDs, number of partitions are 428 for Many RDDs built on top of each
> other. The base RDD could benefit from large number of partitions but RDDs
> derived from it should have much less # of partitions.
>
> How to control # of partitions @ RDD level ?
>
>