You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Walrus theCat <wa...@gmail.com> on 2014/03/24 21:28:18 UTC

question about partitions

Hi,

Quick question about partitions.  If my RDD is partitioned into 5
partitions, does that mean that I am constraining it to exist on at most 5
machines?

Thanks

Re: question about partitions

Posted by Walrus theCat <wa...@gmail.com>.
Syed,

Thanks for the tip.  I'm not sure if coalesce is doing what I'm intending
to do, which is, in effect, to subdivide the RDD into N parts (by calling
coalesce and doing operations on the partitions.)  It sounds like, however,
this won't bottleneck my processing power.  If this sets off any alarms for
anyone, feel free to chime in.


On Mon, Mar 24, 2014 at 2:50 PM, Syed A. Hashmi <sh...@cloudera.com>wrote:

> RDD.coalesce should be fine for rebalancing data across all RDD
> partitions. Coalesce is pretty handy in situations where you have sparse
> data and want to compact it (e.g. data after applying a strict filter) OR
> you know the magic number of partitions according to your cluster which
> will be optimal.
>
> One point to watch out though is that if N is greater than your current
> partitions, you need to pass shuffle=true to coalesce. If N is less than
> your current partitions (i.e. you are shrinking partitions, do not set
> shuffle=true, otherwise it will cause additional unnecessary shuffle
> overhead.
>
>
> On Mon, Mar 24, 2014 at 2:32 PM, Walrus theCat <wa...@gmail.com>wrote:
>
>> For instance, I need to work with an RDD in terms of N parts.  Will
>> calling RDD.coalesce(N) possibly cause processing bottlenecks?
>>
>>
>> On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat <wa...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> Quick question about partitions.  If my RDD is partitioned into 5
>>> partitions, does that mean that I am constraining it to exist on at most 5
>>> machines?
>>>
>>> Thanks
>>>
>>
>>
>

Re: question about partitions

Posted by "Syed A. Hashmi" <sh...@cloudera.com>.
RDD.coalesce should be fine for rebalancing data across all RDD partitions.
Coalesce is pretty handy in situations where you have sparse data and want
to compact it (e.g. data after applying a strict filter) OR you know the
magic number of partitions according to your cluster which will be optimal.

One point to watch out though is that if N is greater than your current
partitions, you need to pass shuffle=true to coalesce. If N is less than
your current partitions (i.e. you are shrinking partitions, do not set
shuffle=true, otherwise it will cause additional unnecessary shuffle
overhead.


On Mon, Mar 24, 2014 at 2:32 PM, Walrus theCat <wa...@gmail.com>wrote:

> For instance, I need to work with an RDD in terms of N parts.  Will
> calling RDD.coalesce(N) possibly cause processing bottlenecks?
>
>
> On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat <wa...@gmail.com>wrote:
>
>> Hi,
>>
>> Quick question about partitions.  If my RDD is partitioned into 5
>> partitions, does that mean that I am constraining it to exist on at most 5
>> machines?
>>
>> Thanks
>>
>
>

Re: question about partitions

Posted by Walrus theCat <wa...@gmail.com>.
For instance, I need to work with an RDD in terms of N parts.  Will calling
RDD.coalesce(N) possibly cause processing bottlenecks?


On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat <wa...@gmail.com>wrote:

> Hi,
>
> Quick question about partitions.  If my RDD is partitioned into 5
> partitions, does that mean that I am constraining it to exist on at most 5
> machines?
>
> Thanks
>