You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daniel Haviv <da...@veracity-group.com> on 2015/07/20 11:04:55 UTC

Local Repartition

Hi,
My data is constructed from a lot of small files which results in a lot of
partitions per RDD.
Is there some way to locally repartition the RDD without shuffling so that
all of the partitions that reside on a specific node will become X
partitions on the same node ?

Thank you.
Daniel

Re: Local Repartition

Posted by Doug Balog <do...@dugos.com>.

Hi Daniel,
Take a look at .coalesce()
I’ve seen good results by coalescing to num executors * 10, but I’m still trying to figure out the 
optimal number of partitions per executor. 
To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1)

Cheers,

Doug

> On Jul 20, 2015, at 5:04 AM, Daniel Haviv <da...@veracity-group.com> wrote:
> 
> Hi,
> My data is constructed from a lot of small files which results in a lot of partitions per RDD.
> Is there some way to locally repartition the RDD without shuffling so that all of the partitions that reside on a specific node will become X partitions on the same node ?
> 
> Thank you.
> Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Local Repartition

Posted by Daniel Haviv <da...@veracity-group.com>.

Great explanation.

Thanks guys!

Daniel

> On 20 ביולי 2015, at 18:12, Silvio Fiorito <si...@granturing.com> wrote:
> 
> Hi Daniel,
> 
> Coalesce, by default will not cause a shuffle. The second parameter when set to true will cause a full shuffle. This is actually what repartition does (calls coalesce with shuffle=true).
> 
> It will attempt to keep colocated partitions together (as you describe) on the same executor. What may happen is you lose data locality if you reduce the partitions to fewer than the number of executors. You obviously also reduce parallelism so you need to be aware of that as you decide when to call coalesce.
> 
> Thanks,
> Silvio
> 
> From: Daniel Haviv
> Date: Monday, July 20, 2015 at 4:59 PM
> To: Doug Balog
> Cc: user
> Subject: Re: Local Repartition
> 
> Thanks Doug,
> coalesce might invoke a shuffle as well. 
> I don't think what I'm suggesting is a feature but it definitely should be.
> 
> Daniel
> 
>> On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog <do...@balog.net> wrote:
>> Hi Daniel,
>> Take a look at .coalesce()
>> I’ve seen good results by coalescing to num executors * 10, but I’m still trying to figure out the
>> optimal number of partitions per executor.
>> To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1)
>> 
>> 
>> Cheers,
>> 
>> Doug
>> 
>> > On Jul 20, 2015, at 5:04 AM, Daniel Haviv <da...@veracity-group.com> wrote:
>> >
>> > Hi,
>> > My data is constructed from a lot of small files which results in a lot of partitions per RDD.
>> > Is there some way to locally repartition the RDD without shuffling so that all of the partitions that reside on a specific node will become X partitions on the same node ?
>> >
>> > Thank you.
>> > Daniel
>

Re: Local Repartition

Posted by Silvio Fiorito <si...@granturing.com>.

Hi Daniel,

Coalesce, by default will not cause a shuffle. The second parameter when set to true will cause a full shuffle. This is actually what repartition does (calls coalesce with shuffle=true).

It will attempt to keep colocated partitions together (as you describe) on the same executor. What may happen is you lose data locality if you reduce the partitions to fewer than the number of executors. You obviously also reduce parallelism so you need to be aware of that as you decide when to call coalesce.

Thanks,
Silvio

From: Daniel Haviv
Date: Monday, July 20, 2015 at 4:59 PM
To: Doug Balog
Cc: user
Subject: Re: Local Repartition

Thanks Doug,
coalesce might invoke a shuffle as well.
I don't think what I'm suggesting is a feature but it definitely should be.

Daniel

On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog <do...@balog.net>> wrote:
Hi Daniel,
Take a look at .coalesce()
I’ve seen good results by coalescing to num executors * 10, but I’m still trying to figure out the
optimal number of partitions per executor.
To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1)

Cheers,

Doug

> On Jul 20, 2015, at 5:04 AM, Daniel Haviv <da...@veracity-group.com>> wrote:
>
> Hi,
> My data is constructed from a lot of small files which results in a lot of partitions per RDD.
> Is there some way to locally repartition the RDD without shuffling so that all of the partitions that reside on a specific node will become X partitions on the same node ?
>
> Thank you.
> Daniel

Re: Local Repartition

Posted by Daniel Haviv <da...@veracity-group.com>.

Thanks Doug,
coalesce might invoke a shuffle as well.
I don't think what I'm suggesting is a feature but it definitely should be.

Daniel

On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog <do...@balog.net> wrote:

> Hi Daniel,
> Take a look at .coalesce()
> I’ve seen good results by coalescing to num executors * 10, but I’m still
> trying to figure out the
> optimal number of partitions per executor.
> To get the number of executors,
> sc.getConf.getInt(“spark.executor.instances”,-1)
>
>
> Cheers,
>
> Doug
>
> > On Jul 20, 2015, at 5:04 AM, Daniel Haviv <
> daniel.haviv@veracity-group.com> wrote:
> >
> > Hi,
> > My data is constructed from a lot of small files which results in a lot
> of partitions per RDD.
> > Is there some way to locally repartition the RDD without shuffling so
> that all of the partitions that reside on a specific node will become X
> partitions on the same node ?
> >
> > Thank you.
> > Daniel
>
>