You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ted Yu <yu...@gmail.com> on 2016/01/11 20:11:38 UTC

Re: partitioning RDD

Hi,
Please use proper subject when sending email to user@

In your example below, what do the values inside curly braces represent ?
I assume not the keys since values for same key should go to the same
partition.

Cheers

On Mon, Jan 11, 2016 at 10:51 AM, Daniel Imberman <daniel.imberman@gmail.com
> wrote:

> Hi all,
>
> I'm looking for a way to efficiently partition an RDD, but allow the same
> data to exists on multiple partitions.
>
>
> Lets say I have a key-value RDD with keys {1,2,3,4}
>
> I want to be able to to repartition the RDD so that so the partitions look
> like
>
> p1 = {1,2}
> p2 = {2,3}
> p3 = {3,4}
>
> Locality is important in this situation as I would be doing internal
> comparison values.
>
> Does anyone have any thoughts as to how I could go about doing this?
>
> Thank you
>

Re: partitioning RDD

Posted by Daniel Imberman <da...@gmail.com>.

Hi Ted,

Sorry about that. I will be more careful in future emails.

So the values are representing buckets (so 1 would represent all values
with key 1, 2 represents all values with key 2, etc.)

The issue is that I would want to have all values of key 1 and key 2 in
partition 1, while then having all values of key 2 and key 3 in partition
2. The idea here is that I can pre-determine which buckets I would want to
compare to another and then use mapPartitions to handle those comparisons
individually.

The ideas I've had in the past seems like they would be very inefficient.My
original thought was to have a List of RDDs. Something along the lines of
this:

(this is purely pseudocode so there may be errors)

val r:RDD[Int, Vector]
val a = List(1 to 5)
a.map(i =>  r.filter{case(key,value) => key == i || key == (i + 1) % 5}
a.foreach{//whatever mapping I need to do}

but this solution seems like it would potentially require a lot of shuffles
once I complete whatever comparisons I need to perform and would not have
the benefit of locality between buckets.

Any help that might point me in the correct direction would be greatly
appreciated.

Thank you!

Daniel

On Mon, Jan 11, 2016 at 11:11 AM Ted Yu <yu...@gmail.com> wrote:

> Hi,
> Please use proper subject when sending email to user@
>
> In your example below, what do the values inside curly braces represent ?
> I assume not the keys since values for same key should go to the same
> partition.
>
> Cheers
>
> On Mon, Jan 11, 2016 at 10:51 AM, Daniel Imberman <
> daniel.imberman@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm looking for a way to efficiently partition an RDD, but allow the same
>> data to exists on multiple partitions.
>>
>>
>> Lets say I have a key-value RDD with keys {1,2,3,4}
>>
>> I want to be able to to repartition the RDD so that so the partitions
>> look like
>>
>> p1 = {1,2}
>> p2 = {2,3}
>> p3 = {3,4}
>>
>> Locality is important in this situation as I would be doing internal
>> comparison values.
>>
>> Does anyone have any thoughts as to how I could go about doing this?
>>
>> Thank you
>>
>
>