You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Akshat Aranya <aa...@gmail.com> on 2014/09/17 01:27:13 UTC

partitioned groupBy

I have a use case where my RDD is set up such:

Partition 0:
K1 -> [V1, V2]
K2 -> [V2]

Partition 1:
K3 -> [V1]
K4 -> [V3]

I want to invert this RDD, but only within a partition, so that the
operation does not require a shuffle.  It doesn't matter if the partitions
of the inverted RDD have non unique keys across the partitions, for example:

Partition 0:
V1 -> [K1]
V2 -> [K1, K2]

Partition 1:
V1 -> [K3]
V3 -> [K4]

Is there a way to do only a per-partition groupBy, instead of shuffling the
entire data?

Re: partitioned groupBy

Posted by Patrick Wendell <pw...@gmail.com>.
If you'd like to re-use the resulting inverted map, you can persist the result:

x = myRdd.mapPartitions(create inverted map).persist()

Your function would create the reverse map and then return an iterator
over the keys in that map.


On Wed, Sep 17, 2014 at 1:04 PM, Akshat Aranya <aa...@gmail.com> wrote:
> Patrick,
>
> If I understand this correctly, I won't be able to do this in the closure
> provided to mapPartitions() because that's going to be stateless, in the
> sense that a hash map that I create within the closure would only be useful
> for one call of MapPartitionsRDD.compute().  I guess I would need to
> override mapPartitions() directly within my RDD.  Right?
>
> On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>
>> If each partition can fit in memory, you can do this using
>> mapPartitions and then building an inverse mapping within each
>> partition. You'd need to construct a hash map within each partition
>> yourself.
>>
>> On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya <aa...@gmail.com> wrote:
>> > I have a use case where my RDD is set up such:
>> >
>> > Partition 0:
>> > K1 -> [V1, V2]
>> > K2 -> [V2]
>> >
>> > Partition 1:
>> > K3 -> [V1]
>> > K4 -> [V3]
>> >
>> > I want to invert this RDD, but only within a partition, so that the
>> > operation does not require a shuffle.  It doesn't matter if the
>> > partitions
>> > of the inverted RDD have non unique keys across the partitions, for
>> > example:
>> >
>> > Partition 0:
>> > V1 -> [K1]
>> > V2 -> [K1, K2]
>> >
>> > Partition 1:
>> > V1 -> [K3]
>> > V3 -> [K4]
>> >
>> > Is there a way to do only a per-partition groupBy, instead of shuffling
>> > the
>> > entire data?
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: partitioned groupBy

Posted by Akshat Aranya <aa...@gmail.com>.
Patrick,

If I understand this correctly, I won't be able to do this in the closure
provided to mapPartitions() because that's going to be stateless, in the
sense that a hash map that I create within the closure would only be useful
for one call of MapPartitionsRDD.compute().  I guess I would need to
override mapPartitions() directly within my RDD.  Right?

On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell <pw...@gmail.com> wrote:

> If each partition can fit in memory, you can do this using
> mapPartitions and then building an inverse mapping within each
> partition. You'd need to construct a hash map within each partition
> yourself.
>
> On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya <aa...@gmail.com> wrote:
> > I have a use case where my RDD is set up such:
> >
> > Partition 0:
> > K1 -> [V1, V2]
> > K2 -> [V2]
> >
> > Partition 1:
> > K3 -> [V1]
> > K4 -> [V3]
> >
> > I want to invert this RDD, but only within a partition, so that the
> > operation does not require a shuffle.  It doesn't matter if the
> partitions
> > of the inverted RDD have non unique keys across the partitions, for
> example:
> >
> > Partition 0:
> > V1 -> [K1]
> > V2 -> [K1, K2]
> >
> > Partition 1:
> > V1 -> [K3]
> > V3 -> [K4]
> >
> > Is there a way to do only a per-partition groupBy, instead of shuffling
> the
> > entire data?
> >
>

Re: partitioned groupBy

Posted by Patrick Wendell <pw...@gmail.com>.
If each partition can fit in memory, you can do this using
mapPartitions and then building an inverse mapping within each
partition. You'd need to construct a hash map within each partition
yourself.

On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya <aa...@gmail.com> wrote:
> I have a use case where my RDD is set up such:
>
> Partition 0:
> K1 -> [V1, V2]
> K2 -> [V2]
>
> Partition 1:
> K3 -> [V1]
> K4 -> [V3]
>
> I want to invert this RDD, but only within a partition, so that the
> operation does not require a shuffle.  It doesn't matter if the partitions
> of the inverted RDD have non unique keys across the partitions, for example:
>
> Partition 0:
> V1 -> [K1]
> V2 -> [K1, K2]
>
> Partition 1:
> V1 -> [K3]
> V3 -> [K4]
>
> Is there a way to do only a per-partition groupBy, instead of shuffling the
> entire data?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org