You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by guojc <gu...@gmail.com> on 2013/11/15 12:54:15 UTC

Does spark RDD has a partitionedByKey

Hi,
  I'm wondering whether spark rdd can has a partitionedByKey function? The
use of this function is to have a rdd distributed by according to a
cerntain paritioner and cache it. And then further join performance by rdd
with same partitoner will a great speed up. Currently, we only have a
groupByKeyFunction and generate a Seq of desired type , which is not very
convenient.

Btw, Sorry for last empty body email. I mistakenly hit the send shortcut.


Best Regards,
Jiacheng Guo

Re: Does spark RDD has a partitionedByKey

Posted by Koert Kuipers <ko...@tresata.com>.

in fact co-partitioning was one of the main reason we started using spark.
in map-reduce its a giant pain to implement


On Sat, Nov 16, 2013 at 3:05 PM, Koert Kuipers <ko...@tresata.com> wrote:

> we use PartitionBy a lot to keep multiple datasets co-partitioned before
> caching.
> it works well.
>
>
> On Sat, Nov 16, 2013 at 5:10 AM, guojc <gu...@gmail.com> wrote:
>
>> After looking at the api more carefully, I just found  I overlooked the
>> partitionBy function on PairRDDFunction.  It's the function I need. Sorry
>> for the confusion.
>>
>> Best Regards,
>> Jiacheng Guo
>>
>>
>> On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen <ct...@adatao.com>wrote:
>>
>>> Jiacheng, if you're OK with using the Shark layer above Spark (and I
>>> think for many use cases the answer would be "yes"), then you can take
>>> advantage of Shark's co-partitioning. Or do something like
>>> https://github.com/amplab/shark/pull/100/commits
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Nov 16, 2013 2:48 AM, "guojc" <gu...@gmail.com> wrote:
>>>
>>>> Hi Meisam,
>>>>      What I want to achieve here is a bit tricky. Basically, I'm try to
>>>> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
>>>> very efficient join strategy for high in-balanced data set and provide huge
>>>> gain against normal join in that situation.,
>>>>
>>>>      Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and
>>>> both of them load directly from hdfs. So both of them will has a
>>>> partitioner of Nothing. And X is a large complicate struct contain a set of
>>>> join key Y.  First for each partition of a , I extract join key Y from
>>>> every ins of X in that parition and construct a hash set of join key Y and
>>>> paritionID. Now I have a new rdd c :RDD[Y,PartionID ] and join it with b on
>>>> Y and then construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on
>>>> PartitionID and constructing map of Y and Z.  As for each partition of a, I
>>>> want to repartiion it according to its partition id, and it becomes a rdd
>>>>  e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
>>>> they will be joined very efficiently.
>>>>
>>>>     The key ability I want to have here is the ability to cache rdd c
>>>> with same partitioner of rdd b and cache e. So later join with b and d will
>>>> be efficient, because the value of b will be updated from time to time and
>>>> d's content will change accordingly. And It will be nice to have the
>>>> ability to repartition a with its original paritionid without actually
>>>> shuffle across network.
>>>>
>>>> You can refer to
>>>> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf for
>>>> PerSplit SemiJoin's details.
>>>>
>>>> Best Regards,
>>>> Jiacheng Guo
>>>>
>>>>
>>>> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <me...@gmail.com>wrote:
>>>>
>>>>> Hi guojc,
>>>>>
>>>>> It is not cleat for me what problem you are trying to solve. What do
>>>>> you want to do with the result of your
>>>>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
>>>>> in a join? Do you want to save it to your file system? Or do you want
>>>>> to do something else with it?
>>>>>
>>>>> Thanks,
>>>>> Meisam
>>>>>
>>>>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <gu...@gmail.com> wrote:
>>>>> > Hi Meisam,
>>>>> >     Thank you for response. I know each rdd has a partitioner. What
>>>>> I want
>>>>> > to achieved here is re-partition a piece of data according to my
>>>>> custom
>>>>> > partitioner. Currently I do that by
>>>>> groupByKey(myPartitioner).flatMapValues(
>>>>> > x=>x). But I'm a bit worried whether this will create additional
>>>>> temp object
>>>>> > collection, as result is first made into Seq the an collection of
>>>>> tupples.
>>>>> > Any suggestion?
>>>>> >
>>>>> > Best Regards,
>>>>> > Jiahcheng Guo
>>>>> >
>>>>> >
>>>>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <
>>>>> meisam.fathi@gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> Hi Jiacheng,
>>>>> >>
>>>>> >> Each RDD has a partitioner. You can define your own partitioner if
>>>>> the
>>>>> >> default partitioner does not suit your purpose.
>>>>> >> You can take a look at this
>>>>> >>
>>>>> >>
>>>>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
>>>>> .
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Meisam
>>>>> >>
>>>>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <gu...@gmail.com> wrote:
>>>>> >> > Hi,
>>>>> >> >   I'm wondering whether spark rdd can has a partitionedByKey
>>>>> function?
>>>>> >> > The
>>>>> >> > use of this function is to have a rdd distributed by according to
>>>>> a
>>>>> >> > cerntain
>>>>> >> > paritioner and cache it. And then further join performance by rdd
>>>>> with
>>>>> >> > same
>>>>> >> > partitoner will a great speed up. Currently, we only have a
>>>>> >> > groupByKeyFunction and generate a Seq of desired type , which is
>>>>> not
>>>>> >> > very
>>>>> >> > convenient.
>>>>> >> >
>>>>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send
>>>>> >> > shortcut.
>>>>> >> >
>>>>> >> >
>>>>> >> > Best Regards,
>>>>> >> > Jiacheng Guo
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>
>

Re: Does spark RDD has a partitionedByKey

Posted by Koert Kuipers <ko...@tresata.com>.

we use PartitionBy a lot to keep multiple datasets co-partitioned before
caching.
it works well.


On Sat, Nov 16, 2013 at 5:10 AM, guojc <gu...@gmail.com> wrote:

> After looking at the api more carefully, I just found  I overlooked the
> partitionBy function on PairRDDFunction.  It's the function I need. Sorry
> for the confusion.
>
> Best Regards,
> Jiacheng Guo
>
>
> On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen <ct...@adatao.com>wrote:
>
>> Jiacheng, if you're OK with using the Shark layer above Spark (and I
>> think for many use cases the answer would be "yes"), then you can take
>> advantage of Shark's co-partitioning. Or do something like
>> https://github.com/amplab/shark/pull/100/commits
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 2:48 AM, "guojc" <gu...@gmail.com> wrote:
>>
>>> Hi Meisam,
>>>      What I want to achieve here is a bit tricky. Basically, I'm try to
>>> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
>>> very efficient join strategy for high in-balanced data set and provide huge
>>> gain against normal join in that situation.,
>>>
>>>      Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both
>>> of them load directly from hdfs. So both of them will has a partitioner of
>>> Nothing. And X is a large complicate struct contain a set of join key Y.
>>>  First for each partition of a , I extract join key Y from every ins of X
>>> in that parition and construct a hash set of join key Y and paritionID. Now
>>> I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then
>>> construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and
>>> constructing map of Y and Z.  As for each partition of a, I want to
>>> repartiion it according to its partition id, and it becomes a rdd
>>>  e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
>>> they will be joined very efficiently.
>>>
>>>     The key ability I want to have here is the ability to cache rdd c
>>> with same partitioner of rdd b and cache e. So later join with b and d will
>>> be efficient, because the value of b will be updated from time to time and
>>> d's content will change accordingly. And It will be nice to have the
>>> ability to repartition a with its original paritionid without actually
>>> shuffle across network.
>>>
>>> You can refer to
>>> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf for
>>> PerSplit SemiJoin's details.
>>>
>>> Best Regards,
>>> Jiacheng Guo
>>>
>>>
>>> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <me...@gmail.com>wrote:
>>>
>>>> Hi guojc,
>>>>
>>>> It is not cleat for me what problem you are trying to solve. What do
>>>> you want to do with the result of your
>>>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
>>>> in a join? Do you want to save it to your file system? Or do you want
>>>> to do something else with it?
>>>>
>>>> Thanks,
>>>> Meisam
>>>>
>>>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <gu...@gmail.com> wrote:
>>>> > Hi Meisam,
>>>> >     Thank you for response. I know each rdd has a partitioner. What I
>>>> want
>>>> > to achieved here is re-partition a piece of data according to my
>>>> custom
>>>> > partitioner. Currently I do that by
>>>> groupByKey(myPartitioner).flatMapValues(
>>>> > x=>x). But I'm a bit worried whether this will create additional temp
>>>> object
>>>> > collection, as result is first made into Seq the an collection of
>>>> tupples.
>>>> > Any suggestion?
>>>> >
>>>> > Best Regards,
>>>> > Jiahcheng Guo
>>>> >
>>>> >
>>>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <
>>>> meisam.fathi@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> Hi Jiacheng,
>>>> >>
>>>> >> Each RDD has a partitioner. You can define your own partitioner if
>>>> the
>>>> >> default partitioner does not suit your purpose.
>>>> >> You can take a look at this
>>>> >>
>>>> >>
>>>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
>>>> .
>>>> >>
>>>> >> Thanks,
>>>> >> Meisam
>>>> >>
>>>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <gu...@gmail.com> wrote:
>>>> >> > Hi,
>>>> >> >   I'm wondering whether spark rdd can has a partitionedByKey
>>>> function?
>>>> >> > The
>>>> >> > use of this function is to have a rdd distributed by according to a
>>>> >> > cerntain
>>>> >> > paritioner and cache it. And then further join performance by rdd
>>>> with
>>>> >> > same
>>>> >> > partitoner will a great speed up. Currently, we only have a
>>>> >> > groupByKeyFunction and generate a Seq of desired type , which is
>>>> not
>>>> >> > very
>>>> >> > convenient.
>>>> >> >
>>>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send
>>>> >> > shortcut.
>>>> >> >
>>>> >> >
>>>> >> > Best Regards,
>>>> >> > Jiacheng Guo
>>>> >
>>>> >
>>>>
>>>
>>>
>

Re: Does spark RDD has a partitionedByKey

Posted by guojc <gu...@gmail.com>.

After looking at the api more carefully, I just found  I overlooked the
partitionBy function on PairRDDFunction.  It's the function I need. Sorry
for the confusion.

Best Regards,
Jiacheng Guo


On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen <ct...@adatao.com> wrote:

> Jiacheng, if you're OK with using the Shark layer above Spark (and I think
> for many use cases the answer would be "yes"), then you can take advantage
> of Shark's co-partitioning. Or do something like
> https://github.com/amplab/shark/pull/100/commits
>
> Sent while mobile. Pls excuse typos etc.
> On Nov 16, 2013 2:48 AM, "guojc" <gu...@gmail.com> wrote:
>
>> Hi Meisam,
>>      What I want to achieve here is a bit tricky. Basically, I'm try to
>> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
>> very efficient join strategy for high in-balanced data set and provide huge
>> gain against normal join in that situation.,
>>
>>      Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both
>> of them load directly from hdfs. So both of them will has a partitioner of
>> Nothing. And X is a large complicate struct contain a set of join key Y.
>>  First for each partition of a , I extract join key Y from every ins of X
>> in that parition and construct a hash set of join key Y and paritionID. Now
>> I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then
>> construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and
>> constructing map of Y and Z.  As for each partition of a, I want to
>> repartiion it according to its partition id, and it becomes a rdd
>>  e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
>> they will be joined very efficiently.
>>
>>     The key ability I want to have here is the ability to cache rdd c
>> with same partitioner of rdd b and cache e. So later join with b and d will
>> be efficient, because the value of b will be updated from time to time and
>> d's content will change accordingly. And It will be nice to have the
>> ability to repartition a with its original paritionid without actually
>> shuffle across network.
>>
>> You can refer to
>> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf for
>> PerSplit SemiJoin's details.
>>
>> Best Regards,
>> Jiacheng Guo
>>
>>
>> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <me...@gmail.com>wrote:
>>
>>> Hi guojc,
>>>
>>> It is not cleat for me what problem you are trying to solve. What do
>>> you want to do with the result of your
>>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
>>> in a join? Do you want to save it to your file system? Or do you want
>>> to do something else with it?
>>>
>>> Thanks,
>>> Meisam
>>>
>>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <gu...@gmail.com> wrote:
>>> > Hi Meisam,
>>> >     Thank you for response. I know each rdd has a partitioner. What I
>>> want
>>> > to achieved here is re-partition a piece of data according to my custom
>>> > partitioner. Currently I do that by
>>> groupByKey(myPartitioner).flatMapValues(
>>> > x=>x). But I'm a bit worried whether this will create additional temp
>>> object
>>> > collection, as result is first made into Seq the an collection of
>>> tupples.
>>> > Any suggestion?
>>> >
>>> > Best Regards,
>>> > Jiahcheng Guo
>>> >
>>> >
>>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <meisam.fathi@gmail.com
>>> >
>>> > wrote:
>>> >>
>>> >> Hi Jiacheng,
>>> >>
>>> >> Each RDD has a partitioner. You can define your own partitioner if the
>>> >> default partitioner does not suit your purpose.
>>> >> You can take a look at this
>>> >>
>>> >>
>>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
>>> .
>>> >>
>>> >> Thanks,
>>> >> Meisam
>>> >>
>>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <gu...@gmail.com> wrote:
>>> >> > Hi,
>>> >> >   I'm wondering whether spark rdd can has a partitionedByKey
>>> function?
>>> >> > The
>>> >> > use of this function is to have a rdd distributed by according to a
>>> >> > cerntain
>>> >> > paritioner and cache it. And then further join performance by rdd
>>> with
>>> >> > same
>>> >> > partitoner will a great speed up. Currently, we only have a
>>> >> > groupByKeyFunction and generate a Seq of desired type , which is not
>>> >> > very
>>> >> > convenient.
>>> >> >
>>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send
>>> >> > shortcut.
>>> >> >
>>> >> >
>>> >> > Best Regards,
>>> >> > Jiacheng Guo
>>> >
>>> >
>>>
>>
>>

Re: Does spark RDD has a partitionedByKey

Posted by Christopher Nguyen <ct...@adatao.com>.

Jiacheng, if you're OK with using the Shark layer above Spark (and I think
for many use cases the answer would be "yes"), then you can take advantage
of Shark's co-partitioning. Or do something like
https://github.com/amplab/shark/pull/100/commits

Sent while mobile. Pls excuse typos etc.
On Nov 16, 2013 2:48 AM, "guojc" <gu...@gmail.com> wrote:

> Hi Meisam,
>      What I want to achieve here is a bit tricky. Basically, I'm try to
> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
> very efficient join strategy for high in-balanced data set and provide huge
> gain against normal join in that situation.,
>
>      Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both
> of them load directly from hdfs. So both of them will has a partitioner of
> Nothing. And X is a large complicate struct contain a set of join key Y.
>  First for each partition of a , I extract join key Y from every ins of X
> in that parition and construct a hash set of join key Y and paritionID. Now
> I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then
> construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and
> constructing map of Y and Z.  As for each partition of a, I want to
> repartiion it according to its partition id, and it becomes a rdd
>  e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
> they will be joined very efficiently.
>
>     The key ability I want to have here is the ability to cache rdd c with
> same partitioner of rdd b and cache e. So later join with b and d will be
> efficient, because the value of b will be updated from time to time and d's
> content will change accordingly. And It will be nice to have the ability to
> repartition a with its original paritionid without actually shuffle across
> network.
>
> You can refer to
> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf for
> PerSplit SemiJoin's details.
>
> Best Regards,
> Jiacheng Guo
>
>
> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <me...@gmail.com>wrote:
>
>> Hi guojc,
>>
>> It is not cleat for me what problem you are trying to solve. What do
>> you want to do with the result of your
>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
>> in a join? Do you want to save it to your file system? Or do you want
>> to do something else with it?
>>
>> Thanks,
>> Meisam
>>
>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <gu...@gmail.com> wrote:
>> > Hi Meisam,
>> >     Thank you for response. I know each rdd has a partitioner. What I
>> want
>> > to achieved here is re-partition a piece of data according to my custom
>> > partitioner. Currently I do that by
>> groupByKey(myPartitioner).flatMapValues(
>> > x=>x). But I'm a bit worried whether this will create additional temp
>> object
>> > collection, as result is first made into Seq the an collection of
>> tupples.
>> > Any suggestion?
>> >
>> > Best Regards,
>> > Jiahcheng Guo
>> >
>> >
>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <me...@gmail.com>
>> > wrote:
>> >>
>> >> Hi Jiacheng,
>> >>
>> >> Each RDD has a partitioner. You can define your own partitioner if the
>> >> default partitioner does not suit your purpose.
>> >> You can take a look at this
>> >>
>> >>
>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
>> .
>> >>
>> >> Thanks,
>> >> Meisam
>> >>
>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <gu...@gmail.com> wrote:
>> >> > Hi,
>> >> >   I'm wondering whether spark rdd can has a partitionedByKey
>> function?
>> >> > The
>> >> > use of this function is to have a rdd distributed by according to a
>> >> > cerntain
>> >> > paritioner and cache it. And then further join performance by rdd
>> with
>> >> > same
>> >> > partitoner will a great speed up. Currently, we only have a
>> >> > groupByKeyFunction and generate a Seq of desired type , which is not
>> >> > very
>> >> > convenient.
>> >> >
>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send
>> >> > shortcut.
>> >> >
>> >> >
>> >> > Best Regards,
>> >> > Jiacheng Guo
>> >
>> >
>>
>
>

Re: Does spark RDD has a partitionedByKey

Posted by guojc <gu...@gmail.com>.

Hi Meisam,
     What I want to achieve here is a bit tricky. Basically, I'm try to
implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
very efficient join strategy for high in-balanced data set and provide huge
gain against normal join in that situation.,

     Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both of
them load directly from hdfs. So both of them will has a partitioner of
Nothing. And X is a large complicate struct contain a set of join key Y.
 First for each partition of a , I extract join key Y from every ins of X
in that parition and construct a hash set of join key Y and paritionID. Now
I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then
construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and
constructing map of Y and Z.  As for each partition of a, I want to
repartiion it according to its partition id, and it becomes a rdd
 e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
they will be joined very efficiently.

    The key ability I want to have here is the ability to cache rdd c with
same partitioner of rdd b and cache e. So later join with b and d will be
efficient, because the value of b will be updated from time to time and d's
content will change accordingly. And It will be nice to have the ability to
repartition a with its original paritionid without actually shuffle across
network.

You can refer to
http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf for
PerSplit SemiJoin's details.

Best Regards,
Jiacheng Guo


On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <me...@gmail.com>wrote:

> Hi guojc,
>
> It is not cleat for me what problem you are trying to solve. What do
> you want to do with the result of your
> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
> in a join? Do you want to save it to your file system? Or do you want
> to do something else with it?
>
> Thanks,
> Meisam
>
> On Fri, Nov 15, 2013 at 12:56 PM, guojc <gu...@gmail.com> wrote:
> > Hi Meisam,
> >     Thank you for response. I know each rdd has a partitioner. What I
> want
> > to achieved here is re-partition a piece of data according to my custom
> > partitioner. Currently I do that by
> groupByKey(myPartitioner).flatMapValues(
> > x=>x). But I'm a bit worried whether this will create additional temp
> object
> > collection, as result is first made into Seq the an collection of
> tupples.
> > Any suggestion?
> >
> > Best Regards,
> > Jiahcheng Guo
> >
> >
> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <me...@gmail.com>
> > wrote:
> >>
> >> Hi Jiacheng,
> >>
> >> Each RDD has a partitioner. You can define your own partitioner if the
> >> default partitioner does not suit your purpose.
> >> You can take a look at this
> >>
> >>
> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
> .
> >>
> >> Thanks,
> >> Meisam
> >>
> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <gu...@gmail.com> wrote:
> >> > Hi,
> >> >   I'm wondering whether spark rdd can has a partitionedByKey function?
> >> > The
> >> > use of this function is to have a rdd distributed by according to a
> >> > cerntain
> >> > paritioner and cache it. And then further join performance by rdd with
> >> > same
> >> > partitoner will a great speed up. Currently, we only have a
> >> > groupByKeyFunction and generate a Seq of desired type , which is not
> >> > very
> >> > convenient.
> >> >
> >> > Btw, Sorry for last empty body email. I mistakenly hit the send
> >> > shortcut.
> >> >
> >> >
> >> > Best Regards,
> >> > Jiacheng Guo
> >
> >
>

Re: Does spark RDD has a partitionedByKey

Posted by Meisam Fathi <me...@gmail.com>.

Hi guojc,

It is not cleat for me what problem you are trying to solve. What do
you want to do with the result of your
groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
in a join? Do you want to save it to your file system? Or do you want
to do something else with it?

Thanks,
Meisam

On Fri, Nov 15, 2013 at 12:56 PM, guojc <gu...@gmail.com> wrote:
> Hi Meisam,
>     Thank you for response. I know each rdd has a partitioner. What I want
> to achieved here is re-partition a piece of data according to my custom
> partitioner. Currently I do that by groupByKey(myPartitioner).flatMapValues(
> x=>x). But I'm a bit worried whether this will create additional temp object
> collection, as result is first made into Seq the an collection of tupples.
> Any suggestion?
>
> Best Regards,
> Jiahcheng Guo
>
>
> On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <me...@gmail.com>
> wrote:
>>
>> Hi Jiacheng,
>>
>> Each RDD has a partitioner. You can define your own partitioner if the
>> default partitioner does not suit your purpose.
>> You can take a look at this
>>
>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf.
>>
>> Thanks,
>> Meisam
>>
>> On Fri, Nov 15, 2013 at 6:54 AM, guojc <gu...@gmail.com> wrote:
>> > Hi,
>> >   I'm wondering whether spark rdd can has a partitionedByKey function?
>> > The
>> > use of this function is to have a rdd distributed by according to a
>> > cerntain
>> > paritioner and cache it. And then further join performance by rdd with
>> > same
>> > partitoner will a great speed up. Currently, we only have a
>> > groupByKeyFunction and generate a Seq of desired type , which is not
>> > very
>> > convenient.
>> >
>> > Btw, Sorry for last empty body email. I mistakenly hit the send
>> > shortcut.
>> >
>> >
>> > Best Regards,
>> > Jiacheng Guo
>
>

Re: Does spark RDD has a partitionedByKey

Posted by guojc <gu...@gmail.com>.

Hi Meisam,
    Thank you for response. I know each rdd has a partitioner. What I want
to achieved here is re-partition a piece of data according to my custom
partitioner. Currently I do that by
groupByKey(myPartitioner).flatMapValues( x=>x). But I'm a bit worried
whether this will create additional temp object collection, as result is
first made into Seq the an collection of tupples. Any suggestion?

Best Regards,
Jiahcheng Guo


On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <me...@gmail.com>wrote:

> Hi Jiacheng,
>
> Each RDD has a partitioner. You can define your own partitioner if the
> default partitioner does not suit your purpose.
> You can take a look at this
>
> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
> .
>
> Thanks,
> Meisam
>
> On Fri, Nov 15, 2013 at 6:54 AM, guojc <gu...@gmail.com> wrote:
> > Hi,
> >   I'm wondering whether spark rdd can has a partitionedByKey function?
> The
> > use of this function is to have a rdd distributed by according to a
> cerntain
> > paritioner and cache it. And then further join performance by rdd with
> same
> > partitoner will a great speed up. Currently, we only have a
> > groupByKeyFunction and generate a Seq of desired type , which is not very
> > convenient.
> >
> > Btw, Sorry for last empty body email. I mistakenly hit the send shortcut.
> >
> >
> > Best Regards,
> > Jiacheng Guo
>

Re: Does spark RDD has a partitionedByKey

Posted by Meisam Fathi <me...@gmail.com>.

Hi Jiacheng,

Each RDD has a partitioner. You can define your own partitioner if the
default partitioner does not suit your purpose.
You can take a look at this
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf.

Thanks,
Meisam

On Fri, Nov 15, 2013 at 6:54 AM, guojc <gu...@gmail.com> wrote:
> Hi,
>   I'm wondering whether spark rdd can has a partitionedByKey function? The
> use of this function is to have a rdd distributed by according to a cerntain
> paritioner and cache it. And then further join performance by rdd with same
> partitoner will a great speed up. Currently, we only have a
> groupByKeyFunction and generate a Seq of desired type , which is not very
> convenient.
>
> Btw, Sorry for last empty body email. I mistakenly hit the send shortcut.
>
>
> Best Regards,
> Jiacheng Guo