You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Punit Naik <na...@gmail.com> on 2016/07/14 17:09:30 UTC

repartitionAndSortWithinPartitions HELP

Hi guys

In my spark/scala code I am implementing secondary sort. I wanted to know,
when I call the "repartitionAndSortWithinPartitions" method, the whole
(entire) RDD will be sorted or only the individual partitions will be
sorted?
If its the latter case, will applying a "sortByKey" after
"repartitionAndSortWithinPartitions" be faster now that the individual
partitions are sorted?

-- 
Thank You

Regards

Punit Naik

Re: repartitionAndSortWithinPartitions HELP

Posted by Koert Kuipers <ko...@tresata.com>.

sortByKey needs to use a range partitioner, a very particular partitioner,
so you cannot supply your own partitioner.

you should not have to shuffle twice to do a secondary sort algo


On Thu, Jul 14, 2016 at 2:22 PM, Punit Naik <na...@gmail.com> wrote:

> Okay. Can't I supply the same partitioner I used for
> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>
> On 14-Jul-2016 11:38 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>
>> repartitionAndSortWithinPartitions partitions the rdd and sorts within
>> each partition. so each partition is fully sorted, but the rdd is not
>> sorted.
>>
>> sortByKey is basically the same as repartitionAndSortWithinPartitions
>> except it uses a range partitioner so that the entire rdd is sorted.
>> however since sortByKey uses a different partitioner than
>> repartitionAndSortWithinPartitions you do not get much benefit from running
>> sortByKey after repartitionAndSortWithinPartitions (because all the data
>> will get shuffled again)
>>
>>
>> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <na...@gmail.com>
>> wrote:
>>
>>> Hi Koert
>>>
>>> I have already used "repartitionAndSortWithinPartitions" for secondary
>>> sorting and it works fine. Just wanted to know whether it will sort the
>>> entire RDD or not.
>>>
>>> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> repartitionAndSortWithinPartit sort by keys, not values per key, so not
>>>> really secondary sort by itself.
>>>>
>>>> for secondary sort also check out:
>>>> https://github.com/tresata/spark-sorted
>>>>
>>>>
>>>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <na...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi guys
>>>>>
>>>>> In my spark/scala code I am implementing secondary sort. I wanted to
>>>>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>>>>> whole (entire) RDD will be sorted or only the individual partitions will be
>>>>> sorted?
>>>>> If its the latter case, will applying a "sortByKey" after
>>>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>>>> partitions are sorted?
>>>>>
>>>>> --
>>>>> Thank You
>>>>>
>>>>> Regards
>>>>>
>>>>> Punit Naik
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thank You
>>>
>>> Regards
>>>
>>> Punit Naik
>>>
>>
>>

Re: repartitionAndSortWithinPartitions HELP

Posted by Punit Naik <na...@gmail.com>.

Okay that clears my doubt! Thanks a lot.

On 15-Jul-2016 7:43 PM, "Koert Kuipers" <ko...@tresata.com> wrote:

spark's shuffle mechanism takes care of this kind of optimization
internally when you use the sort-based shuffle (which is the default).

On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik <na...@gmail.com> wrote:

> I meant to say that first we can sort the individual partitions and then
> sort them again by merging. Sort of a divide and conquer mechanism.
> Does sortByKey take care of all this internally?
>
>
> On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik <na...@gmail.com>
> wrote:
>
>> Can we increase the sorting speed of RDD by doing a secondary sort first?
>>
>> On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <na...@gmail.com>
>> wrote:
>>
>>> Okay. Can't I supply the same partitioner I used for
>>> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>>>
>>> On 14-Jul-2016 11:38 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>
>>>> repartitionAndSortWithinPartitions partitions the rdd and sorts within
>>>> each partition. so each partition is fully sorted, but the rdd is not
>>>> sorted.
>>>>
>>>> sortByKey is basically the same as repartitionAndSortWithinPartitions
>>>> except it uses a range partitioner so that the entire rdd is sorted.
>>>> however since sortByKey uses a different partitioner than
>>>> repartitionAndSortWithinPartitions you do not get much benefit from running
>>>> sortByKey after repartitionAndSortWithinPartitions (because all the data
>>>> will get shuffled again)
>>>>
>>>>
>>>> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <na...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Koert
>>>>>
>>>>> I have already used "repartitionAndSortWithinPartitions" for secondary
>>>>> sorting and it works fine. Just wanted to know whether it will sort the
>>>>> entire RDD or not.
>>>>>
>>>>> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> repartitionAndSortWithinPartit sort by keys, not values per key, so
>>>>>> not really secondary sort by itself.
>>>>>>
>>>>>> for secondary sort also check out:
>>>>>> https://github.com/tresata/spark-sorted
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <na...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi guys
>>>>>>>
>>>>>>> In my spark/scala code I am implementing secondary sort. I wanted to
>>>>>>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>>>>>>> whole (entire) RDD will be sorted or only the individual partitions will be
>>>>>>> sorted?
>>>>>>> If its the latter case, will applying a "sortByKey" after
>>>>>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>>>>>> partitions are sorted?
>>>>>>>
>>>>>>> --
>>>>>>> Thank You
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Punit Naik
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thank You
>>>>>
>>>>> Regards
>>>>>
>>>>> Punit Naik
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Thank You
>>
>> Regards
>>
>> Punit Naik
>>
>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>

Re: repartitionAndSortWithinPartitions HELP

Posted by Koert Kuipers <ko...@tresata.com>.

spark's shuffle mechanism takes care of this kind of optimization
internally when you use the sort-based shuffle (which is the default).

On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik <na...@gmail.com> wrote:

> I meant to say that first we can sort the individual partitions and then
> sort them again by merging. Sort of a divide and conquer mechanism.
> Does sortByKey take care of all this internally?
>
>
> On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik <na...@gmail.com>
> wrote:
>
>> Can we increase the sorting speed of RDD by doing a secondary sort first?
>>
>> On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <na...@gmail.com>
>> wrote:
>>
>>> Okay. Can't I supply the same partitioner I used for
>>> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>>>
>>> On 14-Jul-2016 11:38 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>
>>>> repartitionAndSortWithinPartitions partitions the rdd and sorts within
>>>> each partition. so each partition is fully sorted, but the rdd is not
>>>> sorted.
>>>>
>>>> sortByKey is basically the same as repartitionAndSortWithinPartitions
>>>> except it uses a range partitioner so that the entire rdd is sorted.
>>>> however since sortByKey uses a different partitioner than
>>>> repartitionAndSortWithinPartitions you do not get much benefit from running
>>>> sortByKey after repartitionAndSortWithinPartitions (because all the data
>>>> will get shuffled again)
>>>>
>>>>
>>>> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <na...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Koert
>>>>>
>>>>> I have already used "repartitionAndSortWithinPartitions" for secondary
>>>>> sorting and it works fine. Just wanted to know whether it will sort the
>>>>> entire RDD or not.
>>>>>
>>>>> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> repartitionAndSortWithinPartit sort by keys, not values per key, so
>>>>>> not really secondary sort by itself.
>>>>>>
>>>>>> for secondary sort also check out:
>>>>>> https://github.com/tresata/spark-sorted
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <na...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi guys
>>>>>>>
>>>>>>> In my spark/scala code I am implementing secondary sort. I wanted to
>>>>>>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>>>>>>> whole (entire) RDD will be sorted or only the individual partitions will be
>>>>>>> sorted?
>>>>>>> If its the latter case, will applying a "sortByKey" after
>>>>>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>>>>>> partitions are sorted?
>>>>>>>
>>>>>>> --
>>>>>>> Thank You
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Punit Naik
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thank You
>>>>>
>>>>> Regards
>>>>>
>>>>> Punit Naik
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Thank You
>>
>> Regards
>>
>> Punit Naik
>>
>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>

Re: repartitionAndSortWithinPartitions HELP

Posted by Punit Naik <na...@gmail.com>.

I meant to say that first we can sort the individual partitions and then
sort them again by merging. Sort of a divide and conquer mechanism.
Does sortByKey take care of all this internally?

On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik <na...@gmail.com> wrote:

> Can we increase the sorting speed of RDD by doing a secondary sort first?
>
> On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <na...@gmail.com>
> wrote:
>
>> Okay. Can't I supply the same partitioner I used for
>> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>>
>> On 14-Jul-2016 11:38 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>>
>>> repartitionAndSortWithinPartitions partitions the rdd and sorts within
>>> each partition. so each partition is fully sorted, but the rdd is not
>>> sorted.
>>>
>>> sortByKey is basically the same as repartitionAndSortWithinPartitions
>>> except it uses a range partitioner so that the entire rdd is sorted.
>>> however since sortByKey uses a different partitioner than
>>> repartitionAndSortWithinPartitions you do not get much benefit from running
>>> sortByKey after repartitionAndSortWithinPartitions (because all the data
>>> will get shuffled again)
>>>
>>>
>>> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <na...@gmail.com>
>>> wrote:
>>>
>>>> Hi Koert
>>>>
>>>> I have already used "repartitionAndSortWithinPartitions" for secondary
>>>> sorting and it works fine. Just wanted to know whether it will sort the
>>>> entire RDD or not.
>>>>
>>>> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> repartitionAndSortWithinPartit sort by keys, not values per key, so
>>>>> not really secondary sort by itself.
>>>>>
>>>>> for secondary sort also check out:
>>>>> https://github.com/tresata/spark-sorted
>>>>>
>>>>>
>>>>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <na...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi guys
>>>>>>
>>>>>> In my spark/scala code I am implementing secondary sort. I wanted to
>>>>>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>>>>>> whole (entire) RDD will be sorted or only the individual partitions will be
>>>>>> sorted?
>>>>>> If its the latter case, will applying a "sortByKey" after
>>>>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>>>>> partitions are sorted?
>>>>>>
>>>>>> --
>>>>>> Thank You
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Punit Naik
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thank You
>>>>
>>>> Regards
>>>>
>>>> Punit Naik
>>>>
>>>
>>>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>



-- 
Thank You

Regards

Punit Naik

Re: repartitionAndSortWithinPartitions HELP

Posted by Punit Naik <na...@gmail.com>.

Can we increase the sorting speed of RDD by doing a secondary sort first?

On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <na...@gmail.com> wrote:

> Okay. Can't I supply the same partitioner I used for
> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>
> On 14-Jul-2016 11:38 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>
>> repartitionAndSortWithinPartitions partitions the rdd and sorts within
>> each partition. so each partition is fully sorted, but the rdd is not
>> sorted.
>>
>> sortByKey is basically the same as repartitionAndSortWithinPartitions
>> except it uses a range partitioner so that the entire rdd is sorted.
>> however since sortByKey uses a different partitioner than
>> repartitionAndSortWithinPartitions you do not get much benefit from running
>> sortByKey after repartitionAndSortWithinPartitions (because all the data
>> will get shuffled again)
>>
>>
>> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <na...@gmail.com>
>> wrote:
>>
>>> Hi Koert
>>>
>>> I have already used "repartitionAndSortWithinPartitions" for secondary
>>> sorting and it works fine. Just wanted to know whether it will sort the
>>> entire RDD or not.
>>>
>>> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> repartitionAndSortWithinPartit sort by keys, not values per key, so not
>>>> really secondary sort by itself.
>>>>
>>>> for secondary sort also check out:
>>>> https://github.com/tresata/spark-sorted
>>>>
>>>>
>>>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <na...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi guys
>>>>>
>>>>> In my spark/scala code I am implementing secondary sort. I wanted to
>>>>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>>>>> whole (entire) RDD will be sorted or only the individual partitions will be
>>>>> sorted?
>>>>> If its the latter case, will applying a "sortByKey" after
>>>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>>>> partitions are sorted?
>>>>>
>>>>> --
>>>>> Thank You
>>>>>
>>>>> Regards
>>>>>
>>>>> Punit Naik
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thank You
>>>
>>> Regards
>>>
>>> Punit Naik
>>>
>>
>>


-- 
Thank You

Regards

Punit Naik

Re: repartitionAndSortWithinPartitions HELP

Posted by Punit Naik <na...@gmail.com>.

Okay. Can't I supply the same partitioner I used for
"repartitionAndSortWithinPartitions" as an argument to "sortByKey"?

On 14-Jul-2016 11:38 PM, "Koert Kuipers" <ko...@tresata.com> wrote:

> repartitionAndSortWithinPartitions partitions the rdd and sorts within
> each partition. so each partition is fully sorted, but the rdd is not
> sorted.
>
> sortByKey is basically the same as repartitionAndSortWithinPartitions
> except it uses a range partitioner so that the entire rdd is sorted.
> however since sortByKey uses a different partitioner than
> repartitionAndSortWithinPartitions you do not get much benefit from running
> sortByKey after repartitionAndSortWithinPartitions (because all the data
> will get shuffled again)
>
>
> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <na...@gmail.com>
> wrote:
>
>> Hi Koert
>>
>> I have already used "repartitionAndSortWithinPartitions" for secondary
>> sorting and it works fine. Just wanted to know whether it will sort the
>> entire RDD or not.
>>
>> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>>> repartitionAndSortWithinPartit sort by keys, not values per key, so not
>>> really secondary sort by itself.
>>>
>>> for secondary sort also check out:
>>> https://github.com/tresata/spark-sorted
>>>
>>>
>>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <na...@gmail.com>
>>> wrote:
>>>
>>>> Hi guys
>>>>
>>>> In my spark/scala code I am implementing secondary sort. I wanted to
>>>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>>>> whole (entire) RDD will be sorted or only the individual partitions will be
>>>> sorted?
>>>> If its the latter case, will applying a "sortByKey" after
>>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>>> partitions are sorted?
>>>>
>>>> --
>>>> Thank You
>>>>
>>>> Regards
>>>>
>>>> Punit Naik
>>>>
>>>
>>>
>>
>>
>> --
>> Thank You
>>
>> Regards
>>
>> Punit Naik
>>
>
>

Re: repartitionAndSortWithinPartitions HELP

Posted by Koert Kuipers <ko...@tresata.com>.

repartitionAndSortWithinPartitions partitions the rdd and sorts within each
partition. so each partition is fully sorted, but the rdd is not sorted.

sortByKey is basically the same as repartitionAndSortWithinPartitions
except it uses a range partitioner so that the entire rdd is sorted.
however since sortByKey uses a different partitioner than
repartitionAndSortWithinPartitions you do not get much benefit from running
sortByKey after repartitionAndSortWithinPartitions (because all the data
will get shuffled again)

On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <na...@gmail.com> wrote:

> Hi Koert
>
> I have already used "repartitionAndSortWithinPartitions" for secondary
> sorting and it works fine. Just wanted to know whether it will sort the
> entire RDD or not.
>
> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> repartitionAndSortWithinPartit sort by keys, not values per key, so not
>> really secondary sort by itself.
>>
>> for secondary sort also check out:
>> https://github.com/tresata/spark-sorted
>>
>>
>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <na...@gmail.com>
>> wrote:
>>
>>> Hi guys
>>>
>>> In my spark/scala code I am implementing secondary sort. I wanted to
>>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>>> whole (entire) RDD will be sorted or only the individual partitions will be
>>> sorted?
>>> If its the latter case, will applying a "sortByKey" after
>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>> partitions are sorted?
>>>
>>> --
>>> Thank You
>>>
>>> Regards
>>>
>>> Punit Naik
>>>
>>
>>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>

Re: repartitionAndSortWithinPartitions HELP

Posted by Punit Naik <na...@gmail.com>.

Hi Koert

I have already used "repartitionAndSortWithinPartitions" for secondary
sorting and it works fine. Just wanted to know whether it will sort the
entire RDD or not.

On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> wrote:

> repartitionAndSortWithinPartit sort by keys, not values per key, so not
> really secondary sort by itself.
>
> for secondary sort also check out:
> https://github.com/tresata/spark-sorted
>
>
> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <na...@gmail.com>
> wrote:
>
>> Hi guys
>>
>> In my spark/scala code I am implementing secondary sort. I wanted to
>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>> whole (entire) RDD will be sorted or only the individual partitions will be
>> sorted?
>> If its the latter case, will applying a "sortByKey" after
>> "repartitionAndSortWithinPartitions" be faster now that the individual
>> partitions are sorted?
>>
>> --
>> Thank You
>>
>> Regards
>>
>> Punit Naik
>>
>
>


-- 
Thank You

Regards

Punit Naik

Re: repartitionAndSortWithinPartitions HELP

Posted by Koert Kuipers <ko...@tresata.com>.

repartitionAndSortWithinPartit sort by keys, not values per key, so not
really secondary sort by itself.

for secondary sort also check out:
https://github.com/tresata/spark-sorted


On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <na...@gmail.com> wrote:

> Hi guys
>
> In my spark/scala code I am implementing secondary sort. I wanted to know,
> when I call the "repartitionAndSortWithinPartitions" method, the whole
> (entire) RDD will be sorted or only the individual partitions will be
> sorted?
> If its the latter case, will applying a "sortByKey" after
> "repartitionAndSortWithinPartitions" be faster now that the individual
> partitions are sorted?
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>