You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2018/02/01 07:13:43 UTC
Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions
for RDD
You can just do that with mapPartitions pretty easily can’t you?
On Wed, Jan 31, 2018 at 11:08 PM Ruifeng Zheng <ru...@foxmail.com> wrote:
> HI all:
>
>
>
> 1, Dataset API supports operation “sortWithinPartitions”, but in
> RDD API there is no counterpart (I know there is
> “repartitionAndSortWithinPartitions”, but I don’t want to repartition the
> RDD), I have to convert RDD to Dataset for this function. Would it make
> sense to add a “sortWithinPartitions” for RDD?
>
>
>
> 2, In “aggregateByKey”/”reduceByKey”, I want to do some special
> operation (like aggregator compression) after local aggregation on each
> partitions. A similar case may be: compute ‘ApproximatePercentile’ for
> different keys by ”reduceByKey”, it may be helpful if
> ‘QuantileSummaries#compress’ is called before network communication. So I
> wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD?
>
>
>
> Regards,
>
> Ruifeng
>
>
>
>
>
>
>
>
>
Re: [Core][Suggestion] sortWithinPartitions and
aggregateWithinPartitions for RDD
Posted by Ruifeng Zheng <ru...@foxmail.com>.
Do you mean in-memory processing? It works fine if all partitions are small. But when some partition don’t fit in memory, it will cause OOM.
发件人: Reynold Xin <rx...@databricks.com>
日期: 2018年2月1日 星期四 下午3:14
收件人: Ruifeng Zheng <ru...@foxmail.com>
抄送: <de...@spark.apache.org>
主题: Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD
You can just do that with mapPartitions pretty easily can’t you?
On Wed, Jan 31, 2018 at 11:08 PM Ruifeng Zheng <ru...@foxmail.com> wrote:
HI all:
1, Dataset API supports operation “sortWithinPartitions”, but in RDD API there is no counterpart (I know there is “repartitionAndSortWithinPartitions”, but I don’t want to repartition the RDD), I have to convert RDD to Dataset for this function. Would it make sense to add a “sortWithinPartitions” for RDD?
2, In “aggregateByKey”/”reduceByKey”, I want to do some special operation (like aggregator compression) after local aggregation on each partitions. A similar case may be: compute ‘ApproximatePercentile’ for different keys by ”reduceByKey”, it may be helpful if ‘QuantileSummaries#compress’ is called before network communication. So I wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD?
Regards,
Ruifeng