You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by 周浥尘 <zh...@gmail.com> on 2018/08/20 12:52:57 UTC

Why repartitionAndSortWithinPartitions slower than MapReducer

Hi team,

I found the Spark method *repartitionAndSortWithinPartitions *spends twice
as much time as using Mapreduce in some cases.
I want to repartition the dataset accorading to split keys and save them to
files in ascending. As the doc says, repartitionAndSortWithinPartitions “is
more efficient than calling `repartition` and then sorting within each
partition because it can push the sorting down into the shuffle machinery.”
I thought it may be faster than MR, but actually, it is much more slower. I
also adjust several configurations of spark, but that doesn't work.(Both
Spark and Mapreduce run on a three-node cluster and share the same number
of partitions.)
Can this situation be explained or is there any approach to improve the
performance of spark?

Thanks & Regards,
Yichen

Re: Why repartitionAndSortWithinPartitions slower than MapReducer

Posted by Koert Kuipers <ko...@tresata.com>.

I assume you are using RDDs? What are you doing after the repartitioning +
sorting, if anything?


On Aug 20, 2018 11:22, "周浥尘" <zh...@gmail.com> wrote:

In addition to my previous email,
Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6

周浥尘 <zh...@gmail.com> 于2018年8月20日周一 下午8:52写道：

> Hi team,
>
> I found the Spark method *repartitionAndSortWithinPartitions *spends
> twice as much time as using Mapreduce in some cases.
> I want to repartition the dataset accorading to split keys and save them
> to files in ascending. As the doc says, repartitionAndSortWithinPartitions
> “is more efficient than calling `repartition` and then sorting within each
> partition because it can push the sorting down into the shuffle machinery.”
> I thought it may be faster than MR, but actually, it is much more slower. I
> also adjust several configurations of spark, but that doesn't work.(Both
> Spark and Mapreduce run on a three-node cluster and share the same number
> of partitions.)
> Can this situation be explained or is there any approach to improve the
> performance of spark?
>
> Thanks & Regards,
> Yichen
>

Re: Why repartitionAndSortWithinPartitions slower than MapReducer

Posted by 周浥尘 <zh...@gmail.com>.

In addition to my previous email,
Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6

周浥尘 <zh...@gmail.com> 于2018年8月20日周一 下午8:52写道：

> Hi team,
>
> I found the Spark method *repartitionAndSortWithinPartitions *spends
> twice as much time as using Mapreduce in some cases.
> I want to repartition the dataset accorading to split keys and save them
> to files in ascending. As the doc says,
> repartitionAndSortWithinPartitions “is more efficient than calling
> `repartition` and then sorting within each partition because it can push
> the sorting down into the shuffle machinery.” I thought it may be faster
> than MR, but actually, it is much more slower. I also adjust several
> configurations of spark, but that doesn't work.(Both Spark and Mapreduce
> run on a three-node cluster and share the same number of partitions.)
> Can this situation be explained or is there any approach to improve the
> performance of spark?
>
> Thanks & Regards,
> Yichen
>

Re: Why repartitionAndSortWithinPartitions slower than MapReducer

Posted by 周浥尘 <zh...@gmail.com>.

In addition to my previous email,
Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6

周浥尘 <zh...@gmail.com> 于2018年8月20日周一 下午8:52写道：

> Hi team,
>
> I found the Spark method *repartitionAndSortWithinPartitions *spends
> twice as much time as using Mapreduce in some cases.
> I want to repartition the dataset accorading to split keys and save them
> to files in ascending. As the doc says,
> repartitionAndSortWithinPartitions “is more efficient than calling
> `repartition` and then sorting within each partition because it can push
> the sorting down into the shuffle machinery.” I thought it may be faster
> than MR, but actually, it is much more slower. I also adjust several
> configurations of spark, but that doesn't work.(Both Spark and Mapreduce
> run on a three-node cluster and share the same number of partitions.)
> Can this situation be explained or is there any approach to improve the
> performance of spark?
>
> Thanks & Regards,
> Yichen
>