You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ansriniv <an...@gmail.com> on 2014/06/20 15:57:59 UTC

parallel Reduce within a key

Hi,

I am on Spark 0.9.0

I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32
cores in the cluster).
I have an input rdd with 64 partitions.

I am running  "sc.mapPartitions(...).reduce(...)"

I can see that I get full parallelism on the mapper (all my 32 cores are
busy simultaneously). However, when it comes to reduce(), the outputs of the
mappers are all reduced SERIALLY. Further, all the reduce processing happens
only on 1 of the workers.

I was expecting that the outputs of the 16 mappers on node 1 would be
reduced in parallel in node 1 while the outputs of the 16 mappers on node 2
would be reduced in parallel on node 2 and there would be 1 final inter-node
reduce (of node 1 reduced result and node 2 reduced result).

Isn't parallel reduce supported WITHIN a key (in this case I have no key) ?
(I know that there is parallelism in reduce across keys)

Best Regards
Anand



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallel-Reduce-within-a-key-tp7998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: parallel Reduce within a key

Posted by Michael Malak <mi...@yahoo.com>.

How about a treeReduceByKey? :-)


On Friday, June 20, 2014 11:55 AM, DB Tsai <db...@stanford.edu> wrote:
 


Currently, the reduce operation combines the result from mapper
sequentially, so it's O(n).

Xiangrui is working on treeReduce which is O(log(n)). Based on the
benchmark, it dramatically increase the performance. You can test the
code in his own branch.
https://github.com/apache/spark/pull/1110

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai



On Fri, Jun 20, 2014 at 6:57 AM, ansriniv <an...@gmail.com> wrote:
> Hi,
>
> I am on Spark 0.9.0
>
> I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32
> cores in the cluster).
> I have an input rdd with 64 partitions.
>
> I am running  "sc.mapPartitions(...).reduce(...)"
>
> I can see that I get full parallelism on the mapper (all my 32 cores are
> busy simultaneously). However, when it comes to reduce(), the outputs of the
> mappers are all reduced SERIALLY. Further, all the reduce processing happens
> only on 1 of the workers.
>
> I was expecting that the outputs of the 16 mappers on node 1 would be
> reduced in parallel in node 1 while the outputs of the 16 mappers on node 2
> would be reduced in parallel on node 2 and there would be 1 final inter-node
> reduce (of node 1 reduced result and node 2 reduced result).
>
> Isn't parallel reduce supported WITHIN a key (in this case I have no key) ?
> (I know that there is parallelism in reduce across keys)
>
> Best Regards
> Anand
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallel-Reduce-within-a-key-tp7998.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: parallel Reduce within a key

Posted by DB Tsai <db...@stanford.edu>.

Currently, the reduce operation combines the result from mapper
sequentially, so it's O(n).

Xiangrui is working on treeReduce which is O(log(n)). Based on the
benchmark, it dramatically increase the performance. You can test the
code in his own branch.
https://github.com/apache/spark/pull/1110

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Jun 20, 2014 at 6:57 AM, ansriniv <an...@gmail.com> wrote:
> Hi,
>
> I am on Spark 0.9.0
>
> I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32
> cores in the cluster).
> I have an input rdd with 64 partitions.
>
> I am running  "sc.mapPartitions(...).reduce(...)"
>
> I can see that I get full parallelism on the mapper (all my 32 cores are
> busy simultaneously). However, when it comes to reduce(), the outputs of the
> mappers are all reduced SERIALLY. Further, all the reduce processing happens
> only on 1 of the workers.
>
> I was expecting that the outputs of the 16 mappers on node 1 would be
> reduced in parallel in node 1 while the outputs of the 16 mappers on node 2
> would be reduced in parallel on node 2 and there would be 1 final inter-node
> reduce (of node 1 reduced result and node 2 reduced result).
>
> Isn't parallel reduce supported WITHIN a key (in this case I have no key) ?
> (I know that there is parallelism in reduce across keys)
>
> Best Regards
> Anand
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallel-Reduce-within-a-key-tp7998.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.