You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mengxr <gi...@git.apache.org> on 2014/04/23 12:02:15 UTC

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/506

    [SPARK-1485][MLLIB] Implement Butterfly AllReduce

    The current implementations of machine learning algorithms rely on the driver for some computation and data broadcasting. This may create a bottleneck at the driver for both computation and communication, especially in multi-model training. An efficient implementation of AllReduce can help free up the driver. This PR contains a simple butterfly AllReduce implementation. Compared it with reduce + broadcast (http) on a 16-node EC2 cluster (with slow connection), and saw 2x speed-up on vectors of size 1k to 10m.
    
    Possible improvements:
    
    1. Each executor only needs one copy.
    2. Better handling when the number of partitions is not a power of two?  

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark butterfly

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/506.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #506
    
----
commit 76f4bb7b29e0b605561dac048535380b645839fd
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-18T04:43:32Z

    init impl of allReduce

commit d14300540da65900a2a19f8edc5adc5ce9c3e72d
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-22T21:11:25Z

    move allReduce to mllib

commit 98c329d5ac12ec341a9cf6355cd67ea031189a24
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-23T09:44:51Z

    allow arbitrary number of partitions

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark issue #506: [WIP][SPARK-1485][MLLIB] Implement Butterfly AllReduce

Posted by sidps <gi...@git.apache.org>.
Github user sidps commented on the issue:

    https://github.com/apache/spark/pull/506
  
    I've been curious about underlying implementations of such operations, has the ring all-reduce technique been considered? http://research.baidu.com/bringing-hpc-techniques-deep-learning/


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-41144634
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-1485][MLLIB] Implement Butterfly A...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-46380865
  
    Thanks all for reviewing this PR! I found the butterfly pattern introduces complex dependency that slows down the computation. In my tests, a good approach for Spark is tree reduce + bt broadcast. So I'm closing this one now in favor of https://github.com/apache/spark/pull/1110 . 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/506#discussion_r11921347
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/PartitionSlicingRDD.scala ---
    @@ -0,0 +1,43 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.rdd
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.{TaskContext, Partition}
    +
    +/**
    + * Represents an RDD obtained from partition slicing of its parent RDD.
    --- End diff --
    
    Isn't this the same as PartitionPruningRDD ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/506#discussion_r11921440
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala ---
    @@ -44,6 +44,54 @@ class RDDFunctions[T: ClassTag](self: RDD[T]) {
           new SlidingRDD[T](self, windowSize)
         }
       }
    +
    +  /**
    +   * Returns an RDD with the specified slice of partitions.
    +   */
    +  def slicePartitions(slice: Seq[Int]): RDD[T] = {
    +    new PartitionSlicingRDD(self, slice)
    +  }
    +
    +  /**
    +   * Computes the all-reduced RDD of the parent RDD, which has the same number of partitions and
    +   * locality information as its parent RDD. Each partition contains only one record, which is the
    +   * same as calling `RDD#reduce` on its parent RDD.
    +   *
    +   * @param f reducer
    +   * @return all-reduced RDD
    +   */
    +  def allReduce(f: (T, T) => T): RDD[T] = {
    +    val numPartitions = self.partitions.size
    +    require(numPartitions > 0, "Parent RDD does not have any partitions.")
    +    val nextPowerOfTwo = {
    +      var i = 0
    +      while ((numPartitions >> i) > 0) {
    +        i += 1
    +      }
    +      1 << i
    +    }
    +    var butterfly = self.mapPartitions( (iter) =>
    +      Iterator(iter.reduce(f)),
    +      preservesPartitioning = true
    +    ).cache()
    +
    +    if (nextPowerOfTwo > numPartitions) {
    +      val padding = self.context.parallelize(Seq.empty[T], nextPowerOfTwo - numPartitions)
    +      butterfly = butterfly.union(padding)
    +    }
    +
    +    var offset = nextPowerOfTwo >> 1
    +    while (offset > 0) {
    +      butterfly = new ButterflyReducedRDD[T](butterfly, f, offset).cache()
    --- End diff --
    
    IMHO its a little risky to cache all the iterations of this loop in terms of memory usage. The right thing to do is to probably hold references to them and unpersist at the end ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr closed the pull request at:

    https://github.com/apache/spark/pull/506


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/506#discussion_r11921949
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala ---
    @@ -44,6 +44,54 @@ class RDDFunctions[T: ClassTag](self: RDD[T]) {
           new SlidingRDD[T](self, windowSize)
         }
       }
    +
    +  /**
    +   * Returns an RDD with the specified slice of partitions.
    +   */
    +  def slicePartitions(slice: Seq[Int]): RDD[T] = {
    +    new PartitionSlicingRDD(self, slice)
    +  }
    +
    +  /**
    +   * Computes the all-reduced RDD of the parent RDD, which has the same number of partitions and
    +   * locality information as its parent RDD. Each partition contains only one record, which is the
    +   * same as calling `RDD#reduce` on its parent RDD.
    +   *
    +   * @param f reducer
    +   * @return all-reduced RDD
    +   */
    +  def allReduce(f: (T, T) => T): RDD[T] = {
    +    val numPartitions = self.partitions.size
    +    require(numPartitions > 0, "Parent RDD does not have any partitions.")
    +    val nextPowerOfTwo = {
    +      var i = 0
    +      while ((numPartitions >> i) > 0) {
    +        i += 1
    +      }
    +      1 << i
    +    }
    +    var butterfly = self.mapPartitions( (iter) =>
    +      Iterator(iter.reduce(f)),
    +      preservesPartitioning = true
    +    ).cache()
    +
    +    if (nextPowerOfTwo > numPartitions) {
    +      val padding = self.context.parallelize(Seq.empty[T], nextPowerOfTwo - numPartitions)
    +      butterfly = butterfly.union(padding)
    +    }
    +
    +    var offset = nextPowerOfTwo >> 1
    +    while (offset > 0) {
    +      butterfly = new ButterflyReducedRDD[T](butterfly, f, offset).cache()
    --- End diff --
    
    Each partition will be visited twice in a butterfly step. If the previous stage is not cached or falls out cache, the cost is huge. I'm looking at the `RangeDependency` now. Maybe it can help.
    
    Btw, I don't quite understand what do you mean by `hold references to them`. Could you elaborate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/506#discussion_r11922137
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala ---
    @@ -44,6 +44,54 @@ class RDDFunctions[T: ClassTag](self: RDD[T]) {
           new SlidingRDD[T](self, windowSize)
         }
       }
    +
    +  /**
    +   * Returns an RDD with the specified slice of partitions.
    +   */
    +  def slicePartitions(slice: Seq[Int]): RDD[T] = {
    +    new PartitionSlicingRDD(self, slice)
    +  }
    +
    +  /**
    +   * Computes the all-reduced RDD of the parent RDD, which has the same number of partitions and
    +   * locality information as its parent RDD. Each partition contains only one record, which is the
    +   * same as calling `RDD#reduce` on its parent RDD.
    +   *
    +   * @param f reducer
    +   * @return all-reduced RDD
    +   */
    +  def allReduce(f: (T, T) => T): RDD[T] = {
    +    val numPartitions = self.partitions.size
    +    require(numPartitions > 0, "Parent RDD does not have any partitions.")
    +    val nextPowerOfTwo = {
    +      var i = 0
    +      while ((numPartitions >> i) > 0) {
    +        i += 1
    +      }
    +      1 << i
    +    }
    +    var butterfly = self.mapPartitions( (iter) =>
    +      Iterator(iter.reduce(f)),
    +      preservesPartitioning = true
    +    ).cache()
    +
    +    if (nextPowerOfTwo > numPartitions) {
    +      val padding = self.context.parallelize(Seq.empty[T], nextPowerOfTwo - numPartitions)
    +      butterfly = butterfly.union(padding)
    +    }
    +
    +    var offset = nextPowerOfTwo >> 1
    +    while (offset > 0) {
    +      butterfly = new ButterflyReducedRDD[T](butterfly, f, offset).cache()
    --- End diff --
    
    When we create a new RDD at each step we store the RDD references in say a ArrayBuffer. After the loop exits, we call unpersist on all the older RDDs. This doesn't work very well with lazy transformations, though allReduce doesn't need to be lazy ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-41147298
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/506#discussion_r11924002
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala ---
    @@ -44,6 +44,54 @@ class RDDFunctions[T: ClassTag](self: RDD[T]) {
           new SlidingRDD[T](self, windowSize)
         }
       }
    +
    +  /**
    +   * Returns an RDD with the specified slice of partitions.
    +   */
    +  def slicePartitions(slice: Seq[Int]): RDD[T] = {
    +    new PartitionSlicingRDD(self, slice)
    +  }
    +
    +  /**
    +   * Computes the all-reduced RDD of the parent RDD, which has the same number of partitions and
    +   * locality information as its parent RDD. Each partition contains only one record, which is the
    +   * same as calling `RDD#reduce` on its parent RDD.
    +   *
    +   * @param f reducer
    +   * @return all-reduced RDD
    +   */
    +  def allReduce(f: (T, T) => T): RDD[T] = {
    +    val numPartitions = self.partitions.size
    +    require(numPartitions > 0, "Parent RDD does not have any partitions.")
    +    val nextPowerOfTwo = {
    +      var i = 0
    +      while ((numPartitions >> i) > 0) {
    +        i += 1
    +      }
    +      1 << i
    +    }
    +    var butterfly = self.mapPartitions( (iter) =>
    +      Iterator(iter.reduce(f)),
    +      preservesPartitioning = true
    +    ).cache()
    +
    +    if (nextPowerOfTwo > numPartitions) {
    +      val padding = self.context.parallelize(Seq.empty[T], nextPowerOfTwo - numPartitions)
    +      butterfly = butterfly.union(padding)
    +    }
    +
    +    var offset = nextPowerOfTwo >> 1
    +    while (offset > 0) {
    +      butterfly = new ButterflyReducedRDD[T](butterfly, f, offset).cache()
    --- End diff --
    
    Actually, I thought about doing that. I prefer lazy transformations, given the fact that old cached RDDs will be cleared from memory for new ones. But I am not sure whether cleaning is reliable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-41147299
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14371/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-42990397
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-42990253
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-41209849
  
    @mengxr This is really cool and the performance wins look awesome. Apart from the inline comments, I just one more idea: Instead of using cache + rdd re-partitioning in each step, how expensive is it to do a reduceByKey at each iteration and adjust the keys appropriately ? I think some serialization + de-serialization overheads might add up, but it'll simplify the clean up / caching etc. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-41206844
  
    Agree, I was not suggesting that this specific change per-se makes it into core.
    Just that there are a lot of applications for all-reduce support in spark : and if it were available out of the box, it will make porting a lot of algos quite trivial !
    
    And in that context, all reduce support should go into core.
    If we feel that this specific PR is not what we want due to limitations/design choices, that is fine by me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-41144433
  
    Might be a good idea to move this out of mllib and push this into core itself.
    The utility of this PR seems more fundamental than just for ML (assuming it does something analogous to all reduce in mpi - note, I am yet to go through this in detail :-) ).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by mengxr <gi...@git.apache.org>.
GitHub user mengxr reopened a pull request:

    https://github.com/apache/spark/pull/506

    [SPARK-1485][MLLIB] Implement Butterfly AllReduce

    The current implementations of machine learning algorithms rely on the driver for some computation and data broadcasting. This may create a bottleneck at the driver for both computation and communication, especially in multi-model training. An efficient implementation of AllReduce can help free up the driver. This PR contains a simple butterfly AllReduce implementation. Compared it with reduce + broadcast (http) on a 16-node EC2 cluster (with slow connection), and saw 2x speed-up on vectors of size 1k to 10m.
    
    Possible improvements:
    
    1. Each executor only needs one copy.
    2. Better handling when the number of partitions is not a power of two?  

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark butterfly

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/506.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #506
    
----
commit 76f4bb7b29e0b605561dac048535380b645839fd
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-18T04:43:32Z

    init impl of allReduce

commit d14300540da65900a2a19f8edc5adc5ce9c3e72d
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-22T21:11:25Z

    move allReduce to mllib

commit 98c329d5ac12ec341a9cf6355cd67ea031189a24
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-23T09:44:51Z

    allow arbitrary number of partitions

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/506#discussion_r11921644
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/PartitionSlicingRDD.scala ---
    @@ -0,0 +1,43 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.rdd
    +
    +import scala.reflect.ClassTag
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.{TaskContext, Partition}
    +
    +/**
    + * Represents an RDD obtained from partition slicing of its parent RDD.
    --- End diff --
    
    Yes, I didn't know there is one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-42990283
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-1485][MLLIB] Implement Butterfly A...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr closed the pull request at:

    https://github.com/apache/spark/pull/506


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-41144180
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-42990400
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14943/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-41195545
  
    @mridulm this is not a very general solution yet, and can be very bad consequences (e.g. when data are not cached in memory). If we want a more reliable allReduce, we should probably look into some sort of shuffle dependency that is not all to all (the main problem modeling this using shuffle I see is having to send a bunch of 0s back to the driver for shuffle block size estimation; we might be able to just use run-length-encoding to make that transmission cheap).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/506#discussion_r11924192
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala ---
    @@ -44,6 +44,54 @@ class RDDFunctions[T: ClassTag](self: RDD[T]) {
           new SlidingRDD[T](self, windowSize)
         }
       }
    +
    +  /**
    +   * Returns an RDD with the specified slice of partitions.
    +   */
    +  def slicePartitions(slice: Seq[Int]): RDD[T] = {
    +    new PartitionSlicingRDD(self, slice)
    +  }
    +
    +  /**
    +   * Computes the all-reduced RDD of the parent RDD, which has the same number of partitions and
    +   * locality information as its parent RDD. Each partition contains only one record, which is the
    +   * same as calling `RDD#reduce` on its parent RDD.
    +   *
    +   * @param f reducer
    +   * @return all-reduced RDD
    +   */
    +  def allReduce(f: (T, T) => T): RDD[T] = {
    +    val numPartitions = self.partitions.size
    +    require(numPartitions > 0, "Parent RDD does not have any partitions.")
    +    val nextPowerOfTwo = {
    +      var i = 0
    +      while ((numPartitions >> i) > 0) {
    +        i += 1
    +      }
    +      1 << i
    +    }
    +    var butterfly = self.mapPartitions( (iter) =>
    +      Iterator(iter.reduce(f)),
    +      preservesPartitioning = true
    +    ).cache()
    +
    +    if (nextPowerOfTwo > numPartitions) {
    +      val padding = self.context.parallelize(Seq.empty[T], nextPowerOfTwo - numPartitions)
    +      butterfly = butterfly.union(padding)
    +    }
    +
    +    var offset = nextPowerOfTwo >> 1
    +    while (offset > 0) {
    +      butterfly = new ButterflyReducedRDD[T](butterfly, f, offset).cache()
    --- End diff --
    
    Yeah, the default clean up policy is still LRU as far as I know. In that case you could see weird things like RDDs cached before the ButterflyRDD getting evicted first.
    
    What we need is an interface to say unpersist some RDDs after they have been computed upon, but I don't think we have that yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-41208879
  
    @mridulm This implementation is experimental, and I'm looking for comments and suggestions to make it better. @etrain @shivaram ?
    
    Since it already outperforms reduce + broadcast, it is interesting to see how far we can go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1485][MLLIB] Implement Butterfly AllRed...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/506#issuecomment-41144760
  
    I'm a little worried about misuse. Calling AllReduce with many small partitions can be very slow, at least with this implementation. Even in MLlib, this is package private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---