You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by darabos <gi...@git.apache.org> on 2014/03/31 13:06:08 UTC

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

GitHub user darabos opened a pull request:

    https://github.com/apache/spark/pull/276

    Do not re-use objects in the EdgePartition/EdgeTriplet iterators.

    This avoids a silent data corruption issue (https://spark-project.atlassian.net/browse/SPARK-1188) and has no performance impact by my measurements. It also simplifies the code. As far as I can tell the object re-use was nothing but premature optimization.
    
    I did actual benchmarks for all the included changes, and there is no performance difference. I am not sure where to put the benchmarks. Does Spark not have a benchmark suite?
    
    This is an example benchmark I did:
    
    test("benchmark") {
      val builder = new EdgePartitionBuilder[Int]
      for (i <- (1 to 10000000)) {
        builder.add(i.toLong, i.toLong, i)
      }
      val p = builder.toEdgePartition
      p.map(_.attr + 1).iterator.toList
    }
    
    It ran for 10 seconds both before and after this change.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/darabos/spark spark-1188

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/276.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #276
    
----
commit c55f52fffa79f0ee227367a555172f6cb4ce5cee
Author: Daniel Darabos <da...@gmail.com>
Date:   2014-03-31T10:58:05Z

    Tests that reproduce the problems from SPARK-1188.

commit 0182f2b329b2bb6e6ca8c41245f09db83b71908b
Author: Daniel Darabos <da...@gmail.com>
Date:   2014-03-31T10:58:37Z

    Do not re-use objects in the EdgePartition/EdgeTriplet iterators. This avoids a silent data corruption issue (SPARK-1188) and has no performance impact in my measurements. It also simplifies the code.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by darabos <gi...@git.apache.org>.
Github user darabos commented on the pull request:

    https://github.com/apache/spark/pull/276#issuecomment-39077046
  
    Sorry, the new JIRA link is https://issues.apache.org/jira/browse/SPARK-1188. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by darabos <gi...@git.apache.org>.
Github user darabos commented on the pull request:

    https://github.com/apache/spark/pull/276#issuecomment-39199208
  
    Thanks for the comments! The description of the GC effects was very educational. I made the suggested changes. Let me know if you'd like to see something else changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/276#discussion_r11172504
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartition.scala ---
    @@ -84,19 +87,13 @@ class EdgePartition[@specialized(Char, Int, Boolean, Byte, Long, Float, Double)
        * order of the edges returned by `EdgePartition.iterator` and
        * should return attributes equal to the number of edges.
        *
    -   * @param f a function from an edge to a new attribute
    +   * @param iter an iterator for the new attribute values
        * @tparam ED2 the type of the new attribute
    -   * @return a new edge partition with the result of the function `f`
    -   *         applied to each edge
    +   * @return a new edge partition with the attribute values replaced
        */
       def map[ED2: ClassTag](iter: Iterator[ED2]): EdgePartition[ED2] = {
    -    val newData = new Array[ED2](data.size)
    -    var i = 0
    -    while (iter.hasNext) {
    -      newData(i) = iter.next()
    -      i += 1
    -    }
    -    assert(newData.size == i)
    +    val newData = iter.toArray
    --- End diff --
    
    Can we revert this change? This is where Scala fails us. The while loop is ~ 2x faster than iter.toArray in my micro benchmark. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/276#issuecomment-39237641
  
    Thanks @darabos. The change looks good to me other than the one place I pointed out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/276


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/276#issuecomment-39372866
  
    Merged. Thanks! This is a very important usability improvement. 
    
    cc @jegonzal @ankurdave


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/276#issuecomment-39076605
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/276#issuecomment-39150913
  
    Thanks for sending this in. It's something that is extremely confusing in graphx and we need to fix it. However, I am not sure if taking out the object reuse in edges is the way to fix this problem. 
    
    This is actually hard to test in micro benchmarks, because you rarely get GCs in micro benchmarks. In these cases where an edge/triplet is returned by an iterator, JVM's escape analysis cannot capture the scope and cannot do on-stack allocation. As a result, there are lots of temporary objects allocated in the heap.
    
    Allocations of short-lived objects are supposed to be cheap. The most expensive objects to gc are medium-lived objects. However, the more temporary objects we allocate, the more frequent a young gen gc happens. The more frequent young gen gc happens, the more likely for random objects to become medium lived. 
    
    Maybe a better way to fix this is to leave the object reuse as is, but in all places where we return the object to the user, we should make sure it copies it.
    
    I just looked at the code, and I think we can accomplish that by just adding a copy to EdgeRDD.compute, i.e.
    ```scala
      override def compute(part: Partition, context: TaskContext): Iterator[Edge[ED]] = {
        firstParent[(PartitionID, EdgePartition[ED])].iterator(part, context).next._2.iterator.map(_.copy())
      }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/276#issuecomment-39150971
  
    And for EdgeTripletIterator we should just keep your change (i.e. always create a new row) since that is not used internally at all. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by darabos <gi...@git.apache.org>.
Github user darabos commented on a diff in the pull request:

    https://github.com/apache/spark/pull/276#discussion_r11199524
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartition.scala ---
    @@ -84,19 +87,13 @@ class EdgePartition[@specialized(Char, Int, Boolean, Byte, Long, Float, Double)
        * order of the edges returned by `EdgePartition.iterator` and
        * should return attributes equal to the number of edges.
        *
    -   * @param f a function from an edge to a new attribute
    +   * @param iter an iterator for the new attribute values
        * @tparam ED2 the type of the new attribute
    -   * @return a new edge partition with the result of the function `f`
    -   *         applied to each edge
    +   * @return a new edge partition with the attribute values replaced
        */
       def map[ED2: ClassTag](iter: Iterator[ED2]): EdgePartition[ED2] = {
    -    val newData = new Array[ED2](data.size)
    -    var i = 0
    -    while (iter.hasNext) {
    -      newData(i) = iter.next()
    -      i += 1
    -    }
    -    assert(newData.size == i)
    +    val newData = iter.toArray
    --- End diff --
    
    Done. Yeah, I didn't benchmark this, sorry.
    
    I guess toArray must work with a dynamic array, since it does not know the size in advance. There is an Iterator.copyToArray, but regrettably it does not return the number of elements copied, so we would lose the important assertion. I filed a feature request for this. (https://issues.scala-lang.org/browse/SI-8469) Hopefully I'm not being obtuse.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---