You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by darabos <gi...@git.apache.org> on 2014/03/31 13:06:08 UTC

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

GitHub user darabos opened a pull request:

    https://github.com/apache/spark/pull/276

    Do not re-use objects in the EdgePartition/EdgeTriplet iterators.

    This avoids a silent data corruption issue (https://spark-project.atlassian.net/browse/SPARK-1188) and has no performance impact by my measurements. It also simplifies the code. As far as I can tell the object re-use was nothing but premature optimization.
    
    I did actual benchmarks for all the included changes, and there is no performance difference. I am not sure where to put the benchmarks. Does Spark not have a benchmark suite?
    
    This is an example benchmark I did:
    
    test("benchmark") {
      val builder = new EdgePartitionBuilder[Int]
      for (i <- (1 to 10000000)) {
        builder.add(i.toLong, i.toLong, i)
      }
      val p = builder.toEdgePartition
      p.map(_.attr + 1).iterator.toList
    }
    
    It ran for 10 seconds both before and after this change.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/darabos/spark spark-1188

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/276.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #276
    
----
commit c55f52fffa79f0ee227367a555172f6cb4ce5cee
Author: Daniel Darabos <da...@gmail.com>
Date:   2014-03-31T10:58:05Z

    Tests that reproduce the problems from SPARK-1188.

commit 0182f2b329b2bb6e6ca8c41245f09db83b71908b
Author: Daniel Darabos <da...@gmail.com>
Date:   2014-03-31T10:58:37Z

    Do not re-use objects in the EdgePartition/EdgeTriplet iterators. This avoids a silent data corruption issue (SPARK-1188) and has no performance impact in my measurements. It also simplifies the code.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Do not re-use objects in the EdgePartition/Edg...

Posted by darabos <gi...@git.apache.org>.

Github user darabos commented on the pull request:

    https://github.com/apache/spark/pull/276#issuecomment-39077046
  
    Sorry, the new JIRA link is https://issues.apache.org/jira/browse/SPARK-1188. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---