You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by insidedctm <gi...@git.apache.org> on 2016/02/21 12:54:46 UTC

[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

GitHub user insidedctm opened a pull request:

    https://github.com/apache/spark/pull/11290

    [SPARK-3650][GraphX] Triangle Count handles reverse edges incorrectly

    ## What changes were proposed in this pull request?
    
    Reworking of @jegonzal PR #2495 to address the issue identified in SPARK-3650. Code amended to use the convertToCanonicalEdges method. 
    
    
    ## How was the this patch tested?
    
    Patch was tested using the unit tests created in PR #2495
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/insidedctm/spark spark-3650

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11290.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11290
    
----
commit 428fa26880bb32f04d0799d2c227e52defb99428
Author: Robin East <ro...@xense.co.uk>
Date:   2015-09-14T21:09:26Z

    Change bytes to bits in RoutingTablePartition.toMessage

commit cf66402fb77855711ffd17ddb3efa58c7d44296e
Author: Robin East <ro...@xense.co.uk>
Date:   2016-02-18T18:38:37Z

    Merge remote-tracking branch 'upstream/master'

commit 96fcc0aae84450d6cc3edf046807048b2d8c2db1
Author: Joseph E. Gonzalez <jo...@gmail.com>
Date:   2014-09-22T21:57:28Z

    Improving Triangle Count

commit 1edc09df8e32b6717aa300fe62636a9613bcbc27
Author: Joseph E. Gonzalez <jo...@gmail.com>
Date:   2014-09-22T22:16:46Z

    fixing bug in unit tests where bi-directed edges lead to duplicate triangles.

commit 47673cadc957eb35dbab01cdcbbe21382987e691
Author: Joseph E. Gonzalez <jo...@gmail.com>
Date:   2014-11-13T07:18:58Z

    factored out code for canonicalization

commit c6cd74792d4f82e562d1c792d322f17b1877d4af
Author: Robin East <ro...@xense.co.uk>
Date:   2016-02-20T21:46:49Z

    SPARK-3650 updates to PR 2495 to work with current master

commit c8ad0bd4ed998b86a465bc36ec59ddc5dcceef5e
Author: Robin East <ro...@xense.co.uk>
Date:   2016-02-21T11:27:10Z

    revert unexpected changes to R/pkg/DESCRIPTION

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11290#issuecomment-186957610
  
    I'm going to merge this in master. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11290


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11290#issuecomment-186809524
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11290#issuecomment-186856257
  
    **[Test build #51640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51640/consoleFull)** for PR 11290 at commit [`1e6f5d2`](https://github.com/apache/spark/commit/1e6f5d2e01ea7902a1c7ec5261e21e855ed8b073).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by insidedctm <gi...@git.apache.org>.
Github user insidedctm commented on the pull request:

    https://github.com/apache/spark/pull/11290#issuecomment-186838624
  
    @srowen good points, I've updated and pushed changes in line with your comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11290#issuecomment-186855315
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11290#issuecomment-186860315
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51640/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11290#discussion_r53566037
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/TriangleCount.scala ---
    @@ -27,25 +28,47 @@ import org.apache.spark.graphx._
      * The algorithm is relatively straightforward and can be computed in three steps:
      *
      * <ul>
    - * <li>Compute the set of neighbors for each vertex
    - * <li>For each edge compute the intersection of the sets and send the count to both vertices.
    + * <li> Compute the set of neighbors for each vertex
    + * <li> For each edge compute the intersection of the sets and send the count to both vertices.
      * <li> Compute the sum at each vertex and divide by two since each triangle is counted twice.
      * </ul>
      *
    - * Note that the input graph should have its edges in canonical direction
    - * (i.e. the `sourceId` less than `destId`). Also the graph must have been partitioned
    - * using [[org.apache.spark.graphx.Graph#partitionBy]].
    + * There are two implementations.  The default `TriangleCount.run` implementation first removes
    + * self cycles and canonicalizes the graph to ensure that the following conditions hold:
    + * <ul>
    + * <li> There are no self edges
    + * <li> All edges are oriented src > dst
    + * <li> There are no duplicate edges
    + * </ul>
    + * However, the canonicalization procedure is costly as it requires repartitioning the graph.
    + * If the input data is already in "canonical form" with self cycles removed then the
    + * `TriangleCount.runPreCanonicalized` should be used instead.
    + *
    + * {{{
    + * val canonicalGraph = graph.mapEdges(e => 1).removeSelfEdges().canonicalizeEdges()
    + * val counts = TriangleCount.runPreCanonicalized(canonicalGraph).vertices
    + * }}}
    + *
      */
     object TriangleCount {
     
       def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): Graph[Int, ED] = {
    -    // Remove redundant edges
    -    val g = graph.groupEdges((a, b) => a).cache()
    +    // Transform the edge data something cheap to shuffle and then canonicalize
    +    val canonicalGraph = graph.mapEdges(e => true).removeSelfEdges().convertToCanonicalEdges()
    +    // Get the triangle counts
    +    val counters = runPreCanonicalized(canonicalGraph).vertices
    +    // Join them bath with the original graph
    +    graph.outerJoinVertices(counters) { (vid, _, optCounter: Option[Int]) =>
    --- End diff --
    
    Nit: just `{ (_, _, optCounter) => ` right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11290#discussion_r53566028
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/TriangleCount.scala ---
    @@ -80,15 +103,18 @@ object TriangleCount {
           ctx.sendToSrc(counter)
           ctx.sendToDst(counter)
         }
    +
         // compute the intersection along edges
         val counters: VertexRDD[Int] = setGraph.aggregateMessages(edgeFunc, _ + _)
         // Merge counters with the graph and divide by two since each triangle is counted twice
    -    g.outerJoinVertices(counters) {
    -      (vid, _, optCounter: Option[Int]) =>
    -        val dblCount = optCounter.getOrElse(0)
    -        // double count should be even (divisible by two)
    -        assert((dblCount & 1) == 0)
    -        dblCount / 2
    +    graph.outerJoinVertices(counters) { (vid, _, optCounter: Option[Int]) =>
    +      val dblCount = optCounter.getOrElse(0)
    +      // This algorithm double counts each triangle so the final count should be even
    +      val isEven = (dblCount & 1) == 0
    +      if (!isEven) {
    +        throw new Exception("Triangle count resulted in an invalid number of triangles.")
    --- End diff --
    
    I would just `require(...)` this condition. `(dblCount % 2) == 0` would be a little more canonical


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11290#discussion_r53566042
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/TriangleCount.scala ---
    @@ -27,25 +28,47 @@ import org.apache.spark.graphx._
      * The algorithm is relatively straightforward and can be computed in three steps:
      *
      * <ul>
    - * <li>Compute the set of neighbors for each vertex
    - * <li>For each edge compute the intersection of the sets and send the count to both vertices.
    + * <li> Compute the set of neighbors for each vertex
    --- End diff --
    
    Nit: the `<li>` should be closed if you're writing HTML tags


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11290#issuecomment-186860282
  
    **[Test build #51640 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51640/consoleFull)** for PR 11290 at commit [`1e6f5d2`](https://github.com/apache/spark/commit/1e6f5d2e01ea7902a1c7ec5261e21e855ed8b073).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11290#issuecomment-186860314
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by jegonzal <gi...@git.apache.org>.
Github user jegonzal commented on the pull request:

    https://github.com/apache/spark/pull/11290#issuecomment-186951853
  
    This looks good to me. @insidedctm thanks for reviving the PR and @srowen thanks for taking a look at this!  My only minor concern is that it will change the results for people that are using triangle count so we should note this in the change log on the next release. 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11290#issuecomment-186957539
  
    This is great timing given the next version is 2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org