You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tinkerpop.apache.org by sp...@apache.org on 2018/08/09 18:12:48 UTC

[06/12] tinkerpop git commit: Extended the connected-components recipe

Extended the connected-components recipe


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/a0119640
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/a0119640
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/a0119640

Branch: refs/heads/master
Commit: a011964030f36c87b1ecb22eefb1e7deeffed3b9
Parents: 002560a
Author: HadoopMarc <vt...@xs4all.nl>
Authored: Sun Jun 10 15:17:17 2018 +0200
Committer: Stephen Mallette <sp...@genoprime.com>
Committed: Thu Aug 9 10:54:41 2018 -0400

----------------------------------------------------------------------
 docs/src/recipes/connected-components.asciidoc |  94 ++++++++++++--------
 docs/static/images/cc-scale-ratio.png          | Bin 0 -> 14393 bytes
 docs/static/images/cc-scale-size.png           | Bin 0 -> 12220 bytes
 3 files changed, 58 insertions(+), 36 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/a0119640/docs/src/recipes/connected-components.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/connected-components.asciidoc b/docs/src/recipes/connected-components.asciidoc
index edbeec5..850c31f 100644
--- a/docs/src/recipes/connected-components.asciidoc
+++ b/docs/src/recipes/connected-components.asciidoc
@@ -31,11 +31,11 @@ Depending on the size of the graph, three solution regimes can be discriminated:
 
 1. Small graphs that fit in the memory of a single machine
 
-2. Medium graphs backed by storage for which a linear scan is still feasible. This regime is left to third party
+2. Medium-sized graphs backed by storage for which an OLTP linear scan is still feasible. This regime is left to third party
 TinkerPop implementations, since TinkerPop itself has no storage-backed reference implementations. The idea is that
 component membership is stored in the graph, rather than in memory.
 
-3. Large graphs requiring an OLAP approach to yield results in a reasonable time.
+3. Large graphs requiring an approach with `HadoopGraph` and `SparkGraphComputer` to yield results in a reasonable time.
 
 
 These regimes are discussed separately using the following graph with three weakly connected components:
@@ -55,16 +55,21 @@ g.addV().property(id, "A").as("a").
   addE("link").from("d").to("e").iterate()
 ----
 
+==== Small graph traversals
 
-===== Small graphs
-
-Connected components in a small graph can be determined with both an OLTP traversal and the OLAP
+Connected components in a small graph can be determined with either an OLTP traversal or the OLAP
 `connectedComponent()`-step. The `connectedComponent()`-step is available as of TinkerPop 3.4.0 and is
 described in more detail in the
 link:http://tinkerpop.apache.org/docs/x.y.z/reference/#connectedcomponent-step[Reference Documentation].
+The traversal looks like:
+[gremlin-groovy,existing]
+----
+g.withComputer().V().connectedComponent().
+    group().by('gremlin.connectedComponentVertexProgram.component').
+    select(values).unfold()
+----
 
 A straightforward way to detect the various subgraphs with an OLTP traversal is to do this:
-
 [gremlin-groovy,existing]
 ----
 g.V().emit(cyclicPath().or().not(both())).                                    <1>
@@ -73,7 +78,6 @@ g.V().emit(cyclicPath().or().not(both())).                                    <1
     by(path().unfold().dedup().fold()).                                       <4>
     select(values).unfold()                                                   <5>
 ----
-
 <1> The initial emit() step allows for output of isolated vertices, in addition to the discovery of
 components as described in (2).
 
@@ -83,7 +87,7 @@ path.  Collection `'a'` is used to keep track of visited vertices, for both subt
 and new traversals resulting from the `g.V()` linear scan.
 
 <3> While `'a'` nicely keeps track of vertices already visited, the actual components need to be extracted from the
-path information of surviving traversers. The `path().unfold().limit(1)` closure provides the starting vertex
+path information. The `path().unfold().limit(1)` closure provides the starting vertex
 of surviving traversers, which can be used to group the components.
 
 <4> This clause collects the unique vertices from all paths with the same starting vertex, thus from the same
@@ -91,39 +95,57 @@ weak component.
 
 <5> The values of the groupby map contain the lists of vertices making up the requested components.
 
-This algorithm completes in linear time with the number of vertices and edges, because a traversal is started for each
-vertex and each edge with its associated out-vertex is visited exactly once.
-
 
-==== Large graphs
 
-Large graphs require an OLAP solution with a custom VertexProgram that can be run using a graph implementation's
-GraphComputer, in particular `SparkGraphComputer` on a `HadoopGraph`. The OLAP solution also runs faster for most
-medium-sized graphs, that is when these graph have a 'natural' structure with a limited maximum path length.
+==== Small graph scalability
 
-The TinkerPop library of vertex programs contains the `WeakComponentsVertexProgram` which can be run in the same
-way as the link:http://tinkerpop.apache.org/docs/x.y.z/reference/#peerpressurevertexprogram[PeerPressureVertexProgram]:
+The scalability of the OLTP traversal and the `connectedComponent()`-step for in-memory graphs is shown in the figures
+below.
 
-[gremlin-groovy,existing]
-----
-result = g.getGraph().compute().
-    program(WeakComponentsVertexProgram.build().maxIterations(100).create()).
-    mapReduce(ClusterPopulationMapReduce.build().create()).
-    mapReduce(ClusterCountMapReduce.build().create()).
-    submit().get()
-result.memory().clusterPopulation
-gResult = result.graph().traversal()
-gResult.V().valueMap(true)
-----
+[[cc-scale-size]]
+.Run times for finding connected components in a randomly generated graph with 10 components of equal size and with an edge/vertex ratio of 6
+image::cc-scale-size.png[width=600, side=bottom]
 
-The vertex program has interconnected vertices exchange id's and store the lowest id until no vertex receives a
-lower id. This algorithm is commonly applied in
+In general, the `connectedComponent()`-step is almost a factor two faster than the OLTP traversal. Only, for very
+small graphs the overhead of running the ConnectedComponentVertexProgram is larger than that of the OLTP traversal.
+The vertex program works by having interconnected vertices exchange id's and store the lowest id until no vertex
+receives a lower id. This algorithm is commonly applied in
 link:https://en.wikipedia.org/wiki/Bulk_synchronous_parallel[bulk synchronous parallel] systems, e.g. in
-link:https://spark.apache.org/graphx[Apache Spark GraphX].
+link:https://spark.apache.org/graphx[Apache Spark GraphX]. Overhead for the vertex program arises because it has to run
+as many cycles as the largest length of the shortest paths between any two vertices in a component of the graph. In
+every cycle each vertex has to be checked for being
+"halted". Overhead of the OLTP traversal consists of each traverser having to carry complete path information. For
+pure depth-first-search or breadth-first-search implementations, connected-component algotithms should scale
+as [.big]##O##(V+E). For the traversals in the figure above this is almost the case.
+
+
+[[cc-scale-ratio]]
+.Run times for finding connected components in a randomly generated graph with 10 components, each consisting of 6400 vertices
+image::cc-scale-ratio.png[width=600]
+
+The random graphs used for the scalability tests can be modulated with the edge/vertex ratio. For small ratios the
+components generated are more lint-like and harder to process by the `connectedComponent()`-step. For high ratios
+the components are more mesh-like and the ConnectedComponentVertexProgram needs few cycles to process the graph. These
+characteristics show clearly from the graph. Indeed, for a given number of vertices, the run time of the
+`connectedComponent()`-step does not depend on the number of edges, but rather on the maximum shortest path length in
+the graph.
+
+
+==== Large graphs
+
+Large graphs in TinkerPop require distributed processing by `SparkGraphComputer` to get results in a reasonable time (OLAP
+approach). This means that the graph must be available as `HadoopGraph` (third party TinkerPop implementations often
+allow to make a graph available as an `HadoopGraph` by providing an Hadoop `InputFormat`). Running the
+`connectedComponent()`-step on
+an `HadoopGraph` works the same as for a small graph, provided that `SparkGraphComputer` is specified as the graph computer,
+either with the `gremlin.hadoop.defaultGraphComputer` property or as part of the `withComputer()`-step.
+
+Scalability of the the `connectedComponent()`-step with `SparkGraphComputer` is high, but note that:
+
+* the graph should fit in the memory of the Spark cluster to allow the VertexProgram to run its cycles without spilling
+intermediate results to disk and loosing most of the gains from the distributed processing
 
-==== Scalability
+* as discussed for small graphs, the BSP algorithm does not play well with graphs having a large shortest path between
+any pair of vertices. Overcoming this limitation is still a
+link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[subject of academic research].
 
-ToDo:
- - limits and run time regime 1
- - test of friendster graph regime 3
- - discuss: link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[http://www.vldb.org/pvldb/vol7/p1821-yan.pdf]
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/a0119640/docs/static/images/cc-scale-ratio.png
----------------------------------------------------------------------
diff --git a/docs/static/images/cc-scale-ratio.png b/docs/static/images/cc-scale-ratio.png
new file mode 100644
index 0000000..33a842d
Binary files /dev/null and b/docs/static/images/cc-scale-ratio.png differ

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/a0119640/docs/static/images/cc-scale-size.png
----------------------------------------------------------------------
diff --git a/docs/static/images/cc-scale-size.png b/docs/static/images/cc-scale-size.png
new file mode 100644
index 0000000..2b08a89
Binary files /dev/null and b/docs/static/images/cc-scale-size.png differ