You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tinkerpop.apache.org by sp...@apache.org on 2017/01/10 13:59:34 UTC

[1/2] tinkerpop git commit: Added duplicate edge detection recipe

Repository: tinkerpop
Updated Branches:
  refs/heads/more-recipes 043d1a3aa -> 3e54d897e


Added duplicate edge detection recipe


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/7783c838
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/7783c838
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/7783c838

Branch: refs/heads/more-recipes
Commit: 7783c838e15cfdf3ee567790c657f34403766936
Parents: 043d1a3
Author: Stephen Mallette <sp...@genoprime.com>
Authored: Tue Jan 10 08:49:02 2017 -0500
Committer: Stephen Mallette <sp...@genoprime.com>
Committed: Tue Jan 10 08:49:02 2017 -0500

----------------------------------------------------------------------
 docs/src/recipes/duplicate-edge.asciidoc | 142 ++++++++++++++++++++++++++
 docs/src/recipes/index.asciidoc          |  18 ++++
 2 files changed, 160 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/7783c838/docs/src/recipes/duplicate-edge.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/duplicate-edge.asciidoc b/docs/src/recipes/duplicate-edge.asciidoc
new file mode 100644
index 0000000..ed56106
--- /dev/null
+++ b/docs/src/recipes/duplicate-edge.asciidoc
@@ -0,0 +1,142 @@
+////
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+////
+[[duplicate-edge]]
+Duplicate Edge Detection
+------------------------
+
+Whether part of a graph maintenance process or for some other analysis need, it is sometimes necessary to detect
+if there is more than one edge between two vertices. The following examples will assume that an edge with the same
+label and direction will be considered "duplicate".
+
+The "modern" graph does not have any duplicate edges that fit that definition, so the following example adds one
+that is duplicative of the "created" edge between vertex "1" and "3".
+
+[gremlin-groovy,modern]
+----
+g.V(1).as("a").V(3).addE("created").from("a").iterate()
+g.V(1).outE("created")
+----
+
+One way to find the duplicate edges would be to do something like this:
+
+[gremlin-groovy,existing]
+----
+g.V().outE().
+  project("a","b").                         <1>
+    by().by(inV().path().by().by(label)).
+  group().                                  <2>
+    by(select("b")).
+    by(select("a").fold()).
+  unfold().                                 <3>
+  select(values).                           <4>
+  where(count(local).is(gt(1)))
+----
+
+<1> The "a" and "b" from the `project` contain the edge and the path respectively. The path consists of a the outgoing
+vertex, an edge, and the incoming vertex. The use of `by().by(label))` converts the edge to its label (recall that `by`
+are applied in round-robin fashion), so the path will look something like: `[v[1],created,v[3]]`.
+<2> Group by the path from "b" and construct a list of edges from "a". Any value in this `Map` that has a list of edges
+greater than one means that there is more than one edge for that edge label between those two vertices (i.e. the `Map`
+key).
+<3> Unroll the key-value pairs in the `Map` of paths-edges.
+<4> Only the values from the `Map` are needed and as mentioned earlier, those lists with more than one edge would
+containa  duplicate.
+
+This method find the duplicates, but does require more memory than other approaches. A slightly more complex approach
+that uses less memory might look like this:
+
+[gremlin-groovy,existing]
+----
+g.V().as("ov").
+  outE().as("e").
+  inV().as("iv").
+  inE().                                          <1>
+  where(neq("e")).                                <2>
+  where(eq("e")).by(label).
+  where(outV().as("ov")).
+  group().
+    by(select("ov","e","iv").by().by(label)).     <3>
+  unfold().                                       <4>
+  select(values).
+  where(count(local).is(gt(1)))
+----
+
+<1> To this point in the traversal, the outgoing edges of a vertex are being iterated with the current edge labelled
+as "e". For "e", Gremlin traverses to the incoming vertex and back on in edges of that vertex.
+<2> Those incoming edges are filtered with the following `where` steps. The first ensures that it does not traverse
+back over "e" (i.e. the current edge). The second determines if the edge label is equivalent (i.e. the test for the
+working definition of "duplicate"). The third determines if the outgoing vertex matches the one that started the
+path labelled as "ov".
+<3> This line is quite similar to the output achieved in the previous example at step 2. A `Map` is produced that uses
+the outgoing vertex, the edge label, and the incoming vertex as the key, with the list of edges for that path as the
+value.
+<4> The rest of the traversal is the same as the previous one.
+
+A third way to approach this problem would be to force a link:https://en.wikipedia.org/wiki/Depth-first_search[depth-first search].
+The previous examples invoke traversal strategies that force a link:https://en.wikipedia.org/wiki/Breadth-first_search[breadth first search]
+as a performance optimization.
+
+[gremlin-groovy,existing]
+----
+g.withoutStrategies(LazyBarrierStrategy, PathRetractionStrategy).V().as("ov").   <1>
+  outE().as("e1").
+  inV().as("iv").
+  inE().
+  where(neq("e1")).
+  where(outV().as("ov")).as("e2").                                               <2>
+  filter(select("e1","e2").by(label).where("e1", eq("e2")))                      <3>
+----
+
+<1> Remove strategies that will optimize for breadth first searches and thus allow Gremlin to go depth first.
+<2> To this point, the traversal is very much like the previous one. Review step 2 in the previous example to see the
+parallels here.
+<3> The final `filter` simply looks for edges that match on label, which would then meet the working definition of
+"duplicate".
+
+The basic pattern at play here is to compare the path of the outgoing vertex, its outgoing edge label and the incoming
+vertex. This model can obviously be contracted or expanded as needed to fit different definitions of "duplicate" For
+example, if the definition of "duplicate" was just two vertices with more than one edge of any label between them, then
+the approach might look something like this:
+
+[gremlin-groovy,existing]
+----
+----
+
+Alternatively, a "duplicate" definition could extended to the label and properties of the edge. For purposes of
+demonstration, an additional edge is added to the "modern" graph:
+
+[gremlin-groovy]
+----
+g = TinkerFactory.createModern().traversal()
+g.V(1).as("a").V(3).addE("created").property("weight",0.4d).from("a").iterate()
+g.V(1).as("a").V(3).addE("created").property("weight",0.5d).from("a").iterate()
+g.V(1).outE("created").valueMap(true)
+----
+
+To identify the duplicate with this revised definition, the previous traversal can be modified to:
+
+[gremlin-groovy,existing]
+----
+g.withoutStrategies(LazyBarrierStrategy, PathRetractionStrategy).V().as("ov").
+  outE().as("e1").
+  inV().as("iv").
+  inE().
+  where(neq("e1")).
+  where(outV().as("ov")).as("e2").
+  filter(select("e1","e2").by(label).where("e1", eq("e2"))).
+  filter(select("e1","e2").by("weight").where("e1", eq("e2"))).valueMap(true)
+----
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/7783c838/docs/src/recipes/index.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/index.asciidoc b/docs/src/recipes/index.asciidoc
index 2e88ac5..9c42f0f 100644
--- a/docs/src/recipes/index.asciidoc
+++ b/docs/src/recipes/index.asciidoc
@@ -44,6 +44,8 @@ include::connected-components.asciidoc[]
 
 include::cycle-detection.asciidoc[]
 
+include::duplicate-edge.asciidoc[]
+
 include::if-then-based-grouping.asciidoc[]
 
 include::pagination.asciidoc[]
@@ -212,4 +214,20 @@ g.V().hasLabel("student").as("s").
     by(select("s").by("name")).
     by(group().by(select("t").by(valueMap("fromTime","toTime"))).
                by(select("c").dedup().values("name").fold())).next()
+----
+
+[[appendix-d]]
+_In the "modern" graph, with a duplicate edge added, find the vertex pairs that have more than one edge between them._
+
+[gremlin-groovy,modern]
+----
+g.V(1).as("a").V(3).addE("created").property("weight",0.4d).from("a").iterate()
+g.V(1).outE("created")
+g.V().as("a").
+  out().as("b").
+  groupCount().
+    by(select("a","b")).
+  unfold().
+  filter(select(values).is(gt(1))).
+  select(keys)
 ----
\ No newline at end of file


[2/2] tinkerpop git commit: Move appendix to its own doc and out of index.

Posted by sp...@apache.org.
Move appendix to its own doc and out of index.


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/3e54d897
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/3e54d897
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/3e54d897

Branch: refs/heads/more-recipes
Commit: 3e54d897e70f26fdefbe30daf5ce19a38a052e4d
Parents: 7783c83
Author: Stephen Mallette <sp...@genoprime.com>
Authored: Tue Jan 10 08:52:58 2017 -0500
Committer: Stephen Mallette <sp...@genoprime.com>
Committed: Tue Jan 10 08:52:58 2017 -0500

----------------------------------------------------------------------
 docs/src/recipes/appendix.asciidoc | 112 ++++++++++++++++++++++++++++++++
 docs/src/recipes/index.asciidoc    |  97 +--------------------------
 2 files changed, 113 insertions(+), 96 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/3e54d897/docs/src/recipes/appendix.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/appendix.asciidoc b/docs/src/recipes/appendix.asciidoc
new file mode 100644
index 0000000..63ec447
--- /dev/null
+++ b/docs/src/recipes/appendix.asciidoc
@@ -0,0 +1,112 @@
+////
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+////
+Appendix
+========
+
+Many of the recipes are based on questions and answers provided on the gremlin-users mailing list. This section
+contains a number of traversals from the mailing list that do not easily fit any particular pattern (i.e. a recipe),
+but are nonetheless interesting and thus remain good tools for learning Gremlin.
+
+[[appendix-a]]
+_For each person in a "follows" graph, determine the number of followers and list their names._
+
+[gremlin-groovy]
+----
+g.addV('name','marko').as('marko').
+  addV('name','josh').as('josh').
+  addV('name','daniel').as('daniel').
+  addV('name','matthias').as('matthias').
+  addE('follows').from('josh').to('marko').
+  addE('follows').from('matthias').to('josh').
+  addE('follows').from('daniel').to('josh').
+  addE('follows').from('daniel').to('marko').iterate()
+g.V().as('p').
+  map(__.in('follows').values('name').fold()).
+  group().by(select('p').by('name')).
+          by(project('numFollowers','followers').
+               by(count(local)).by()).next()
+----
+
+[[appendix-b]]
+_In the "modern" graph, show each person, the software they worked on and the co-worker count for the software and
+the names of those co-workers._
+
+[gremlin-groovy,modern]
+----
+g.V().hasLabel("person").as("p").
+  out("created").as("s").
+  map(__.in("created").
+    where(neq("p")).values("name").fold()).
+  group().by(select("p").by("name")).
+    by(group().by(select("s").by("name")).
+    by(project("numCoworkers","coworkers").
+         by(count(local)).by())).next()
+----
+
+[[appendix-c]]
+_Assuming a graph of students, classes and times, detect students who have a conflicting schedule._
+
+[gremlin-groovy]
+----
+g.addV(label, "student", "name", "Pete").as("s1").
+  addV(label, "student", "name", "Joe").as("s2").
+  addV(label, "class", "name", "Java's GC").as("c1").
+  addV(label, "class", "name", "FP Principles").as("c2").
+  addV(label, "class", "name", "Memory Management in C").as("c3").
+  addV(label, "class", "name", "Memory Management in C++").as("c4").
+  addV(label, "timeslot", "date", "11/25/2016", "fromTime", "10:00", "toTime", "11:00").as("t1").
+  addV(label, "timeslot", "date", "11/25/2016", "fromTime", "11:00", "toTime", "12:00").as("t2").
+  addE("attends").from("s1").to("c1").
+  addE("attends").from("s1").to("c2").
+  addE("attends").from("s1").to("c3").
+  addE("attends").from("s1").to("c4").
+  addE("attends").from("s2").to("c2").
+  addE("attends").from("s2").to("c3").
+  addE("allocated").from("c1").to("t1").
+  addE("allocated").from("c1").to("t2").
+  addE("allocated").from("c2").to("t1").
+  addE("allocated").from("c3").to("t2").
+  addE("allocated").from("c4").to("t2").iterate()
+g.V().hasLabel("student").as("s").
+  out("attends").as("c").
+  out("allocated").as("t").
+  select("s").
+  out("attends").
+  where(neq("c")).
+  out("allocated").
+  where(eq("t")).
+  group().
+    by(select("s").by("name")).
+    by(group().by(select("t").by(valueMap("fromTime","toTime"))).
+               by(select("c").dedup().values("name").fold())).next()
+----
+
+[[appendix-d]]
+_In the "modern" graph, with a duplicate edge added, find the vertex pairs that have more than one edge between them._
+
+[gremlin-groovy,modern]
+----
+g.V(1).as("a").V(3).addE("created").property("weight",0.4d).from("a").iterate()
+g.V(1).outE("created")
+g.V().as("a").
+  out().as("b").
+  groupCount().
+    by(select("a","b")).
+  unfold().
+  filter(select(values).is(gt(1))).
+  select(keys)
+----
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/3e54d897/docs/src/recipes/index.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/index.asciidoc b/docs/src/recipes/index.asciidoc
index 9c42f0f..f77b929 100644
--- a/docs/src/recipes/index.asciidoc
+++ b/docs/src/recipes/index.asciidoc
@@ -135,99 +135,4 @@ GitHub and JIRA are linked.  As mentioned earlier in this section, the recipe wi
 committers prior to merging. This process may take several days to complete. We look forward to receiving your
 submissions!
 
-Appendix
-========
-
-Many of the recipes are based on questions and answers provided on the gremlin-users mailing list. This section
-contains a number of traversals from the mailing list that do not easily fit any particular pattern (i.e. a recipe),
-but are nonetheless interesting and thus remain good tools for learning Gremlin.
-
-[[appendix-a]]
-_For each person in a "follows" graph, determine the number of followers and list their names._
-
-[gremlin-groovy]
-----
-g.addV('name','marko').as('marko').
-  addV('name','josh').as('josh').
-  addV('name','daniel').as('daniel').
-  addV('name','matthias').as('matthias').
-  addE('follows').from('josh').to('marko').
-  addE('follows').from('matthias').to('josh').
-  addE('follows').from('daniel').to('josh').
-  addE('follows').from('daniel').to('marko').iterate()
-g.V().as('p').
-  map(__.in('follows').values('name').fold()).
-  group().by(select('p').by('name')).
-          by(project('numFollowers','followers').
-               by(count(local)).by()).next()
-----
-
-[[appendix-b]]
-_In the "modern" graph, show each person, the software they worked on and the co-worker count for the software and
-the names of those co-workers._
-
-[gremlin-groovy,modern]
-----
-g.V().hasLabel("person").as("p").
-  out("created").as("s").
-  map(__.in("created").
-    where(neq("p")).values("name").fold()).
-  group().by(select("p").by("name")).
-    by(group().by(select("s").by("name")).
-    by(project("numCoworkers","coworkers").
-         by(count(local)).by())).next()
-----
-
-[[appendix-c]]
-_Assuming a graph of students, classes and times, detect students who have a conflicting schedule._
-
-[gremlin-groovy]
-----
-g.addV(label, "student", "name", "Pete").as("s1").
-  addV(label, "student", "name", "Joe").as("s2").
-  addV(label, "class", "name", "Java's GC").as("c1").
-  addV(label, "class", "name", "FP Principles").as("c2").
-  addV(label, "class", "name", "Memory Management in C").as("c3").
-  addV(label, "class", "name", "Memory Management in C++").as("c4").
-  addV(label, "timeslot", "date", "11/25/2016", "fromTime", "10:00", "toTime", "11:00").as("t1").
-  addV(label, "timeslot", "date", "11/25/2016", "fromTime", "11:00", "toTime", "12:00").as("t2").
-  addE("attends").from("s1").to("c1").
-  addE("attends").from("s1").to("c2").
-  addE("attends").from("s1").to("c3").
-  addE("attends").from("s1").to("c4").
-  addE("attends").from("s2").to("c2").
-  addE("attends").from("s2").to("c3").
-  addE("allocated").from("c1").to("t1").
-  addE("allocated").from("c1").to("t2").
-  addE("allocated").from("c2").to("t1").
-  addE("allocated").from("c3").to("t2").
-  addE("allocated").from("c4").to("t2").iterate()
-g.V().hasLabel("student").as("s").
-  out("attends").as("c").
-  out("allocated").as("t").
-  select("s").
-  out("attends").
-  where(neq("c")).
-  out("allocated").
-  where(eq("t")).
-  group().
-    by(select("s").by("name")).
-    by(group().by(select("t").by(valueMap("fromTime","toTime"))).
-               by(select("c").dedup().values("name").fold())).next()
-----
-
-[[appendix-d]]
-_In the "modern" graph, with a duplicate edge added, find the vertex pairs that have more than one edge between them._
-
-[gremlin-groovy,modern]
-----
-g.V(1).as("a").V(3).addE("created").property("weight",0.4d).from("a").iterate()
-g.V(1).outE("created")
-g.V().as("a").
-  out().as("b").
-  groupCount().
-    by(select("a","b")).
-  unfold().
-  filter(select(values).is(gt(1))).
-  select(keys)
-----
\ No newline at end of file
+include::appendix.asciidoc[]
\ No newline at end of file