You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tinkerpop.apache.org by GitBox <gi...@apache.org> on 2021/10/26 22:08:39 UTC

[GitHub] [tinkerpop] mschmidt00 commented on a change in pull request #1487: Equality, Equivalence, Comparability, and Orderability Semantics prop…

mschmidt00 commented on a change in pull request #1487:
URL: https://github.com/apache/tinkerpop/pull/1487#discussion_r736957101



##########
File path: docs/src/dev/future/equality_proposal.asciidoc
##########
@@ -0,0 +1,682 @@
+= Proposal for equality, equivalance, comparability and orderability semantics for TinkerPop
+
+== Motivation
+
+How values compare to each other is crucial to the behavior of a query language. While comparison semantics may sound like a trivial question at first, when looking under the surface many interesting questions arise, including aspects around equality and comparability in the context of type casting (e.g., over numerics of different types), slightly different variants of equality being used in different context (e.g. predicates vs. deduplication), questions around comparability and ordering across different logical types, as well as around the identity of elements (such as vertex and edge properties). 
+
+TinkerPop / Gremlin is written in and (partially) relies upon Java / JVM, and there is no clear semantics defined and published for the different types of equality and comparison operations as of today. Rather, what equals what and how values compare is often implicitly defined by the semantics of the underlying Java data structures that are being used, and hence may vary from context to context. We believe that a concise definition of these concepts is critical for both TinkerPop customers — who need to be able to reason about the outcome of their queries — as well as custom implementations of the TinkerPop API, who would benefit from a concise definition to follow. Therefore, TinkerPop should provide a complete and cohesive semantics for equality / comparison such that all Graph providers can easily ensure that their query processing approach aligns with the TinkerPop implementation. Helping customers and implementers alike, this will help increase the adoption of Gremlin as a
  query language. 
+
+This documentation is a proposal that shall serve as a basis for a community discussion on how TinkerPop should handle equality / comparison in different contexts. Motivated by different examples of the status today, we formalize different notions of equality and comparability and describe the contexts in which they apply. While the semantics that we propose is largely aligned with the semantics that is implemented in TinkerPop today, this proposal aims to fill in some existing gaps (such as providing a complete, cross-datatype ordering instead of throwing exceptions) and proposes modifications for a few edge cases, as to make the overall semantics more predictable, coherent, and documentable.    
+
+=== Examples
+
+Below are a couple of example scenarios where defining semantics can help clarify and mitigate inconsistent / undefined behavior in TinkerPop today:
+
+==== The underspecified/undocumented behavior
+
+Consider an equality check such as 
+
+[source]
+----
+gremlin> g.V().has(id, 19)
+----
+
+Without a precise definition, both users and Graph providers don't know whether this query matches only nodes with an ID that is exactly equal to the integer value 19 or, for instance, all numerical values that cast to an Integer value 19. To see that, right now, they need to dig into the TinkerPop code base. While, in the above example, type casting rules apply, in other cases such as
+
+[source]
+----
+gremlin> g.V().property(numericValue).dedup()
+----
+
+the two values above would always be treated as different entities.
+
+==== The behavior that is inherently driven by Java:
+
+Another example is equality over composite type.
+
+[source]
+----
+gremlin> g.V().aggregate("a").out().aggregate("b").cap("a").where(eq("b"))
+----
+
+This query compares two BulkSet objects produced by cap-Step. But the comparison is Java dependent and we don’t have a clear definition of how the comparison works for this kind of types.
+Same as Map, e.g.
+
+[source]
+----
+gremlin> g.V().group().unfold().as("a").V().group().unfold().as("b").where(eq("a", "b"))
+----
+
+and even we have comparison over Map.Entry which is Java dependent type.
+
+[source]
+----
+gremlin> g.V().group().unfold().order() 
+class java.util.HashMap$Node cannot be cast to class java.lang.Comparable (java.util.HashMap$Node and java.lang.Comparable are in module java.base of loader 'bootstrap')
+----
+
+==== Potentially unexpected results due to incompleteness
+
+A query which tries to determine the order across multiple types fails today. 
+
+[source]
+----
+// Propertis have values of Integer and String.
+gremlin> g.V().values("some property").order()
+
+class java.lang.Integer cannot be cast to class java.lang.String (java.lang.Integer and java.lang.String are in module java.base of loader 'bootstrap')
+
+// This query aims to order a heterogeneous result set
+gremlin> g.V().union(out(), outE()).order()
+class org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerVertex cannot be cast to class java.lang.Comparable (org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerVertex is in unnamed module of loader 'app'; java.lang.Comparable is in module java.base of loader 'bootstrap')
+----
+
+It would be more helpful for users to define the complete order across types and returns a result instead of throwing an Exception.
+
+==== Inconsistencies results
+
+Handling for `NaN`, `NULL`, `+0.0`, `-0.0`, `+INF`, `-INF` is tricky, and TinkerPop does not cover all cases consistently at this moment.
+
+[source]
+----
+gremlin> g.V("1").properties("key")
+==>vp[key→0]
+
+// NaN == 0 holds true for this equality check.  
+gremlin> g.V("1").has("key", Double.NaN)
+==>v[1]
+
+gremlin> g.V("1").properties()
+==>vp[key→Infinity]
+
+// 0.0 is interepreted as BigDecial in Groovy and it tries to promote Infinity to BigDecimal as well,
+// then the type casting fails. This is observed when using Java11.
+gremlin> g.V("1").has("key", gt(0.0))
+Character I is neither a decimal digit number, decimal point, nor "e" notation exponential mark.
+----
+
+In the next section, we provide a conceptual proposal to define concepts around how values compare and are ordered, which aims to provide an answer to these and other questions. We seek the feedback from the community to discuss and reach a consensus around the proposal and are open to all other ideas around how these concepts should be defined in TinkerPop / Gremlin.
+
+== Conceptualization of Equality and Comparison
+
+In the above section we used the notions of "equality" and "comparison" in a generalized way. Inspired by the formalization in https://s3.amazonaws.com/artifacts.opencypher.org/openCypher9.pdf[the openCypher specification], we now refine these two notions into four, where we distinguish between equality vs. equivalence and comparability vs. orderability, which constitute two flavors of these concepts tailored to their usage in different concepts.  We summarize and contrast these concepts in the following subsections; more technical details and discussion of edge cases can be found in the technical appendix.
+
+=== Proposed semantics
+
+==== Equality vs. Equivalence
+
+Equality defines when two values are considered equal in the context of database lookups and predicates, while  equivalence defines value collation semantics in the context of, for instance, deduplication. For instance, equivalence over two values `a := Double.NaN` and `b:= Double.NaN` is true, but equality would (in our proposal) be defined as false; the rational here (which is commonly found in query and programming languages) is that comparing two "unknown" numbers — which is a frequent use case for NaN, cannot certainly be identified as equal in comparison, but it typically makes sense to group them together in, for instance, aggregations. 
+
+Both equality and equivalence can be understood as complete, i.e. the result of equality and equivalence checks is always either TRUE or FALSE (in particular, it never returns NULL or throws an exception). The details on equality and equivalence are sketched in the following two subsections, respectively.
+
+===== Equality 
+
+* Used by equality and membership predicates (such as https://github.com/apache/tinkerpop/blob/734f4a8745e797f794c4860962912b04313f312a/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/P.java#L130[P.eq], https://github.com/apache/tinkerpop/blob/734f4a8745e797f794c4860962912b04313f312a/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/P.java#L139[P.neq], and the list membership https://github.com/apache/tinkerpop/blob/72be3549a5e4f99115e9d491e0fc051fff77998a/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/Contains.java#L52[P.within]) in Gremlin. When this eq operator returns TRUE for 2 values (LHS and RHS), by definition LHS and RHS are equal to each other.
+
+* If graph providers need join semantics in query execution, equality should be used to join data over join keys. +
+Example:
+
+[code]
+----
+// equality over 2 ids
+gremlin> g.V().has(id, "some id")
+// equality over vertices
+gremlin> g.V().as("v").out().out().where(eq("v"))
+----
+
+* Equality adheres to type promotion semantics for numerical values, i.e. equality holds for values of different numerical type if they cast into the exactly same same value of the lowest common super type.
+* Other than the type promotion between Numbers, 2 values of different type are always regarded as not equal.
+* Equality checks always return TRUE or FALSE. They never result in NULL output, undefined behavior, nor do they ever throw an error. Detailed behavior is described in
+
+===== Equivalence
+
+* Equivalence defines how TinkerPop deals with 2 values to be grouped or de-duplicated. Specifically it is necessary for the dedup and group steps in Gremlin. +
+Example:
+
+[code]
+----
+// deduplication needs equivalence over 2 property values
+gremlin> g.V().dedup().by("name")
+// grouping by equivalence over 2 property values
+gremlin> g.V().group().by("age") 
+----
+
+* Equivalence ignores type promotion semantics, i.e. two values of different types (e.g. 2^^int vs. 2.0^^float) are always considered to be non-equivalent. (There is an open question whether equivalence takes type promotion into account). +
+
+* For Number, 
+** Because type promotion is not effective, if the types are different then two numbers are never equivalent
+** NaN is not equal to NaN, but equivalent to each other
+
+* Other than the edge case around NaN (and, as of today, Numbers), equivalence in TinkerPop is identical to equality.
+* Like equality, equivalence checks always return TRUE or FALSE. They never result in NULL output, undefined behavior, nor do they ever throw an error.
+
+==== Comparability vs. Orderability
+
+Comparability and orderability can be understood as the "dual" concepts of equality and equivalence for range comparisons (rather than exact comparison). For the 2 values of the same type (except for NaN), comparability is stronger than orderability in the sense that everything that every order between two values that holds TRUE w.r.t. comparability also holds TRUE w.r.t. orderability, but not vice versa. Comparability is what is being used in range predicates. It is restricted to comparison within the same type or, for numerics, class of types; comparability is complete within a given type, but returns NULL if the two types are considered incomparable (e.g., an integer cannot be compared to a string). Orderability fills these gaps, by providing a stable sort order over mixed type results; it is consistent with comparability within a type, and complete both within and across types, i.e. it will never return NULL or throw an exception. +
+More details on comparability and orderability are sketched in the following two subsections, respectively.
+
+===== Comparability
+
+* Used by the comparison operators (https://github.com/apache/tinkerpop/blob/050f66a956ae36ceede55613097cc86e19b8a737/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/Compare.java#L88[P.gt], https://github.com/apache/tinkerpop/blob/050f66a956ae36ceede55613097cc86e19b8a737/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/Compare.java#L138[P.lt], https://github.com/apache/tinkerpop/blob/050f66a956ae36ceede55613097cc86e19b8a737/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/Compare.java#L117[P.gte], https://github.com/apache/tinkerpop/blob/050f66a956ae36ceede55613097cc86e19b8a737/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/Compare.java#L168[P.lte]) in Gremlin and defines how to compare 2 values. +
+Example:
+
+[code]
+----
+// comparison over 2 property values
+gremlin> g.E().has("weight", gt(1))  
+----
+
+* For numbers,
+** it should be aligned to equality conceptually as far as type promotion is concerned. e.g. `1.0 < 2 < 3L`
+* Comparison should not result in undefined behavior, but can return NULL if and only if we are comparing incomparable data types. How this NULL result is handled is Graph provider dependent.
+* Otherwise Comparison does return TRUE or FALSE
+
+===== Orderability
+
+* Used to determine the order. In TinkerPop, the order step follows the notion of orderability.
+* Orderability must not result in NULL / undefined behavior.
+* Orderability must not throw an error. In other words, even if 2 values are incomparable we should still be able to determine the order of those two. This inevitably leads to the requirement to define the order across different data types. For the detailed order across types, see appendix.
+* Orderability determines if 2 values are ordered at the same position or one value is positioned earlier than another.
+* The concept of equivalence is used to determine if the 2 values are at the same position
+* When the position is identical, which value comes first (in other words, whether it should perform stable sort) depends on graph providers' implementation.
+* For values of the same type, comparability can be used to determine which comes first except for NaN in Number. For a different type, we have a dedicated order as described in the section below.
+
+===== Mapping table for TinkerPop operators
+
+Shown as below is a table for which notion proposed above each TinkerPop construct used.
+
+[%header]
+|================
+|Construct|Concept                
+|P.eq     |Equality               
+|P.neq    |Equality               
+|P.within |Equality               
+|P.without|Equality               
+|P.lt     |Comparability          
+|P.gt     |Comparability          
+|P.lte    |Equality, Comparability
+|P.gte    |Equality, Comparability
+|P.inside |Comparability          
+|P.outside|Comparability          
+|P.between|Equality, Comparability
+|================
+
+== What would change ?
+
+=== Semantics
+
+In terms of Semantics, right now TinkerPop does not have formal semantics to define these characteristics introduced in this proposal. Therefore this semantics should be published on the official TinkerPop doc.
+
+=== Behavioral changes
+==== Equality
+
+* NaN +
+JDK11 seems to produce a different error from JDK8 when it comes to BigDecimal comparisons that hit NaN and such. For JDK8 they seem to produce NumberFormatException but for JDK11 you get stuff like:
+
+[code]
+----
+gremlin> g.V().has("key", Float.NaN)
+Character N is neither a decimal digit number, decimal point, nor "e" notation exponential mark.
+----
+When Double / Float Number is stored, it always throws. With the proposed change, it wouldn't throw but because NaN is not equal to any numbers this returns empty result.
+
+* BigDecimal +
+Equality around BigDecimal and special values which cannot be parsed as Integer such as NaN, INF should not produce exceptions and should filter.
+
+[code]
+----
+gremlin> g.addV().property('key',Float.NaN)
+==>v[0]
+gremlin> g.addV().property('key',1.0f)
+==>v[2]
+gremlin> g.V().has('key',Float.NaN)
+==>v[0]
+gremlin> g.V().has('key',1.0f)
+==>v[2]
+gremlin> g.V().values("key").is(eq(1.0f)) // 3.5.x
+==>1.0
+gremlin> g.V().has('key',1.0) // 3.5.x - likely due to Groovy going to BigDecimal for "1.0"
+java.lang.NumberFormatException
+Type ':help' or ':h' for help.
+Display stack trace? [yN]n
+gremlin> g.V().values("key").is(eq(new BigDecimal(1.0f))) // 3.5.x
+java.lang.NumberFormatException
+Type ':help' or ':h' for help.
+Display stack trace? [yN]
+gremlin> g.V().has('key',1.0) // proposed
+==>v[2]
+gremlin> g.V().values("key").is(eq(1.0)) // proposed
+==>1.0
+----
+
+==== Comparability
+
+* NaN +
+Comparing on NaN should return no results.
+
+[code]
+----
+gremlin> g.addV().property('key',-5)
+==>v[0]
+gremlin> g.addV().property('key',0)
+==>v[2]
+gremlin> g.addV().property('key',5)
+==>v[4]
+gremlin> g.addV().property('key',Double.NaN)
+==>v[6]
+gremlin> g.V().values("key").is(lte(Double.NaN)) // 3.5.x
+==>-5
+==>0
+==>NaN
+gremlin> g.V().values("key").is(gte(Double.NaN)) // 3.5.x
+==>0
+==>5
+==>NaN
+gremlin> g.V().values("key").is(lt(Double.NaN)) // 3.5.x
+==>-5
+gremlin> g.V().values("key").is(gt(Double.NaN)) // 3.5.x
+==>5
+gremlin> g.V().values("key").is(lte(Double.NaN)) // proposed
+==>NaN
+gremlin> g.V().values("key").is(gte(Double.NaN)) // proposed
+==>NaN
+gremlin> g.V().values("key").is(lte(Double.NaN)) // proposed
+gremlin> g.V().values("key").is(gte(Double.NaN)) // proposed
+----
+
+* Comparability throws exception today but based on the proposal, it returns NULL when comparing incompatibile types.
+  ** When Vertex / Edge / VertexProperty  is compared, today it throws but it should return NULL.
+  ** When NULL is compared, today it throws an exception but it should return NULL. 
+
+==== Equivalence
+
+TinkerPop today uses a hash value for original values for grouping and the behavior is unchanged.
+
+==== Orderability
+
+- Currently, TinkerPop follows comparability for orderability, thus non-comparable and mixed-type values will fail in ordering. The proposed change is to be able to order any types.
+
+[code]
+----
+gremlin> g.V().order(). // 3.5.x
+org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerVertex cannot be cast to java.lang.Comparable
+Type ':help' or ':h' for help.
+Display stack trace? [yN]
+gremlin> g.V(1).values('name').union(identity(),V(2)).order() // 3.5.x
+org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerVertex cannot be cast to java.lang.Comparable
+Type ':help' or ':h' for help.
+Display stack trace? [yN]n
+gremlin> g.V().order()  // proposed
+==>v[1]
+==>v[2]
+==>v[3]
+==>v[4]
+==>v[5]
+==>v[6]
+gremlin> g.V(1).values('name').union(identity(),V(2)).order() // proposed
+==>v[2]
+==>marko
+gremlin> g.addV().property("key", 100)
+==>v[0]
+gremlin> g.addV().property("key", "100000")
+==>v[2]
+gremlin> g.V().values('key').order() // 3.5.x
+java.lang.Integer cannot be cast to java.lang.String
+Type ':help' or ':h' for help.
+Display stack trace? [yN]
+gremlin> g.V().values('key').order() // proposed
+==>100
+==>100000
+----
+
+== Open Questions
+
+* Should we take type-promotion into account in terms of equivalence ? +
+[code]
+----
+// In this case below,
+gremlin> g.V().property()
+==>[key:1.0]
+==>[key:1]
+
+// which is more natural, whether we don't de-duplicate them
+gremlin> g.V().property().dedup()
+==>[key:1.0]
+==>[key:1]
+
+// or de-dup them
+gremlin> g.V().property().dedup()
+==>[key:1.0]
+----
+        
+If de-duping, there is another question which value we should filter out. We need to define priority over types in Number. 
+Also note that TinkerPop is Java based and we have Double.NaN and Float.NaN, ±Double.INF and ±Float.INF. Not adhering type casting means, for example, Double.NaN and Float.NaN is not de-duplicated / grouped according to the semantics.
+        
+* Map.Entry is Java dependent type. Instead of defining semantics for Map.Entry, do we introduce a concept of like key-value tuple for it to generalize ?

Review comment:
       The alternative would be to leave a little more flexibility for implementations, and just define (non-exhaustive) contracts that implementations must satisfy when comparing maps (for instance, two maps cannot compare equal if the unordered set of key-value pairs compares non-equal). TP builds internally on LinkedHashMaps, which would make the above a good match -- but other implementations may differ and use unordered representations. The question here is how critical the order within a map is to key use cases of the query language -- any thoughts on that @joshsh? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org