You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tinkerpop.apache.org by svanteschubert <gi...@git.apache.org> on 2018/07/15 18:59:47 UTC

[GitHub] tinkerpop pull request #892: Tinkerpop 2007

GitHub user svanteschubert opened a pull request:

    https://github.com/apache/tinkerpop/pull/892

    Tinkerpop 2007

    For ease of handling, I let the easier fix of [TINKERPOP-2007](https://issues.apache.org/jira/browse/TINKERPOP-2007) depend on [TINKERPOP-2006](https://issues.apache.org/jira/browse/TINKERPOP-2006).
    
    We might consider extending the fix to overwrite all Comparators not only those for GraphML.
    I did this once in a test run, but then I would need to exchange also JSON test files and did not oversee all implications like in the isolation of the GraphMLWriter.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/svanteschubert/tinkerpop TINKERPOP-2007

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tinkerpop/pull/892.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #892
    
----
commit cfc7ecefd370ac438afea03f1e6d718f1eff0fbe
Author: Svante Schubert <sv...@...>
Date:   2018-07-15T18:26:10Z

    TINKERPOP-2006 - Fix for valid GraphML export when graph properties of a vertex and edge have similar name

commit 10292f4ead80796748f3ecc8cba8a8b0412afe14
Author: Svante Schubert <sv...@...>
Date:   2018-07-15T18:33:16Z

    TINKERPOP-2007 - GraphML export is normalized by default and better ordered to ease 3rd party regression testing

----


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by svanteschubert <gi...@git.apache.org>.

Github user svanteschubert commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    @robertdale  Generally you are right, but here we are talking about normalisation, which implies a certain order. 
    
    I find your statement to test XML with XML tools and graphs with graph tools hard to follow.
    It sounds a bit like the eat-your-own-dogfood principle, which is not bad in general, but not applicable here.
    One goal of regression tests is to run fast, another is to provide human-readable output.
    A line based compare is quite appropriate for this task, XML tools are far more unnecessary complex and time-consuming. 


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by svanteschubert <gi...@git.apache.org>.

Github user svanteschubert commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    a) Regression testing seems easier when there is a deterministic way of serialization, in case this is already, the normalization is fine to be optional but at least reachable by API via Gremlin console for GraphML serialization. Would you agree to this? Otherwise, we can dismiss this patch.
    Could you provide some console access for the normalization, Stephen?
    b) Shall we check for integer and/or long and do the fall-back to the simple string comparison or what would be your suggestion? Remember, the integer ID is the default key of graphs created by Tinkerpop graph and the current normalization is far from the human way of sorting integer.  This seems important to me, as GraphML is especially easy to view and edit the XML and do testing based on this.
    
    Again, if you do not like this at all, let's dismiss simply the patch and save us the time. ;-)


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    > Do you think there is a performance real performance penalty when we do sorting for serialization?
    
    The graphs you have been working with are small (judging from your questions on the mailing list you mentioned only a few hundred edges in size). That's not a big deal - you will likely sort that with no performance penalty. On the other hand if you were to try to stream out a significantly larger graph that isn't completely in-memory then I believe the sort would add cost. If you can demonstrate that there is no cost then I would say we not only normalize by default but that we deprecate the configuration option and get rid of it. I think we'd always like to see normalization if there is no penalty for using it.
    
    > I am available for a code review over VOIP if you like to speed up things, Stephen.
    
    I'm in no rush :smile: - we have no releases officially planned so we can take our time. Thanks for the offer though.
    



---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by svanteschubert <gi...@git.apache.org>.

Github user svanteschubert commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    As we do not know anything about the nature of the user given keys, we can not provide the 'correct' sorting algorithm for normalization. In this case, I tend to favour the natural ordering of integer, which is currently just 'broken' by the given JDK String sort.
    The downside is the performance check on the existence integers, which we could implement that it works only if it works for all and we would only offer for an optional normalization.
    
    In case you agree: I would provide an update on the patch. As both patches seem to be overlapping, would it be fine for you to provide a single patch, again? 


---

[GitHub] tinkerpop pull request #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on a diff in the pull request:

    https://github.com/apache/tinkerpop/pull/892#discussion_r202642609
  
    --- Diff: gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/structure/io/graphml/GraphMLWriter.java ---
    @@ -18,32 +18,32 @@
      */
     package org.apache.tinkerpop.gremlin.structure.io.graphml;
     
    -import org.apache.tinkerpop.gremlin.structure.Direction;
    -import org.apache.tinkerpop.gremlin.structure.Edge;
    -import org.apache.tinkerpop.gremlin.structure.Element;
    -import org.apache.tinkerpop.gremlin.structure.Graph;
    -import org.apache.tinkerpop.gremlin.structure.Property;
    -import org.apache.tinkerpop.gremlin.structure.Vertex;
    -import org.apache.tinkerpop.gremlin.structure.VertexProperty;
    -import org.apache.tinkerpop.gremlin.structure.io.GraphWriter;
    -import org.apache.tinkerpop.gremlin.structure.io.Io;
    -import org.apache.tinkerpop.gremlin.structure.util.Comparators;
    -import org.apache.tinkerpop.gremlin.util.iterator.IteratorUtils;
    -
    -import javax.xml.XMLConstants;
    -import javax.xml.stream.XMLOutputFactory;
    -import javax.xml.stream.XMLStreamException;
    -import javax.xml.stream.XMLStreamWriter;
     import java.io.IOException;
     import java.io.OutputStream;
     import java.util.ArrayList;
     import java.util.Collection;
     import java.util.Collections;
    +import java.util.Comparator;
     import java.util.HashMap;
     import java.util.Iterator;
     import java.util.List;
     import java.util.Map;
     import java.util.Optional;
    +import javax.xml.XMLConstants;
    +import javax.xml.stream.XMLOutputFactory;
    +import javax.xml.stream.XMLStreamException;
    +import javax.xml.stream.XMLStreamWriter;
    +import org.apache.commons.collections.CollectionUtils;
    --- End diff --
    
    Note that our code formatting is such that tinkerpop packages are first. Perhaps your IDE is configured to do this differently and quietly made the change in the background.


---

[GitHub] tinkerpop pull request #892: Tinkerpop 2007

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/tinkerpop/pull/892


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by svanteschubert <gi...@git.apache.org>.

Github user svanteschubert commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    Yes, let us not waste any further time on such a trivial topic. 
    
    If everyone else feels fine with an optional normalization of integer keys jumping around than it should be fine for me as well and I will work on a local fixed version... ;-)
    
    You may close this patch and I will provide one solely for the previous one creating a new branch and patch.


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    Thanks for contributing to Apache TinkerPop. I've done some initial review of your pull request here - I wouldn't say that I'm done with comments, but I think I posted enough there to get discussion started.


---

[GitHub] tinkerpop pull request #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on a diff in the pull request:

    https://github.com/apache/tinkerpop/pull/892#discussion_r202651724
  
    --- Diff: gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/structure/io/graphml/GraphMLWriter.java ---
    @@ -58,13 +58,14 @@
      */
     public final class GraphMLWriter implements GraphWriter {
         private final XMLOutputFactory inputFactory = XMLOutputFactory.newInstance();
    -    private boolean normalize = false;
    +    private boolean normalize = true;
    --- End diff --
    
    I don't think we should normalize by default. We should offer the cheapest/fastest option as the default.


---

[GitHub] tinkerpop pull request #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on a diff in the pull request:

    https://github.com/apache/tinkerpop/pull/892#discussion_r202653620
  
    --- Diff: gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/structure/io/graphml/GraphMLWriter.java ---
    @@ -218,20 +220,33 @@ private void writeTypes(final Map<String, String> identifiedVertexKeyTypes,
                                 final Map<String, String> identifiedEdgeKeyTypes,
                                 final XMLStreamWriter writer) throws XMLStreamException {
             // <key id="weight" for="edge" attr.name="weight" attr.type="float"/>
    -        final Collection<String> vertexKeySet = getVertexKeysAndNormalizeIfRequired(identifiedVertexKeyTypes);
    +        Collection<String> vertexKeySet = getDataStructureKeysAndNormalizeIfRequired(identifiedVertexKeyTypes);
    +        Collection<String> edgeKeySet = getDataStructureKeysAndNormalizeIfRequired(identifiedEdgeKeyTypes);
    +        // in case vertex and edge may have the same attribute name, the key id in graphml have to be different
    +        intersection = CollectionUtils.intersection(vertexKeySet, edgeKeySet);
    --- End diff --
    
    In one respect, I think that using a suffix in the `id` is a good idea generally - in other words, don't bother to even track the intersections, just add the suffix, but I'm not sure how different tools use the `id` attribute. I assume they use it as we use it, just to look up the `attr.name` and `attr.value` - if so, then the `id` is simply internal to the graphml and you don't need to track for intersections to add the suffix.
    
    If that is not the case and the choice is to not generalize this, then we'd need tests to cover this logic.


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by svanteschubert <gi...@git.apache.org>.

Github user svanteschubert commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    Long sounds fine. Good catch!
    
    The sort algorithm I suggested is indeed a heuristic, still a better heuristic as the default String sort by Java ;-)
    Ordering ["a100","100a","1a1"] in my head seems not easy and is more a guessing, I am happy with the given result, yes! ;-)
    
    Okay, I agree that performance penalty by default is a suboptimal idea, then let's go the other way and make debug/normalization easier accessible by command line. I suggest adding the debug option even in the [TinkerPop Docu GraphML initial mentioning](tinkerpop.apache.org/docs/current/reference/#_graphml_reader_writer) (I suggest to add this option for other formats as JSON as well). 
    
    The new normalized output is more predictable (at least for me) ;-) and would ease the analysis of regressions tests as I plan to do for my usage. Indeed, not mandatory but I find it annoying enough to spend my weekend fixing it...
    
    PS: Sorry for the import reordering, using Netbeans and the reording of imports happens on save, I will install the latest Apache Netbeans version and see how I prevent this noice from happening next time!
    
    Thanks for your quick review, Stephen!
    Svante


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by svanteschubert <gi...@git.apache.org>.

Github user svanteschubert commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
     I use to work on the console, stability and clarity comes there before the
    performance when drafting on a solution (my product version would run from
    the console anyway).
    
    Do you think there is a performance real performance penalty when we do
    sorting for serialization?
    
    I am available for a code review over VOIP if you like to speed up things,
    Stephen.
    
    PS: The regarding your question, I looked it up on the GraphML Primer, it
    is an internal ID, see
    http://graphml.graphdrawing.org/primer/graphml-primer.html#AttributesDefinition
    
    2018-07-16 13:45 GMT+02:00 stephen mallette <no...@github.com>:
    
    > *@spmallette* commented on this pull request.
    > ------------------------------
    >
    > In gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/
    > structure/io/graphml/GraphMLWriter.java
    > <https://github.com/apache/tinkerpop/pull/892#discussion_r202653620>:
    >
    > > @@ -218,20 +220,33 @@ private void writeTypes(final Map<String, String> identifiedVertexKeyTypes,
    >                              final Map<String, String> identifiedEdgeKeyTypes,
    >                              final XMLStreamWriter writer) throws XMLStreamException {
    >          // <key id="weight" for="edge" attr.name="weight" attr.type="float"/>
    > -        final Collection<String> vertexKeySet = getVertexKeysAndNormalizeIfRequired(identifiedVertexKeyTypes);
    > +        Collection<String> vertexKeySet = getDataStructureKeysAndNormalizeIfRequired(identifiedVertexKeyTypes);
    > +        Collection<String> edgeKeySet = getDataStructureKeysAndNormalizeIfRequired(identifiedEdgeKeyTypes);
    > +        // in case vertex and edge may have the same attribute name, the key id in graphml have to be different
    > +        intersection = CollectionUtils.intersection(vertexKeySet, edgeKeySet);
    >
    > In one respect, I think that using a suffix in the id is a good idea
    > generally - in other words, don't bother to even track the intersections,
    > just add the suffix, but I'm not sure how different tools use the id
    > attribute. I assume they use it as we use it, just to look up the
    > attr.name and attr.value - if so, then the id is simply internal to the
    > graphml and you don't need to track for intersections to add the suffix.
    >
    > If that is not the case and the choice is to not generalize this, then
    > we'd need tests to cover this logic.
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/tinkerpop/pull/892#pullrequestreview-137390995>,
    > or mute the thread
    > <https://github.com/notifications/unsubscribe-auth/AAyW2-CrJxLMO6tNEZK0-nwK9TZ8TIl9ks5uHHzmgaJpZM4VQWY9>
    > .
    >



---

[GitHub] tinkerpop pull request #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on a diff in the pull request:

    https://github.com/apache/tinkerpop/pull/892#discussion_r202651403
  
    --- Diff: gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/structure/io/graphml/GraphMLWriter.java ---
    @@ -242,7 +257,31 @@ private void writeTypes(final Map<String, String> identifiedVertexKeyTypes,
         private void writeEdges(final XMLStreamWriter writer, final Graph graph) throws XMLStreamException {
             if (normalize) {
                 final List<Edge> edges = IteratorUtils.list(graph.edges());
    -            Collections.sort(edges, Comparators.ELEMENT_COMPARATOR);
    +            Collections.sort(edges, new Comparator<Edge>() {
    --- End diff --
    
    Your algorithm for sorting leaves me with some concerns. It assumes that a "string of numbers" will fit into the integer space. I suppose that you could use `Long` instead, but what if the "string of numbers" is `39098403984093284023984032840329840389430984` - you'd still blow the space. 
    
    Maybe I'm not following the code properly but if you had keys like this:
    
    ```text
    gremlin> x = ["a100","100a","1a1"]
    ==>a100
    ==>100a
    ==>1a1
    ```
    
    and you sorted them with existing sorting you get:
    
    ```text
    gremlin> Collections.sort(x, Comparator.comparing({ it.toString() }, String.CASE_INSENSITIVE_ORDER))
    gremlin> x
    ==>100a
    ==>1a1
    ==>a100
    ```
    
    with your approach I get:
    
    ```text
    gremlin> extractInt = { s ->
    ......1>   num = s.replaceAll("\\D", "");
    ......2>   return num.isEmpty() ? 0 : Integer.parseInt(num);
    ......3> }
    gremlin> x = ["a100","100a","1a1"]
    ==>a100
    ==>100a
    ==>1a1
    gremlin> x.sort{a,b ->
    ......1> o1fullString = a.toString();
    ......2> o2fullString = b.toString();
    ......3> 
    ......3> o1StringPart = o1fullString.replaceAll("\\d", "");
    ......4> o2StringPart = o2fullString.replaceAll("\\d", "");
    ......5> 
    ......5> if(o1StringPart.equalsIgnoreCase(o2StringPart)) {
    ......6>   return extractInt(o1fullString) - extractInt(o2fullString);
    ......7> }
    ......8> return o1fullString.compareTo(o2fullString);
    ......9> }
    ==>1a1
    ==>a100
    ==>100a
    ```
    
    Is that what you expected to have happen? 
    
    With IDs you really can't assume too much because TinkerPop is not ID aware. The underlying graph implementation is the only thing that really knows anything about what the type of format of the ID is.



---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by svanteschubert <gi...@git.apache.org>.

Github user svanteschubert commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    Exactly, in case of normalization of GraphML output use the existing string sort heuristic as a fallback and test the keys for Integer and cast them. 
    Perhaps the goal to allow natural sorting for integer can be implemented more efficiently, but that's what I desire - keeping the natural order of keys consisting only of integers.


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    > at least reachable by API via Gremlin console for GraphML serialization
    
    It's reachable - just configured the `GraphWriter` [directly](http://tinkerpop.apache.org/docs/current/reference/#_graphml_reader_writer) rather than rely on the default provided by `graph.io(graphml())`
    
    ```java
    final Graph graph = TinkerFactory.createModern();
    try (final OutputStream os = new FileOutputStream("tinkerpop-modern.xml")) {
        graph.io(IoCore.graphml()).writer().normalize(true).create().writeGraph(os, graph);
    }
    ```
    
    > Again, if you do not like this at all, let's dismiss simply the patch and save us the time
    
    i think #891 should definitely go in as there is not much to think about on that one. this one is fine, but the sorting should make sense for the full scope of identifiers from all different graphs. It can't be optimized to work for just TinkerGraph defaulted to `Integer` values - it wouldn't be fair to other graphs if we had that kind of favoritism :smile:  I can think about the sorting some more, but right now i'm not sure what would be best.


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    > Long sounds fine.
    
    I don't think that `Long` by itself is good enough. As I mentioned in my comments, you can't assume you know anything about the identifier. TinkerPop has no knowledge of what it contains.  It could be that a `Graph` implementation uses numbers that exceed the size of `Long` which would generate an error in your current code.
    
    > The sort algorithm I suggested is indeed a heuristic, still a better heuristic as the default String sort by Java 
    
    What you have may be better for identifiers that have pure numerics, but imo, it makes sorts of strings mixed with numbers less intuitive that the default. I think that if you want your change for numbers then you need to make them both make sense. 
    
    > let's go the other way and make debug/normalization easier accessible by command line. I suggest adding the debug option even in the TinkerPop Docu GraphML initial mentioning (I suggest to add this option for other formats as JSON as well).
    
    Your link in the above didn't work, but I think I know what you are referring to. I guess I don't have a problem with calling more attention to the normalization option, but I can't say that I fully follow why normalization should be so important to anyone. 
    
    The only use case that I am aware of is if you were doing file comparisons/diffs on generated files - normalization would help you there, especially if you were pushing such files into source control or something similar. But, to me, that's not a common use case. I would venture to say that most people just want to "save their data" and don't really care about looking at it in any way. 
    
    Do you have some other use case that makes this feature so important? You mentioned "analysis of regression tests" but I would like to hear more about that. We tend to be careful about features that don't suit wide use cases.
    
    >  Sorry for the import reordering,
    
    No problem - happens all the time with all different IDEs
    
    A final point, depending on the complexity of how this sorting algorithm ends up, I think that it might be important for us to see what the performance differences are between your changes and what we had before. That's nothing that we need to have immediately as the PR is still changing, but I just wanted to let you know that it may be necessary to fully evaluate this PR.


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    I'm sorry but I'm not following what you're proposing for the id sorting. Just to be clear, are you saying you want to convert to integer if the id is integer and sort on that otherwise, use the string representation of the id for sorting?


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    Ok - personally, I'm not in favor of an `Integer` only approach because it is too narrow a case and only helps TinkerGraph in one specific situation. I think that for your use case of regression testing you already have a normalized GraphML file so that's already ok, but you don't have the "human readable" order you would like which seems like less of a concern in the bigger picture of things. 
    
    perhaps we just focus on getting #891 merged and leave the ordering as it currently is?


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by robertdale <gi...@git.apache.org>.

Github user robertdale commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    I don't like this patch.  If the regression tests rely on some sort order in xml, they're doing it wrong.  If it's the graph being validated, then the test should pull it into a graph engine and validate using graph queries.  If the xml is being validated, then the test should use xml tools, e.g. xmldiff, schema validation, etc.  If the test really requires sort order, which it shouldn't because GraphML does not guarantee any order, then it should transform the xml with xslt before validating it.
    
    That said, the writer already provides an optional, **deterministic** order feature.   If someone doesn't like that order, then someone else won't like this order.  So, I think the right solution would be to be able to provide a custom comparator as a parameter and let the user create any order they see fit.


---

[GitHub] tinkerpop issue #892: Tinkerpop 2007

Posted by spmallette <gi...@git.apache.org>.

Github user spmallette commented on the issue:

    https://github.com/apache/tinkerpop/pull/892
  
    Thanks for understanding - i'll see you over on #891 


---