You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by kinow <gi...@git.apache.org> on 2017/04/12 12:52:39 UTC

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

GitHub user kinow opened a pull request:

    https://github.com/apache/jena/pull/237

    JENA-1313: compare using a Collator when both literals are tagged with same language

    Mimics the behaviour of Dydra described [here](http://blog.dydra.com/2015/05/06/collation).
    
    When there are strings with the same language, then instead of simply comparing the text, it uses [java.text.Collator](https://docs.oracle.com/javase/7/docs/api/java/text/Collator.html) and the language locale to compare strings.
    
    This does not create a collate:collate function as described in JENA-1313 as a possible solution, but could be still useful for users that expect the sort results to follow the values' language collation.
    
    Needs further tests and discussion.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kinow/jena JENA-1313-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #237
    
----
commit fdcfc6307d7d0f4cbd850adeeb48d3ca9300c266
Author: Bruno P. Kinoshita <br...@yahoo.com.br>
Date:   2017-04-12T12:44:42Z

    JENA-1313: compare using a Collator when both literals are tagged with same language

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    @osma,
    
    >Did you by any chance test this with the performance test case that I wrote up earlier? I'd like to know how it compares to a plain ORDER BY in terms of performance. I can test that myself too when I have a suitable slot of time, but that might take a while since many deadlines are coming up in the next few days...
    
    Well remembered. Updated my sandbox to include [a JMH test](https://github.com/kinow/jena-arq-filter/blob/master/src/test/java/br/eti/kinoshita/jena/ArqOrderByTest.java#L24).
    
    Initial version was using the average time. Here are the results.
    
    ```
    Result "br.eti.kinoshita.jena.ArqOrderByTest.testOrderByCollation":
      3058822481.830 ±(99.9%) 51737778.554 ns/op [Average]
      (min, avg, max) = (2669383311.000, 3058822481.830, 3841515554.000), stdev = 219060994.044
      CI (99.9%): [3007084703.276, 3110560260.384] (assumes normal distribution)
    
    Result "br.eti.kinoshita.jena.ArqOrderByTest.testOrderByLang":
      3017545546.455 ±(99.9%) 47500397.951 ns/op [Average]
      (min, avg, max) = (2646169688.000, 3017545546.455, 3499258012.000), stdev = 201119659.239
      CI (99.9%): [2970045148.504, 3065045944.406] (assumes normal distribution)
    
    
    # Run complete. Total time: 00:41:33
    
    Benchmark                            Mode  Cnt           Score          Error  Units
    ArqOrderByTest.testOrderByCollation  avgt  200  3058822481.830 ± 51737778.554  ns/op
    ArqOrderByTest.testOrderByLang       avgt  200  3017545546.455 ± 47500397.951  ns/op
    ```
    
    Then updated it to actually benchmark the throughput.
    
    ```
    Result "br.eti.kinoshita.jena.ArqOrderByTest.testOrderByCollation":
      ≈ 10⁻⁹ ops/ns
    
    Result "br.eti.kinoshita.jena.ArqOrderByTest.testOrderByLang":
      ≈ 10⁻⁹ ops/ns
    
    
    Benchmark                             Mode  Cnt   Score     Error   Units
    ArqOrderByTest.testOrderByCollation  thrpt  200  ≈ 10⁻⁹            ops/ns
    ArqOrderByTest.testOrderByLang       thrpt  200  ≈ 10⁻⁹            ops/ns
    ```
    
    Throughput displays no difference. Average time was about the same for minimum, but average and max displayed a slight increase when using collation. But I think the overhead won't be really noticeable for end users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r114431718
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -954,7 +971,8 @@ public boolean isDayTimeDuration()
         public boolean     getBoolean()     { raise(new ExprEvalTypeException("Not a boolean: "+this)) ; return false ; }
         public String      getString()      { raise(new ExprEvalTypeException("Not a string: "+this)) ; return null ; }
         public String      getLang()        { raise(new ExprEvalTypeException("Not a string: "+this)) ; return null ; }
    -    
    +    public String      getCollation()   { raise(new ExprEvalTypeException("Not a collation: "+this)) ; return null ; }
    --- End diff --
    
    Nice catch! I'm updating the code, experimenting with the idea of having a `NodeValueSortKey` type instead. If that works well, will update the message to "Not a sort key"... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r111250998
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -784,10 +786,15 @@ private static int compare(NodeValue nv1, NodeValue nv2, boolean sortOrderingCom
                         return x ;
                     }
     
    -                // same lang tag (case insensitive)
    -                x = StrUtils.strCompare(node1.getLiteralLexicalForm(), node2.getLiteralLexicalForm()) ;
    +                // same lang tag, handle collation
    +                // TBD: cache locales? cache collators? pre define both/any? a simple in-memory lru-map-cache?
    +                Locale desiredLocale = Locale.forLanguageTag(node1.getLiteralLanguage());
    +                Collator collator = Collator.getInstance(desiredLocale);
    --- End diff --
    
    Yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116374282
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -978,7 +967,6 @@ public boolean isDayTimeDuration()
         public boolean     getBoolean()     { raise(new ExprEvalTypeException("Not a boolean: "+this)) ; return false ; }
         public String      getString()      { raise(new ExprEvalTypeException("Not a string: "+this)) ; return null ; }
         public String      getLang()        { raise(new ExprEvalTypeException("Not a string: "+this)) ; return null ; }
    -    public String      getCollation()   { raise(new ExprEvalTypeException("Not a sort key: "+this)) ; return null ; }
     
    --- End diff --
    
    Yup, that makes sense. See comment above. I'd be fine re-adding this method, and making it return a NodeValueSortKey, though wouldn't mind postponing it to try some generics-fu first :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r111237018
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -784,10 +786,15 @@ private static int compare(NodeValue nv1, NodeValue nv2, boolean sortOrderingCom
                         return x ;
                     }
     
    -                // same lang tag (case insensitive)
    -                x = StrUtils.strCompare(node1.getLiteralLexicalForm(), node2.getLiteralLexicalForm()) ;
    +                // same lang tag, handle collation
    +                // TBD: cache locales? cache collators? pre define both/any? a simple in-memory lru-map-cache?
    +                Locale desiredLocale = Locale.forLanguageTag(node1.getLiteralLanguage());
    +                Collator collator = Collator.getInstance(desiredLocale);
    --- End diff --
    
    Or would we want to collate all literals with the same lang tag together and for those with a lang-specific collator available, subsort according thereto?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116365710
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -774,19 +772,10 @@ private static int compare(NodeValue nv1, NodeValue nv2, boolean sortOrderingCom
                 }
                 case VSPACE_SORTKEY :
                 {
    -                int cmp = 0;
    -                String c1 = nv1.getCollation();
    -                String c2 = nv2.getCollation();
    -                if (c1 != null && c2 != null && c1.equals(c2)) {
    -                    // locales are parsed. Here we could think about caching if necessary
    -                    Locale desiredLocale = Locale.forLanguageTag(c1);
    -                    // collators are already stored in a concurrent map by the JVM, with <locale, softref<collator>>
    -                    Collator collator = Collator.getInstance(desiredLocale);
    -                    cmp = collator.compare(nv1.getString(), nv2.getString());
    -                } else {
    -                    cmp = XSDFuncOp.compareString(nv1, nv2) ;
    +                if (!(nv1 instanceof NodeValueSortKey) || !(nv2 instanceof NodeValueSortKey)) {
    +                    raise(new ExprNotComparableException("Can't compare (not node value sort keys) "+nv1+" and "+nv2)) ;
                     }
    -                return cmp;
    +                return ((NodeValueSortKey) nv1).compareTo((NodeValueSortKey) nv2);
                 }
    --- End diff --
    
    Should that be `nv1.getSortKey()`?  All the NodeValues have downcast operations getXYZ().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/237
  
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    +1 once merged, I can start working on documentation in the website :+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by osma <gi...@git.apache.org>.
Github user osma commented on the issue:

    https://github.com/apache/jena/pull/237
  
    I've been thinking about this, and I can't see how this could produce a usable order when there are several language tags (even subtags) involved. For example, in a multilingual SKOS thesaurus, it's quite likely that there are `en-US` and `en-GB` labels mixed together (I know at least a couple of thesauri that do this in their SKOS files). Think about e.g.
    
    ```
    ex:zea_mays a skos:Concept ;
      skos:prefLabel "corn"@en-US, "maize"@en-GB .
    
    ex:coffee a skos:Concept ;
      skos:prefLabel "coffee"@en-US, "coffee"@en-GB .
    ```
    
    Now an ORDER BY that sorts primarily by language tag, then by language-specific collation rules, would order these skos:prefLabels as:
    
    1. `"coffee"@en-GB`
    1. `"maize"@en-GB`
    1. `"coffee"@en-US`
    1. `"corn"@en-US`
    
    I have a hard time seeing how this order would be useful to anyone. These are all English language words; as a user I don't care whether they are sorted by GB or US collation rules (even if they differed, as in fr-CA vs. fr-FR), but this is clearly worse than the current behavior which sorts first by lexical value, then by language tag.
    
    My conclusion of this thought experiment is that there should be a way to specify the collation order in the ORDER BY statement independent of the language tags of the literals being sorted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116365872
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -774,19 +772,10 @@ private static int compare(NodeValue nv1, NodeValue nv2, boolean sortOrderingCom
                 }
                 case VSPACE_SORTKEY :
                 {
    -                int cmp = 0;
    -                String c1 = nv1.getCollation();
    -                String c2 = nv2.getCollation();
    -                if (c1 != null && c2 != null && c1.equals(c2)) {
    -                    // locales are parsed. Here we could think about caching if necessary
    -                    Locale desiredLocale = Locale.forLanguageTag(c1);
    -                    // collators are already stored in a concurrent map by the JVM, with <locale, softref<collator>>
    -                    Collator collator = Collator.getInstance(desiredLocale);
    -                    cmp = collator.compare(nv1.getString(), nv2.getString());
    -                } else {
    -                    cmp = XSDFuncOp.compareString(nv1, nv2) ;
    +                if (!(nv1 instanceof NodeValueSortKey) || !(nv2 instanceof NodeValueSortKey)) {
    +                    raise(new ExprNotComparableException("Can't compare (not node value sort keys) "+nv1+" and "+nv2)) ;
    --- End diff --
    
    That shouldn't happen if the comparison is VALUE_SORTKEY


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    @ajs6f,
    
    The first param, you mean the collation? If we get two NodeValueSortKey with different collation language tags, it will sort the text as normal strings (i.e. ignoring locale specific collation orders, using JVM default behaviour of String.compareTo I think).
    
    Not sure if we could take another action here...
    
    >we mark it as unsupported?
    
    I think so, and add to the function documentation (probably somewhere [here](https://jena.apache.org/documentation/query/library-function.html) ?) about the implementation details, risks, considerations, and so on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    If there are no objections I will merge this PR tomorrow or over the weekend, and will start working on the documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by osma <gi...@git.apache.org>.
Github user osma commented on the issue:

    https://github.com/apache/jena/pull/237
  
    @kinow I think this looks promising! I don't have much comments about the implementation code, but just being able to use `ORDER BY arq:collation("fi", ?label)` seems that it would do a good job of solving my original problem. I do like the way you have smuggled in the collation information so that it's accessible to the `NodeValue.compare` function and it can thus rely on `Collator.compare`, which should be pretty fast. Maybe there's a more elegant way of doing that smuggling, but at least it seems to get the job done based on your example results.
    
    Did you by any chance test this with the [performance test case](https://issues.apache.org/jira/browse/JENA-1313?focusedCommentId=15963023&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15963023) that I wrote up earlier? I'd like to know how it compares to a plain ORDER BY in terms of performance. I can test that myself too when I have a suitable slot of time, but that might take a while since many deadlines are coming up in the next few days...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by osma <gi...@git.apache.org>.
Github user osma commented on the issue:

    https://github.com/apache/jena/pull/237
  
    +1 for merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by rvesse <gi...@git.apache.org>.
Github user rvesse commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r111201417
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -784,10 +786,15 @@ private static int compare(NodeValue nv1, NodeValue nv2, boolean sortOrderingCom
                         return x ;
                     }
     
    -                // same lang tag (case insensitive)
    -                x = StrUtils.strCompare(node1.getLiteralLexicalForm(), node2.getLiteralLexicalForm()) ;
    +                // same lang tag, handle collation
    +                // TBD: cache locales? cache collators? pre define both/any? a simple in-memory lru-map-cache?
    +                Locale desiredLocale = Locale.forLanguageTag(node1.getLiteralLanguage());
    +                Collator collator = Collator.getInstance(desiredLocale);
    --- End diff --
    
    This needs to be much more defensively coded. Language tags are not constrained to be valid languages so they could be no locale or collation associated with a given tag to make sure that we fall back to the existing behaviour for unknown languages


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116378590
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/nodevalue/NodeValueSortKey.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.sparql.expr.nodevalue;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.expr.NodeValue;
    +import org.apache.jena.sparql.util.FmtUtils;
    +
    +/**
    + * A {@link NodeValue} that supports collation value for a string. This allows query values
    + * to be sorted following rules for a specific collation.
    + */
    +public class NodeValueSortKey extends NodeValue {
    +
    +    /**
    +     * Node value text.
    +     */
    +    private final String string;
    +    /**
    +     * Node value collation language tag (e.g. fi, pt-BR, en, en-CA, etc).
    +     */
    +    private final String collation;
    +
    +    public NodeValueSortKey(final String string, final String collation) {
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    public NodeValueSortKey(final String string, final String collation, Node n) {
    +        super(n);
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    @Override
    +    public boolean isSortKey() {
    +        return Boolean.TRUE;
    +    }
    +
    +    @Override
    +    public String getString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String asString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String getCollation() {
    +        return collation;
    +    }
    +
    +    @Override
    +    protected Node makeNode() {
    +        return NodeFactory.createLiteral(string);
    +    }
    +
    --- End diff --
    
    I thought that in this case, `getCollation` was not just a control function. I thought it was the equivalent of `getInteger` for an int node, and `getString` for a string.
    
    So in this case `getCollation` would return the `NodeSortKey` itself, as we may need string+collation. And with that, we wouldn't have to cast the node value to `NodeSortKey`, we could simply call `getCollation().compareTo(other.getCollation())`.
    
    Or with the generics alternative, it would call `nv1.getNodeValue().compareTo(nv2.getNodeValue())`, where `getNodeValue()` for a `NodeSortKey extends NodeValue<NodeSortKey>` would return a `NodeSortKey`.
    
    Hope that makes sense


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by osma <gi...@git.apache.org>.
Github user osma commented on the issue:

    https://github.com/apache/jena/pull/237
  
    On a second thought, I guess implementing collation like this would still allow forcing a particular collation order at query time using e.g. `ORDER BY STRLANG(?label, 'en')`
    So maybe this is worthwhile after all...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116342234
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/nodevalue/NodeValueSortKey.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.sparql.expr.nodevalue;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.expr.NodeValue;
    +import org.apache.jena.sparql.util.FmtUtils;
    +
    +/**
    + * A {@link NodeValue} that supports collation value for a string. This allows query values
    + * to be sorted following rules for a specific collation.
    + */
    +public class NodeValueSortKey extends NodeValue {
    +
    +    /**
    +     * Node value text.
    +     */
    +    private final String string;
    +    /**
    +     * Node value collation language tag (e.g. fi, pt-BR, en, en-CA, etc).
    +     */
    +    private final String collation;
    +
    +    public NodeValueSortKey(final String string, final String collation) {
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    public NodeValueSortKey(final String string, final String collation, Node n) {
    +        super(n);
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    @Override
    +    public boolean isSortKey() {
    +        return Boolean.TRUE;
    +    }
    +
    +    @Override
    +    public String getString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String asString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String getCollation() {
    +        return collation;
    +    }
    +
    +    @Override
    +    protected Node makeNode() {
    +        return NodeFactory.createLiteral(string);
    +    }
    +
    --- End diff --
    
    Added a simple comment, not sure if good enough. Also marked `NodeSortKey` as final for now to show users it is not supposed to be extended yet, and must be treated as for internal use only.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by osma <gi...@git.apache.org>.
Github user osma commented on the issue:

    https://github.com/apache/jena/pull/237
  
    I agree with the other commenters, the general order should be (lang, lex) to avoid potentially inconsistent ordering. Also the language tag may not match any Locale. We also need to have unit tests that verify that the code works in corner cases like this.
    
    But what about subtags like `en-US` and `en-GB`? If the language tag is the primary sort key, then all `en-GB` values would sort before `"a"@en-US`, which I think would be confusing for most users.
    The sort order and collation locale could be based on just the main tag (`en` in this case) ignoring the subtags, but I'm quite sure there is some language subtag out there in the world that requires a different collation order from that of the main language...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    Pull request updated. Now we have a new `NodeValueSortKey` type, and a value space for it (`VSPACE_SORTKEY`). Unit tests added.
    
    When the query contains text annotated with multiple languages, and you use the arq:collation function, everything gets overwritten as a sort key < collation, string >. So say that you have values such as "Casa"@es, "Casa"@pt, "Haus"@de, and "House" in your values.
    
    The function will get the string part (i.e. Casa, Casa, Haus, House), discard anything else, and will sort everything according to the collation that the user provided.
    
    All unit tests passing. Might be interesting testing now in Fuseki, with more elaborated queries. I believe the desired behaviour will work with this change, but would be nice to have others playing a little bit with the function and checking if there are no undesired changes.
    
    Thanks!
    Bruno


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116373375
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -774,19 +772,10 @@ private static int compare(NodeValue nv1, NodeValue nv2, boolean sortOrderingCom
                 }
                 case VSPACE_SORTKEY :
                 {
    -                int cmp = 0;
    -                String c1 = nv1.getCollation();
    -                String c2 = nv2.getCollation();
    -                if (c1 != null && c2 != null && c1.equals(c2)) {
    -                    // locales are parsed. Here we could think about caching if necessary
    -                    Locale desiredLocale = Locale.forLanguageTag(c1);
    -                    // collators are already stored in a concurrent map by the JVM, with <locale, softref<collator>>
    -                    Collator collator = Collator.getInstance(desiredLocale);
    -                    cmp = collator.compare(nv1.getString(), nv2.getString());
    -                } else {
    -                    cmp = XSDFuncOp.compareString(nv1, nv2) ;
    +                if (!(nv1 instanceof NodeValueSortKey) || !(nv2 instanceof NodeValueSortKey)) {
    +                    raise(new ExprNotComparableException("Can't compare (not node value sort keys) "+nv1+" and "+nv2)) ;
    --- End diff --
    
    Agree, right now there is no way of that happening, as the value space for sort keys is returned only when both node values are `NodeSortKey`s. Added more to prevent bugs in case anyone ever changed the function returning the value space.
    
    Should we remove it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    > Doesn't this run into the unstable sort issue that @afs cautioned against?
    
    Not sure. I think not because of this approach, but I tried to find if sort could be unstable, and think I found one case.
    
    > I think it could be avoided by the following logic: If two `NodeValueSortKey`s have different collation languages, sort them by the collation languages instead of even looking at the text.
    >
    >This is the (lang, lex) approach discussed earlier, just applied in a slightly different context.
    
    Sounds like a plan. Let's wait and see what other think.
    
    Now, on stability...
    
    I tried finding ways that the sort would be unstable, but for two values A and B, with same collation, the result would be stable. For two values C and D with different collations, or missing collations, the result would be the sort by the string literal. The node produced would be a `Node_Literal` (function rewrites any node given to it as a `Node_Literal`).
    
    Now here is the interesting part. `#equals(Object)` and `#hashcode()` use the node value, i.e.  the `Node_Literal` string to compare values. Using the approach suggested by @osma `NodeValue#compare(NodeValue, NodeValue)` for `NodeValueSortKey`("Casa", "es") and `NodeValueSortKey`("Casa", "pt") would return that `NodeValueSortKey`("Casa", "es") < `NodeValueSortKey`("Casa", "pt"). i.e. since both values have different collation language tags, we would compare "es" and "pt".
    
    However, `#equals(Object)` and `#hashcode()` would report true based only on the `Node_Literal` node. So `NodeValueSortKey`("Casa", "es").equals( `NodeValueSortKey`("Casa", "pt") ) would return true.
    
    I believe this could cause problems, where the merge-sort sort would be stable (I think), but using the elements (sorted or not) in a map/set could result in weird behaviours...
    
    Some code to illustrate the above stated:
    
    ```
    NodeValueSortKey nvsk1 = new NodeValueSortKey("Casa", "es");
    NodeValueSortKey nvsk2 = new NodeValueSortKey("Casa", "pt");
    System.out.println(nvsk1.equals(nvsk2));
    // true
    
    NodeValueLang nvl1 = new NodeValueLang("Casa", "es");
    NodeValueLang nvl2 = new NodeValueLang("Casa", "pt");
    System.out.println(nvl1.equals(nvl2));
    // false
    ```
    
    For `NodeValueLang`s, when a `Node_Literal` is created, it is given a `LiteralLabel` object that it wraps. Then, when you call `NodeValueLang#equals(Object)`, `NodeValueLang` uses the `LiteralLabel#equals(Object)` to compare the other `NodeValueLang`. `LiteralLabel` is checking the language tag.
    
    I wonder if we should create a new `Node_Concrete` implementation in `org.apache.jena.graph` (Node_SortKey?), or if we should modify `Node_Literal`... I feel like the latter would be less elegant than the former.
    
    By using the current implementation, plus @osma's suggestion of comparing the collation language tag, and finally by making sure equals/hashcode agree with what our comparable says; then I believe we would have a stable sort. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/237
  
    Yes - I was agreeing with the approach of a collation function and it being app-decided not fixed by the nature of the data.
    
    If it can be done without to be hardwired in to `NodeValue.compare` and overloading themeaning of `NodeValueString` I think we will be better placed long term.
    
    `NodeValueString` says "// A plain string, with no language tag, or an xsd:string." and it seems likely this assumption is used elsewhere. I think sneaking in something to `NodeValueString` may have effects elsewhere as it violates the assumption of `NodeValueString` (e.g. equality).
    
    `NodeValueSortKey`, and a new value space `VSPACE_SORTKEY`, which only is useful in `ORDER BY` (and `makeNode` is an exception) means we can say "only for sorting".  `arq:collation` then stops being usable as a general function.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116380794
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/nodevalue/NodeValueSortKey.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.sparql.expr.nodevalue;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.expr.NodeValue;
    +import org.apache.jena.sparql.util.FmtUtils;
    +
    +/**
    + * A {@link NodeValue} that supports collation value for a string. This allows query values
    + * to be sorted following rules for a specific collation.
    + */
    +public class NodeValueSortKey extends NodeValue {
    +
    +    /**
    +     * Node value text.
    +     */
    +    private final String string;
    +    /**
    +     * Node value collation language tag (e.g. fi, pt-BR, en, en-CA, etc).
    +     */
    +    private final String collation;
    +
    +    public NodeValueSortKey(final String string, final String collation) {
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    public NodeValueSortKey(final String string, final String collation, Node n) {
    +        super(n);
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    @Override
    +    public boolean isSortKey() {
    +        return Boolean.TRUE;
    +    }
    +
    +    @Override
    +    public String getString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String asString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String getCollation() {
    +        return collation;
    +    }
    +
    +    @Override
    +    protected Node makeNode() {
    +        return NodeFactory.createLiteral(string);
    +    }
    +
    --- End diff --
    
    I guess my confusion is coming in as follows: my (very possibly wrong) understanding is that for (e.g.) `NodeValueInteger`, etc, the whole state of the thing is in the (e.g.) `BigInteger`, which would come entirely from the underlying binding, _from the RDF_. But the collation is coming _from the query_. It seems different. But like I say, maybe I've got the wrong end of the stick on this and I hope I'm not confusing things further!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/jena/pull/237


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    >I agree with the other commenters, the general order should be (lang, lex) to avoid potentially inconsistent ordering.
    
    Ack, that makes sense +1
    
    >Also the language tag may not match any Locale. We also need to have unit tests that verify that the code works in corner cases like this.
    
    Sure, tests and more defensive programming will come later. Right now looking more for comments on how to sort, where to sort, etc.
    
    Besides typos/mispellings, there are also valid tags such as i-klingon (I believe this is mentioned in some specification linked in the SPARQL spec page). For cases like this I think we would simply try to match against the JVM's available locales, and if not existing, then just use normal string comparison.
    
    >But what about subtags like en-US and en-GB? If the language tag is the primary sort key, then all en-GB values would sort before "a"@en-US, which I think would be confusing for most users.
    The sort order and collation locale could be based on just the main tag (en in this case) ignoring the subtags, but I'm quite sure there is some language subtag out there in the world that requires a different collation order from that of the main language...
    
    The sort order of accented letters is different for en-CA and en-FR.
    
    en-FR:
    
    * cote
    * cot�
    * c�te
    * c�t�
    
    en-CA:
    
    * cote
    * c�te
    * cot�
    * c�t�



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116219667
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/nodevalue/NodeValueSortKey.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.sparql.expr.nodevalue;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.expr.NodeValue;
    +import org.apache.jena.sparql.util.FmtUtils;
    +
    +/**
    + * A {@link NodeValue} that supports collation value for a string. This allows query values
    + * to be sorted following rules for a specific collation.
    + */
    +public class NodeValueSortKey extends NodeValue {
    +
    +    /**
    +     * Node value text.
    +     */
    +    private final String string;
    +    /**
    +     * Node value collation language tag (e.g. fi, pt-BR, en, en-CA, etc).
    +     */
    +    private final String collation;
    +
    +    public NodeValueSortKey(final String string, final String collation) {
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    public NodeValueSortKey(final String string, final String collation, Node n) {
    +        super(n);
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    @Override
    +    public boolean isSortKey() {
    +        return Boolean.TRUE;
    +    }
    +
    +    @Override
    +    public String getString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String asString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String getCollation() {
    +        return collation;
    +    }
    +
    +    @Override
    +    protected Node makeNode() {
    +        return NodeFactory.createLiteral(string);
    +    }
    +
    --- End diff --
    
    Add comments that `makeNode` are fake (they don't round trip).
    
    This could be one of the XSD binary datatypes (base64binary, hexBinary) but really we have to acknowledge to ourselves that `NodeSortKey` is "internal" and appearing in output or in expressions is not going to fully work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116342551
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -783,6 +772,22 @@ private static int compare(NodeValue nv1, NodeValue nv2, boolean sortOrderingCom
                         return Expr.CMP_GREATER ;
                     return Expr.CMP_EQUAL;  // Both plain or both xsd:string.
                 }
    +            case VSPACE_SORTKEY :
    +            {
    +                int cmp = 0;
    +                String c1 = nv1.getCollation();
    +                String c2 = nv2.getCollation();
    +                if (c1 != null && c2 != null && c1.equals(c2)) {
    +                    // locales are parsed. Here we could think about caching if necessary
    +                    Locale desiredLocale = Locale.forLanguageTag(c1);
    +                    // collators are already stored in a concurrent map by the JVM, with <locale, softref<collator>>
    +                    Collator collator = Collator.getInstance(desiredLocale);
    +                    cmp = collator.compare(nv1.getString(), nv2.getString());
    +                } else {
    --- End diff --
    
    What a great idea being comparable! Done, it reduced changes in NodeValue, and made writing tests for the comparison (core part of this change) easier. Thanks Andy!
    
    As for the general mechanism, I'd be +1, but maybe later I think. The class is final for now and has a few comments. If I understand it correctly, someone in the future may remove the final mark, remove the collation and move it to subtype/enums. This way we could re-use `NodeSortKey` later for comparing other things. One could even write a simple function that would compare values by some string edit distance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/237
  
    What is the outcome of ordering mixed language tags and also plain xsd:string?
    
    Changing `NodeValueString` looks dubious - `NodeValueString` is an xsd:string and can be used in expressions involving strings. `makeNode()` is lossy so round tripping Node<->NodeValue loses the collation.
    
    A `NodeValueLangString` is much safer (`NodeValue.makeNode` returns `"foo"@fi`).
    
    I still think there is merit in `ORDER BY arq:collation("fi", ?label)` approach. Have the app say what collation it wants. I like this because it means `xsd:string`s can be sorted by locale collation,even differently for different users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116374267
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/nodevalue/NodeValueSortKey.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.sparql.expr.nodevalue;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.expr.NodeValue;
    +import org.apache.jena.sparql.util.FmtUtils;
    +
    +/**
    + * A {@link NodeValue} that supports collation value for a string. This allows query values
    + * to be sorted following rules for a specific collation.
    + */
    +public class NodeValueSortKey extends NodeValue {
    +
    +    /**
    +     * Node value text.
    +     */
    +    private final String string;
    +    /**
    +     * Node value collation language tag (e.g. fi, pt-BR, en, en-CA, etc).
    +     */
    +    private final String collation;
    +
    +    public NodeValueSortKey(final String string, final String collation) {
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    public NodeValueSortKey(final String string, final String collation, Node n) {
    +        super(n);
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    @Override
    +    public boolean isSortKey() {
    +        return Boolean.TRUE;
    +    }
    +
    +    @Override
    +    public String getString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String asString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String getCollation() {
    +        return collation;
    +    }
    +
    +    @Override
    +    protected Node makeNode() {
    +        return NodeFactory.createLiteral(string);
    +    }
    +
    --- End diff --
    
    I see, so a `NodeValueXyx` would have the equivalent method `getXyz`. In our case, we have `NodeValueSortKey` and we had (before I just deleted) the `getCollation`.
    
    Should I simply add back the `getCollation` method, and return a `NodeValueSortKey`?
    
    Would it make sense in the future try to add a generic type `<T>` to `NodeValueSortKey`, making a single method `public T getNodeValue()`? I think this way we wouldn't have to add one method for each new NodeValue, and the design contract would be easier to be followed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116365836
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/nodevalue/NodeValueSortKey.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.sparql.expr.nodevalue;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.expr.NodeValue;
    +import org.apache.jena.sparql.util.FmtUtils;
    +
    +/**
    + * A {@link NodeValue} that supports collation value for a string. This allows query values
    + * to be sorted following rules for a specific collation.
    + */
    +public class NodeValueSortKey extends NodeValue {
    +
    +    /**
    +     * Node value text.
    +     */
    +    private final String string;
    +    /**
    +     * Node value collation language tag (e.g. fi, pt-BR, en, en-CA, etc).
    +     */
    +    private final String collation;
    +
    +    public NodeValueSortKey(final String string, final String collation) {
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    public NodeValueSortKey(final String string, final String collation, Node n) {
    +        super(n);
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    @Override
    +    public boolean isSortKey() {
    +        return Boolean.TRUE;
    +    }
    +
    +    @Override
    +    public String getString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String asString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String getCollation() {
    +        return collation;
    +    }
    +
    +    @Override
    +    protected Node makeNode() {
    +        return NodeFactory.createLiteral(string);
    +    }
    +
    --- End diff --
    
    Marking final makes sense. There are some details in the way NodeValueSortKey does not follow the pattern of other NodeValues e.g. the "getXYZ" operations return its value (so getString isfor xsd:string to return their string value, getInteger on NodeValueInteger etc.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116378420
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/nodevalue/NodeValueSortKey.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.sparql.expr.nodevalue;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.expr.NodeValue;
    +import org.apache.jena.sparql.util.FmtUtils;
    +
    +/**
    + * A {@link NodeValue} that supports collation value for a string. This allows query values
    + * to be sorted following rules for a specific collation.
    + */
    +public class NodeValueSortKey extends NodeValue {
    +
    +    /**
    +     * Node value text.
    +     */
    +    private final String string;
    +    /**
    +     * Node value collation language tag (e.g. fi, pt-BR, en, en-CA, etc).
    +     */
    +    private final String collation;
    +
    +    public NodeValueSortKey(final String string, final String collation) {
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    public NodeValueSortKey(final String string, final String collation, Node n) {
    +        super(n);
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    @Override
    +    public boolean isSortKey() {
    +        return Boolean.TRUE;
    +    }
    +
    +    @Override
    +    public String getString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String asString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String getCollation() {
    +        return collation;
    +    }
    +
    +    @Override
    +    protected Node makeNode() {
    +        return NodeFactory.createLiteral(string);
    +    }
    +
    --- End diff --
    
    @kinow, I like the idea of making that sort of contract more specifically visible in the type system. You are distinguishing the two case @afs mentioned by putting `getString`, `getInteger`, etc. into the generic and leaving "control" functions like `getCollation` out of it?
    
    Wouldn't this be just as much a good pattern for `NodeValue` itself, except we would have to be careful about migrating it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    @afs,
    
    >Changing NodeValueString looks dubious
    
    Agreed. I thought there should be another way Having a new `NodeValue` type like `NodeValueSortKey` sounds like a good plan, as well as a new value space type. I'm updating the pull request (99% done in my local working copy), but will spend some time reviewing the code, and writing initial unit tests now :-)
    
    Will update this pull request in the next hours.
    Bruno


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116365929
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -978,7 +967,6 @@ public boolean isDayTimeDuration()
         public boolean     getBoolean()     { raise(new ExprEvalTypeException("Not a boolean: "+this)) ; return false ; }
         public String      getString()      { raise(new ExprEvalTypeException("Not a string: "+this)) ; return null ; }
         public String      getLang()        { raise(new ExprEvalTypeException("Not a string: "+this)) ; return null ; }
    -    public String      getCollation()   { raise(new ExprEvalTypeException("Not a sort key: "+this)) ; return null ; }
     
    --- End diff --
    
    This would be the "value" of a NodeValueSortKey ... which is itself so NodeValueSortKey then the casts aren't needed in the comparision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    I have a sandbox project here https://github.com/kinow/jena-arq-filter, not really unit tests, but some renamed main methods that I use for experimenting. You can try checking out this pull request, opening in the same workspace both projects, then trying something like this:
    
    ```
            // Query String
            final String queryString = "PREFIX skos: <http://www.w3.org/2004/02/skos/core#>\n" + 
                    "PREFIX arq: <http://jena.apache.org/ARQ/function#>\n"  +
                    "SELECT ?label WHERE {\n" + 
                    "   VALUES ?label { \"tsahurin kieli\"@fi \"t\u0161ekin kieli\"@fi \"tulun kieli\"@fi \"t�yht�hyypp�\"@fi }\n" + 
                    "}\n" + 
                    "ORDER BY arq:collation(\"fi\", ?label)";
            // --- Model
            Model model = ModelFactory.createDefaultModel();
            // Query object
            Query query = QueryFactory.create(queryString);
            // Execute query
            try (QueryExecution qExec = QueryExecutionFactory.create(query, model)) {
                ResultSet results = qExec.execSelect();
                while (results.hasNext()) {
                    QuerySolution solution = results.nextSolution();
                    System.out.println(solution);
                }
            }
    ```
    
    The result will be:
    
    ```
    ( ?label = "tsahurin kieli"@fi )
    ( ?label = "t\u0161ekin kieli"@fi )
    ( ?label = "tulun kieli"@fi )
    ( ?label = "t�yht�hyypp�"@fi )
    ```
    
    If you change the locale for "en", then it will be:
    
    ```
    ( ?label = "t�yht�hyypp�"@fi )
    ( ?label = "tsahurin kieli"@fi )
    ( ?label = "t\u0161ekin kieli"@fi )
    ( ?label = "tulun kieli"@fi )
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    Merged. Will start working on the documentation, probably this next weekend. Thanks everyone!!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/237
  
    The collator is applied multiple times so if there is a mixed set of lang-tagged-literals this is inconsistent. It only applies to a same-lang,same-lang comparison while it may also be called with same first argument but different, different-lang argument at some other point in the sort.
    
    [Earlier email](https://lists.apache.org/thread.html/bae6f297200bee71bf749612f25e5a6f55d6a44c9c79b7240435d05a@%3Cusers.jena.apache.org%3E)
    
    This runs into the problem of unstable sorts or Java sort crashing out.
    
    We need to switch to sorting by an ordering of (lang, lex), not (lex, lang), not just reorder within same-lang.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r114351493
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -954,7 +971,8 @@ public boolean isDayTimeDuration()
         public boolean     getBoolean()     { raise(new ExprEvalTypeException("Not a boolean: "+this)) ; return false ; }
         public String      getString()      { raise(new ExprEvalTypeException("Not a string: "+this)) ; return null ; }
         public String      getLang()        { raise(new ExprEvalTypeException("Not a string: "+this)) ; return null ; }
    -    
    +    public String      getCollation()   { raise(new ExprEvalTypeException("Not a collation: "+this)) ; return null ; }
    --- End diff --
    
    Isn't this `"Doesn't have a collation: "`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/237
  
    @afs I may be (as usual) confused, but isn't `ORDER BY arq:collation("fi", ?label)` just what @kinow is trying to do?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/237
  
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/237
  
    How about merging this PR to establish a baseline in the codebase and refine it, if needed, from this point onwards?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    Also, note that I left comments that are more questions for this implementation. I wonder if it would be OK to change NodeValueString, or if it would be better to create a new NodeValue. Also the prefix used for the function, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    Sorry about the mess. I reverted the previous changes, and wanted to keep everything in the branch history in case we decided to go back that way, but messed up with a `git rebase`. Cherry picked a few commits, now it's looking OK.
    
    So now this updated pull request is following a different direction. Instead of changing the default behaviour, based on language tags, it contains a 2-parameters "collation" function. All changes in ARQ.
    
    Please, ignore comments/unit tests/code readability/etc, as what this pull request is right now is a mere suggestion of an alternative for JENA-1313, and may be again discarded in case there are too many problems with this implementation.
    
    The FN_Collation.java contains the code for the new function. The first argument is a locale, used for finding the collator. The second argument to the function is the NodeValue (Expr). What the function does, is quite simple - and possibly na�ve?. It extracts the string literal from the Expr part, then creates a new NodeValue that contains both String + locale.
    
    Further down, the NodeValueString was modified as well to keep track of the string locale. Alternatively, we could create a new NodeValue subtype, instead of adding an optional locale (backward binary compatible change, as we add, but not change existing methods).
    
    Then, when the SortCondition in the Query is evaluated, and then the NodeValueString#compare method is called, it checks if it was given a desired locale. If so, it sorts using that locale.
    
    Notice that this function will be applied always in the String Value Space in ARQ, as even when we have a Language Tag, it is discarded and we use only the string. Basically, any node with a literal string will become a NodeValueString, when this function is applied to the node.
    
    With this, users are able to choose a Collation, overriding any language tags. This way, if your data contains @en and @en-GB, you can decide to use any Collation you desire on your query.
    
    Thoughts?
    
    Cheers
    Bruno


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    Hmm, messed up the pull request. Let me try to update it correctly...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/237
  
    It's hard for me to reason about what will happen if the first param is allowed to vary along the bindings...  do we care about weirdness like that or is that just such a strange thing to do that we mark it as unsupported?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116390940
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/nodevalue/NodeValueSortKey.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.sparql.expr.nodevalue;
    +
    +import org.apache.jena.graph.Node;
    +import org.apache.jena.graph.NodeFactory;
    +import org.apache.jena.sparql.expr.NodeValue;
    +import org.apache.jena.sparql.util.FmtUtils;
    +
    +/**
    + * A {@link NodeValue} that supports collation value for a string. This allows query values
    + * to be sorted following rules for a specific collation.
    + */
    +public class NodeValueSortKey extends NodeValue {
    +
    +    /**
    +     * Node value text.
    +     */
    +    private final String string;
    +    /**
    +     * Node value collation language tag (e.g. fi, pt-BR, en, en-CA, etc).
    +     */
    +    private final String collation;
    +
    +    public NodeValueSortKey(final String string, final String collation) {
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    public NodeValueSortKey(final String string, final String collation, Node n) {
    +        super(n);
    +        this.string = string;
    +        this.collation = collation;
    +    }
    +
    +    @Override
    +    public boolean isSortKey() {
    +        return Boolean.TRUE;
    +    }
    +
    +    @Override
    +    public String getString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String asString() {
    +        return string;
    +    }
    +
    +    @Override
    +    public String getCollation() {
    +        return collation;
    +    }
    +
    +    @Override
    +    protected Node makeNode() {
    +        return NodeFactory.createLiteral(string);
    +    }
    +
    --- End diff --
    
    I'm still learning how/where `NodeValue` is used, and its relation to a `Node`. So far, `NodeSortKey` indeed does not completely match other `NodeValue`s. But I think **if** generics worked here (and it may not work due to how `NodeValue` is designed) maybe it would make more sense. `NodeSortKey` would be a bit weird, but would be a weird type, not interfering with the `NodeValue` definition per se.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    >My conclusion of this thought experiment is that there should be a way to specify the collation order in the ORDER BY statement independent of the language tags of the literals being sorted.
    
    Good points. I believe we want to give users the option to specify the collation and override the language tag then. I think we could, however, still offer this as fall back, in case no collation is specified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by osma <gi...@git.apache.org>.
Github user osma commented on the issue:

    https://github.com/apache/jena/pull/237
  
    > If we get two NodeValueSortKey with different collation language tags, it will sort the text as normal strings
    
    Doesn't this run into the unstable sort issue that @afs cautioned against?
    
    I think it could be avoided by the following logic: If two NodeValueSortKeys have different collation languages, sort them by the collation languages instead of even looking at the text. 
    
    This is the (lang, lex) approach discussed earlier, just applied in a slightly different context.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    >The collator is applied multiple times so if there is a mixed set of lang-tagged-literals this is inconsistent. It only applies to a same-lang,same-lang comparison while it may also be called with same first argument but different, different-lang argument at some other point in the sort.
    > (...)
    >We need to switch to sorting by an ordering of (lang, lex), not (lex, lang), not just reorder within same-lang.
    
    Thanks @afs ! Will try to update the pull request following the comments here so far.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request #237: JENA-1313: compare using a Collator when both litera...

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116218340
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java ---
    @@ -783,6 +772,22 @@ private static int compare(NodeValue nv1, NodeValue nv2, boolean sortOrderingCom
                         return Expr.CMP_GREATER ;
                     return Expr.CMP_EQUAL;  // Both plain or both xsd:string.
                 }
    +            case VSPACE_SORTKEY :
    +            {
    +                int cmp = 0;
    +                String c1 = nv1.getCollation();
    +                String c2 = nv2.getCollation();
    +                if (c1 != null && c2 != null && c1.equals(c2)) {
    +                    // locales are parsed. Here we could think about caching if necessary
    +                    Locale desiredLocale = Locale.forLanguageTag(c1);
    +                    // collators are already stored in a concurrent map by the JVM, with <locale, softref<collator>>
    +                    Collator collator = Collator.getInstance(desiredLocale);
    +                    cmp = collator.compare(nv1.getString(), nv2.getString());
    +                } else {
    --- End diff --
    
    Would it be better to have `NodeSortKey` being comparable rather than putting the comparison here?
    
    The then removes the need for `getCollation`.
    
    And, as a general mechanism, `NodeSortKey` are not restricted to language-based collection.  Maybe an extension is for `NodeSortKey` based on an enum-like interpretation of a value (e.g. a persons title or rank or even "one", "two", "three").


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---