You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by stain <gi...@git.apache.org> on 2017/01/12 14:53:20 UTC

[GitHub] commons-rdf pull request #30: COMMONSRDF-51 language tags compared lower cas...

GitHub user stain opened a pull request:

    https://github.com/apache/commons-rdf/pull/30

    COMMONSRDF-51 language tags compared lower case

    This fixes [COMMONSRDF-51](https://issues.apache.org/jira/browse/COMMONSRDF-51) - at least from `Literal.equals()` and `Literal.hashCode()`
    
    Further test might be needed to verify consistent behaviour in `Graph` and `Dataset` if underlying framework does not correctly do langtag comparison in lower case (e.g. Turkish locale problem).
    
    Please comment on the fixes and the suggested updated javadoc:
    
    * [Literal.equals(Object)](http://stain.github.io/commons-rdf/COMMONSRDF-51/org/apache/commons/rdf/api/Literal.html#equals-java.lang.Object-)
    * [Literal.hashCode()](http://stain.github.io/commons-rdf/COMMONSRDF-51/org/apache/commons/rdf/api/Literal.html#hashCode--)
    * [Literal.getLanguageTag()](http://stain.github.io/commons-rdf/COMMONSRDF-51/org/apache/commons/rdf/api/Literal.html#getLanguageTag--)
    
    For code improvements of this PR, feel free to push to the `COMMONSRDF-51-langtag-lcase` branch at https://git-wip-us.apache.org/repos/asf/commons-rdf.git or use the "Start review" mechanism in GitHub.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/commons-rdf COMMONSRDF-51-langtag-lcase

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/commons-rdf/pull/30.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #30
    
----
commit 3064d219606cbe42c0150d81dbf6cdbc74bf7491
Author: Stian Soiland-Reyes <st...@apache.org>
Date:   2017-01-12T14:51:26Z

    COMMONSRDF-51: compare language tags in lower case

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by afs <gi...@git.apache.org>.

Github user afs commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    You mean you tried to Graph.delete a triple?
    
    That deletes exact triples (.equals). Graph.remove deletes triples that match (same value) by using "find" that has the value-matching capability. (checked - works for me).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by stain <gi...@git.apache.org>.

Github user stain commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    Thanks @afs , that makes sense, `JenaGraphImpl` was indeed using `graph.delete()`. 
    
    I have fixed in both `JenaGraphImpl` and `JenaDatasetImpl`.  See comment - do you think there is much performance gain from not splitting into pattern but passing the original Jena Triple (safe only when there's no Literal object with langtag) - or shall we always use the pattern?
    
    (e.g. would Jena TDB do things like get an internal Triple row ID out of the jena `Triple` for faster delete?)
    
    I added tests for Dataset that reveals that statements in default graph come back from Jena in the named graph `<urn:x-arq:DefaultGraph>` - that's a separate bug in `JenaDatasetImpl` and the converters - we should represent that always as `Optional.empty()` in Commons RDF land.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by afs <gi...@git.apache.org>.

Github user afs commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    @ansell mentions one of the reasons the wording for RDF 1.1is not so direct - RDF 1.0 did not sanction the common normalization defined in BCP47 canonicalization, although that actually requires consulting the registry as well.
    
    Jena is lax by default, and retains the form as originally written. In practice, datasets seem to be internally consistent, all lower case or all syntax-canonical. 
    
    Variations of case are different nodes in the general case but are `Node.sameValue` (compare) and cause matching in graph.find. Some storage layers may differ and canonicalize the form, in order to index.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by stain <gi...@git.apache.org>.

Github user stain commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    OK, so then it makes sense for the Commons RDF tests to only care about the
    value being preserved (whatever the case going in or out is upper or
    lower), and that our .equals and .hashCode is based on lowercase in the
    ROOT Locale.
    
    We don't have equivalent tests if datatyped floats etc preserve their
    specific syntactic value (e.g. "-.0"^^xsd:float) so we should not do that
    for langtags either.
    
    I'll modify the branch and merge.
    
    On 21 Jan 2017 9:39 pm, "Andy Seaborne" <no...@github.com> wrote:
    
    > RDF 1.1 mentions:
    >
    >    1. Turtle parsing - there is a lang tag rule.
    >    2. The text that conversion to a lowercase lexical is allowed.
    >    3. Value-comparison is case insensitive.
    >
    > Which is that test for? Lexical or value?
    >
    > At least acknowledging that RDF's "lowercase" is not in keeping with BCP
    > 47 syntax canonicalization (the registry may change the characters)
    > whatever the spec makes sense to me and I suspect domain experts; it's
    > following the spec that "owns" language tags. Focus on the value comparison.
    >
    > \u2014
    > You are receiving this because you authored the thread.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/commons-rdf/pull/30#issuecomment-274290063>,
    > or mute the thread
    > <https://github.com/notifications/unsubscribe-auth/AAPd5Q9mhDWV3MwkRzl0QLO5FS7bVl6Wks5rUnsmgaJpZM4Lh1hF>
    > .
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by stain <gi...@git.apache.org>.

Github user stain commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    There seems to be consensus on http://lists.w3.org/Archives/Public/public-rdf-comments/2017Jan/thread.html and http://lists.w3.org/Archives/Public/semantic-web/2017Jan/thread.html in the _Are literal language tags case sensitive?_ threads that it is not meant to be a change from RDF 1.0 - that language tags should still be compared case insensitively.
    
    That should be inline with what this PR suggests - case insensitive in `.equals()` and `.hashCode()`
    
    Do you agree on that line, @afs and @ansell ..?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf pull request #30: COMMONSRDF-51 language tags compared lower cas...

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on a diff in the pull request:

    https://github.com/apache/commons-rdf/pull/30#discussion_r96309546
  
    --- Diff: api/src/test/java/org/apache/commons/rdf/api/AbstractRDFTest.java ---
    @@ -194,6 +194,114 @@ public void testCreateLiteralLangISO693_3() throws Exception {
             assertEquals("\"Herbert Van de Sompel\"@vls", vls.ntriplesString());
         }
     
    +    public void testCreateLiteralLangCaseInsensitive() throws Exception {
    --- End diff --
    
    Does this need @Test annotation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by stain <gi...@git.apache.org>.

Github user stain commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    Right, BCP47 normalisation would make sense, but sadly that is not directly
    permitted by RDF 1.1, only normalisation to lower case :-( - probably to
    avoid dependency on the registry.
    
    However I think we can try to make Commons RDF present a consistent RDF
    1.1-compliant view, which I would think includes that creating a literal
    with en-gb lowercase would return en-gb lowercase, also with RDF4J as back
    end. (Would this require our wrapper LiteralImpl to always lowercase for
    RDF4J?)
    
    Can we extend the RDF4J test to also cover the other settings? How are they
    provided? If the user is explicitly asking to go beyond the RDF standards,
    then they should not be surprised if Commons RDF's view goes along with
    that (or falls over), so then perhaps we don't need to worry about it here?
    (e.g. Jena can be configured to support generalized RDF which don't work
    well with the normal TripleImpl).
    
    
    
    On 16 Jan 2017 9:52 pm, "Peter Ansell" <no...@github.com> wrote:
    
    *@ansell* commented on this pull request.
    
    Looks fairly good to me. I disagree with the test assertion that disallows
    normalisation using the BCP47 conventions (e.g., en-GB) in their
    constructors, but it is a minor issue.
    ------------------------------
    
    In api/src/test/java/org/apache/commons/rdf/api/AbstractRDFTest.java
    <https://github.com/apache/commons-rdf/pull/30#pullrequestreview-16887719>:
    
    > @@ -194,6 +194,114 @@ public void testCreateLiteralLangISO693_3() throws Exception {
             assertEquals("\"Herbert Van de Sompel\"@vls", vls.ntriplesString());
         }
    
    +    public void testCreateLiteralLangCaseInsensitive() throws Exception {
    
    Does this need @Test <https://github.com/Test> annotation?
    ------------------------------
    
    In api/src/test/java/org/apache/commons/rdf/api/AbstractRDFTest.java
    <https://github.com/apache/commons-rdf/pull/30#pullrequestreview-16887719>:
    
    > @@ -194,6 +194,114 @@ public void testCreateLiteralLangISO693_3() throws Exception {
             assertEquals("\"Herbert Van de Sompel\"@vls", vls.ntriplesString());
         }
    
    +    public void testCreateLiteralLangCaseInsensitive() throws Exception {
    +        // COMMONSRDF-51: Literal langtag may not be in lowercase, but
    +        // must be COMPARED (aka .equals and .hashCode()) in lowercase
    +        // as the language space is lower case.
    +        final Literal lower = factory.createLiteral("Hello", "en-gb");
    +        final Literal upper = factory.createLiteral("Hello", "EN-GB");
    +        final Literal mixed = factory.createLiteral("Hello", "en-GB");
    +
    +
    +        assertEquals("en-gb", lower.getLanguageTag().get());
    
    RDF4J may not follow this in some cases. It may use the BCP47 normalisation
    conventions to obtain en-GB instead.
    
    \u2014
    You are receiving this because you authored the thread.
    Reply to this email directly, view it on GitHub
    <https://github.com/apache/commons-rdf/pull/30#pullrequestreview-16887719>,
    or mute the thread
    <https://github.com/notifications/unsubscribe-auth/AAPd5Zd7XV-563iLQNzvWAEI4dO7Hm4Qks5rS-aYgaJpZM4Lh1hF>
    .



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by afs <gi...@git.apache.org>.

Github user afs commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    RDF 1.1 mentions:
    
    1. Turtle parsing - there is a lang tag rule.
    2. The text that conversion to a lowercase lexical is allowed.
    3. Value-comparison is case insensitive.
    
    Which is that test for? Lexical or value?
    
    At least acknowledging that RDF's "lowercase" is not in keeping with BCP 47 syntax canonicalization (the registry may change the characters) whatever the spec makes sense to me and I suspect domain experts; it's following the spec that "owns" language tags. Focus on the value comparison.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf pull request #30: COMMONSRDF-51 language tags compared lower cas...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/commons-rdf/pull/30


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by stain <gi...@git.apache.org>.

Github user stain commented on the issue:

https://github.com/apache/commons-rdf/pull/30

This pull request returns `getLanguageTag()` in whatever case the underlying platform does (e.g. I think RDF4J and JSONLD-Java preserves casing, while Jena and Simple converts to lowercase.

I think it is only in `.equals()` and `.hashCode()` we need case insensitivity.

There's arguments both ways if we should provide a consistent view across the implementations (e.g. always lowercase); or if we should provide a consistency with what the underlying implementation does (e.g. if it is preserves casing for presentation purposes).

Commons RDF don't have any value handling mechanisms now for say converting`"13.37"^^xsd:float` to a Java float `13.37f` (without going through the underlying implementations and related methods); or determining value equality, so I think it is not too weird if Commons RDF doesn't do anything clever about language tags either (beyond spec compliance).

But if someone were to add a Common RDF API for such literal value handling, it could be natural to also add "utils" methods for presenting or parsing language tags (e.g. `isLanguageTagEqual("en-us", "en-US")` as well as hierarchical comparisons, something like `isSameLanguageTagFamily("en-us", "en-GB")`

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    Don't take what is not explicitly said in the standard to mean that it is disallowed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    In the RDF4J/Sesame case, we have had some users request, and some other users complain about , both lowercasing, which was used in the past, and canonicalisation, so RDF4J will default to leaving case alone, but any user is free to switch on the canonicalisation. Currently there isn't a lowercase-all-tags option, but that may also appear in the future.
    
    For reference, the language tag canonicalisation procedure that RDF4J optionally uses, which relies on the JDK's copy of the IANA Language Subtag Registry, is:
    
    ```
    new Locale.Builder().setLanguageTag(tag).build().toLanguageTag()
    ```
    
    There are other possible methods, but the method above is the only one that I could find which throws an error if the original tag is illformed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf pull request #30: COMMONSRDF-51 language tags compared lower cas...

Posted by ansell <gi...@git.apache.org>.

Github user ansell commented on a diff in the pull request:

    https://github.com/apache/commons-rdf/pull/30#discussion_r96309778
  
    --- Diff: api/src/test/java/org/apache/commons/rdf/api/AbstractRDFTest.java ---
    @@ -194,6 +194,114 @@ public void testCreateLiteralLangISO693_3() throws Exception {
             assertEquals("\"Herbert Van de Sompel\"@vls", vls.ntriplesString());
         }
     
    +    public void testCreateLiteralLangCaseInsensitive() throws Exception {
    +        // COMMONSRDF-51: Literal langtag may not be in lowercase, but
    +        // must be COMPARED (aka .equals and .hashCode()) in lowercase
    +        // as the language space is lower case.       
    +        final Literal lower = factory.createLiteral("Hello", "en-gb"); 
    +        final Literal upper = factory.createLiteral("Hello", "EN-GB"); 
    +        final Literal mixed = factory.createLiteral("Hello", "en-GB");
    +
    +        
    +        assertEquals("en-gb", lower.getLanguageTag().get());
    --- End diff --
    
    RDF4J may not follow this in some cases. It may use the BCP47 normalisation conventions to obtain en-GB instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by stain <gi...@git.apache.org>.

Github user stain commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    I added equivalent tests for `Graph` add/contains/remove, and this fails for Jena:
    
    ```
    java.lang.AssertionError
    	at org.junit.Assert.fail(Assert.java:86)
    	at org.junit.Assert.assertTrue(Assert.java:41)
    	at org.junit.Assert.assertFalse(Assert.java:64)
    	at org.junit.Assert.assertFalse(Assert.java:74)
    	at org.apache.commons.rdf.api.AbstractGraphTest.containsLanguageTagsCaseInsensitive(AbstractGraphTest.java:415)
            ...
    ```
    
    basically if I add `"Hello"@EN-GB` to a Jena Graph and then try to remove `"Hello"@en-GB` then the statement remains in the graph as `"Hello"@EN-GB`.
    
    How can we fix this? It would not be enough to just lowercase in all Commons RDF methods with Jena Literal language tags, as the Jena graph/model could be populated through other means (e.g. parsing a file).
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[GitHub] commons-rdf issue #30: COMMONSRDF-51 language tags compared lower case

Posted by stain <gi...@git.apache.org>.

Github user stain commented on the issue:

    https://github.com/apache/commons-rdf/pull/30
  
    I propose now to merge this branch following COMMONSRDF-55 fixing. Thanks everyone!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org