You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Stian Soiland-Reyes (JIRA)" <ji...@apache.org> on 2017/01/11 15:03:58 UTC

[jira] [Commented] (COMMONSRDF-51) RDF-1.1 specifies that language tags need to be compared using lower-case

    [ https://issues.apache.org/jira/browse/COMMONSRDF-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818566#comment-15818566 ] 

Stian Soiland-Reyes commented on COMMONSRDF-51:
-----------------------------------------------

I think this needs to be clarified on public-rdf-comments@w3.org as our "character by character" is a [quote from the spec|https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal]:

{quote}

Literal term equality: Two literals are term-equal (the same RDF literal) if and only if the two lexical forms, the two datatype IRIs, and the two language tags (if any) compare equal, character by character. Thus, two literals can have the same value without being the same RDF term. For example:

      "1"^^xs:integer
      "01"^^xs:integer
    
denote the same value, but are not the same literal RDF terms and are not term-equal because their lexical form differs.
{quote}

It also says above the value space is always in lower case, but then says equality is done "character by character" and not by value space.  (As that example shows, the lexical value of data types like integers are also compared by character instead of by value space)

I have nevertheless started a branch [COMMONSRDF-51-langtag-lcase|https://github.com/apache/commons-rdf/compare/COMMONSRDF-51-langtag-lcase] to try this out.. this revealed bugs in the bindings for simple (just the Turkish case), jsonld-java (which does no validation of language tags), rdf4j (fails Turkish test) and jena (fails Turkish test).

As both RDF4J and Jena are vulnerable to the Turkish case, that should be reported upstream after rdf-comments clarifications.

Would it make sense for Commons RDF to strengthen getLanguageTag() to ALWAYS return the language tag in lower case for any RDF implementations (e.g. normalize if implementation does not do it correctly internally) - as a kind of interoperability/RDF 1.1 measure - or should we strive to keep their current case representation as-is? 

> RDF-1.1 specifies that language tags need to be compared using lower-case
> -------------------------------------------------------------------------
>
>                 Key: COMMONSRDF-51
>                 URL: https://issues.apache.org/jira/browse/COMMONSRDF-51
>             Project: Apache Commons RDF
>          Issue Type: Bug
>          Components: api
>    Affects Versions: 0.3.0
>            Reporter: Peter Ansell
>            Assignee: Stian Soiland-Reyes
>
> The [RDF-1.1 specification states that the [value space of Literal language tags is lowercase|https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal], which does not conflict with the case-insensitive specification in BCP47. The Literal.equals and Literal.hashCode API contracts should specify that language tags must be compared using lowercase, even if they are otherwise stored and returned as upper-case by getLanguageTag. The API currently has incorrect language by saying "character-by-character" for language tag comparisons, as that implies case-sensitive comparisons are used.
> The lowercasing must also be done using a locale that is consistent (known example where lowercase and uppercase do not roundtrip as expected for US-ASCII characters is Turkish [1]), so I would recommend actually stating that .toLowerCase(Locale.ENGLISH) is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)