You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Pascal Christoph (JIRA)" <ji...@apache.org> on 2013/05/17 11:57:16 UTC
[jira] [Updated] (JENA-457) ntriples: Object-URIs should be %-encoded

     [ https://issues.apache.org/jira/browse/JENA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pascal Christoph updated JENA-457:
----------------------------------

    Description: 
Ntriple serialization is in pure ASCII for now[1] , so IRIs are not possible as UTF8 is not allowed (see rfc3987). Serializing a Model to ntriples escapes non-ASCII characters with '\u' escaping. These URIs don't resolve in most cases per se, e.g. in dbpedia. These are the three different notations possible:

1. http://de.dbpedia.org/resource/T\u00FCr
2. http://de.dbpedia.org/resource/T%fcr
3. http://de.dbpedia.org/resource/Tür
[EDIT: rendering of 3. is broken,  see http://www.fileformat.info/info/unicode/char/00fc for more info )

While the 1. doesn't resolve and the 3. is not ASCII, the 2. (the percent-octet encoding) fulfills both requirements. So I would like to see the use of the 2. to encode object URIs in ASCII ntriple serialization. See also https://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples .

One could use jena to serialize as turtle and transform this turtle file to ntriples with rapper. But rapper encodes all literals having unicode-escape-sequences to utf8 ignoring the transformation of URIs (wisely, since they are identifier). So this does not help.

Some concrete code which is responsible for this serialization:

 RDFWriter fasterWriter = model.getWriter("N-TRIPLE");

Should be save to apply a patch like this in NTripleWriter.java:

private static void writeURIString(String s, PrintWriter writer) {
    writer.print(org.apache.commons.httpclient.util.URIUtil.encodeQuery(s) ) ;
}
(not tested)

What do you think?
-o

[1]see a month old note from W3C where it is proposed to use utf-8 instead of ASCII : http://www.w3.org/TR/2013/NOTE-n-triples-20130409/#n-triple-changes

  was:
Ntriple serialization is in pure ASCII for now[1] , so IRIs are not possible as UTF8 is not allowed (see rfc3987). Serializing a Model to ntriples escapes non-ASCII characters with '\u' escaping. These URIs don't resolve in most cases per se, e.g. in dbpedia. These are the three different notations possible:

1. http://de.dbpedia.org/resource/T\u00FCr
2. http://de.dbpedia.org/resource/T%fcr
3. http://de.dbpedia.org/resource/Tür

While the 1. doesn't resolve and the 3. is not ASCII, the 2. (the percent-octet encoding) fulfills both requirements. So I would like to see the use of the 2. to encode object URIs in ASCII ntriple serialization. See also https://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples .

One could use jena to serialize as turtle and transform this turtle file to ntriples with rapper. But rapper encodes all literals having unicode-escape-sequences to utf8 ignoring the transformation of URIs (wisely, since they are identifier). So this does not help.

Some concrete code which is responsible for this serialization:

 RDFWriter fasterWriter = model.getWriter("N-TRIPLE");

Should be save to apply a patch like this in NTripleWriter.java:

private static void writeURIString(String s, PrintWriter writer) {
    writer.print(org.apache.commons.httpclient.util.URIUtil.encodeQuery(s) ) ;
}
(not tested)

What do you think?
-o

[1]see a month old note from W3C where it is proposed to use utf-8 instead of ASCII : http://www.w3.org/TR/2013/NOTE-n-triples-20130409/#n-triple-changes

    
> ntriples: Object-URIs should be %-encoded
> -----------------------------------------
>
>                 Key: JENA-457
>                 URL: https://issues.apache.org/jira/browse/JENA-457
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ, Jena, RDF API
>    Affects Versions: ARQ 2.9.3
>         Environment: everywhere
>            Reporter: Pascal Christoph
>            Priority: Minor
>              Labels: patch
>
> Ntriple serialization is in pure ASCII for now[1] , so IRIs are not possible as UTF8 is not allowed (see rfc3987). Serializing a Model to ntriples escapes non-ASCII characters with '\u' escaping. These URIs don't resolve in most cases per se, e.g. in dbpedia. These are the three different notations possible:
> 1. http://de.dbpedia.org/resource/T\u00FCr
> 2. http://de.dbpedia.org/resource/T%fcr
> 3. http://de.dbpedia.org/resource/Tür
> [EDIT: rendering of 3. is broken,  see http://www.fileformat.info/info/unicode/char/00fc for more info )
> While the 1. doesn't resolve and the 3. is not ASCII, the 2. (the percent-octet encoding) fulfills both requirements. So I would like to see the use of the 2. to encode object URIs in ASCII ntriple serialization. See also https://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples .
> One could use jena to serialize as turtle and transform this turtle file to ntriples with rapper. But rapper encodes all literals having unicode-escape-sequences to utf8 ignoring the transformation of URIs (wisely, since they are identifier). So this does not help.
> Some concrete code which is responsible for this serialization:
>  RDFWriter fasterWriter = model.getWriter("N-TRIPLE");
> Should be save to apply a patch like this in NTripleWriter.java:
> private static void writeURIString(String s, PrintWriter writer) {
>     writer.print(org.apache.commons.httpclient.util.URIUtil.encodeQuery(s) ) ;
> }
> (not tested)
> What do you think?
> -o
> [1]see a month old note from W3C where it is proposed to use utf-8 instead of ASCII : http://www.w3.org/TR/2013/NOTE-n-triples-20130409/#n-triple-changes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira