You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Pascal Christoph (JIRA)" <ji...@apache.org> on 2013/05/17 11:57:16 UTC
[jira] [Updated] (JENA-457) ntriples: Object-URIs should be
%-encoded
[ https://issues.apache.org/jira/browse/JENA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pascal Christoph updated JENA-457:
----------------------------------
Description:
Ntriple serialization is in pure ASCII for now[1] , so IRIs are not possible as UTF8 is not allowed (see rfc3987). Serializing a Model to ntriples escapes non-ASCII characters with '\u' escaping. These URIs don't resolve in most cases per se, e.g. in dbpedia. These are the three different notations possible:
1. http://de.dbpedia.org/resource/T\u00FCr
2. http://de.dbpedia.org/resource/T%fcr
3. http://de.dbpedia.org/resource/Tür
[EDIT: rendering of 3. is broken, see http://www.fileformat.info/info/unicode/char/00fc for more info )
While the 1. doesn't resolve and the 3. is not ASCII, the 2. (the percent-octet encoding) fulfills both requirements. So I would like to see the use of the 2. to encode object URIs in ASCII ntriple serialization. See also https://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples .
One could use jena to serialize as turtle and transform this turtle file to ntriples with rapper. But rapper encodes all literals having unicode-escape-sequences to utf8 ignoring the transformation of URIs (wisely, since they are identifier). So this does not help.
Some concrete code which is responsible for this serialization:
RDFWriter fasterWriter = model.getWriter("N-TRIPLE");
Should be save to apply a patch like this in NTripleWriter.java:
private static void writeURIString(String s, PrintWriter writer) {
writer.print(org.apache.commons.httpclient.util.URIUtil.encodeQuery(s) ) ;
}
(not tested)
What do you think?
-o
[1]see a month old note from W3C where it is proposed to use utf-8 instead of ASCII : http://www.w3.org/TR/2013/NOTE-n-triples-20130409/#n-triple-changes
was:
Ntriple serialization is in pure ASCII for now[1] , so IRIs are not possible as UTF8 is not allowed (see rfc3987). Serializing a Model to ntriples escapes non-ASCII characters with '\u' escaping. These URIs don't resolve in most cases per se, e.g. in dbpedia. These are the three different notations possible:
1. http://de.dbpedia.org/resource/T\u00FCr
2. http://de.dbpedia.org/resource/T%fcr
3. http://de.dbpedia.org/resource/Tür
While the 1. doesn't resolve and the 3. is not ASCII, the 2. (the percent-octet encoding) fulfills both requirements. So I would like to see the use of the 2. to encode object URIs in ASCII ntriple serialization. See also https://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples .
One could use jena to serialize as turtle and transform this turtle file to ntriples with rapper. But rapper encodes all literals having unicode-escape-sequences to utf8 ignoring the transformation of URIs (wisely, since they are identifier). So this does not help.
Some concrete code which is responsible for this serialization:
RDFWriter fasterWriter = model.getWriter("N-TRIPLE");
Should be save to apply a patch like this in NTripleWriter.java:
private static void writeURIString(String s, PrintWriter writer) {
writer.print(org.apache.commons.httpclient.util.URIUtil.encodeQuery(s) ) ;
}
(not tested)
What do you think?
-o
[1]see a month old note from W3C where it is proposed to use utf-8 instead of ASCII : http://www.w3.org/TR/2013/NOTE-n-triples-20130409/#n-triple-changes
> ntriples: Object-URIs should be %-encoded
> -----------------------------------------
>
> Key: JENA-457
> URL: https://issues.apache.org/jira/browse/JENA-457
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ, Jena, RDF API
> Affects Versions: ARQ 2.9.3
> Environment: everywhere
> Reporter: Pascal Christoph
> Priority: Minor
> Labels: patch
>
> Ntriple serialization is in pure ASCII for now[1] , so IRIs are not possible as UTF8 is not allowed (see rfc3987). Serializing a Model to ntriples escapes non-ASCII characters with '\u' escaping. These URIs don't resolve in most cases per se, e.g. in dbpedia. These are the three different notations possible:
> 1. http://de.dbpedia.org/resource/T\u00FCr
> 2. http://de.dbpedia.org/resource/T%fcr
> 3. http://de.dbpedia.org/resource/Tür
> [EDIT: rendering of 3. is broken, see http://www.fileformat.info/info/unicode/char/00fc for more info )
> While the 1. doesn't resolve and the 3. is not ASCII, the 2. (the percent-octet encoding) fulfills both requirements. So I would like to see the use of the 2. to encode object URIs in ASCII ntriple serialization. See also https://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples .
> One could use jena to serialize as turtle and transform this turtle file to ntriples with rapper. But rapper encodes all literals having unicode-escape-sequences to utf8 ignoring the transformation of URIs (wisely, since they are identifier). So this does not help.
> Some concrete code which is responsible for this serialization:
> RDFWriter fasterWriter = model.getWriter("N-TRIPLE");
> Should be save to apply a patch like this in NTripleWriter.java:
> private static void writeURIString(String s, PrintWriter writer) {
> writer.print(org.apache.commons.httpclient.util.URIUtil.encodeQuery(s) ) ;
> }
> (not tested)
> What do you think?
> -o
> [1]see a month old note from W3C where it is proposed to use utf-8 instead of ASCII : http://www.w3.org/TR/2013/NOTE-n-triples-20130409/#n-triple-changes
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira