You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Peter Ansell (JIRA)" <ji...@apache.org> on 2012/05/22 02:45:40 UTC

[jira] [Created] (ANY23-99) NQuadsWriter should force ASCII in OutputStream constructor

Peter Ansell created ANY23-99:
---------------------------------

             Summary: NQuadsWriter should force ASCII in OutputStream constructor
                 Key: ANY23-99
                 URL: https://issues.apache.org/jira/browse/ANY23-99
             Project: Apache Any23
          Issue Type: Bug
          Components: core
    Affects Versions: 0.8.0
            Reporter: Peter Ansell


The NQuads specification states that all NQuads documents must be ASCII encoded. [1] The current NQuadsWriter(OutputStream) constructor does not enforce this when creating the OutputStreamWriter to wrap up the given outputstream. If it is not enforced, then the users locale will be used to create the OutputStreamWriter, which may not enforce US-ASCII.

Patch is to replace the constructor with:

        this( new OutputStreamWriter(os, Charset.forName("US-ASCII")) );

[1] http://sw.deri.org/2008/07/n-quads/#mediatype

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-99) NQuadsWriter should force ASCII in OutputStream constructor

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281138#comment-13281138 ] 

Andy Seaborne commented on ANY23-99:
------------------------------------

Is there a specific example of this happening?

The encoding rules for NQuads are to use \u so something has to encode to ASCII and it is not enough to rely  the writer doing chars to bytes.

I think this is handled via the calls:

Literals:

org.openrdf.rio.ntriples.NTriplesUtil.toNTriplesString

URIs:

org.openrdf.rio.ntriples.NTriplesUtil.escapeString

Comments 

handleComment does not encode - this is (arguably) not quite right.

Also:

The charset requirements may well change.  The soon-to-be-published working draft of the formal spec for N-triples defines it to be UTF-8 when used with application/n-triples.  The old rules for text/plain still apply (US-ASCII).   I would expect N-Quads to follow N-triples.  This is all in the future.

                
> NQuadsWriter should force ASCII in OutputStream constructor
> -----------------------------------------------------------
>
>                 Key: ANY23-99
>                 URL: https://issues.apache.org/jira/browse/ANY23-99
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: Peter Ansell
>
> The NQuads specification states that all NQuads documents must be ASCII encoded. [1] The current NQuadsWriter(OutputStream) constructor does not enforce this when creating the OutputStreamWriter to wrap up the given outputstream. If it is not enforced, then the users locale will be used to create the OutputStreamWriter, which may not enforce US-ASCII.
> Patch is to replace the constructor with:
>         this( new OutputStreamWriter(os, Charset.forName("US-ASCII")) );
> [1] http://sw.deri.org/2008/07/n-quads/#mediatype

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-99) NQuadsWriter should force ASCII in OutputStream constructor

Posted by "Peter Ansell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282027#comment-13282027 ] 

Peter Ansell commented on ANY23-99:
-----------------------------------

It seems like it would be better to read it all in as UTF-8 as you say and then handle the exceptions when the data comes back in via a parser, so they have a chance to fix the document. Silent corruption is never good. 

It does violate the general rule to try to write strictly according to the specification and read somewhat liberally, within reason, but if people are not generally aware of the ASCII encoding rules then it may be more more useful to support them than to exclude them.
                
> NQuadsWriter should force ASCII in OutputStream constructor
> -----------------------------------------------------------
>
>                 Key: ANY23-99
>                 URL: https://issues.apache.org/jira/browse/ANY23-99
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: Peter Ansell
>
> The NQuads specification states that all NQuads documents must be ASCII encoded. [1] The current NQuadsWriter(OutputStream) constructor does not enforce this when creating the OutputStreamWriter to wrap up the given outputstream. If it is not enforced, then the users locale will be used to create the OutputStreamWriter, which may not enforce US-ASCII.
> Patch is to replace the constructor with:
>         this( new OutputStreamWriter(os, Charset.forName("US-ASCII")) );
> [1] http://sw.deri.org/2008/07/n-quads/#mediatype

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-99) NQuadsWriter should force ASCII in OutputStream constructor

Posted by "Peter Ansell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281300#comment-13281300 ] 

Peter Ansell commented on ANY23-99:
-----------------------------------

I have had instances in the past where the most difficult to find fault in a UTF-8-standardised system has been the use of an OutputStreamWriter(OutputStream) constructor instead of the OutputStreamWriter(OutputStream,Charset) constructor. I have no specific example of non-ASCII output coming out of the NQuadsWriter. Are there any character sets that could create non-ASCII compatible NQuads documents if the users locale was setup with the charset and OutputStreamWriter(OutputStreap) inherited that locale by default because we didn't specify US-ASCII explicitly? The escaping seems to make it okay at a semantic level but it would still practically be variable based on the JVM environment properties if it isn't explicitly set. Not changing the constructor just seems like we are looking for a bug that could be easily avoided (based on the current spec saying ASCII-only).

There are examples of non-ASCII data successfully going into the NQuadsParser in NQuadsParserTest, which is to be expected if we accept liberally and output standardised NQuads, although it is a little strange that the test suite explicitly supports it given the specification is very clear currently about the \u encoding rules for all non-ASCII characters.

It would be great if both NTriples and NQuads would be able to fully support UTF-8 when they are revised. It is also great that NTriples is getting a specific MIME type this time around. Hopefully the distinction between the two types for essentially the same format doesn't confuse people. It seems fairly unique to have a scenario where a single format has two legitimate types where the only difference is the encoding rules. It would be ideal to be able to handle \uNNNN the same as the native UTF-8 bytes and that would make it possible to parse old documents while all new documents just use UTF-8 without having to check whether they wanted text/plain NTriples or application/n-triples NTriples when writing out. 

Naively I would see this possibly requiring two different Rio writers (as Rio writers have a unique relationship with single RDFFormat which has a single charset attached to it) and possibly two different Rio parsers for the same reason. That doesn't really seem ideal but if necessary it may be a workaround.
                
> NQuadsWriter should force ASCII in OutputStream constructor
> -----------------------------------------------------------
>
>                 Key: ANY23-99
>                 URL: https://issues.apache.org/jira/browse/ANY23-99
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: Peter Ansell
>
> The NQuads specification states that all NQuads documents must be ASCII encoded. [1] The current NQuadsWriter(OutputStream) constructor does not enforce this when creating the OutputStreamWriter to wrap up the given outputstream. If it is not enforced, then the users locale will be used to create the OutputStreamWriter, which may not enforce US-ASCII.
> Patch is to replace the constructor with:
>         this( new OutputStreamWriter(os, Charset.forName("US-ASCII")) );
> [1] http://sw.deri.org/2008/07/n-quads/#mediatype

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-99) NQuadsWriter should force ASCII in OutputStream constructor

Posted by "Peter Ansell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287133#comment-13287133 ] 

Peter Ansell commented on ANY23-99:
-----------------------------------

The Sesame Rio NTriplesWriter currently forces US-ASCII on all OutputStream's in the OutputStream constructor:

	public NTriplesWriter(OutputStream out) {
		this(new OutputStreamWriter(out, Charset.forName("US-ASCII")));
	}

If we did the same in the NQuadsWriter we would have that as a precedent.
                
> NQuadsWriter should force ASCII in OutputStream constructor
> -----------------------------------------------------------
>
>                 Key: ANY23-99
>                 URL: https://issues.apache.org/jira/browse/ANY23-99
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: Peter Ansell
>
> The NQuads specification states that all NQuads documents must be ASCII encoded. [1] The current NQuadsWriter(OutputStream) constructor does not enforce this when creating the OutputStreamWriter to wrap up the given outputstream. If it is not enforced, then the users locale will be used to create the OutputStreamWriter, which may not enforce US-ASCII.
> Patch is to replace the constructor with:
>         this( new OutputStreamWriter(os, Charset.forName("US-ASCII")) );
> [1] http://sw.deri.org/2008/07/n-quads/#mediatype

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-99) NQuadsWriter should force ASCII in OutputStream constructor

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281437#comment-13281437 ] 

Andy Seaborne commented on ANY23-99:
------------------------------------

For parsing, a single parser can handle the ASCII and UTF-8 versions if it handles UTF-8 (ASCII is a strict subset of UTF-8). My experience is that N-triples does occur with non-ASCII in it because the ASCII restriction isn't universally known.  N-Quads is probably the same.

OutputStreamWriter(OutputStream,ASCII) does have one consequence - should for some reason non-ASCII be fed into such a writer, the output is wrong (default is to print a "?" i.e. silent corruption of the data).  By the time the writer sees non-ASCII in the stream it's too late.

So changing the OutputStreamWriter is fine - relying on the platform default is never good.  But if there is a problem, it's in the code sending the data to the writer, and the writer can't fix it up (unless we have a writer that generates the \u itself).

>From a code inspection, the comments should be fixed.  A strict fix is to encode them, a lax fix is to output UTF-8 and leave comments as written for convenience of reading them again.

The most common use of N-Quads I see is as DB dumps - no comments.

                
> NQuadsWriter should force ASCII in OutputStream constructor
> -----------------------------------------------------------
>
>                 Key: ANY23-99
>                 URL: https://issues.apache.org/jira/browse/ANY23-99
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: Peter Ansell
>
> The NQuads specification states that all NQuads documents must be ASCII encoded. [1] The current NQuadsWriter(OutputStream) constructor does not enforce this when creating the OutputStreamWriter to wrap up the given outputstream. If it is not enforced, then the users locale will be used to create the OutputStreamWriter, which may not enforce US-ASCII.
> Patch is to replace the constructor with:
>         this( new OutputStreamWriter(os, Charset.forName("US-ASCII")) );
> [1] http://sw.deri.org/2008/07/n-quads/#mediatype

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira