You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xmlrpc-dev@ws.apache.org by "Daniel L. Rall" <dl...@finemaltcoding.com> on 2005/05/02 05:50:08 UTC

Re: [PATCH] characters invalid for an encoding

On Fri, 2005-04-29 at 02:05 +0200, Jochen Wiedmann wrote:
> Hi, Daniel,
> 
> if I get your patch right, then the the character handling is a matter
> of encoding only. However, if that's the case ...
> 
> > +                    if (isUnicode)
> > +                    {
> ...
> > +                    }
> > +                    else
> > +                    {
> > +                        throw new XmlRpcException(0, "Invalid character data "
> > +                                                  + "corresponding to XML "
> > +                                                  + "entity &#"
> > +                                                  + String.valueOf((int) c)
> > +                                                  + ';');
> > +                    }
> 
> then I see no reason to throw an exception here. One should simply fall
> back to writing a numeric entity reference, as in the case "c < 0x20".
> Or do I get things wrong?

Ignoring the encoding="..." XML header, that would indeed be valid XML.
However, when the XML is subsequently processed, the entity-encoded
value will transformed into ASCII.  Since there is no equivalent 7 bit
ASCII character which fits within that range (e.g. 0x7f is ASCII 127), a
parse error will be generated.

Jochen, thanks a lot for taking the time to look the patch over.

- Dan



Re: [PATCH] characters invalid for an encoding

Posted by John Wilson <tu...@wilson.co.uk>.
On 16 May 2005, at 23:52, Daniel Rall wrote:

> I took a look through the spec, but nothing stood out.  John, are  
> there
> any particular portions of the spec that I should be looking at in
> particular?  The section on valid characters is really clear that the
> majority of control characters can't occur, but I didn't see any
> discussion as to why replacing them with character references isn't a
> good enough escaping mechanism.  Not trying to be obstructionist --  
> just
> trying to understand.
>


Daniel, I quite understand....

Section 2.2 defines the character ranges which can occur in a parsed  
entity. My understanding of a parsed entity is the parsing process  
replaces all the character references. So the "escaping" of  
characters has no effect.

I fired up Oxygen and did an experiment. When a document contains  
&#x00; I get the following error when I check for well formedness:

F Character reference "&#x00" is an invalid XML character.

I believe that Xerces is used to perform this check.



John Wilson
The Wilson Partnership
http://www.wilson.co.uk



Re: [PATCH] characters invalid for an encoding

Posted by John Wilson <tu...@wilson.co.uk>.
On 16 May 2005, at 23:52, Daniel Rall wrote:

> I did a some looking around, and the closest thing I could find
> supporting that is an email by Tim Bray:
>
> http://lists.xml.org/archives/xml-dev/199804/msg00502.html
>

Here's a clearer one from the same list;

http://lists.xml.org/archives/xml-dev/200502/msg00156.html



John Wilson
The Wilson Partnership
http://www.wilson.co.uk



Re: [PATCH] characters invalid for an encoding

Posted by Daniel Rall <dl...@finemaltcoding.com>.
On Fri, 2005-05-06 at 14:59 +0100, John Wilson wrote: 
>On 6 May 2005, at 12:03, Jochen Wiedmann wrote:
... 
>>> For maximum interoperability I would suggest we use UTF-8 but use
>>> character references for all values > 0X7F. This means that even  
>>> if  the
>>> other end gets the encoding wrong it will still almost certainly
>>> understand the characters. If the other end does not understand
>>> character encodings it will be very easy to see what the problem is
>>> (which is not quite so easy to do if it mistakes UTF-8 for ISO8859-1,
>>> for example)
>>
>> That is, as far as I can say, what Daniels proposed patch does.
>
>Yes It would appear to do this. However it also seems to emit invalid  
>XML code points as character references (e.g. the NULL character  
>would be emitted as &#x0;). 

That's right -- it was intentional, as I was unaware of this
restriction, and figured I'd start with the parts it seemed that
everyone agreed on.  :-)

>I do not believe that the XML spec allows  
>this. I believe that these code points cannot appear in a well formed  
>document in any form. The intent is to allow the consuming  
>application to be 100% sure it never sees these characters.

I did a some looking around, and the closest thing I could find
supporting that is an email by Tim Bray:

http://lists.xml.org/archives/xml-dev/199804/msg00502.html

I also found some conformance testing materials against a really old XML
parser from Sun:

http://www.xml.com/1999/09/conformance/reports/report-sun-val.html

I took a look through the spec, but nothing stood out.  John, are there
any particular portions of the spec that I should be looking at in
particular?  The section on valid characters is really clear that the
majority of control characters can't occur, but I didn't see any
discussion as to why replacing them with character references isn't a
good enough escaping mechanism.  Not trying to be obstructionist -- just
trying to understand.


I've committed patches to CVS HEAD and XMLRPC_1_2_BRANCH implementing
everything we've discussed (including test cases), _except_ the blocking
of these suspect control characters.  Attached is a patch which could be
applied to CVS HEAD to block such characters, but if we end up going
that route, it's probably time to optimize the changes I've made
recently to XmlWriter.


Re: [PATCH] characters invalid for an encoding

Posted by John Wilson <tu...@wilson.co.uk>.
On 6 May 2005, at 12:03, Jochen Wiedmann wrote:

>
> For version 3, I have code ready that checks the presence of Java 1.4.
> It that is available, an instance of Charset is being queried.

Yes that works fine - I'm too used to living with the need to support  
J2ME. I forget the nice things in 1.4 :)

>
>
>
>> For maximum interoperability I would suggest we use UTF-8 but use
>> character references for all values > 0X7F. This means that even  
>> if  the
>> other end gets the encoding wrong it will still almost certainly
>> understand the characters. If the other end does not understand
>> character encodings it will be very easy to see what the problem is
>> (which is not quite so easy to do if it mistakes UTF-8 for ISO8859-1,
>> for example)
>>
>
> That is, as far as I can say, what Daniels proposed patch does.
>


Yes It would appear to do this. However it also seems to emit invalid  
XML code points as character references (e.g. the NULL character  
would be emitted as &#x0;). I do not believe that the XML spec allows  
this. I believe that these code points cannot appear in a well formed  
document in any form. The intent is to allow the consuming  
application to be 100% sure it never sees these characters.


John Wilson
The Wilson Partnership
http://www.wilson.co.uk



Re: [PATCH] characters invalid for an encoding

Posted by Jochen Wiedmann <jo...@gmail.com>.
John Wilson wrote:

> The problem with allowing arbitrary encoding is that the writer has  no
> idea of what the mapping of code Unicode code point to character 
> encoding is. i.e. there is no way of answering the question "I have a 
> Unicode code point with value X can I represent that directly in 
> encoding Y?" If the answer to this question is "NO" then it has to  emit
> a character reference.

For version 3, I have code ready that checks the presence of Java 1.4.
It that is available, an instance of Charset is being queried.


> For maximum interoperability I would suggest we use UTF-8 but use 
> character references for all values > 0X7F. This means that even if  the
> other end gets the encoding wrong it will still almost certainly 
> understand the characters. If the other end does not understand 
> character encodings it will be very easy to see what the problem is 
> (which is not quite so easy to do if it mistakes UTF-8 for ISO8859-1, 
> for example)

That is, as far as I can say, what Daniels proposed patch does.


Jochen

Re: [PATCH] characters invalid for an encoding

Posted by John Wilson <tu...@wilson.co.uk>.
On 5 May 2005, at 22:47, Daniel Rall wrote:

> On Thu, 2005-05-05 at 15:24 +0100, John Wilson wrote:
>
>> I'm not sure I follow this either :)
>>
>> Currently we emit an XML declaration which says we are using
>> ISO8859-1 encoding.
>>
>
> The declaration generated depends upon the encoding in use by  
> XmlWriter,
> no?
>
>         write(PROLOG_START);
>         write(canonicalizeEncoding(enc));
>         write(PROLOG_END);
>

The problem with allowing arbitrary encoding is that the writer has  
no idea of what the mapping of code Unicode code point to character  
encoding is. i.e. there is no way of answering the question "I have a  
Unicode code point with value X can I represent that directly in  
encoding Y?" If the answer to this question is "NO" then it has to  
emit a character reference.

If the writer wants to support arbitrary encodings it has to be given  
a mechanism to determine  when to emit character references.  
Personally I don't think the flexibility in choosing a character  
encoding is worth the complexity in supporting it. My view is that  
the writer should only support UTF-8 and (possibly) UTF-16. These  
encodings do not require any XML declaration and can represent all  
Unicode code points.

For maximum interoperability I would suggest we use UTF-8 but use  
character references for all values > 0X7F. This means that even if  
the other end gets the encoding wrong it will still almost certainly  
understand the characters. If the other end does not understand  
character encodings it will be very easy to see what the problem is  
(which is not quite so easy to do if it mistakes UTF-8 for ISO8859-1,  
for example)
>
>> Unicode code points in the range 0X00 to 0XFF
>> have the same value as the ISO8859-1 character values. If we wish to
>> send Unicode code points with values > 0XFF then we have to emit
>> character references (e.g. &x1FF;)
>>
>> If we were to change the encoding to UTF-8 or UTF-16 then we would
>> never have to emit character references (though we still could if we
>> wanted to).
>>
>
> Like you say below, we'd still have to emit character references for
> Unicode code points not allowed in XML documents, yes?

No. Characters which are not allowed in XML documents (e.g. the  
USASCII control characters like DEL and HT) are not allowed even when  
represented by a character reference.
>
>
>> The XML 1.0 spec forbids some Unicode code points from appearing in a
>> well formed XML document (only these code points are allowed: #x9 |
>> #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] -
>> see section 2.2 of the spec.). Note that USASCII control characters
>> other than HT, CR and NL are not allowed. Using a character reference
>> doesn't make any difference <a>&#x0;</a> is not a well formed XML
>> document and should be rejected by an XML parser (MinML used not to
>> complain about this - later versions do).
>>
>
> What range are these control characters in (e.g. < 0x20)?

The USASCII control characters which are not allowed are all those  
with values < 0X20 with the exception of 0X09, 0X0A and 0X0D.
>
>
>> There is another little wrinkle with CR and LF. An XML parser is
>> required to "normalise" line endings (see section 2.11 of the spec).
>> This normalisation involves replacing CR NL or CR with NL. This
>> normalisation does not occur if the CR is represented using a
>> character reference.
>>
>> So a correct XML writer should do the following:
>>
>> 1/ refuse to write characters with Unicode code points which are not
>> allowed in an XML document.
>>
>
> Do you suggest throwing an exception here, or writing a '?' character?

You have to throw an exception. There is no point in sending a  
message you know that the other end will not be able to understand.
>
>
>> 2/ replace characters with a Unicode code point which is not allowed
>> in the encoding being used with the appropriate character reference.
>>
>
> For any random encoding, anyone know a good way of determining whether
> such a character is representable by said encoding?

No - this is a classic deficiency in the Java Writer API. If we had a  
canRepresent() function then the world would be a better place for  
XML encoders.

>
>
>> 3/ replace <,& and > with either the pre defined entities (&lt; etc)
>> or with a character reference.
>>
>
> We're already replacing them with pref-defined entities, so we're in
> good shape here.
>
>
>> 4/ replace all CR characters with a character reference.
>>
>
> We do this to keep them from getting normalized by the XML parser, I
> take it?  Previously, we'd write them literally.

Yes - this hasn't caused problems in the past but it could in principle.
>
>
>> If we wanted to have the greatest possible chance of interoperating
>> we should emit no XML encoding declaration and replace code points
>> with values > 0X7F with character references.
>>
>
> I agree with the part about replacing code points with values > 0x7f
> with character references (see exchange with Jochen).
>
> Can non-ASCII encodings be determined by the parser using the BOM, or
> some such heuristic?  Would we write all non-ASCII encoding as UTF-8?
>
The XML spec has a section which describes heuristics which can  
determine many encodings by looking at the first four octets. This is  
based on the fact that the first character of a well formed XML  
document must be '<'. See Appendix F of the spec for the fill picture  
(it's a really cleaver mechanism IMO).

>


John Wilson
The Wilson Partnership
http://www.wilson.co.uk



Re: [PATCH] characters invalid for an encoding

Posted by Jochen Wiedmann <jo...@gmail.com>.
Daniel Rall wrote:

> Do you suggest throwing an exception here, or writing a '?' character?

*Never* use something like the '?', please! :-)


> I'm attaching a patch as a discussion piece which implements some of the
> discussion from this thread.

+1 for your patch.


Jochen

Re: [PATCH] characters invalid for an encoding

Posted by Daniel Rall <dl...@finemaltcoding.com>.
On Thu, 2005-05-05 at 15:24 +0100, John Wilson wrote:
>I'm not sure I follow this either :)
>
>Currently we emit an XML declaration which says we are using  
>ISO8859-1 encoding. 

The declaration generated depends upon the encoding in use by XmlWriter,
no?

        write(PROLOG_START);
        write(canonicalizeEncoding(enc));
        write(PROLOG_END);

>Unicode code points in the range 0X00 to 0XFF  
>have the same value as the ISO8859-1 character values. If we wish to  
>send Unicode code points with values > 0XFF then we have to emit  
>character references (e.g. &x1FF;)
>
>If we were to change the encoding to UTF-8 or UTF-16 then we would  
>never have to emit character references (though we still could if we  
>wanted to).

Like you say below, we'd still have to emit character references for
Unicode code points not allowed in XML documents, yes?

>The XML 1.0 spec forbids some Unicode code points from appearing in a  
>well formed XML document (only these code points are allowed: #x9 |  
>#xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] -  
>see section 2.2 of the spec.). Note that USASCII control characters  
>other than HT, CR and NL are not allowed. Using a character reference  
>doesn't make any difference <a>&#x0;</a> is not a well formed XML  
>document and should be rejected by an XML parser (MinML used not to  
>complain about this - later versions do).

What range are these control characters in (e.g. < 0x20)?

>There is another little wrinkle with CR and LF. An XML parser is  
>required to "normalise" line endings (see section 2.11 of the spec).  
>This normalisation involves replacing CR NL or CR with NL. This  
>normalisation does not occur if the CR is represented using a  
>character reference.
>
>So a correct XML writer should do the following:
>
>1/ refuse to write characters with Unicode code points which are not  
>allowed in an XML document.

Do you suggest throwing an exception here, or writing a '?' character?

>2/ replace characters with a Unicode code point which is not allowed  
>in the encoding being used with the appropriate character reference.

For any random encoding, anyone know a good way of determining whether
such a character is representable by said encoding?

>3/ replace <,& and > with either the pre defined entities (&lt; etc)  
>or with a character reference.

We're already replacing them with pref-defined entities, so we're in
good shape here.

>4/ replace all CR characters with a character reference.

We do this to keep them from getting normalized by the XML parser, I
take it?  Previously, we'd write them literally.

>If we wanted to have the greatest possible chance of interoperating  
>we should emit no XML encoding declaration and replace code points  
>with values > 0X7F with character references.

I agree with the part about replacing code points with values > 0x7f
with character references (see exchange with Jochen).

Can non-ASCII encodings be determined by the parser using the BOM, or
some such heuristic?  Would we write all non-ASCII encoding as UTF-8?


I'm attaching a patch as a discussion piece which implements some of the
discussion from this thread.

Re: [PATCH] characters invalid for an encoding

Posted by John Wilson <tu...@wilson.co.uk>.
I'm not sure I follow this either :)

Currently we emit an XML declaration which says we are using  
ISO8859-1 encoding. Unicode code points in the range 0X00 to 0XFF  
have the same value as the ISO8859-1 character values. If we wish to  
send Unicode code points with values > 0XFF then we have to emit  
character references (e.g. &x1FF;)

If we were to change the encoding to UTF-8 or UTF-16 then we would  
never have to emit character references (though we still could if we  
wanted to).

The XML 1.0 spec forbids some Unicode code points from appearing in a  
well formed XML document (only these code points are allowed: #x9 |  
#xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] -  
see section 2.2 of the spec.). Note that USASCII control characters  
other than HT, CR and NL are not allowed. Using a character reference  
doesn't make any difference <a>&#x0;</a> is not a well formed XML  
document and should be rejected by an XML parser (MinML used not to  
complain about this - later versions do).

There is another little wrinkle with CR and LF. An XML parser is  
required to "normalise" line endings (see section 2.11 of the spec).  
This normalisation involves replacing CR NL or CR with NL. This  
normalisation does not occur if the CR is represented using a  
character reference.

So a correct XML writer should do the following:

1/ refuse to write characters with Unicode code points which are not  
allowed in an XML document.

2/ replace characters with a Unicode code point which is not allowed  
in the encoding being used with the appropriate character reference.

3/ replace <,& and > with either the pre defined entities (&lt; etc)  
or with a character reference.

4/ replace all CR characters with a character reference.

If we wanted to have the greatest possible chance of interoperating  
we should emit no XML encoding declaration and replace code points  
with values > 0X7F with character references.


I would recommend Tim Bray's Annotated XML spec http://www.xml.com/ 
axml/testaxml.htm if you would like to check that I have the details  
right.



John Wilson
The Wilson Partnership
http://www.wilson.co.uk



Re: [PATCH] characters invalid for an encoding

Posted by Daniel Rall <dl...@finemaltcoding.com>.
On Thu, 2005-05-05 at 02:35 +0200, Jochen Wiedmann wrote:
>Daniel Rall wrote:
...
>What does "invalid as un-encoded XML" mean? Not being within the
>encodings character set?

I was referring to characters which had not been entity-encoded using
references like &lt; or &#0xffff;.

>If so, the range 0x20 to 0xff is quite arbitrarily and not even valid in
>all cases. For example, it fails for "US-ASCII" encoding. In other
>words, to me this wasn't good.

This change was only intended to catch characters invalid in XML, which
it did an incomplete job of.

...
>- Choose UTF-8 as the encoding; that means, that only very few
>  characters ('<', for example) has to be escaped.

Ideally speaking, this option also strikes me as the cleanest.  Sadly,
the reality is that there are a lot of old XML-RPC clients and servers
out there in production, and that we could only offer this behavior as a
non-default configuration toggle.

>- Choose US-ASCII as the encoding. In other words, escape everything
>  beyond 0x7f.

John Wilson also made this suggestion.  Given the very real inter-op
concerns we have to live with, I propose that this be the default
behavior.

>- Invent a new interface and let the user decide, for example:
>
>      public class XmlRpcEncoder {
>          String getEncoding();
>          boolean isEscaped(char pChar);
>      }

Not to over-engineer things, I also envisioned this type of solution to
implement the UTF-8 toggle discussed above.



Re: [PATCH] characters invalid for an encoding

Posted by Jochen Wiedmann <jo...@gmail.com>.
Daniel Rall wrote:

> On 2002/08/19, CVS rev 1.3 of XmlWriter introduced code to entity encode
> characters in the range 0x20 to 0xff, characters which are invalid as
> un-encoded _XML_.  And so it was Good.

Sorry for asking, but I tend to become more and more confused. :-)

What does "invalid as un-encoded XML" mean? Not being within the
encodings character set?

If so, the range 0x20 to 0xff is quite arbitrarily and not even valid in
all cases. For example, it fails for "US-ASCII" encoding. In other
words, to me this wasn't good.


> With the restriction on ASCII-only <string> payloads removed, do we want
> to go back to the days of CVS rev 1.3, where all characters which are
> not valid _XML_ are entity encoded, and no special handling is enforced
> based on the XmlWriter's encoding?

Besides the fact, that I do not understand, what has actually been
restricted (lexical representation or actual character set) and that the
 latter would make XML-RPC pretty useless to most of us: The restriction
is away.

So we have, IMO, the following options:

- Choose UTF-8 as the encoding; that means, that only very few
  characters ('<', for example) has to be escaped.

- Choose US-ASCII as the encoding. In other words, escape everything
  beyond 0x7f.

- Invent a new interface and let the user decide, for example:

      public class XmlRpcEncoder {
          String getEncoding();
          boolean isEscaped(char pChar);
      }

I do personally favour the first option very clearly.


Jochen

Re: [PATCH] characters invalid for an encoding

Posted by Daniel Rall <dl...@finemaltcoding.com>.
On Wed, 2005-05-04 at 13:42 -0700, Steve Quint wrote: 
>I must first profess that I'm not very smart, and reading the XML 
>"recommendation" on W3C makes my head spin.  My apologies if my 
>interpretation of this "recommendation" is wrong.

I feel that.  There's a reason I sat on this patch for _months_.  ;-)

>At 4:45 PM -0700 5/3/05, Daniel Rall wrote:
>>
>>Hi Steve, can you elaborate on this?  Both the XML RFCs and certain
>>encodings dictate what constitutes valid content, and how content must
>>be represented.  For instance, certain multi-byte characters simply
>>aren't representable in 7 bit encoding like ASCII -- the only way to
>>deliver'em through an ASCII encoding is to use another encoding which
>>can be represented in ASCII (e.g. base-64).

Re-reading this and doing some research, I see that I've mis-remembered
part of the problem, and that my description here may be (debateably)
incorrect near the end given the removal of the "ASCII-only" restriction
from the XML-RPC spec.

>>I don't follow you.  The XML-RPC spec itself used to dictate that the
>>XML payload must be ASCII.  That changed only recently.
>
>Unless I'm missing something, can't any multi-byte character be 
>represented using entity encoding.  Why is this operation reserved 
>for UTF-8 and UTF-16?

Indeed, multi-byte characters can be represented using entity encoding.
Here's the deal...

On 2002/08/19, CVS rev 1.3 of XmlWriter introduced code to entity encode
characters in the range 0x20 to 0xff, characters which are invalid as
un-encoded _XML_.  And so it was Good.

On 2002/08/20, CVS rev 1.4 of XmlWriter incorrectly changed the code
introduced in rev 1.3 to throw an exception when encountering characters
in that same range of 0x20 to 0xff, claiming that such characters were
not valid in XML-RPC <string> payloads, because at that time, XML-RPC
allowed only ASCII data for its <string> data type.  Rev 1.4 _should've_
looked more similar to the change I just committed, which disallowed
characters outside of the range of 0x20 to 0x7f, and occurred within
<string> data.

On 6/30/03, Dave Winer removed the restriction about only ASCII being
allowed in <string> payloads from the XML-RPC specification.


With the restriction on ASCII-only <string> payloads removed, do we want
to go back to the days of CVS rev 1.3, where all characters which are
not valid _XML_ are entity encoded, and no special handling is enforced
based on the XmlWriter's encoding?  (What does this mean for inter-op
with older XML-RPC clients/servers?)  Or do we -- as the code I just
checked in does -- assume that XML parser that will be receiving the
content generated by XmlWriter could be converting the data into ASCII
whenever it's declared to be ASCII?  (Again, keeping support for old
XML-RPC clients/servers in mind here.)



Re: [PATCH] characters invalid for an encoding

Posted by Steve Quint <li...@nanohertz.com>.
I must first profess that I'm not very smart, and reading the XML 
"recommendation" on W3C makes my head spin.  My apologies if my 
interpretation of this "recommendation" is wrong.

At 4:45 PM -0700 5/3/05, Daniel Rall wrote:
>
>Hi Steve, can you elaborate on this?  Both the XML RFCs and certain
>encodings dictate what constitutes valid content, and how content must
>be represented.  For instance, certain multi-byte characters simply
>aren't representable in 7 bit encoding like ASCII -- the only way to
>deliver'em through an ASCII encoding is to use another encoding which
>can be represented in ASCII (e.g. base-64).


>I don't follow you.  The XML-RPC spec itself used to dictate that the
>XML payload must be ASCII.  That changed only recently.

Unless I'm missing something, can't any multi-byte character be 
represented using entity encoding.  Why is this operation reserved 
for UTF-8 and UTF-16?

-- 

Steve

------------------------------------------------------------
"Always ... always remember: Less is less. More is more. More is
better. And twice as much is good too. Not enough is bad. And too
much is never enough except when it's just about right."
			-- The Tick
------------------------------------------------------------

Re: [PATCH] characters invalid for an encoding

Posted by Daniel Rall <dl...@finemaltcoding.com>.
On Mon, 2005-05-02 at 17:43 -0700, Steve Quint wrote:
>At 8:50 PM -0700 5/1/05, Daniel L. Rall wrote:
>>
>>Ignoring the encoding="..." XML header, that would indeed be valid XML.
>>However, when the XML is subsequently processed, the entity-encoded
>>value will transformed into ASCII.  Since there is no equivalent 7 bit
>>ASCII character which fits within that range (e.g. 0x7f is ASCII 127), a
>>parse error will be generated.
>
>I believe it is incorrect for XMLWriter to make any assumptions about 
>encodings, especially since the encoding is passed in the 
>constructor.  

Hi Steve, can you elaborate on this?  Both the XML RFCs and certain
encodings dictate what constitutes valid content, and how content must
be represented.  For instance, certain multi-byte characters simply
aren't representable in 7 bit encoding like ASCII -- the only way to
deliver'em through an ASCII encoding is to use another encoding which
can be represented in ASCII (e.g. base-64).

>If I create an instance of XMLWriter that I expect to 
>support a binary encoding that I passed in the constructor, I 
>shouldn't have to worry about an encoding related error later on. 
>Encoding related errors should probably happen at the time I specify 
>an encoding.

How do you recommend that be done, given the content to write isn't
available until the writeObject() method is called?  It seems to me that
you don't know whether the content is representable in your specified
encoding until you get the content...

This behavior seems reasonably consistent with the JDK, which declares
that a Writer can throw an IOException (which
UnsupportedEncodingException is an instance of).

http://java.sun.com/j2se/1.4.2/docs/api/java/io/Writer.html#write(java.lang.String)

>If this is truly the behavior you wish to enforce, and I see no 
>reason for it, perhaps the Constructor should throw an 
>UnsupportedEncodingException if the encoding is neither UTF-8 nor 
>UTF-16.

I don't follow you.  The XML-RPC spec itself used to dictate that the
XML payload must be ASCII.  That changed only recently.

Thanks for the review!
- Dan



Re: [PATCH] characters invalid for an encoding

Posted by Steve Quint <li...@nanohertz.com>.
At 8:50 PM -0700 5/1/05, Daniel L. Rall wrote:
>
>Ignoring the encoding="..." XML header, that would indeed be valid XML.
>However, when the XML is subsequently processed, the entity-encoded
>value will transformed into ASCII.  Since there is no equivalent 7 bit
>ASCII character which fits within that range (e.g. 0x7f is ASCII 127), a
>parse error will be generated.

I believe it is incorrect for XMLWriter to make any assumptions about 
encodings, especially since the encoding is passed in the 
constructor.  If I create an instance of XMLWriter that I expect to 
support a binary encoding that I passed in the constructor, I 
shouldn't have to worry about an encoding related error later on. 
Encoding related errors should probably happen at the time I specify 
an encoding.

If this is truly the behavior you wish to enforce, and I see no 
reason for it, perhaps the Constructor should throw an 
UnsupportedEncodingException if the encoding is neither UTF-8 nor 
UTF-16.
-- 

Steve

------------------------------------------------------------
"Always ... always remember: Less is less. More is more. More is
better. And twice as much is good too. Not enough is bad. And too
much is never enough except when it's just about right."
			-- The Tick
------------------------------------------------------------