You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by Daniel Schroeder <da...@mozquito.com> on 2000/09/13 21:38:11 UTC

XMLFormatTarget::writeChars() for UTF-16?

Is there a way to use the XMLFormatter to write out utf-16 coded output?

The method to actually write to a target, XMLFormatTarget::writeChars(const
XMLByte* const toWrite), takes a pointer to a string of XMLBytes (which is
typedef'd as unsigned char). As it has no additional parameter for the
length of the string, any implementation of writeChars() must rely on the \0
as terminating character. This makes it impossible to use it for any
encosing that may have embedded null characters.

Am I (again) missing something, or is this a bug?

If it's a bug, the obvious fix would be to additionally pass the length of
the string along with the string itself. However, as the implementation of
writeChars() is not in Xerces, but in the user's code, this is an external
interface, and if we change it, we break everybodies implementation.

Daniel

-- ------------------------------
Daniel Schroder (daniel@mozquito.com)
Senior Software Engineer
Stack Overflow AG

Phone: +49-89-76736370

"My software never has bugs. It just develops random features."

RE: XMLFormatTarget::writeChars() for UTF-16?

Posted by Daniel Schroeder <da...@mozquito.com>.

Okay, I think complained enough now about writeChars() getting passed an
improperly terminated string, it's time for me to prove that I can actually
*do* something useful... :-). So I sat down and tried to create a patch that
makes the call to writeChars() more useful.

But before I go into details - I found an unrelated issue: it seems several
XMLFormatter data members are new'ed as an array - for example

  ((XMLFormatter*)this)->fAposRef = new XMLByte[outBytes + 1];

- but are deleted with a simple delete:

    delete fAposRef;

This affects fAposRef, fAmpRef, fGTRef, fLTRef, and fQuoteRef; looks like a
bug to me, so I changed that in the patch described below.

Back to my problem - short recapitulation:
The XMLFormatter passes writeChars(), which is a user-provided function, a
raw byte pointer to some data which should then be output by writeChar() in
an application-specific way. The problem was that the data were already
transcoded to the proper output encoding, but were terminated by a single
null byte, regardless of the actual encoding. A single null byte is not a
valid termination character for various encodings, including utf-16.

I thought about additionally passing the length of the data string to
writeChars() as a quick fix. However, this a) would have broken backward
compatibility, and b) wouldn't have been correct from an architectural point
of view.

So I decided to do The Right Thing and terminate said data with the proper
termination character in the selected output encoding. My changes make a few
assumption which I believe to be safe:
- the termination character must not be longer than 10 characters. Does
anybody know of any encoding that has a longer termination character?
- the termination character must be constant and not depend on the previous
data.
The termination character in the output encoding is generated (and cached)
in much the same was as the character entities fAposRef, fAmpRef etc.

This patch is backward compatible, so no code changes are required in the
user's application.

The patch does not solve the issue that it's still hard to write a good and
fast writeChars() routine, as this routine needs to know the actual encoding
of the data being passed to do anything useful with them. This information
can be obtained, but this may be expensive in terms of execution time.

I tested my changes under Windows 2000 with Visual C++ only. Maybe someone
can have a look at them and/or run them through a compiler on different
platforms? Also, as I mostly work in a Windows environment, this was my
first time to actually create diff files. If they are no good, please let me
know.

Regards
  Daniel

-- ------------------------------
Daniel Schröder (daniel@mozquito.com)
Senior Software Engineer
Mozquito Technologies AG

Phone: +49-89-7299740

"My software never has bugs. It just develops random features."

Re: XMLFormatTarget::writeChars() for UTF-16?

Posted by Dean Roddey <dr...@charmedquark.com>.

No, it has to assume its terminated in whatever way is appropriate for the
encoding its in. IN order for you to output it, you have to know what format
its in. If you know what format its in, you know how to find the length. If
you know its UTF-16, then you know to look for a UTF-16 end of line, i.e. a
16 bit null.

And looking at the encoding name, as I said before, isn't always as easy as
it seems. There are many variations for a particular encoding, and some
transcoders might accept different ones that others.

Also, in order to know what APIs to call to write out the data you are
getting, i.e. do you call a short character API or a wide character API, you
have to know what format its in. This also would require interpreting
encoding names.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"It takes two buttocks to make friction"
    - African Proverb


----- Original Message -----
From: "Daniel Schroeder" <da...@mozquito.com>
To: <xe...@xml.apache.org>
Sent: Thursday, September 14, 2000 3:00 AM
Subject: RE: XMLFormatTarget::writeChars() for UTF-16?


>
> Dean Roddey wrote:
>
> > This is an issue, which is not trivial to deal with. The output
formatter
> > must really know what the output format is. Just passing a length won't
> > help. You have to know how to interpret the data. By having just
> > a raw byte pointer, you can pass anything to it. The formatter just
needs
> to
> > know what its really dealing with an handle that appropriately.
>
> Dean, I'm not quite sure I understand what you mean. If you say
'formatter',
> do you mean the WriteChars() function? I don't have any problem with the
> formatter itself, just with the interface to WriteChars(). I'm also happy
> with the data being passed as pointer to raw data - if WriteChars() wants
to
> know the current encoding, it can always call getEncodingName(), so that
> shouldn't be a problem.
>
> However, WriteChars() just does not *know* how many characters (in
whatever
> encoding) it had received. So WriteChars() has no choice but to *assume*
> that this raw data is terminated by a single null byte. This assumption is
> plain wrong for utf-16.
>
> On the other hand, if WriteChars() knew the number of valid bytes, the
> interface would be (almost) complete - it can ask for the current output
> encoding, and then has enough information to do something useful with the
> data, as it need not rely on any assumptions.
>
> Daniel
>
> -- ------------------------------
> Daniel Schröder (daniel@mozquito.com)
> Senior Software Engineer
> Stack Overflow AG
>
> Phone: +49-89-76736370
>
> "My software never has bugs. It just develops random features."
>
> > -----Original Message-----
> > From: Dean Roddey [mailto:droddey@charmedquark.com]
> > Sent: Thursday, September 14, 2000 4:48 AM
> > To: xerces-c-dev@xml.apache.org
> > Subject: Re: XMLFormatTarget::writeChars() for UTF-16?
> >
> >
> > This is an issue, which is not trivial to deal with. The output
formatter
> > must really know what the output format is. Just passing a length won't
> > help. You have to know how to interpret the data. By having just
> > a raw byte
> > pointer, you can pass anything to it. The formatter just needs to
> > know what
> > its really dealing with an handle that appropriately.
> >
> > However, given the many ways that a particular encoding can be
> > referred to,
> > its kind of difficult to have the formatter figure out reliably just
from
> > the encoding name what the actual format is. Therefore, I punted on the
> > original implementation and didn't address this. If you want to,
> > you can of
> > course create your own formatter, and use the encoding name to figure
out
> > what it really is. But, given that we can plug any number of transcoders
> > under the parser, and they can choose to use encoding name
> > variations of any
> > sort they wanted, I didn't try to have the default formatter
> > implementations
> > try to deal with this.
> >
> > --------------------------
> > Dean Roddey
> > The CIDLib C++ Frameworks
> > Charmed Quark Software
> > droddey@charmedquark.com
> > http://www.charmedquark.com
> >
> > "It takes two buttocks to make friction"
> >     - African Proverb
> >
> >
> > ----- Original Message -----
> > From: "Daniel Schroeder" <da...@mozquito.com>
> > To: <xe...@xml.apache.org>
> > Sent: Wednesday, September 13, 2000 12:38 PM
> > Subject: XMLFormatTarget::writeChars() for UTF-16?
> >
> >
> > >
> > > Is there a way to use the XMLFormatter to write out utf-16 coded
output?
> > >
> > > The method to actually write to a target,
> > XMLFormatTarget::writeChars(const
> > > XMLByte* const toWrite), takes a pointer to a string of
> > XMLBytes (which is
> > > typedef'd as unsigned char). As it has no additional parameter for the
> > > length of the string, any implementation of writeChars() must
> > rely on the
> > \0
> > > as terminating character. This makes it impossible to use it for any
> > > encosing that may have embedded null characters.
> > >
> > > Am I (again) missing something, or is this a bug?
> > >
> > > If it's a bug, the obvious fix would be to additionally pass
> > the length of
> > > the string along with the string itself. However, as the
> > implementation of
> > > writeChars() is not in Xerces, but in the user's code, this is
> > an external
> > > interface, and if we change it, we break everybodies implementation.
> > >
> > > Daniel
> > >
> > > -- ------------------------------
> > > Daniel Schroder (daniel@mozquito.com)
> > > Senior Software Engineer
> > > Stack Overflow AG
> > >
> > > Phone: +49-89-76736370
> > >
> > > "My software never has bugs. It just develops random features."
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> > > For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>

RE: XMLFormatTarget::writeChars() for UTF-16?

Posted by Daniel Schroeder <da...@mozquito.com>.

Dean Roddey wrote:

> This is an issue, which is not trivial to deal with. The output formatter
> must really know what the output format is. Just passing a length won't
> help. You have to know how to interpret the data. By having just
> a raw byte pointer, you can pass anything to it. The formatter just needs
to
> know what its really dealing with an handle that appropriately.

Dean, I'm not quite sure I understand what you mean. If you say 'formatter',
do you mean the WriteChars() function? I don't have any problem with the
formatter itself, just with the interface to WriteChars(). I'm also happy
with the data being passed as pointer to raw data - if WriteChars() wants to
know the current encoding, it can always call getEncodingName(), so that
shouldn't be a problem.

However, WriteChars() just does not *know* how many characters (in whatever
encoding) it had received. So WriteChars() has no choice but to *assume*
that this raw data is terminated by a single null byte. This assumption is
plain wrong for utf-16.

On the other hand, if WriteChars() knew the number of valid bytes, the
interface would be (almost) complete - it can ask for the current output
encoding, and then has enough information to do something useful with the
data, as it need not rely on any assumptions.

Daniel

-- ------------------------------
Daniel Schröder (daniel@mozquito.com)
Senior Software Engineer
Stack Overflow AG

Phone: +49-89-76736370

"My software never has bugs. It just develops random features."

> -----Original Message-----
> From: Dean Roddey [mailto:droddey@charmedquark.com]
> Sent: Thursday, September 14, 2000 4:48 AM
> To: xerces-c-dev@xml.apache.org
> Subject: Re: XMLFormatTarget::writeChars() for UTF-16?
>
>
> This is an issue, which is not trivial to deal with. The output formatter
> must really know what the output format is. Just passing a length won't
> help. You have to know how to interpret the data. By having just
> a raw byte
> pointer, you can pass anything to it. The formatter just needs to
> know what
> its really dealing with an handle that appropriately.
>
> However, given the many ways that a particular encoding can be
> referred to,
> its kind of difficult to have the formatter figure out reliably just from
> the encoding name what the actual format is. Therefore, I punted on the
> original implementation and didn't address this. If you want to,
> you can of
> course create your own formatter, and use the encoding name to figure out
> what it really is. But, given that we can plug any number of transcoders
> under the parser, and they can choose to use encoding name
> variations of any
> sort they wanted, I didn't try to have the default formatter
> implementations
> try to deal with this.
>
> --------------------------
> Dean Roddey
> The CIDLib C++ Frameworks
> Charmed Quark Software
> droddey@charmedquark.com
> http://www.charmedquark.com
>
> "It takes two buttocks to make friction"
>     - African Proverb
>
>
> ----- Original Message -----
> From: "Daniel Schroeder" <da...@mozquito.com>
> To: <xe...@xml.apache.org>
> Sent: Wednesday, September 13, 2000 12:38 PM
> Subject: XMLFormatTarget::writeChars() for UTF-16?
>
>
> >
> > Is there a way to use the XMLFormatter to write out utf-16 coded output?
> >
> > The method to actually write to a target,
> XMLFormatTarget::writeChars(const
> > XMLByte* const toWrite), takes a pointer to a string of
> XMLBytes (which is
> > typedef'd as unsigned char). As it has no additional parameter for the
> > length of the string, any implementation of writeChars() must
> rely on the
> \0
> > as terminating character. This makes it impossible to use it for any
> > encosing that may have embedded null characters.
> >
> > Am I (again) missing something, or is this a bug?
> >
> > If it's a bug, the obvious fix would be to additionally pass
> the length of
> > the string along with the string itself. However, as the
> implementation of
> > writeChars() is not in Xerces, but in the user's code, this is
> an external
> > interface, and if we change it, we break everybodies implementation.
> >
> > Daniel
> >
> > -- ------------------------------
> > Daniel Schroder (daniel@mozquito.com)
> > Senior Software Engineer
> > Stack Overflow AG
> >
> > Phone: +49-89-76736370
> >
> > "My software never has bugs. It just develops random features."
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>
>

Re: XMLFormatTarget::writeChars() for UTF-16?

Posted by Dean Roddey <dr...@charmedquark.com>.

This is an issue, which is not trivial to deal with. The output formatter
must really know what the output format is. Just passing a length won't
help. You have to know how to interpret the data. By having just a raw byte
pointer, you can pass anything to it. The formatter just needs to know what
its really dealing with an handle that appropriately.

However, given the many ways that a particular encoding can be referred to,
its kind of difficult to have the formatter figure out reliably just from
the encoding name what the actual format is. Therefore, I punted on the
original implementation and didn't address this. If you want to, you can of
course create your own formatter, and use the encoding name to figure out
what it really is. But, given that we can plug any number of transcoders
under the parser, and they can choose to use encoding name variations of any
sort they wanted, I didn't try to have the default formatter implementations
try to deal with this.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"It takes two buttocks to make friction"
    - African Proverb


----- Original Message -----
From: "Daniel Schroeder" <da...@mozquito.com>
To: <xe...@xml.apache.org>
Sent: Wednesday, September 13, 2000 12:38 PM
Subject: XMLFormatTarget::writeChars() for UTF-16?


>
> Is there a way to use the XMLFormatter to write out utf-16 coded output?
>
> The method to actually write to a target,
XMLFormatTarget::writeChars(const
> XMLByte* const toWrite), takes a pointer to a string of XMLBytes (which is
> typedef'd as unsigned char). As it has no additional parameter for the
> length of the string, any implementation of writeChars() must rely on the
\0
> as terminating character. This makes it impossible to use it for any
> encosing that may have embedded null characters.
>
> Am I (again) missing something, or is this a bug?
>
> If it's a bug, the obvious fix would be to additionally pass the length of
> the string along with the string itself. However, as the implementation of
> writeChars() is not in Xerces, but in the user's code, this is an external
> interface, and if we change it, we break everybodies implementation.
>
> Daniel
>
> -- ------------------------------
> Daniel Schroder (daniel@mozquito.com)
> Senior Software Engineer
> Stack Overflow AG
>
> Phone: +49-89-76736370
>
> "My software never has bugs. It just develops random features."
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>