You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by Bill Schindler <de...@bitranch.com> on 2000/10/08 13:14:28 UTC

Re: Bug in XMLFormatter::getAmpRef() and similar methods

"Patrick Parks" <pa...@whidbey.com> wrote:
> I have found a bug in XMLFormatter::getAmpRef() and similar methods when
> using wide-character encodings such as "UTF-16LE".
> [ ... ]
> Since this method appears to rely  only upon a null terminator (rather
>than including a byte count argument [as it should, IMO]) ...

The problem runs somewhat deeper than just the get...Ref() methods. There's
some patches submitted a week or two ago (look for the subject
"XMLFormatter patches") to deal with the problem. The patches are waiting
for someone to look at them and check them in.

(Hello? Anyone with commit authority still on the list?)

--Bill

Re: Bug in XMLFormatter::getAmpRef() and similar methods

Posted by Bill Schindler <de...@bitranch.com>.

"Patrick Parks" <pa...@whidbey.com> wrote:
> I would vote for a non-pure implementation of both overloads of writeChars()
> method (as you suggested).

As long as everyone realizes that the default implementation of the one
method will do nothing. (And the other one's default implementation will
call the one that does nothing.) With that tiny caveat, I think that's the
best way of doing it.

> You said something about sharing the same instance of a format writer object
>between multiple XMLFormatter instances.  Does that really make sense?

Yes, it makes sense in some cases. (But it would take entirely too much
work to hack one of those cases into an email-able example.) Actually, I'm
not sure that I could prove that the functionality makes really good sense,
but the capability was built into the design of XMLFormatter and
FormatTarget. The minute we break that capability, someone will certainly
complain.

> This leads me to question the need to pass an  XMLFormatter pointer to the second
> version of your writeChars() method.  If the purpose is to be able to get character
> encoding, wouldn't that impose a heavy performance overhead into a writer that didn't
> have a very good idea of what it needs to do with the byte stream it receives through
> writeChars()?

Yes, using the XMLFormatter pointer would impose a heavy overhead on the
writer. That pointer is there because someone made a strong case for being
able to determine the encoding from within XMLFormatTarget. If you look at
the patches I did for DOMPrint and SAXPrint, I ignore the pointer (which
costs almost nothing) -- because I agree with your analysis of the cost of
using it. But it's there for those who need it.

I think that having some way to query the bit-size of the encoding would be
_far_ more useful than getting the name of the encoding, especially for
generalized tools like DOMPrint. For most cases, FormatTarget doesn't care
at all about the name of the encoding, it just needs to know how wide the
encoding's base character is.

--Bill

Re: Bug in XMLFormatter::getAmpRef() and similar methods

Posted by Patrick Parks <pa...@whidbey.com>.

Thanks Bill,

I worked around this bug by using "UTF-8" encoding (which seems to work OK)
and then doing my own UTF-8 to UTF-16 widening.

However, I think your "XMLFormatter patches (take two)" changes make sense.
I would vote for a non-pure implementation of both overloads of writeChars()
method (as you suggested).  It would be annoying to require definition of a
do-nothing version of writeChars() that never gets called.

There was one comment in your "XMLFormatter -- only for single-byte
encodings?" thread that I don't think I agree with.  You said something
about sharing the same instance of a format writer object between multiple
XMLFormatter instances.  Does that really make sense?  It seems that most
FormatTarget objects (that are not just doing cout << toWrite) would have to
carry some state information in order to know what to do with the characters
it receives.  Here's a version I used to collect the emitted XML into a STL
string:

    class StringTarget : public XMLFormatTarget
    {
      private:
        std::string & xml ;

      public:
        StringTarget( string & target ) : xml( target ) {}

        void writeChars( const XMLByte * const toWrite )
        {
            xml += (char*) toWrite ;
        }
    } ;

In my case I carry a reference to the target string and append bytes on each
writeChars() call.  This leads me to question the need to pass an
XMLFormatter pointer to the second version of your writeChars() method.  If
the purpose is to be able to get character encoding, wouldn't that impose a
heavy performance overhead into a writer that didn't have a very good idea
of what it needs to do with the byte stream it receives through
writeChars()?  I guess I'm saying that adding the byte-count argument is a
good idea... but I don't think there's much need for the extra pointer to
XMLFormatter.  If you need a reference to XMLFormatter, or any other context
information in the format writer, these can simply be passed as constructor
arguments.  For example, my constructor for an XMLFormatter looks like this:

    XMWriter::XMWriter( std::string & xml )
        : target( xml )
        , formatter( "UTF-8"
                   , &target
                   , XMLFormatter::NoEscapes
                   )
    {
        ...
    }

where target and formatter are declared as member variables in XMWriter.h
as:

    StringTarget target ;
    XMLFormatter formatter ;

If I needed access to the XMLFormatter instance calling writeChars() I could
deal with this by adding an XMLFormatter argument to my StringTarget
constructor and changing my XMWriter constructor like this:

    XMWriter::XMWriter( std::string & xml )
        : target( this, xml )
        , formatter( "UTF-8"
                   , &target
                   , XMLFormatter::NoEscapes
                   )
    {
        ...
    }

This would give me access to my XMWriter instance and to XMLFormatter (via
it's formatter member variable).  And my XMWriter object has the potential
to carry a lot more interesting state information than an instance of
XMLFormatter.  For example, if encoding is interesting to my format writer,
I could pre-digest some information about the encoding (such as
character-width) and stash it in my XMWriter class.

/Patrick

----- Original Message -----
From: "Bill Schindler" <de...@bitranch.com>
To: <xe...@xml.apache.org>
Sent: Sunday, October 08, 2000 11:14
Subject: Re: Bug in XMLFormatter::getAmpRef() and similar methods

[...]

> The problem runs somewhat deeper than just the get...Ref() methods.
There's
> some patches submitted a week or two ago (look for the subject
> "XMLFormatter patches") to deal with the problem. The patches are waiting
> for someone to look at them and check them in.