You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Daniel J. Pyra" <da...@provider.pl> on 2002/02/27 12:44:45 UTC

Sample code improvement

Hi XercesC community!

I would like to announce an improvement for DOMPrint sample program (for
XerecsC 1.6.0).
DOMPrint works pretty well, but for documents with a tag storing very large
portion of text data (> 2MB) it slows down significantly. Probably it is not
a good idea to put such data in a single tag, but in existing system which I
must deal with, I have no choice - my interface receives large files saved
in Base64 format. I have adopted DOMPrint program, so I can write XML
documents in any C++ stream.
In function ostream& operator<<(ostream& target, DOM_Node& toWrite) in
DOMPrint.cpp file a call
gFormatter->formatBuf("large buffer", "large count",.)
is the bottle neck. Probably it is a good idea to divide "large buffer" into
pieces.

***
// original source:

...
unsigned long lent = nodeValue.length() ;

switch (toWrite.getNodeType())
{
    case DOM_Node::TEXT_NODE:
    {
        gFormatter->formatBuf(nodeValue.rawBuffer(), lent,
XMLFormatter::CharEscapes) ;
        break ;
    }

...
***

***
// suggested source:

...
unsigned long lent = nodeValue.length() ;

switch (toWrite.getNodeType())
{
    case DOM_Node::TEXT_NODE:
    {
        /*
        Index of the beginning of data portion of tag being put in a stream
        */
        unsigned long ind = 0L ;

        /*
        Tag values are being written out in portions <=
XERCESC_XMLFRMBUF_SIZE. Then putting "large" tags does not differ from
multiply putting smaller tags.
        */
        while( ind + XERCESC_XMLFRMBUF_SIZE < lent )
        {
            gFormatter->formatBuf(&nodeValue.rawBuffer()[ind],
XERCESC_XMLFRMBUF_SIZE, XMLFormatter::CharEscapes) ;
            ind += XERCESC_XMLFRMBUF_SIZE ;
        }

        gFormatter->formatBuf(&nodeValue.rawBuffer()[ind], lent,
XMLFormatter::CharEscapes) ;
        break ;
    }

...
***

After analysing source of XMLFormatter class I assigned for
XERCESC_XMLFRMBUF_SIZE value 65536, which looks suit for internal
transcoding buffer. Now there is no decreasing of speed while calling
function for XML documents with a tag storing very large portion of text
data.

All documents I process are encoded in UTF-8. I have tested the code in
Windows 2000 (VC++) and HP-UX 11 (a++) environments with XercesC 1.6.0. Any
ideas, comments, further improvements are welcome.

Daniel Pyra
Software Engineer
dan@provider.pl
Poland

Re: Sample code improvement

Posted by TU ANH NGUYEN <tn...@kstc.konicabt.com>.
Hi,

Has anybody used Xerces C++ to parse XML node attributes values on vxWorks?
What do I need to do with the Autosense.hpp to port Xerces to vxWorks
operating system?  (GNU,GCCDefs.hpp) ???

Tu


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


Re: Sample code improvement

Posted by PeiYong PY Zhang <pe...@ca.ibm.com>.
Daniel Pyra,

>gFormatter->formatBuf("large buffer", "large count",.) is the bottle neck.

   Could you please elabrate why this function is the bottle neck?

>After analysing source of XMLFormatter class I assigned for
>XERCESC_XMLFRMBUF_SIZE value 65536, which looks suit for internal transcoding
buffer

the XMLFormatter::formatBuf() would try to send transcodeTo() a block of data no
more than
kTmpBufSize (which is 16 * 1024 = 2^14), as shown below:

                const unsigned int srcCount = tmpPtr - srcPtr;
                const unsigned srcChars = srcCount > kTmpBufSize ?
                                          kTmpBufSize : srcCount;

                const unsigned int outBytes = fXCoder->transcodeTo
                (
                    srcPtr
                    , srcChars
                    , fTmpBuf
                    , kTmpBufSize
                    , charsEaten
                    , unRepOpts
                );

By reducing the size to 65536, you reduce the invocations to transcoderTo() from
within formatBuf(), but the overall invocations
to transcoderTo() is the same considering your loop to invoke formatBuf().

>All documents I process are encoded in UTF-8. I have tested the code in
>Windows 2000 (VC++) and HP-UX 11 (a++) environments with XercesC 1.6.0. Any
>ideas, comments, further improvements are welcome.

   Did you try your enhencement on DOMPrint (not your application code) and
tested against your sample XML files? Is there any significant improvement? If
so, would you mind sending us your sample XML files? thanks.

PeiYong



"Daniel J. Pyra" wrote:

> Hi XercesC community!
>
> I would like to announce an improvement for DOMPrint sample program (for
> XerecsC 1.6.0).
> DOMPrint works pretty well, but for documents with a tag storing very large
> portion of text data (> 2MB) it slows down significantly. Probably it is not
> a good idea to put such data in a single tag, but in existing system which I
> must deal with, I have no choice - my interface receives large files saved
> in Base64 format. I have adopted DOMPrint program, so I can write XML
> documents in any C++ stream.
> In function ostream& operator<<(ostream& target, DOM_Node& toWrite) in
> DOMPrint.cpp file a call
> gFormatter->formatBuf("large buffer", "large count",.)
> is the bottle neck. Probably it is a good idea to divide "large buffer" into
> pieces.
>
> ***
> // original source:
>
> ...
> unsigned long lent = nodeValue.length() ;
>
> switch (toWrite.getNodeType())
> {
>     case DOM_Node::TEXT_NODE:
>     {
>         gFormatter->formatBuf(nodeValue.rawBuffer(), lent,
> XMLFormatter::CharEscapes) ;
>         break ;
>     }
>
> ...
> ***
>
> ***
> // suggested source:
>
> ...
> unsigned long lent = nodeValue.length() ;
>
> switch (toWrite.getNodeType())
> {
>     case DOM_Node::TEXT_NODE:
>     {
>         /*
>         Index of the beginning of data portion of tag being put in a stream
>         */
>         unsigned long ind = 0L ;
>
>         /*
>         Tag values are being written out in portions <=
> XERCESC_XMLFRMBUF_SIZE. Then putting "large" tags does not differ from
> multiply putting smaller tags.
>         */
>         while( ind + XERCESC_XMLFRMBUF_SIZE < lent )
>         {
>             gFormatter->formatBuf(&nodeValue.rawBuffer()[ind],
> XERCESC_XMLFRMBUF_SIZE, XMLFormatter::CharEscapes) ;
>             ind += XERCESC_XMLFRMBUF_SIZE ;
>         }
>
>         gFormatter->formatBuf(&nodeValue.rawBuffer()[ind], lent,
> XMLFormatter::CharEscapes) ;
>         break ;
>     }
>
> ...
> ***
>
> After analysing source of XMLFormatter class I assigned for
> XERCESC_XMLFRMBUF_SIZE value 65536, which looks suit for internal
> transcoding buffer. Now there is no decreasing of speed while calling
> function for XML documents with a tag storing very large portion of text
> data.
>
> All documents I process are encoded in UTF-8. I have tested the code in
> Windows 2000 (VC++) and HP-UX 11 (a++) environments with XercesC 1.6.0. Any
> ideas, comments, further improvements are welcome.
>
> Daniel Pyra
> Software Engineer
> dan@provider.pl
> Poland
>
>   ------------------------------------------------------------------------
>                    Name: domprint.txt
>    domprint.txt    Type: Plain Text (text/plain)
>                Encoding: quoted-printable
>
>   ------------------------------------------------------------------------
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org