You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by "Weston, Steven" <St...@compuware.com> on 2007/06/01 17:35:42 UTC

Xerces processing encoded characters

We are having a strange problem with encoded characters and I'm
wondering if there are any suggestions on how to correct the problem.
We have a party name tag in our xml document and some of those names
have encoded ampersands in the data associated with that tag (something
like the following -- company name &amp; co.).  If I've read the
documentation correctly xerces should convert that encoded ampersand to
a simple ampersand so we end up with a value something like -- company
name & co..  
The problem that we are running into is that for some reason the
processing of the encoded character is causing the party name to
replicate, which in some cases (when the name is long) it exceeds the
maximum length allowed for that piece of data within the database.  In
the example above we would end up with the following...
Company name Company name &Company name & co
It drops the ampersand and everything after it in the first copy of the
name, in the second it drops everything after the ampersand, and finally
in the third instance of the name it has the name properly converted.
We have recently changed the parser to the new version 2.9.0 in the
hopes that it would handle this encoded character better, but the same
problem persists.  Any suggestions on what we might do to correct this?
Thanks
steve

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. 

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it.

Re: Xerces processing encoded characters

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Steve,

Most likely you're using SAX and are registering a ContentHandler which 
assumes that all character data of an element is reported in a single 
chunk. This is probably the most common SAX programming error.

characters() may be called multiple times [1][2] for contiguous text. Your 
ContentHandler needs to accumulate the text returned in each call of 
characters() until you receive a callback that isn't characters.

Thanks.

[1] 
http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html#characters(char[],%20int,%20int)
[2] http://xerces.apache.org/xerces2-j/faq-sax.html#faq-2

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

"Weston, Steven" <St...@compuware.com> wrote on 06/01/2007 
11:35:42 AM:

> We are having a strange problem with encoded characters and I'm 
> wondering if there are any suggestions on how to correct the 
> problem.  We have a party name tag in our xml document and some of 
> those names have encoded ampersands in the data associated with that
> tag (something like the following -- company name &amp; co.).  If 
> I've read the documentation correctly xerces should convert that 
> encoded ampersand to a simple ampersand so we end up with a value 
> something like -- company name & co.. 
> The problem that we are running into is that for some reason the 
> processing of the encoded character is causing the party name to 
> replicate, which in some cases (when the name is long) it exceeds 
> the maximum length allowed for that piece of data within the 
> database.  In the example above we would end up with the following?
> Company name Company name &Company name & co 
> It drops the ampersand and everything after it in the first copy of 
> the name, in the second it drops everything after the ampersand, and
> finally in the third instance of the name it has the name properly 
converted.
> We have recently changed the parser to the new version 2.9.0 in the 
> hopes that it would handle this encoded character better, but the 
> same problem persists.  Any suggestions on what we might do to correct 
this?
> Thanks 
> steve

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org