You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Fuzzo <mc...@gmail.com> on 2008/10/22 09:54:18 UTC

Xerces2 vs Xerces1 Element Text Parsing Implementation

Hi all!

Let me explain the problem with an example.
I have to parse an XML in this form:

<anomaly id="0012" severity="4">some_text_with_%_symbol</anomaly>

With Xerces1 SAX parser, the element text (some_text_with_%A7_symbol) is
parsed in one solution with full length invoking the characters(char[] ch,
int start, int length) method.

With Xerces2, the element text is parsed in 30 bytes slot and the method is
invoked some times until the text element is fully parsed.

Now, in my application the text element is sometimes encoded with
java.net.URLEncoder class and then decoded with java.net.URLDecoder.

With Xerces2, happens that the element substring can be in form of
first_part_of_text_% and URLDecoder can't handle correctly the final % char,
giving me a URLDecoder: Incomplete trailing escape (%) pattern because it
does not find the 2 following chars (ex.: %A7 means the § symbol in Cp1252
encoding).

There is a way to configure Xerces2 to parse text elements in only one
solution?

Many thanks!


-- 
View this message in context: http://www.nabble.com/Xerces2-vs-Xerces1-Element-Text-Parsing-Implementation-tp20105730p20105730.html
Sent from the Xerces - J - Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Xerces2 vs Xerces1 Element Text Parsing Implementation

Posted by Fuzzo <mc...@gmail.com>.

Hi!
Many many thanks for answer!

Problem solved with a StringBuffer! :jumping:


Michael Glavassevich-3 wrote:
> 
> 
> Hi,
> 
> Fuzzo <mc...@gmail.com> wrote on 10/22/2008 03:54:18 AM:
> 
>> Hi all!
>>
>> Let me explain the problem with an example.
>> I have to parse an XML in this form:
>>
>> <anomaly id="0012" severity="4">some_text_with_%_symbol</anomaly>
>>
>> With Xerces1 SAX parser, the element text (some_text_with_%A7_symbol) is
>> parsed in one solution with full length invoking the characters(char[]
> ch,
>> int start, int length) method.
>>
>> With Xerces2, the element text is parsed in 30 bytes slot and the method
> is
>> invoked some times until the text element is fully parsed.
>>
>> Now, in my application the text element is sometimes encoded with
>> java.net.URLEncoder class and then decoded with java.net.URLDecoder.
>>
>> With Xerces2, happens that the element substring can be in form of
>> first_part_of_text_% and URLDecoder can't handle correctly the final %
> char,
>> giving me a URLDecoder: Incomplete trailing escape (%) pattern because it
>> does not find the 2 following chars (ex.: %A7 means the § symbol in
> Cp1252
>> encoding).
>>
>> There is a way to configure Xerces2 to parse text elements in only one
>> solution?
> 
> No. characters() may be called multiple times [1][2] for contiguous text.
> You cannot assume it will only be called once. Your ContentHandler needs
> to
> accumulate the text returned in each call of characters() until you
> receive
> a callback that isn't characters.
> 
>> Many thanks!
>>
>>
>> --
>> View this message in context: http://www.nabble.com/Xerces2-vs-
>> Xerces1-Element-Text-Parsing-Implementation-tp20105730p20105730.html
>> Sent from the Xerces - J - Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: j-users-help@xerces.apache.org
> 
> Thanks.
> 
> [1]
> http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html#characters(char[],%20int,%20int)
> [2] http://xerces.apache.org/xerces2-j/faq-sax.html#faq-2
> 
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
> 

-- 
View this message in context: http://www.nabble.com/Xerces2-vs-Xerces1-Element-Text-Parsing-Implementation-tp20105730p20197262.html
Sent from the Xerces - J - Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Xerces2 vs Xerces1 Element Text Parsing Implementation

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi,

Fuzzo <mc...@gmail.com> wrote on 10/22/2008 03:54:18 AM:

> Hi all!
>
> Let me explain the problem with an example.
> I have to parse an XML in this form:
>
> <anomaly id="0012" severity="4">some_text_with_%_symbol</anomaly>
>
> With Xerces1 SAX parser, the element text (some_text_with_%A7_symbol) is
> parsed in one solution with full length invoking the characters(char[]
ch,
> int start, int length) method.
>
> With Xerces2, the element text is parsed in 30 bytes slot and the method
is
> invoked some times until the text element is fully parsed.
>
> Now, in my application the text element is sometimes encoded with
> java.net.URLEncoder class and then decoded with java.net.URLDecoder.
>
> With Xerces2, happens that the element substring can be in form of
> first_part_of_text_% and URLDecoder can't handle correctly the final %
char,
> giving me a URLDecoder: Incomplete trailing escape (%) pattern because it
> does not find the 2 following chars (ex.: %A7 means the § symbol in
Cp1252
> encoding).
>
> There is a way to configure Xerces2 to parse text elements in only one
> solution?

No. characters() may be called multiple times [1][2] for contiguous text.
You cannot assume it will only be called once. Your ContentHandler needs to
accumulate the text returned in each call of characters() until you receive
a callback that isn't characters.

> Many thanks!
>
>
> --
> View this message in context: http://www.nabble.com/Xerces2-vs-
> Xerces1-Element-Text-Parsing-Implementation-tp20105730p20105730.html
> Sent from the Xerces - J - Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

Thanks.

[1]
http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html#characters(char[],%20int,%20int)
[2] http://xerces.apache.org/xerces2-j/faq-sax.html#faq-2

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org