You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commons-dev@ws.apache.org by Kasun Indrasiri <ka...@gmail.com> on 2010/04/30 11:51:34 UTC

XML with large text entries are broken down to chunks when parsing with Axiom - non-coalescing mode.

Hi,

When parsing XML in non-coalescing mode ("javax.xml.stream.isCoalescing",
false) Axiom breaks down large text entries to multiple chunks. Therefore CDATA
elements with lengthy texts get translated into multiple CDATA elements.

thanks,
-- 
Kasun Indrasiri
Senior Software Engineer,
WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
Blog : http://kasunpanorama.blogspot.com/

Re: XML with large text entries are broken down to chunks when parsing with Axiom - non-coalescing mode.

Posted by Andreas Veithen <an...@gmail.com>.

I've updated the Javadoc of OMText [1] to clarify all this.

Andreas

[1] https://svn.apache.org/repos/asf/webservices/commons/trunk/modules/axiom/modules/axiom-api/src/main/java/org/apache/axiom/om/OMText.java

On Sat, May 1, 2010 at 11:40, Hiranya Jayathilaka <hi...@gmail.com> wrote:
> On Fri, Apr 30, 2010 at 9:12 PM, Andreas Veithen
> <an...@gmail.com>wrote:
>
>> Axiom always creates the nodes based on the events received from the
>> underlying parser. If javax.xml.stream.isCoalescing is set to false on
>> the parser, then by definition the parser may return large text nodes
>> in multiple chunks. The problem is that if
>> javax.xml.stream.isCoalescing is set to true, StAX doesn't report
>> CDATA sections in the document as CDATA events, but as CHARACTER
>> events. It is however possible to configure Woodstox to report CDATA
>> sections without splitting text nodes into chunks. Note that even with
>> such a configuration, OMElement#getText should always be used to
>> extract the text content of an element (to cover the case where the
>> element contains a mix of text nodes and CDATA sections).
>>
>> Note that while coalescing is switched off by default at the StAX
>> level, Axiom overrides this so that by default coalescing is turned on
>> [1]. It is not surprising that there is code that implicitly relies on
>> this. Therefore, working with Axiom in non coalescing mode is always a
>> risk.
>>
>
> Thanks Andreas. This explains a lot.
>
> Thanks,
> Hiranya
>
>
>>
>> Andreas
>>
>> [1] http://people.apache.org/~veithen/axiom/userguide/ch04.html#d0e866
>>
>> On Fri, Apr 30, 2010 at 11:51, Kasun Indrasiri <ka...@gmail.com> wrote:
>> > Hi,
>> >
>> > When parsing XML in non-coalescing mode ("javax.xml.stream.isCoalescing",
>> > false) Axiom breaks down large text entries to multiple chunks. Therefore
>> CDATA
>> > elements with lengthy texts get translated into multiple CDATA elements.
>> >
>> > thanks,
>> > --
>> > Kasun Indrasiri
>> > Senior Software Engineer,
>> > WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
>> > Blog : http://kasunpanorama.blogspot.com/
>> >
>>
>
>
>
> --
> Hiranya Jayathilaka
> Software Engineer;
> WSO2 Inc.;  http://wso2.org
> E-mail: hiranya@wso2.com;  Mobile: +94 77 633 3491
> Blog: http://techfeast-hiranya.blogspot.com
>

Re: XML with large text entries are broken down to chunks when parsing with Axiom - non-coalescing mode.

Posted by Hiranya Jayathilaka <hi...@gmail.com>.

On Fri, Apr 30, 2010 at 9:12 PM, Andreas Veithen
<an...@gmail.com>wrote:

> Axiom always creates the nodes based on the events received from the
> underlying parser. If javax.xml.stream.isCoalescing is set to false on
> the parser, then by definition the parser may return large text nodes
> in multiple chunks. The problem is that if
> javax.xml.stream.isCoalescing is set to true, StAX doesn't report
> CDATA sections in the document as CDATA events, but as CHARACTER
> events. It is however possible to configure Woodstox to report CDATA
> sections without splitting text nodes into chunks. Note that even with
> such a configuration, OMElement#getText should always be used to
> extract the text content of an element (to cover the case where the
> element contains a mix of text nodes and CDATA sections).
>
> Note that while coalescing is switched off by default at the StAX
> level, Axiom overrides this so that by default coalescing is turned on
> [1]. It is not surprising that there is code that implicitly relies on
> this. Therefore, working with Axiom in non coalescing mode is always a
> risk.
>

Thanks Andreas. This explains a lot.

Thanks,
Hiranya


>
> Andreas
>
> [1] http://people.apache.org/~veithen/axiom/userguide/ch04.html#d0e866
>
> On Fri, Apr 30, 2010 at 11:51, Kasun Indrasiri <ka...@gmail.com> wrote:
> > Hi,
> >
> > When parsing XML in non-coalescing mode ("javax.xml.stream.isCoalescing",
> > false) Axiom breaks down large text entries to multiple chunks. Therefore
> CDATA
> > elements with lengthy texts get translated into multiple CDATA elements.
> >
> > thanks,
> > --
> > Kasun Indrasiri
> > Senior Software Engineer,
> > WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
> > Blog : http://kasunpanorama.blogspot.com/
> >
>



-- 
Hiranya Jayathilaka
Software Engineer;
WSO2 Inc.;  http://wso2.org
E-mail: hiranya@wso2.com;  Mobile: +94 77 633 3491
Blog: http://techfeast-hiranya.blogspot.com

Re: XML with large text entries are broken down to chunks when parsing with Axiom - non-coalescing mode.

Posted by Hiranya Jayathilaka <hi...@gmail.com>.

Hi Kasun,

On Sat, May 1, 2010 at 11:03 AM, Kasun Indrasiri <ka...@gmail.com> wrote:

> Hi,
>
> I guess this becomes even more riskier in a scenario like this.
>
> XML string :  "<a> a_ lengthy_string</a>" -> omElem
>
> Once we parse this xml in non-coalescing mode and create an OM
> element(omElem) with this,
>
> - first Child : contains the first portion of 'a_lengthy_string' string
> - last Child : contains the rest
>
> However, as Hiranya mentioned 'omEle.getText()' will give us the correct
> value of the text content.
>
> Is this the acceptable behavior?
>

Yes. It seems if you are using non-coalescing mode, you should use the
getText() method to retrieve the full text from elements.

Thanks,
Hiranya


>
> regards,
>
> Kasun
>
>
> On Fri, Apr 30, 2010 at 9:12 PM, Andreas Veithen
> <an...@gmail.com>wrote:
>
> > Axiom always creates the nodes based on the events received from the
> > underlying parser. If javax.xml.stream.isCoalescing is set to false on
> > the parser, then by definition the parser may return large text nodes
> > in multiple chunks. The problem is that if
> > javax.xml.stream.isCoalescing is set to true, StAX doesn't report
> > CDATA sections in the document as CDATA events, but as CHARACTER
> > events. It is however possible to configure Woodstox to report CDATA
> > sections without splitting text nodes into chunks. Note that even with
> > such a configuration, OMElement#getText should always be used to
> > extract the text content of an element (to cover the case where the
> > element contains a mix of text nodes and CDATA sections).
> >
> > Note that while coalescing is switched off by default at the StAX
> > level, Axiom overrides this so that by default coalescing is turned on
> > [1]. It is not surprising that there is code that implicitly relies on
> > this. Therefore, working with Axiom in non coalescing mode is always a
> > risk.
> >
> > Andreas
> >
> > [1] http://people.apache.org/~veithen/axiom/userguide/ch04.html#d0e866
> >
> > On Fri, Apr 30, 2010 at 11:51, Kasun Indrasiri <ka...@gmail.com>
> wrote:
> > > Hi,
> > >
> > > When parsing XML in non-coalescing mode
> ("javax.xml.stream.isCoalescing",
> > > false) Axiom breaks down large text entries to multiple chunks.
> Therefore
> > CDATA
> > > elements with lengthy texts get translated into multiple CDATA
> elements.
> > >
> > > thanks,
> > > --
> > > Kasun Indrasiri
> > > Senior Software Engineer,
> > > WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
> > > Blog : http://kasunpanorama.blogspot.com/
> > >
> >
>
>
>
> --
> Kasun Indrasiri
> Senior Software Engineer,
> WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
> Blog : http://kasunpanorama.blogspot.com/
>



-- 
Hiranya Jayathilaka
Software Engineer;
WSO2 Inc.;  http://wso2.org
E-mail: hiranya@wso2.com;  Mobile: +94 77 633 3491
Blog: http://techfeast-hiranya.blogspot.com

Re: XML with large text entries are broken down to chunks when parsing with Axiom - non-coalescing mode.

Posted by Andreas Veithen <an...@gmail.com>.

On Sat, May 1, 2010 at 07:33, Kasun Indrasiri <ka...@gmail.com> wrote:
> Hi,
>
> I guess this becomes even more riskier in a scenario like this.
>
> XML string :  "<a> a_ lengthy_string</a>" -> omElem
>
> Once we parse this xml in non-coalescing mode and create an OM
> element(omElem) with this,
>
> - first Child : contains the first portion of 'a_lengthy_string' string
> - last Child : contains the rest
>
> However, as Hiranya mentioned 'omEle.getText()' will give us the correct
> value of the text content.
>
> Is this the acceptable behavior?

It's not the default behavior, but if someone explicitly configures
Axiom to switch off coalescing, then he has to live with the
consequences ;-)

> regards,
>
> Kasun
>
>
> On Fri, Apr 30, 2010 at 9:12 PM, Andreas Veithen
> <an...@gmail.com>wrote:
>
>> Axiom always creates the nodes based on the events received from the
>> underlying parser. If javax.xml.stream.isCoalescing is set to false on
>> the parser, then by definition the parser may return large text nodes
>> in multiple chunks. The problem is that if
>> javax.xml.stream.isCoalescing is set to true, StAX doesn't report
>> CDATA sections in the document as CDATA events, but as CHARACTER
>> events. It is however possible to configure Woodstox to report CDATA
>> sections without splitting text nodes into chunks. Note that even with
>> such a configuration, OMElement#getText should always be used to
>> extract the text content of an element (to cover the case where the
>> element contains a mix of text nodes and CDATA sections).
>>
>> Note that while coalescing is switched off by default at the StAX
>> level, Axiom overrides this so that by default coalescing is turned on
>> [1]. It is not surprising that there is code that implicitly relies on
>> this. Therefore, working with Axiom in non coalescing mode is always a
>> risk.
>>
>> Andreas
>>
>> [1] http://people.apache.org/~veithen/axiom/userguide/ch04.html#d0e866
>>
>> On Fri, Apr 30, 2010 at 11:51, Kasun Indrasiri <ka...@gmail.com> wrote:
>> > Hi,
>> >
>> > When parsing XML in non-coalescing mode ("javax.xml.stream.isCoalescing",
>> > false) Axiom breaks down large text entries to multiple chunks. Therefore
>> CDATA
>> > elements with lengthy texts get translated into multiple CDATA elements.
>> >
>> > thanks,
>> > --
>> > Kasun Indrasiri
>> > Senior Software Engineer,
>> > WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
>> > Blog : http://kasunpanorama.blogspot.com/
>> >
>>
>
>
>
> --
> Kasun Indrasiri
> Senior Software Engineer,
> WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
> Blog : http://kasunpanorama.blogspot.com/
>

Re: XML with large text entries are broken down to chunks when parsing with Axiom - non-coalescing mode.

Posted by Kasun Indrasiri <ka...@gmail.com>.

Hi,

I guess this becomes even more riskier in a scenario like this.

XML string :  "<a> a_ lengthy_string</a>" -> omElem

Once we parse this xml in non-coalescing mode and create an OM
element(omElem) with this,

- first Child : contains the first portion of 'a_lengthy_string' string
- last Child : contains the rest

However, as Hiranya mentioned 'omEle.getText()' will give us the correct
value of the text content.

Is this the acceptable behavior?

regards,

Kasun


On Fri, Apr 30, 2010 at 9:12 PM, Andreas Veithen
<an...@gmail.com>wrote:

> Axiom always creates the nodes based on the events received from the
> underlying parser. If javax.xml.stream.isCoalescing is set to false on
> the parser, then by definition the parser may return large text nodes
> in multiple chunks. The problem is that if
> javax.xml.stream.isCoalescing is set to true, StAX doesn't report
> CDATA sections in the document as CDATA events, but as CHARACTER
> events. It is however possible to configure Woodstox to report CDATA
> sections without splitting text nodes into chunks. Note that even with
> such a configuration, OMElement#getText should always be used to
> extract the text content of an element (to cover the case where the
> element contains a mix of text nodes and CDATA sections).
>
> Note that while coalescing is switched off by default at the StAX
> level, Axiom overrides this so that by default coalescing is turned on
> [1]. It is not surprising that there is code that implicitly relies on
> this. Therefore, working with Axiom in non coalescing mode is always a
> risk.
>
> Andreas
>
> [1] http://people.apache.org/~veithen/axiom/userguide/ch04.html#d0e866
>
> On Fri, Apr 30, 2010 at 11:51, Kasun Indrasiri <ka...@gmail.com> wrote:
> > Hi,
> >
> > When parsing XML in non-coalescing mode ("javax.xml.stream.isCoalescing",
> > false) Axiom breaks down large text entries to multiple chunks. Therefore
> CDATA
> > elements with lengthy texts get translated into multiple CDATA elements.
> >
> > thanks,
> > --
> > Kasun Indrasiri
> > Senior Software Engineer,
> > WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
> > Blog : http://kasunpanorama.blogspot.com/
> >
>



-- 
Kasun Indrasiri
Senior Software Engineer,
WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
Blog : http://kasunpanorama.blogspot.com/

Re: XML with large text entries are broken down to chunks when parsing with Axiom - non-coalescing mode.

Posted by Andreas Veithen <an...@gmail.com>.

Axiom always creates the nodes based on the events received from the
underlying parser. If javax.xml.stream.isCoalescing is set to false on
the parser, then by definition the parser may return large text nodes
in multiple chunks. The problem is that if
javax.xml.stream.isCoalescing is set to true, StAX doesn't report
CDATA sections in the document as CDATA events, but as CHARACTER
events. It is however possible to configure Woodstox to report CDATA
sections without splitting text nodes into chunks. Note that even with
such a configuration, OMElement#getText should always be used to
extract the text content of an element (to cover the case where the
element contains a mix of text nodes and CDATA sections).

Note that while coalescing is switched off by default at the StAX
level, Axiom overrides this so that by default coalescing is turned on
[1]. It is not surprising that there is code that implicitly relies on
this. Therefore, working with Axiom in non coalescing mode is always a
risk.

Andreas

[1] http://people.apache.org/~veithen/axiom/userguide/ch04.html#d0e866

On Fri, Apr 30, 2010 at 11:51, Kasun Indrasiri <ka...@gmail.com> wrote:
> Hi,
>
> When parsing XML in non-coalescing mode ("javax.xml.stream.isCoalescing",
> false) Axiom breaks down large text entries to multiple chunks. Therefore CDATA
> elements with lengthy texts get translated into multiple CDATA elements.
>
> thanks,
> --
> Kasun Indrasiri
> Senior Software Engineer,
> WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
> Blog : http://kasunpanorama.blogspot.com/
>

Re: XML with large text entries are broken down to chunks when parsing with Axiom - non-coalescing mode.

Posted by Hiranya Jayathilaka <hi...@gmail.com>.

Hi Benson,

On Fri, Apr 30, 2010 at 6:21 PM, Benson Margulies <bi...@gmail.com>wrote:

> I don't see how it could be a bug. Any XML parser is permitted to
> split these things for it's own convenience, it doesn't change the
> infoset.
>

Then the problem is in the way how we have used the Axiom API. We currently
use the following bit of code to extract the text content from an element.
Text content is wrapped in a CDATA element.

OMNode nodeValue = elem.getFirstOMChild();
String text = ((OMText) nodeValue).getText().trim();

If the CDATA element gets translated into multiple CDATA elements the above
code will return the text of the first CDATA element only. Rest of the text
is lost.

If we use the following code instead things seem to work fine:

String text = elem.getText();

Is is ok to use the getText() method as shown above to retrieve CDATA
content?

Thanks,
Hiranya


> On Fri, Apr 30, 2010 at 8:48 AM, Hiranya Jayathilaka
> <hi...@gmail.com> wrote:
> > On Fri, Apr 30, 2010 at 3:21 PM, Kasun Indrasiri <ka...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> When parsing XML in non-coalescing mode
> ("javax.xml.stream.isCoalescing",
> >> false) Axiom breaks down large text entries to multiple chunks.
> Therefore
> >> CDATA
> >> elements with lengthy texts get translated into multiple CDATA elements.
> >>
> >
> > Folks, is this a bug? This behavior is causing some complications with
> > Synapse. Any feedback would be highly appreciated.
> >
> > Thanks,
> > Hiranya
> >
> >
> >>
> >> thanks,
> >> --
> >> Kasun Indrasiri
> >> Senior Software Engineer,
> >> WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
> >> Blog : http://kasunpanorama.blogspot.com/
> >>
> >
> >
> >
> > --
> > Hiranya Jayathilaka
> > Software Engineer;
> > WSO2 Inc.;  http://wso2.org
> > E-mail: hiranya@wso2.com;  Mobile: +94 77 633 3491
> > Blog: http://techfeast-hiranya.blogspot.com
> >
>



-- 
Hiranya Jayathilaka
Software Engineer;
WSO2 Inc.;  http://wso2.org
E-mail: hiranya@wso2.com;  Mobile: +94 77 633 3491
Blog: http://techfeast-hiranya.blogspot.com

Re: XML with large text entries are broken down to chunks when parsing with Axiom - non-coalescing mode.

Posted by Benson Margulies <bi...@gmail.com>.

I don't see how it could be a bug. Any XML parser is permitted to
split these things for it's own convenience, it doesn't change the
infoset.

On Fri, Apr 30, 2010 at 8:48 AM, Hiranya Jayathilaka
<hi...@gmail.com> wrote:
> On Fri, Apr 30, 2010 at 3:21 PM, Kasun Indrasiri <ka...@gmail.com> wrote:
>
>> Hi,
>>
>> When parsing XML in non-coalescing mode ("javax.xml.stream.isCoalescing",
>> false) Axiom breaks down large text entries to multiple chunks. Therefore
>> CDATA
>> elements with lengthy texts get translated into multiple CDATA elements.
>>
>
> Folks, is this a bug? This behavior is causing some complications with
> Synapse. Any feedback would be highly appreciated.
>
> Thanks,
> Hiranya
>
>
>>
>> thanks,
>> --
>> Kasun Indrasiri
>> Senior Software Engineer,
>> WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
>> Blog : http://kasunpanorama.blogspot.com/
>>
>
>
>
> --
> Hiranya Jayathilaka
> Software Engineer;
> WSO2 Inc.;  http://wso2.org
> E-mail: hiranya@wso2.com;  Mobile: +94 77 633 3491
> Blog: http://techfeast-hiranya.blogspot.com
>

Re: XML with large text entries are broken down to chunks when parsing with Axiom - non-coalescing mode.

Posted by Hiranya Jayathilaka <hi...@gmail.com>.

On Fri, Apr 30, 2010 at 3:21 PM, Kasun Indrasiri <ka...@gmail.com> wrote:

> Hi,
>
> When parsing XML in non-coalescing mode ("javax.xml.stream.isCoalescing",
> false) Axiom breaks down large text entries to multiple chunks. Therefore
> CDATA
> elements with lengthy texts get translated into multiple CDATA elements.
>

Folks, is this a bug? This behavior is causing some complications with
Synapse. Any feedback would be highly appreciated.

Thanks,
Hiranya


>
> thanks,
> --
> Kasun Indrasiri
> Senior Software Engineer,
> WSO2 Inc. - "Lean . Enterprise . Middleware" - http://www.wso2.com/
> Blog : http://kasunpanorama.blogspot.com/
>



-- 
Hiranya Jayathilaka
Software Engineer;
WSO2 Inc.;  http://wso2.org
E-mail: hiranya@wso2.com;  Mobile: +94 77 633 3491
Blog: http://techfeast-hiranya.blogspot.com