You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cocoon.apache.org by Tom Bloomfield <to...@shopbloomfield.com> on 2004/11/19 02:45:32 UTC

Large XML transformations in Cocoon.

I'm planning to do xml -> text transformations (for tab-delimited 
output) and xml -> FOP on large XML datasets.  The XML I will  be 
processing will be 10-12 MB in size, and will grow from there. Based on 
planning, the XSL will contain around 50 node traversals and will 
iterate over my XML dataset around 46,000 times.  Previous to this, my 
Cocoon transformations haven't been nearly this big.

The amount of JVM memory I have to deal with is limited (<256M).  This 
transformation will need to run in real-time. 

Does anyone have experience dealing with large datasets like this?

TIA,
Tom









---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

Re: Large XML transformations in Cocoon.

Posted by Upayavira <uv...@upaya.co.uk>.

Tom Bloomfield wrote:

> I'm planning to do xml -> text transformations (for tab-delimited 
> output) and xml -> FOP on large XML datasets.  The XML I will  be 
> processing will be 10-12 MB in size, and will grow from there. Based 
> on planning, the XSL will contain around 50 node traversals and will 
> iterate over my XML dataset around 46,000 times.  Previous to this, my 
> Cocoon transformations haven't been nearly this big.
>
> The amount of JVM memory I have to deal with is limited (<256M).  This 
> transformation will need to run in real-time.
> Does anyone have experience dealing with large datasets like this?

That sounds like quite a challenge. XSLT isn't that appropriate for that 
sort of thing. Firstly, in XSLT, avoid arbitrary wanders around your XML 
tree - stay as close to the context node as you can.

Alternatively, look at STX (there is an STX block). See if you can 
manage your transformations with that. This is "streaming" 
transformations for XML, i.e. it is designed for streaming, and thus 
should be able to handle large datasets.

Regards, Upayavira

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

Re: Large XML transformations in Cocoon.

Posted by Bertrand Delacretaz <bd...@apache.org>.

Le 19 nov. 04, à 02:45, Tom Bloomfield a écrit :

> ...The XML I will  be processing will be 10-12 MB in size, and will 
> grow from there. Based on planning, the XSL will contain around 50 
> node traversals and will iterate over my XML dataset around 46,000 
> times....

You'll probably have a hard time doing this on a 256-MB system.

In such a case I'd ask myself if my problem is *so* hard as to require 
46'000 iterations over the XML dataset. Of course it depends on the 
kind of data you're processing, but this sounds very unusual.

-Bertrand

Re: Large XML transformations in Cocoon.

Posted by Miles Elam <mi...@pcextremist.com>.

Go right ahead.  Anything I write to this mailing list is fair 
game/public domain.

- Miles Elam


On Nov 20, 2004, at 7:26 AM, Upayavira wrote:

> Miles Elam wrote:
>
> Very useful piece. Would you mind if I put this on the wiki?
>
> Regards, Upayavira
>
>> As someone who has used STX, I can recommend it in this situation 
>> wholeheartedly.  STX looks very much like XSLT but uses a different 
>> namespace and doesn't have as many options for transformation.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

Re: Large XML transformations in Cocoon.

Posted by Upayavira <uv...@upaya.co.uk>.

Miles Elam wrote:

Very useful piece. Would you mind if I put this on the wiki?

Regards, Upayavira

> As someone who has used STX, I can recommend it in this situation 
> wholeheartedly.  STX looks very much like XSLT but uses a different 
> namespace and doesn't have as many options for transformation.
>
> Unless something drastic has changed lately in the XSLT used by 
> Cocoon, it uses a document table model (like a DOM but tailored toward 
> a read-only view and a transformation source).  This is necessary 
> because XSLT allows several passes over the same source document and 
> also allows arbitrary access to any point in the tree (although this 
> is usually quite inefficient).  So while XSLT is the preferred method 
> for XML transformation in general, certain circumstances like yours 
> would point toward alternatives.
>
> As far as streaming XSLT results is concerned, it's possible to 
> configure it this way at the expense of overall processing time.  But 
> you don't appear to have the memory for even one full transformation 
> let alone many at the same time.  STX is your best bet in my opinion.  
> This always streams the output by its very nature.
>
> Also, do NOT put this into a caching pipeline.  With such a large 
> source, memory constraints will get worse before they get better.  
> Reprocess each time (or pregenerate on intervals a la cron) to shift 
> the weight from memory to CPU/disk in this case.
>
> Of course, a final option is to write your own custom Cocoon 
> transformer, but I would recommend the STX route as it would likely be 
> almost as fast and a while lot more flexible and maintainable in the 
> long run.
>
> - Miles Elam
>
>
> On Nov 19, 2004, at 7:07 AM, Tom Bloomfield wrote:
>
>> The number of iterations cooresponds to the number of rows returned 
>> from the database.  There are roughly 46,000 rows present now, so I  
>> need at least that many rows in my display.  The XSL design enables 
>> me to use SAX which should help.  The easiest thing would be to limit 
>> the number of rows returned to something more reasonable like 10,000 
>> (or up the JVM memory :P), but this is the requirement I'm stuck with.
>>
>> Help me understand this: If I apply a transformation using XSLT, 
>> streaming the xml in, does Cocoon "stream" the results out?   IE, 
>> does the entire transformation happen in memory and then get flushed 
>> to the client, or does Cocoon flush the buffer to the client as xxx 
>> bytes are filled?  I made an assumption that Cocoon does this 
>> automatically.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

Re: Large XML transformations in Cocoon.

Posted by Tom Bloomfield <to...@shopbloomfield.com>.

Miles,

Thanks for the tips.  I'll move forward on coding this using STX and 
post some benchmarking numbers when I finish. 

TB



Upayavira wrote:

> Miles Elam wrote:
>
> Very useful piece. Would you mind if I put this on the wiki?
>
> Regards, Upayavira
>
>> As someone who has used STX, I can recommend it in this situation 
>> wholeheartedly.  STX looks very much like XSLT but uses a different 
>> namespace and doesn't have as many options for transformation.
>>
>> Unless something drastic has changed lately in the XSLT used by 
>> Cocoon, it uses a document table model (like a DOM but tailored 
>> toward a read-only view and a transformation source).  This is 
>> necessary because XSLT allows several passes over the same source 
>> document and also allows arbitrary access to any point in the tree 
>> (although this is usually quite inefficient).  So while XSLT is the 
>> preferred method for XML transformation in general, certain 
>> circumstances like yours would point toward alternatives.
>>
>> As far as streaming XSLT results is concerned, it's possible to 
>> configure it this way at the expense of overall processing time.  But 
>> you don't appear to have the memory for even one full transformation 
>> let alone many at the same time.  STX is your best bet in my 
>> opinion.  This always streams the output by its very nature.
>>
>> Also, do NOT put this into a caching pipeline.  With such a large 
>> source, memory constraints will get worse before they get better.  
>> Reprocess each time (or pregenerate on intervals a la cron) to shift 
>> the weight from memory to CPU/disk in this case.
>>
>> Of course, a final option is to write your own custom Cocoon 
>> transformer, but I would recommend the STX route as it would likely 
>> be almost as fast and a while lot more flexible and maintainable in 
>> the long run.
>>
>> - Miles Elam
>>
>>
>> On Nov 19, 2004, at 7:07 AM, Tom Bloomfield wrote:
>>
>>> The number of iterations cooresponds to the number of rows returned 
>>> from the database.  There are roughly 46,000 rows present now, so I  
>>> need at least that many rows in my display.  The XSL design enables 
>>> me to use SAX which should help.  The easiest thing would be to 
>>> limit the number of rows returned to something more reasonable like 
>>> 10,000 (or up the JVM memory :P), but this is the requirement I'm 
>>> stuck with.
>>>
>>> Help me understand this: If I apply a transformation using XSLT, 
>>> streaming the xml in, does Cocoon "stream" the results out?   IE, 
>>> does the entire transformation happen in memory and then get flushed 
>>> to the client, or does Cocoon flush the buffer to the client as xxx 
>>> bytes are filled?  I made an assumption that Cocoon does this 
>>> automatically.
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
>> For additional commands, e-mail: users-help@cocoon.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

Re: Large XML transformations in Cocoon.

Posted by Miles Elam <mi...@pcextremist.com>.

As someone who has used STX, I can recommend it in this situation 
wholeheartedly.  STX looks very much like XSLT but uses a different 
namespace and doesn't have as many options for transformation.

Unless something drastic has changed lately in the XSLT used by Cocoon, 
it uses a document table model (like a DOM but tailored toward a 
read-only view and a transformation source).  This is necessary because 
XSLT allows several passes over the same source document and also 
allows arbitrary access to any point in the tree (although this is 
usually quite inefficient).  So while XSLT is the preferred method for 
XML transformation in general, certain circumstances like yours would 
point toward alternatives.

As far as streaming XSLT results is concerned, it's possible to 
configure it this way at the expense of overall processing time.  But 
you don't appear to have the memory for even one full transformation 
let alone many at the same time.  STX is your best bet in my opinion.  
This always streams the output by its very nature.

Also, do NOT put this into a caching pipeline.  With such a large 
source, memory constraints will get worse before they get better.  
Reprocess each time (or pregenerate on intervals a la cron) to shift 
the weight from memory to CPU/disk in this case.

Of course, a final option is to write your own custom Cocoon 
transformer, but I would recommend the STX route as it would likely be 
almost as fast and a while lot more flexible and maintainable in the 
long run.

- Miles Elam

On Nov 19, 2004, at 7:07 AM, Tom Bloomfield wrote:

> The number of iterations cooresponds to the number of rows returned 
> from the database.  There are roughly 46,000 rows present now, so I  
> need at least that many rows in my display.  The XSL design enables me 
> to use SAX which should help.  The easiest thing would be to limit the 
> number of rows returned to something more reasonable like 10,000 (or 
> up the JVM memory :P), but this is the requirement I'm stuck with.
>
> Help me understand this: If I apply a transformation using XSLT, 
> streaming the xml in, does Cocoon "stream" the results out?   IE, does 
> the entire transformation happen in memory and then get flushed to the 
> client, or does Cocoon flush the buffer to the client as xxx bytes are 
> filled?  I made an assumption that Cocoon does this automatically.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

Re: Large XML transformations in Cocoon.

Posted by Tom Bloomfield <to...@shopbloomfield.com>.

Upayavira, thanks for the heads up about STX.  I'll check out Joost 
later today.

The number of iterations cooresponds to the number of rows returned from 
the database.  There are roughly 46,000 rows present now, so I  need at 
least that many rows in my display.  The XSL design enables me to use 
SAX which should help.  The easiest thing would be to limit the number 
of rows returned to something more reasonable like 10,000 (or up the JVM 
memory :P), but this is the requirement I'm stuck with.

Help me understand this: If I apply a transformation using XSLT, 
streaming the xml in, does Cocoon "stream" the results out?   IE, does 
the entire transformation happen in memory and then get flushed to the 
client, or does Cocoon flush the buffer to the client as xxx bytes are 
filled?  I made an assumption that Cocoon does this automatically.

If anyone else has any suggestions, please let me know. 
TIA,
Tom

Bertrand Delacretaz wrote:

>
> Le 19 nov. 04, à 02:45, Tom Bloomfield a écrit :
>
>> ...The XML I will  be processing will be 10-12 MB in size, and will 
>> grow from there. Based on planning, the XSL will contain around 50 
>> node traversals and will iterate over my XML dataset around 46,000 
>> times....
>
>
> You'll probably have a hard time doing this on a 256-MB system.
>
> In such a case I'd ask myself if my problem is *so* hard as to require 
> 46'000 iterations over the XML dataset. Of course it depends on the 
> kind of data you're processing, but this sounds very unusual.
>
> -Bertrand
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org