You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by Aliger Martin <XA...@cs.felk.cvut.cz> on 2000/03/26 12:07:08 UTC

processing large XML

Hi,

I try to use Xalan for XML->XML transformation. It's good for small 
inputs, but I also need handle larger ones (up to 500MB) even in poor 
machines (12MB+windowze)...

Problem:
 I try this: run java Xalan 0.20 on my Linux and SUN JDK 1.2.2
 Program: SimpleTransform,SAX,DOMParser+SimpleTransform
 Inputs: standard XSL, no problem
         XML (2kB,20kB,1MB,20MB), content is still simillar 
               (only <a>1</a> and <b>2</b>
 Result: on smaller ones is everything fine
         on 20MB JVM produces  SEGMENTATION FAULT

Is this error in JDK or in Xalan? (I hope it isn't my program :-)
Could somebody could help ?

With regards
  Martin Aliger

PS: I place everything somewhere in http://cs.felk.cvut.cz/~xaliger

Re: processing large XML

Posted by Aliger Martin <XA...@cs.felk.cvut.cz>.
> Yes, the problem is: a template may select any node of the whole document
> (e.g. <xsl:apply-templates select="//[@attr=$var]"/> A Mathematician may find
> a rule to recognize one-pass-stylesheets, but as far as I know all XSL 
> processors fetch the whole document into memory. 
> 
> Saxon seems to be the most 'sophisticated' free processor. The preview tag
> tells the processor, you don't select the children of another preview element.
> But I am afraid, if you can't split the document, the preview tag will not
> help you :-(.
> 
> The summary - if you implement a one pass processor, you will become
> the hero of heroes!
> 
> We are looking forward to see your xalan one-pass extentions :-).

:)))))))))) Thanks. I try to handle this problem. Somehow. I have to. But 
don;t know if XSL processor is the good way.

Bye - I try something. And if it works - maybe I could commit it as xalan 
extension :-). I havn't much practice in commiting to free-projects, but 
this could be good opportunity, couldn't it? :-)))

I know that is impossible to handle everything by one-pass-processor. But 
some "reasonable" subset ... For example only such XSLs wich have 
templates only on form with apply-templates select="name". And no 
reordering, of course.

It;s pretty late (2:20) - go to bed. morning ....

MSK ALIK

Re: processing large XML

Posted by Edwin Glaser <ed...@pannenleiter.de>.
Hello,

You wrote:
> Is realy neccessary to have every node in memory? I dont think so. I will
> be happy, if some limited transformations (no reordering) could be done 
> with one seqential pass. But I still need some advanced features on 
> smaller files (sorting,reordering,...). And testing where is small/large 
> boundery is quite difficult - and you need 2 program branches...

Yes, the problem is: a template may select any node of the whole document
(e.g. <xsl:apply-templates select="//[@attr=$var]"/> A Mathematician may find
a rule to recognize one-pass-stylesheets, but as far as I know all XSL 
processors fetch the whole document into memory. 

Saxon seems to be the most 'sophisticated' free processor. The preview tag
tells the processor, you don't select the children of another preview element.
But I am afraid, if you can't split the document, the preview tag will not
help you :-(.

The summary - if you implement a one pass processor, you will become
the hero of heroes!

We are looking forward to see your xalan one-pass extentions :-).

-- 
Edwin Glaser -- edwin@pannenleiter.de


Re: processing large XML

Posted by Aliger Martin <XA...@cs.felk.cvut.cz>.
> Hello,
Hello,
> 
> You wrote:
> > I try to use Xalan for XML->XML transformation. It's good for small 
> > inputs, but I also need handle larger ones (up to 500MB) even in poor 
> > machines (12MB+windowze)...
> 
> Do you really need one big file ? Is it possible to split the file
> into 100 chunks ? You can prepare a StylesheetRoot object and apply
> it 100 times.

:-((. This isn't possible :-(. It is generated XML and I know absolutly 
nothing about structure and so on...

> The saxon processor (http://users.iclway.co.uk/mhkay/saxon/) might
> be another solution for your problem. 
> - The saxon:preview element is a top-level element used to identify
>   elements that will be processed in preview mode. The purpose of 
>   preview mode is to enable XSL processing of very large documents 
>   that are too big to fit in memory: the idea is that subtrees of 
>   the document can be processed and then discarded as soon as they
>   are encountered.

Thank you - I try this. I'm new on this XML field (for two weeks?) and 
still looking for good solution...

> Just another thought: Can you accept the performance of xsl
> transformations ? If you use a raw sax parser and hard coded 
> transformations, your program will run 10 to 100 times faster. 

Yep. You are right. Maybe the XSL performance will not be acceptable. I'm 
still testing. Yes - I use SAX (yesterday I found it - and was happy :-), 
but writing my own processor is hard job. I was coding one for three 
months already - and is very limited (compared with xalan).

Is realy neccessary to have every node in memory? I dont think so. I will 
be happy, if some limited transformations (no reordering) could be done 
with one seqential pass. But I still need some advanced features on 
smaller files (sorting,reordering,...). And testing where is small/large 
boundery is quite difficult - and you need 2 program branches...

Thanks
  Martin Aliger

Re: processing large XML

Posted by Edwin Glaser <ed...@pannenleiter.de>.
Hello,

You wrote:
> I try to use Xalan for XML->XML transformation. It's good for small 
> inputs, but I also need handle larger ones (up to 500MB) even in poor 
> machines (12MB+windowze)...

Do you really need one big file ? Is it possible to split the file
into 100 chunks ? You can prepare a StylesheetRoot object and apply
it 100 times.

The saxon processor (http://users.iclway.co.uk/mhkay/saxon/) might
be another solution for your problem. 
- The saxon:preview element is a top-level element used to identify
  elements that will be processed in preview mode. The purpose of 
  preview mode is to enable XSL processing of very large documents 
  that are too big to fit in memory: the idea is that subtrees of 
  the document can be processed and then discarded as soon as they
  are encountered.

Just another thought: Can you accept the performance of xsl
transformations ? If you use a raw sax parser and hard coded 
transformations, your program will run 10 to 100 times faster. 

Hope it helps. edwin.
-- 
Edwin Glaser -- edwin@pannenleiter.de