You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by ruwired <ch...@citi.com> on 2009/01/20 23:04:07 UTC

Parsing a large XML file in Xerces

Currently I'm transforming from the command line with Xalan. 
Example:  java -ms1024m -mx1024m org.apache.xalan.xslt.Process -in
WebCrd.xml -xsl webcrd_EmpHist.xsl -out i_emphist.dat -INCREMENTAL

Now I'm trying to do the same exact process with Xerces.
It doesn't seem like Xerces has a similar incremental option...
Is there a way to parse a large file (500+mb) in a quick way that also
doesn't run the system out of memory?
I'm wondering how can I get Xerces to parse an input file "incrementally"
like Xalan?

I'm very new at all this so any ideas, solutions, or general help on where
to get started would be greatly appreciated.
-- 
View this message in context: http://www.nabble.com/Parsing-a-large-XML-file-in-Xerces-tp21572644p21572644.html
Sent from the Xerces - J - Dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org


RE: Parsing a large XML file in Xerces

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
ruwired <ch...@citi.com> wrote on 01/22/2009 10:21:23 AM:

> Hmm...I see.
>
> The guy who gave this project to me is misinformed. I was told Xalan
itself
> is a parser but apparently it's not. I was also told Xerces is the parser
> recommended as a replacement for Xalan in Java 1.5. That is apparently
wrong
> too since Xerces is the replacement for Crimson. Correct me if I'm wrong.

That's right. Xerces is an XML parser. Xalan is an XSLT processor. Sun Java
5 includes derivatives of these.

> The point of this whole thing was to see if Xerces would cut down on the
> time it takes to process a 500mb XML file and pass through about 13 XSLs.
> What would be the optimal way (if there is one) to do this if I'm not
using
> Xalan?

If the transformation is simple you could roll your own, perhaps by writing
a SAX ContentHandler which does the work. I imagine your scenario isn't
simple though.

> In any case, now I don't see what the point is of removing Xalan from the
> equation and just using Xerces. It seems like all I would be doing is
> removing the ability to do XSL transforms in a simplified way. Doesn't
Xalan
> use the Xerces implementation by default as the parser?

It uses whatever parser it finds through JAXP's factory mechanism. The
default will often be Xerces.

> Michael Glavassevich-3 wrote:
> >
> > Most XSLT processors bulid some data model of the document regardless
of
> > the form of the input. So even if you fire SAX events to the XSLT
> > processor
> > I would expect that you would still run out of memory for very large
> > documents.
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
> >
>
> --
> View this message in context: http://www.nabble.com/Parsing-a-large-
> XML-file-in-Xerces-tp21572644p21606335.html
> Sent from the Xerces - J - Dev mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

RE: Parsing a large XML file in Xerces

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
ruwired <ch...@citi.com> wrote on 01/22/2009 10:21:23 AM:

> Hmm...I see.
>
> The guy who gave this project to me is misinformed. I was told Xalan
itself
> is a parser but apparently it's not. I was also told Xerces is the parser
> recommended as a replacement for Xalan in Java 1.5. That is apparently
wrong
> too since Xerces is the replacement for Crimson. Correct me if I'm wrong.

That's right. Xerces is an XML parser. Xalan is an XSLT processor. Sun Java
5 includes derivatives of these.

> The point of this whole thing was to see if Xerces would cut down on the
> time it takes to process a 500mb XML file and pass through about 13 XSLs.
> What would be the optimal way (if there is one) to do this if I'm not
using
> Xalan?

If the transformation is simple you could roll your own, perhaps by writing
a SAX ContentHandler which does the work. I imagine your scenario isn't
simple though.

> In any case, now I don't see what the point is of removing Xalan from the
> equation and just using Xerces. It seems like all I would be doing is
> removing the ability to do XSL transforms in a simplified way. Doesn't
Xalan
> use the Xerces implementation by default as the parser?

It uses whatever parser it finds through JAXP's factory mechanism. The
default will often be Xerces.

> Michael Glavassevich-3 wrote:
> >
> > Most XSLT processors bulid some data model of the document regardless
of
> > the form of the input. So even if you fire SAX events to the XSLT
> > processor
> > I would expect that you would still run out of memory for very large
> > documents.
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
> >
>
> --
> View this message in context: http://www.nabble.com/Parsing-a-large-
> XML-file-in-Xerces-tp21572644p21606335.html
> Sent from the Xerces - J - Dev mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

RE: Parsing a large XML file in Xerces

Posted by ruwired <ch...@citi.com>.
Hmm...I see.

The guy who gave this project to me is misinformed. I was told Xalan itself
is a parser but apparently it's not. I was also told Xerces is the parser
recommended as a replacement for Xalan in Java 1.5. That is apparently wrong
too since Xerces is the replacement for Crimson. Correct me if I'm wrong.

The point of this whole thing was to see if Xerces would cut down on the
time it takes to process a 500mb XML file and pass through about 13 XSLs.
What would be the optimal way (if there is one) to do this if I'm not using
Xalan?

In any case, now I don't see what the point is of removing Xalan from the
equation and just using Xerces. It seems like all I would be doing is
removing the ability to do XSL transforms in a simplified way. Doesn't Xalan
use the Xerces implementation by default as the parser?





Michael Glavassevich-3 wrote:
> 
> Most XSLT processors bulid some data model of the document regardless of
> the form of the input. So even if you fire SAX events to the XSLT
> processor
> I would expect that you would still run out of memory for very large
> documents.
> 
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
> 

-- 
View this message in context: http://www.nabble.com/Parsing-a-large-XML-file-in-Xerces-tp21572644p21606335.html
Sent from the Xerces - J - Dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org


RE: Parsing a large XML file in Xerces

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Most XSLT processors bulid some data model of the document regardless of
the form of the input. So even if you fire SAX events to the XSLT processor
I would expect that you would still run out of memory for very large
documents.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

ruwired <ch...@citi.com> wrote on 01/21/2009 11:30:56 AM:

> I'm using XSL to get the text I want into a dat file.  What do I use for
an
> XSLT processor with Xerces?  Saxon?
>
> Thanks.
>
> MeBigFatGuy wrote:
> >
> > Use SAX of course
> >
> > http://xerces.apache.org/xerces2-j/samples-sax.html
> >
> > -----Original Message-----
> > From: "ruwired"
> > Sent: Tuesday, January 20, 2009 5:04pm
> > To: j-dev@xerces.apache.org
> > Subject: Parsing a large XML file in Xerces
> >
> >
> > Currently I'm transforming from the command line with Xalan.
> > Example:  java -ms1024m -mx1024m org.apache.xalan.xslt.Process -in
> > WebCrd.xml -xsl webcrd_EmpHist.xsl -out i_emphist.dat -INCREMENTAL
> >
> > Now I'm trying to do the same exact process with Xerces.
> > It doesn't seem like Xerces has a similar incremental option...
> > Is there a way to parse a large file (500+mb) in a quick way that also
> > doesn't run the system out of memory?
> > I'm wondering how can I get Xerces to parse an input file
"incrementally"
> > like Xalan?
> >
> > I'm very new at all this so any ideas, solutions, or general help on
where
> > to get started would be greatly appreciated.
> > --
> > View this message in context:
> > http://www.nabble.com/Parsing-a-large-XML-file-in-Xerces-
> tp21572644p21572644.html
> > Sent from the Xerces - J - Dev mailing list archive at Nabble.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: j-dev-help@xerces.apache.org
>
> --
> View this message in context: http://www.nabble.com/Parsing-a-large-
> XML-file-in-Xerces-tp21572644p21586777.html
> Sent from the Xerces - J - Dev mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org

RE: Parsing a large XML file in Xerces

Posted by ruwired <ch...@citi.com>.
I'm using XSL to get the text I want into a dat file.  What do I use for an
XSLT processor with Xerces?  Saxon?  

Thanks.



MeBigFatGuy wrote:
> 
> Use SAX of course
> 
> http://xerces.apache.org/xerces2-j/samples-sax.html
> 
> -----Original Message-----
> From: "ruwired" 
> Sent: Tuesday, January 20, 2009 5:04pm
> To: j-dev@xerces.apache.org
> Subject: Parsing a large XML file in Xerces
> 
> 
> Currently I'm transforming from the command line with Xalan. 
> Example:  java -ms1024m -mx1024m org.apache.xalan.xslt.Process -in
> WebCrd.xml -xsl webcrd_EmpHist.xsl -out i_emphist.dat -INCREMENTAL
> 
> Now I'm trying to do the same exact process with Xerces.
> It doesn't seem like Xerces has a similar incremental option...
> Is there a way to parse a large file (500+mb) in a quick way that also
> doesn't run the system out of memory?
> I'm wondering how can I get Xerces to parse an input file "incrementally"
> like Xalan?
> 
> I'm very new at all this so any ideas, solutions, or general help on where
> to get started would be greatly appreciated.
> -- 
> View this message in context:
> http://www.nabble.com/Parsing-a-large-XML-file-in-Xerces-tp21572644p21572644.html
> Sent from the Xerces - J - Dev mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Parsing-a-large-XML-file-in-Xerces-tp21572644p21586777.html
Sent from the Xerces - J - Dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org