You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Sudarsan, Sithu D." <Si...@fda.hhs.gov> on 2009/05/21 16:42:59 UTC

Parsing large xml files

Hi,

While trying to parse xml documents of about 50MB size, we run into
OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB
(that is the max), does not help. Is there any API that could be used to
handle such large single xml files?

If Lucene is not the right place, please let me know alternate places to
look for,

Thanks in advance,
Sithu D Sudarsan
sithu.sudarsan@fda.hhs.gov
sdsudarsan@ualr.edu




Re: Parsing large xml files

Posted by Erick Erickson <er...@gmail.com>.
What fails and what is the stack trace? Have you tried just
parsing the XML in a stand-alone program independent of
indexing?

You should easily be able to parse a 50MB file with that much
memory. I suspect something else is going on here. Perhaps you're
not *really* allocating that much memory to the process. If you're
working in an IDE for instance you could be allocating memory to the
IDE but not setting the correct runtime parameters for programs
run within that IDE.

If that is irrelevant, perhaps you could add more details...

Best
Erick



On Thu, May 21, 2009 at 10:42 AM, Sudarsan, Sithu D. <
Sithu.Sudarsan@fda.hhs.gov> wrote:

>
> Hi,
>
> While trying to parse xml documents of about 50MB size, we run into
> OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB
> (that is the max), does not help. Is there any API that could be used to
> handle such large single xml files?
>
> If Lucene is not the right place, please let me know alternate places to
> look for,
>
> Thanks in advance,
> Sithu D Sudarsan
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>
>

RE: Parsing large xml files

Posted by "Sudarsan, Sithu D." <Si...@fda.hhs.gov>.
Thanks, I'll try that and get back to you 


Sincerely,
Sithu D Sudarsan

-----Original Message-----
From: Michael Barbarelli [mailto:mbarbarelli@gmail.com] 
Sent: Thursday, May 21, 2009 10:52 AM
To: java-user@lucene.apache.org
Subject: Re: Parsing large xml files

Why not use an XML pull parser?  I recommend against using an in-memory
parser.

On Thu, May 21, 2009 at 3:42 PM, Sudarsan, Sithu D. <
Sithu.Sudarsan@fda.hhs.gov> wrote:

>
> Hi,
>
> While trying to parse xml documents of about 50MB size, we run into
> OutOfMemoryError due to java heap space. Increasing JVM to use close
2GB
> (that is the max), does not help. Is there any API that could be used
to
> handle such large single xml files?
>
> If Lucene is not the right place, please let me know alternate places
to
> look for,
>
> Thanks in advance,
> Sithu D Sudarsan
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Parsing large xml files

Posted by Joel Halbert <jo...@su3analytics.com>.
try http://piccolo.sourceforge.net/
is small and fast.


-----Original Message-----
From: Michael Barbarelli <mb...@gmail.com>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: Parsing large xml files
Date: Thu, 21 May 2009 15:52:00 +0100

Why not use an XML pull parser?  I recommend against using an in-memory
parser.

On Thu, May 21, 2009 at 3:42 PM, Sudarsan, Sithu D. <
Sithu.Sudarsan@fda.hhs.gov> wrote:

>
> Hi,
>
> While trying to parse xml documents of about 50MB size, we run into
> OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB
> (that is the max), does not help. Is there any API that could be used to
> handle such large single xml files?
>
> If Lucene is not the right place, please let me know alternate places to
> look for,
>
> Thanks in advance,
> Sithu D Sudarsan
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Parsing large xml files

Posted by Michael Barbarelli <mb...@gmail.com>.
Why not use an XML pull parser?  I recommend against using an in-memory
parser.

On Thu, May 21, 2009 at 3:42 PM, Sudarsan, Sithu D. <
Sithu.Sudarsan@fda.hhs.gov> wrote:

>
> Hi,
>
> While trying to parse xml documents of about 50MB size, we run into
> OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB
> (that is the max), does not help. Is there any API that could be used to
> handle such large single xml files?
>
> If Lucene is not the right place, please let me know alternate places to
> look for,
>
> Thanks in advance,
> Sithu D Sudarsan
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>
>

Re: Parsing large xml files

Posted by prasanna pradhan <pr...@gmail.com>.
We had similar a problem  where we had to parse 1 GB XML files.Better
transform to array like json and write a custom search API using lucene.

On Thu, May 21, 2009 at 8:12 PM, Sudarsan, Sithu D. <
Sithu.Sudarsan@fda.hhs.gov> wrote:

>
> Hi,
>
> While trying to parse xml documents of about 50MB size, we run into
> OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB
> (that is the max), does not help. Is there any API that could be used to
> handle such large single xml files?
>
> If Lucene is not the right place, please let me know alternate places to
> look for,
>
> Thanks in advance,
> Sithu D Sudarsan
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>
>


-- 
Thanks,
Prasanna

Re: Parsing large xml files

Posted by Matthew Hall <mh...@informatics.jax.org>.
Yeah, there's a setting on windows that allows you to use up to .. erm 
3G I think it was.  The limitation there is due to the silly windows 
file system.  I'm don't remember off hand exactly what that setting was, 
but I'm 100% certain that its there.

If you do a google search for jvm maximum memory settings on windows you 
should be able to find a few articles about it.

(At least that's certainly my recollection)

Secondly, if you have a linux machine available you should likely just 
use that, particularly if its a 64 bit processor because then a whole 
ton more memory becomes available to you.

When I'm developing my indexes I do it via eclipse on my windows 
platform, but with the actual directories themselves mounted from a 
solaris machine.  When I go to actually MAKE the indexes I simply login 
to the machine do a quick ant compile, and run them.  Sure its an extra 
step, but the gains are more than worth it in our case.

Matt

Sudarsan, Sithu D. wrote:
>  
> Hi Matt,
>
> We use 32 bit JVM. Though it is supposed to have upto 4GB, any
> assignment above 2GB in Windows XP fails. The machine has  quad-core
> dual processor.
>
> On Linux we're able to use 4GB though!
>
> If there is any setting that will let us use 4GB do let me know.
>
> Thanks,
> Sithu D Sudarsan
>
> -----Original Message-----
> From: Matthew Hall [mailto:mhall@informatics.jax.org] 
> Sent: Friday, May 22, 2009 8:59 AM
> To: java-user@lucene.apache.org
> Subject: Re: Parsing large xml files
>
> 2g... should not be a maximum for any Jvm that I know of.
>
> Assuming you are running a 32 bit Jvm you are actually able to address a
>
> bit under 4G of memory, I've always used around 3.6G when trying to max 
> out a 32 bit jvm.  Technically speaking it should be able to address 4g 
> under a 32 bit or, however a certain percentage of the memory is set 
> aside for overhead, so you can only really use a bit less than the max.
>
> If you have a 64 bit os/jvm (which you likely might), you can use the 
> -d64 setting for your runtime environment to set your maximum memory 
> much.. MUCH higher, for example we regularly use 6G of memory on our 
> application servers here at the lab.
>
> Hope this helps you a bit,
>
> Matt
>
> crackeur@comcast.net wrote:
>   
>> http://vtd-xml.sf.net 
>>
>>
>> ----- Original Message ----- 
>> From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
>> To: java-user@lucene.apache.org 
>> Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
>> Subject: Parsing large xml files 
>>
>>
>> Hi, 
>>
>> While trying to parse xml documents of about 50MB size, we run into 
>> OutOfMemoryError due to java heap space. Increasing JVM to use close
>>     
> 2GB 
>   
>> (that is the max), does not help. Is there any API that could be used
>>     
> to 
>   
>> handle such large single xml files? 
>>
>> If Lucene is not the right place, please let me know alternate places
>>     
> to 
>   
>> look for, 
>>
>> Thanks in advance, 
>> Sithu D Sudarsan 
>> sithu.sudarsan@fda.hhs.gov 
>> sdsudarsan@ualr.edu 
>>
>>
>>
>>
>>   
>>     
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Parsing large xml files

Posted by "Sudarsan, Sithu D." <Si...@fda.hhs.gov>.
 
Hi Matt,

We use 32 bit JVM. Though it is supposed to have upto 4GB, any
assignment above 2GB in Windows XP fails. The machine has  quad-core
dual processor.

On Linux we're able to use 4GB though!

If there is any setting that will let us use 4GB do let me know.

Thanks,
Sithu D Sudarsan

-----Original Message-----
From: Matthew Hall [mailto:mhall@informatics.jax.org] 
Sent: Friday, May 22, 2009 8:59 AM
To: java-user@lucene.apache.org
Subject: Re: Parsing large xml files

2g... should not be a maximum for any Jvm that I know of.

Assuming you are running a 32 bit Jvm you are actually able to address a

bit under 4G of memory, I've always used around 3.6G when trying to max 
out a 32 bit jvm.  Technically speaking it should be able to address 4g 
under a 32 bit or, however a certain percentage of the memory is set 
aside for overhead, so you can only really use a bit less than the max.

If you have a 64 bit os/jvm (which you likely might), you can use the 
-d64 setting for your runtime environment to set your maximum memory 
much.. MUCH higher, for example we regularly use 6G of memory on our 
application servers here at the lab.

Hope this helps you a bit,

Matt

crackeur@comcast.net wrote:
> http://vtd-xml.sf.net 
>
>
> ----- Original Message ----- 
> From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
> To: java-user@lucene.apache.org 
> Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
> Subject: Parsing large xml files 
>
>
> Hi, 
>
> While trying to parse xml documents of about 50MB size, we run into 
> OutOfMemoryError due to java heap space. Increasing JVM to use close
2GB 
> (that is the max), does not help. Is there any API that could be used
to 
> handle such large single xml files? 
>
> If Lucene is not the right place, please let me know alternate places
to 
> look for, 
>
> Thanks in advance, 
> Sithu D Sudarsan 
> sithu.sudarsan@fda.hhs.gov 
> sdsudarsan@ualr.edu 
>
>
>
>
>   



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Parsing large xml files

Posted by Matthew Hall <mh...@informatics.jax.org>.
2g... should not be a maximum for any Jvm that I know of.

Assuming you are running a 32 bit Jvm you are actually able to address a 
bit under 4G of memory, I've always used around 3.6G when trying to max 
out a 32 bit jvm.  Technically speaking it should be able to address 4g 
under a 32 bit or, however a certain percentage of the memory is set 
aside for overhead, so you can only really use a bit less than the max.

If you have a 64 bit os/jvm (which you likely might), you can use the 
-d64 setting for your runtime environment to set your maximum memory 
much.. MUCH higher, for example we regularly use 6G of memory on our 
application servers here at the lab.

Hope this helps you a bit,

Matt

crackeur@comcast.net wrote:
> http://vtd-xml.sf.net 
>
>
> ----- Original Message ----- 
> From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
> To: java-user@lucene.apache.org 
> Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
> Subject: Parsing large xml files 
>
>
> Hi, 
>
> While trying to parse xml documents of about 50MB size, we run into 
> OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB 
> (that is the max), does not help. Is there any API that could be used to 
> handle such large single xml files? 
>
> If Lucene is not the right place, please let me know alternate places to 
> look for, 
>
> Thanks in advance, 
> Sithu D Sudarsan 
> sithu.sudarsan@fda.hhs.gov 
> sdsudarsan@ualr.edu 
>
>
>
>
>   



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Parsing large xml files

Posted by cr...@comcast.net.
yes, that is something worth thinking about .... thanks for bringing this up... 
----- Original Message ----- 
From: "Michael Wechner" <mi...@wyona.com> 
To: java-user@lucene.apache.org 
Sent: Friday, May 22, 2009 11:41:51 AM GMT -08:00 US/Canada Pacific 
Subject: Re: Parsing large xml files 

crackeur@comcast.net schrieb: 
> once you get comfortable with vtd-xml, few people will ever get back to DOM and SAX... 
>   

maybe you want to consider to contribute a vtd-xml based parsing 
implementation to Lucene ;-) 

Thanks 

Michael 
> ----- Original Message ----- 
> From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
> To: java-user@lucene.apache.org 
> Sent: Friday, May 22, 2009 6:39:33 AM GMT -08:00 US/Canada Pacific 
> Subject: RE: Parsing large xml files 
> 
> Thanks everyone for your useful suggestions/links. 
> 
> Lucene uses DOM and we tried with SAX. 
> 
> XML Pull & vtd-xml as well as Piccolo seem good. 
> 
> However, for now, we've broken the file into smaller chunks and then 
> parsing it. 
> 
> When we get some time, we'ld like to refactor with the suggested ones. 
> 
> Erick: We do use Eclipse. But running from CLI gives the same error! May 
> be there is a way to address the memory issues, but the current idea of 
> breaking into smaller chunks have worked for now... 
> 
> 
> Sincerely, 
> Sithu D Sudarsan 
> 
> -----Original Message----- 
> From: Michael Wechner [mailto:michael.wechner@wyona.com] 
> Sent: Friday, May 22, 2009 4:48 AM 
> To: java-user@lucene.apache.org 
> Subject: Re: Parsing large xml files 
> 
> crackeur@comcast.net schrieb: 
>   
>> http://vtd-xml.sf.net 
>> 
>> 
>> ----- Original Message ----- 
>> From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
>> To: java-user@lucene.apache.org 
>> Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
>> Subject: Parsing large xml files 
>> 
>> 
>> Hi, 
>> 
>> While trying to parse xml documents of about 50MB size, we run into 
>> OutOfMemoryError due to java heap space. Increasing JVM to use close 
>>     
> 2GB 
>   
>> (that is the max), does not help. Is there any API that could be used 
>>     
> to 
>   
>> handle such large single xml files? 
>>   
>>     
> 
> I am not familiar with that particular code of Lucene, but is it 
> possible that Lucene is using DOM for this parsing? 
> If so, one could try to replace it by SAX, and hence get rid of the 
> OutOfMemory issue. 
> 
> Cheers 
> 
> Michael 
>   
>> If Lucene is not the right place, please let me know alternate places 
>>     
> to 
>   
>> look for, 
>> 
>> Thanks in advance, 
>> Sithu D Sudarsan 
>> sithu.sudarsan@fda.hhs.gov 
>> sdsudarsan@ualr.edu 
>> 
>> 
>> 
>> 
>>   
>>     
> 
> 
> --------------------------------------------------------------------- 
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
> For additional commands, e-mail: java-user-help@lucene.apache.org 
> 
> 
> --------------------------------------------------------------------- 
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
> For additional commands, e-mail: java-user-help@lucene.apache.org 
> 
> 
>   


--------------------------------------------------------------------- 
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-user-help@lucene.apache.org 


Re: Parsing large xml files

Posted by Michael Wechner <mi...@wyona.com>.
crackeur@comcast.net schrieb:
> once you get comfortable with vtd-xml, few people will ever get back to DOM and SAX... 
>   

maybe you want to consider to contribute a vtd-xml based parsing 
implementation to Lucene ;-)

Thanks

Michael
> ----- Original Message ----- 
> From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
> To: java-user@lucene.apache.org 
> Sent: Friday, May 22, 2009 6:39:33 AM GMT -08:00 US/Canada Pacific 
> Subject: RE: Parsing large xml files 
>
> Thanks everyone for your useful suggestions/links. 
>
> Lucene uses DOM and we tried with SAX. 
>
> XML Pull & vtd-xml as well as Piccolo seem good. 
>
> However, for now, we've broken the file into smaller chunks and then 
> parsing it. 
>
> When we get some time, we'ld like to refactor with the suggested ones. 
>
> Erick: We do use Eclipse. But running from CLI gives the same error! May 
> be there is a way to address the memory issues, but the current idea of 
> breaking into smaller chunks have worked for now... 
>
>
> Sincerely, 
> Sithu D Sudarsan 
>
> -----Original Message----- 
> From: Michael Wechner [mailto:michael.wechner@wyona.com] 
> Sent: Friday, May 22, 2009 4:48 AM 
> To: java-user@lucene.apache.org 
> Subject: Re: Parsing large xml files 
>
> crackeur@comcast.net schrieb: 
>   
>> http://vtd-xml.sf.net 
>>
>>
>> ----- Original Message ----- 
>> From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
>> To: java-user@lucene.apache.org 
>> Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
>> Subject: Parsing large xml files 
>>
>>
>> Hi, 
>>
>> While trying to parse xml documents of about 50MB size, we run into 
>> OutOfMemoryError due to java heap space. Increasing JVM to use close 
>>     
> 2GB 
>   
>> (that is the max), does not help. Is there any API that could be used 
>>     
> to 
>   
>> handle such large single xml files? 
>>   
>>     
>
> I am not familiar with that particular code of Lucene, but is it 
> possible that Lucene is using DOM for this parsing? 
> If so, one could try to replace it by SAX, and hence get rid of the 
> OutOfMemory issue. 
>
> Cheers 
>
> Michael 
>   
>> If Lucene is not the right place, please let me know alternate places 
>>     
> to 
>   
>> look for, 
>>
>> Thanks in advance, 
>> Sithu D Sudarsan 
>> sithu.sudarsan@fda.hhs.gov 
>> sdsudarsan@ualr.edu 
>>
>>
>>
>>
>>   
>>     
>
>
> --------------------------------------------------------------------- 
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
> For additional commands, e-mail: java-user-help@lucene.apache.org 
>
>
> --------------------------------------------------------------------- 
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
> For additional commands, e-mail: java-user-help@lucene.apache.org 
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Parsing large xml files

Posted by cr...@comcast.net.
once you get comfortable with vtd-xml, few people will ever get back to DOM and SAX... 
----- Original Message ----- 
From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
To: java-user@lucene.apache.org 
Sent: Friday, May 22, 2009 6:39:33 AM GMT -08:00 US/Canada Pacific 
Subject: RE: Parsing large xml files 

Thanks everyone for your useful suggestions/links. 

Lucene uses DOM and we tried with SAX. 

XML Pull & vtd-xml as well as Piccolo seem good. 

However, for now, we've broken the file into smaller chunks and then 
parsing it. 

When we get some time, we'ld like to refactor with the suggested ones. 

Erick: We do use Eclipse. But running from CLI gives the same error! May 
be there is a way to address the memory issues, but the current idea of 
breaking into smaller chunks have worked for now... 


Sincerely, 
Sithu D Sudarsan 

-----Original Message----- 
From: Michael Wechner [mailto:michael.wechner@wyona.com] 
Sent: Friday, May 22, 2009 4:48 AM 
To: java-user@lucene.apache.org 
Subject: Re: Parsing large xml files 

crackeur@comcast.net schrieb: 
> http://vtd-xml.sf.net 
> 
> 
> ----- Original Message ----- 
> From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
> To: java-user@lucene.apache.org 
> Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
> Subject: Parsing large xml files 
> 
> 
> Hi, 
> 
> While trying to parse xml documents of about 50MB size, we run into 
> OutOfMemoryError due to java heap space. Increasing JVM to use close 
2GB 
> (that is the max), does not help. Is there any API that could be used 
to 
> handle such large single xml files? 
>   

I am not familiar with that particular code of Lucene, but is it 
possible that Lucene is using DOM for this parsing? 
If so, one could try to replace it by SAX, and hence get rid of the 
OutOfMemory issue. 

Cheers 

Michael 
> If Lucene is not the right place, please let me know alternate places 
to 
> look for, 
> 
> Thanks in advance, 
> Sithu D Sudarsan 
> sithu.sudarsan@fda.hhs.gov 
> sdsudarsan@ualr.edu 
> 
> 
> 
> 
>   


--------------------------------------------------------------------- 
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-user-help@lucene.apache.org 


--------------------------------------------------------------------- 
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-user-help@lucene.apache.org 


RE: Parsing large xml files

Posted by "Sudarsan, Sithu D." <Si...@fda.hhs.gov>.
Thanks everyone for your useful suggestions/links.

Lucene uses DOM and we tried with SAX.

XML Pull & vtd-xml as well as Piccolo seem good.

However, for now, we've broken the file into smaller chunks and then
parsing it. 

When we get some time, we'ld like to refactor with the suggested ones.

Erick: We do use Eclipse. But running from CLI gives the same error! May
be there is a way to address the memory issues, but the current idea of
breaking into smaller chunks have worked for now... 


Sincerely,
Sithu D Sudarsan

-----Original Message-----
From: Michael Wechner [mailto:michael.wechner@wyona.com] 
Sent: Friday, May 22, 2009 4:48 AM
To: java-user@lucene.apache.org
Subject: Re: Parsing large xml files

crackeur@comcast.net schrieb:
> http://vtd-xml.sf.net 
>
>
> ----- Original Message ----- 
> From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
> To: java-user@lucene.apache.org 
> Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
> Subject: Parsing large xml files 
>
>
> Hi, 
>
> While trying to parse xml documents of about 50MB size, we run into 
> OutOfMemoryError due to java heap space. Increasing JVM to use close
2GB 
> (that is the max), does not help. Is there any API that could be used
to 
> handle such large single xml files? 
>   

I am not familiar with that particular code of Lucene, but is it 
possible that Lucene is using DOM for this parsing?
If so, one could try to replace it by SAX, and hence get rid of the 
OutOfMemory issue.

Cheers

Michael
> If Lucene is not the right place, please let me know alternate places
to 
> look for, 
>
> Thanks in advance, 
> Sithu D Sudarsan 
> sithu.sudarsan@fda.hhs.gov 
> sdsudarsan@ualr.edu 
>
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Parsing large xml files

Posted by Michael Wechner <mi...@wyona.com>.
crackeur@comcast.net schrieb:
> http://vtd-xml.sf.net 
>
>
> ----- Original Message ----- 
> From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
> To: java-user@lucene.apache.org 
> Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
> Subject: Parsing large xml files 
>
>
> Hi, 
>
> While trying to parse xml documents of about 50MB size, we run into 
> OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB 
> (that is the max), does not help. Is there any API that could be used to 
> handle such large single xml files? 
>   

I am not familiar with that particular code of Lucene, but is it 
possible that Lucene is using DOM for this parsing?
If so, one could try to replace it by SAX, and hence get rid of the 
OutOfMemory issue.

Cheers

Michael
> If Lucene is not the right place, please let me know alternate places to 
> look for, 
>
> Thanks in advance, 
> Sithu D Sudarsan 
> sithu.sudarsan@fda.hhs.gov 
> sdsudarsan@ualr.edu 
>
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Parsing large xml files

Posted by cr...@comcast.net.

http://vtd-xml.sf.net 


----- Original Message ----- 
From: "Sithu D. Sudarsan" <Si...@fda.hhs.gov> 
To: java-user@lucene.apache.org 
Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
Subject: Parsing large xml files 


Hi, 

While trying to parse xml documents of about 50MB size, we run into 
OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB 
(that is the max), does not help. Is there any API that could be used to 
handle such large single xml files? 

If Lucene is not the right place, please let me know alternate places to 
look for, 

Thanks in advance, 
Sithu D Sudarsan 
sithu.sudarsan@fda.hhs.gov 
sdsudarsan@ualr.edu