You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Josh Harness <jo...@jtv.com> on 2011/11/08 17:10:58 UTC

DataImportHandler Streaming XML Parse

All -

     We're using DIH to import flat xml files. We're getting Heap memory
exceptions due to the file size. Is there any way to force DIH to do a
streaming parse rather than a DOM parse? I really don't want to chunk my
files up or increase the heap size.

Many Thanks!

Josh

Re: DataImportHandler Streaming XML Parse

Posted by Chris Hostetter <ho...@fucit.org>.
:      We're using DIH to import flat xml files. We're getting Heap memory
: exceptions due to the file size. Is there any way to force DIH to do a
: streaming parse rather than a DOM parse? I really don't want to chunk my
: files up or increase the heap size.

The XPathEntityProcessor is using a streaming parser -- it doesn't read in 
the entire DOM for each file (that's the main reason why it doesn't 
support full XPath expressions, just a subset)

If you are getting OOM errors it's possibly the sorce of the problem is 
simply a heap that is unreasonably small, or some other bug -- you haven't 
really provided many details to go on (ie: how big is your current heap, 
what types of things you do in this Solr server (ie: index & serach? 
using filterCache? sorting?), what your DIH configs look like, how big 
each indivual entity is in the XML files, etc...) so it's hard to guess 
what your problem might be.

One of the best tools for narrowing down a problem like this is to look at 
some heap visualization tools to see what is actaully using all the heap 
(who knows: maybe you can help us track down a bug no one else has 
discovered yet because your usecase is unusual)


-Hoss