You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by jimtronic <ji...@gmail.com> on 2013/08/06 16:23:53 UTC

Help importing xml file as raw xml

Hi,

I found a few threads out there dealing with this problem, but there didn't
really seem to be much detail to the solution.

I have large xml files (500M to 2+ G) with a complex nested structure. It's
impossible for me to import the exact structure into a solr representation,
and, honestly, I don't need to. But, I do need to store the raw xml for each
main item in a solr field for use by other clients. 

I tried using the xsl option for the XPathEntityProcessor, and it works
perfectly for small files. However, it cannot handle the big file -- or at
least the machine I have doesn't have enough memory to handle this task.

Normal import with the XPEProcessor takes just a few minutes. I do this job
a couple times a day and I don't want it to eat up all the memory on one of
my nodes.

I tried using xsltproc to pretransform the file, but it also took a long
time and eventually failed due to memory.

My best option now would seem to be using awk or sed to transform the file
prior to solr import. Perhaps by removing line breaks and using the
LineEntityProcessor and some scripts.

My other thought is that since the XPEProcessor knows the structure, there
must be some way for it to be extended so that it outputs the raw input if
requested.

Anyone have any other thoughts?

Thanks!
Jim 



--
View this message in context: http://lucene.472066.n3.nabble.com/Help-importing-xml-file-as-raw-xml-tp4082824.html
Sent from the Solr - User mailing list archive at Nabble.com.