You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by je...@bnf.fr on 2013/05/24 16:11:24 UTC

error while indexing huge filesystem with data import handler and FileListEntityProcessor

Hello,


We are trying to use data import handler and particularly on a collection
which contains many file (one xml per document)

Our configuration works  for a small amount of files, but dataimport fails
with OutofMemory Error when running it on 10M files (in several
directories...)

This is it the content of our config.xml:

			<entity name="noticebib"
					datasource="null"
					processor="FileListEntityProcessor"
					fileName="^.*\.xml$" recursive="true"
					baseDir="${noticesBIB.basedir}"
					rootEntity="false"
				>

				<entity  name="processorDocument"
					processor="XPathEntityProcessor"
					url="${noticebib.fileAbsolutePath}"
					xsl="xslt/mnb/IXM_MNb.xsl"
					forEach="/record"
					transformer="fr.bnf.solr.BnfDateTransformer"
				>
				<all my mapping>

When we try it on a directory which contains 10 subdirectoies each subdir
containing 1000 subdirectories, each one containing 1000 xml files (10M
files, so), indexation process doesn't work anymore,

We have a java.outofmemory excpetion (even with 512 Mo and 1GB memory)
ERROR 2013-05-24 15:26:25,733 http-9145-2
org.apache.solr.handler.dataimport.DataImporter  (96) - Full Import
failed:java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.ClassCastException: java.lang.OutOfMemoryError cannot be cast to
java.lang.Exception
        at org.apache.solr.handler.dataimport.DocBuilder.execute
(DocBuilder.java:266)
        at org.apache.solr.handler.dataimport.DataImporter.doFullImport
(DataImporter.java:422)
        at org.apache.solr.handler.dataimport.DataImporter.runCmd
(DataImporter.java:487)
        at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
(DataImportHandler.java:179)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest
(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)

Monitoring the jvm with visualvm, I've seen that most of time is taken by
the method FileListEntityProcessor.accept (called by getFolderFiles), so I
assumed that the error occured when filling list of files to be indexed:
Indeed the list of files is done by this method which called by
getFolderFiles.

Basically, the list of files to index is done  by getFolderFiles, itself
called at first call to nextRow(). The indexation itself starts only after
that.
org/apache/solr/handler/dataimport/FileListEntityProcessor.java
  private void  [More ...] getFolderFiles(File dir, final List<Map<String,
Object>> fileDetails) {

I found back the variable fileDetails which contains the list of my xml
files. It contains 611345 entries (for approximatively 500 Mo of memory).
And I have 10M xml files (more or less...). That why I think it's not
finished yet.
To get the entire list I guess I need something between 5 and 10 Go for my
process.

So I have several questions :
_ Is it possible to have severalFileListEntityProcessor attached to only
one  XPathEntityProcessor in the data-config.xml : Like this I can do it in
ten times, with my 10 directories of first level.
_ Is there a roadmap to optimize this method, for example by not doing the
list of all file in  the first time, but each 1000 documents, for instance?
_ Or to store the file list in a temporary file in order to save some
memory?

Regards,
-----------------------------------------------
Jérôme Dupont
-----------------------------------------------

Exposition  Jean de Gonet, relieur  - jusqu'au 21 juillet 2013 - BnF - François-Mitterrand / Galerie François 1 er 
Jean de Gonet dédicacera le catalogue de l'exposition le samedi 25 mai de 16h30 à 18 heures à l'entrée de l'exposition. Avant d'imprimer, pensez à l'environnement. 

Re: error while indexing huge filesystem with data import handler and FileListEntityProcessor

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
On Fri, May 24, 2013 at 10:11 AM,  <je...@bnf.fr> wrote:
> Or to store the file list in a temporary file in order to save some
> memory?

This is probably your easiest option. Use an O/S level 'find' or
similar command to get all the files. Massage them in a text editor to
match the required format and use LineEntityProcessor to read those
URLs.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)