You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Lannig Carina <La...@ssi-schaefer-noell.com> on 2010/08/12 14:32:19 UTC

Indexing large files using Solr Cell causes OutOfMemory error

Hi,

I'm trying to index a txt-File (~150MB) using Solr Cell/Tika.
The curl command aborts due to a java.lang.OutOfMemoryError.
*****************************************************************
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:3209)
        at java.lang.String.&lt;init&gt;(String.java:215)
        at java.lang.StringBuilder.toString(StringBuilder.java:430)
        at org.apache.solr.handler.extraction.SolrContentHandler.newDocument(Sol
rContentHandler.java:124)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(Ext
ractingDocumentLoader.java:119)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(Ex
tractingDocumentLoader.java:125)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
actingDocumentLoader.java:195)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
ntentStreamHandlerBase.java:54)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
erBase.java:131)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
Request(RequestHandlers.java:237)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:337)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:240)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
icationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
ilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
alve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
alve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
ava:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
ava:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
ve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
a:298)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
:852)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
ss(Http11Protocol.java:588)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:48
9)
        at java.lang.Thread.run(Thread.java:619)
) that prevented it from fulfilling this request.</u></p><HR size="1" noshade="n
oshade"><h3>Apache Tomcat/6.0.26</h3></body></html>
*****************************************************************

AFAIK Tika keeps the whole file in RAM and posts it as one single string to Solr.
I'm using JVM-args: Xmx1024M and solr default config with
*****************************************************************
  <mainIndex>
    <!-- options specific to the main on-disk lucene index -->
    <useCompoundFile>false</useCompoundFile>
    <ramBufferSizeMB>32</ramBufferSizeMB>
    <mergeFactor>10</mergeFactor>
    ...
  </mainIndex>

  <requestDispatcher handleSelect="true" >
    <!--Make sure your system has some authentication before enabling remote streaming!  -->
    <requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048000" />
   ...
*****************************************************************
Is there a chance to force Solr/Tika to flush the memory during indexing a file?
Increasing RAM in dependence on the size of the largest file to index seems not very nice.
Did I miss some configuration option or do I have to modify Java code? I just found http://osdir.com/ml/tika-dev.lucene.apache.org/2009-02/msg00020.html and I'm wondering if there is a solution yet.

Carina

Re: Indexing large files using Solr Cell causes OutOfMemory error

Posted by Gora Mohanty <go...@srijan.in>.
On Thu, 12 Aug 2010 14:32:19 +0200
Lannig Carina <La...@ssi-schaefer-noell.com> wrote:

> Hi,
> 
> I'm trying to index a txt-File (~150MB) using Solr Cell/Tika.
> The curl command aborts due to a java.lang.OutOfMemoryError.
[...]
> AFAIK Tika keeps the whole file in RAM and posts it as one single
> string to Solr. I'm using JVM-args: Xmx1024M and solr default
> config with
[...]

Do not know about Tika, but what is the size of your Solr index,
and the number of documents in it? Solr seems to need RAM, and
while we did not do real benchmarks then, even with a few tens of
thousands of documents, performance seemed to improve by allocating
2GB RAM. Besides, unless you are on a very tight budget, throwing a
few GB more RAM at the problem seems to be an easy, and not
very expensive, way out.

Regards,
Gora

Re: Indexing large files using Solr Cell causes OutOfMemory error

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Indexing large files using Solr Cell causes OutOfMemory error
: References: <AA...@mail.gmail.com>
: In-Reply-To: <AA...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss