You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Geeta Subramanian <gs...@commvault.com> on 2011/03/17 19:58:55 UTC
OOM for large files

Hi,



I am getting OOM after posting a 100 Mb document to SOLR with trace:

Exception in thread "main" org.apache.solr.common.SolrException: Java heap space  java.lang.OutOfMemoryError: Java heap space

                at java.util.Arrays.copyOf(Unknown Source)

                at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)

                at java.lang.AbstractStringBuilder.append(Unknown Source)

                at java.lang.StringBuilder.append(Unknown Source)

                at org.apache.solr.handler.extraction.Solrtik       ContentHandler.characters(SolrContentHandler.java:257)

                at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:124)

                at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:153)

                at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:124)

                at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:124)

                at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:39)

                at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61)

                at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:113)

                at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:151)

                at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:175)

                at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:144)

                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:142)

                at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)

                at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)

                at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:193)

                at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)

                at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

                at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:237)

                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)

                at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)

                at org.apache.solr.se







I have given 1024M memory.

But still this fails, so, can somebody tell me the minimum heap size required w.r.t. file size so that document get indexed successfully?



Also just a weird question:

In Tika's code, there is a place where char[] is initialized to 4096. Then when this used in StringWriter, if the array is full it does an expandCapacity (as highlighted in logs), there is an array copy operation. So with just 4kb, if I want to process a 100mb document, a lot of char arrays will be generated and we need to depend on GC for getting them cleaned.



Is there any idea, if I change the Tika code to initialize the char array with more than ~4k , will there be any performance improvement?



Thanks for your time,

Regards,

Geeta















******************Legal Disclaimer***************************
"This communication may contain confidential and privileged material 
for the sole use of the intended recipient.  Any unauthorized review, 
use or distribution by others is strictly prohibited.  If you have 
received the message in error, please advise the sender by reply 
email and delete the message. Thank you."
****************************************************************