You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by "Cech. Ulrich" <Ul...@aeb.de> on 2011/03/16 11:14:02 UTC

Indexer behaves differently while starting the repository?

Hello,

I have some "funny" problem with the Jackrabbit-Indexer mechanism. Are store text files with a file size between 2 und 170MB. The Xmx is set to 850, and the indexer works without problem when storing the stream in the jackrabbit datastore (FileDataStore). Till here is everything ok.
But if I delete the workspace-index directory to let Jackrabbit restore it when starting the next time, the indexer starts, works some files and then creates an java.lang.OutOfMemoryError: Java heap space.

Can someone tell me, where the difference is between "indexing while storing" and "(re)indexing while startup the repository"?

Thank you very much for any hint,
Best regards,
Ulrich

I appended the stacktrace here:

2011-03-16 11:04:36,916 WARN : [LazyTextExtractorField] Failed to extract text from a binary property
java.lang.OutOfMemoryError: Java heap space
            at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
            at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:518)
            at java.lang.StringBuilder.append(StringBuilder.java:190)
            at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.characters(LazyTextExtractorField.java:191)
            at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
            at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:153)
            at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
            at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
            at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
            at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
            at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:39)
            at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61)
            at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:113)
            at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:151)
            at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:261)
            at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
            at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
            at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
            at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
            at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
            at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:192)
            at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
            at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
            at java.util.concurrent.FutureTask.run(FutureTask.java:123)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
            at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
            at java.lang.Thread.run(Thread.java:595)

AW: Indexer behaves differently while starting the repository?

Posted by "Cech. Ulrich" <Ul...@aeb.de>.

I think I found something out:

The reason, because the indexer works while "storing" is, that there is actually only one thread, which stores the data.
When the repository starts, the indexer starts multiple "ParsingTasks". I think, if many sessions will store the data, there are also many indexer-threads, so the error would come up the same.

And the implementation of the AbstractStringBuffer.expandCapacity is very dangerous, because if the char[] has 500MB, and we need 510, so the expandCapacity() method is called, then the new char[] is initialized with 500MB * 2, which kills the error (or multiple ParsingTasks expand from 100 to 200)

Can someone tell me, how you index very large files? Or is it a bad idea in general to fulltext-index these big files?

Thanks in advance,
Ulrich