You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by P Williams <wi...@gmail.com> on 2011/11/03 17:48:26 UTC

Re: Stream still in memory after tika exception? Possible memoryleak?

Hi All,

I'm experiencing a similar problem to the other's in the thread.

I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to
apache-solr-4.0-2011-10-14_08-56-59.war and then
apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various
sizes, using the TikaEntityProcessor.  My indexing would run to completion
and was completely successful under the June build.  The only error was
readability of the fulltext in highlighting.  This was fixed in Tika 0.10
(TIKA-611).  I chose to use the October 14 build of Solr because Tika 0.10
had recently been included (SOLR-2372).

On the same machine without changing any memory settings my initial problem
is a Perm Gen error.  Fine, I increase the PermGen space.

I've set the "onError" parameter to "skip" for the TikaEntityProcessor.
 Now I get several (6)

*SEVERE: Exception thrown while getting data*
*java.net.SocketTimeoutException: Read timed out*
*SEVERE: Exception in entity :
tika:org.apache.solr.handler.dataimport.DataImport*
*HandlerException: Exception in invoking url <url removed> # 2975*

pairs.  And after ~3881 documents, with auto commit set unreasonably
frequently I consistently get an Out of Memory Error

*SEVERE: Exception while processing: f document :
null:org.apache.solr.handle**r.dataimport.DataImportHandlerException:
java.lang.OutOfMemoryError: Java heap s**pace*

The stack trace points
to org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:718).

The October 30 build performs identically.

Funny thing is that monitoring via JConsole doesn't reveal any memory
issues.

Because the out of Memory error did not occur in June, this leads me to
believe that a bug has been introduced to the code since then.  Should I
open an issue in JIRA?

Thanks,
Tricia

On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs <ja...@gmail.com> wrote:

> Hi Erick,
>
> I am using Solr 3.3.0, but with 1.4.1 the same problems.
> The connector is a homemade program in the C# programming language and is
> posting via http remote streaming (i.e.
>
> http://localhost:8080/solr/update/extract?stream.file=/path/to/file.doc&literal.id=1
> )
> I'm using Tika to extract the content (comes with the Solr Cell).
>
> A possible problem is that the filestream needs to be closed, after
> extracting, by the client application, but it seems that there is going
> something wrong while getting a Tika-exception: the stream never leaves the
> memory. At least that is my assumption.
>
> What is the common way to extract content from officefiles (pdf, doc, rtf,
> xls etc) and index them? To write a content extractor / validator yourself?
> Or is it possible to do this with the Solr Cell without getting a huge
> memory consumption? Please let me know. Thanks in advance.
>
> Marc
>
> 2011/8/30 Erick Erickson <er...@gmail.com>
>
> > What version of Solr are you using, and how are you indexing?
> > DIH? SolrJ?
> >
> > I'm guessing you're using Tika, but how?
> >
> > Best
> > Erick
> >
> > On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs <ja...@gmail.com> wrote:
> > > Hi all,
> > >
> > > Currently I'm testing Solr's indexing performance, but unfortunately
> I'm
> > > running into memory problems.
> > > It looks like Solr is not closing the filestream after an exception,
> but
> > I'm
> > > not really sure.
> > >
> > > The current system I'm using has 150GB of memory and while I'm indexing
> > the
> > > memoryconsumption is growing and growing (eventually more then 50GB).
> > > In the attached graph I indexed about 70k of office-documents
> > (pdf,doc,xls
> > > etc) and between 1 and 2 percent throws an exception.
> > > The commits are after 64MB, 60 seconds or after a job (there are 6
> evenly
> > > divided jobs).
> > >
> > > After indexing the memoryconsumption isn't dropping. Even after an
> > optimize
> > > command it's still there.
> > > What am I doing wrong? I can't imagine I'm the only one with this
> > problem.
> > > Thanks in advance!
> > >
> > > Kind regards,
> > >
> > > Marc
> > >
> >
>

Re: Stream still in memory after tika exception? Possible memoryleak?

Posted by Lance Norskog <go...@gmail.com>.
Yes, please open a JIRA for this, with as much info as possible.

Lance

On Thu, Nov 3, 2011 at 9:48 AM, P Williams
<wi...@gmail.com>wrote:

> Hi All,
>
> I'm experiencing a similar problem to the other's in the thread.
>
> I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to
> apache-solr-4.0-2011-10-14_08-56-59.war and then
> apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various
> sizes, using the TikaEntityProcessor.  My indexing would run to completion
> and was completely successful under the June build.  The only error was
> readability of the fulltext in highlighting.  This was fixed in Tika 0.10
> (TIKA-611).  I chose to use the October 14 build of Solr because Tika 0.10
> had recently been included (SOLR-2372).
>
> On the same machine without changing any memory settings my initial problem
> is a Perm Gen error.  Fine, I increase the PermGen space.
>
> I've set the "onError" parameter to "skip" for the TikaEntityProcessor.
>  Now I get several (6)
>
> *SEVERE: Exception thrown while getting data*
> *java.net.SocketTimeoutException: Read timed out*
> *SEVERE: Exception in entity :
> tika:org.apache.solr.handler.dataimport.DataImport*
> *HandlerException: Exception in invoking url <url removed> # 2975*
>
> pairs.  And after ~3881 documents, with auto commit set unreasonably
> frequently I consistently get an Out of Memory Error
>
> *SEVERE: Exception while processing: f document :
> null:org.apache.solr.handle**r.dataimport.DataImportHandlerException:
> java.lang.OutOfMemoryError: Java heap s**pace*
>
> The stack trace points
> to
> org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
> and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> r.java:718).
>
> The October 30 build performs identically.
>
> Funny thing is that monitoring via JConsole doesn't reveal any memory
> issues.
>
> Because the out of Memory error did not occur in June, this leads me to
> believe that a bug has been introduced to the code since then.  Should I
> open an issue in JIRA?
>
> Thanks,
> Tricia
>
> On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs <ja...@gmail.com> wrote:
>
> > Hi Erick,
> >
> > I am using Solr 3.3.0, but with 1.4.1 the same problems.
> > The connector is a homemade program in the C# programming language and is
> > posting via http remote streaming (i.e.
> >
> >
> http://localhost:8080/solr/update/extract?stream.file=/path/to/file.doc&literal.id=1
> > )
> > I'm using Tika to extract the content (comes with the Solr Cell).
> >
> > A possible problem is that the filestream needs to be closed, after
> > extracting, by the client application, but it seems that there is going
> > something wrong while getting a Tika-exception: the stream never leaves
> the
> > memory. At least that is my assumption.
> >
> > What is the common way to extract content from officefiles (pdf, doc,
> rtf,
> > xls etc) and index them? To write a content extractor / validator
> yourself?
> > Or is it possible to do this with the Solr Cell without getting a huge
> > memory consumption? Please let me know. Thanks in advance.
> >
> > Marc
> >
> > 2011/8/30 Erick Erickson <er...@gmail.com>
> >
> > > What version of Solr are you using, and how are you indexing?
> > > DIH? SolrJ?
> > >
> > > I'm guessing you're using Tika, but how?
> > >
> > > Best
> > > Erick
> > >
> > > On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs <ja...@gmail.com>
> wrote:
> > > > Hi all,
> > > >
> > > > Currently I'm testing Solr's indexing performance, but unfortunately
> > I'm
> > > > running into memory problems.
> > > > It looks like Solr is not closing the filestream after an exception,
> > but
> > > I'm
> > > > not really sure.
> > > >
> > > > The current system I'm using has 150GB of memory and while I'm
> indexing
> > > the
> > > > memoryconsumption is growing and growing (eventually more then 50GB).
> > > > In the attached graph I indexed about 70k of office-documents
> > > (pdf,doc,xls
> > > > etc) and between 1 and 2 percent throws an exception.
> > > > The commits are after 64MB, 60 seconds or after a job (there are 6
> > evenly
> > > > divided jobs).
> > > >
> > > > After indexing the memoryconsumption isn't dropping. Even after an
> > > optimize
> > > > command it's still there.
> > > > What am I doing wrong? I can't imagine I'm the only one with this
> > > problem.
> > > > Thanks in advance!
> > > >
> > > > Kind regards,
> > > >
> > > > Marc
> > > >
> > >
> >
>



-- 
Lance Norskog
goksron@gmail.com