You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Tim Allison <ta...@apache.org> on 2018/08/07 12:36:00 UTC

Fwd: Memory Leak in 7.3 to 7.4

Thomas,
   Thank you for raising this on the Solr list. Please let us know if we
can help you help us figure out what’s going on...or if you’ve already
figured it out!
    Thank you!

    Best,
       Tim

---------- Forwarded message ---------
From: Thomas Scheffler <th...@uni-jena.de>
Date: Thu, Aug 2, 2018 at 6:06 AM
Subject: Memory Leak in 7.3 to 7.4
To: solr-user@lucene.apache.org <so...@lucene.apache.org>


Hi,

we noticed a memory leak in a rather small setup. 40.000 metadata documents
with nearly as much files that have „literal.*“ fields with it. While 7.2.1
has brought some tika issues (due to a beta version) the real problems
started to appear with version 7.3.0 which are currently unresolved in
7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was
enough, now 6G aren’t enough to index all files.
I am now to a point where I can track this down to the libraries in
solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries
shipped with 7.2.1 the problem disappears. As most files are PDF documents
I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the
problem. I will next try to downgrade these single libraries back to 2.0.6
and 1.16 to see if these are the source of the memory leak.

In the mean time I would like to know if anybody else experienced the same
problems?

kind regards,

Thomas

Re: Memory Leak in 7.3 to 7.4

Posted by David Pilato <da...@pilato.fr>.

My bad. The issue I mentioned was reported with a Tika 1.16 version.
So not related to the current thread. Most likely a problem in my own code :)


Le 8 août 2018 à 05:25 +0200, Robert Neal Clayton <ro...@gmail.com>, a écrit :
> I have a remarkably similar setup to David, I’m running through about 50,000 PDF files with OCR tools at the moment. I have Tika 1.18 running standalone and a shell script sending PDFs to it via curl for each file to extract metadata before OCR functions, by POSTing the file to the /meta URL.
>
> After 10 hours of uptime, Tika is using about 5.6 gigs of memory. After restarting the Tika server, that appears to be about the same amount of memory that it gets when it starts fresh.
>
> So whatever the issue is, it’s not in anything that falls under /meta, that stuff is working great for me.
>
> > On Aug 7, 2018, at 9:57 AM, Tim Allison <ta...@apache.org> wrote:
> >
> > Thank you, David! It would be helpful to know if downgrading to 1.16
> > solves the problems with .txt files, as it does (apparently) with
> > pdfs.
> > On Tue, Aug 7, 2018 at 9:10 AM David Pilato <da...@pilato.fr> wrote:
> > >
> > > That's interesting. Someone did some tests on a project I'm working on and reported as well a lot of memory usage (even for only txt files).
> > > I did not dig yet into the issue so I don't know if this is related or not, but I thought I'd share this here: https://github.com/dadoonet/fscrawler/issues/566
> > >
> > >
> > > Le 7 août 2018 à 14:36 +0200, Tim Allison <ta...@apache.org>, a écrit :
> > >
> > > Thomas,
> > > Thank you for raising this on the Solr list. Please let us know if we can help you help us figure out what’s going on...or if you’ve already figured it out!
> > > Thank you!
> > >
> > > Best,
> > > Tim
> > >
> > > ---------- Forwarded message ---------
> > > From: Thomas Scheffler <th...@uni-jena.de>
> > > Date: Thu, Aug 2, 2018 at 6:06 AM
> > > Subject: Memory Leak in 7.3 to 7.4
> > > To: solr-user@lucene.apache.org <so...@lucene.apache.org>
> > >
> > >
> > > Hi,
> > >
> > > we noticed a memory leak in a rather small setup. 40.000 metadata documents with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some tika issues (due to a beta version) the real problems started to appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was enough, now 6G aren’t enough to index all files.
> > > I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16 to see if these are the source of the memory leak.
> > >
> > > In the mean time I would like to know if anybody else experienced the same problems?
> > >
> > > kind regards,
> > >
> > > Thomas
>

Re: Memory Leak in 7.3 to 7.4

Posted by Robert Neal Clayton <ro...@gmail.com>.

I have a remarkably similar setup to David, I’m running through about 50,000 PDF files with OCR tools at the moment. I have Tika 1.18 running standalone and a shell script sending PDFs to it via curl for each file to extract metadata before OCR functions, by POSTing the file to the /meta URL.  

After 10 hours of uptime, Tika is using about 5.6 gigs of memory.  After restarting the Tika server, that appears to be about the same amount of memory that it gets when it starts fresh.

So whatever the issue is, it’s not in anything that falls under /meta, that stuff is working great for me.

> On Aug 7, 2018, at 9:57 AM, Tim Allison <ta...@apache.org> wrote:
> 
> Thank you, David!  It would be helpful to know if downgrading to 1.16
> solves the problems with .txt files, as it does (apparently) with
> pdfs.
> On Tue, Aug 7, 2018 at 9:10 AM David Pilato <da...@pilato.fr> wrote:
>> 
>> That's interesting. Someone did some tests on a project I'm working on and reported as well a lot of memory usage (even for only txt files).
>> I did not dig yet into the issue so I don't know if this is related or not, but I thought I'd share this here: https://github.com/dadoonet/fscrawler/issues/566
>> 
>> 
>> Le 7 août 2018 à 14:36 +0200, Tim Allison <ta...@apache.org>, a écrit :
>> 
>> Thomas,
>>   Thank you for raising this on the Solr list. Please let us know if we can help you help us figure out what’s going on...or if you’ve already figured it out!
>>    Thank you!
>> 
>>    Best,
>>       Tim
>> 
>> ---------- Forwarded message ---------
>> From: Thomas Scheffler <th...@uni-jena.de>
>> Date: Thu, Aug 2, 2018 at 6:06 AM
>> Subject: Memory Leak in 7.3 to 7.4
>> To: solr-user@lucene.apache.org <so...@lucene.apache.org>
>> 
>> 
>> Hi,
>> 
>> we noticed a memory leak in a rather small setup. 40.000 metadata documents with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some tika issues (due to a beta version) the real problems started to appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was enough, now 6G aren’t enough to index all files.
>> I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16 to see if these are the source of the memory leak.
>> 
>> In the mean time I would like to know if anybody else experienced the same problems?
>> 
>> kind regards,
>> 
>> Thomas

Re: Fwd: Memory Leak in 7.3 to 7.4

Posted by Tim Allison <ta...@apache.org>.

Thank you, David!  It would be helpful to know if downgrading to 1.16
solves the problems with .txt files, as it does (apparently) with
pdfs.
On Tue, Aug 7, 2018 at 9:10 AM David Pilato <da...@pilato.fr> wrote:
>
> That's interesting. Someone did some tests on a project I'm working on and reported as well a lot of memory usage (even for only txt files).
> I did not dig yet into the issue so I don't know if this is related or not, but I thought I'd share this here: https://github.com/dadoonet/fscrawler/issues/566
>
>
> Le 7 août 2018 à 14:36 +0200, Tim Allison <ta...@apache.org>, a écrit :
>
> Thomas,
>    Thank you for raising this on the Solr list. Please let us know if we can help you help us figure out what’s going on...or if you’ve already figured it out!
>     Thank you!
>
>     Best,
>        Tim
>
> ---------- Forwarded message ---------
> From: Thomas Scheffler <th...@uni-jena.de>
> Date: Thu, Aug 2, 2018 at 6:06 AM
> Subject: Memory Leak in 7.3 to 7.4
> To: solr-user@lucene.apache.org <so...@lucene.apache.org>
>
>
> Hi,
>
> we noticed a memory leak in a rather small setup. 40.000 metadata documents with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some tika issues (due to a beta version) the real problems started to appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was enough, now 6G aren’t enough to index all files.
> I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16 to see if these are the source of the memory leak.
>
> In the mean time I would like to know if anybody else experienced the same problems?
>
> kind regards,
>
> Thomas

Re: Fwd: Memory Leak in 7.3 to 7.4

Posted by David Pilato <da...@pilato.fr>.

That's interesting. Someone did some tests on a project I'm working on and reported as well a lot of memory usage (even for only txt files).
I did not dig yet into the issue so I don't know if this is related or not, but I thought I'd share this here: https://github.com/dadoonet/fscrawler/issues/566


Le 7 août 2018 à 14:36 +0200, Tim Allison <ta...@apache.org>, a écrit :
> Thomas,
>    Thank you for raising this on the Solr list. Please let us know if we can help you help us figure out what’s going on...or if you’ve already figured it out!
>     Thank you!
>
>     Best,
>        Tim
>
> > ---------- Forwarded message ---------
> > From: Thomas Scheffler <th...@uni-jena.de>
> > Date: Thu, Aug 2, 2018 at 6:06 AM
> > Subject: Memory Leak in 7.3 to 7.4
> > To: solr-user@lucene.apache.org <so...@lucene.apache.org>
> >
> >
> > Hi,
> >
> > we noticed a memory leak in a rather small setup. 40.000 metadata documents with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some tika issues (due to a beta version) the real problems started to appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was enough, now 6G aren’t enough to index all files.
> > I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16 to see if these are the source of the memory leak.
> >
> > In the mean time I would like to know if anybody else experienced the same problems?
> >
> > kind regards,
> >
> > Thomas