You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Ar...@bka.bund.de on 2016/08/08 06:10:22 UTC

CPE memory usage

Hi!

I'm using uimaFIT 2.2.0 and uimaj 2.8.1. The collectection processing engine is slowy eating up all memory until it gets killed by the system. This happens even when I'm just runnging a collection reader and no other compoments (no analysis at all). Does anyone has experiented a similar behavior or has any ideas?

Best,
Armin

AW: CPE memory usage

Posted by Ar...@bka.bund.de.

Hi Jens,

nice tips. I will try that one with the filters, first. I just need to make a view changes.

Thank you,
Armin

-----Ursprüngliche Nachricht-----
Von: jens@grivolla.net [mailto:jens@grivolla.net] Im Auftrag von Jens Grivolla
Gesendet: Dienstag, 16. August 2016 13:34
An: user@uima.apache.org
Betreff: Re: CPE memory usage

Solr is known not to be very good at deep paging, but rather getting the
top relevant results. Running a query asking for the millionth document is
pretty much the worst you can do as it will have to rank all documents
again, up to the millionth, and return that one. It can also be unreliable
if your document collection changes.

We did get it to work quite well, though. I believe we used only filters
and retrieved the results in natural order, so that Solr wouldn't have to
rank the documents. We also had a version where we first retrieved all
matching document ids in one go, and then queried for the documents by id,
one by one, in getNext().

Deep paging has also seen some major improvements over time IIRC, so newer
Solr versions should perform much better than the ones from a few years ago.

Best,
Jens

On Tue, Aug 9, 2016 at 12:20 PM, <Ar...@bka.bund.de> wrote:

> Hi!
>
> Finally, it looks like that Solr causes the high memory consumption. The
> SolrClient isn't expected to be used like I did it. But it isn't documented
> either. The Solr documentation is very bad. I just happened to find a
> solution on the web by accident.
>
> Thanks,
> Armin
>
> -----Ursprüngliche Nachricht-----
> Von: Richard Eckart de Castilho [mailto:rec@apache.org]
> Gesendet: Montag, 8. August 2016 15:33
> An: user@uima.apache.org
> Betreff: Re: CPE memory usage
>
> Do you have code for a minimal test case?
>
> Cheers,
>
> -- Richard
>
> > On 08.08.2016, at 15:31, <Ar...@bka.bund.de> <
> Armin.Wegner@bka.bund.de> wrote:
> >
> > Hi Richard!
> >
> > I've changed the document reader to a kind of no-op-reader, that always
> sets the document text to an empty string: same behavior, but much slower
> increase in memory usage.
> >
> > Cheers,
> > Armin
>
>

Re: CPE memory usage

Posted by Jens Grivolla <j+...@grivolla.net>.

Hi Armin, glad I could help. Getting all IDs first also avoids problems
with changing data which could mess with the offsets. This way you have a
fixed snapshot of all existing documents (at the beginning).

Best,
Jens

On Mon, Aug 29, 2016 at 8:12 AM, <Ar...@bka.bund.de> wrote:

> Hi Jens,
>
> I just want to confirm your information. As you said, the query gets
> slower the larger start is, even using filters. The best solution is to get
> all ids first (may take some time), and then to get each documents by id
> successively. There is a request handler (get) and a Java API method
> (HttpSolrClient.getById()) to do so.
>
> Thanks to your help, I have a constantly fast queries, now.
>
> Cheers,
> Armin
>
> -----Ursprüngliche Nachricht-----
> Von: jens@grivolla.net [mailto:jens@grivolla.net] Im Auftrag von Jens
> Grivolla
> Gesendet: Dienstag, 16. August 2016 13:34
> An: user@uima.apache.org
> Betreff: Re: CPE memory usage
>
> Solr is known not to be very good at deep paging, but rather getting the
> top relevant results. Running a query asking for the millionth document is
> pretty much the worst you can do as it will have to rank all documents
> again, up to the millionth, and return that one. It can also be unreliable
> if your document collection changes.
>
> We did get it to work quite well, though. I believe we used only filters
> and retrieved the results in natural order, so that Solr wouldn't have to
> rank the documents. We also had a version where we first retrieved all
> matching document ids in one go, and then queried for the documents by id,
> one by one, in getNext().
>
> Deep paging has also seen some major improvements over time IIRC, so newer
> Solr versions should perform much better than the ones from a few years
> ago.
>
> Best,
> Jens
>
> On Tue, Aug 9, 2016 at 12:20 PM, <Ar...@bka.bund.de> wrote:
>
> > Hi!
> >
> > Finally, it looks like that Solr causes the high memory consumption. The
> > SolrClient isn't expected to be used like I did it. But it isn't
> documented
> > either. The Solr documentation is very bad. I just happened to find a
> > solution on the web by accident.
> >
> > Thanks,
> > Armin
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Richard Eckart de Castilho [mailto:rec@apache.org]
> > Gesendet: Montag, 8. August 2016 15:33
> > An: user@uima.apache.org
> > Betreff: Re: CPE memory usage
> >
> > Do you have code for a minimal test case?
> >
> > Cheers,
> >
> > -- Richard
> >
> > > On 08.08.2016, at 15:31, <Ar...@bka.bund.de> <
> > Armin.Wegner@bka.bund.de> wrote:
> > >
> > > Hi Richard!
> > >
> > > I've changed the document reader to a kind of no-op-reader, that always
> > sets the document text to an empty string: same behavior, but much slower
> > increase in memory usage.
> > >
> > > Cheers,
> > > Armin
> >
> >
>

AW: CPE memory usage

Posted by Ar...@bka.bund.de.

Hi Jens,

I just want to confirm your information. As you said, the query gets slower the larger start is, even using filters. The best solution is to get all ids first (may take some time), and then to get each documents by id successively. There is a request handler (get) and a Java API method (HttpSolrClient.getById()) to do so.

Thanks to your help, I have a constantly fast queries, now.

Cheers,
Armin

-----Ursprüngliche Nachricht-----
Von: jens@grivolla.net [mailto:jens@grivolla.net] Im Auftrag von Jens Grivolla
Gesendet: Dienstag, 16. August 2016 13:34
An: user@uima.apache.org
Betreff: Re: CPE memory usage

Solr is known not to be very good at deep paging, but rather getting the
top relevant results. Running a query asking for the millionth document is
pretty much the worst you can do as it will have to rank all documents
again, up to the millionth, and return that one. It can also be unreliable
if your document collection changes.

We did get it to work quite well, though. I believe we used only filters
and retrieved the results in natural order, so that Solr wouldn't have to
rank the documents. We also had a version where we first retrieved all
matching document ids in one go, and then queried for the documents by id,
one by one, in getNext().

Deep paging has also seen some major improvements over time IIRC, so newer
Solr versions should perform much better than the ones from a few years ago.

Best,
Jens

On Tue, Aug 9, 2016 at 12:20 PM, <Ar...@bka.bund.de> wrote:

> Hi!
>
> Finally, it looks like that Solr causes the high memory consumption. The
> SolrClient isn't expected to be used like I did it. But it isn't documented
> either. The Solr documentation is very bad. I just happened to find a
> solution on the web by accident.
>
> Thanks,
> Armin
>
> -----Ursprüngliche Nachricht-----
> Von: Richard Eckart de Castilho [mailto:rec@apache.org]
> Gesendet: Montag, 8. August 2016 15:33
> An: user@uima.apache.org
> Betreff: Re: CPE memory usage
>
> Do you have code for a minimal test case?
>
> Cheers,
>
> -- Richard
>
> > On 08.08.2016, at 15:31, <Ar...@bka.bund.de> <
> Armin.Wegner@bka.bund.de> wrote:
> >
> > Hi Richard!
> >
> > I've changed the document reader to a kind of no-op-reader, that always
> sets the document text to an empty string: same behavior, but much slower
> increase in memory usage.
> >
> > Cheers,
> > Armin
>
>

Re: CPE memory usage

Posted by Jens Grivolla <j+...@grivolla.net>.

Solr is known not to be very good at deep paging, but rather getting the
top relevant results. Running a query asking for the millionth document is
pretty much the worst you can do as it will have to rank all documents
again, up to the millionth, and return that one. It can also be unreliable
if your document collection changes.

We did get it to work quite well, though. I believe we used only filters
and retrieved the results in natural order, so that Solr wouldn't have to
rank the documents. We also had a version where we first retrieved all
matching document ids in one go, and then queried for the documents by id,
one by one, in getNext().

Deep paging has also seen some major improvements over time IIRC, so newer
Solr versions should perform much better than the ones from a few years ago.

Best,
Jens

On Tue, Aug 9, 2016 at 12:20 PM, <Ar...@bka.bund.de> wrote:

> Hi!
>
> Finally, it looks like that Solr causes the high memory consumption. The
> SolrClient isn't expected to be used like I did it. But it isn't documented
> either. The Solr documentation is very bad. I just happened to find a
> solution on the web by accident.
>
> Thanks,
> Armin
>
> -----Ursprüngliche Nachricht-----
> Von: Richard Eckart de Castilho [mailto:rec@apache.org]
> Gesendet: Montag, 8. August 2016 15:33
> An: user@uima.apache.org
> Betreff: Re: CPE memory usage
>
> Do you have code for a minimal test case?
>
> Cheers,
>
> -- Richard
>
> > On 08.08.2016, at 15:31, <Ar...@bka.bund.de> <
> Armin.Wegner@bka.bund.de> wrote:
> >
> > Hi Richard!
> >
> > I've changed the document reader to a kind of no-op-reader, that always
> sets the document text to an empty string: same behavior, but much slower
> increase in memory usage.
> >
> > Cheers,
> > Armin
>
>

AW: CPE memory usage

Posted by Ar...@bka.bund.de.

Hi!

Finally, it looks like that Solr causes the high memory consumption. The SolrClient isn't expected to be used like I did it. But it isn't documented either. The Solr documentation is very bad. I just happened to find a solution on the web by accident.

Thanks,
Armin

-----Ursprüngliche Nachricht-----
Von: Richard Eckart de Castilho [mailto:rec@apache.org] 
Gesendet: Montag, 8. August 2016 15:33
An: user@uima.apache.org
Betreff: Re: CPE memory usage

Do you have code for a minimal test case?

Cheers,

-- Richard

> On 08.08.2016, at 15:31, <Ar...@bka.bund.de> <Ar...@bka.bund.de> wrote:
> 
> Hi Richard!
> 
> I've changed the document reader to a kind of no-op-reader, that always sets the document text to an empty string: same behavior, but much slower increase in memory usage.
> 
> Cheers,
> Armin

Re: CPE memory usage

Posted by Richard Eckart de Castilho <re...@apache.org>.

Do you have code for a minimal test case?

Cheers,

-- Richard

> On 08.08.2016, at 15:31, <Ar...@bka.bund.de> <Ar...@bka.bund.de> wrote:
> 
> Hi Richard!
> 
> I've changed the document reader to a kind of no-op-reader, that always sets the document text to an empty string: same behavior, but much slower increase in memory usage.
> 
> Cheers,
> Armin

AW: CPE memory usage

Posted by Ar...@bka.bund.de.

Hi Richard!

I've changed the document reader to a kind of no-op-reader, that always sets the document text to an empty string: same behavior, but much slower increase in memory usage.

Cheers,
Armin

-----Ursprüngliche Nachricht-----
Von: Richard Eckart de Castilho [mailto:rec@apache.org] 
Gesendet: Montag, 8. August 2016 14:31
An: user@uima.apache.org
Betreff: Re: CPE memory usage

I am not aware of any resource leaks in uimaFIT or UIMA-J either.
Maybe Solr doesn't handle resources in the way you expect? E.g.
queries or documents may have to be closed/returned when they
are no longer needed. You could to generate a heap dump using
JVisualVM to figure our what kind of objects are accumulating.

Cheers,

-- Richard

> On 08.08.2016, at 09:22, Armin.Wegner@bka.bund.de wrote:
> 
> Hello Richard!
> 
> No, I can't change the reader. It's reading from Solr. The response documents are put in a queue. The querying logic is done in hasNext(). hasNext() returns true if the queue is not empty. If the queue is empty, hasNext() sends a request to Solr and puts the response documents in the empty queue. If there are no more response documents from Solr, the queue remains empty and the reader is done. getNext() pulls a document from the queue and sets the CAS's document text. The number of documents returned from Solr can be given as a parameter. Currently it's set to 1, that is one document per request. The only fields of the reader class are the parameters and the document queue. All other variables are local to their methods. It's pretty simple. There shouldn't be any resource leaks.
> 
> Best,
> Armin
> 
> -----Ursprüngliche Nachricht-----
> Von: Richard Eckart de Castilho [mailto:rec@apache.org] 
> Gesendet: Montag, 8. August 2016 08:50
> An: user@uima.apache.org
> Betreff: Re: CPE memory usage
> 
> Did you try using a different reader?
> 
> Cheers,
> 
> -- Richard
> 
>> On 08.08.2016, at 08:10, <Ar...@bka.bund.de> <Ar...@bka.bund.de> wrote:
>> 
>> Hi!
>> 
>> I'm using uimaFIT 2.2.0 and uimaj 2.8.1. The collectection processing engine is slowy eating up all memory until it gets killed by the system. This happens even when I'm just runnging a collection reader and no other compoments (no analysis at all). Does anyone has experiented a similar behavior or has any ideas?
>> 
>> Best,
>> Armin
>

Re: CPE memory usage

Posted by Richard Eckart de Castilho <re...@apache.org>.

I am not aware of any resource leaks in uimaFIT or UIMA-J either.
Maybe Solr doesn't handle resources in the way you expect? E.g.
queries or documents may have to be closed/returned when they
are no longer needed. You could to generate a heap dump using
JVisualVM to figure our what kind of objects are accumulating.

Cheers,

-- Richard

> On 08.08.2016, at 09:22, Armin.Wegner@bka.bund.de wrote:
> 
> Hello Richard!
> 
> No, I can't change the reader. It's reading from Solr. The response documents are put in a queue. The querying logic is done in hasNext(). hasNext() returns true if the queue is not empty. If the queue is empty, hasNext() sends a request to Solr and puts the response documents in the empty queue. If there are no more response documents from Solr, the queue remains empty and the reader is done. getNext() pulls a document from the queue and sets the CAS's document text. The number of documents returned from Solr can be given as a parameter. Currently it's set to 1, that is one document per request. The only fields of the reader class are the parameters and the document queue. All other variables are local to their methods. It's pretty simple. There shouldn't be any resource leaks.
> 
> Best,
> Armin
> 
> -----Ursprüngliche Nachricht-----
> Von: Richard Eckart de Castilho [mailto:rec@apache.org] 
> Gesendet: Montag, 8. August 2016 08:50
> An: user@uima.apache.org
> Betreff: Re: CPE memory usage
> 
> Did you try using a different reader?
> 
> Cheers,
> 
> -- Richard
> 
>> On 08.08.2016, at 08:10, <Ar...@bka.bund.de> <Ar...@bka.bund.de> wrote:
>> 
>> Hi!
>> 
>> I'm using uimaFIT 2.2.0 and uimaj 2.8.1. The collectection processing engine is slowy eating up all memory until it gets killed by the system. This happens even when I'm just runnging a collection reader and no other compoments (no analysis at all). Does anyone has experiented a similar behavior or has any ideas?
>> 
>> Best,
>> Armin
>

AW: CPE memory usage

Posted by Ar...@bka.bund.de.

Hello Richard!

No, I can't change the reader. It's reading from Solr. The response documents are put in a queue. The querying logic is done in hasNext(). hasNext() returns true if the queue is not empty. If the queue is empty, hasNext() sends a request to Solr and puts the response documents in the empty queue. If there are no more response documents from Solr, the queue remains empty and the reader is done. getNext() pulls a document from the queue and sets the CAS's document text. The number of documents returned from Solr can be given as a parameter. Currently it's set to 1, that is one document per request. The only fields of the reader class are the parameters and the document queue. All other variables are local to their methods. It's pretty simple. There shouldn't be any resource leaks.

Best,
Armin

-----Ursprüngliche Nachricht-----
Von: Richard Eckart de Castilho [mailto:rec@apache.org] 
Gesendet: Montag, 8. August 2016 08:50
An: user@uima.apache.org
Betreff: Re: CPE memory usage

Did you try using a different reader?

Cheers,

-- Richard

> On 08.08.2016, at 08:10, <Ar...@bka.bund.de> <Ar...@bka.bund.de> wrote:
> 
> Hi!
> 
> I'm using uimaFIT 2.2.0 and uimaj 2.8.1. The collectection processing engine is slowy eating up all memory until it gets killed by the system. This happens even when I'm just runnging a collection reader and no other compoments (no analysis at all). Does anyone has experiented a similar behavior or has any ideas?
> 
> Best,
> Armin

Re: CPE memory usage

Posted by Richard Eckart de Castilho <re...@apache.org>.

Did you try using a different reader?

Cheers,

-- Richard

> On 08.08.2016, at 08:10, <Ar...@bka.bund.de> <Ar...@bka.bund.de> wrote:
> 
> Hi!
> 
> I'm using uimaFIT 2.2.0 and uimaj 2.8.1. The collectection processing engine is slowy eating up all memory until it gets killed by the system. This happens even when I'm just runnging a collection reader and no other compoments (no analysis at all). Does anyone has experiented a similar behavior or has any ideas?
> 
> Best,
> Armin