You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by S G <sg...@gmail.com> on 2016/12/02 19:01:41 UTC

Memory leak in Solr

Hi,

This post shows some stats on Solr which indicate that there might be a
memory leak in there.

http://stackoverflow.com/questions/40939166/is-this-a-memory-leak-in-solr

Can someone please help to debug this?
It might be a very good step in making Solr stable if we can fix this.

Thanks
SG

Re: Memory leak in Solr

Posted by Scott Blum <dr...@gmail.com>.

Are you sure it's an actual leak, not just memory pinned by caches?

Related: https://issues.apache.org/jira/browse/SOLR-9810

On Fri, Dec 2, 2016 at 2:01 PM, S G <sg...@gmail.com> wrote:

> Hi,
>
> This post shows some stats on Solr which indicate that there might be a
> memory leak in there.
>
> http://stackoverflow.com/questions/40939166/is-this-a-memory-leak-in-solr
>
> Can someone please help to debug this?
> It might be a very good step in making Solr stable if we can fix this.
>
> Thanks
> SG
>

Re: Memory leak in Solr

Posted by Walter Underwood <wu...@wunderwood.org>.

That is a huge heap.

Once you have enough heap memory to hold a Java program’s working set,
more memory doesn’t make it faster. I just makes the GC take longer.

If you have GC monitoring, look at how much memory is in use after a full GC.
Add the space for new generation (eden, whatever), then a bit more for 
burst memory usage. Set the heap to that.

I recommend fairly large new generation memory allocation. An HTTP service
has a fair amount of allocation that has a lifetime of one HTTP request. Those
allocations should never be promoted to tenured space.

We run with an 8G heap and a 2G new generation with 4.10.4.

Of course, make sure you are running some sort of parallel GC. You can use
G1 or use CMS with ParNew, your choice. We are running CMS/ParNew, but
will be experimenting with G1 soon.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 4, 2016, at 11:07 AM, S G <sg...@gmail.com> wrote:
> 
> Thank you Eric.
> Our Solr version is 4.10 and we are not doing any sorting or faceting.
> 
> I am trying to find some ways of investigating this problem.
> Hence asking a few more questions to see what are the normal steps taken in
> such situations.
> (I did search a few of them on the Internet but could not find anything
> good).
> Any pointers provided here will help us resolve a little more quickly.
> 
> 
> 1) Is there a conclusive way to know about the memory leaks?
>  How does Solr ensure with each release that there are no memory leaks?
>  With a heap 24gb (-Xmx parameter), I sometimes see GC pauses of about 1
> second now.
>  Looks like we will need to scale it down.
>  Total VM memory is 92gb and Solr is the only process running on it.
> 
> 
> 2) How can I know that the zookeeper connectivity to Solr is not good?
>  What commands/steps are normally used to resolve this?
>  Does Solr has some metrics that share the zookeeper interaction
> statistics?
> 
> 
> 3) In a span of 9 hours, I see:
>  4 times: java.net.SocketException: Connection reset
>  32 times: java.net.SocketTimeoutException: Read timed out
> 
> And several other exceptions that ultimately bring a whole shard down
> (leader is recovery-failed and replica is down).
> 
> I understand that the above information might not be sufficient to get the
> full picture.
> But just in case, someone has resolved or debugged these issues before,
> please share your experience.
> It would be of great help to me.
> 
> Thanks,
> SG
> 
> 
> 
> 
> 
> On Sun, Dec 4, 2016 at 8:59 AM, Erick Erickson <er...@gmail.com>
> wrote:
> 
>> All of this is consistent with not having a properly
>> tuned Solr instance wrt # documents, usage
>> pattern, memory allocated to the JVM, GC
>> settings and the like.
>> 
>> Your leader issues can be explained by long
>> GC pauses too. Zookeeper periodically pings
>> each replica it knows about and if the response
>> times out (due to GC in this case) then Zookeeper
>> thinks the node has gone away and marks
>> it as "down". Similarly when a leader forwards
>> an update to a follower and the request times
>> out, the leader will mark the follower as down.
>> Do this enough and the state of the cluster gets
>> "interesting".
>> 
>> You still haven't told us what version of Solr
>> you're using, the "Version" you took from
>> the core stats is the version of the _index_,
>> not Solr.
>> 
>> You have almost 200M documents on
>> a single core. That's definitely on the high side,
>> although I've seen that work. Assuming
>> you aren't doing things like faceting and
>> sorting and the like on non docValues fields.
>> 
>> As others have pointed out, the link you
>> provided doesn't provide much in the way of
>> any "smoking guns" as far as a memory
>> leak is concerned.
>> 
>> I've certainly seen situations where memory
>> required by Solr is close to the total memory
>> allocated to the JVM for instance. Then the GC
>> cycle kicks in and recovers just enough to
>> go on for a very brief time before going into another
>> GC cycle resulting in very poor performance.
>> 
>> So overall this looks like you need to do some
>> serious tuning of your Solr instances, take a
>> hard look at how you're using your physical
>> machines. You specify that these are VMs,
>> but how many VMs are you running per box?
>> How much JVM have you allocated for each?
>> How much total physical memory do you have
>> to work with per box?
>> 
>> Even if you provide the answers to the above
>> questions, there's not much we can do to
>> help you resolve your issues assuming it's
>> simply inappropriate sizing. I'd really recommend
>> you create a stress environment so you can
>> test different scenarios to become confident about
>> your expected performance, here's a blog on the
>> subject:
>> 
>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>> the-abstract-why-we-dont-have-a-definitive-answer/
>> 
>> Best,
>> Erick
>> 
>> On Sat, Dec 3, 2016 at 8:46 PM, S G <sg...@gmail.com> wrote:
>>> The symptom we see is that the java clients querying Solr see response
>>> times in 10s of seconds (not milliseconds).
>>> And on the tomcat's gc.log file (where Solr is running), we see very bad
>> GC
>>> pauses - threads being paused for 0.5 seconds per second approximately.
>>> 
>>> Some numbers for the Solr Cloud:
>>> 
>>> *Overall infrastructure:*
>>> - Only one collection
>>> - 16 VMs used
>>> - 8 shards (1 leader and 1 replica per shard - each core on separate VM)
>>> 
>>> *Overview from one core:*
>>> - Num Docs:193,623,388
>>> - Max Doc:230,577,696
>>> - Heap Memory Usage:231,217,880
>>> - Deleted Docs:36,954,308
>>> - Version:2,357,757
>>> - Segment Count:37
>>> 
>>> *Stats from QueryHandler/select*
>>> - requests:78,557
>>> - errors:358
>>> - timeouts:0
>>> - totalTime:1,639,975.27
>>> - avgRequestsPerSecond:2.62
>>> - 5minRateReqsPerSecond:1.39
>>> - 15minRateReqsPerSecond:1.64
>>> - avgTimePerRequest:20.87
>>> - medianRequestTime:0.70
>>> - 75thPcRequestTime:1.11
>>> - 95thPcRequestTime:191.76
>>> 
>>> *Stats from QueryHandler/update*
>>> - requests:33,555
>>> - errors:0
>>> - timeouts:0
>>> - totalTime:227,870.58
>>> - avgRequestsPerSecond:1.12
>>> - 5minRateReqsPerSecond:1.16
>>> - 15minRateReqsPerSecond:1.23
>>> - avgTimePerRequest:6.79
>>> - medianRequestTime:3.16
>>> - 75thPcRequestTime:5.27
>>> - 95thPcRequestTime:9.33
>>> 
>>> And yet the Solr clients are reporting timeouts and very long read times.
>>> 
>>> Plus, on every server, we are seeing lots of exceptions.
>>> For example:
>>> 
>>> Between 8:06:55 PM and 8:21:36 PM, exceptions are:
>>> 
>>> 1) Request says it is coming from leader, but we are the leader:
>>> update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_
>> 1456430020/&wt=javabin&version=2
>>> 
>>> 2) org.apache.solr.common.SolrException: Request says it is coming from
>>> leader, but we are the leader
>>> 
>>> 3) org.apache.solr.common.SolrException:
>>> org.apache.solr.client.solrj.SolrServerException: Tried one server for
>> read
>>> operation and it timed out, so failing fast
>>> 
>>> 4) null:org.apache.solr.common.SolrException:
>>> org.apache.solr.client.solrj.SolrServerException: Tried one server for
>> read
>>> operation and it timed out, so failing fast
>>> 
>>> 5) org.apache.solr.common.SolrException:
>>> org.apache.solr.client.solrj.SolrServerException: Tried one server for
>> read
>>> operation and it timed out, so failing fast
>>> 
>>> 6) null:org.apache.solr.common.SolrException:
>>> org.apache.solr.client.solrj.SolrServerException: Tried one server for
>> read
>>> operation and it timed out, so failing fast
>>> 
>>> 7) org.apache.solr.common.SolrException:
>>> org.apache.solr.client.solrj.SolrServerException: No live SolrServers
>>> available to handle this request. Zombie server list:
>>> [HOSTA_ca_1_1456429897]
>>> 
>>> 8) null:org.apache.solr.common.SolrException:
>>> org.apache.solr.client.solrj.SolrServerException: No live SolrServers
>>> available to handle this request. Zombie server list:
>>> [HOSTA_ca_1_1456429897]
>>> 
>>> 9) org.apache.solr.common.SolrException:
>>> org.apache.solr.client.solrj.SolrServerException: Tried one server for
>> read
>>> operation and it timed out, so failing fast
>>> 
>>> 10) null:org.apache.solr.common.SolrException:
>>> org.apache.solr.client.solrj.SolrServerException: Tried one server for
>> read
>>> operation and it timed out, so failing fast
>>> 
>>> 11) org.apache.solr.common.SolrException:
>>> org.apache.solr.client.solrj.SolrServerException: Tried one server for
>> read
>>> operation and it timed out, so failing fast
>>> 
>>> 12) null:org.apache.solr.common.SolrException:
>>> org.apache.solr.client.solrj.SolrServerException: Tried one server for
>> read
>>> operation and it timed out, so failing fast
>>> 
>>> Why are we seeing so many timeouts then and why so huge response times on
>>> the client?
>>> 
>>> Thanks
>>> SG
>>> 
>>> 
>>> 
>>> On Sat, Dec 3, 2016 at 4:19 PM, <bi...@gmail.com> wrote:
>>> 
>>>> What tool is that ? The stats I would like to run on my Solr instance
>>>> 
>>>> Bill Bell
>>>> Sent from mobile
>>>> 
>>>> 
>>>>> On Dec 2, 2016, at 4:49 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>>>>> 
>>>>>> On 12/2/2016 12:01 PM, S G wrote:
>>>>>> This post shows some stats on Solr which indicate that there might
>> be a
>>>>>> memory leak in there.
>>>>>> 
>>>>>> http://stackoverflow.com/questions/40939166/is-this-a-
>>>> memory-leak-in-solr
>>>>>> 
>>>>>> Can someone please help to debug this?
>>>>>> It might be a very good step in making Solr stable if we can fix
>> this.
>>>>> 
>>>>> +1 to what Walter said.
>>>>> 
>>>>> I replied earlier on the stackoverflow question.
>>>>> 
>>>>> FYI -- your 95th percentile request time of about 16 milliseconds is
>> NOT
>>>>> something that I would characterize as "very high."  I would *love* to
>>>>> have statistics that good.
>>>>> 
>>>>> Even your 99th percentile request time is not much more than a full
>>>>> second.  If a search takes a couple of seconds, most users will not
>>>>> really care, and some might not even notice.  It's when a large
>>>>> percentage of queries start taking several seconds that complaints
>> start
>>>>> coming in.  On your system, 99 percent of your queries are completing
>> in
>>>>> 1.3 seconds or less, and 95 percent of them are less than 17
>>>>> milliseconds.  That sounds quite good to me.
>>>>> 
>>>>> In my experience, the time it takes for the browser to receive the
>>>>> search result page and render it is a significant part of the total
>> time
>>>>> to see results, and often dwarfs the time spent getting info from
>> Solr.
>>>>> 
>>>>> Here's some numbers from Solr in my organization:
>>>>> 
>>>>> requests:               4102054
>>>>> errors:                 364894
>>>>> timeouts:               49
>>>>> totalTime:              799446287.45041
>>>>> avgRequestsPerSecond:   1.2375565828793849
>>>>> 5minRateReqsPerSecond:  0.8444329508327961
>>>>> 15minRateReqsPerSecond: 0.8631197328073346
>>>>> avgTimePerRequest:      194.88926460997587
>>>>> medianRequestTime:      20.8566605
>>>>> 75thPcRequestTime:      85.51328849999999
>>>>> 95thPcRequestTime:      2202.277466549999
>>>>> 99thPcRequestTime:      5280.375381280002
>>>>> 999thPcRequestTime:     6866.020122961001
>>>>> 
>>>>> The numbers above come from a distributed index that contains 167
>>>>> million documents and takes up about 200GB of disk space across two
>>>>> machines.
>>>>> 
>>>>> requests:               192683
>>>>> errors:                 124
>>>>> timeouts:               0
>>>>> totalTime:              199380421.985073
>>>>> avgRequestsPerSecond    0.042222722771354554
>>>>> 5minRateReqsPerSecon    0.00800545427600684
>>>>> 15minRateReqsPerSecond: 0.017521222412364163
>>>>> avgTimePerRequest:      1034.7587591280653
>>>>> medianRequestTime:      541.591858
>>>>> 75thPcRequestTime:      1683.83246125
>>>>> 95thPcRequestTime:      5644.542019949997
>>>>> 99thPcRequestTime:      9445.592394760004
>>>>> 999thPcRequestTime:     14602.166640771007
>>>>> 
>>>>> These numbers are from an index with about 394 million documents,
>> taking
>>>>> up nearly 500GB of disk space.  This index is also distributed on
>>>>> multiple machines.
>>>>> 
>>>>> Are you experiencing any problems other than what you perceive as slow
>>>>> queries?  I asked some other questions on stackoverflow.  In
>> particular,
>>>>> I'd like to know the total memory on the server, the total number of
>>>>> documents (maxDoc and numDoc) you're handling with this server, as
>> well
>>>>> as the total index size.  What do your queries look like?  What
>> version
>>>>> and vendor of Java are you using?  Can you share your config/schema?
>>>>> 
>>>>> A memory leak is very unlikely, unless your Java or your operating
>>>>> system is broken.  I can't say for sure that it's not happening, but
>>>>> it's just not something we see around here.
>>>>> 
>>>>> Here's what I have collected on performance issues in Solr.  This page
>>>>> does mostly concern itself with memory, though it touches briefly on
>>>>> other topics:
>>>>> 
>>>>> https://wiki.apache.org/solr/SolrPerformanceProblems
>>>>> 
>>>>> Thanks,
>>>>> Shawn
>>>>> 
>>>> 
>>

Re: Memory leak in Solr

Posted by S G <sg...@gmail.com>.

Thank you Eric.
Our Solr version is 4.10 and we are not doing any sorting or faceting.

I am trying to find some ways of investigating this problem.
Hence asking a few more questions to see what are the normal steps taken in
such situations.
(I did search a few of them on the Internet but could not find anything
good).
Any pointers provided here will help us resolve a little more quickly.


1) Is there a conclusive way to know about the memory leaks?
  How does Solr ensure with each release that there are no memory leaks?
  With a heap 24gb (-Xmx parameter), I sometimes see GC pauses of about 1
second now.
  Looks like we will need to scale it down.
  Total VM memory is 92gb and Solr is the only process running on it.


2) How can I know that the zookeeper connectivity to Solr is not good?
  What commands/steps are normally used to resolve this?
  Does Solr has some metrics that share the zookeeper interaction
statistics?


3) In a span of 9 hours, I see:
  4 times: java.net.SocketException: Connection reset
  32 times: java.net.SocketTimeoutException: Read timed out

And several other exceptions that ultimately bring a whole shard down
(leader is recovery-failed and replica is down).

I understand that the above information might not be sufficient to get the
full picture.
But just in case, someone has resolved or debugged these issues before,
please share your experience.
It would be of great help to me.

Thanks,
SG





On Sun, Dec 4, 2016 at 8:59 AM, Erick Erickson <er...@gmail.com>
wrote:

> All of this is consistent with not having a properly
> tuned Solr instance wrt # documents, usage
> pattern, memory allocated to the JVM, GC
> settings and the like.
>
> Your leader issues can be explained by long
> GC pauses too. Zookeeper periodically pings
> each replica it knows about and if the response
> times out (due to GC in this case) then Zookeeper
> thinks the node has gone away and marks
> it as "down". Similarly when a leader forwards
> an update to a follower and the request times
> out, the leader will mark the follower as down.
> Do this enough and the state of the cluster gets
> "interesting".
>
> You still haven't told us what version of Solr
> you're using, the "Version" you took from
> the core stats is the version of the _index_,
> not Solr.
>
> You have almost 200M documents on
> a single core. That's definitely on the high side,
> although I've seen that work. Assuming
> you aren't doing things like faceting and
> sorting and the like on non docValues fields.
>
> As others have pointed out, the link you
> provided doesn't provide much in the way of
> any "smoking guns" as far as a memory
> leak is concerned.
>
> I've certainly seen situations where memory
> required by Solr is close to the total memory
> allocated to the JVM for instance. Then the GC
> cycle kicks in and recovers just enough to
> go on for a very brief time before going into another
> GC cycle resulting in very poor performance.
>
> So overall this looks like you need to do some
> serious tuning of your Solr instances, take a
> hard look at how you're using your physical
> machines. You specify that these are VMs,
> but how many VMs are you running per box?
> How much JVM have you allocated for each?
> How much total physical memory do you have
> to work with per box?
>
> Even if you provide the answers to the above
> questions, there's not much we can do to
> help you resolve your issues assuming it's
> simply inappropriate sizing. I'd really recommend
> you create a stress environment so you can
> test different scenarios to become confident about
> your expected performance, here's a blog on the
> subject:
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
> the-abstract-why-we-dont-have-a-definitive-answer/
>
> Best,
> Erick
>
> On Sat, Dec 3, 2016 at 8:46 PM, S G <sg...@gmail.com> wrote:
> > The symptom we see is that the java clients querying Solr see response
> > times in 10s of seconds (not milliseconds).
> > And on the tomcat's gc.log file (where Solr is running), we see very bad
> GC
> > pauses - threads being paused for 0.5 seconds per second approximately.
> >
> > Some numbers for the Solr Cloud:
> >
> > *Overall infrastructure:*
> > - Only one collection
> > - 16 VMs used
> > - 8 shards (1 leader and 1 replica per shard - each core on separate VM)
> >
> > *Overview from one core:*
> > - Num Docs:193,623,388
> > - Max Doc:230,577,696
> > - Heap Memory Usage:231,217,880
> > - Deleted Docs:36,954,308
> > - Version:2,357,757
> > - Segment Count:37
> >
> > *Stats from QueryHandler/select*
> > - requests:78,557
> > - errors:358
> > - timeouts:0
> > - totalTime:1,639,975.27
> > - avgRequestsPerSecond:2.62
> > - 5minRateReqsPerSecond:1.39
> > - 15minRateReqsPerSecond:1.64
> > - avgTimePerRequest:20.87
> > - medianRequestTime:0.70
> > - 75thPcRequestTime:1.11
> > - 95thPcRequestTime:191.76
> >
> > *Stats from QueryHandler/update*
> > - requests:33,555
> > - errors:0
> > - timeouts:0
> > - totalTime:227,870.58
> > - avgRequestsPerSecond:1.12
> > - 5minRateReqsPerSecond:1.16
> > - 15minRateReqsPerSecond:1.23
> > - avgTimePerRequest:6.79
> > - medianRequestTime:3.16
> > - 75thPcRequestTime:5.27
> > - 95thPcRequestTime:9.33
> >
> > And yet the Solr clients are reporting timeouts and very long read times.
> >
> > Plus, on every server, we are seeing lots of exceptions.
> > For example:
> >
> > Between 8:06:55 PM and 8:21:36 PM, exceptions are:
> >
> > 1) Request says it is coming from leader, but we are the leader:
> > update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_
> 1456430020/&wt=javabin&version=2
> >
> > 2) org.apache.solr.common.SolrException: Request says it is coming from
> > leader, but we are the leader
> >
> > 3) org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 4) null:org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 5) org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 6) null:org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 7) org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> > available to handle this request. Zombie server list:
> > [HOSTA_ca_1_1456429897]
> >
> > 8) null:org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> > available to handle this request. Zombie server list:
> > [HOSTA_ca_1_1456429897]
> >
> > 9) org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 10) null:org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 11) org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 12) null:org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > Why are we seeing so many timeouts then and why so huge response times on
> > the client?
> >
> > Thanks
> > SG
> >
> >
> >
> > On Sat, Dec 3, 2016 at 4:19 PM, <bi...@gmail.com> wrote:
> >
> >> What tool is that ? The stats I would like to run on my Solr instance
> >>
> >> Bill Bell
> >> Sent from mobile
> >>
> >>
> >> > On Dec 2, 2016, at 4:49 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> >> >
> >> >> On 12/2/2016 12:01 PM, S G wrote:
> >> >> This post shows some stats on Solr which indicate that there might
> be a
> >> >> memory leak in there.
> >> >>
> >> >> http://stackoverflow.com/questions/40939166/is-this-a-
> >> memory-leak-in-solr
> >> >>
> >> >> Can someone please help to debug this?
> >> >> It might be a very good step in making Solr stable if we can fix
> this.
> >> >
> >> > +1 to what Walter said.
> >> >
> >> > I replied earlier on the stackoverflow question.
> >> >
> >> > FYI -- your 95th percentile request time of about 16 milliseconds is
> NOT
> >> > something that I would characterize as "very high."  I would *love* to
> >> > have statistics that good.
> >> >
> >> > Even your 99th percentile request time is not much more than a full
> >> > second.  If a search takes a couple of seconds, most users will not
> >> > really care, and some might not even notice.  It's when a large
> >> > percentage of queries start taking several seconds that complaints
> start
> >> > coming in.  On your system, 99 percent of your queries are completing
> in
> >> > 1.3 seconds or less, and 95 percent of them are less than 17
> >> > milliseconds.  That sounds quite good to me.
> >> >
> >> > In my experience, the time it takes for the browser to receive the
> >> > search result page and render it is a significant part of the total
> time
> >> > to see results, and often dwarfs the time spent getting info from
> Solr.
> >> >
> >> > Here's some numbers from Solr in my organization:
> >> >
> >> > requests:               4102054
> >> > errors:                 364894
> >> > timeouts:               49
> >> > totalTime:              799446287.45041
> >> > avgRequestsPerSecond:   1.2375565828793849
> >> > 5minRateReqsPerSecond:  0.8444329508327961
> >> > 15minRateReqsPerSecond: 0.8631197328073346
> >> > avgTimePerRequest:      194.88926460997587
> >> > medianRequestTime:      20.8566605
> >> > 75thPcRequestTime:      85.51328849999999
> >> > 95thPcRequestTime:      2202.277466549999
> >> > 99thPcRequestTime:      5280.375381280002
> >> > 999thPcRequestTime:     6866.020122961001
> >> >
> >> > The numbers above come from a distributed index that contains 167
> >> > million documents and takes up about 200GB of disk space across two
> >> > machines.
> >> >
> >> > requests:               192683
> >> > errors:                 124
> >> > timeouts:               0
> >> > totalTime:              199380421.985073
> >> > avgRequestsPerSecond    0.042222722771354554
> >> > 5minRateReqsPerSecon    0.00800545427600684
> >> > 15minRateReqsPerSecond: 0.017521222412364163
> >> > avgTimePerRequest:      1034.7587591280653
> >> > medianRequestTime:      541.591858
> >> > 75thPcRequestTime:      1683.83246125
> >> > 95thPcRequestTime:      5644.542019949997
> >> > 99thPcRequestTime:      9445.592394760004
> >> > 999thPcRequestTime:     14602.166640771007
> >> >
> >> > These numbers are from an index with about 394 million documents,
> taking
> >> > up nearly 500GB of disk space.  This index is also distributed on
> >> > multiple machines.
> >> >
> >> > Are you experiencing any problems other than what you perceive as slow
> >> > queries?  I asked some other questions on stackoverflow.  In
> particular,
> >> > I'd like to know the total memory on the server, the total number of
> >> > documents (maxDoc and numDoc) you're handling with this server, as
> well
> >> > as the total index size.  What do your queries look like?  What
> version
> >> > and vendor of Java are you using?  Can you share your config/schema?
> >> >
> >> > A memory leak is very unlikely, unless your Java or your operating
> >> > system is broken.  I can't say for sure that it's not happening, but
> >> > it's just not something we see around here.
> >> >
> >> > Here's what I have collected on performance issues in Solr.  This page
> >> > does mostly concern itself with memory, though it touches briefly on
> >> > other topics:
> >> >
> >> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >> >
> >> > Thanks,
> >> > Shawn
> >> >
> >>
>

Re: Memory leak in Solr

Posted by Erick Erickson <er...@gmail.com>.

All of this is consistent with not having a properly
tuned Solr instance wrt # documents, usage
pattern, memory allocated to the JVM, GC
settings and the like.

Your leader issues can be explained by long
GC pauses too. Zookeeper periodically pings
each replica it knows about and if the response
times out (due to GC in this case) then Zookeeper
thinks the node has gone away and marks
it as "down". Similarly when a leader forwards
an update to a follower and the request times
out, the leader will mark the follower as down.
Do this enough and the state of the cluster gets
"interesting".

You still haven't told us what version of Solr
you're using, the "Version" you took from
the core stats is the version of the _index_,
not Solr.

You have almost 200M documents on
a single core. That's definitely on the high side,
although I've seen that work. Assuming
you aren't doing things like faceting and
sorting and the like on non docValues fields.

As others have pointed out, the link you
provided doesn't provide much in the way of
any "smoking guns" as far as a memory
leak is concerned.

I've certainly seen situations where memory
required by Solr is close to the total memory
allocated to the JVM for instance. Then the GC
cycle kicks in and recovers just enough to
go on for a very brief time before going into another
GC cycle resulting in very poor performance.

So overall this looks like you need to do some
serious tuning of your Solr instances, take a
hard look at how you're using your physical
machines. You specify that these are VMs,
but how many VMs are you running per box?
How much JVM have you allocated for each?
How much total physical memory do you have
to work with per box?

Even if you provide the answers to the above
questions, there's not much we can do to
help you resolve your issues assuming it's
simply inappropriate sizing. I'd really recommend
you create a stress environment so you can
test different scenarios to become confident about
your expected performance, here's a blog on the
subject:

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Sat, Dec 3, 2016 at 8:46 PM, S G <sg...@gmail.com> wrote:
> The symptom we see is that the java clients querying Solr see response
> times in 10s of seconds (not milliseconds).
> And on the tomcat's gc.log file (where Solr is running), we see very bad GC
> pauses - threads being paused for 0.5 seconds per second approximately.
>
> Some numbers for the Solr Cloud:
>
> *Overall infrastructure:*
> - Only one collection
> - 16 VMs used
> - 8 shards (1 leader and 1 replica per shard - each core on separate VM)
>
> *Overview from one core:*
> - Num Docs:193,623,388
> - Max Doc:230,577,696
> - Heap Memory Usage:231,217,880
> - Deleted Docs:36,954,308
> - Version:2,357,757
> - Segment Count:37
>
> *Stats from QueryHandler/select*
> - requests:78,557
> - errors:358
> - timeouts:0
> - totalTime:1,639,975.27
> - avgRequestsPerSecond:2.62
> - 5minRateReqsPerSecond:1.39
> - 15minRateReqsPerSecond:1.64
> - avgTimePerRequest:20.87
> - medianRequestTime:0.70
> - 75thPcRequestTime:1.11
> - 95thPcRequestTime:191.76
>
> *Stats from QueryHandler/update*
> - requests:33,555
> - errors:0
> - timeouts:0
> - totalTime:227,870.58
> - avgRequestsPerSecond:1.12
> - 5minRateReqsPerSecond:1.16
> - 15minRateReqsPerSecond:1.23
> - avgTimePerRequest:6.79
> - medianRequestTime:3.16
> - 75thPcRequestTime:5.27
> - 95thPcRequestTime:9.33
>
> And yet the Solr clients are reporting timeouts and very long read times.
>
> Plus, on every server, we are seeing lots of exceptions.
> For example:
>
> Between 8:06:55 PM and 8:21:36 PM, exceptions are:
>
> 1) Request says it is coming from leader, but we are the leader:
> update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_1456430020/&wt=javabin&version=2
>
> 2) org.apache.solr.common.SolrException: Request says it is coming from
> leader, but we are the leader
>
> 3) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 4) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 5) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 6) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 7) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> available to handle this request. Zombie server list:
> [HOSTA_ca_1_1456429897]
>
> 8) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> available to handle this request. Zombie server list:
> [HOSTA_ca_1_1456429897]
>
> 9) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 10) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 11) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 12) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> Why are we seeing so many timeouts then and why so huge response times on
> the client?
>
> Thanks
> SG
>
>
>
> On Sat, Dec 3, 2016 at 4:19 PM, <bi...@gmail.com> wrote:
>
>> What tool is that ? The stats I would like to run on my Solr instance
>>
>> Bill Bell
>> Sent from mobile
>>
>>
>> > On Dec 2, 2016, at 4:49 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>> >
>> >> On 12/2/2016 12:01 PM, S G wrote:
>> >> This post shows some stats on Solr which indicate that there might be a
>> >> memory leak in there.
>> >>
>> >> http://stackoverflow.com/questions/40939166/is-this-a-
>> memory-leak-in-solr
>> >>
>> >> Can someone please help to debug this?
>> >> It might be a very good step in making Solr stable if we can fix this.
>> >
>> > +1 to what Walter said.
>> >
>> > I replied earlier on the stackoverflow question.
>> >
>> > FYI -- your 95th percentile request time of about 16 milliseconds is NOT
>> > something that I would characterize as "very high."  I would *love* to
>> > have statistics that good.
>> >
>> > Even your 99th percentile request time is not much more than a full
>> > second.  If a search takes a couple of seconds, most users will not
>> > really care, and some might not even notice.  It's when a large
>> > percentage of queries start taking several seconds that complaints start
>> > coming in.  On your system, 99 percent of your queries are completing in
>> > 1.3 seconds or less, and 95 percent of them are less than 17
>> > milliseconds.  That sounds quite good to me.
>> >
>> > In my experience, the time it takes for the browser to receive the
>> > search result page and render it is a significant part of the total time
>> > to see results, and often dwarfs the time spent getting info from Solr.
>> >
>> > Here's some numbers from Solr in my organization:
>> >
>> > requests:               4102054
>> > errors:                 364894
>> > timeouts:               49
>> > totalTime:              799446287.45041
>> > avgRequestsPerSecond:   1.2375565828793849
>> > 5minRateReqsPerSecond:  0.8444329508327961
>> > 15minRateReqsPerSecond: 0.8631197328073346
>> > avgTimePerRequest:      194.88926460997587
>> > medianRequestTime:      20.8566605
>> > 75thPcRequestTime:      85.51328849999999
>> > 95thPcRequestTime:      2202.277466549999
>> > 99thPcRequestTime:      5280.375381280002
>> > 999thPcRequestTime:     6866.020122961001
>> >
>> > The numbers above come from a distributed index that contains 167
>> > million documents and takes up about 200GB of disk space across two
>> > machines.
>> >
>> > requests:               192683
>> > errors:                 124
>> > timeouts:               0
>> > totalTime:              199380421.985073
>> > avgRequestsPerSecond    0.042222722771354554
>> > 5minRateReqsPerSecon    0.00800545427600684
>> > 15minRateReqsPerSecond: 0.017521222412364163
>> > avgTimePerRequest:      1034.7587591280653
>> > medianRequestTime:      541.591858
>> > 75thPcRequestTime:      1683.83246125
>> > 95thPcRequestTime:      5644.542019949997
>> > 99thPcRequestTime:      9445.592394760004
>> > 999thPcRequestTime:     14602.166640771007
>> >
>> > These numbers are from an index with about 394 million documents, taking
>> > up nearly 500GB of disk space.  This index is also distributed on
>> > multiple machines.
>> >
>> > Are you experiencing any problems other than what you perceive as slow
>> > queries?  I asked some other questions on stackoverflow.  In particular,
>> > I'd like to know the total memory on the server, the total number of
>> > documents (maxDoc and numDoc) you're handling with this server, as well
>> > as the total index size.  What do your queries look like?  What version
>> > and vendor of Java are you using?  Can you share your config/schema?
>> >
>> > A memory leak is very unlikely, unless your Java or your operating
>> > system is broken.  I can't say for sure that it's not happening, but
>> > it's just not something we see around here.
>> >
>> > Here's what I have collected on performance issues in Solr.  This page
>> > does mostly concern itself with memory, though it touches briefly on
>> > other topics:
>> >
>> > https://wiki.apache.org/solr/SolrPerformanceProblems
>> >
>> > Thanks,
>> > Shawn
>> >
>>

Re: Memory leak in Solr

Posted by William Bell <bi...@gmail.com>.

What do you mean by JVM level? Run Solr on different ports on the same
machine? If you have a 32 core box would you run 2,3,4 JVMs?

On Sun, Dec 4, 2016 at 8:46 PM, Jeff Wartes <jw...@whitepages.com> wrote:

>
> Here’s an earlier post where I mentioned some GC investigation tools:
> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/
> 201604.mbox/%3C8F8FA32D-EC0E-4352-86F7-4B2D8A906903@whitepages.com%3E
>
> In my experience, there are many aspects of the Solr/Lucene memory
> allocation model that scale with things other than documents returned.
> (such as cardinality, or simply index size) A single query on a large index
> might consume dozens of megabytes of heap to complete. But that heap should
> also be released quickly after the query finishes.
> The key characteristic of a memory leak is that the software is allocating
> memory that it cannot reclaim. If it’s a leak, you ought to be able to
> reproduce it at any query rate - have you tried this? A run with, say, half
> the rate, over twice the duration?
>
> I’m inclined to agree with others here, that although you’ve correctly
> attributed the cause to GC, it’s probably less an indication of a leak, and
> more an indication of simply allocating memory faster than it can be
> reclaimed, combined with the long pauses that are increasingly unavoidable
> as heap size goes up.
> Note that in the case of a CMS allocation failure, the fallback full-GC is
> *single threaded*, which means it’ll usually take considerably longer than
> a normal GC - even for a comparable amount of garbage.
>
> In addition to GC tuning, you can address these by sharding more, both at
> the core and jvm level.
>
>
> On 12/4/16, 3:46 PM, "Shawn Heisey" <ap...@elyograg.org> wrote:
>
>     On 12/3/2016 9:46 PM, S G wrote:
>     > The symptom we see is that the java clients querying Solr see
> response
>     > times in 10s of seconds (not milliseconds).
>     <snip>
>     > Some numbers for the Solr Cloud:
>     >
>     > *Overall infrastructure:*
>     > - Only one collection
>     > - 16 VMs used
>     > - 8 shards (1 leader and 1 replica per shard - each core on separate
> VM)
>     >
>     > *Overview from one core:*
>     > - Num Docs:193,623,388
>     > - Max Doc:230,577,696
>     > - Heap Memory Usage:231,217,880
>     > - Deleted Docs:36,954,308
>     > - Version:2,357,757
>     > - Segment Count:37
>
>     The heap memory usage number isn't useful.  It doesn't cover all the
>     memory used.
>
>     > *Stats from QueryHandler/select*
>     > - requests:78,557
>     > - errors:358
>     > - timeouts:0
>     > - totalTime:1,639,975.27
>     > - avgRequestsPerSecond:2.62
>     > - 5minRateReqsPerSecond:1.39
>     > - 15minRateReqsPerSecond:1.64
>     > - avgTimePerRequest:20.87
>     > - medianRequestTime:0.70
>     > - 75thPcRequestTime:1.11
>     > - 95thPcRequestTime:191.76
>
>     These times are in *milliseconds*, not seconds .. and these are even
>     better numbers than you showed before.  Where are you seeing 10 plus
>     second query times?  Solr is not showing numbers like that.
>
>     If your VM host has 16 VMs on it and each one has a total memory size
> of
>     92GB, then if that machine doesn't have 1.5 terabytes of memory, you're
>     oversubscribed, and this is going to lead to terrible performance...
> but
>     the numbers you've shown here do not show terrible performance.
>
>     > Plus, on every server, we are seeing lots of exceptions.
>     > For example:
>     >
>     > Between 8:06:55 PM and 8:21:36 PM, exceptions are:
>     >
>     > 1) Request says it is coming from leader, but we are the leader:
>     > update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_
> 1456430020/&wt=javabin&version=2
>     >
>     > 2) org.apache.solr.common.SolrException: Request says it is coming
> from
>     > leader, but we are the leader
>     >
>     > 3) org.apache.solr.common.SolrException:
>     > org.apache.solr.client.solrj.SolrServerException: Tried one server
> for read
>     > operation and it timed out, so failing fast
>     >
>     > 4) null:org.apache.solr.common.SolrException:
>     > org.apache.solr.client.solrj.SolrServerException: Tried one server
> for read
>     > operation and it timed out, so failing fast
>     >
>     > 5) org.apache.solr.common.SolrException:
>     > org.apache.solr.client.solrj.SolrServerException: Tried one server
> for read
>     > operation and it timed out, so failing fast
>     >
>     > 6) null:org.apache.solr.common.SolrException:
>     > org.apache.solr.client.solrj.SolrServerException: Tried one server
> for read
>     > operation and it timed out, so failing fast
>     >
>     > 7) org.apache.solr.common.SolrException:
>     > org.apache.solr.client.solrj.SolrServerException: No live
> SolrServers
>     > available to handle this request. Zombie server list:
>     > [HOSTA_ca_1_1456429897]
>     >
>     > 8) null:org.apache.solr.common.SolrException:
>     > org.apache.solr.client.solrj.SolrServerException: No live
> SolrServers
>     > available to handle this request. Zombie server list:
>     > [HOSTA_ca_1_1456429897]
>     >
>     > 9) org.apache.solr.common.SolrException:
>     > org.apache.solr.client.solrj.SolrServerException: Tried one server
> for read
>     > operation and it timed out, so failing fast
>     >
>     > 10) null:org.apache.solr.common.SolrException:
>     > org.apache.solr.client.solrj.SolrServerException: Tried one server
> for read
>     > operation and it timed out, so failing fast
>     >
>     > 11) org.apache.solr.common.SolrException:
>     > org.apache.solr.client.solrj.SolrServerException: Tried one server
> for read
>     > operation and it timed out, so failing fast
>     >
>     > 12) null:org.apache.solr.common.SolrException:
>     > org.apache.solr.client.solrj.SolrServerException: Tried one server
> for read
>     > operation and it timed out, so failing fast
>
>     These errors sound like timeouts, possibly caused by long GC pauses ...
>     but as already mentioned, the query handler statistics do not indicate
>     long query times.  If a long GC were to happen during a query, then the
>     query time would be long as well.
>
>     The core information above doesn't include the size of the index on
>     disk.  That number would be useful for telling you whether there's
>     enough memory.
>
>     As I said at the beginning of the thread, I haven't seen anything here
>     to indicate a memory leak, and others are using version 4.10 without
> any
>     problems.  If there were a memory leak in a released version of Solr,
>     many people would have run into problems with it.
>
>     Thanks,
>     Shawn
>
>
>
>


-- 
Bill Bell
billnbell@gmail.com
cell 720-256-8076

Re: Memory leak in Solr

Posted by Jeff Wartes <jw...@whitepages.com>.

Here’s an earlier post where I mentioned some GC investigation tools:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3C8F8FA32D-EC0E-4352-86F7-4B2D8A906903@whitepages.com%3E

In my experience, there are many aspects of the Solr/Lucene memory allocation model that scale with things other than documents returned. (such as cardinality, or simply index size) A single query on a large index might consume dozens of megabytes of heap to complete. But that heap should also be released quickly after the query finishes.
The key characteristic of a memory leak is that the software is allocating memory that it cannot reclaim. If it’s a leak, you ought to be able to reproduce it at any query rate - have you tried this? A run with, say, half the rate, over twice the duration?

I’m inclined to agree with others here, that although you’ve correctly attributed the cause to GC, it’s probably less an indication of a leak, and more an indication of simply allocating memory faster than it can be reclaimed, combined with the long pauses that are increasingly unavoidable as heap size goes up.
Note that in the case of a CMS allocation failure, the fallback full-GC is *single threaded*, which means it’ll usually take considerably longer than a normal GC - even for a comparable amount of garbage.

In addition to GC tuning, you can address these by sharding more, both at the core and jvm level.


On 12/4/16, 3:46 PM, "Shawn Heisey" <ap...@elyograg.org> wrote:

    On 12/3/2016 9:46 PM, S G wrote:
    > The symptom we see is that the java clients querying Solr see response
    > times in 10s of seconds (not milliseconds).
    <snip>
    > Some numbers for the Solr Cloud:
    >
    > *Overall infrastructure:*
    > - Only one collection
    > - 16 VMs used
    > - 8 shards (1 leader and 1 replica per shard - each core on separate VM)
    >
    > *Overview from one core:*
    > - Num Docs:193,623,388
    > - Max Doc:230,577,696
    > - Heap Memory Usage:231,217,880
    > - Deleted Docs:36,954,308
    > - Version:2,357,757
    > - Segment Count:37
    
    The heap memory usage number isn't useful.  It doesn't cover all the
    memory used.
    
    > *Stats from QueryHandler/select*
    > - requests:78,557
    > - errors:358
    > - timeouts:0
    > - totalTime:1,639,975.27
    > - avgRequestsPerSecond:2.62
    > - 5minRateReqsPerSecond:1.39
    > - 15minRateReqsPerSecond:1.64
    > - avgTimePerRequest:20.87
    > - medianRequestTime:0.70
    > - 75thPcRequestTime:1.11
    > - 95thPcRequestTime:191.76
    
    These times are in *milliseconds*, not seconds .. and these are even
    better numbers than you showed before.  Where are you seeing 10 plus
    second query times?  Solr is not showing numbers like that.
    
    If your VM host has 16 VMs on it and each one has a total memory size of
    92GB, then if that machine doesn't have 1.5 terabytes of memory, you're
    oversubscribed, and this is going to lead to terrible performance... but
    the numbers you've shown here do not show terrible performance.
    
    > Plus, on every server, we are seeing lots of exceptions.
    > For example:
    >
    > Between 8:06:55 PM and 8:21:36 PM, exceptions are:
    >
    > 1) Request says it is coming from leader, but we are the leader:
    > update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_1456430020/&wt=javabin&version=2
    >
    > 2) org.apache.solr.common.SolrException: Request says it is coming from
    > leader, but we are the leader
    >
    > 3) org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for read
    > operation and it timed out, so failing fast
    >
    > 4) null:org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for read
    > operation and it timed out, so failing fast
    >
    > 5) org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for read
    > operation and it timed out, so failing fast
    >
    > 6) null:org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for read
    > operation and it timed out, so failing fast
    >
    > 7) org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: No live SolrServers
    > available to handle this request. Zombie server list:
    > [HOSTA_ca_1_1456429897]
    >
    > 8) null:org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: No live SolrServers
    > available to handle this request. Zombie server list:
    > [HOSTA_ca_1_1456429897]
    >
    > 9) org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for read
    > operation and it timed out, so failing fast
    >
    > 10) null:org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for read
    > operation and it timed out, so failing fast
    >
    > 11) org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for read
    > operation and it timed out, so failing fast
    >
    > 12) null:org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for read
    > operation and it timed out, so failing fast
    
    These errors sound like timeouts, possibly caused by long GC pauses ...
    but as already mentioned, the query handler statistics do not indicate
    long query times.  If a long GC were to happen during a query, then the
    query time would be long as well.
    
    The core information above doesn't include the size of the index on
    disk.  That number would be useful for telling you whether there's
    enough memory.
    
    As I said at the beginning of the thread, I haven't seen anything here
    to indicate a memory leak, and others are using version 4.10 without any
    problems.  If there were a memory leak in a released version of Solr,
    many people would have run into problems with it.
    
    Thanks,
    Shawn

Re: Memory leak in Solr

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/3/2016 9:46 PM, S G wrote:
> The symptom we see is that the java clients querying Solr see response
> times in 10s of seconds (not milliseconds).
<snip>
> Some numbers for the Solr Cloud:
>
> *Overall infrastructure:*
> - Only one collection
> - 16 VMs used
> - 8 shards (1 leader and 1 replica per shard - each core on separate VM)
>
> *Overview from one core:*
> - Num Docs:193,623,388
> - Max Doc:230,577,696
> - Heap Memory Usage:231,217,880
> - Deleted Docs:36,954,308
> - Version:2,357,757
> - Segment Count:37

The heap memory usage number isn't useful.  It doesn't cover all the
memory used.

> *Stats from QueryHandler/select*
> - requests:78,557
> - errors:358
> - timeouts:0
> - totalTime:1,639,975.27
> - avgRequestsPerSecond:2.62
> - 5minRateReqsPerSecond:1.39
> - 15minRateReqsPerSecond:1.64
> - avgTimePerRequest:20.87
> - medianRequestTime:0.70
> - 75thPcRequestTime:1.11
> - 95thPcRequestTime:191.76

These times are in *milliseconds*, not seconds .. and these are even
better numbers than you showed before.  Where are you seeing 10 plus
second query times?  Solr is not showing numbers like that.

If your VM host has 16 VMs on it and each one has a total memory size of
92GB, then if that machine doesn't have 1.5 terabytes of memory, you're
oversubscribed, and this is going to lead to terrible performance... but
the numbers you've shown here do not show terrible performance.

> Plus, on every server, we are seeing lots of exceptions.
> For example:
>
> Between 8:06:55 PM and 8:21:36 PM, exceptions are:
>
> 1) Request says it is coming from leader, but we are the leader:
> update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_1456430020/&wt=javabin&version=2
>
> 2) org.apache.solr.common.SolrException: Request says it is coming from
> leader, but we are the leader
>
> 3) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 4) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 5) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 6) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 7) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> available to handle this request. Zombie server list:
> [HOSTA_ca_1_1456429897]
>
> 8) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> available to handle this request. Zombie server list:
> [HOSTA_ca_1_1456429897]
>
> 9) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 10) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 11) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 12) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast

These errors sound like timeouts, possibly caused by long GC pauses ...
but as already mentioned, the query handler statistics do not indicate
long query times.  If a long GC were to happen during a query, then the
query time would be long as well.

The core information above doesn't include the size of the index on
disk.  That number would be useful for telling you whether there's
enough memory.

As I said at the beginning of the thread, I haven't seen anything here
to indicate a memory leak, and others are using version 4.10 without any
problems.  If there were a memory leak in a released version of Solr,
many people would have run into problems with it.

Thanks,
Shawn

Re: Memory leak in Solr

Posted by S G <sg...@gmail.com>.

The symptom we see is that the java clients querying Solr see response
times in 10s of seconds (not milliseconds).
And on the tomcat's gc.log file (where Solr is running), we see very bad GC
pauses - threads being paused for 0.5 seconds per second approximately.

Some numbers for the Solr Cloud:

*Overall infrastructure:*
- Only one collection
- 16 VMs used
- 8 shards (1 leader and 1 replica per shard - each core on separate VM)

*Overview from one core:*
- Num Docs:193,623,388
- Max Doc:230,577,696
- Heap Memory Usage:231,217,880
- Deleted Docs:36,954,308
- Version:2,357,757
- Segment Count:37

*Stats from QueryHandler/select*
- requests:78,557
- errors:358
- timeouts:0
- totalTime:1,639,975.27
- avgRequestsPerSecond:2.62
- 5minRateReqsPerSecond:1.39
- 15minRateReqsPerSecond:1.64
- avgTimePerRequest:20.87
- medianRequestTime:0.70
- 75thPcRequestTime:1.11
- 95thPcRequestTime:191.76

*Stats from QueryHandler/update*
- requests:33,555
- errors:0
- timeouts:0
- totalTime:227,870.58
- avgRequestsPerSecond:1.12
- 5minRateReqsPerSecond:1.16
- 15minRateReqsPerSecond:1.23
- avgTimePerRequest:6.79
- medianRequestTime:3.16
- 75thPcRequestTime:5.27
- 95thPcRequestTime:9.33

And yet the Solr clients are reporting timeouts and very long read times.

Plus, on every server, we are seeing lots of exceptions.
For example:

Between 8:06:55 PM and 8:21:36 PM, exceptions are:

1) Request says it is coming from leader, but we are the leader:
update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_1456430020/&wt=javabin&version=2

2) org.apache.solr.common.SolrException: Request says it is coming from
leader, but we are the leader

3) org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

4) null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

5) org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

6) null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

7) org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request. Zombie server list:
[HOSTA_ca_1_1456429897]

8) null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request. Zombie server list:
[HOSTA_ca_1_1456429897]

9) org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

10) null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

11) org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

12) null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

Why are we seeing so many timeouts then and why so huge response times on
the client?

Thanks
SG



On Sat, Dec 3, 2016 at 4:19 PM, <bi...@gmail.com> wrote:

> What tool is that ? The stats I would like to run on my Solr instance
>
> Bill Bell
> Sent from mobile
>
>
> > On Dec 2, 2016, at 4:49 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> >
> >> On 12/2/2016 12:01 PM, S G wrote:
> >> This post shows some stats on Solr which indicate that there might be a
> >> memory leak in there.
> >>
> >> http://stackoverflow.com/questions/40939166/is-this-a-
> memory-leak-in-solr
> >>
> >> Can someone please help to debug this?
> >> It might be a very good step in making Solr stable if we can fix this.
> >
> > +1 to what Walter said.
> >
> > I replied earlier on the stackoverflow question.
> >
> > FYI -- your 95th percentile request time of about 16 milliseconds is NOT
> > something that I would characterize as "very high."  I would *love* to
> > have statistics that good.
> >
> > Even your 99th percentile request time is not much more than a full
> > second.  If a search takes a couple of seconds, most users will not
> > really care, and some might not even notice.  It's when a large
> > percentage of queries start taking several seconds that complaints start
> > coming in.  On your system, 99 percent of your queries are completing in
> > 1.3 seconds or less, and 95 percent of them are less than 17
> > milliseconds.  That sounds quite good to me.
> >
> > In my experience, the time it takes for the browser to receive the
> > search result page and render it is a significant part of the total time
> > to see results, and often dwarfs the time spent getting info from Solr.
> >
> > Here's some numbers from Solr in my organization:
> >
> > requests:               4102054
> > errors:                 364894
> > timeouts:               49
> > totalTime:              799446287.45041
> > avgRequestsPerSecond:   1.2375565828793849
> > 5minRateReqsPerSecond:  0.8444329508327961
> > 15minRateReqsPerSecond: 0.8631197328073346
> > avgTimePerRequest:      194.88926460997587
> > medianRequestTime:      20.8566605
> > 75thPcRequestTime:      85.51328849999999
> > 95thPcRequestTime:      2202.277466549999
> > 99thPcRequestTime:      5280.375381280002
> > 999thPcRequestTime:     6866.020122961001
> >
> > The numbers above come from a distributed index that contains 167
> > million documents and takes up about 200GB of disk space across two
> > machines.
> >
> > requests:               192683
> > errors:                 124
> > timeouts:               0
> > totalTime:              199380421.985073
> > avgRequestsPerSecond    0.042222722771354554
> > 5minRateReqsPerSecon    0.00800545427600684
> > 15minRateReqsPerSecond: 0.017521222412364163
> > avgTimePerRequest:      1034.7587591280653
> > medianRequestTime:      541.591858
> > 75thPcRequestTime:      1683.83246125
> > 95thPcRequestTime:      5644.542019949997
> > 99thPcRequestTime:      9445.592394760004
> > 999thPcRequestTime:     14602.166640771007
> >
> > These numbers are from an index with about 394 million documents, taking
> > up nearly 500GB of disk space.  This index is also distributed on
> > multiple machines.
> >
> > Are you experiencing any problems other than what you perceive as slow
> > queries?  I asked some other questions on stackoverflow.  In particular,
> > I'd like to know the total memory on the server, the total number of
> > documents (maxDoc and numDoc) you're handling with this server, as well
> > as the total index size.  What do your queries look like?  What version
> > and vendor of Java are you using?  Can you share your config/schema?
> >
> > A memory leak is very unlikely, unless your Java or your operating
> > system is broken.  I can't say for sure that it's not happening, but
> > it's just not something we see around here.
> >
> > Here's what I have collected on performance issues in Solr.  This page
> > does mostly concern itself with memory, though it touches briefly on
> > other topics:
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Thanks,
> > Shawn
> >
>

Re: Memory leak in Solr

Posted by bi...@gmail.com.

What tool is that ? The stats I would like to run on my Solr instance 

Bill Bell
Sent from mobile


> On Dec 2, 2016, at 4:49 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 12/2/2016 12:01 PM, S G wrote:
>> This post shows some stats on Solr which indicate that there might be a
>> memory leak in there.
>> 
>> http://stackoverflow.com/questions/40939166/is-this-a-memory-leak-in-solr
>> 
>> Can someone please help to debug this?
>> It might be a very good step in making Solr stable if we can fix this.
> 
> +1 to what Walter said.
> 
> I replied earlier on the stackoverflow question.
> 
> FYI -- your 95th percentile request time of about 16 milliseconds is NOT
> something that I would characterize as "very high."  I would *love* to
> have statistics that good.
> 
> Even your 99th percentile request time is not much more than a full
> second.  If a search takes a couple of seconds, most users will not
> really care, and some might not even notice.  It's when a large
> percentage of queries start taking several seconds that complaints start
> coming in.  On your system, 99 percent of your queries are completing in
> 1.3 seconds or less, and 95 percent of them are less than 17
> milliseconds.  That sounds quite good to me.
> 
> In my experience, the time it takes for the browser to receive the
> search result page and render it is a significant part of the total time
> to see results, and often dwarfs the time spent getting info from Solr.
> 
> Here's some numbers from Solr in my organization:
> 
> requests:               4102054
> errors:                 364894
> timeouts:               49
> totalTime:              799446287.45041
> avgRequestsPerSecond:   1.2375565828793849
> 5minRateReqsPerSecond:  0.8444329508327961
> 15minRateReqsPerSecond: 0.8631197328073346
> avgTimePerRequest:      194.88926460997587
> medianRequestTime:      20.8566605
> 75thPcRequestTime:      85.51328849999999
> 95thPcRequestTime:      2202.277466549999
> 99thPcRequestTime:      5280.375381280002
> 999thPcRequestTime:     6866.020122961001
> 
> The numbers above come from a distributed index that contains 167
> million documents and takes up about 200GB of disk space across two
> machines.
> 
> requests:               192683
> errors:                 124
> timeouts:               0
> totalTime:              199380421.985073
> avgRequestsPerSecond    0.042222722771354554
> 5minRateReqsPerSecon    0.00800545427600684
> 15minRateReqsPerSecond: 0.017521222412364163
> avgTimePerRequest:      1034.7587591280653
> medianRequestTime:      541.591858
> 75thPcRequestTime:      1683.83246125
> 95thPcRequestTime:      5644.542019949997
> 99thPcRequestTime:      9445.592394760004
> 999thPcRequestTime:     14602.166640771007
> 
> These numbers are from an index with about 394 million documents, taking
> up nearly 500GB of disk space.  This index is also distributed on
> multiple machines.
> 
> Are you experiencing any problems other than what you perceive as slow
> queries?  I asked some other questions on stackoverflow.  In particular,
> I'd like to know the total memory on the server, the total number of
> documents (maxDoc and numDoc) you're handling with this server, as well
> as the total index size.  What do your queries look like?  What version
> and vendor of Java are you using?  Can you share your config/schema?
> 
> A memory leak is very unlikely, unless your Java or your operating
> system is broken.  I can't say for sure that it's not happening, but
> it's just not something we see around here.
> 
> Here's what I have collected on performance issues in Solr.  This page
> does mostly concern itself with memory, though it touches briefly on
> other topics:
> 
> https://wiki.apache.org/solr/SolrPerformanceProblems
> 
> Thanks,
> Shawn
>

Re: Memory leak in Solr

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/2/2016 12:01 PM, S G wrote:
> This post shows some stats on Solr which indicate that there might be a
> memory leak in there.
>
> http://stackoverflow.com/questions/40939166/is-this-a-memory-leak-in-solr
>
> Can someone please help to debug this?
> It might be a very good step in making Solr stable if we can fix this.

+1 to what Walter said.

I replied earlier on the stackoverflow question.

FYI -- your 95th percentile request time of about 16 milliseconds is NOT
something that I would characterize as "very high."  I would *love* to
have statistics that good.

Even your 99th percentile request time is not much more than a full
second.  If a search takes a couple of seconds, most users will not
really care, and some might not even notice.  It's when a large
percentage of queries start taking several seconds that complaints start
coming in.  On your system, 99 percent of your queries are completing in
1.3 seconds or less, and 95 percent of them are less than 17
milliseconds.  That sounds quite good to me.

In my experience, the time it takes for the browser to receive the
search result page and render it is a significant part of the total time
to see results, and often dwarfs the time spent getting info from Solr.

Here's some numbers from Solr in my organization:

requests:               4102054
errors:                 364894
timeouts:               49
totalTime:              799446287.45041
avgRequestsPerSecond:   1.2375565828793849
5minRateReqsPerSecond:  0.8444329508327961
15minRateReqsPerSecond: 0.8631197328073346
avgTimePerRequest:      194.88926460997587
medianRequestTime:      20.8566605
75thPcRequestTime:      85.51328849999999
95thPcRequestTime:      2202.277466549999
99thPcRequestTime:      5280.375381280002
999thPcRequestTime:     6866.020122961001

The numbers above come from a distributed index that contains 167
million documents and takes up about 200GB of disk space across two
machines.

requests:               192683
errors:                 124
timeouts:               0
totalTime:              199380421.985073
avgRequestsPerSecond    0.042222722771354554
5minRateReqsPerSecon    0.00800545427600684
15minRateReqsPerSecond: 0.017521222412364163
avgTimePerRequest:      1034.7587591280653
medianRequestTime:      541.591858
75thPcRequestTime:      1683.83246125
95thPcRequestTime:      5644.542019949997
99thPcRequestTime:      9445.592394760004
999thPcRequestTime:     14602.166640771007

These numbers are from an index with about 394 million documents, taking
up nearly 500GB of disk space.  This index is also distributed on
multiple machines.

Are you experiencing any problems other than what you perceive as slow
queries?  I asked some other questions on stackoverflow.  In particular,
I'd like to know the total memory on the server, the total number of
documents (maxDoc and numDoc) you're handling with this server, as well
as the total index size.  What do your queries look like?  What version
and vendor of Java are you using?  Can you share your config/schema?

A memory leak is very unlikely, unless your Java or your operating
system is broken.  I can't say for sure that it's not happening, but
it's just not something we see around here.

Here's what I have collected on performance issues in Solr.  This page
does mostly concern itself with memory, though it touches briefly on
other topics:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: Memory leak in Solr

Posted by Scott Blum <dr...@gmail.com>.

Are you sure it's an actual leak, not just memory pinned by caches?

Related: https://issues.apache.org/jira/browse/SOLR-9810

On Fri, Dec 2, 2016 at 2:01 PM, S G <sg...@gmail.com> wrote:

> Hi,
>
> This post shows some stats on Solr which indicate that there might be a
> memory leak in there.
>
> http://stackoverflow.com/questions/40939166/is-this-a-memory-leak-in-solr
>
> Can someone please help to debug this?
> It might be a very good step in making Solr stable if we can fix this.
>
> Thanks
> SG
>

Re: Memory leak in Solr

Posted by Greg Harris <ha...@gmail.com>.

Hi,

All your stats show is large memory requirements to Solr. There is no
direct mapping of number of documents and queries to memory reqts as
requested in that article. Different Solr projects can yield extremely,
extremely different requirements. If you want to understand your memory
usage better, you need to do a heap dump and to analyze it with something
like Eclipse MemoryAnalyzer or YourKit. Its STW, so you will have a little
bit of downtime. In 4.10 I'd almost already guess that your culprit is not
using docValues for things being faceted, grouped, sorted on leaving you
with a large fieldCache and yielding large memory requirements which will
not be cleaned upon a gc as they are still "live objects". While I couldn't
say that's true for sure without more analysis, its IME, pretty common.

Greg


On Dec 2, 2016 11:01 AM, "S G" <sg...@gmail.com> wrote:

Hi,

This post shows some stats on Solr which indicate that there might be a
memory leak in there.

http://stackoverflow.com/questions/40939166/is-this-a-memory-leak-in-solr

Can someone please help to debug this?
It might be a very good step in making Solr stable if we can fix this.

Thanks
SG

Re: Memory leak in Solr

Posted by Walter Underwood <wu...@wunderwood.org>.

We’ve been running Solr 4.10.4 in prod for a couple of years. There aren’t any obvious
memory leaks in it. It stays up for months.

Objects ejected from the cache will almost always be tenured, so that tends to cause 
full GCs.

If there are very few repeats in your query load, you’ll see a lot of cache ejections. 
This can also happen if you have an HTTP cache in front of the Solr hosts.
What are the hit rates on the Solr caches?

Also, are you using “NOW” in your queries? That will cause a very low hit rate
on the query result cache.

We can’t help without a lot more information, like your search architecture, the 
search collections, the query load, cache sizes, etc.

Finally, this is not a question for the dev list. This belongs on solr-user, so I’m
dropping the reply to the dev list.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 2, 2016, at 11:01 AM, S G <sg...@gmail.com> wrote:
> 
> Hi,
> 
> This post shows some stats on Solr which indicate that there might be a memory leak in there.
> 
> http://stackoverflow.com/questions/40939166/is-this-a-memory-leak-in-solr <http://stackoverflow.com/questions/40939166/is-this-a-memory-leak-in-solr>
> 
> Can someone please help to debug this?
> It might be a very good step in making Solr stable if we can fix this.
> 
> Thanks
> SG