You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Maulin Rathod <mr...@asite.com> on 2018/02/07 12:20:03 UTC

RE: Long GC Pauses

Hi Erick,



Thanks for your response. It shows GC pauses in Solr GC logs (refer below solr gc log where it shows 138.4138211 sec pause).



Seems like some bad query causes high memory allocation.

Further analyzing issue we found that asking for too many rows (e.g. rows=10000000) can cause full GC problem as mentioned in below link. We had some query which was asking for too many rows….so for now we have reduced number rows. After changing this we don’t see any large GC Pause (Max GC pause is 3-4 second). So seems like issue is resolved for now…But still please let me know if you have any other suggestion which can help us to understand the root cause…


https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/





Solr GC Log

==========



2018-01-04T12:13:40.346+0000: 140501.570: [CMS-concurrent-abortable-preclean-start]

{Heap before GC invocations=8909 (full 1001):

par new generation   total 10922688K, used 4187699K [0x0000000080000000, 0x00000003a0000000, 0x00000003a0000000)

  eden space 8738176K,  41% used [0x0000000080000000, 0x000000015fb1f6d8, 0x0000000295560000)

  from space 2184512K,  23% used [0x0000000295560000, 0x00000002b53cd8b8, 0x000000031aab0000)

  to   space 2184512K,   0% used [0x000000031aab0000, 0x000000031aab0000, 0x00000003a0000000)

concurrent mark-sweep generation total 39321600K, used 38082444K [0x00000003a0000000, 0x0000000d00000000, 0x0000000d00000000)

Metaspace       used 46676K, capacity 47372K, committed 50136K, reserved 51200K

2018-01-04T12:13:40.393+0000: 140501.618: [GC (Allocation Failure) 140501.618: [CMS2018-01-04T12:13:40.534+0000: 140501.766: [CMS-concurrent-abortable-preclean: 0.149/0.196 secs] [Times: user=0.41 sys=0.00, real=0.19 secs]

 (concurrent mode failure): 38082444K->18127660K(39321600K), 138.3709254 secs] 42270144K->18127660K(50244288K), [Metaspace: 46676K->46676K(51200K)], 138.3746435 secs] [Times: user=12.75 sys=22.89, real=138.38 secs]

Heap after GC invocations=8910 (full 1002):

par new generation   total 10922688K, used 0K [0x0000000080000000, 0x00000003a0000000, 0x00000003a0000000)

  eden space 8738176K,   0% used [0x0000000080000000, 0x0000000080000000, 0x0000000295560000)

  from space 2184512K,   0% used [0x0000000295560000, 0x0000000295560000, 0x000000031aab0000)

  to   space 2184512K,   0% used [0x000000031aab0000, 0x000000031aab0000, 0x00000003a0000000)

concurrent mark-sweep generation total 39321600K, used 18127660K [0x00000003a0000000, 0x0000000d00000000, 0x0000000d00000000)

Metaspace       used 46676K, capacity 47372K, committed 50136K, reserved 51200K

}

2018-01-04T12:15:58.772+0000: 140639.993: Total time for which application threads were stopped: 138.4138211 seconds, Stopping threads took: 0.0369886 seconds







Regards,



Maulin





-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 31 January 2018 22:47
To: solr-user <so...@lucene.apache.org>
Subject: Re: Long GC Pauses



Just to double check, when you san you're seeing 60-200 sec  GC pauses are you looking at the GC logs (or using some kind of monitor) or is that the time it takes the query to respond to the client? Because a single GC pause that long on 40G is unusual no matter what. Another take on Jason's question is For all the JVMs you're running, how much _total_ heap is allocated?

And how much physical memory is on the box? I generally start with _at least_ half the memory left to the OS....



These are fairly horrible, what generates such queries?

AND * AND *



Best,

Erick







On Wed, Jan 31, 2018 at 7:28 AM, Jason Gerlowski <ge...@gmail.com>> wrote:

> Hi Maulin,

>

> To clarify, when you said "...allocated 40 GB RAM to each shard."

> above, I'm going to assume you meant "to each node" instead.  If you

> actually did mean "to each shard" above, please correct me and anyone

> who chimes in afterward.

>

> Firstly, it's really hard to even take guesses about potential causes

> or remediations without more details about your load characteristics

> (average/peak QPS, index size, average document size, etc.).  If no

> one gives any satisfactory advice, please consider uploading

> additional details to help us help you.

>

> Secondly, I don't know anything about the load characteristics you're

> putting on your Solr cluster, but I'm curious whether you've

> experimented with lower RAM settings.  Generally speaking, the more

> RAM you have, the longer your GC pauses are likely to be (even with

> the tuning that various GC settings provide).  If you can get away

> with giving the Solr process less RAM, you should see your GC pauses

> shrink.  Was 40GB chosen after some trial-and-error experimentation, or is it something you could investigate?

>

> For a bit more overview on this, see this slightly outdated (but still

> useful) wiki page:

> https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

>

> Hope that helps, even if just to disqualify some potential

> causes/solutions to close in on a real fix.

>

> Best,

>

> Jason

>

> On Wed, Jan 31, 2018 at 8:17 AM, Maulin Rathod <mr...@asite.com>> wrote:

>

>> Hi,

>>

>> We are using solr cloud 6.1. We have around 20 collection on 4 nodes

>> (We have 2 shards and each shard have 2 replicas). We have allocated

>> 40 GB RAM to each shard.

>>

>> Intermittently we found long GC pauses (60 sec to 200 sec) due to

>> which solr stops responding and hence collections goes in recovering

>> mode. It takes minimum 5-10 minutes (sometime it takes more and we

>> have to restart the solr node) for recovering all collections. We are

>> using default GC setting (CMS) as per solr.cmd.

>>

>> We tried different G1 GC to see if it help, but still we see long GC

>> pauses(60 sec to 200 sec) and also found that memory usage is more in

>> in case G1 GC.

>>

>> What could be reason for long GC pauses and how can fix it?

>> Insufficient memory or problem with GC setting or something else? Any

>> suggestion would be greatly appreciated.

>>

>> In our analysis, we also found some inefficient queries (which uses *

>> many times in query) in solr logs. Could it be reason for high memory usage?

>>

>> Slow Query

>> --------------

>>

>> INFO  (qtp1239731077-498778) [c:documents s:shard1 r:core_node1

>> x:documents] o.a.s.c.S.Request [documents]  webapp=/solr path=/select

>> params={df=summary&distrib=false&fl=id&shards.purpose=4&

>> start=0&fsv=true&sort=description+asc,id+desc&fq=&shard.url=

>> s1.asite.com:8983/solr/documents|s1r1.asite.com:

>> 8983/solr/documents&rows=250&version=2&q=((id:(

>> REV78364_24705418+REV78364_24471492+REV78364_24471429+

>> REV78364_24470771+REV78364_24470271+))+OR+summary:((HPC*+

>> AND+*+AND+*+AND+OH1150*+AND+*+AND+*+AND+U0*+AND+*+AND+*+AND+

>> HGS*+AND+*+AND+*+AND+MDL*+AND+*+AND+*+AND+100067*+AND+*+AND+

>> -*+AND+Reinforcement*+AND+*+AND+Mode*)+))++AND++(title:((*

>> HPC\+\-\+OH1150\+\-\+U0\+\-\+HGS\+\-\+MDL\+\-\+100067\+-\+

>> Reinforcement\+Mode*)+))+AND+project_id:(-2+78243+78365+

>> 78364)+AND+is_active:true+AND+((isLatest:(true)+AND+

>> isFolderActive:true+AND+isXref:false+AND+-document_

>> type_id:(3+7)+AND+((is_public:true+OR+distribution_list:

>> 4858120+OR+folderadmin_list:4858120+OR+author_user_id:

>> 4858120)+AND+((defaultAccess:(true)+OR+allowedUsers:(

>> 4858120)+OR+allowedRoles:(6342201+172408+6336860)+OR+

>> combinationUsers:(4858120))+AND+-blockedUsers:(4858120))))

>> +OR+(isLatestRevPrivate:(true)+AND+allowedUsersForPvtRev:(

>> 4858120)+AND+-folderadmin_list:(4858120)))&shards.tolerant=true&NOW=

>> 1516786982952&isShard=true&wt=javabin} hits=0 status=0 QTime=83309

>>

>>

>>

>>

>> Regards,

>>

>> Maulin

>>

>> [CC Award Winners!]

>>

>>

Re: Long GC Pauses

Posted by Shawn Heisey <ap...@elyograg.org>.
On 2/7/2018 8:08 AM, Shawn Heisey wrote:
If your queries are producing the correct results,
> then I will tell you that the "summary" part of your query example is 
> quite possibly completely unnecessary

After further thought, I have concluded that this part of what I said is 
probably completely wrong.

But I do not think that you need *any* of the fourteen bare wildcards 
that are in your query.

Thanks,
Shawn

Re: Long GC Pauses

Posted by Shawn Heisey <ap...@elyograg.org>.
On 2/7/2018 5:20 AM, Maulin Rathod wrote:
> Further analyzing issue we found that asking for too many rows (e.g. rows=10000000) can cause full GC problem as mentioned in below link.

This is because when you ask for 10 million rows, Solr allocates a 
memory structure capable of storing information for each of those 10 
million rows, even before it knows how many documents are actually going 
to match the query.  This problem is mentioned by Toke's blog post you 
linked.

Bare wildcard queries can also lead to big problems with memory churn, 
and are not recommended.  Your query has a bare "*" included in it 
FOURTEEN times, on the summary field.  The name of that field suggests 
that it will have a very high term count.  If it does have a lot of 
unique terms, then ONE wildcard is going to be horrifically slow and 
consume a ton of memory.  Fourteen of them is going to be particularly 
insane.  You've also got a number of wildcards with text prefixes, which 
will not be as bad as the bare wildcard, but can still chew up a lot of 
memory and time.

I suspect that entire "summary" part of your query generation needs to 
be reworked.

You also have wildcards in the part of the query on the "title" field.

The kind of query you do with wildcards can often be completely replaced 
with ngram or edgengram filtering on the analysis chain, usually with a 
big performance advantage.

I suspect that the large number of wildcards is a big part of why your 
example query took 83 seconds to execute.  There may have also been some 
nasty GC pauses during the query.

You still have not answered the questions asked early in this thread 
about memory.  Is the heap 40GB, or is that the total memory installed 
in the server?  What is the total size of all Solr heaps on the machine, 
how much total memory is in the server, and how much index data (both 
document count and disk space size) is being handled by all the Solr 
instances on that machine?

The portion of your GC log that you included is too short, and has also 
been completely mangled by being pasted into an email.  If you want it 
analyzed, we will need a full copy of the logfile, without any 
modification, which likely means you need to use a file sharing site to 
transport it.

What I *can* decipher from your GC log suggests that your heap size may 
actually be 48GB, not 40GB.  After the big GC event, there was a little 
over 17GB of heap memory allocations remaining.  So my first bit of 
advice is to try reducing the heap size.  Without a large GC log, my 
current thought is to make it half what it currently is -- 24GB.  With a 
more extensive GC log, I could make a more accurate recommendation.  My 
second bit of advice would be to eliminate as many wildcards from your 
query as you can.  If your queries are producing the correct results, 
then I will tell you that the "summary" part of your query example is 
quite possibly completely unnecessary, and is going to require a LOT of 
memory.



Additional advice, not really related to the main discussion:

Some of the query looks like it is a perfect candidate for extraction 
into filter queries.  Any portion of the query that is particularly 
static is probably going to benefit from being changed into a filter 
query.  Possible filters you could use based on what I see:

fq=isFolderActive:true
fq=isXref:false
fq=*:* -document_type_id:(3 7)

If your index activity is well-suited for heavy filterCache usage, 
filters like this can achieve incredible speedups.

A lot of the other things in the query appear to be for ID values that 
are likely to change for every user.  Query clauses like that are not a 
good fit for filter queries.

Thanks,
Shawn

Re: Long GC Pauses

Posted by Ere Maijala <er...@helsinki.fi>.
Hi Maulin,

I'll chime in by referring to my own findings when analyzing Solr 
performance: 
https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html

Yonik has a good article about paging: 
http://yonik.com/solr/paging-and-deep-paging/. While it's about deep 
paging, the same issues affect searches that request too many rows. We 
have switched to cursorMark when needing a large set of records and 
found that it, while consuming less memory, also performs much better 
when fetching like 100 000 records. We also limited paging in our UI to 
100 000 first hits, which avoids intermittent issues experienced when a 
user tried to e.g. go to the last result page in a really large result 
set. Unfortunately there's no mechanism that would allow one to present 
a large result set completely while allowing jumping to a random page.

--Ere

Maulin Rathod kirjoitti 7.2.2018 klo 14.20:
> Hi Erick,
> 
> 
> 
> Thanks for your response. It shows GC pauses in Solr GC logs (refer below solr gc log where it shows 138.4138211 sec pause).
> 
> 
> 
> Seems like some bad query causes high memory allocation.
> 
> Further analyzing issue we found that asking for too many rows (e.g. rows=10000000) can cause full GC problem as mentioned in below link. We had some query which was asking for too many rows….so for now we have reduced number rows. After changing this we don’t see any large GC Pause (Max GC pause is 3-4 second). So seems like issue is resolved for now…But still please let me know if you have any other suggestion which can help us to understand the root cause…
> 
> 
> https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
> https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/
> 
> 
> 
> 
> 
> Solr GC Log
> 
> ==========
> 
> 
> 
> 2018-01-04T12:13:40.346+0000: 140501.570: [CMS-concurrent-abortable-preclean-start]
> 
> {Heap before GC invocations=8909 (full 1001):
> 
> par new generation   total 10922688K, used 4187699K [0x0000000080000000, 0x00000003a0000000, 0x00000003a0000000)
> 
>    eden space 8738176K,  41% used [0x0000000080000000, 0x000000015fb1f6d8, 0x0000000295560000)
> 
>    from space 2184512K,  23% used [0x0000000295560000, 0x00000002b53cd8b8, 0x000000031aab0000)
> 
>    to   space 2184512K,   0% used [0x000000031aab0000, 0x000000031aab0000, 0x00000003a0000000)
> 
> concurrent mark-sweep generation total 39321600K, used 38082444K [0x00000003a0000000, 0x0000000d00000000, 0x0000000d00000000)
> 
> Metaspace       used 46676K, capacity 47372K, committed 50136K, reserved 51200K
> 
> 2018-01-04T12:13:40.393+0000: 140501.618: [GC (Allocation Failure) 140501.618: [CMS2018-01-04T12:13:40.534+0000: 140501.766: [CMS-concurrent-abortable-preclean: 0.149/0.196 secs] [Times: user=0.41 sys=0.00, real=0.19 secs]
> 
>   (concurrent mode failure): 38082444K->18127660K(39321600K), 138.3709254 secs] 42270144K->18127660K(50244288K), [Metaspace: 46676K->46676K(51200K)], 138.3746435 secs] [Times: user=12.75 sys=22.89, real=138.38 secs]
> 
> Heap after GC invocations=8910 (full 1002):
> 
> par new generation   total 10922688K, used 0K [0x0000000080000000, 0x00000003a0000000, 0x00000003a0000000)
> 
>    eden space 8738176K,   0% used [0x0000000080000000, 0x0000000080000000, 0x0000000295560000)
> 
>    from space 2184512K,   0% used [0x0000000295560000, 0x0000000295560000, 0x000000031aab0000)
> 
>    to   space 2184512K,   0% used [0x000000031aab0000, 0x000000031aab0000, 0x00000003a0000000)
> 
> concurrent mark-sweep generation total 39321600K, used 18127660K [0x00000003a0000000, 0x0000000d00000000, 0x0000000d00000000)
> 
> Metaspace       used 46676K, capacity 47372K, committed 50136K, reserved 51200K
> 
> }
> 
> 2018-01-04T12:15:58.772+0000: 140639.993: Total time for which application threads were stopped: 138.4138211 seconds, Stopping threads took: 0.0369886 seconds
> 
> 
> 
> 
> 
> 
> 
> Regards,
> 
> 
> 
> Maulin
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: 31 January 2018 22:47
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: Long GC Pauses
> 
> 
> 
> Just to double check, when you san you're seeing 60-200 sec  GC pauses are you looking at the GC logs (or using some kind of monitor) or is that the time it takes the query to respond to the client? Because a single GC pause that long on 40G is unusual no matter what. Another take on Jason's question is For all the JVMs you're running, how much _total_ heap is allocated?
> 
> And how much physical memory is on the box? I generally start with _at least_ half the memory left to the OS....
> 
> 
> 
> These are fairly horrible, what generates such queries?
> 
> AND * AND *
> 
> 
> 
> Best,
> 
> Erick
> 
> 
> 
> 
> 
> 
> 
> On Wed, Jan 31, 2018 at 7:28 AM, Jason Gerlowski <ge...@gmail.com>> wrote:
> 
>> Hi Maulin,
> 
>>
> 
>> To clarify, when you said "...allocated 40 GB RAM to each shard."
> 
>> above, I'm going to assume you meant "to each node" instead.  If you
> 
>> actually did mean "to each shard" above, please correct me and anyone
> 
>> who chimes in afterward.
> 
>>
> 
>> Firstly, it's really hard to even take guesses about potential causes
> 
>> or remediations without more details about your load characteristics
> 
>> (average/peak QPS, index size, average document size, etc.).  If no
> 
>> one gives any satisfactory advice, please consider uploading
> 
>> additional details to help us help you.
> 
>>
> 
>> Secondly, I don't know anything about the load characteristics you're
> 
>> putting on your Solr cluster, but I'm curious whether you've
> 
>> experimented with lower RAM settings.  Generally speaking, the more
> 
>> RAM you have, the longer your GC pauses are likely to be (even with
> 
>> the tuning that various GC settings provide).  If you can get away
> 
>> with giving the Solr process less RAM, you should see your GC pauses
> 
>> shrink.  Was 40GB chosen after some trial-and-error experimentation, or is it something you could investigate?
> 
>>
> 
>> For a bit more overview on this, see this slightly outdated (but still
> 
>> useful) wiki page:
> 
>> https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
> 
>>
> 
>> Hope that helps, even if just to disqualify some potential
> 
>> causes/solutions to close in on a real fix.
> 
>>
> 
>> Best,
> 
>>
> 
>> Jason
> 
>>
> 
>> On Wed, Jan 31, 2018 at 8:17 AM, Maulin Rathod <mr...@asite.com>> wrote:
> 
>>
> 
>>> Hi,
> 
>>>
> 
>>> We are using solr cloud 6.1. We have around 20 collection on 4 nodes
> 
>>> (We have 2 shards and each shard have 2 replicas). We have allocated
> 
>>> 40 GB RAM to each shard.
> 
>>>
> 
>>> Intermittently we found long GC pauses (60 sec to 200 sec) due to
> 
>>> which solr stops responding and hence collections goes in recovering
> 
>>> mode. It takes minimum 5-10 minutes (sometime it takes more and we
> 
>>> have to restart the solr node) for recovering all collections. We are
> 
>>> using default GC setting (CMS) as per solr.cmd.
> 
>>>
> 
>>> We tried different G1 GC to see if it help, but still we see long GC
> 
>>> pauses(60 sec to 200 sec) and also found that memory usage is more in
> 
>>> in case G1 GC.
> 
>>>
> 
>>> What could be reason for long GC pauses and how can fix it?
> 
>>> Insufficient memory or problem with GC setting or something else? Any
> 
>>> suggestion would be greatly appreciated.
> 
>>>
> 
>>> In our analysis, we also found some inefficient queries (which uses *
> 
>>> many times in query) in solr logs. Could it be reason for high memory usage?
> 
>>>
> 
>>> Slow Query
> 
>>> --------------
> 
>>>
> 
>>> INFO  (qtp1239731077-498778) [c:documents s:shard1 r:core_node1
> 
>>> x:documents] o.a.s.c.S.Request [documents]  webapp=/solr path=/select
> 
>>> params={df=summary&distrib=false&fl=id&shards.purpose=4&
> 
>>> start=0&fsv=true&sort=description+asc,id+desc&fq=&shard.url=
> 
>>> s1.asite.com:8983/solr/documents|s1r1.asite.com:
> 
>>> 8983/solr/documents&rows=250&version=2&q=((id:(
> 
>>> REV78364_24705418+REV78364_24471492+REV78364_24471429+
> 
>>> REV78364_24470771+REV78364_24470271+))+OR+summary:((HPC*+
> 
>>> AND+*+AND+*+AND+OH1150*+AND+*+AND+*+AND+U0*+AND+*+AND+*+AND+
> 
>>> HGS*+AND+*+AND+*+AND+MDL*+AND+*+AND+*+AND+100067*+AND+*+AND+
> 
>>> -*+AND+Reinforcement*+AND+*+AND+Mode*)+))++AND++(title:((*
> 
>>> HPC\+\-\+OH1150\+\-\+U0\+\-\+HGS\+\-\+MDL\+\-\+100067\+-\+
> 
>>> Reinforcement\+Mode*)+))+AND+project_id:(-2+78243+78365+
> 
>>> 78364)+AND+is_active:true+AND+((isLatest:(true)+AND+
> 
>>> isFolderActive:true+AND+isXref:false+AND+-document_
> 
>>> type_id:(3+7)+AND+((is_public:true+OR+distribution_list:
> 
>>> 4858120+OR+folderadmin_list:4858120+OR+author_user_id:
> 
>>> 4858120)+AND+((defaultAccess:(true)+OR+allowedUsers:(
> 
>>> 4858120)+OR+allowedRoles:(6342201+172408+6336860)+OR+
> 
>>> combinationUsers:(4858120))+AND+-blockedUsers:(4858120))))
> 
>>> +OR+(isLatestRevPrivate:(true)+AND+allowedUsersForPvtRev:(
> 
>>> 4858120)+AND+-folderadmin_list:(4858120)))&shards.tolerant=true&NOW=
> 
>>> 1516786982952&isShard=true&wt=javabin} hits=0 status=0 QTime=83309
> 
>>>
> 
>>>
> 
>>>
> 
>>>
> 
>>> Regards,
> 
>>>
> 
>>> Maulin
> 
>>>
> 
>>> [CC Award Winners!]
> 
>>>
> 
>>>

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland