You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Karol Grzyb <gr...@gmail.com> on 2020/10/06 11:43:04 UTC

Java GC issue investigation

Hi,

I'm involved in investigation of issue that involves huge GC overhead
that happens during performance tests on Solr Nodes. Solr version is
6.1. Last test were done on staging env, and we run into problems for
<100 requests/second.

The size of the index itself is ~200MB ~ 50K docs
Index has small updates every 15min.



Queries involve sorting and faceting.

I've gathered some heap dumps, I can see from them that most of heap
memory is retained because of object of following classes:

-org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
(>4G, 91% of heap)
-org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
-org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
-org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
(>3.7G 76% of heap)



Based on information above is there anything generic that can been
looked at as source of potential improvement without diving deeply
into schema and queries (which may be very difficlut to change at this
moment)? I don't see docvalues being enabled - could this help, as if
I get the docs correctly, it's specifically helpful when there are
many sorts/grouping/facets? Or I

Additionaly I see, that many threads are blocked on LRUCache.get,
should I recomend switching to FastLRUCache?

Also, I wonder if -Xmx12288m for java heap is not too much for 16G
memory? I see some (~5/s) page faults in Dynatrace during the biggest
traffic.

Thank you very much for any help,
Kind regards,
Karol

Re: Java GC issue investigation

Posted by Walter Underwood <wu...@wunderwood.org>.

First thing is to stop using CMS and use G1GC.

We’ve been using these settings with over a hundred machines
in prod for nearly four years.

SOLR_HEAP=8g
# Use G1 GC  -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 7, 2020, at 2:39 AM, Karol Grzyb <gr...@gmail.com> wrote:
> 
> Hi Matthew, Erick!
> 
> Thank you very much for the feedback, I'll try to convince them to
> reduce the heap size.
> 
> current GC settings:
> 
> -XX:+CMSParallelRemarkEnabled
> -XX:+CMSScavengeBeforeRemark
> -XX:+ParallelRefProcEnabled
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC
> -XX:CMSInitiatingOccupancyFraction=50
> -XX:CMSMaxAbortablePrecleanTime=6000
> -XX:ConcGCThreads=4
> -XX:MaxTenuringThreshold=8
> -XX:NewRatio=3
> -XX:ParallelGCThreads=4
> -XX:PretenureSizeThreshold=64m
> -XX:SurvivorRatio=4
> -XX:TargetSurvivorRatio=90
> 
> Kind regards,
> Karol
> 
> 
> wt., 6 paź 2020 o 16:52 Erick Erickson <er...@gmail.com> napisał(a):
>> 
>> 12G is not that huge, it’s surprising that you’re seeing this problem.
>> 
>> However, there are a couple of things to look at:
>> 
>> 1> If you’re saying that you have 16G total physical memory and are allocating 12G to Solr, that’s an anti-pattern. See:
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> If at all possible, you should allocate between 25% and 50% of your physical memory to Solr...
>> 
>> 2> what garbage collector are you using? G1GC might be a better choice.
>> 
>>> On Oct 6, 2020, at 10:44 AM, matthew sporleder <ms...@gmail.com> wrote:
>>> 
>>> Your index is so small that it should easily get cached into OS memory
>>> as it is accessed.  Having a too-big heap is a known problem
>>> situation.
>>> 
>>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?
>>> 
>>> On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb <gr...@gmail.com> wrote:
>>>> 
>>>> Hi Matthew,
>>>> 
>>>> Thank you for the answer, I cannot reproduce the setup locally I'll
>>>> try to convince them to reduce Xmx, I guess they will rather not agree
>>>> to 1GB but something less than 12G for sure.
>>>> And have some proper dev setup because for now we could only test prod
>>>> or stage which are difficult to adjust.
>>>> 
>>>> Is being stuck in GC common behaviour when the index is small compared
>>>> to available heap during bigger load? I was more worried about the
>>>> ratio of heap to total host memory.
>>>> 
>>>> Regards,
>>>> Karol
>>>> 
>>>> 
>>>> wt., 6 paź 2020 o 14:39 matthew sporleder <ms...@gmail.com> napisał(a):
>>>>> 
>>>>> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
>>>>> to, like, 1g ?
>>>>> 
>>>>> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb <gr...@gmail.com> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm involved in investigation of issue that involves huge GC overhead
>>>>>> that happens during performance tests on Solr Nodes. Solr version is
>>>>>> 6.1. Last test were done on staging env, and we run into problems for
>>>>>> <100 requests/second.
>>>>>> 
>>>>>> The size of the index itself is ~200MB ~ 50K docs
>>>>>> Index has small updates every 15min.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Queries involve sorting and faceting.
>>>>>> 
>>>>>> I've gathered some heap dumps, I can see from them that most of heap
>>>>>> memory is retained because of object of following classes:
>>>>>> 
>>>>>> -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
>>>>>> (>4G, 91% of heap)
>>>>>> -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
>>>>>> -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
>>>>>> -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
>>>>>> (>3.7G 76% of heap)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Based on information above is there anything generic that can been
>>>>>> looked at as source of potential improvement without diving deeply
>>>>>> into schema and queries (which may be very difficlut to change at this
>>>>>> moment)? I don't see docvalues being enabled - could this help, as if
>>>>>> I get the docs correctly, it's specifically helpful when there are
>>>>>> many sorts/grouping/facets? Or I
>>>>>> 
>>>>>> Additionaly I see, that many threads are blocked on LRUCache.get,
>>>>>> should I recomend switching to FastLRUCache?
>>>>>> 
>>>>>> Also, I wonder if -Xmx12288m for java heap is not too much for 16G
>>>>>> memory? I see some (~5/s) page faults in Dynatrace during the biggest
>>>>>> traffic.
>>>>>> 
>>>>>> Thank you very much for any help,
>>>>>> Kind regards,
>>>>>> Karol
>>

Re: Java GC issue investigation

Posted by Karol Grzyb <gr...@gmail.com>.

Hi Matthew, Erick!

Thank you very much for the feedback, I'll try to convince them to
reduce the heap size.

current GC settings:

-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90

Kind regards,
Karol


wt., 6 paź 2020 o 16:52 Erick Erickson <er...@gmail.com> napisał(a):
>
> 12G is not that huge, it’s surprising that you’re seeing this problem.
>
> However, there are a couple of things to look at:
>
> 1> If you’re saying that you have 16G total physical memory and are allocating 12G to Solr, that’s an anti-pattern. See:
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> If at all possible, you should allocate between 25% and 50% of your physical memory to Solr...
>
> 2> what garbage collector are you using? G1GC might be a better choice.
>
> > On Oct 6, 2020, at 10:44 AM, matthew sporleder <ms...@gmail.com> wrote:
> >
> > Your index is so small that it should easily get cached into OS memory
> > as it is accessed.  Having a too-big heap is a known problem
> > situation.
> >
> > https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?
> >
> > On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb <gr...@gmail.com> wrote:
> >>
> >> Hi Matthew,
> >>
> >> Thank you for the answer, I cannot reproduce the setup locally I'll
> >> try to convince them to reduce Xmx, I guess they will rather not agree
> >> to 1GB but something less than 12G for sure.
> >> And have some proper dev setup because for now we could only test prod
> >> or stage which are difficult to adjust.
> >>
> >> Is being stuck in GC common behaviour when the index is small compared
> >> to available heap during bigger load? I was more worried about the
> >> ratio of heap to total host memory.
> >>
> >> Regards,
> >> Karol
> >>
> >>
> >> wt., 6 paź 2020 o 14:39 matthew sporleder <ms...@gmail.com> napisał(a):
> >>>
> >>> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
> >>> to, like, 1g ?
> >>>
> >>> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb <gr...@gmail.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I'm involved in investigation of issue that involves huge GC overhead
> >>>> that happens during performance tests on Solr Nodes. Solr version is
> >>>> 6.1. Last test were done on staging env, and we run into problems for
> >>>> <100 requests/second.
> >>>>
> >>>> The size of the index itself is ~200MB ~ 50K docs
> >>>> Index has small updates every 15min.
> >>>>
> >>>>
> >>>>
> >>>> Queries involve sorting and faceting.
> >>>>
> >>>> I've gathered some heap dumps, I can see from them that most of heap
> >>>> memory is retained because of object of following classes:
> >>>>
> >>>> -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> >>>> (>4G, 91% of heap)
> >>>> -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> >>>> -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> >>>> -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> >>>> (>3.7G 76% of heap)
> >>>>
> >>>>
> >>>>
> >>>> Based on information above is there anything generic that can been
> >>>> looked at as source of potential improvement without diving deeply
> >>>> into schema and queries (which may be very difficlut to change at this
> >>>> moment)? I don't see docvalues being enabled - could this help, as if
> >>>> I get the docs correctly, it's specifically helpful when there are
> >>>> many sorts/grouping/facets? Or I
> >>>>
> >>>> Additionaly I see, that many threads are blocked on LRUCache.get,
> >>>> should I recomend switching to FastLRUCache?
> >>>>
> >>>> Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> >>>> memory? I see some (~5/s) page faults in Dynatrace during the biggest
> >>>> traffic.
> >>>>
> >>>> Thank you very much for any help,
> >>>> Kind regards,
> >>>> Karol
>

Re: Java GC issue investigation

Posted by Erick Erickson <er...@gmail.com>.

12G is not that huge, it’s surprising that you’re seeing this problem.

However, there are a couple of things to look at:

1> If you’re saying that you have 16G total physical memory and are allocating 12G to Solr, that’s an anti-pattern. See: 
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
If at all possible, you should allocate between 25% and 50% of your physical memory to Solr...

2> what garbage collector are you using? G1GC might be a better choice.

> On Oct 6, 2020, at 10:44 AM, matthew sporleder <ms...@gmail.com> wrote:
> 
> Your index is so small that it should easily get cached into OS memory
> as it is accessed.  Having a too-big heap is a known problem
> situation.
> 
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?
> 
> On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb <gr...@gmail.com> wrote:
>> 
>> Hi Matthew,
>> 
>> Thank you for the answer, I cannot reproduce the setup locally I'll
>> try to convince them to reduce Xmx, I guess they will rather not agree
>> to 1GB but something less than 12G for sure.
>> And have some proper dev setup because for now we could only test prod
>> or stage which are difficult to adjust.
>> 
>> Is being stuck in GC common behaviour when the index is small compared
>> to available heap during bigger load? I was more worried about the
>> ratio of heap to total host memory.
>> 
>> Regards,
>> Karol
>> 
>> 
>> wt., 6 paź 2020 o 14:39 matthew sporleder <ms...@gmail.com> napisał(a):
>>> 
>>> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
>>> to, like, 1g ?
>>> 
>>> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb <gr...@gmail.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I'm involved in investigation of issue that involves huge GC overhead
>>>> that happens during performance tests on Solr Nodes. Solr version is
>>>> 6.1. Last test were done on staging env, and we run into problems for
>>>> <100 requests/second.
>>>> 
>>>> The size of the index itself is ~200MB ~ 50K docs
>>>> Index has small updates every 15min.
>>>> 
>>>> 
>>>> 
>>>> Queries involve sorting and faceting.
>>>> 
>>>> I've gathered some heap dumps, I can see from them that most of heap
>>>> memory is retained because of object of following classes:
>>>> 
>>>> -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
>>>> (>4G, 91% of heap)
>>>> -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
>>>> -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
>>>> -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
>>>> (>3.7G 76% of heap)
>>>> 
>>>> 
>>>> 
>>>> Based on information above is there anything generic that can been
>>>> looked at as source of potential improvement without diving deeply
>>>> into schema and queries (which may be very difficlut to change at this
>>>> moment)? I don't see docvalues being enabled - could this help, as if
>>>> I get the docs correctly, it's specifically helpful when there are
>>>> many sorts/grouping/facets? Or I
>>>> 
>>>> Additionaly I see, that many threads are blocked on LRUCache.get,
>>>> should I recomend switching to FastLRUCache?
>>>> 
>>>> Also, I wonder if -Xmx12288m for java heap is not too much for 16G
>>>> memory? I see some (~5/s) page faults in Dynatrace during the biggest
>>>> traffic.
>>>> 
>>>> Thank you very much for any help,
>>>> Kind regards,
>>>> Karol

Re: Java GC issue investigation

Posted by matthew sporleder <ms...@gmail.com>.

Your index is so small that it should easily get cached into OS memory
as it is accessed.  Having a too-big heap is a known problem
situation.

https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?

On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb <gr...@gmail.com> wrote:
>
> Hi Matthew,
>
> Thank you for the answer, I cannot reproduce the setup locally I'll
> try to convince them to reduce Xmx, I guess they will rather not agree
> to 1GB but something less than 12G for sure.
> And have some proper dev setup because for now we could only test prod
> or stage which are difficult to adjust.
>
> Is being stuck in GC common behaviour when the index is small compared
> to available heap during bigger load? I was more worried about the
> ratio of heap to total host memory.
>
> Regards,
> Karol
>
>
> wt., 6 paź 2020 o 14:39 matthew sporleder <ms...@gmail.com> napisał(a):
> >
> > You have a 12G heap for a 200MB index?  Can you just try changing Xmx
> > to, like, 1g ?
> >
> > On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb <gr...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I'm involved in investigation of issue that involves huge GC overhead
> > > that happens during performance tests on Solr Nodes. Solr version is
> > > 6.1. Last test were done on staging env, and we run into problems for
> > > <100 requests/second.
> > >
> > > The size of the index itself is ~200MB ~ 50K docs
> > > Index has small updates every 15min.
> > >
> > >
> > >
> > > Queries involve sorting and faceting.
> > >
> > > I've gathered some heap dumps, I can see from them that most of heap
> > > memory is retained because of object of following classes:
> > >
> > > -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> > > (>4G, 91% of heap)
> > > -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> > > -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> > > -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> > > (>3.7G 76% of heap)
> > >
> > >
> > >
> > > Based on information above is there anything generic that can been
> > > looked at as source of potential improvement without diving deeply
> > > into schema and queries (which may be very difficlut to change at this
> > > moment)? I don't see docvalues being enabled - could this help, as if
> > > I get the docs correctly, it's specifically helpful when there are
> > > many sorts/grouping/facets? Or I
> > >
> > > Additionaly I see, that many threads are blocked on LRUCache.get,
> > > should I recomend switching to FastLRUCache?
> > >
> > > Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> > > memory? I see some (~5/s) page faults in Dynatrace during the biggest
> > > traffic.
> > >
> > > Thank you very much for any help,
> > > Kind regards,
> > > Karol

Re: Java GC issue investigation

Posted by Karol Grzyb <gr...@gmail.com>.

Hi Matthew,

Thank you for the answer, I cannot reproduce the setup locally I'll
try to convince them to reduce Xmx, I guess they will rather not agree
to 1GB but something less than 12G for sure.
And have some proper dev setup because for now we could only test prod
or stage which are difficult to adjust.

Is being stuck in GC common behaviour when the index is small compared
to available heap during bigger load? I was more worried about the
ratio of heap to total host memory.

Regards,
Karol


wt., 6 paź 2020 o 14:39 matthew sporleder <ms...@gmail.com> napisał(a):
>
> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
> to, like, 1g ?
>
> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb <gr...@gmail.com> wrote:
> >
> > Hi,
> >
> > I'm involved in investigation of issue that involves huge GC overhead
> > that happens during performance tests on Solr Nodes. Solr version is
> > 6.1. Last test were done on staging env, and we run into problems for
> > <100 requests/second.
> >
> > The size of the index itself is ~200MB ~ 50K docs
> > Index has small updates every 15min.
> >
> >
> >
> > Queries involve sorting and faceting.
> >
> > I've gathered some heap dumps, I can see from them that most of heap
> > memory is retained because of object of following classes:
> >
> > -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> > (>4G, 91% of heap)
> > -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> > -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> > -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> > (>3.7G 76% of heap)
> >
> >
> >
> > Based on information above is there anything generic that can been
> > looked at as source of potential improvement without diving deeply
> > into schema and queries (which may be very difficlut to change at this
> > moment)? I don't see docvalues being enabled - could this help, as if
> > I get the docs correctly, it's specifically helpful when there are
> > many sorts/grouping/facets? Or I
> >
> > Additionaly I see, that many threads are blocked on LRUCache.get,
> > should I recomend switching to FastLRUCache?
> >
> > Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> > memory? I see some (~5/s) page faults in Dynatrace during the biggest
> > traffic.
> >
> > Thank you very much for any help,
> > Kind regards,
> > Karol

Re: Java GC issue investigation

Posted by matthew sporleder <ms...@gmail.com>.

You have a 12G heap for a 200MB index?  Can you just try changing Xmx
to, like, 1g ?

On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb <gr...@gmail.com> wrote:
>
> Hi,
>
> I'm involved in investigation of issue that involves huge GC overhead
> that happens during performance tests on Solr Nodes. Solr version is
> 6.1. Last test were done on staging env, and we run into problems for
> <100 requests/second.
>
> The size of the index itself is ~200MB ~ 50K docs
> Index has small updates every 15min.
>
>
>
> Queries involve sorting and faceting.
>
> I've gathered some heap dumps, I can see from them that most of heap
> memory is retained because of object of following classes:
>
> -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> (>4G, 91% of heap)
> -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> (>3.7G 76% of heap)
>
>
>
> Based on information above is there anything generic that can been
> looked at as source of potential improvement without diving deeply
> into schema and queries (which may be very difficlut to change at this
> moment)? I don't see docvalues being enabled - could this help, as if
> I get the docs correctly, it's specifically helpful when there are
> many sorts/grouping/facets? Or I
>
> Additionaly I see, that many threads are blocked on LRUCache.get,
> should I recomend switching to FastLRUCache?
>
> Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> memory? I see some (~5/s) page faults in Dynatrace during the biggest
> traffic.
>
> Thank you very much for any help,
> Kind regards,
> Karol