You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Patrick O'Lone <po...@townnews.com> on 2013/11/26 17:59:01 UTC

Solr 3.6.1 stalling with high CPU and blocking on field cache

I've been tracking a problem in our Solr environment for awhile with
periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to try
and thought I might get some insight from some others on this list.

The load on the server is normally anywhere between 1-3. It's an 8-core
machine with 40GB of RAM. I have about 25GB of index data that is
replicated to this server every 5 minutes. It's taking about 200
connections per second and roughly every 5-10 minutes it will stall for
about 30 seconds to a minute. The stall causes the load to go to as high
as 90. It is all CPU bound in user space - all cores go to 99%
utilization (spinlock?). When doing a thread dump, the following line is
blocked in all running Tomcat threads:

org.apache.lucene.search.FieldCacheImpl$Cache.get (
FieldCacheImpl.java:230 )

Looking the source code in 3.6.1, that is a function call to
syncronized() which blocks all threads and causes the backlog. I've
tried to correlate these events to the replication events - but even
with replication disabled - this still happens. We run multiple data
centers using Solr and I was comparing garbage collection processes
between and noted that the old generation is collected very differently
on this data center versus others. The old generation is collected as a
massive collect event (several gigabytes worth) - the other data center
is more saw toothed and collects only in 500MB-1GB at a time. Here's my
parameters to java (the same in all environments):

/usr/java/jre/bin/java \
-verbose:gc \
-XX:+PrintGCDetails \
-server \
-Dcom.sun.management.jmxremote \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:+CMSIncrementalMode \
-XX:+CMSParallelRemarkEnabled \
-XX:+CMSIncrementalPacing \
-XX:NewRatio=3 \
-Xms30720M \
-Xmx30720M \
-Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
-classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
-Dcatalina.base=/usr/local/share/apache-tomcat \
-Dcatalina.home=/usr/local/share/apache-tomcat \
-Djava.io.tmpdir=/tmp \
org.apache.catalina.startup.Bootstrap start

I've tried a few GC option changes from this (been running this way for
a couple of years now) - primarily removing CMS Incremental mode as we
have 8 cores and remarks on the internet suggest that it is only for
smaller SMP setups. Removing CMS did not fix anything.

I've considered that the heap is way too large (30GB from 40GB) and may
not leave enough memory for mmap operations (MMap appears to be used in
the field cache). Based on active memory utilization in Java, seems like
I might be able to reduce down to 22GB safely - but I'm not sure if that
will help with the CPU issues.

I think field cache is used for sorting and faceting. I've started to
investigate facet.method, but from what I can tell, this doesn't seem to
influence sorting at all - only facet queries. I've tried setting
useFilterForSortQuery, and seems to require less field cache but doesn't
address the stalling issues.

Is there something I am overlooking? Perhaps the system is becoming
oversubscribed in terms of resources? Thanks for any help that is offered.

-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... polone@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Patrick O'Lone <po...@townnews.com>.
I initially thought this was the case as well. These are slave nodes
that receive updates every 5-10 minutes. However, this issue happens
even if replication is turned off and no update handler is provided at all.

I have confirmed against my data that simply querying the fq for a
start_time in a range takes 11-13 seconds to actually populate the
cache. If I make the fq not cache at all, my QTime raises by about
100ms, but does not have the stalling effect. A purely negative query
also seems to have this effect, that is:

fq=-start_time:[NOW/MINUTE TO *]

But, I'm not sure if that is because it actually caches the negative
query or if it discards it entirely.

> Patrick,
> 
> Are you getting these stalls following a commit? If so then the issue is
> most likely fieldCache warming pauses. To stop your users from seeing
> this pause you'll need to add static warming queries to your
> solrconfig.xml to warm the fieldCache before it's registered .
> 
> 
> On Mon, Dec 9, 2013 at 12:33 PM, Patrick O'Lone <polone@townnews.com
> <ma...@townnews.com>> wrote:
> 
>     Well, I want to include everything will start in the next 5 minute
>     interval and everything that came before. The query is more like:
> 
>     fq=start_time:[* TO NOW+5MINUTE/5MINUTE]
> 
>     so that it rounds to the nearest 5 minute interval on the right-hand
>     side. But, as soon as 1 second after that 5 minute window, everything
>     pauses wanting for filter cache (at least that's my working theory based
>     on observation). Is it possible to do something like:
> 
>     fq=start_time:[* TO NOW+1DAY/DAY]&q=start_time:[* TO NOW/MINUTE]
> 
>     where it would use the filter cache to narrow down by day resolution and
>     then filter as part of the standard query, or something like that?
> 
>     My thought is that this would still gain a benefit from a query cache,
>     but somewhat slower since it must remove results for things appearing
>     later in the day.
> 
>     > If you want a start time within the next 5 minutes, I think your
>     filter
>     > is not the good one.
>     > * will be replaced by the first date in your field
>     >
>     > Try :
>     > fq=start_time:[NOW TO NOW+5MINUTE]
>     >
>     > Franck Brisbart
>     >
>     >
>     > Le lundi 09 d�cembre 2013 � 09:07 -0600, Patrick O'Lone a �crit :
>     >> I have a new question about this issue - I create a filter queries of
>     >> the form:
>     >>
>     >> fq=start_time:[* TO NOW/5MINUTE]
>     >>
>     >> This is used to restrict the set of documents to only items that
>     have a
>     >> start time within the next 5 minutes. Most of my indexes have
>     millions
>     >> of documents with few documents that start sometime in the future.
>     >> Nearly all of my queries include this, would this cause every other
>     >> search thread to block until the filter query is re-cached every 5
>     >> minutes and if so, is there a better way to do it? Thanks for any
>     >> continued help with this issue!
>     >>
>     >>> We have a webapp running with a very high HEAP size (24GB) and
>     we have
>     >>> no problems with it AFTER we enabled the new GC that is meant to
>     replace
>     >>> sometime in the future the CMS GC, but you have to have Java 6
>     update
>     >>> "Some number I couldn't find but latest should cover" to be able
>     to use:
>     >>>
>     >>> 1. Remove all GC options you have and...
>     >>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
>     >>>
>     >>> As a test of course, more information you can read on the
>     following (and
>     >>> interesting) article, we also have Solr running with these
>     options, no
>     >>> more pauses or HEAP size hitting the sky.
>     >>>
>     >>> Don't get bored reading the 1st (and small) introduction page of the
>     >>> article, page 2 and 3 will make lot of sense:
>     >>>
>     http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
>     >>>
>     >>>
>     >>> HTH,
>     >>>
>     >>> Guido.
>     >>>
>     >>> On 26/11/13 21:59, Patrick O'Lone wrote:
>     >>>> We do perform a lot of sorting - on multiple fields in fact. We
>     have
>     >>>> different kinds of Solr configurations - our news searches do
>     little
>     >>>> with regards to faceting, but heavily sort. We provide
>     classified ad
>     >>>> searches and that heavily uses faceting. I might try reducing
>     the JVM
>     >>>> memory some and amount of perm generation as suggested earlier.
>     It feels
>     >>>> like a GC issue and loading the cache just happens to be the
>     victim of a
>     >>>> stop-the-world event at the worse possible time.
>     >>>>
>     >>>>> My gut instinct is that your heap size is way too high. Try
>     >>>>> decreasing it to like 5-10G. I know you say it uses more than
>     that,
>     >>>>> but that just seems bizarre unless you're doing something like
>     >>>>> faceting and/or sorting on every field.
>     >>>>>
>     >>>>> -Michael
>     >>>>>
>     >>>>> -----Original Message-----
>     >>>>> From: Patrick O'Lone [mailto:polone@townnews.com
>     <ma...@townnews.com>]
>     >>>>> Sent: Tuesday, November 26, 2013 11:59 AM
>     >>>>> To: solr-user@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on
>     field cache
>     >>>>>
>     >>>>> I've been tracking a problem in our Solr environment for
>     awhile with
>     >>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on
>     ideas to
>     >>>>> try and thought I might get some insight from some others on
>     this list.
>     >>>>>
>     >>>>> The load on the server is normally anywhere between 1-3. It's an
>     >>>>> 8-core machine with 40GB of RAM. I have about 25GB of index
>     data that
>     >>>>> is replicated to this server every 5 minutes. It's taking
>     about 200
>     >>>>> connections per second and roughly every 5-10 minutes it will
>     stall
>     >>>>> for about 30 seconds to a minute. The stall causes the load to
>     go to
>     >>>>> as high as 90. It is all CPU bound in user space - all cores go to
>     >>>>> 99% utilization (spinlock?). When doing a thread dump, the
>     following
>     >>>>> line is blocked in all running Tomcat threads:
>     >>>>>
>     >>>>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
>     >>>>> FieldCacheImpl.java:230 )
>     >>>>>
>     >>>>> Looking the source code in 3.6.1, that is a function call to
>     >>>>> syncronized() which blocks all threads and causes the backlog.
>     I've
>     >>>>> tried to correlate these events to the replication events -
>     but even
>     >>>>> with replication disabled - this still happens. We run
>     multiple data
>     >>>>> centers using Solr and I was comparing garbage collection
>     processes
>     >>>>> between and noted that the old generation is collected very
>     >>>>> differently on this data center versus others. The old
>     generation is
>     >>>>> collected as a massive collect event (several gigabytes worth)
>     - the
>     >>>>> other data center is more saw toothed and collects only in
>     500MB-1GB
>     >>>>> at a time. Here's my parameters to java (the same in all
>     environments):
>     >>>>>
>     >>>>> /usr/java/jre/bin/java \
>     >>>>> -verbose:gc \
>     >>>>> -XX:+PrintGCDetails \
>     >>>>> -server \
>     >>>>> -Dcom.sun.management.jmxremote \
>     >>>>> -XX:+UseConcMarkSweepGC \
>     >>>>> -XX:+UseParNewGC \
>     >>>>> -XX:+CMSIncrementalMode \
>     >>>>> -XX:+CMSParallelRemarkEnabled \
>     >>>>> -XX:+CMSIncrementalPacing \
>     >>>>> -XX:NewRatio=3 \
>     >>>>> -Xms30720M \
>     >>>>> -Xmx30720M \
>     >>>>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
>     >>>>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
>     >>>>> -Dcatalina.base=/usr/local/share/apache-tomcat \
>     >>>>> -Dcatalina.home=/usr/local/share/apache-tomcat \
>     >>>>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap
>     start
>     >>>>>
>     >>>>> I've tried a few GC option changes from this (been running
>     this way
>     >>>>> for a couple of years now) - primarily removing CMS
>     Incremental mode
>     >>>>> as we have 8 cores and remarks on the internet suggest that it is
>     >>>>> only for smaller SMP setups. Removing CMS did not fix anything.
>     >>>>>
>     >>>>> I've considered that the heap is way too large (30GB from
>     40GB) and
>     >>>>> may not leave enough memory for mmap operations (MMap appears
>     to be
>     >>>>> used in the field cache). Based on active memory utilization
>     in Java,
>     >>>>> seems like I might be able to reduce down to 22GB safely - but I'm
>     >>>>> not sure if that will help with the CPU issues.
>     >>>>>
>     >>>>> I think field cache is used for sorting and faceting. I've
>     started to
>     >>>>> investigate facet.method, but from what I can tell, this
>     doesn't seem
>     >>>>> to influence sorting at all - only facet queries. I've tried
>     setting
>     >>>>> useFilterForSortQuery, and seems to require less field cache but
>     >>>>> doesn't address the stalling issues.
>     >>>>>
>     >>>>> Is there something I am overlooking? Perhaps the system is
>     becoming
>     >>>>> oversubscribed in terms of resources? Thanks for any help that is
>     >>>>> offered.
>     >>>>>
>     >>>>> --
>     >>>>> Patrick O'Lone
>     >>>>> Director of Software Development
>     >>>>> TownNews.com
>     >>>>>
>     >>>>> E-mail ... polone@townnews.com <ma...@townnews.com>
>     >>>>> Phone .... 309-743-0809 <tel:309-743-0809>
>     >>>>> Fax ...... 309-743-0830 <tel:309-743-0830>
>     >>>>>
>     >>>>>
>     >>>>
>     >>>
>     >>>
>     >>
>     >>
>     >
>     >
>     >
>     >
> 
> 
>     --
>     Patrick O'Lone
>     Director of Software Development
>     TownNews.com
> 
>     E-mail ... polone@townnews.com <ma...@townnews.com>
>     Phone .... 309-743-0809 <tel:309-743-0809>
>     Fax ...... 309-743-0830 <tel:309-743-0830>
> 
> 
> 
> 
> -- 
> Joel Bernstein
> Search Engineer at Heliosearch


-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... polone@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Joel Bernstein <jo...@gmail.com>.
Patrick,

Are you getting these stalls following a commit? If so then the issue is
most likely fieldCache warming pauses. To stop your users from seeing this
pause you'll need to add static warming queries to your solrconfig.xml to
warm the fieldCache before it's registered .


On Mon, Dec 9, 2013 at 12:33 PM, Patrick O'Lone <po...@townnews.com> wrote:

> Well, I want to include everything will start in the next 5 minute
> interval and everything that came before. The query is more like:
>
> fq=start_time:[* TO NOW+5MINUTE/5MINUTE]
>
> so that it rounds to the nearest 5 minute interval on the right-hand
> side. But, as soon as 1 second after that 5 minute window, everything
> pauses wanting for filter cache (at least that's my working theory based
> on observation). Is it possible to do something like:
>
> fq=start_time:[* TO NOW+1DAY/DAY]&q=start_time:[* TO NOW/MINUTE]
>
> where it would use the filter cache to narrow down by day resolution and
> then filter as part of the standard query, or something like that?
>
> My thought is that this would still gain a benefit from a query cache,
> but somewhat slower since it must remove results for things appearing
> later in the day.
>
> > If you want a start time within the next 5 minutes, I think your filter
> > is not the good one.
> > * will be replaced by the first date in your field
> >
> > Try :
> > fq=start_time:[NOW TO NOW+5MINUTE]
> >
> > Franck Brisbart
> >
> >
> > Le lundi 09 décembre 2013 à 09:07 -0600, Patrick O'Lone a écrit :
> >> I have a new question about this issue - I create a filter queries of
> >> the form:
> >>
> >> fq=start_time:[* TO NOW/5MINUTE]
> >>
> >> This is used to restrict the set of documents to only items that have a
> >> start time within the next 5 minutes. Most of my indexes have millions
> >> of documents with few documents that start sometime in the future.
> >> Nearly all of my queries include this, would this cause every other
> >> search thread to block until the filter query is re-cached every 5
> >> minutes and if so, is there a better way to do it? Thanks for any
> >> continued help with this issue!
> >>
> >>> We have a webapp running with a very high HEAP size (24GB) and we have
> >>> no problems with it AFTER we enabled the new GC that is meant to
> replace
> >>> sometime in the future the CMS GC, but you have to have Java 6 update
> >>> "Some number I couldn't find but latest should cover" to be able to
> use:
> >>>
> >>> 1. Remove all GC options you have and...
> >>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
> >>>
> >>> As a test of course, more information you can read on the following
> (and
> >>> interesting) article, we also have Solr running with these options, no
> >>> more pauses or HEAP size hitting the sky.
> >>>
> >>> Don't get bored reading the 1st (and small) introduction page of the
> >>> article, page 2 and 3 will make lot of sense:
> >>>
> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
> >>>
> >>>
> >>> HTH,
> >>>
> >>> Guido.
> >>>
> >>> On 26/11/13 21:59, Patrick O'Lone wrote:
> >>>> We do perform a lot of sorting - on multiple fields in fact. We have
> >>>> different kinds of Solr configurations - our news searches do little
> >>>> with regards to faceting, but heavily sort. We provide classified ad
> >>>> searches and that heavily uses faceting. I might try reducing the JVM
> >>>> memory some and amount of perm generation as suggested earlier. It
> feels
> >>>> like a GC issue and loading the cache just happens to be the victim
> of a
> >>>> stop-the-world event at the worse possible time.
> >>>>
> >>>>> My gut instinct is that your heap size is way too high. Try
> >>>>> decreasing it to like 5-10G. I know you say it uses more than that,
> >>>>> but that just seems bizarre unless you're doing something like
> >>>>> faceting and/or sorting on every field.
> >>>>>
> >>>>> -Michael
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Patrick O'Lone [mailto:polone@townnews.com]
> >>>>> Sent: Tuesday, November 26, 2013 11:59 AM
> >>>>> To: solr-user@lucene.apache.org
> >>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field
> cache
> >>>>>
> >>>>> I've been tracking a problem in our Solr environment for awhile with
> >>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to
> >>>>> try and thought I might get some insight from some others on this
> list.
> >>>>>
> >>>>> The load on the server is normally anywhere between 1-3. It's an
> >>>>> 8-core machine with 40GB of RAM. I have about 25GB of index data that
> >>>>> is replicated to this server every 5 minutes. It's taking about 200
> >>>>> connections per second and roughly every 5-10 minutes it will stall
> >>>>> for about 30 seconds to a minute. The stall causes the load to go to
> >>>>> as high as 90. It is all CPU bound in user space - all cores go to
> >>>>> 99% utilization (spinlock?). When doing a thread dump, the following
> >>>>> line is blocked in all running Tomcat threads:
> >>>>>
> >>>>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
> >>>>> FieldCacheImpl.java:230 )
> >>>>>
> >>>>> Looking the source code in 3.6.1, that is a function call to
> >>>>> syncronized() which blocks all threads and causes the backlog. I've
> >>>>> tried to correlate these events to the replication events - but even
> >>>>> with replication disabled - this still happens. We run multiple data
> >>>>> centers using Solr and I was comparing garbage collection processes
> >>>>> between and noted that the old generation is collected very
> >>>>> differently on this data center versus others. The old generation is
> >>>>> collected as a massive collect event (several gigabytes worth) - the
> >>>>> other data center is more saw toothed and collects only in 500MB-1GB
> >>>>> at a time. Here's my parameters to java (the same in all
> environments):
> >>>>>
> >>>>> /usr/java/jre/bin/java \
> >>>>> -verbose:gc \
> >>>>> -XX:+PrintGCDetails \
> >>>>> -server \
> >>>>> -Dcom.sun.management.jmxremote \
> >>>>> -XX:+UseConcMarkSweepGC \
> >>>>> -XX:+UseParNewGC \
> >>>>> -XX:+CMSIncrementalMode \
> >>>>> -XX:+CMSParallelRemarkEnabled \
> >>>>> -XX:+CMSIncrementalPacing \
> >>>>> -XX:NewRatio=3 \
> >>>>> -Xms30720M \
> >>>>> -Xmx30720M \
> >>>>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
> >>>>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
> >>>>> -Dcatalina.base=/usr/local/share/apache-tomcat \
> >>>>> -Dcatalina.home=/usr/local/share/apache-tomcat \
> >>>>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
> >>>>>
> >>>>> I've tried a few GC option changes from this (been running this way
> >>>>> for a couple of years now) - primarily removing CMS Incremental mode
> >>>>> as we have 8 cores and remarks on the internet suggest that it is
> >>>>> only for smaller SMP setups. Removing CMS did not fix anything.
> >>>>>
> >>>>> I've considered that the heap is way too large (30GB from 40GB) and
> >>>>> may not leave enough memory for mmap operations (MMap appears to be
> >>>>> used in the field cache). Based on active memory utilization in Java,
> >>>>> seems like I might be able to reduce down to 22GB safely - but I'm
> >>>>> not sure if that will help with the CPU issues.
> >>>>>
> >>>>> I think field cache is used for sorting and faceting. I've started to
> >>>>> investigate facet.method, but from what I can tell, this doesn't seem
> >>>>> to influence sorting at all - only facet queries. I've tried setting
> >>>>> useFilterForSortQuery, and seems to require less field cache but
> >>>>> doesn't address the stalling issues.
> >>>>>
> >>>>> Is there something I am overlooking? Perhaps the system is becoming
> >>>>> oversubscribed in terms of resources? Thanks for any help that is
> >>>>> offered.
> >>>>>
> >>>>> --
> >>>>> Patrick O'Lone
> >>>>> Director of Software Development
> >>>>> TownNews.com
> >>>>>
> >>>>> E-mail ... polone@townnews.com
> >>>>> Phone .... 309-743-0809
> >>>>> Fax ...... 309-743-0830
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
> >
> >
>
>
> --
> Patrick O'Lone
> Director of Software Development
> TownNews.com
>
> E-mail ... polone@townnews.com
> Phone .... 309-743-0809
> Fax ...... 309-743-0830
>



-- 
Joel Bernstein
Search Engineer at Heliosearch

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Patrick O'Lone <po...@townnews.com>.
Well, I want to include everything will start in the next 5 minute
interval and everything that came before. The query is more like:

fq=start_time:[* TO NOW+5MINUTE/5MINUTE]

so that it rounds to the nearest 5 minute interval on the right-hand
side. But, as soon as 1 second after that 5 minute window, everything
pauses wanting for filter cache (at least that's my working theory based
on observation). Is it possible to do something like:

fq=start_time:[* TO NOW+1DAY/DAY]&q=start_time:[* TO NOW/MINUTE]

where it would use the filter cache to narrow down by day resolution and
then filter as part of the standard query, or something like that?

My thought is that this would still gain a benefit from a query cache,
but somewhat slower since it must remove results for things appearing
later in the day.

> If you want a start time within the next 5 minutes, I think your filter
> is not the good one.
> * will be replaced by the first date in your field
> 
> Try :
> fq=start_time:[NOW TO NOW+5MINUTE]
> 
> Franck Brisbart
> 
> 
> Le lundi 09 décembre 2013 à 09:07 -0600, Patrick O'Lone a écrit :
>> I have a new question about this issue - I create a filter queries of
>> the form:
>>
>> fq=start_time:[* TO NOW/5MINUTE]
>>
>> This is used to restrict the set of documents to only items that have a
>> start time within the next 5 minutes. Most of my indexes have millions
>> of documents with few documents that start sometime in the future.
>> Nearly all of my queries include this, would this cause every other
>> search thread to block until the filter query is re-cached every 5
>> minutes and if so, is there a better way to do it? Thanks for any
>> continued help with this issue!
>>
>>> We have a webapp running with a very high HEAP size (24GB) and we have
>>> no problems with it AFTER we enabled the new GC that is meant to replace
>>> sometime in the future the CMS GC, but you have to have Java 6 update
>>> "Some number I couldn't find but latest should cover" to be able to use:
>>>
>>> 1. Remove all GC options you have and...
>>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
>>>
>>> As a test of course, more information you can read on the following (and
>>> interesting) article, we also have Solr running with these options, no
>>> more pauses or HEAP size hitting the sky.
>>>
>>> Don't get bored reading the 1st (and small) introduction page of the
>>> article, page 2 and 3 will make lot of sense:
>>> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
>>>
>>>
>>> HTH,
>>>
>>> Guido.
>>>
>>> On 26/11/13 21:59, Patrick O'Lone wrote:
>>>> We do perform a lot of sorting - on multiple fields in fact. We have
>>>> different kinds of Solr configurations - our news searches do little
>>>> with regards to faceting, but heavily sort. We provide classified ad
>>>> searches and that heavily uses faceting. I might try reducing the JVM
>>>> memory some and amount of perm generation as suggested earlier. It feels
>>>> like a GC issue and loading the cache just happens to be the victim of a
>>>> stop-the-world event at the worse possible time.
>>>>
>>>>> My gut instinct is that your heap size is way too high. Try
>>>>> decreasing it to like 5-10G. I know you say it uses more than that,
>>>>> but that just seems bizarre unless you're doing something like
>>>>> faceting and/or sorting on every field.
>>>>>
>>>>> -Michael
>>>>>
>>>>> -----Original Message-----
>>>>> From: Patrick O'Lone [mailto:polone@townnews.com]
>>>>> Sent: Tuesday, November 26, 2013 11:59 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache
>>>>>
>>>>> I've been tracking a problem in our Solr environment for awhile with
>>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to
>>>>> try and thought I might get some insight from some others on this list.
>>>>>
>>>>> The load on the server is normally anywhere between 1-3. It's an
>>>>> 8-core machine with 40GB of RAM. I have about 25GB of index data that
>>>>> is replicated to this server every 5 minutes. It's taking about 200
>>>>> connections per second and roughly every 5-10 minutes it will stall
>>>>> for about 30 seconds to a minute. The stall causes the load to go to
>>>>> as high as 90. It is all CPU bound in user space - all cores go to
>>>>> 99% utilization (spinlock?). When doing a thread dump, the following
>>>>> line is blocked in all running Tomcat threads:
>>>>>
>>>>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
>>>>> FieldCacheImpl.java:230 )
>>>>>
>>>>> Looking the source code in 3.6.1, that is a function call to
>>>>> syncronized() which blocks all threads and causes the backlog. I've
>>>>> tried to correlate these events to the replication events - but even
>>>>> with replication disabled - this still happens. We run multiple data
>>>>> centers using Solr and I was comparing garbage collection processes
>>>>> between and noted that the old generation is collected very
>>>>> differently on this data center versus others. The old generation is
>>>>> collected as a massive collect event (several gigabytes worth) - the
>>>>> other data center is more saw toothed and collects only in 500MB-1GB
>>>>> at a time. Here's my parameters to java (the same in all environments):
>>>>>
>>>>> /usr/java/jre/bin/java \
>>>>> -verbose:gc \
>>>>> -XX:+PrintGCDetails \
>>>>> -server \
>>>>> -Dcom.sun.management.jmxremote \
>>>>> -XX:+UseConcMarkSweepGC \
>>>>> -XX:+UseParNewGC \
>>>>> -XX:+CMSIncrementalMode \
>>>>> -XX:+CMSParallelRemarkEnabled \
>>>>> -XX:+CMSIncrementalPacing \
>>>>> -XX:NewRatio=3 \
>>>>> -Xms30720M \
>>>>> -Xmx30720M \
>>>>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
>>>>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
>>>>> -Dcatalina.base=/usr/local/share/apache-tomcat \
>>>>> -Dcatalina.home=/usr/local/share/apache-tomcat \
>>>>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
>>>>>
>>>>> I've tried a few GC option changes from this (been running this way
>>>>> for a couple of years now) - primarily removing CMS Incremental mode
>>>>> as we have 8 cores and remarks on the internet suggest that it is
>>>>> only for smaller SMP setups. Removing CMS did not fix anything.
>>>>>
>>>>> I've considered that the heap is way too large (30GB from 40GB) and
>>>>> may not leave enough memory for mmap operations (MMap appears to be
>>>>> used in the field cache). Based on active memory utilization in Java,
>>>>> seems like I might be able to reduce down to 22GB safely - but I'm
>>>>> not sure if that will help with the CPU issues.
>>>>>
>>>>> I think field cache is used for sorting and faceting. I've started to
>>>>> investigate facet.method, but from what I can tell, this doesn't seem
>>>>> to influence sorting at all - only facet queries. I've tried setting
>>>>> useFilterForSortQuery, and seems to require less field cache but
>>>>> doesn't address the stalling issues.
>>>>>
>>>>> Is there something I am overlooking? Perhaps the system is becoming
>>>>> oversubscribed in terms of resources? Thanks for any help that is
>>>>> offered.
>>>>>
>>>>> -- 
>>>>> Patrick O'Lone
>>>>> Director of Software Development
>>>>> TownNews.com
>>>>>
>>>>> E-mail ... polone@townnews.com
>>>>> Phone .... 309-743-0809
>>>>> Fax ...... 309-743-0830
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
> 
> 
> 
> 


-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... polone@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by fbrisbart <fb...@bestofmedia.com>.
If you want a start time within the next 5 minutes, I think your filter
is not the good one.
* will be replaced by the first date in your field

Try :
fq=start_time:[NOW TO NOW+5MINUTE]

Franck Brisbart


Le lundi 09 décembre 2013 à 09:07 -0600, Patrick O'Lone a écrit :
> I have a new question about this issue - I create a filter queries of
> the form:
> 
> fq=start_time:[* TO NOW/5MINUTE]
> 
> This is used to restrict the set of documents to only items that have a
> start time within the next 5 minutes. Most of my indexes have millions
> of documents with few documents that start sometime in the future.
> Nearly all of my queries include this, would this cause every other
> search thread to block until the filter query is re-cached every 5
> minutes and if so, is there a better way to do it? Thanks for any
> continued help with this issue!
> 
> > We have a webapp running with a very high HEAP size (24GB) and we have
> > no problems with it AFTER we enabled the new GC that is meant to replace
> > sometime in the future the CMS GC, but you have to have Java 6 update
> > "Some number I couldn't find but latest should cover" to be able to use:
> > 
> > 1. Remove all GC options you have and...
> > 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
> > 
> > As a test of course, more information you can read on the following (and
> > interesting) article, we also have Solr running with these options, no
> > more pauses or HEAP size hitting the sky.
> > 
> > Don't get bored reading the 1st (and small) introduction page of the
> > article, page 2 and 3 will make lot of sense:
> > http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
> > 
> > 
> > HTH,
> > 
> > Guido.
> > 
> > On 26/11/13 21:59, Patrick O'Lone wrote:
> >> We do perform a lot of sorting - on multiple fields in fact. We have
> >> different kinds of Solr configurations - our news searches do little
> >> with regards to faceting, but heavily sort. We provide classified ad
> >> searches and that heavily uses faceting. I might try reducing the JVM
> >> memory some and amount of perm generation as suggested earlier. It feels
> >> like a GC issue and loading the cache just happens to be the victim of a
> >> stop-the-world event at the worse possible time.
> >>
> >>> My gut instinct is that your heap size is way too high. Try
> >>> decreasing it to like 5-10G. I know you say it uses more than that,
> >>> but that just seems bizarre unless you're doing something like
> >>> faceting and/or sorting on every field.
> >>>
> >>> -Michael
> >>>
> >>> -----Original Message-----
> >>> From: Patrick O'Lone [mailto:polone@townnews.com]
> >>> Sent: Tuesday, November 26, 2013 11:59 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache
> >>>
> >>> I've been tracking a problem in our Solr environment for awhile with
> >>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to
> >>> try and thought I might get some insight from some others on this list.
> >>>
> >>> The load on the server is normally anywhere between 1-3. It's an
> >>> 8-core machine with 40GB of RAM. I have about 25GB of index data that
> >>> is replicated to this server every 5 minutes. It's taking about 200
> >>> connections per second and roughly every 5-10 minutes it will stall
> >>> for about 30 seconds to a minute. The stall causes the load to go to
> >>> as high as 90. It is all CPU bound in user space - all cores go to
> >>> 99% utilization (spinlock?). When doing a thread dump, the following
> >>> line is blocked in all running Tomcat threads:
> >>>
> >>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
> >>> FieldCacheImpl.java:230 )
> >>>
> >>> Looking the source code in 3.6.1, that is a function call to
> >>> syncronized() which blocks all threads and causes the backlog. I've
> >>> tried to correlate these events to the replication events - but even
> >>> with replication disabled - this still happens. We run multiple data
> >>> centers using Solr and I was comparing garbage collection processes
> >>> between and noted that the old generation is collected very
> >>> differently on this data center versus others. The old generation is
> >>> collected as a massive collect event (several gigabytes worth) - the
> >>> other data center is more saw toothed and collects only in 500MB-1GB
> >>> at a time. Here's my parameters to java (the same in all environments):
> >>>
> >>> /usr/java/jre/bin/java \
> >>> -verbose:gc \
> >>> -XX:+PrintGCDetails \
> >>> -server \
> >>> -Dcom.sun.management.jmxremote \
> >>> -XX:+UseConcMarkSweepGC \
> >>> -XX:+UseParNewGC \
> >>> -XX:+CMSIncrementalMode \
> >>> -XX:+CMSParallelRemarkEnabled \
> >>> -XX:+CMSIncrementalPacing \
> >>> -XX:NewRatio=3 \
> >>> -Xms30720M \
> >>> -Xmx30720M \
> >>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
> >>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
> >>> -Dcatalina.base=/usr/local/share/apache-tomcat \
> >>> -Dcatalina.home=/usr/local/share/apache-tomcat \
> >>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
> >>>
> >>> I've tried a few GC option changes from this (been running this way
> >>> for a couple of years now) - primarily removing CMS Incremental mode
> >>> as we have 8 cores and remarks on the internet suggest that it is
> >>> only for smaller SMP setups. Removing CMS did not fix anything.
> >>>
> >>> I've considered that the heap is way too large (30GB from 40GB) and
> >>> may not leave enough memory for mmap operations (MMap appears to be
> >>> used in the field cache). Based on active memory utilization in Java,
> >>> seems like I might be able to reduce down to 22GB safely - but I'm
> >>> not sure if that will help with the CPU issues.
> >>>
> >>> I think field cache is used for sorting and faceting. I've started to
> >>> investigate facet.method, but from what I can tell, this doesn't seem
> >>> to influence sorting at all - only facet queries. I've tried setting
> >>> useFilterForSortQuery, and seems to require less field cache but
> >>> doesn't address the stalling issues.
> >>>
> >>> Is there something I am overlooking? Perhaps the system is becoming
> >>> oversubscribed in terms of resources? Thanks for any help that is
> >>> offered.
> >>>
> >>> -- 
> >>> Patrick O'Lone
> >>> Director of Software Development
> >>> TownNews.com
> >>>
> >>> E-mail ... polone@townnews.com
> >>> Phone .... 309-743-0809
> >>> Fax ...... 309-743-0830
> >>>
> >>>
> >>
> > 
> > 
> 
> 



Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Patrick O'Lone <po...@townnews.com>.
Yeah, I tried G1, but it did not help - I don't think it is a garbage
collection issue. I've made various changes to iCMS as well and the
issue ALWAYS happens - no matter what I do. If I'm taking heavy traffic
(200 requests per second) - as soon as I hit a 5 minute mark - the world
stops - garbage collection would be less predictable. Nearly all of my
requests have this 5 minute windowing behavior on time though, which is
why I have it as a strong suspect now. If it blocks on that - even for a
couple of seconds, my traffic backlog will be 600-800 requests.

> Did you add the Garbage collection JVM options I suggested you?
> 
> -XX:+UseG1GC -XX:MaxGCPauseMillis=50
> 
> Guido.
> 
> On 09/12/13 16:33, Patrick O'Lone wrote:
>> Unfortunately, in a test environment, this happens in version 4.4.0 of
>> Solr as well.
>>
>>> I was trying to locate the release notes for 3.6.x it is too old, if I
>>> were you I would update to 3.6.2 (from 3.6.1), it shouldn't affect you
>>> since it is a minor release, locate the release notes and see if
>>> something that is affecting you got fixed, also, I would be thinking on
>>> moving on to 4.x which is quite stable and fast.
>>>
>>> Like anything with Java and concurrency, it will just get better (and
>>> faster) with bigger numbers and concurrency frameworks becoming more and
>>> more reliable, standard and stable.
>>>
>>> Regards,
>>>
>>> Guido.
>>>
>>> On 09/12/13 15:07, Patrick O'Lone wrote:
>>>> I have a new question about this issue - I create a filter queries of
>>>> the form:
>>>>
>>>> fq=start_time:[* TO NOW/5MINUTE]
>>>>
>>>> This is used to restrict the set of documents to only items that have a
>>>> start time within the next 5 minutes. Most of my indexes have millions
>>>> of documents with few documents that start sometime in the future.
>>>> Nearly all of my queries include this, would this cause every other
>>>> search thread to block until the filter query is re-cached every 5
>>>> minutes and if so, is there a better way to do it? Thanks for any
>>>> continued help with this issue!
>>>>
>>>>> We have a webapp running with a very high HEAP size (24GB) and we have
>>>>> no problems with it AFTER we enabled the new GC that is meant to
>>>>> replace
>>>>> sometime in the future the CMS GC, but you have to have Java 6 update
>>>>> "Some number I couldn't find but latest should cover" to be able to
>>>>> use:
>>>>>
>>>>> 1. Remove all GC options you have and...
>>>>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
>>>>>
>>>>> As a test of course, more information you can read on the following
>>>>> (and
>>>>> interesting) article, we also have Solr running with these options, no
>>>>> more pauses or HEAP size hitting the sky.
>>>>>
>>>>> Don't get bored reading the 1st (and small) introduction page of the
>>>>> article, page 2 and 3 will make lot of sense:
>>>>> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> HTH,
>>>>>
>>>>> Guido.
>>>>>
>>>>> On 26/11/13 21:59, Patrick O'Lone wrote:
>>>>>> We do perform a lot of sorting - on multiple fields in fact. We have
>>>>>> different kinds of Solr configurations - our news searches do little
>>>>>> with regards to faceting, but heavily sort. We provide classified ad
>>>>>> searches and that heavily uses faceting. I might try reducing the JVM
>>>>>> memory some and amount of perm generation as suggested earlier. It
>>>>>> feels
>>>>>> like a GC issue and loading the cache just happens to be the victim
>>>>>> of a
>>>>>> stop-the-world event at the worse possible time.
>>>>>>
>>>>>>> My gut instinct is that your heap size is way too high. Try
>>>>>>> decreasing it to like 5-10G. I know you say it uses more than that,
>>>>>>> but that just seems bizarre unless you're doing something like
>>>>>>> faceting and/or sorting on every field.
>>>>>>>
>>>>>>> -Michael
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Patrick O'Lone [mailto:polone@townnews.com]
>>>>>>> Sent: Tuesday, November 26, 2013 11:59 AM
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field
>>>>>>> cache
>>>>>>>
>>>>>>> I've been tracking a problem in our Solr environment for awhile with
>>>>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to
>>>>>>> try and thought I might get some insight from some others on this
>>>>>>> list.
>>>>>>>
>>>>>>> The load on the server is normally anywhere between 1-3. It's an
>>>>>>> 8-core machine with 40GB of RAM. I have about 25GB of index data
>>>>>>> that
>>>>>>> is replicated to this server every 5 minutes. It's taking about 200
>>>>>>> connections per second and roughly every 5-10 minutes it will stall
>>>>>>> for about 30 seconds to a minute. The stall causes the load to go to
>>>>>>> as high as 90. It is all CPU bound in user space - all cores go to
>>>>>>> 99% utilization (spinlock?). When doing a thread dump, the following
>>>>>>> line is blocked in all running Tomcat threads:
>>>>>>>
>>>>>>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
>>>>>>> FieldCacheImpl.java:230 )
>>>>>>>
>>>>>>> Looking the source code in 3.6.1, that is a function call to
>>>>>>> syncronized() which blocks all threads and causes the backlog. I've
>>>>>>> tried to correlate these events to the replication events - but even
>>>>>>> with replication disabled - this still happens. We run multiple data
>>>>>>> centers using Solr and I was comparing garbage collection processes
>>>>>>> between and noted that the old generation is collected very
>>>>>>> differently on this data center versus others. The old generation is
>>>>>>> collected as a massive collect event (several gigabytes worth) - the
>>>>>>> other data center is more saw toothed and collects only in 500MB-1GB
>>>>>>> at a time. Here's my parameters to java (the same in all
>>>>>>> environments):
>>>>>>>
>>>>>>> /usr/java/jre/bin/java \
>>>>>>> -verbose:gc \
>>>>>>> -XX:+PrintGCDetails \
>>>>>>> -server \
>>>>>>> -Dcom.sun.management.jmxremote \
>>>>>>> -XX:+UseConcMarkSweepGC \
>>>>>>> -XX:+UseParNewGC \
>>>>>>> -XX:+CMSIncrementalMode \
>>>>>>> -XX:+CMSParallelRemarkEnabled \
>>>>>>> -XX:+CMSIncrementalPacing \
>>>>>>> -XX:NewRatio=3 \
>>>>>>> -Xms30720M \
>>>>>>> -Xmx30720M \
>>>>>>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
>>>>>>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
>>>>>>> -Dcatalina.base=/usr/local/share/apache-tomcat \
>>>>>>> -Dcatalina.home=/usr/local/share/apache-tomcat \
>>>>>>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
>>>>>>>
>>>>>>> I've tried a few GC option changes from this (been running this way
>>>>>>> for a couple of years now) - primarily removing CMS Incremental mode
>>>>>>> as we have 8 cores and remarks on the internet suggest that it is
>>>>>>> only for smaller SMP setups. Removing CMS did not fix anything.
>>>>>>>
>>>>>>> I've considered that the heap is way too large (30GB from 40GB) and
>>>>>>> may not leave enough memory for mmap operations (MMap appears to be
>>>>>>> used in the field cache). Based on active memory utilization in
>>>>>>> Java,
>>>>>>> seems like I might be able to reduce down to 22GB safely - but I'm
>>>>>>> not sure if that will help with the CPU issues.
>>>>>>>
>>>>>>> I think field cache is used for sorting and faceting. I've
>>>>>>> started to
>>>>>>> investigate facet.method, but from what I can tell, this doesn't
>>>>>>> seem
>>>>>>> to influence sorting at all - only facet queries. I've tried setting
>>>>>>> useFilterForSortQuery, and seems to require less field cache but
>>>>>>> doesn't address the stalling issues.
>>>>>>>
>>>>>>> Is there something I am overlooking? Perhaps the system is becoming
>>>>>>> oversubscribed in terms of resources? Thanks for any help that is
>>>>>>> offered.
>>>>>>>
>>>>>>> -- 
>>>>>>> Patrick O'Lone
>>>>>>> Director of Software Development
>>>>>>> TownNews.com
>>>>>>>
>>>>>>> E-mail ... polone@townnews.com
>>>>>>> Phone .... 309-743-0809
>>>>>>> Fax ...... 309-743-0830
>>>>>>>
>>>>>>>
>>>
>>
> 
> 


-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... polone@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Guido Medina <gu...@temetra.com>.
Did you add the Garbage collection JVM options I suggested you?

-XX:+UseG1GC -XX:MaxGCPauseMillis=50

Guido.

On 09/12/13 16:33, Patrick O'Lone wrote:
> Unfortunately, in a test environment, this happens in version 4.4.0 of
> Solr as well.
>
>> I was trying to locate the release notes for 3.6.x it is too old, if I
>> were you I would update to 3.6.2 (from 3.6.1), it shouldn't affect you
>> since it is a minor release, locate the release notes and see if
>> something that is affecting you got fixed, also, I would be thinking on
>> moving on to 4.x which is quite stable and fast.
>>
>> Like anything with Java and concurrency, it will just get better (and
>> faster) with bigger numbers and concurrency frameworks becoming more and
>> more reliable, standard and stable.
>>
>> Regards,
>>
>> Guido.
>>
>> On 09/12/13 15:07, Patrick O'Lone wrote:
>>> I have a new question about this issue - I create a filter queries of
>>> the form:
>>>
>>> fq=start_time:[* TO NOW/5MINUTE]
>>>
>>> This is used to restrict the set of documents to only items that have a
>>> start time within the next 5 minutes. Most of my indexes have millions
>>> of documents with few documents that start sometime in the future.
>>> Nearly all of my queries include this, would this cause every other
>>> search thread to block until the filter query is re-cached every 5
>>> minutes and if so, is there a better way to do it? Thanks for any
>>> continued help with this issue!
>>>
>>>> We have a webapp running with a very high HEAP size (24GB) and we have
>>>> no problems with it AFTER we enabled the new GC that is meant to replace
>>>> sometime in the future the CMS GC, but you have to have Java 6 update
>>>> "Some number I couldn't find but latest should cover" to be able to use:
>>>>
>>>> 1. Remove all GC options you have and...
>>>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
>>>>
>>>> As a test of course, more information you can read on the following (and
>>>> interesting) article, we also have Solr running with these options, no
>>>> more pauses or HEAP size hitting the sky.
>>>>
>>>> Don't get bored reading the 1st (and small) introduction page of the
>>>> article, page 2 and 3 will make lot of sense:
>>>> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
>>>>
>>>>
>>>>
>>>> HTH,
>>>>
>>>> Guido.
>>>>
>>>> On 26/11/13 21:59, Patrick O'Lone wrote:
>>>>> We do perform a lot of sorting - on multiple fields in fact. We have
>>>>> different kinds of Solr configurations - our news searches do little
>>>>> with regards to faceting, but heavily sort. We provide classified ad
>>>>> searches and that heavily uses faceting. I might try reducing the JVM
>>>>> memory some and amount of perm generation as suggested earlier. It
>>>>> feels
>>>>> like a GC issue and loading the cache just happens to be the victim
>>>>> of a
>>>>> stop-the-world event at the worse possible time.
>>>>>
>>>>>> My gut instinct is that your heap size is way too high. Try
>>>>>> decreasing it to like 5-10G. I know you say it uses more than that,
>>>>>> but that just seems bizarre unless you're doing something like
>>>>>> faceting and/or sorting on every field.
>>>>>>
>>>>>> -Michael
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Patrick O'Lone [mailto:polone@townnews.com]
>>>>>> Sent: Tuesday, November 26, 2013 11:59 AM
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache
>>>>>>
>>>>>> I've been tracking a problem in our Solr environment for awhile with
>>>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to
>>>>>> try and thought I might get some insight from some others on this
>>>>>> list.
>>>>>>
>>>>>> The load on the server is normally anywhere between 1-3. It's an
>>>>>> 8-core machine with 40GB of RAM. I have about 25GB of index data that
>>>>>> is replicated to this server every 5 minutes. It's taking about 200
>>>>>> connections per second and roughly every 5-10 minutes it will stall
>>>>>> for about 30 seconds to a minute. The stall causes the load to go to
>>>>>> as high as 90. It is all CPU bound in user space - all cores go to
>>>>>> 99% utilization (spinlock?). When doing a thread dump, the following
>>>>>> line is blocked in all running Tomcat threads:
>>>>>>
>>>>>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
>>>>>> FieldCacheImpl.java:230 )
>>>>>>
>>>>>> Looking the source code in 3.6.1, that is a function call to
>>>>>> syncronized() which blocks all threads and causes the backlog. I've
>>>>>> tried to correlate these events to the replication events - but even
>>>>>> with replication disabled - this still happens. We run multiple data
>>>>>> centers using Solr and I was comparing garbage collection processes
>>>>>> between and noted that the old generation is collected very
>>>>>> differently on this data center versus others. The old generation is
>>>>>> collected as a massive collect event (several gigabytes worth) - the
>>>>>> other data center is more saw toothed and collects only in 500MB-1GB
>>>>>> at a time. Here's my parameters to java (the same in all
>>>>>> environments):
>>>>>>
>>>>>> /usr/java/jre/bin/java \
>>>>>> -verbose:gc \
>>>>>> -XX:+PrintGCDetails \
>>>>>> -server \
>>>>>> -Dcom.sun.management.jmxremote \
>>>>>> -XX:+UseConcMarkSweepGC \
>>>>>> -XX:+UseParNewGC \
>>>>>> -XX:+CMSIncrementalMode \
>>>>>> -XX:+CMSParallelRemarkEnabled \
>>>>>> -XX:+CMSIncrementalPacing \
>>>>>> -XX:NewRatio=3 \
>>>>>> -Xms30720M \
>>>>>> -Xmx30720M \
>>>>>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
>>>>>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
>>>>>> -Dcatalina.base=/usr/local/share/apache-tomcat \
>>>>>> -Dcatalina.home=/usr/local/share/apache-tomcat \
>>>>>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
>>>>>>
>>>>>> I've tried a few GC option changes from this (been running this way
>>>>>> for a couple of years now) - primarily removing CMS Incremental mode
>>>>>> as we have 8 cores and remarks on the internet suggest that it is
>>>>>> only for smaller SMP setups. Removing CMS did not fix anything.
>>>>>>
>>>>>> I've considered that the heap is way too large (30GB from 40GB) and
>>>>>> may not leave enough memory for mmap operations (MMap appears to be
>>>>>> used in the field cache). Based on active memory utilization in Java,
>>>>>> seems like I might be able to reduce down to 22GB safely - but I'm
>>>>>> not sure if that will help with the CPU issues.
>>>>>>
>>>>>> I think field cache is used for sorting and faceting. I've started to
>>>>>> investigate facet.method, but from what I can tell, this doesn't seem
>>>>>> to influence sorting at all - only facet queries. I've tried setting
>>>>>> useFilterForSortQuery, and seems to require less field cache but
>>>>>> doesn't address the stalling issues.
>>>>>>
>>>>>> Is there something I am overlooking? Perhaps the system is becoming
>>>>>> oversubscribed in terms of resources? Thanks for any help that is
>>>>>> offered.
>>>>>>
>>>>>> -- 
>>>>>> Patrick O'Lone
>>>>>> Director of Software Development
>>>>>> TownNews.com
>>>>>>
>>>>>> E-mail ... polone@townnews.com
>>>>>> Phone .... 309-743-0809
>>>>>> Fax ...... 309-743-0830
>>>>>>
>>>>>>
>>
>


Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Patrick O'Lone <po...@townnews.com>.
Unfortunately, in a test environment, this happens in version 4.4.0 of
Solr as well.

> I was trying to locate the release notes for 3.6.x it is too old, if I
> were you I would update to 3.6.2 (from 3.6.1), it shouldn't affect you
> since it is a minor release, locate the release notes and see if
> something that is affecting you got fixed, also, I would be thinking on
> moving on to 4.x which is quite stable and fast.
> 
> Like anything with Java and concurrency, it will just get better (and
> faster) with bigger numbers and concurrency frameworks becoming more and
> more reliable, standard and stable.
> 
> Regards,
> 
> Guido.
> 
> On 09/12/13 15:07, Patrick O'Lone wrote:
>> I have a new question about this issue - I create a filter queries of
>> the form:
>>
>> fq=start_time:[* TO NOW/5MINUTE]
>>
>> This is used to restrict the set of documents to only items that have a
>> start time within the next 5 minutes. Most of my indexes have millions
>> of documents with few documents that start sometime in the future.
>> Nearly all of my queries include this, would this cause every other
>> search thread to block until the filter query is re-cached every 5
>> minutes and if so, is there a better way to do it? Thanks for any
>> continued help with this issue!
>>
>>> We have a webapp running with a very high HEAP size (24GB) and we have
>>> no problems with it AFTER we enabled the new GC that is meant to replace
>>> sometime in the future the CMS GC, but you have to have Java 6 update
>>> "Some number I couldn't find but latest should cover" to be able to use:
>>>
>>> 1. Remove all GC options you have and...
>>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
>>>
>>> As a test of course, more information you can read on the following (and
>>> interesting) article, we also have Solr running with these options, no
>>> more pauses or HEAP size hitting the sky.
>>>
>>> Don't get bored reading the 1st (and small) introduction page of the
>>> article, page 2 and 3 will make lot of sense:
>>> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
>>>
>>>
>>>
>>> HTH,
>>>
>>> Guido.
>>>
>>> On 26/11/13 21:59, Patrick O'Lone wrote:
>>>> We do perform a lot of sorting - on multiple fields in fact. We have
>>>> different kinds of Solr configurations - our news searches do little
>>>> with regards to faceting, but heavily sort. We provide classified ad
>>>> searches and that heavily uses faceting. I might try reducing the JVM
>>>> memory some and amount of perm generation as suggested earlier. It
>>>> feels
>>>> like a GC issue and loading the cache just happens to be the victim
>>>> of a
>>>> stop-the-world event at the worse possible time.
>>>>
>>>>> My gut instinct is that your heap size is way too high. Try
>>>>> decreasing it to like 5-10G. I know you say it uses more than that,
>>>>> but that just seems bizarre unless you're doing something like
>>>>> faceting and/or sorting on every field.
>>>>>
>>>>> -Michael
>>>>>
>>>>> -----Original Message-----
>>>>> From: Patrick O'Lone [mailto:polone@townnews.com]
>>>>> Sent: Tuesday, November 26, 2013 11:59 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache
>>>>>
>>>>> I've been tracking a problem in our Solr environment for awhile with
>>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to
>>>>> try and thought I might get some insight from some others on this
>>>>> list.
>>>>>
>>>>> The load on the server is normally anywhere between 1-3. It's an
>>>>> 8-core machine with 40GB of RAM. I have about 25GB of index data that
>>>>> is replicated to this server every 5 minutes. It's taking about 200
>>>>> connections per second and roughly every 5-10 minutes it will stall
>>>>> for about 30 seconds to a minute. The stall causes the load to go to
>>>>> as high as 90. It is all CPU bound in user space - all cores go to
>>>>> 99% utilization (spinlock?). When doing a thread dump, the following
>>>>> line is blocked in all running Tomcat threads:
>>>>>
>>>>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
>>>>> FieldCacheImpl.java:230 )
>>>>>
>>>>> Looking the source code in 3.6.1, that is a function call to
>>>>> syncronized() which blocks all threads and causes the backlog. I've
>>>>> tried to correlate these events to the replication events - but even
>>>>> with replication disabled - this still happens. We run multiple data
>>>>> centers using Solr and I was comparing garbage collection processes
>>>>> between and noted that the old generation is collected very
>>>>> differently on this data center versus others. The old generation is
>>>>> collected as a massive collect event (several gigabytes worth) - the
>>>>> other data center is more saw toothed and collects only in 500MB-1GB
>>>>> at a time. Here's my parameters to java (the same in all
>>>>> environments):
>>>>>
>>>>> /usr/java/jre/bin/java \
>>>>> -verbose:gc \
>>>>> -XX:+PrintGCDetails \
>>>>> -server \
>>>>> -Dcom.sun.management.jmxremote \
>>>>> -XX:+UseConcMarkSweepGC \
>>>>> -XX:+UseParNewGC \
>>>>> -XX:+CMSIncrementalMode \
>>>>> -XX:+CMSParallelRemarkEnabled \
>>>>> -XX:+CMSIncrementalPacing \
>>>>> -XX:NewRatio=3 \
>>>>> -Xms30720M \
>>>>> -Xmx30720M \
>>>>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
>>>>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
>>>>> -Dcatalina.base=/usr/local/share/apache-tomcat \
>>>>> -Dcatalina.home=/usr/local/share/apache-tomcat \
>>>>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
>>>>>
>>>>> I've tried a few GC option changes from this (been running this way
>>>>> for a couple of years now) - primarily removing CMS Incremental mode
>>>>> as we have 8 cores and remarks on the internet suggest that it is
>>>>> only for smaller SMP setups. Removing CMS did not fix anything.
>>>>>
>>>>> I've considered that the heap is way too large (30GB from 40GB) and
>>>>> may not leave enough memory for mmap operations (MMap appears to be
>>>>> used in the field cache). Based on active memory utilization in Java,
>>>>> seems like I might be able to reduce down to 22GB safely - but I'm
>>>>> not sure if that will help with the CPU issues.
>>>>>
>>>>> I think field cache is used for sorting and faceting. I've started to
>>>>> investigate facet.method, but from what I can tell, this doesn't seem
>>>>> to influence sorting at all - only facet queries. I've tried setting
>>>>> useFilterForSortQuery, and seems to require less field cache but
>>>>> doesn't address the stalling issues.
>>>>>
>>>>> Is there something I am overlooking? Perhaps the system is becoming
>>>>> oversubscribed in terms of resources? Thanks for any help that is
>>>>> offered.
>>>>>
>>>>> -- 
>>>>> Patrick O'Lone
>>>>> Director of Software Development
>>>>> TownNews.com
>>>>>
>>>>> E-mail ... polone@townnews.com
>>>>> Phone .... 309-743-0809
>>>>> Fax ...... 309-743-0830
>>>>>
>>>>>
>>>
>>
> 
> 


-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... polone@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Guido Medina <gu...@temetra.com>.
I was trying to locate the release notes for 3.6.x it is too old, if I 
were you I would update to 3.6.2 (from 3.6.1), it shouldn't affect you 
since it is a minor release, locate the release notes and see if 
something that is affecting you got fixed, also, I would be thinking on 
moving on to 4.x which is quite stable and fast.

Like anything with Java and concurrency, it will just get better (and 
faster) with bigger numbers and concurrency frameworks becoming more and 
more reliable, standard and stable.

Regards,

Guido.

On 09/12/13 15:07, Patrick O'Lone wrote:
> I have a new question about this issue - I create a filter queries of
> the form:
>
> fq=start_time:[* TO NOW/5MINUTE]
>
> This is used to restrict the set of documents to only items that have a
> start time within the next 5 minutes. Most of my indexes have millions
> of documents with few documents that start sometime in the future.
> Nearly all of my queries include this, would this cause every other
> search thread to block until the filter query is re-cached every 5
> minutes and if so, is there a better way to do it? Thanks for any
> continued help with this issue!
>
>> We have a webapp running with a very high HEAP size (24GB) and we have
>> no problems with it AFTER we enabled the new GC that is meant to replace
>> sometime in the future the CMS GC, but you have to have Java 6 update
>> "Some number I couldn't find but latest should cover" to be able to use:
>>
>> 1. Remove all GC options you have and...
>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
>>
>> As a test of course, more information you can read on the following (and
>> interesting) article, we also have Solr running with these options, no
>> more pauses or HEAP size hitting the sky.
>>
>> Don't get bored reading the 1st (and small) introduction page of the
>> article, page 2 and 3 will make lot of sense:
>> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
>>
>>
>> HTH,
>>
>> Guido.
>>
>> On 26/11/13 21:59, Patrick O'Lone wrote:
>>> We do perform a lot of sorting - on multiple fields in fact. We have
>>> different kinds of Solr configurations - our news searches do little
>>> with regards to faceting, but heavily sort. We provide classified ad
>>> searches and that heavily uses faceting. I might try reducing the JVM
>>> memory some and amount of perm generation as suggested earlier. It feels
>>> like a GC issue and loading the cache just happens to be the victim of a
>>> stop-the-world event at the worse possible time.
>>>
>>>> My gut instinct is that your heap size is way too high. Try
>>>> decreasing it to like 5-10G. I know you say it uses more than that,
>>>> but that just seems bizarre unless you're doing something like
>>>> faceting and/or sorting on every field.
>>>>
>>>> -Michael
>>>>
>>>> -----Original Message-----
>>>> From: Patrick O'Lone [mailto:polone@townnews.com]
>>>> Sent: Tuesday, November 26, 2013 11:59 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache
>>>>
>>>> I've been tracking a problem in our Solr environment for awhile with
>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to
>>>> try and thought I might get some insight from some others on this list.
>>>>
>>>> The load on the server is normally anywhere between 1-3. It's an
>>>> 8-core machine with 40GB of RAM. I have about 25GB of index data that
>>>> is replicated to this server every 5 minutes. It's taking about 200
>>>> connections per second and roughly every 5-10 minutes it will stall
>>>> for about 30 seconds to a minute. The stall causes the load to go to
>>>> as high as 90. It is all CPU bound in user space - all cores go to
>>>> 99% utilization (spinlock?). When doing a thread dump, the following
>>>> line is blocked in all running Tomcat threads:
>>>>
>>>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
>>>> FieldCacheImpl.java:230 )
>>>>
>>>> Looking the source code in 3.6.1, that is a function call to
>>>> syncronized() which blocks all threads and causes the backlog. I've
>>>> tried to correlate these events to the replication events - but even
>>>> with replication disabled - this still happens. We run multiple data
>>>> centers using Solr and I was comparing garbage collection processes
>>>> between and noted that the old generation is collected very
>>>> differently on this data center versus others. The old generation is
>>>> collected as a massive collect event (several gigabytes worth) - the
>>>> other data center is more saw toothed and collects only in 500MB-1GB
>>>> at a time. Here's my parameters to java (the same in all environments):
>>>>
>>>> /usr/java/jre/bin/java \
>>>> -verbose:gc \
>>>> -XX:+PrintGCDetails \
>>>> -server \
>>>> -Dcom.sun.management.jmxremote \
>>>> -XX:+UseConcMarkSweepGC \
>>>> -XX:+UseParNewGC \
>>>> -XX:+CMSIncrementalMode \
>>>> -XX:+CMSParallelRemarkEnabled \
>>>> -XX:+CMSIncrementalPacing \
>>>> -XX:NewRatio=3 \
>>>> -Xms30720M \
>>>> -Xmx30720M \
>>>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
>>>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
>>>> -Dcatalina.base=/usr/local/share/apache-tomcat \
>>>> -Dcatalina.home=/usr/local/share/apache-tomcat \
>>>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
>>>>
>>>> I've tried a few GC option changes from this (been running this way
>>>> for a couple of years now) - primarily removing CMS Incremental mode
>>>> as we have 8 cores and remarks on the internet suggest that it is
>>>> only for smaller SMP setups. Removing CMS did not fix anything.
>>>>
>>>> I've considered that the heap is way too large (30GB from 40GB) and
>>>> may not leave enough memory for mmap operations (MMap appears to be
>>>> used in the field cache). Based on active memory utilization in Java,
>>>> seems like I might be able to reduce down to 22GB safely - but I'm
>>>> not sure if that will help with the CPU issues.
>>>>
>>>> I think field cache is used for sorting and faceting. I've started to
>>>> investigate facet.method, but from what I can tell, this doesn't seem
>>>> to influence sorting at all - only facet queries. I've tried setting
>>>> useFilterForSortQuery, and seems to require less field cache but
>>>> doesn't address the stalling issues.
>>>>
>>>> Is there something I am overlooking? Perhaps the system is becoming
>>>> oversubscribed in terms of resources? Thanks for any help that is
>>>> offered.
>>>>
>>>> -- 
>>>> Patrick O'Lone
>>>> Director of Software Development
>>>> TownNews.com
>>>>
>>>> E-mail ... polone@townnews.com
>>>> Phone .... 309-743-0809
>>>> Fax ...... 309-743-0830
>>>>
>>>>
>>
>


Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Patrick O'Lone <po...@townnews.com>.
I have a new question about this issue - I create a filter queries of
the form:

fq=start_time:[* TO NOW/5MINUTE]

This is used to restrict the set of documents to only items that have a
start time within the next 5 minutes. Most of my indexes have millions
of documents with few documents that start sometime in the future.
Nearly all of my queries include this, would this cause every other
search thread to block until the filter query is re-cached every 5
minutes and if so, is there a better way to do it? Thanks for any
continued help with this issue!

> We have a webapp running with a very high HEAP size (24GB) and we have
> no problems with it AFTER we enabled the new GC that is meant to replace
> sometime in the future the CMS GC, but you have to have Java 6 update
> "Some number I couldn't find but latest should cover" to be able to use:
> 
> 1. Remove all GC options you have and...
> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/
> 
> As a test of course, more information you can read on the following (and
> interesting) article, we also have Solr running with these options, no
> more pauses or HEAP size hitting the sky.
> 
> Don't get bored reading the 1st (and small) introduction page of the
> article, page 2 and 3 will make lot of sense:
> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061
> 
> 
> HTH,
> 
> Guido.
> 
> On 26/11/13 21:59, Patrick O'Lone wrote:
>> We do perform a lot of sorting - on multiple fields in fact. We have
>> different kinds of Solr configurations - our news searches do little
>> with regards to faceting, but heavily sort. We provide classified ad
>> searches and that heavily uses faceting. I might try reducing the JVM
>> memory some and amount of perm generation as suggested earlier. It feels
>> like a GC issue and loading the cache just happens to be the victim of a
>> stop-the-world event at the worse possible time.
>>
>>> My gut instinct is that your heap size is way too high. Try
>>> decreasing it to like 5-10G. I know you say it uses more than that,
>>> but that just seems bizarre unless you're doing something like
>>> faceting and/or sorting on every field.
>>>
>>> -Michael
>>>
>>> -----Original Message-----
>>> From: Patrick O'Lone [mailto:polone@townnews.com]
>>> Sent: Tuesday, November 26, 2013 11:59 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache
>>>
>>> I've been tracking a problem in our Solr environment for awhile with
>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to
>>> try and thought I might get some insight from some others on this list.
>>>
>>> The load on the server is normally anywhere between 1-3. It's an
>>> 8-core machine with 40GB of RAM. I have about 25GB of index data that
>>> is replicated to this server every 5 minutes. It's taking about 200
>>> connections per second and roughly every 5-10 minutes it will stall
>>> for about 30 seconds to a minute. The stall causes the load to go to
>>> as high as 90. It is all CPU bound in user space - all cores go to
>>> 99% utilization (spinlock?). When doing a thread dump, the following
>>> line is blocked in all running Tomcat threads:
>>>
>>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
>>> FieldCacheImpl.java:230 )
>>>
>>> Looking the source code in 3.6.1, that is a function call to
>>> syncronized() which blocks all threads and causes the backlog. I've
>>> tried to correlate these events to the replication events - but even
>>> with replication disabled - this still happens. We run multiple data
>>> centers using Solr and I was comparing garbage collection processes
>>> between and noted that the old generation is collected very
>>> differently on this data center versus others. The old generation is
>>> collected as a massive collect event (several gigabytes worth) - the
>>> other data center is more saw toothed and collects only in 500MB-1GB
>>> at a time. Here's my parameters to java (the same in all environments):
>>>
>>> /usr/java/jre/bin/java \
>>> -verbose:gc \
>>> -XX:+PrintGCDetails \
>>> -server \
>>> -Dcom.sun.management.jmxremote \
>>> -XX:+UseConcMarkSweepGC \
>>> -XX:+UseParNewGC \
>>> -XX:+CMSIncrementalMode \
>>> -XX:+CMSParallelRemarkEnabled \
>>> -XX:+CMSIncrementalPacing \
>>> -XX:NewRatio=3 \
>>> -Xms30720M \
>>> -Xmx30720M \
>>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \
>>> -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \
>>> -Dcatalina.base=/usr/local/share/apache-tomcat \
>>> -Dcatalina.home=/usr/local/share/apache-tomcat \
>>> -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
>>>
>>> I've tried a few GC option changes from this (been running this way
>>> for a couple of years now) - primarily removing CMS Incremental mode
>>> as we have 8 cores and remarks on the internet suggest that it is
>>> only for smaller SMP setups. Removing CMS did not fix anything.
>>>
>>> I've considered that the heap is way too large (30GB from 40GB) and
>>> may not leave enough memory for mmap operations (MMap appears to be
>>> used in the field cache). Based on active memory utilization in Java,
>>> seems like I might be able to reduce down to 22GB safely - but I'm
>>> not sure if that will help with the CPU issues.
>>>
>>> I think field cache is used for sorting and faceting. I've started to
>>> investigate facet.method, but from what I can tell, this doesn't seem
>>> to influence sorting at all - only facet queries. I've tried setting
>>> useFilterForSortQuery, and seems to require less field cache but
>>> doesn't address the stalling issues.
>>>
>>> Is there something I am overlooking? Perhaps the system is becoming
>>> oversubscribed in terms of resources? Thanks for any help that is
>>> offered.
>>>
>>> -- 
>>> Patrick O'Lone
>>> Director of Software Development
>>> TownNews.com
>>>
>>> E-mail ... polone@townnews.com
>>> Phone .... 309-743-0809
>>> Fax ...... 309-743-0830
>>>
>>>
>>
> 
> 


-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... polone@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Guido Medina <gu...@temetra.com>.
We have a webapp running with a very high HEAP size (24GB) and we have 
no problems with it AFTER we enabled the new GC that is meant to replace 
sometime in the future the CMS GC, but you have to have Java 6 update 
"Some number I couldn't find but latest should cover" to be able to use:

 1. Remove all GC options you have and...
 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/

As a test of course, more information you can read on the following (and 
interesting) article, we also have Solr running with these options, no 
more pauses or HEAP size hitting the sky.

Don't get bored reading the 1st (and small) introduction page of the 
article, page 2 and 3 will make lot of sense: 
http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061

HTH,

Guido.

On 26/11/13 21:59, Patrick O'Lone wrote:
> We do perform a lot of sorting - on multiple fields in fact. We have
> different kinds of Solr configurations - our news searches do little
> with regards to faceting, but heavily sort. We provide classified ad
> searches and that heavily uses faceting. I might try reducing the JVM
> memory some and amount of perm generation as suggested earlier. It feels
> like a GC issue and loading the cache just happens to be the victim of a
> stop-the-world event at the worse possible time.
>
>> My gut instinct is that your heap size is way too high. Try decreasing it to like 5-10G. I know you say it uses more than that, but that just seems bizarre unless you're doing something like faceting and/or sorting on every field.
>>
>> -Michael
>>
>> -----Original Message-----
>> From: Patrick O'Lone [mailto:polone@townnews.com]
>> Sent: Tuesday, November 26, 2013 11:59 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache
>>
>> I've been tracking a problem in our Solr environment for awhile with periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I might get some insight from some others on this list.
>>
>> The load on the server is normally anywhere between 1-3. It's an 8-core machine with 40GB of RAM. I have about 25GB of index data that is replicated to this server every 5 minutes. It's taking about 200 connections per second and roughly every 5-10 minutes it will stall for about 30 seconds to a minute. The stall causes the load to go to as high as 90. It is all CPU bound in user space - all cores go to 99% utilization (spinlock?). When doing a thread dump, the following line is blocked in all running Tomcat threads:
>>
>> org.apache.lucene.search.FieldCacheImpl$Cache.get (
>> FieldCacheImpl.java:230 )
>>
>> Looking the source code in 3.6.1, that is a function call to
>> syncronized() which blocks all threads and causes the backlog. I've tried to correlate these events to the replication events - but even with replication disabled - this still happens. We run multiple data centers using Solr and I was comparing garbage collection processes between and noted that the old generation is collected very differently on this data center versus others. The old generation is collected as a massive collect event (several gigabytes worth) - the other data center is more saw toothed and collects only in 500MB-1GB at a time. Here's my parameters to java (the same in all environments):
>>
>> /usr/java/jre/bin/java \
>> -verbose:gc \
>> -XX:+PrintGCDetails \
>> -server \
>> -Dcom.sun.management.jmxremote \
>> -XX:+UseConcMarkSweepGC \
>> -XX:+UseParNewGC \
>> -XX:+CMSIncrementalMode \
>> -XX:+CMSParallelRemarkEnabled \
>> -XX:+CMSIncrementalPacing \
>> -XX:NewRatio=3 \
>> -Xms30720M \
>> -Xmx30720M \
>> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \ -Dcatalina.base=/usr/local/share/apache-tomcat \ -Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
>>
>> I've tried a few GC option changes from this (been running this way for a couple of years now) - primarily removing CMS Incremental mode as we have 8 cores and remarks on the internet suggest that it is only for smaller SMP setups. Removing CMS did not fix anything.
>>
>> I've considered that the heap is way too large (30GB from 40GB) and may not leave enough memory for mmap operations (MMap appears to be used in the field cache). Based on active memory utilization in Java, seems like I might be able to reduce down to 22GB safely - but I'm not sure if that will help with the CPU issues.
>>
>> I think field cache is used for sorting and faceting. I've started to investigate facet.method, but from what I can tell, this doesn't seem to influence sorting at all - only facet queries. I've tried setting useFilterForSortQuery, and seems to require less field cache but doesn't address the stalling issues.
>>
>> Is there something I am overlooking? Perhaps the system is becoming oversubscribed in terms of resources? Thanks for any help that is offered.
>>
>> --
>> Patrick O'Lone
>> Director of Software Development
>> TownNews.com
>>
>> E-mail ... polone@townnews.com
>> Phone .... 309-743-0809
>> Fax ...... 309-743-0830
>>
>>
>


Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Patrick O'Lone <po...@townnews.com>.
We do perform a lot of sorting - on multiple fields in fact. We have
different kinds of Solr configurations - our news searches do little
with regards to faceting, but heavily sort. We provide classified ad
searches and that heavily uses faceting. I might try reducing the JVM
memory some and amount of perm generation as suggested earlier. It feels
like a GC issue and loading the cache just happens to be the victim of a
stop-the-world event at the worse possible time.

> My gut instinct is that your heap size is way too high. Try decreasing it to like 5-10G. I know you say it uses more than that, but that just seems bizarre unless you're doing something like faceting and/or sorting on every field.
> 
> -Michael
> 
> -----Original Message-----
> From: Patrick O'Lone [mailto:polone@townnews.com] 
> Sent: Tuesday, November 26, 2013 11:59 AM
> To: solr-user@lucene.apache.org
> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache
> 
> I've been tracking a problem in our Solr environment for awhile with periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I might get some insight from some others on this list.
> 
> The load on the server is normally anywhere between 1-3. It's an 8-core machine with 40GB of RAM. I have about 25GB of index data that is replicated to this server every 5 minutes. It's taking about 200 connections per second and roughly every 5-10 minutes it will stall for about 30 seconds to a minute. The stall causes the load to go to as high as 90. It is all CPU bound in user space - all cores go to 99% utilization (spinlock?). When doing a thread dump, the following line is blocked in all running Tomcat threads:
> 
> org.apache.lucene.search.FieldCacheImpl$Cache.get (
> FieldCacheImpl.java:230 )
> 
> Looking the source code in 3.6.1, that is a function call to
> syncronized() which blocks all threads and causes the backlog. I've tried to correlate these events to the replication events - but even with replication disabled - this still happens. We run multiple data centers using Solr and I was comparing garbage collection processes between and noted that the old generation is collected very differently on this data center versus others. The old generation is collected as a massive collect event (several gigabytes worth) - the other data center is more saw toothed and collects only in 500MB-1GB at a time. Here's my parameters to java (the same in all environments):
> 
> /usr/java/jre/bin/java \
> -verbose:gc \
> -XX:+PrintGCDetails \
> -server \
> -Dcom.sun.management.jmxremote \
> -XX:+UseConcMarkSweepGC \
> -XX:+UseParNewGC \
> -XX:+CMSIncrementalMode \
> -XX:+CMSParallelRemarkEnabled \
> -XX:+CMSIncrementalPacing \
> -XX:NewRatio=3 \
> -Xms30720M \
> -Xmx30720M \
> -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \ -Dcatalina.base=/usr/local/share/apache-tomcat \ -Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start
> 
> I've tried a few GC option changes from this (been running this way for a couple of years now) - primarily removing CMS Incremental mode as we have 8 cores and remarks on the internet suggest that it is only for smaller SMP setups. Removing CMS did not fix anything.
> 
> I've considered that the heap is way too large (30GB from 40GB) and may not leave enough memory for mmap operations (MMap appears to be used in the field cache). Based on active memory utilization in Java, seems like I might be able to reduce down to 22GB safely - but I'm not sure if that will help with the CPU issues.
> 
> I think field cache is used for sorting and faceting. I've started to investigate facet.method, but from what I can tell, this doesn't seem to influence sorting at all - only facet queries. I've tried setting useFilterForSortQuery, and seems to require less field cache but doesn't address the stalling issues.
> 
> Is there something I am overlooking? Perhaps the system is becoming oversubscribed in terms of resources? Thanks for any help that is offered.
> 
> --
> Patrick O'Lone
> Director of Software Development
> TownNews.com
> 
> E-mail ... polone@townnews.com
> Phone .... 309-743-0809
> Fax ...... 309-743-0830
> 
> 


-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... polone@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

RE: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Michael Ryan <mr...@moreover.com>.
My gut instinct is that your heap size is way too high. Try decreasing it to like 5-10G. I know you say it uses more than that, but that just seems bizarre unless you're doing something like faceting and/or sorting on every field.

-Michael

-----Original Message-----
From: Patrick O'Lone [mailto:polone@townnews.com] 
Sent: Tuesday, November 26, 2013 11:59 AM
To: solr-user@lucene.apache.org
Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache

I've been tracking a problem in our Solr environment for awhile with periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I might get some insight from some others on this list.

The load on the server is normally anywhere between 1-3. It's an 8-core machine with 40GB of RAM. I have about 25GB of index data that is replicated to this server every 5 minutes. It's taking about 200 connections per second and roughly every 5-10 minutes it will stall for about 30 seconds to a minute. The stall causes the load to go to as high as 90. It is all CPU bound in user space - all cores go to 99% utilization (spinlock?). When doing a thread dump, the following line is blocked in all running Tomcat threads:

org.apache.lucene.search.FieldCacheImpl$Cache.get (
FieldCacheImpl.java:230 )

Looking the source code in 3.6.1, that is a function call to
syncronized() which blocks all threads and causes the backlog. I've tried to correlate these events to the replication events - but even with replication disabled - this still happens. We run multiple data centers using Solr and I was comparing garbage collection processes between and noted that the old generation is collected very differently on this data center versus others. The old generation is collected as a massive collect event (several gigabytes worth) - the other data center is more saw toothed and collects only in 500MB-1GB at a time. Here's my parameters to java (the same in all environments):

/usr/java/jre/bin/java \
-verbose:gc \
-XX:+PrintGCDetails \
-server \
-Dcom.sun.management.jmxremote \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:+CMSIncrementalMode \
-XX:+CMSParallelRemarkEnabled \
-XX:+CMSIncrementalPacing \
-XX:NewRatio=3 \
-Xms30720M \
-Xmx30720M \
-Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \ -Dcatalina.base=/usr/local/share/apache-tomcat \ -Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start

I've tried a few GC option changes from this (been running this way for a couple of years now) - primarily removing CMS Incremental mode as we have 8 cores and remarks on the internet suggest that it is only for smaller SMP setups. Removing CMS did not fix anything.

I've considered that the heap is way too large (30GB from 40GB) and may not leave enough memory for mmap operations (MMap appears to be used in the field cache). Based on active memory utilization in Java, seems like I might be able to reduce down to 22GB safely - but I'm not sure if that will help with the CPU issues.

I think field cache is used for sorting and faceting. I've started to investigate facet.method, but from what I can tell, this doesn't seem to influence sorting at all - only facet queries. I've tried setting useFilterForSortQuery, and seems to require less field cache but doesn't address the stalling issues.

Is there something I am overlooking? Perhaps the system is becoming oversubscribed in terms of resources? Thanks for any help that is offered.

--
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... polone@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830

RE: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Patrice Monroe Pustavrh <pa...@bisnode.si>.
I am not completely sure about that, but if I remember correctly (it has been more than one year since I've did that and I was hmm.. whatever you want to write here,  enogh not to take notes :( ), it helped that I've reduced the percentage of size of permanent generation (somehow, more GC on less permanent gen, but this one is not blocking the system and it could be that it prevents really large GC's - at the account of more smaller ones). But it is far from sound advice, it is just somehow distant memory and I've could also mixed things up in my memory (been doing many other things in between), so my advice could as well be misleading (and make sure that your heap is still big enough, once you get bellow reasonable value, nothing will help). 
P.S. if it worked for you, just let us know. 

Regards
Patrice Monroe Pustavrh, 
Software developer, 
Bisnode Slovenia d.o.o.

-----Original Message-----
From: Patrick O'Lone [mailto:polone@townnews.com] 
Sent: Tuesday, November 26, 2013 5:59 PM
To: solr-user@lucene.apache.org
Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache

I've been tracking a problem in our Solr environment for awhile with periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I might get some insight from some others on this list.

The load on the server is normally anywhere between 1-3. It's an 8-core machine with 40GB of RAM. I have about 25GB of index data that is replicated to this server every 5 minutes. It's taking about 200 connections per second and roughly every 5-10 minutes it will stall for about 30 seconds to a minute. The stall causes the load to go to as high as 90. It is all CPU bound in user space - all cores go to 99% utilization (spinlock?). When doing a thread dump, the following line is blocked in all running Tomcat threads:

org.apache.lucene.search.FieldCacheImpl$Cache.get (
FieldCacheImpl.java:230 )

Looking the source code in 3.6.1, that is a function call to
syncronized() which blocks all threads and causes the backlog. I've tried to correlate these events to the replication events - but even with replication disabled - this still happens. We run multiple data centers using Solr and I was comparing garbage collection processes between and noted that the old generation is collected very differently on this data center versus others. The old generation is collected as a massive collect event (several gigabytes worth) - the other data center is more saw toothed and collects only in 500MB-1GB at a time. Here's my parameters to java (the same in all environments):

/usr/java/jre/bin/java \
-verbose:gc \
-XX:+PrintGCDetails \
-server \
-Dcom.sun.management.jmxremote \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:+CMSIncrementalMode \
-XX:+CMSParallelRemarkEnabled \
-XX:+CMSIncrementalPacing \
-XX:NewRatio=3 \
-Xms30720M \
-Xmx30720M \
-Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \ -Dcatalina.base=/usr/local/share/apache-tomcat \ -Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start

I've tried a few GC option changes from this (been running this way for a couple of years now) - primarily removing CMS Incremental mode as we have 8 cores and remarks on the internet suggest that it is only for smaller SMP setups. Removing CMS did not fix anything.

I've considered that the heap is way too large (30GB from 40GB) and may not leave enough memory for mmap operations (MMap appears to be used in the field cache). Based on active memory utilization in Java, seems like I might be able to reduce down to 22GB safely - but I'm not sure if that will help with the CPU issues.

I think field cache is used for sorting and faceting. I've started to investigate facet.method, but from what I can tell, this doesn't seem to influence sorting at all - only facet queries. I've tried setting useFilterForSortQuery, and seems to require less field cache but doesn't address the stalling issues.

Is there something I am overlooking? Perhaps the system is becoming oversubscribed in terms of resources? Thanks for any help that is offered.

--
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... polone@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830


RE: Solr 3.6.1 stalling with high CPU and blocking on field cache

Posted by Patrice Monroe Pustavrh <pa...@bisnode.si>.
I am not completely sure about that, but if I remember correctly (it has been more than one year since I've did that and I was lazy enogh not to take notes :( ), it helped that I've reduced the percentage of size of permanent generation (somehow, more GC on less permanent gen, but this one is not blocking the system and it could be that it prevents really large GC's - at the account of more smaller ones). But it is far from sound advice, it is just somehow distant memory and I've could also mixed things up in my memory (been doing many other things in between), so my advice could as well be misleading (and make sure that your heap is still big enough, once you get bellow reasonable value, nothing will help). 
P.S. if it worked for you, just let us know. 

Regards
Patrice Monroe Pustavrh, 
Software developer, 
Bisnode Slovenia d.o.o.

-----Original Message-----
From: Patrick O'Lone [mailto:polone@townnews.com] 
Sent: Tuesday, November 26, 2013 5:59 PM
To: solr-user@lucene.apache.org
Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache

I've been tracking a problem in our Solr environment for awhile with periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I might get some insight from some others on this list.

The load on the server is normally anywhere between 1-3. It's an 8-core machine with 40GB of RAM. I have about 25GB of index data that is replicated to this server every 5 minutes. It's taking about 200 connections per second and roughly every 5-10 minutes it will stall for about 30 seconds to a minute. The stall causes the load to go to as high as 90. It is all CPU bound in user space - all cores go to 99% utilization (spinlock?). When doing a thread dump, the following line is blocked in all running Tomcat threads:

org.apache.lucene.search.FieldCacheImpl$Cache.get (
FieldCacheImpl.java:230 )

Looking the source code in 3.6.1, that is a function call to
syncronized() which blocks all threads and causes the backlog. I've tried to correlate these events to the replication events - but even with replication disabled - this still happens. We run multiple data centers using Solr and I was comparing garbage collection processes between and noted that the old generation is collected very differently on this data center versus others. The old generation is collected as a massive collect event (several gigabytes worth) - the other data center is more saw toothed and collects only in 500MB-1GB at a time. Here's my parameters to java (the same in all environments):

/usr/java/jre/bin/java \
-verbose:gc \
-XX:+PrintGCDetails \
-server \
-Dcom.sun.management.jmxremote \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:+CMSIncrementalMode \
-XX:+CMSParallelRemarkEnabled \
-XX:+CMSIncrementalPacing \
-XX:NewRatio=3 \
-Xms30720M \
-Xmx30720M \
-Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \ -Dcatalina.base=/usr/local/share/apache-tomcat \ -Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start

I've tried a few GC option changes from this (been running this way for a couple of years now) - primarily removing CMS Incremental mode as we have 8 cores and remarks on the internet suggest that it is only for smaller SMP setups. Removing CMS did not fix anything.

I've considered that the heap is way too large (30GB from 40GB) and may not leave enough memory for mmap operations (MMap appears to be used in the field cache). Based on active memory utilization in Java, seems like I might be able to reduce down to 22GB safely - but I'm not sure if that will help with the CPU issues.

I think field cache is used for sorting and faceting. I've started to investigate facet.method, but from what I can tell, this doesn't seem to influence sorting at all - only facet queries. I've tried setting useFilterForSortQuery, and seems to require less field cache but doesn't address the stalling issues.

Is there something I am overlooking? Perhaps the system is becoming oversubscribed in terms of resources? Thanks for any help that is offered.

--
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... polone@townnews.com
Phone .... 309-743-0809
Fax ...... 309-743-0830