You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Paul Russell <pa...@qflow.com> on 2021/10/26 12:10:43 UTC

SOLR Performance on RHEL 7

I have a current SOLR cluster running SOLR 6.6 on RHEL 6 servers. All SOLR
instances use a 25G JVM on the RHEL 6 server configured with 64G of memory
managing a 900G collection. Measured response time to queries average about
100ms.

I am attempting to move the cluster to new RHEL 7 servers with the same
configuration (8 cores/ 64G memory) and having performance issues.

On the RHEL 7 servers the kswapd0 process is consuming up to 30% of the CPU
and response time is being measured at 500-1000 ms for queries.

I tried using the vm.swappiness setting at both 0 and 1 and have been
unable to change the behavior. If I trim the SOLR JVM to 16Gb response
times get better and GC logs show the JVM is operating correctly..

Has anyone else had a similar issue? I have tried upgrading to SOLR 7.7.2
as part of the process and that hasn't helped.

Any suggestions?

Re: SOLR Performance on RHEL 7

Posted by Paul Russell <pa...@qflow.com>.

Elaine,

Thanks for the feedback. That's troubling to hear that your issues with
SOLR and RHEL 7 weren' resolved..... Perhaps I should pursue an alternate
O/S as well if needed

I finally got the customer to add the 64Gb SDD swap volume. Strange thing
is with both the O/S tools and a Grafana node exporter running I don't see
any usage of swap at all. Because all storage was switched to SSD and I
just brought the RHEL 7 node into the cluster and I don't know if I have a
fix yet.

I appreciate the similarities.. Tomorrow will tell the tale pretty quickly.

On Wed, Dec 8, 2021 at 6:13 PM Elaine Cario <et...@gmail.com> wrote:

> I'm late to the dance but FWIW, we also experienced some similar swap-like
> issues when we upgraded from Centos 7.6 to Centos 7.9 (this was Solr 8.3) -
> some of the Solr nodes would end up reading from disk like crazy, and query
> response times would suffer accordingly.  At one point we had 1/2 the nodes
> (with 1 set of replicas) on 7.6 and the other 1/2 (with 2nd replicas) on
> 7.9, and could see disk reads and io waits an order of magnitude higher on
> 7.9, with all other things being equal.
>
> We never really solved it: after countless weeks of testing
> various configurations, we threw up our hands and started migrating
> everything to Amazon Linux 2 (there were other reasons for that, but this
> was a definite driver).  We also have some servers still hosting RedHat
> Enterprise 7.9 so far without issues but these are also slated for
> migration in the coming weeks.
>
> On Tue, Oct 26, 2021 at 8:11 AM Paul Russell <pa...@qflow.com>
> wrote:
>
> > I have a current SOLR cluster running SOLR 6.6 on RHEL 6 servers. All
> SOLR
> > instances use a 25G JVM on the RHEL 6 server configured with 64G of
> memory
> > managing a 900G collection. Measured response time to queries average
> about
> > 100ms.
> >
> > I am attempting to move the cluster to new RHEL 7 servers with the same
> > configuration (8 cores/ 64G memory) and having performance issues.
> >
> > On the RHEL 7 servers the kswapd0 process is consuming up to 30% of the
> CPU
> > and response time is being measured at 500-1000 ms for queries.
> >
> > I tried using the vm.swappiness setting at both 0 and 1 and have been
> > unable to change the behavior. If I trim the SOLR JVM to 16Gb response
> > times get better and GC logs show the JVM is operating correctly..
> >
> > Has anyone else had a similar issue? I have tried upgrading to SOLR 7.7.2
> > as part of the process and that hasn't helped.
> >
> > Any suggestions?
> >
>

Re: SOLR Performance on RHEL 7

Posted by Elaine Cario <et...@gmail.com>.

I'm late to the dance but FWIW, we also experienced some similar swap-like
issues when we upgraded from Centos 7.6 to Centos 7.9 (this was Solr 8.3) -
some of the Solr nodes would end up reading from disk like crazy, and query
response times would suffer accordingly.  At one point we had 1/2 the nodes
(with 1 set of replicas) on 7.6 and the other 1/2 (with 2nd replicas) on
7.9, and could see disk reads and io waits an order of magnitude higher on
7.9, with all other things being equal.

We never really solved it: after countless weeks of testing
various configurations, we threw up our hands and started migrating
everything to Amazon Linux 2 (there were other reasons for that, but this
was a definite driver).  We also have some servers still hosting RedHat
Enterprise 7.9 so far without issues but these are also slated for
migration in the coming weeks.

On Tue, Oct 26, 2021 at 8:11 AM Paul Russell <pa...@qflow.com> wrote:

> I have a current SOLR cluster running SOLR 6.6 on RHEL 6 servers. All SOLR
> instances use a 25G JVM on the RHEL 6 server configured with 64G of memory
> managing a 900G collection. Measured response time to queries average about
> 100ms.
>
> I am attempting to move the cluster to new RHEL 7 servers with the same
> configuration (8 cores/ 64G memory) and having performance issues.
>
> On the RHEL 7 servers the kswapd0 process is consuming up to 30% of the CPU
> and response time is being measured at 500-1000 ms for queries.
>
> I tried using the vm.swappiness setting at both 0 and 1 and have been
> unable to change the behavior. If I trim the SOLR JVM to 16Gb response
> times get better and GC logs show the JVM is operating correctly..
>
> Has anyone else had a similar issue? I have tried upgrading to SOLR 7.7.2
> as part of the process and that hasn't helped.
>
> Any suggestions?
>

Re: SOLR Performance on RHEL 7

Posted by dmitri maziuk <dm...@gmail.com>.

On 2021-10-26 10:24 AM, Shawn Heisey wrote:
...
> I don't think swap is the problem.  Disabling swap entirely would be a 
> good test to confirm.  For general server use cases, I would not 
> recommend that action, but for dedicated systems with plenty of memory 
> like what is described in this thread, running without swap space seems 
> like a very good idea.

But check your /etc/fstab first and make sure nothing is mounted as tmpfs.

You really should stick in an SSD and have a 64G swap partition on it: 
it won't hurt and it costs nothing. You should also check `iostat -dmx` 
when your system's "slow", e.g. seeing > 100% utilization on disk writes 
usually means a non-TLER disk going bad and not telling the OS about it 
(and slowing everything down to a crawl).

Dima

Re: SOLR Performance on RHEL 7

Posted by matthew sporleder <ms...@gmail.com>.

On Tue, Oct 26, 2021 at 11:25 AM Shawn Heisey <el...@elyograg.org> wrote:
>
> On 10/26/21 8:34 AM, Michael Gibney wrote:
> > In my experience, running Solr on CentOS 7 (comparable to RHEL 7) -- on
> > VMWare, but "ballooning" was _not_ the issue -- I found that setting
> > vm.swappiness=0 or 1 did not actually prevent swapping. Notwithstanding
> > Shawn's excellent suggestions above, if you still suspect that swapping is
> > the issue and you are ok with foregoing swap altogether, you might try
> > straight-up `swapoff -a`. This ended up being the right choice for my case,
> > fwiw.
>
> Not related to the OP, replying to Michael:
>
> I've seen some very strange behavior related to swap on Linux.  My
> server at home (Ubuntu 20, kernel 5.11.0) has 64GB of memory and two
> 6-core CPUs.  It is not running Solr.  Even with what I classify as zero
> memory pressure (40 or more gigabytes of memory used by OS disk
> caching), and with swappiness at 0 or 1, it seems to prefer swapping out
> (swap partition is 8GB) rather than just reclaiming memory from cache,
> and I had thought that lowering swappiness would reverse that
> preference.  I haven't figured out a way to keep that from happening,
> other than disabling swap.  I don't need the swap space -- 64GB is more
> than enough for what the server does.
>
> Related to the OP:
>
> I don't think swap is the problem.  Disabling swap entirely would be a
> good test to confirm.  For general server use cases, I would not
> recommend that action, but for dedicated systems with plenty of memory
> like what is described in this thread, running without swap space seems
> like a very good idea.
>
> Thanks,
> Shawn
>
>

'swappiness' changed a few years ago and the values 0 and 1 are not
intuitive.  https://eklitzke.org/swappiness

Re: SOLR Performance on RHEL 7

Posted by Shawn Heisey <el...@elyograg.org>.

On 10/26/21 8:34 AM, Michael Gibney wrote:
> In my experience, running Solr on CentOS 7 (comparable to RHEL 7) -- on
> VMWare, but "ballooning" was _not_ the issue -- I found that setting
> vm.swappiness=0 or 1 did not actually prevent swapping. Notwithstanding
> Shawn's excellent suggestions above, if you still suspect that swapping is
> the issue and you are ok with foregoing swap altogether, you might try
> straight-up `swapoff -a`. This ended up being the right choice for my case,
> fwiw.

Not related to the OP, replying to Michael:

I've seen some very strange behavior related to swap on Linux.  My 
server at home (Ubuntu 20, kernel 5.11.0) has 64GB of memory and two 
6-core CPUs.  It is not running Solr.  Even with what I classify as zero 
memory pressure (40 or more gigabytes of memory used by OS disk 
caching), and with swappiness at 0 or 1, it seems to prefer swapping out 
(swap partition is 8GB) rather than just reclaiming memory from cache, 
and I had thought that lowering swappiness would reverse that 
preference.  I haven't figured out a way to keep that from happening, 
other than disabling swap.  I don't need the swap space -- 64GB is more 
than enough for what the server does.

Related to the OP:

I don't think swap is the problem.  Disabling swap entirely would be a 
good test to confirm.  For general server use cases, I would not 
recommend that action, but for dedicated systems with plenty of memory 
like what is described in this thread, running without swap space seems 
like a very good idea.

Thanks,
Shawn

Re: SOLR Performance on RHEL 7

Posted by Michael Gibney <mi...@michaelgibney.net>.

In my experience, running Solr on CentOS 7 (comparable to RHEL 7) -- on
VMWare, but "ballooning" was _not_ the issue -- I found that setting
vm.swappiness=0 or 1 did not actually prevent swapping. Notwithstanding
Shawn's excellent suggestions above, if you still suspect that swapping is
the issue and you are ok with foregoing swap altogether, you might try
straight-up `swapoff -a`. This ended up being the right choice for my case,
fwiw.

On Tue, Oct 26, 2021 at 10:20 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/26/21 6:10 AM, Paul Russell wrote:
> > I have a current SOLR cluster running SOLR 6.6 on RHEL 6 servers. All
> SOLR
> > instances use a 25G JVM on the RHEL 6 server configured with 64G of
> memory
> > managing a 900G collection. Measured response time to queries average
> about
> > 100ms.
>
> Congrats on getting that performance.  With the numbers you have
> described, I would not expect to see anything that good.
>
> > On the RHEL 7 servers the kswapd0 process is consuming up to 30% of the
> CPU
> > and response time is being measured at 500-1000 ms for queries.
>
> How long are you giving the system, and how many queries have been
> handled by the cluster before you begin benchmarking?  The only way the
> old cluster could see performance that good is handling a LOT of queries
> ... enough that the OS can figure out how to effectively cache the index
> with limited memory.  By my calculations, your systems have less than
> 40GB of free memory to cache a 900GB index.  And that assumes that Solr
> is the only software running on these systems.
>
> > I tried using the vm.swappiness setting at both 0 and 1 and have been
> > unable to change the behavior.
>
> Did you see any information other than kswapd0 CPU usage that led you to
> this action?  I would not expect swap to be the problem with this, and
> your own experiments seem to say the same.
>
> > If I trim the SOLR JVM to 16Gb response
> > times get better and GC logs show the JVM is operating correctly..
>
>
> Sounds like you have a solution.  Is there a problem with simply
> changing the heap size?  If everything works with a lower heap size,
> then the lower heap size is strongly encouraged.  You seem to be making
> a point here about the JVM operating correctly with a 16GB heap.  Are
> you seeing something in GC logs to indicate incorrect operation with the
> higher heap?  Solr 6.x uses CMS for garbage collection. You might see
> better GC performance by switching to G1. Switching to another collector
> would require a much newer Java version, one that is probably not
> compatible with Solr 6.x. Here is the GC_TUNE setting (goes in
> solr.in.sh) for newer Solr versions:
>
>        GC_TUNE=('-XX:+UseG1GC' \
>          '-XX:+PerfDisableSharedMem' \
>          '-XX:+ParallelRefProcEnabled' \
>          '-XX:MaxGCPauseMillis=250' \
>          '-XX:+UseLargePages' \
>          '-XX:+AlwaysPreTouch' \
>          '-XX:+ExplicitGCInvokesConcurrent')
>
> If your servers have more than one physical CPU and NUMA architecture,
> then I would strongly recommend adding "-XX:+UseNUMA" to the argument
> list.  Adding it on systems with only one NUMA node will not cause
> problems.
>
> I would not expect the problem to be in the OS, but I could be wrong.
> It is possible that changes in the newer kernel make it less efficient
> at figuring out proper cache operation, and that would affect Solr.
> Usually things get better with an upgrade, but you never know.
>
> It seems more likely to be some other difference between the systems.
> Top culprit in my mind is Java.  Are the two systems running the same
> version of Java from the same vendor?  What I would recommend for Solr
> 6.x is the latest OpenJDK 8.  In the past I would have recommended
> Oracle Java, but they changed their licensing, so now I go with
> OpenJDK.  Avoid IBM Java or anything that descends from it -- it is
> known to have bugs running Lucene software.  If you want to use a newer
> Java version than Java 8, you'll need to upgrade Solr.  Upgrading from
> 6.x to 8.x is something that requires extensive testing, and a complete
> reindex from scratch.
>
> I would be interested in seeing the screenshot described here:
>
>
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-Askingforhelponamemory/performanceissue
>
> RHEL uses gnu top.
>
> My own deployments use Ubuntu.  Back when I did have access to large
> Solr installs, they were running on CentOS, which is effectively the
> same as RHEL.  I do not recall whether they were CentOS 6 or 7.
>
> Thanks,
> Shawn
>
>
>

Re: SOLR Performance on RHEL 7

Posted by Walter Underwood <wu...@wunderwood.org>.

How big are the indexes? Improving performance with a smaller heap could mean that the indexes were not fitting in the file buffers.

You can verify this by looking at iostats with the different heap sizes. There should be almost no disk reads while Solr is handling queries. If there is disk IO, there is not enough RAM available to cache the index files and queries will be a lot slower.

There could be some change between RHEL versions where there are new demons or something else is taking up more RAM.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 26, 2021, at 10:07 AM, Dave <ha...@gmail.com> wrote:
> 
> I have always preferred completely turning off swap on solr dedicated machines, and especially if you can’t use an SSD. 
> 
>> On Oct 26, 2021, at 12:59 PM, Paul Russell <pa...@qflow.com> wrote:
>> 
>> Thanks for all the helpful information.
>> 
>> Currently we are averaging about 5.5k requests a minute for this collection
>> that is supported by a 3 node SOLR cluster. RHEL6 (Current Servers) and
>> RHEL 7 (New Servers)  are both utilizing OpenJDK8. Older servers have an
>> older version 8.131 new servers have 8.302 jdk installations.
>> 
>> GC is configured the same on all servers.
>> 
>> GC_TUNE="-XX:+UseG1GC -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=200
>> -XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+PerfDisableSharedMem
>> -XX:MetaspaceSize=64M"
>> 
>> 
>> Because I can bring the nodes on-line during off peak hours and load test
>> I'll take a look at 'swap-off" option. I dont control the hardware but I
>> also think a larger SSD based swap fs is also an option unless turning swap
>> off doesnt work
>> 
>> 
>> Thanks again..
>> 
>> 
>> 
>> 
>> 
>>> On Tue, Oct 26, 2021 at 9:20 AM Shawn Heisey <ap...@elyograg.org> wrote:
>>> 
>>>> On 10/26/21 6:10 AM, Paul Russell wrote:
>>>> I have a current SOLR cluster running SOLR 6.6 on RHEL 6 servers. All
>>> SOLR
>>>> instances use a 25G JVM on the RHEL 6 server configured with 64G of
>>> memory
>>>> managing a 900G collection. Measured response time to queries average
>>> about
>>>> 100ms.
>>> 
>>> Congrats on getting that performance.  With the numbers you have
>>> described, I would not expect to see anything that good.
>>> 
>>>> On the RHEL 7 servers the kswapd0 process is consuming up to 30% of the
>>> CPU
>>>> and response time is being measured at 500-1000 ms for queries.
>>> 
>>> How long are you giving the system, and how many queries have been
>>> handled by the cluster before you begin benchmarking?  The only way the
>>> old cluster could see performance that good is handling a LOT of queries
>>> ... enough that the OS can figure out how to effectively cache the index
>>> with limited memory.  By my calculations, your systems have less than
>>> 40GB of free memory to cache a 900GB index.  And that assumes that Solr
>>> is the only software running on these systems.
>>> 
>>>> I tried using the vm.swappiness setting at both 0 and 1 and have been
>>>> unable to change the behavior.
>>> 
>>> Did you see any information other than kswapd0 CPU usage that led you to
>>> this action?  I would not expect swap to be the problem with this, and
>>> your own experiments seem to say the same.
>>> 
>>>> If I trim the SOLR JVM to 16Gb response
>>>> times get better and GC logs show the JVM is operating correctly..
>>> 
>>> 
>>> Sounds like you have a solution.  Is there a problem with simply
>>> changing the heap size?  If everything works with a lower heap size,
>>> then the lower heap size is strongly encouraged.  You seem to be making
>>> a point here about the JVM operating correctly with a 16GB heap.  Are
>>> you seeing something in GC logs to indicate incorrect operation with the
>>> higher heap?  Solr 6.x uses CMS for garbage collection. You might see
>>> better GC performance by switching to G1. Switching to another collector
>>> would require a much newer Java version, one that is probably not
>>> compatible with Solr 6.x. Here is the GC_TUNE setting (goes in
>>> solr.in.sh) for newer Solr versions:
>>> 
>>>      GC_TUNE=('-XX:+UseG1GC' \
>>>        '-XX:+PerfDisableSharedMem' \
>>>        '-XX:+ParallelRefProcEnabled' \
>>>        '-XX:MaxGCPauseMillis=250' \
>>>        '-XX:+UseLargePages' \
>>>        '-XX:+AlwaysPreTouch' \
>>>        '-XX:+ExplicitGCInvokesConcurrent')
>>> 
>>> If your servers have more than one physical CPU and NUMA architecture,
>>> then I would strongly recommend adding "-XX:+UseNUMA" to the argument
>>> list.  Adding it on systems with only one NUMA node will not cause
>>> problems.
>>> 
>>> I would not expect the problem to be in the OS, but I could be wrong.
>>> It is possible that changes in the newer kernel make it less efficient
>>> at figuring out proper cache operation, and that would affect Solr.
>>> Usually things get better with an upgrade, but you never know.
>>> 
>>> It seems more likely to be some other difference between the systems.
>>> Top culprit in my mind is Java.  Are the two systems running the same
>>> version of Java from the same vendor?  What I would recommend for Solr
>>> 6.x is the latest OpenJDK 8.  In the past I would have recommended
>>> Oracle Java, but they changed their licensing, so now I go with
>>> OpenJDK.  Avoid IBM Java or anything that descends from it -- it is
>>> known to have bugs running Lucene software.  If you want to use a newer
>>> Java version than Java 8, you'll need to upgrade Solr.  Upgrading from
>>> 6.x to 8.x is something that requires extensive testing, and a complete
>>> reindex from scratch.
>>> 
>>> I would be interested in seeing the screenshot described here:
>>> 
>>> 
>>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-Askingforhelponamemory/performanceissue
>>> 
>>> RHEL uses gnu top.
>>> 
>>> My own deployments use Ubuntu.  Back when I did have access to large
>>> Solr installs, they were running on CentOS, which is effectively the
>>> same as RHEL.  I do not recall whether they were CentOS 6 or 7.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 
>>> 
>> 
>> -- 
>> Paul
>> Russell
>> VP Integration/Support Services
>> [image: <!--company-->] <https://www.qflow.com/>
>> *main:* 314.968.9906
>> *direct:* 314.255.2135
>> *cell:* 314.258.0864
>> 9317 Manchester Rd.
>> St. Louis, MO 63119
>> qflow.com <https://www.qflow.com/>

Re: SOLR Performance on RHEL 7

Posted by Dave <ha...@gmail.com>.

I have always preferred completely turning off swap on solr dedicated machines, and especially if you can’t use an SSD. 

> On Oct 26, 2021, at 12:59 PM, Paul Russell <pa...@qflow.com> wrote:
> 
> Thanks for all the helpful information.
> 
> Currently we are averaging about 5.5k requests a minute for this collection
> that is supported by a 3 node SOLR cluster. RHEL6 (Current Servers) and
> RHEL 7 (New Servers)  are both utilizing OpenJDK8. Older servers have an
> older version 8.131 new servers have 8.302 jdk installations.
> 
> GC is configured the same on all servers.
> 
> GC_TUNE="-XX:+UseG1GC -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=200
> -XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+PerfDisableSharedMem
> -XX:MetaspaceSize=64M"
> 
> 
> Because I can bring the nodes on-line during off peak hours and load test
> I'll take a look at 'swap-off" option. I dont control the hardware but I
> also think a larger SSD based swap fs is also an option unless turning swap
> off doesnt work
> 
> 
> Thanks again..
> 
> 
> 
> 
> 
>> On Tue, Oct 26, 2021 at 9:20 AM Shawn Heisey <ap...@elyograg.org> wrote:
>> 
>>> On 10/26/21 6:10 AM, Paul Russell wrote:
>>> I have a current SOLR cluster running SOLR 6.6 on RHEL 6 servers. All
>> SOLR
>>> instances use a 25G JVM on the RHEL 6 server configured with 64G of
>> memory
>>> managing a 900G collection. Measured response time to queries average
>> about
>>> 100ms.
>> 
>> Congrats on getting that performance.  With the numbers you have
>> described, I would not expect to see anything that good.
>> 
>>> On the RHEL 7 servers the kswapd0 process is consuming up to 30% of the
>> CPU
>>> and response time is being measured at 500-1000 ms for queries.
>> 
>> How long are you giving the system, and how many queries have been
>> handled by the cluster before you begin benchmarking?  The only way the
>> old cluster could see performance that good is handling a LOT of queries
>> ... enough that the OS can figure out how to effectively cache the index
>> with limited memory.  By my calculations, your systems have less than
>> 40GB of free memory to cache a 900GB index.  And that assumes that Solr
>> is the only software running on these systems.
>> 
>>> I tried using the vm.swappiness setting at both 0 and 1 and have been
>>> unable to change the behavior.
>> 
>> Did you see any information other than kswapd0 CPU usage that led you to
>> this action?  I would not expect swap to be the problem with this, and
>> your own experiments seem to say the same.
>> 
>>> If I trim the SOLR JVM to 16Gb response
>>> times get better and GC logs show the JVM is operating correctly..
>> 
>> 
>> Sounds like you have a solution.  Is there a problem with simply
>> changing the heap size?  If everything works with a lower heap size,
>> then the lower heap size is strongly encouraged.  You seem to be making
>> a point here about the JVM operating correctly with a 16GB heap.  Are
>> you seeing something in GC logs to indicate incorrect operation with the
>> higher heap?  Solr 6.x uses CMS for garbage collection. You might see
>> better GC performance by switching to G1. Switching to another collector
>> would require a much newer Java version, one that is probably not
>> compatible with Solr 6.x. Here is the GC_TUNE setting (goes in
>> solr.in.sh) for newer Solr versions:
>> 
>>       GC_TUNE=('-XX:+UseG1GC' \
>>         '-XX:+PerfDisableSharedMem' \
>>         '-XX:+ParallelRefProcEnabled' \
>>         '-XX:MaxGCPauseMillis=250' \
>>         '-XX:+UseLargePages' \
>>         '-XX:+AlwaysPreTouch' \
>>         '-XX:+ExplicitGCInvokesConcurrent')
>> 
>> If your servers have more than one physical CPU and NUMA architecture,
>> then I would strongly recommend adding "-XX:+UseNUMA" to the argument
>> list.  Adding it on systems with only one NUMA node will not cause
>> problems.
>> 
>> I would not expect the problem to be in the OS, but I could be wrong.
>> It is possible that changes in the newer kernel make it less efficient
>> at figuring out proper cache operation, and that would affect Solr.
>> Usually things get better with an upgrade, but you never know.
>> 
>> It seems more likely to be some other difference between the systems.
>> Top culprit in my mind is Java.  Are the two systems running the same
>> version of Java from the same vendor?  What I would recommend for Solr
>> 6.x is the latest OpenJDK 8.  In the past I would have recommended
>> Oracle Java, but they changed their licensing, so now I go with
>> OpenJDK.  Avoid IBM Java or anything that descends from it -- it is
>> known to have bugs running Lucene software.  If you want to use a newer
>> Java version than Java 8, you'll need to upgrade Solr.  Upgrading from
>> 6.x to 8.x is something that requires extensive testing, and a complete
>> reindex from scratch.
>> 
>> I would be interested in seeing the screenshot described here:
>> 
>> 
>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-Askingforhelponamemory/performanceissue
>> 
>> RHEL uses gnu top.
>> 
>> My own deployments use Ubuntu.  Back when I did have access to large
>> Solr installs, they were running on CentOS, which is effectively the
>> same as RHEL.  I do not recall whether they were CentOS 6 or 7.
>> 
>> Thanks,
>> Shawn
>> 
>> 
>> 
> 
> -- 
> Paul
> Russell
> VP Integration/Support Services
> [image: <!--company-->] <https://www.qflow.com/>
> *main:* 314.968.9906
> *direct:* 314.255.2135
> *cell:* 314.258.0864
> 9317 Manchester Rd.
> St. Louis, MO 63119
> qflow.com <https://www.qflow.com/>

Re: SOLR Performance on RHEL 7

Posted by Shawn Heisey <el...@elyograg.org>.

On 10/26/21 10:58 AM, Paul Russell wrote:
> Currently we are averaging about 5.5k requests a minute for this collection
> that is supported by a 3 node SOLR cluster. RHEL6 (Current Servers) and
> RHEL 7 (New Servers)  are both utilizing OpenJDK8. Older servers have an
> older version 8.131 new servers have 8.302 jdk installations.

Can you easily try 302 or a later release on RHEL6 or the 131 release on 
RHEL7, to see if that makes any difference?

I'm trying to find where releases of openjdk 8 can be downloaded and I 
am coming up empty.  You could download releases from Oracle for testing 
purposes and not be in violation of the license.

This should be a download link for the Oracle Java 8.311 JRE, Linux 
64-bit version:

https://javadl.oracle.com/webapps/download/AutoDL?BundleId=245469_4d5417147a92418ea8b615e228bb6935

To use that, you would just need to extract it to a location on your 
filesystem, maybe in /opt, and then use the JAVA_HOME environment 
variable to tell Solr where to find it.

I'm having trouble finding download links for older Java 8 releases.  
Oracle has been fiddling with the website since the last time I looked 
at it.

Thanks,
Shawn

Re: SOLR Performance on RHEL 7

Posted by Paul Russell <pa...@qflow.com>.

Thanks for all the helpful information.

Currently we are averaging about 5.5k requests a minute for this collection
that is supported by a 3 node SOLR cluster. RHEL6 (Current Servers) and
RHEL 7 (New Servers)  are both utilizing OpenJDK8. Older servers have an
older version 8.131 new servers have 8.302 jdk installations.

GC is configured the same on all servers.

GC_TUNE="-XX:+UseG1GC -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=200
-XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+PerfDisableSharedMem
-XX:MetaspaceSize=64M"


Because I can bring the nodes on-line during off peak hours and load test
I'll take a look at 'swap-off" option. I dont control the hardware but I
also think a larger SSD based swap fs is also an option unless turning swap
off doesnt work


Thanks again..





On Tue, Oct 26, 2021 at 9:20 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/26/21 6:10 AM, Paul Russell wrote:
> > I have a current SOLR cluster running SOLR 6.6 on RHEL 6 servers. All
> SOLR
> > instances use a 25G JVM on the RHEL 6 server configured with 64G of
> memory
> > managing a 900G collection. Measured response time to queries average
> about
> > 100ms.
>
> Congrats on getting that performance.  With the numbers you have
> described, I would not expect to see anything that good.
>
> > On the RHEL 7 servers the kswapd0 process is consuming up to 30% of the
> CPU
> > and response time is being measured at 500-1000 ms for queries.
>
> How long are you giving the system, and how many queries have been
> handled by the cluster before you begin benchmarking?  The only way the
> old cluster could see performance that good is handling a LOT of queries
> ... enough that the OS can figure out how to effectively cache the index
> with limited memory.  By my calculations, your systems have less than
> 40GB of free memory to cache a 900GB index.  And that assumes that Solr
> is the only software running on these systems.
>
> > I tried using the vm.swappiness setting at both 0 and 1 and have been
> > unable to change the behavior.
>
> Did you see any information other than kswapd0 CPU usage that led you to
> this action?  I would not expect swap to be the problem with this, and
> your own experiments seem to say the same.
>
> > If I trim the SOLR JVM to 16Gb response
> > times get better and GC logs show the JVM is operating correctly..
>
>
> Sounds like you have a solution.  Is there a problem with simply
> changing the heap size?  If everything works with a lower heap size,
> then the lower heap size is strongly encouraged.  You seem to be making
> a point here about the JVM operating correctly with a 16GB heap.  Are
> you seeing something in GC logs to indicate incorrect operation with the
> higher heap?  Solr 6.x uses CMS for garbage collection. You might see
> better GC performance by switching to G1. Switching to another collector
> would require a much newer Java version, one that is probably not
> compatible with Solr 6.x. Here is the GC_TUNE setting (goes in
> solr.in.sh) for newer Solr versions:
>
>        GC_TUNE=('-XX:+UseG1GC' \
>          '-XX:+PerfDisableSharedMem' \
>          '-XX:+ParallelRefProcEnabled' \
>          '-XX:MaxGCPauseMillis=250' \
>          '-XX:+UseLargePages' \
>          '-XX:+AlwaysPreTouch' \
>          '-XX:+ExplicitGCInvokesConcurrent')
>
> If your servers have more than one physical CPU and NUMA architecture,
> then I would strongly recommend adding "-XX:+UseNUMA" to the argument
> list.  Adding it on systems with only one NUMA node will not cause
> problems.
>
> I would not expect the problem to be in the OS, but I could be wrong.
> It is possible that changes in the newer kernel make it less efficient
> at figuring out proper cache operation, and that would affect Solr.
> Usually things get better with an upgrade, but you never know.
>
> It seems more likely to be some other difference between the systems.
> Top culprit in my mind is Java.  Are the two systems running the same
> version of Java from the same vendor?  What I would recommend for Solr
> 6.x is the latest OpenJDK 8.  In the past I would have recommended
> Oracle Java, but they changed their licensing, so now I go with
> OpenJDK.  Avoid IBM Java or anything that descends from it -- it is
> known to have bugs running Lucene software.  If you want to use a newer
> Java version than Java 8, you'll need to upgrade Solr.  Upgrading from
> 6.x to 8.x is something that requires extensive testing, and a complete
> reindex from scratch.
>
> I would be interested in seeing the screenshot described here:
>
>
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-Askingforhelponamemory/performanceissue
>
> RHEL uses gnu top.
>
> My own deployments use Ubuntu.  Back when I did have access to large
> Solr installs, they were running on CentOS, which is effectively the
> same as RHEL.  I do not recall whether they were CentOS 6 or 7.
>
> Thanks,
> Shawn
>
>
>

-- 
Paul
Russell
VP Integration/Support Services
[image: <!--company-->] <https://www.qflow.com/>
*main:* 314.968.9906
*direct:* 314.255.2135
*cell:* 314.258.0864
9317 Manchester Rd.
St. Louis, MO 63119
qflow.com <https://www.qflow.com/>

Re: SOLR Performance on RHEL 7

Posted by Shawn Heisey <ap...@elyograg.org>.

On 10/26/21 6:10 AM, Paul Russell wrote:
> I have a current SOLR cluster running SOLR 6.6 on RHEL 6 servers. All SOLR
> instances use a 25G JVM on the RHEL 6 server configured with 64G of memory
> managing a 900G collection. Measured response time to queries average about
> 100ms.

Congrats on getting that performance.  With the numbers you have 
described, I would not expect to see anything that good.

> On the RHEL 7 servers the kswapd0 process is consuming up to 30% of the CPU
> and response time is being measured at 500-1000 ms for queries.

How long are you giving the system, and how many queries have been 
handled by the cluster before you begin benchmarking?  The only way the 
old cluster could see performance that good is handling a LOT of queries 
... enough that the OS can figure out how to effectively cache the index 
with limited memory.  By my calculations, your systems have less than 
40GB of free memory to cache a 900GB index.  And that assumes that Solr 
is the only software running on these systems.

> I tried using the vm.swappiness setting at both 0 and 1 and have been
> unable to change the behavior.

Did you see any information other than kswapd0 CPU usage that led you to 
this action?  I would not expect swap to be the problem with this, and 
your own experiments seem to say the same.

> If I trim the SOLR JVM to 16Gb response
> times get better and GC logs show the JVM is operating correctly..

Sounds like you have a solution.  Is there a problem with simply 
changing the heap size?  If everything works with a lower heap size, 
then the lower heap size is strongly encouraged.  You seem to be making 
a point here about the JVM operating correctly with a 16GB heap.  Are 
you seeing something in GC logs to indicate incorrect operation with the 
higher heap?  Solr 6.x uses CMS for garbage collection. You might see 
better GC performance by switching to G1. Switching to another collector 
would require a much newer Java version, one that is probably not 
compatible with Solr 6.x. Here is the GC_TUNE setting (goes in 
solr.in.sh) for newer Solr versions:

       GC_TUNE=('-XX:+UseG1GC' \
         '-XX:+PerfDisableSharedMem' \
         '-XX:+ParallelRefProcEnabled' \
         '-XX:MaxGCPauseMillis=250' \
         '-XX:+UseLargePages' \
         '-XX:+AlwaysPreTouch' \
         '-XX:+ExplicitGCInvokesConcurrent')

If your servers have more than one physical CPU and NUMA architecture, 
then I would strongly recommend adding "-XX:+UseNUMA" to the argument 
list.  Adding it on systems with only one NUMA node will not cause problems.

I would not expect the problem to be in the OS, but I could be wrong.  
It is possible that changes in the newer kernel make it less efficient 
at figuring out proper cache operation, and that would affect Solr.  
Usually things get better with an upgrade, but you never know.

It seems more likely to be some other difference between the systems.  
Top culprit in my mind is Java.  Are the two systems running the same 
version of Java from the same vendor?  What I would recommend for Solr 
6.x is the latest OpenJDK 8.  In the past I would have recommended 
Oracle Java, but they changed their licensing, so now I go with 
OpenJDK.  Avoid IBM Java or anything that descends from it -- it is 
known to have bugs running Lucene software.  If you want to use a newer 
Java version than Java 8, you'll need to upgrade Solr.  Upgrading from 
6.x to 8.x is something that requires extensive testing, and a complete 
reindex from scratch.

I would be interested in seeing the screenshot described here:

https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-Askingforhelponamemory/performanceissue

RHEL uses gnu top.

My own deployments use Ubuntu.  Back when I did have access to large 
Solr installs, they were running on CentOS, which is effectively the 
same as RHEL.  I do not recall whether they were CentOS 6 or 7.

Thanks,
Shawn

Re: SOLR Performance on RHEL 7

Posted by matthew sporleder <ms...@gmail.com>.

On Tue, Oct 26, 2021 at 8:11 AM Paul Russell <pa...@qflow.com> wrote:
>
> I have a current SOLR cluster running SOLR 6.6 on RHEL 6 servers. All SOLR
> instances use a 25G JVM on the RHEL 6 server configured with 64G of memory
> managing a 900G collection. Measured response time to queries average about
> 100ms.
>
> I am attempting to move the cluster to new RHEL 7 servers with the same
> configuration (8 cores/ 64G memory) and having performance issues.
>
> On the RHEL 7 servers the kswapd0 process is consuming up to 30% of the CPU
> and response time is being measured at 500-1000 ms for queries.
>
> I tried using the vm.swappiness setting at both 0 and 1 and have been
> unable to change the behavior. If I trim the SOLR JVM to 16Gb response
> times get better and GC logs show the JVM is operating correctly..
>
> Has anyone else had a similar issue? I have tried upgrading to SOLR 7.7.2
> as part of the process and that hasn't helped.
>
> Any suggestions?

Are you running tuned on the servers?  If so have you tried different
profiles like 'latency-performance'