You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dominique Bejean <do...@eolya.fr> on 2020/10/18 09:22:13 UTC
SolrCloud 6.6.2 suddenly crash due to slow queries and Log4j issue
Hi,
A few months ago, I reported an issue with Solr nodes crashing due to the
old generation heap growing suddenly and generating OOM. This problem
occurred again this week. I have threads dumps for each minute during the 3
minutes the problem occured. I am using fastthread.io in order to analyse
these dumps.
The threads scenario on failing node is
=== 15h54 -> it looks fine
Old gen heap : 0,5 Gb (3Gb max)
67 threads TIMED_WAITING
26 threads RUNNABLE
7 threads WAITING
=== 15h55 -> fastthreads reports few suspects
Old gen heap stars growing : from 0,5 Gb to 2 Gb (3Gb max)
42 threads TIMED_WAITING
41 threads RUNNABLE
10 threads WAITING
The first symptom is 8 runnable threads are stuck (same stack trace)
waiting for response from some other nodes
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
at
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
at
org.apache.http.impl.io.SocketInputBuffer.isDataAvailable(SocketInputBuffer.java:95)
at
org.apache.http.impl.AbstractHttpClientConnection.isStale(AbstractHttpClientConnection.java:310)
at
org.apache.http.impl.conn.ManagedClientConnectionImpl.isStale(ManagedClientConnectionImpl.java:158)
at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:433)
at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:515)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:447)
at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:388)
at
org.apache.solr.handler.component.HttpShardHandlerFactory.makeLoadBalancedRequest(HttpShardHandlerFactory.java:302)
at
org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:166)
at
org.apache.solr.handler.component.HttpShardHandler$$Lambda$192/1788637481.call(Unknown
Source)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/986729174.run(Unknown
Source)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
=== 15h56 -> fastthreads reports issue
Old gen heap full : from 3Gb (3Gb max)
57 threads TIMED_WAITING
126 threads RUNNABLE
18 threads WAITING
14 threads BLOCKED
7 runnable threads are still stuck (same stack trace) waiting for response
from some other nodes
1 BLOCKED thread obtained org.apache.log4j.Logger's lock & did not release
it due to that 13 threads are BLOCKED (same stack trace) on
org.apache.log4j.Category.callAppenders
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.log4j.Category.callAppenders(Category.java:204)
- waiting to lock <0x00000007005a6f08> (a org.apache.log4j.Logger)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:304)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2482)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:745)
=== 15h57 -> fastthreads reports issue
Old gen heap full : from 3Gb (3Gb max)
43 threads TIMED_WAITING
250 threads RUNNABLE
110 threads WAITING
112 threads BLOCKED
18 runnable threads are still stuck (same stack trace) waiting for
response from some other nodes
22 threads are in BLOCKED state (same stack trace)
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.log4j.Category.callAppenders(Category.java:204)
- waiting to lock <0x00000007005a4900> (a org.apache.log4j.Logger)
90 threads are in BLOCKED state (same stack trace)
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:816)
- waiting to lock <0x000000070087b910> (a
org.apache.solr.util.stats.InstrumentedHttpClient)
=== 15h58 -> Young Gen heap full -> OOM
The global scenario is (6 solr nodes)
1/
The problem starts with very slow queries (30 seconds to 2 minutes) on all
nodes.
2/
On the failing node :
* Some threads are stucked on at
java.net.SocketInputStream.socketRead0(Native Method) waiting for responses
from other nodes
* The Log4j issue starts (
https://blog.fastthread.io/2020/01/24/log4j-bug-slows-down-your-app/)
* The Old generation heap grows in few seconds
* More and more threads handling incoming request are blocked
* The Young generation heap is full -> OOM
Ok, slow queries have to be fixed, but even if some nodes are not
responding to another node, this node shouldn't finally crash with this
Log4J issue.
Before slow queries are fixed, any suggestion in order to avoid this crash ?
Regards.
Dominique
Re: SolrCloud 6.6.2 suddenly crash due to slow queries and Log4j issue
Posted by Dominique Bejean <do...@eolya.fr>.
Shawn,
According to the log4j description (
https://bz.apache.org/bugzilla/show_bug.cgi?id=57714), the issue is related
to lock during appenders collection process.
In addition to CONSOLE and file appenders in the default log4j.properties,
my customer added 2 extra FileAppender dedicated to all requests and slow
requests. I suggested removing these two extra appenders.
Regards
Dominique
Le lun. 19 oct. 2020 à 15:48, Dominique Bejean <do...@eolya.fr>
a écrit :
> Hi Shawn,
>
> Thank you for your response.
>
> You are confirming my diagnosis.
>
> This is in fact a 8 nodes cluster with one single collection with 4 shards
> and 1 replica (8 cores).
>
> 4 Gb heap and 90 Gb Ram
>
>
> When no issue occurs nearly 50% of the heap is used.
>
> Num Docs in collection : 10.000.000
>
> Num Docs per core is more or less 2.500.000
>
> Max Doc per core is more or less 3.000.000
>
> Core Data size is more or less 70 Gb
>
> Here are the JVM settings
>
> -DSTOP.KEY=solrrocks
>
> -DSTOP.PORT=7983
>
> -Dcom.sun.management.jmxremote
>
> -Dcom.sun.management.jmxremote.authenticate=false
>
> -Dcom.sun.management.jmxremote.local.only=false
>
> -Dcom.sun.management.jmxremote.port=18983
>
> -Dcom.sun.management.jmxremote.rmi.port=18983
>
> -Dcom.sun.management.jmxremote.ssl=false
>
> -Dhost=XXXXXXXX
>
> -Djava.rmi.server.hostname=XXXXXXX
>
> -Djetty.home=/xxxxx/server
>
> -Djetty.port=8983
>
> -Dlog4j.configuration=file:/xxxxxx/log4j.properties
>
> -Dsolr.install.dir=/xxxxxx/solr
>
> -Dsolr.jetty.request.header.size=32768
>
> -Dsolr.log.dir=/xxxxxxx/Logs
>
> -Dsolr.log.muteconsole
>
> -Dsolr.solr.home=/xxxxxxxx/data
>
> -Duser.timezone=Europe/Paris
>
> -DzkClientTimeout=30000
>
> -DzkHost=xxxxxxx
>
> -XX:+CMSParallelRemarkEnabled
>
> -XX:+CMSScavengeBeforeRemark
>
> -XX:+ParallelRefProcEnabled
>
> -XX:+PrintGCApplicationStoppedTime
>
> -XX:+PrintGCDateStamps
>
> -XX:+PrintGCDetails
>
> -XX:+PrintGCTimeStamps
>
> -XX:+PrintHeapAtGC
>
> -XX:+PrintTenuringDistribution
>
> -XX:+UseCMSInitiatingOccupancyOnly
>
> -XX:+UseConcMarkSweepGC
>
> -XX:+UseGCLogFileRotation
>
> -XX:+UseGCLogFileRotation
>
> -XX:+UseParNewGC
>
> -XX:-OmitStackTraceInFastThrow
>
> -XX:CMSInitiatingOccupancyFraction=50
>
> -XX:CMSMaxAbortablePrecleanTime=6000
>
> -XX:ConcGCThreads=4
>
> -XX:GCLogFileSize=20M
>
> -XX:MaxTenuringThreshold=8
>
> -XX:NewRatio=3
>
> -XX:NumberOfGCLogFiles=9
>
> -XX:OnOutOfMemoryError=/xxxxxxx/solr/bin/oom_solr.sh
>
> 8983
>
> /xxxxxx/Logs
>
> -XX:ParallelGCThreads=4
>
> -XX:PretenureSizeThreshold=64m
>
> -XX:SurvivorRatio=4
>
> -XX:TargetSurvivorRatio=90
>
> -Xloggc:/xxxxxx/solr_gc.log
>
> -Xloggc:/xxxxxx/solr_gc.log
>
> -Xms4g
>
> -Xmx4g
>
> -Xss256k
>
> -verbose:gc
>
>
>
> Here is one screenshot of top command for the node that failed last week.
>
> [image: 2020-10-19 15_48_06-Photos.png]
>
> Regards
>
> Dominique
>
>
>
> Le dim. 18 oct. 2020 à 22:03, Shawn Heisey <ap...@elyograg.org> a écrit :
>
>> On 10/18/2020 3:22 AM, Dominique Bejean wrote:
>> > A few months ago, I reported an issue with Solr nodes crashing due to
>> the
>> > old generation heap growing suddenly and generating OOM. This problem
>> > occurred again this week. I have threads dumps for each minute during
>> the 3
>> > minutes the problem occured. I am using fastthread.io in order to
>> analyse
>> > these dumps.
>>
>> <snip>
>>
>> > * The Log4j issue starts (
>> > https://blog.fastthread.io/2020/01/24/log4j-bug-slows-down-your-app/)
>>
>> If the log4j bug is the root cause here, then the only way you can fix
>> this is to upgrade to at least Solr 7.4. That is the Solr version where
>> we first upgraded from log4j 1.2.x to log4j2. You cannot upgrade log4j
>> in Solr 6.6.2 without changing Solr code. The code changes required
>> were extensive. Note that I did not do anything to confirm whether the
>> log4j bug is responsible here. You seem pretty confident that this is
>> the case.
>>
>> Note that if you upgrade to 8.x, you will need to reindex from scratch.
>> Upgrading an existing index is possible with one major version bump, but
>> if your index has ever been touched by a release that's two major
>> versions back, it won't work. In 8.x, that is enforced -- 8.x will not
>> even try to read an old index touched by 6.x or earlier.
>>
>> In the following wiki page, I provided instructions for getting a
>> screenshot of the process listing.
>>
>> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems
>>
>> In addition to that screenshot, I would like to know the on-disk size of
>> all the cores running on the problem node, along with a document count
>> from those cores. It might be possible to work around the OOM just by
>> increasing the size of the heap. That won't do anything about problems
>> with log4j.
>>
>> Thanks,
>> Shawn
>>
>
Re: SolrCloud 6.6.2 suddenly crash due to slow queries and Log4j issue
Posted by Dominique Bejean <do...@eolya.fr>.
Hi Shawn,
Thank you for your response.
You are confirming my diagnosis.
This is in fact a 8 nodes cluster with one single collection with 4 shards
and 1 replica (8 cores).
4 Gb heap and 90 Gb Ram
When no issue occurs nearly 50% of the heap is used.
Num Docs in collection : 10.000.000
Num Docs per core is more or less 2.500.000
Max Doc per core is more or less 3.000.000
Core Data size is more or less 70 Gb
Here are the JVM settings
-DSTOP.KEY=solrrocks
-DSTOP.PORT=7983
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=18983
-Dcom.sun.management.jmxremote.rmi.port=18983
-Dcom.sun.management.jmxremote.ssl=false
-Dhost=XXXXXXXX
-Djava.rmi.server.hostname=XXXXXXX
-Djetty.home=/xxxxx/server
-Djetty.port=8983
-Dlog4j.configuration=file:/xxxxxx/log4j.properties
-Dsolr.install.dir=/xxxxxx/solr
-Dsolr.jetty.request.header.size=32768
-Dsolr.log.dir=/xxxxxxx/Logs
-Dsolr.log.muteconsole
-Dsolr.solr.home=/xxxxxxxx/data
-Duser.timezone=Europe/Paris
-DzkClientTimeout=30000
-DzkHost=xxxxxxx
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:+UseGCLogFileRotation
-XX:+UseGCLogFileRotation
-XX:+UseParNewGC
-XX:-OmitStackTraceInFastThrow
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4
-XX:GCLogFileSize=20M
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:NumberOfGCLogFiles=9
-XX:OnOutOfMemoryError=/xxxxxxx/solr/bin/oom_solr.sh
8983
/xxxxxx/Logs
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90
-Xloggc:/xxxxxx/solr_gc.log
-Xloggc:/xxxxxx/solr_gc.log
-Xms4g
-Xmx4g
-Xss256k
-verbose:gc
Here is one screenshot of top command for the node that failed last week.
[image: 2020-10-19 15_48_06-Photos.png]
Regards
Dominique
Le dim. 18 oct. 2020 à 22:03, Shawn Heisey <ap...@elyograg.org> a écrit :
> On 10/18/2020 3:22 AM, Dominique Bejean wrote:
> > A few months ago, I reported an issue with Solr nodes crashing due to the
> > old generation heap growing suddenly and generating OOM. This problem
> > occurred again this week. I have threads dumps for each minute during
> the 3
> > minutes the problem occured. I am using fastthread.io in order to
> analyse
> > these dumps.
>
> <snip>
>
> > * The Log4j issue starts (
> > https://blog.fastthread.io/2020/01/24/log4j-bug-slows-down-your-app/)
>
> If the log4j bug is the root cause here, then the only way you can fix
> this is to upgrade to at least Solr 7.4. That is the Solr version where
> we first upgraded from log4j 1.2.x to log4j2. You cannot upgrade log4j
> in Solr 6.6.2 without changing Solr code. The code changes required
> were extensive. Note that I did not do anything to confirm whether the
> log4j bug is responsible here. You seem pretty confident that this is
> the case.
>
> Note that if you upgrade to 8.x, you will need to reindex from scratch.
> Upgrading an existing index is possible with one major version bump, but
> if your index has ever been touched by a release that's two major
> versions back, it won't work. In 8.x, that is enforced -- 8.x will not
> even try to read an old index touched by 6.x or earlier.
>
> In the following wiki page, I provided instructions for getting a
> screenshot of the process listing.
>
> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems
>
> In addition to that screenshot, I would like to know the on-disk size of
> all the cores running on the problem node, along with a document count
> from those cores. It might be possible to work around the OOM just by
> increasing the size of the heap. That won't do anything about problems
> with log4j.
>
> Thanks,
> Shawn
>
Re: SolrCloud 6.6.2 suddenly crash due to slow queries and Log4j
issue
Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/18/2020 3:22 AM, Dominique Bejean wrote:
> A few months ago, I reported an issue with Solr nodes crashing due to the
> old generation heap growing suddenly and generating OOM. This problem
> occurred again this week. I have threads dumps for each minute during the 3
> minutes the problem occured. I am using fastthread.io in order to analyse
> these dumps.
<snip>
> * The Log4j issue starts (
> https://blog.fastthread.io/2020/01/24/log4j-bug-slows-down-your-app/)
If the log4j bug is the root cause here, then the only way you can fix
this is to upgrade to at least Solr 7.4. That is the Solr version where
we first upgraded from log4j 1.2.x to log4j2. You cannot upgrade log4j
in Solr 6.6.2 without changing Solr code. The code changes required
were extensive. Note that I did not do anything to confirm whether the
log4j bug is responsible here. You seem pretty confident that this is
the case.
Note that if you upgrade to 8.x, you will need to reindex from scratch.
Upgrading an existing index is possible with one major version bump, but
if your index has ever been touched by a release that's two major
versions back, it won't work. In 8.x, that is enforced -- 8.x will not
even try to read an old index touched by 6.x or earlier.
In the following wiki page, I provided instructions for getting a
screenshot of the process listing.
https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems
In addition to that screenshot, I would like to know the on-disk size of
all the cores running on the problem node, along with a document count
from those cores. It might be possible to work around the OOM just by
increasing the size of the heap. That won't do anything about problems
with log4j.
Thanks,
Shawn