You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Davis <da...@gmail.com> on 2020/01/30 21:21:42 UTC

Severe performance issues of Solr 6.6.0 with debug logging

I have recently observer severe performance issues of 1 collection, 2
shard, 4 server SolrCloud (Solr 6.6.0 running on Windows, using
AdoptOpenJDK 1.8 JRE, NSSM was used to run Solr as Windows service).
During recovery of a replica the network utilization of the server hosting
the replica (that is in Recovering state) was very small (around 8 Mbps).
Often the recovery failed and re-started. When the recovery failed
something like the following warning was logged on the shard leader:
Exception while writing response for params:
generation=XXXXX&qt=/replication&file=_randomlookingfilename.fdt&offset=YYYYY&checksum=true&wt=filestream&command=filecontent
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout
expired: 50016/50000 ms
    at
org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:219)
    at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:220)
    at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:491)
    at
org.apache.commons.io.output.ProxyOutputStream.write(ProxyOutputStream.java:90)
    at
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:213)
    at
org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:83)
    at
org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1520)
    at org.apache.solr.core.SolrCore$3.write(SolrCore.java:2601)
    at
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:49)
    at
org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:809)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:538)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
    at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
    at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
    at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
    at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
    at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.eclipse.jetty.server.Server.handle(Server.java:534)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
    at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
    at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired:
50016/50000 ms
    at
org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
    at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown
Source)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    ... 1 more

And after some (often around 30) seconds a warning like this was logged on
the recovering node:
Error in fetching file: _randomlookingfilename.fdt (downloaded NNNNN of
MMMMM bytes)
org.apache.http.MalformedChunkCodingException: CRLF expected at end of chunk
    at
org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:255)
    at
org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
    at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
    at
org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
    at
org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
    at
org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:128)
    at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
    at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1512)
    at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1463)
    at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1444)
    at
org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:926)
    at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:492)
    at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:301)
    at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:400)
    at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:219)
    at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:471)
    at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:284)
    at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

The issue was observable on relatively slow hardware (e.g. Azure B4ms
virtual machine), on more powerful hardware the replica was able to recover
(however the network usage during recovery was still much lower than
expected).

When analyzing the resource bottleneck it was observed that conhost.exe had
higher CPU usage than java.exe (Solr process). The total CPU usage was
around 25% (on 4 core machine).
It turned out that debug logging (enabled by -v command-line option) was
the root cause of performance issues. Solr was writing excessive amount of
log data to console and all the log data was drawn on an invisible console.
The drawing was slow-enough and blocking to cause issues for Solr.
Redirecting STDOUT and STDERR of Solr to nul (in NSSM settings) increased
the network utilization during replica recovery several times and made it
possible for the replica to recover and become active.
The final solution was to revert to the default logging level (info) by
removing the -v command line option. This improved the network utilization
during recovery even more than STDOUT/STDERR redirection.

Could it be possible to mention in Solr documentation (e.g. in
https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems
and https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceFactors
) that debug logging might seriously hurt performance?
Maybe it could be possible to log a warning (that debug logging may
seriously hurt performance) if root logging level is set to debug?