You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Software Dev <st...@gmail.com> on 2014/03/22 20:23:40 UTC

Solr Cloud collection keep going down?

We have 2 collections with 1 shard each replicated over 5 servers in the
cluster. We see a lot of flapping (down or recovering) on one of the
collections. When this happens the other collection hosted on the same
machine is still marked as active. When this happens it takes a fairly long
time (~30 minutes) for the collection to come back online, if at all. I
find that its usually more reliable to completely shutdown solr on the
affected machine and bring it back up with its core disabled. We then
re-enable the core when its marked as active.

A few questions:

1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
that marks one collection as down but the other on the same machine as up?

2) Why does recovery take forever when a node goes down.. even if its only
down for 30 seconds. Our index is only 7-8G and we are running on SSD's.

3) What can be done to diagnose and fix this problem?

Re: Solr Cloud collection keep going down?

Posted by Software Dev <st...@gmail.com>.
Some logs the core in question is "items".

---------------------------------------------------------------------------------------------------------
WARN  - 2014-03-22 02:37:13.344;
org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for
zkNodeName=10.0.14.101:8983_solr_itemscore=items

WARN  - 2014-03-22 02:37:16.352; org.apache.solr.update.PeerSync;
PeerSync: core=items url=http://10.0.14.101:8983/solr too many updates
received since start - startingUpdates no longer overlaps with our
currentUpdates

WARN  - 2014-03-22 02:46:26.277; org.apache.solr.core.SolrCore;
[items] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

WARN  - 2014-03-22 02:49:27.736;
org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay
tlog{file=/var/lib/solr-parent/solr-web/home/items/data/tlog/tlog.0000000000000000409
refcount=2} active=true starting pos=98026856

WARN  - 2014-03-22 02:50:07.896; org.apache.solr.core.SolrCore;
[items] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

ERROR - 2014-03-22 02:51:49.640; org.apache.solr.common.SolrException;
null:org.eclipse.jetty.io.EofException

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)

at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:507)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:147)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)

at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)

at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)

at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)

at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)

at org.apache.solr.util.FastWriter.flush(FastWriter.java:141)

at org.apache.solr.util.FastWriter.write(FastWriter.java:55)

at org.apache.solr.response.RubyWriter.writeStr(RubyResponseWriter.java:92)

at org.apache.solr.response.JSONWriter.writeNamedListAsFlat(JSONResponseWriter.java:285)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:301)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)

at org.apache.solr.response.RubyResponseWriter.write(RubyResponseWriter.java:37)

at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:768)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:440)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)

at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)

at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)

at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:368)

at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)

at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)

at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)

at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)

at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)

at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

at java.lang.Thread.run(Thread.java:744)

Caused by: java.net.SocketException: Connection reset

at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)

at java.net.SocketOutputStream.write(SocketOutputStream.java:159)

at org.eclipse.jetty.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:375)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:164)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:182)

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)

... 51 more


ERROR - 2014-03-22 02:51:49.645; org.apache.solr.common.SolrException;
null:org.eclipse.jetty.io.EofException

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)

at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:507)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:147)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)

at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)

at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)

at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)

at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)

at org.apache.solr.util.FastWriter.flush(FastWriter.java:141)

at org.apache.solr.util.FastWriter.write(FastWriter.java:55)

at org.apache.solr.response.RubyWriter.writeStr(RubyResponseWriter.java:92)

at org.apache.solr.response.JSONWriter.writeNamedListAsFlat(JSONResponseWriter.java:285)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:301)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)

at org.apache.solr.response.RubyResponseWriter.write(RubyResponseWriter.java:37)

at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:768)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:440)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)

at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)

at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)

at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:368)

at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)

at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)

at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)

at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)

at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)

at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

at java.lang.Thread.run(Thread.java:744)

Caused by: java.net.SocketException: Connection reset

at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)

at java.net.SocketOutputStream.write(SocketOutputStream.java:159)

at org.eclipse.jetty.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:375)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:164)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:182)

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)

... 51 more


WARN  - 2014-03-22 02:51:49.645; org.eclipse.jetty.server.Response;
Committed before 500 {msg=Connection
reset,trace=org.eclipse.jetty.io.EofException

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)

at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:507)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:147)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)

at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)

at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)

at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)

at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)

at org.apache.solr.util.FastWriter.flush(FastWriter.java:141)

at org.apache.solr.util.FastWriter.write(FastWriter.java:55)

at org.apache.solr.response.RubyWriter.writeStr(RubyResponseWriter.java:92)

at org.apache.solr.response.JSONWriter.writeNamedListAsFlat(JSONResponseWriter.java:285)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:301)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)

at org.apache.solr.response.RubyResponseWriter.write(RubyResponseWriter.java:37)

at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:768)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:440)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)

at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)

at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)

at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:368)

at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)

at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)

at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)

at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)

at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)

at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

at java.lang.Thread.run(Thread.java:744)

Caused by: java.net.SocketException: Connection reset

at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)

at java.net.SocketOutputStream.write(SocketOutputStream.java:159)

at org.eclipse.jetty.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:375)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:164)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:182)

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)

... 51 more

,code=500}

WARN  - 2014-03-22 02:51:49.650;
org.eclipse.jetty.servlet.ServletHandler; /solr/items/browse

java.lang.IllegalStateException: Committed

at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1144)

at org.eclipse.jetty.server.Response.sendError(Response.java:314)

at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:824)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:448)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)

at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)

at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)

at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:368)

at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)

at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)

at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)

at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)

at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)

at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

at java.lang.Thread.run(Thread.java:744)

WARN  - 2014-03-22 02:53:52.206;
org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished.
recoveryInfo=RecoveryInfo{adds=44296 deletes=5996 deleteByQuery=0
errors=0 positionOfStart=98026856}

WARN  - 2014-03-22 02:53:52.232;
org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for
zkNodeName=10.0.14.101:8983_solr_itemscore=items

WARN  - 2014-03-22 02:53:52.234;
org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for
zkNodeName=10.0.14.101:8983_solr_usersscore=users

WARN  - 2014-03-22 02:53:53.262;
org.apache.solr.cloud.ElectionContext; cancelElection did not find
election node to remove

WARN  - 2014-03-22 02:53:53.282;
org.apache.solr.cloud.ElectionContext; cancelElection did not find
election node to remove

WARN  - 2014-03-22 02:53:53.294;
org.apache.solr.cloud.ElectionContext; cancelElection did not find
election node to remove

WARN  - 2014-03-22 03:22:51.595; org.apache.solr.core.SolrCore;
[users] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

ERROR - 2014-03-22 06:33:47.917; org.apache.solr.common.SolrException;
null:org.eclipse.jetty.io.EofException

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)

at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:507)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:147)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)

at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)

at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)

at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)

at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)

at org.apache.solr.util.FastWriter.flush(FastWriter.java:141)

at org.apache.solr.util.FastWriter.write(FastWriter.java:55)

at org.apache.solr.response.RubyWriter.writeStr(RubyResponseWriter.java:87)

at org.apache.solr.response.JSONWriter.writeNamedListAsFlat(JSONResponseWriter.java:285)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:301)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)

at org.apache.solr.response.RubyResponseWriter.write(RubyResponseWriter.java:37)

at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:768)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:440)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)

at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)

at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)

at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:368)

at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)

at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)

at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)

at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)

at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)

at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

at java.lang.Thread.run(Thread.java:744)

Caused by: java.net.SocketException: Connection reset

at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)

at java.net.SocketOutputStream.write(SocketOutputStream.java:159)

at org.eclipse.jetty.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:375)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:164)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:182)

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)

... 51 more


ERROR - 2014-03-22 06:33:47.919; org.apache.solr.common.SolrException;
null:org.eclipse.jetty.io.EofException

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)

at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:507)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:147)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)

at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)

at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)

at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)

at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)

at org.apache.solr.util.FastWriter.flush(FastWriter.java:141)

at org.apache.solr.util.FastWriter.write(FastWriter.java:55)

at org.apache.solr.response.RubyWriter.writeStr(RubyResponseWriter.java:87)

at org.apache.solr.response.JSONWriter.writeNamedListAsFlat(JSONResponseWriter.java:285)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:301)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)

at org.apache.solr.response.RubyResponseWriter.write(RubyResponseWriter.java:37)

at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:768)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:440)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)

at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)

at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)

at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:368)

at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)

at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)

at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)

at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)

at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)

at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

at java.lang.Thread.run(Thread.java:744)

Caused by: java.net.SocketException: Connection reset

at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)

at java.net.SocketOutputStream.write(SocketOutputStream.java:159)

at org.eclipse.jetty.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:375)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:164)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:182)

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)

... 51 more


WARN  - 2014-03-22 06:33:47.919; org.eclipse.jetty.server.Response;
Committed before 500 {msg=Connection
reset,trace=org.eclipse.jetty.io.EofException

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)

at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:507)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:147)

at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)

at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)

at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)

at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)

at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)

at org.apache.solr.util.FastWriter.flush(FastWriter.java:141)

at org.apache.solr.util.FastWriter.write(FastWriter.java:55)

at org.apache.solr.response.RubyWriter.writeStr(RubyResponseWriter.java:87)

at org.apache.solr.response.JSONWriter.writeNamedListAsFlat(JSONResponseWriter.java:285)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:301)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)

at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)

at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)

at org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)

at org.apache.solr.response.RubyResponseWriter.write(RubyResponseWriter.java:37)

at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:768)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:440)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)

at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)

at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)

at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:368)

at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)

at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)

at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)

at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)

at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)

at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

at java.lang.Thread.run(Thread.java:744)

Caused by: java.net.SocketException: Connection reset

at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)

at java.net.SocketOutputStream.write(SocketOutputStream.java:159)

at org.eclipse.jetty.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:375)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:164)

at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:182)

at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)

... 51 more


,code=500}



On Sat, Mar 22, 2014 at 12:23 PM, Software Dev
<st...@gmail.com> wrote:
> We have 2 collections with 1 shard each replicated over 5 servers in the
> cluster. We see a lot of flapping (down or recovering) on one of the
> collections. When this happens the other collection hosted on the same
> machine is still marked as active. When this happens it takes a fairly long
> time (~30 minutes) for the collection to come back online, if at all. I find
> that its usually more reliable to completely shutdown solr on the affected
> machine and bring it back up with its core disabled. We then re-enable the
> core when its marked as active.
>
> A few questions:
>
> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
> that marks one collection as down but the other on the same machine as up?
>
> 2) Why does recovery take forever when a node goes down.. even if its only
> down for 30 seconds. Our index is only 7-8G and we are running on SSD's.
>
> 3) What can be done to diagnose and fix this problem?
>
>
>

Re: Solr Cloud collection keep going down?

Posted by Michael Della Bitta <mi...@appinions.com>.
What kind of load are the machines under when this happens? A lot of
writes? A lot of http connections?

Do your zookeeper logs mention anything about losing clients?

Have you tried turning on GC logging or profiling GC?

Have you tried running with a smaller max heap size, or
setting -XX:CMSInitiatingOccupancyFraction ?

Just a shot in the dark, since I'm not familiar with Jetty's logging
statements, but that looks like plain old dropped HTTP sockets to me.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Mar 25, 2014 at 1:13 PM, Software Dev <st...@gmail.com>wrote:

> Can anyone else chime in? Thanks
>
> On Mon, Mar 24, 2014 at 10:10 AM, Software Dev
> <st...@gmail.com> wrote:
> > Shawn,
> >
> > Thanks for pointing me in the right direction. After consulting the
> > above document I *think* that the problem may be too large of a heap
> > and which may be affecting GC collection and hence causing ZK
> > timeouts.
> >
> > We have around 20G of memory on these machines with a min/max of heap
> > at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
> > aside for disk cache. Why did we choose 6-10? No other reason than we
> > wanted to allot enough for disk cache and then everything else was
> > thrown and Solr. Does this sound about right?
> >
> > I took some screenshots for VisualVM and our NewRelic reporting as
> > well as some relevant portions of our SolrConfig.xml. Any
> > thoughts/comments would be greatly appreciated.
> >
> > http://postimg.org/gallery/4t73sdks/1fc10f9c/
> >
> > Thanks
> >
> >
> >
> >
> > On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey <so...@elyograg.org> wrote:
> >> On 3/22/2014 1:23 PM, Software Dev wrote:
> >>> We have 2 collections with 1 shard each replicated over 5 servers in
> the
> >>> cluster. We see a lot of flapping (down or recovering) on one of the
> >>> collections. When this happens the other collection hosted on the same
> >>> machine is still marked as active. When this happens it takes a fairly
> long
> >>> time (~30 minutes) for the collection to come back online, if at all. I
> >>> find that its usually more reliable to completely shutdown solr on the
> >>> affected machine and bring it back up with its core disabled. We then
> >>> re-enable the core when its marked as active.
> >>>
> >>> A few questions:
> >>>
> >>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is
> failing
> >>> that marks one collection as down but the other on the same machine as
> up?
> >>>
> >>> 2) Why does recovery take forever when a node goes down.. even if its
> only
> >>> down for 30 seconds. Our index is only 7-8G and we are running on
> SSD's.
> >>>
> >>> 3) What can be done to diagnose and fix this problem?
> >>
> >> Unless you are actually using the ping request handler, the healthcheck
> >> config will not matter.  Or were you referring to something else?
> >>
> >> Referencing the logs you included in your reply:  The EofException
> >> errors happen because your client code times out and disconnects before
> >> the request it made has completed.  That is most likely just a symptom
> >> that has nothing at all to do with the problem.
> >>
> >> Read the following wiki page.  What I'm going to say below will
> >> reference information you can find there:
> >>
> >> http://wiki.apache.org/solr/SolrPerformanceProblems
> >>
> >> Relevant side note: The default zookeeper client timeout is 15 seconds.
> >>  A typical zookeeper config defines tickTime as 2 seconds, and the
> >> timeout cannot be configured to be more than 20 times the tickTime,
> >> which means it cannot go beyond 40 seconds.  The default timeout value
> >> 15 seconds is usually more than enough, unless you are having
> >> performance problems.
> >>
> >> If you are not actually taking Solr instances down, then the fact that
> >> you are seeing the log replay messages indicates to me that something is
> >> taking so much time that the connection to Zookeeper times out.  When it
> >> finally responds, it will attempt to recover the index, which means
> >> first it will replay the transaction log and then it might replicate the
> >> index from the shard leader.
> >>
> >> Replaying the transaction log is likely the reason it takes so long to
> >> recover.  The wiki page I linked above has a "slow startup" section that
> >> explains how to fix this.
> >>
> >> There is some kind of underlying problem that is causing the zookeeper
> >> connection to timeout.  It is most likely garbage collection pauses or
> >> insufficient RAM to cache the index, possibly both.
> >>
> >> You did not indicate how much total RAM you have or how big your Java
> >> heap is.  As the wiki page mentions in the SSD section, SSD is not a
> >> substitute for having enough RAM to cache at significant percentage of
> >> your index.
> >>
> >> Thanks,
> >> Shawn
> >>
>

Re: Solr Cloud collection keep going down?

Posted by Software Dev <st...@gmail.com>.
Can anyone else chime in? Thanks

On Mon, Mar 24, 2014 at 10:10 AM, Software Dev
<st...@gmail.com> wrote:
> Shawn,
>
> Thanks for pointing me in the right direction. After consulting the
> above document I *think* that the problem may be too large of a heap
> and which may be affecting GC collection and hence causing ZK
> timeouts.
>
> We have around 20G of memory on these machines with a min/max of heap
> at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
> aside for disk cache. Why did we choose 6-10? No other reason than we
> wanted to allot enough for disk cache and then everything else was
> thrown and Solr. Does this sound about right?
>
> I took some screenshots for VisualVM and our NewRelic reporting as
> well as some relevant portions of our SolrConfig.xml. Any
> thoughts/comments would be greatly appreciated.
>
> http://postimg.org/gallery/4t73sdks/1fc10f9c/
>
> Thanks
>
>
>
>
> On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey <so...@elyograg.org> wrote:
>> On 3/22/2014 1:23 PM, Software Dev wrote:
>>> We have 2 collections with 1 shard each replicated over 5 servers in the
>>> cluster. We see a lot of flapping (down or recovering) on one of the
>>> collections. When this happens the other collection hosted on the same
>>> machine is still marked as active. When this happens it takes a fairly long
>>> time (~30 minutes) for the collection to come back online, if at all. I
>>> find that its usually more reliable to completely shutdown solr on the
>>> affected machine and bring it back up with its core disabled. We then
>>> re-enable the core when its marked as active.
>>>
>>> A few questions:
>>>
>>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
>>> that marks one collection as down but the other on the same machine as up?
>>>
>>> 2) Why does recovery take forever when a node goes down.. even if its only
>>> down for 30 seconds. Our index is only 7-8G and we are running on SSD's.
>>>
>>> 3) What can be done to diagnose and fix this problem?
>>
>> Unless you are actually using the ping request handler, the healthcheck
>> config will not matter.  Or were you referring to something else?
>>
>> Referencing the logs you included in your reply:  The EofException
>> errors happen because your client code times out and disconnects before
>> the request it made has completed.  That is most likely just a symptom
>> that has nothing at all to do with the problem.
>>
>> Read the following wiki page.  What I'm going to say below will
>> reference information you can find there:
>>
>> http://wiki.apache.org/solr/SolrPerformanceProblems
>>
>> Relevant side note: The default zookeeper client timeout is 15 seconds.
>>  A typical zookeeper config defines tickTime as 2 seconds, and the
>> timeout cannot be configured to be more than 20 times the tickTime,
>> which means it cannot go beyond 40 seconds.  The default timeout value
>> 15 seconds is usually more than enough, unless you are having
>> performance problems.
>>
>> If you are not actually taking Solr instances down, then the fact that
>> you are seeing the log replay messages indicates to me that something is
>> taking so much time that the connection to Zookeeper times out.  When it
>> finally responds, it will attempt to recover the index, which means
>> first it will replay the transaction log and then it might replicate the
>> index from the shard leader.
>>
>> Replaying the transaction log is likely the reason it takes so long to
>> recover.  The wiki page I linked above has a "slow startup" section that
>> explains how to fix this.
>>
>> There is some kind of underlying problem that is causing the zookeeper
>> connection to timeout.  It is most likely garbage collection pauses or
>> insufficient RAM to cache the index, possibly both.
>>
>> You did not indicate how much total RAM you have or how big your Java
>> heap is.  As the wiki page mentions in the SSD section, SSD is not a
>> substitute for having enough RAM to cache at significant percentage of
>> your index.
>>
>> Thanks,
>> Shawn
>>

Re: Solr Cloud collection keep going down?

Posted by Software Dev <st...@gmail.com>.
Shawn,

Thanks for pointing me in the right direction. After consulting the
above document I *think* that the problem may be too large of a heap
and which may be affecting GC collection and hence causing ZK
timeouts.

We have around 20G of memory on these machines with a min/max of heap
at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
aside for disk cache. Why did we choose 6-10? No other reason than we
wanted to allot enough for disk cache and then everything else was
thrown and Solr. Does this sound about right?

I took some screenshots for VisualVM and our NewRelic reporting as
well as some relevant portions of our SolrConfig.xml. Any
thoughts/comments would be greatly appreciated.

http://postimg.org/gallery/4t73sdks/1fc10f9c/

Thanks




On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey <so...@elyograg.org> wrote:
> On 3/22/2014 1:23 PM, Software Dev wrote:
>> We have 2 collections with 1 shard each replicated over 5 servers in the
>> cluster. We see a lot of flapping (down or recovering) on one of the
>> collections. When this happens the other collection hosted on the same
>> machine is still marked as active. When this happens it takes a fairly long
>> time (~30 minutes) for the collection to come back online, if at all. I
>> find that its usually more reliable to completely shutdown solr on the
>> affected machine and bring it back up with its core disabled. We then
>> re-enable the core when its marked as active.
>>
>> A few questions:
>>
>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
>> that marks one collection as down but the other on the same machine as up?
>>
>> 2) Why does recovery take forever when a node goes down.. even if its only
>> down for 30 seconds. Our index is only 7-8G and we are running on SSD's.
>>
>> 3) What can be done to diagnose and fix this problem?
>
> Unless you are actually using the ping request handler, the healthcheck
> config will not matter.  Or were you referring to something else?
>
> Referencing the logs you included in your reply:  The EofException
> errors happen because your client code times out and disconnects before
> the request it made has completed.  That is most likely just a symptom
> that has nothing at all to do with the problem.
>
> Read the following wiki page.  What I'm going to say below will
> reference information you can find there:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Relevant side note: The default zookeeper client timeout is 15 seconds.
>  A typical zookeeper config defines tickTime as 2 seconds, and the
> timeout cannot be configured to be more than 20 times the tickTime,
> which means it cannot go beyond 40 seconds.  The default timeout value
> 15 seconds is usually more than enough, unless you are having
> performance problems.
>
> If you are not actually taking Solr instances down, then the fact that
> you are seeing the log replay messages indicates to me that something is
> taking so much time that the connection to Zookeeper times out.  When it
> finally responds, it will attempt to recover the index, which means
> first it will replay the transaction log and then it might replicate the
> index from the shard leader.
>
> Replaying the transaction log is likely the reason it takes so long to
> recover.  The wiki page I linked above has a "slow startup" section that
> explains how to fix this.
>
> There is some kind of underlying problem that is causing the zookeeper
> connection to timeout.  It is most likely garbage collection pauses or
> insufficient RAM to cache the index, possibly both.
>
> You did not indicate how much total RAM you have or how big your Java
> heap is.  As the wiki page mentions in the SSD section, SSD is not a
> substitute for having enough RAM to cache at significant percentage of
> your index.
>
> Thanks,
> Shawn
>

Re: Solr Cloud collection keep going down?

Posted by Shawn Heisey <so...@elyograg.org>.
On 3/22/2014 1:23 PM, Software Dev wrote:
> We have 2 collections with 1 shard each replicated over 5 servers in the
> cluster. We see a lot of flapping (down or recovering) on one of the
> collections. When this happens the other collection hosted on the same
> machine is still marked as active. When this happens it takes a fairly long
> time (~30 minutes) for the collection to come back online, if at all. I
> find that its usually more reliable to completely shutdown solr on the
> affected machine and bring it back up with its core disabled. We then
> re-enable the core when its marked as active.
> 
> A few questions:
> 
> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
> that marks one collection as down but the other on the same machine as up?
> 
> 2) Why does recovery take forever when a node goes down.. even if its only
> down for 30 seconds. Our index is only 7-8G and we are running on SSD's.
> 
> 3) What can be done to diagnose and fix this problem?

Unless you are actually using the ping request handler, the healthcheck
config will not matter.  Or were you referring to something else?

Referencing the logs you included in your reply:  The EofException
errors happen because your client code times out and disconnects before
the request it made has completed.  That is most likely just a symptom
that has nothing at all to do with the problem.

Read the following wiki page.  What I'm going to say below will
reference information you can find there:

http://wiki.apache.org/solr/SolrPerformanceProblems

Relevant side note: The default zookeeper client timeout is 15 seconds.
 A typical zookeeper config defines tickTime as 2 seconds, and the
timeout cannot be configured to be more than 20 times the tickTime,
which means it cannot go beyond 40 seconds.  The default timeout value
15 seconds is usually more than enough, unless you are having
performance problems.

If you are not actually taking Solr instances down, then the fact that
you are seeing the log replay messages indicates to me that something is
taking so much time that the connection to Zookeeper times out.  When it
finally responds, it will attempt to recover the index, which means
first it will replay the transaction log and then it might replicate the
index from the shard leader.

Replaying the transaction log is likely the reason it takes so long to
recover.  The wiki page I linked above has a "slow startup" section that
explains how to fix this.

There is some kind of underlying problem that is causing the zookeeper
connection to timeout.  It is most likely garbage collection pauses or
insufficient RAM to cache the index, possibly both.

You did not indicate how much total RAM you have or how big your Java
heap is.  As the wiki page mentions in the SSD section, SSD is not a
substitute for having enough RAM to cache at significant percentage of
your index.

Thanks,
Shawn