You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Joseph Obernberger <jo...@lovehorsepower.com> on 2015/06/04 02:39:50 UTC

Lost connection to Zookeeper

Hi All - I've run into a problem where every-once in a while one or more 
of the shards (27 shard cluster) will loose connection to zookeeper and 
report "updates are disabled".  In additional to the CLUSTERSTATUS 
timeout errors, which don't seem to cause any issue, this one certainly 
does as that shard no longer takes any (you guessed it!) updates!
We are using Zookeeper with 7 nodes (7 servers in our quorum).
There stack trace is:

---------
282833508 [qtp1221263105-801058] INFO 
org.apache.solr.update.processor.LogUpdateProcessor  [UNCLASS shard17 
core_node17 UNCLASS] â [UNCLASS] webapp=/solr path=/update 
params={wt=javabin&version=2} {add=[COLLECT20001208773720 
(1502857505963769856)]} 0 3
282837711 [qtp1221263105-802489] INFO 
org.apache.solr.update.processor.LogUpdateProcessor  [UNCLASS shard17 
core_node17 UNCLASS] â [UNCLASS] webapp=/solr path=/update 
params={wt=javabin&version=2} {add=[COLLECT20001208773796 
(1502857510369886208)]} 0 3
282839485 [qtp1221263105-800319] INFO 
org.apache.solr.update.processor.LogUpdateProcessor  [UNCLASS shard17 
core_node17 UNCLASS] â [UNCLASS] webapp=/solr path=/update 
params={wt=javabin&version=2} {add=[COLLECT20001208773821 
(1502857512230060032)]} 0 4
282841460 [qtp1221263105-801228] INFO 
org.apache.solr.update.processor.LogUpdateProcessor  [UNCLASS shard17 
core_node17 UNCLASS] â [UNCLASS] webapp=/solr path=/update 
params={wt=javabin&version=2} {} 0 1
282841461 [qtp1221263105-801228] ERROR org.apache.solr.core.SolrCore  
[UNCLASS shard17 core_node17 UNCLASS] â 
org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates 
are disabled.
         at 
org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1474)
         at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:661)
         at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
         at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
         at 
org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:94)
         at 
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:96)
         at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:166)
         at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
         at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:225)
         at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:121)
         at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:190)
         at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116)
         at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173)
         at 
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106)
         at 
org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
         at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:103)
         at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
         at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1984)
         at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829)
         at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446)
         at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)
         at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
         at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
         at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
         at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
         at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
         at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
         at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
         at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
         at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
         at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
         at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
         at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
         at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
         at org.eclipse.jetty.server.Server.handle(Server.java:368)
         at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
         at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
         at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
         at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
         at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
         at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
         at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
         at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
         at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
         at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
         at java.lang.Thread.run(Thread.java:745)
---------

Any ideas on how to debug this, or a solution?  I believe this only 
happens when we are actively indexing, which is nearly 100% of the 
time.  I checked the zookeeper logs, but I don't see any errors there.
Thank you!

-Joe

Re: Lost connection to Zookeeper

Posted by Eirik Hungnes <hu...@rubrikk.no>.

Hi,

We are facing the same issues on our setup. 3 zk nodes, 1 shard, 10
collections, 1 replica. v. 5.0.0. default startup params.
Solr Servers: 2 core cpu, 7gb memory
Index size: 28g, 3gb heap

This setup was running on v. 4.6 before upgrading to 5 without any of these
errors. The timeout seems to happen randomly and only to 1 of the replicas
(fortunately) at the time. Joe: did you get anywhere with the perf hints?
If not, any other tips appreciated.

null:org.apache.solr.common.SolrException: CLUSTERSTATUS the collection
time out:180s
at
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:630)
at
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:582)
at
org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:932)
at
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:256)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:736)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261
)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)

- Eirik


fre. 5. jun. 2015 kl. 15.58 skrev Joseph Obernberger <
joeo@lovehorsepower.com>:

> Thank you Shawn!  Yes - it is now a Solr 5.1.0 cloud on 27 nodes and we
> use the startup scripts.  The current index size is 3.0T - about 115G
> per node - index is stored in HDFS which is spread across those 27 nodes
> and about (a guess) - 256 spindles.  Each node has 26G of HDFS cache
> (MaxDirectMemorySize) allocated to Solr.  Zookeeper storage is on local
> disk.  Solr and HDFS run on the same machines. Each node is connected to
> a switch over 1G Ethernet, but the backplane is 40G.
> Do you think the clusterstatus and the zookeeper timeouts are related to
> performance issues talking to HDFS?
>
> The JVM parameters are:
> -----------------------------------------
> -DSTOP.KEY=solrrocks
> -DSTOP.PORT=8100
> -Dhost=helios
> -Djava.net.preferIPv4Stack=true
> -Djetty.port=9100
> -DnumShards=27
> -Dsolr.clustering.enabled=true
> -Dsolr.install.dir=/opt/solr
> -Dsolr.lock.type=hdfs
> -Dsolr.solr.home=/opt/solr/server/solr
> -Duser.timezone=UTC-DzkClientTimeout=15000
> -DzkHost=eris.querymasters.com:2181,daphnis.querymasters.com:2181,
> triton.querymasters.com:2181,oberon.querymasters.com:2181,
> portia.querymasters.com:2181,puck.querymasters.com:2181/solr5
>
> -XX:+CMSParallelRemarkEnabled
> -XX:+CMSScavengeBeforeRemark
> -XX:+ParallelRefProcEnabled
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintHeapAtGC
> -XX:+PrintTenuringDistribution
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+UseConcMarkSweepGC
> -XX:+UseLargePages
> -XX:+UseParNewGC-XX:CMSFullGCsBeforeCompaction=1
> -XX:CMSInitiatingOccupancyFraction=50
> -XX:CMSMaxAbortablePrecleanTime=6000
> -XX:CMSTriggerPermRatio=80
> -XX:ConcGCThreads=8
> -XX:MaxDirectMemorySize=26g
> -XX:MaxTenuringThreshold=8
> -XX:NewRatio=3
> -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 9100 /opt/solr/server/logs
> -XX:ParallelGCThreads=8
> -XX:PretenureSizeThreshold=64m
> -XX:SurvivorRatio=4
> -XX:TargetSurvivorRatio=90
> -Xloggc:/opt/solr/server/logs/solr_gc.log
> -Xms8g
> -Xmx16g
> -Xss256k
> -verbose:gc
> --------------------
>
> The directoryFactory is configured as follows:
>
> <directoryFactory name="DirectoryFactory"
>          class="solr.HdfsDirectoryFactory">
>          <bool name="solr.hdfs.blockcache.enabled">true</bool>
>          <int name="solr.hdfs.blockcache.slab.count">200</int>
>          <bool
> name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
>          <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
>          <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
>          <bool name="solr.hdfs.blockcache.write.enabled">false</bool>
>          <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
>          <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">64</int>
>          <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">512</int>
>          <str name="solr.hdfs.home">hdfs://nameservice1:8020/solr5</str>
>          <str
> name="solr.hdfs.confdir">/etc/hadoop/conf.cloudera.hdfs1</str>
>      </directoryFactory>
>
> -Joe
>
> On 6/5/2015 9:34 AM, Shawn Heisey wrote:
> > On 6/3/2015 6:39 PM, Joseph Obernberger wrote:
> >> Hi All - I've run into a problem where every-once in a while one or more
> >> of the shards (27 shard cluster) will loose connection to zookeeper and
> >> report "updates are disabled".  In additional to the CLUSTERSTATUS
> >> timeout errors, which don't seem to cause any issue, this one certainly
> >> does as that shard no longer takes any (you guessed it!) updates!
> >> We are using Zookeeper with 7 nodes (7 servers in our quorum).
> >> There stack trace is:
> > Other messages you have sent talk about Solr 5.x, and one of them
> > mentions a 16-node cluster with a 2.9 terabyte index, with the index
> > data stored on HDFS.
> >
> > I'm going to venture a guess that you don't have anywhere near enough
> > RAM for proper disk caching, leading to general performance issues,
> > which ultimately cause timeouts.  With HDFS, I'm not sure whether OS
> > disk cache on the Solr server matters very much, or whether that needs
> > to be on the HDFS servers.  I would guess the latter.  Also, if your
> > storage networking is gigabit or slower, HDFS may have significantly
> > more latency than local storage.  For good network storage speed, you
> > want 10gig ethernet or Infiniband.
> >
> > If it's Solr 5.x and you are using the included startup scripts, then
> > long GC pauses are probably not a major issue.  The startup scripts
> > include significant GC tuning. If you have deployed in your own
> > container, GC tuning might be an issue -- it is definitely required.
> >
> > Here is where I have written down everything I've learned about Solr
> > performance problems, most of which are due to one problem or another
> > with memory:
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Is your zookeeper database on local storage or HDFS?  I would suggest
> > keeping that on local storage for optimal performance.
> >
> > Thanks,
> > Shawn
> >
> >
>
>

Re: Lost connection to Zookeeper

Posted by Joseph Obernberger <jo...@lovehorsepower.com>.

Thank you Shawn!  Yes - it is now a Solr 5.1.0 cloud on 27 nodes and we 
use the startup scripts.  The current index size is 3.0T - about 115G 
per node - index is stored in HDFS which is spread across those 27 nodes 
and about (a guess) - 256 spindles.  Each node has 26G of HDFS cache 
(MaxDirectMemorySize) allocated to Solr.  Zookeeper storage is on local 
disk.  Solr and HDFS run on the same machines. Each node is connected to 
a switch over 1G Ethernet, but the backplane is 40G.
Do you think the clusterstatus and the zookeeper timeouts are related to 
performance issues talking to HDFS?

The JVM parameters are:
-----------------------------------------
-DSTOP.KEY=solrrocks
-DSTOP.PORT=8100
-Dhost=helios
-Djava.net.preferIPv4Stack=true
-Djetty.port=9100
-DnumShards=27
-Dsolr.clustering.enabled=true
-Dsolr.install.dir=/opt/solr
-Dsolr.lock.type=hdfs
-Dsolr.solr.home=/opt/solr/server/solr
-Duser.timezone=UTC-DzkClientTimeout=15000
-DzkHost=eris.querymasters.com:2181,daphnis.querymasters.com:2181,triton.querymasters.com:2181,oberon.querymasters.com:2181,portia.querymasters.com:2181,puck.querymasters.com:2181/solr5 

-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:+UseLargePages
-XX:+UseParNewGC-XX:CMSFullGCsBeforeCompaction=1
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:CMSTriggerPermRatio=80
-XX:ConcGCThreads=8
-XX:MaxDirectMemorySize=26g
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 9100 /opt/solr/server/logs
-XX:ParallelGCThreads=8
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90
-Xloggc:/opt/solr/server/logs/solr_gc.log
-Xms8g
-Xmx16g
-Xss256k
-verbose:gc
--------------------

The directoryFactory is configured as follows:

<directoryFactory name="DirectoryFactory"
         class="solr.HdfsDirectoryFactory">
         <bool name="solr.hdfs.blockcache.enabled">true</bool>
         <int name="solr.hdfs.blockcache.slab.count">200</int>
         <bool 
name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
         <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
         <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
         <bool name="solr.hdfs.blockcache.write.enabled">false</bool>
         <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
         <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">64</int>
         <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">512</int>
         <str name="solr.hdfs.home">hdfs://nameservice1:8020/solr5</str>
         <str name="solr.hdfs.confdir">/etc/hadoop/conf.cloudera.hdfs1</str>
     </directoryFactory>

-Joe

On 6/5/2015 9:34 AM, Shawn Heisey wrote:
> On 6/3/2015 6:39 PM, Joseph Obernberger wrote:
>> Hi All - I've run into a problem where every-once in a while one or more
>> of the shards (27 shard cluster) will loose connection to zookeeper and
>> report "updates are disabled".  In additional to the CLUSTERSTATUS
>> timeout errors, which don't seem to cause any issue, this one certainly
>> does as that shard no longer takes any (you guessed it!) updates!
>> We are using Zookeeper with 7 nodes (7 servers in our quorum).
>> There stack trace is:
> Other messages you have sent talk about Solr 5.x, and one of them
> mentions a 16-node cluster with a 2.9 terabyte index, with the index
> data stored on HDFS.
>
> I'm going to venture a guess that you don't have anywhere near enough
> RAM for proper disk caching, leading to general performance issues,
> which ultimately cause timeouts.  With HDFS, I'm not sure whether OS
> disk cache on the Solr server matters very much, or whether that needs
> to be on the HDFS servers.  I would guess the latter.  Also, if your
> storage networking is gigabit or slower, HDFS may have significantly
> more latency than local storage.  For good network storage speed, you
> want 10gig ethernet or Infiniband.
>
> If it's Solr 5.x and you are using the included startup scripts, then
> long GC pauses are probably not a major issue.  The startup scripts
> include significant GC tuning. If you have deployed in your own
> container, GC tuning might be an issue -- it is definitely required.
>
> Here is where I have written down everything I've learned about Solr
> performance problems, most of which are due to one problem or another
> with memory:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Is your zookeeper database on local storage or HDFS?  I would suggest
> keeping that on local storage for optimal performance.
>
> Thanks,
> Shawn
>
>

Re: Lost connection to Zookeeper

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/3/2015 6:39 PM, Joseph Obernberger wrote:
> Hi All - I've run into a problem where every-once in a while one or more
> of the shards (27 shard cluster) will loose connection to zookeeper and
> report "updates are disabled".  In additional to the CLUSTERSTATUS
> timeout errors, which don't seem to cause any issue, this one certainly
> does as that shard no longer takes any (you guessed it!) updates!
> We are using Zookeeper with 7 nodes (7 servers in our quorum).
> There stack trace is:

Other messages you have sent talk about Solr 5.x, and one of them
mentions a 16-node cluster with a 2.9 terabyte index, with the index
data stored on HDFS.

I'm going to venture a guess that you don't have anywhere near enough
RAM for proper disk caching, leading to general performance issues,
which ultimately cause timeouts.  With HDFS, I'm not sure whether OS
disk cache on the Solr server matters very much, or whether that needs
to be on the HDFS servers.  I would guess the latter.  Also, if your
storage networking is gigabit or slower, HDFS may have significantly
more latency than local storage.  For good network storage speed, you
want 10gig ethernet or Infiniband.

If it's Solr 5.x and you are using the included startup scripts, then
long GC pauses are probably not a major issue.  The startup scripts
include significant GC tuning. If you have deployed in your own
container, GC tuning might be an issue -- it is definitely required.

Here is where I have written down everything I've learned about Solr
performance problems, most of which are due to one problem or another
with memory:

https://wiki.apache.org/solr/SolrPerformanceProblems

Is your zookeeper database on local storage or HDFS?  I would suggest
keeping that on local storage for optimal performance.

Thanks,
Shawn

Re: Lost connection to Zookeeper

Posted by Joseph Obernberger <jo...@lovehorsepower.com>.

Any thoughts on this / anything configuration items I can check? Could 
the 180 second clusterstatus timeout messages that I'm getting be 
related?  Any issue with running 7 nodes in the zookeeper quorum?  For 
reference the clusterstatus stack trace is:

org.apache.solr.common.SolrException: CLUSTERSTATUS the collection time 
out:180s
     at 
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:740)
     at 
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:692)
     at 
org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:1042)
     at 
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:259)
     at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
     at 
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:783)
     at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:282)
     at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)
     at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
     at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
     at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
     at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
     at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
     at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
     at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
     at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
     at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
     at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
     at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
     at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
     at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
     at org.eclipse.jetty.server.Server.handle(Server.java:368)
     at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
     at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
     at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
     at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
     at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
     at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
     at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
     at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
     at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
     at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
     at java.lang.Thread.run(Thread.java:745)

Thanks for any thoughts!

-Joe

On 6/3/2015 8:39 PM, Joseph Obernberger wrote:
> Hi All - I've run into a problem where every-once in a while one or 
> more of the shards (27 shard cluster) will loose connection to 
> zookeeper and report "updates are disabled".  In additional to the 
> CLUSTERSTATUS timeout errors, which don't seem to cause any issue, 
> this one certainly does as that shard no longer takes any (you guessed 
> it!) updates!
> We are using Zookeeper with 7 nodes (7 servers in our quorum).
> There stack trace is:
>
> ---------
> 282833508 [qtp1221263105-801058] INFO 
> org.apache.solr.update.processor.LogUpdateProcessor  [UNCLASS shard17 
> core_node17 UNCLASS] â [UNCLASS] webapp=/solr path=/update 
> params={wt=javabin&version=2} {add=[COLLECT20001208773720 
> (1502857505963769856)]} 0 3
> 282837711 [qtp1221263105-802489] INFO 
> org.apache.solr.update.processor.LogUpdateProcessor  [UNCLASS shard17 
> core_node17 UNCLASS] â [UNCLASS] webapp=/solr path=/update 
> params={wt=javabin&version=2} {add=[COLLECT20001208773796 
> (1502857510369886208)]} 0 3
> 282839485 [qtp1221263105-800319] INFO 
> org.apache.solr.update.processor.LogUpdateProcessor  [UNCLASS shard17 
> core_node17 UNCLASS] â [UNCLASS] webapp=/solr path=/update 
> params={wt=javabin&version=2} {add=[COLLECT20001208773821 
> (1502857512230060032)]} 0 4
> 282841460 [qtp1221263105-801228] INFO 
> org.apache.solr.update.processor.LogUpdateProcessor  [UNCLASS shard17 
> core_node17 UNCLASS] â [UNCLASS] webapp=/solr path=/update 
> params={wt=javabin&version=2} {} 0 1
> 282841461 [qtp1221263105-801228] ERROR org.apache.solr.core.SolrCore  
> [UNCLASS shard17 core_node17 UNCLASS] â 
> org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - 
> Updates are disabled.
>         at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1474)
>         at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:661)
>         at 
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
>         at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>         at 
> org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:94)
>         at 
> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:96)
>         at 
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:166)
>         at 
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
>         at 
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:225)
>         at 
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:121)
>         at 
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:190)
>         at 
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116)
>         at 
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173)
>         at 
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106)
>         at 
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
>         at 
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:103)
>         at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>         at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1984)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
>         at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>         at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
>         at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>         at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
>         at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
>         at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
>         at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>         at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>         at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
>         at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>         at org.eclipse.jetty.server.Server.handle(Server.java:368)
>         at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
>         at 
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
>         at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
>         at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
>         at 
> org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
>         at 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>         at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>         at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>         at java.lang.Thread.run(Thread.java:745)
> ---------
>
> Any ideas on how to debug this, or a solution?  I believe this only 
> happens when we are actively indexing, which is nearly 100% of the 
> time.  I checked the zookeeper logs, but I don't see any errors there.
> Thank you!
>
> -Joe
>
>