You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by James Keeney <ne...@gmail.com> on 2018/02/27 17:57:36 UTC

Configuration of SOLR Cluster

I'm setting up a solr cluster in AWS cloud and I need help with the
configuration of ZooKeeper. The cluster has 3 ZK nodes and 3 Solr nodes

There are two behaviors that are of concern:

*1 - ZK ensemble not accepting return of node*
Currently, when a ZK node in the ensemble goes down the ensemble is able to
do what it should do and keeps working. However when I bring the 3rd node
back online the other two nodes reject connection requests from the 3rd
node until I restart the nodes. The sequence is:


   1. Bring 3rd node back on line
   2. Restart follower in existing ensemble
   3. Restart leader in existing ensemble

When this is done the third node happily becomes part fo the ensemble.

*2 - Solr nodes unable to connect*
When setting up the cluster for the first time the ensemble rejects the
solr connection requests until the ZK on the ZK ensemble members is
restarted.

So the sequences is:


   1. Setup ensemble
   2. Bring up solr nodes
   3. Restart followers on ZK ensemble
   4. Restart leader on ZK ensemble


When I do this everything is fine and the cluster is now stable.

However, we have also seen that if we have a problem with one of the Solr
nodes that requires restarting more than one node we have to restart ZK to
reconnect the nodes with thee ensemble again.

We are trying to achieve a self correcting cluster. In other words, we
would like to get to the point where if a nodes goes down, all that is
necessary is to restart it (after the issue is resolved) and it will add
itself back into the cluster. Obviously this is an issue if ZK has to be
restarted.

Is there a configuration that I am missing? Why is ZK so finicky?

Our ZK config is very simple:

clientPort=2181

dataDir=/var/opt/zookeeper/data

tickTime=2000

autopurge.purgeInterval=24

initLimit=100

syncLimit=5

server.1=<AWS internal IP1>:2888:3888

server.2=<AWS internal IP2>:2888:3888

server.3=<AWS internal IP3>:2888:3888


Any help would be greatly appreciated.


Jim K.

-- 
Jim Keeney
President, FitterWeb
E: jim@fitterweb.com
M: 703-568-5887

*FitterWeb Consulting*
*Are you lean and agile enough? *

Re: Configuration of SOLR Cluster

Posted by Shawn Heisey <el...@elyograg.org>.
On 2/28/2018 6:54 AM, James Keeney wrote:
> I did notice one thing in the logs:
>
> 2018-02-28 13:21:58,932 [myid:1] - INFO 
> [/172.31.86.130:3888:QuorumCnxManager$Listener@743] - *Received 
> connection request /172.31.73.122:34804 <http://172.31.73.122:34804>*

<snip>

> When the restarted node attempts to reconnect with the ensemble it 
> looks like it does so on a random port. Could it be that nodes in the 
> ensemble are rejecting the new request to rejoin because they are not 
> listening on that port? And why is it not requesting on 3888:2888? 
> This is confusing to me.

That appears to be the source port.  Which is generally going to be a 
very high port and semi-unpredictable.Normal TCP operation.

> I have attached a ZK log and a SOLR log. You can watch the whole 
> progression in the ZK log as it goes from happy to disconnected to 
> trying to reconnect to part of the ensemble when the other nodes are 
> restarted. Seems like ZK holds onto a state based on the original 
> ensemble interactions and that state prevents the node from rejoining 
> the ensemble. The state is then lost with the restart which allows the 
> members to re-establish connection and form the new ensemble.

What timestamps correspond to the actions you took?

Lots and lots of connections refused.  Unless there's something 
preventing network access, I would only expect connections to be refused 
if the software isn't running or isn't listening on the destination port.

Which ZK is that log from?  The one that you shut down to begin testing, 
or one of the others?  I see some very large time gaps in the log:

=========
2018-02-26 18:28:18,066 [myid:1] - INFO 
[LearnerHandler-/172.31.73.122:57652:LearnerHandler@535] - Received 
NEWLEADER-ACK message from 3
2018-02-26 18:56:19,656 [myid:1] - INFO [SyncThread:1:FileTxnLog@203] - 
Creating new log file: log.400000001
=========

=========
2018-02-26 18:56:26,286 [myid:1] - WARN 
[SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2018-02-26 19:34:38,103 [myid:] - INFO [main:QuorumPeerConfig@134] - 
Reading configuration from: /opt/zookeeper/current/bin/../conf/zoo.cfg
=========

The first gap is nearly half an hour, the second is more than half an hour.

What happens after the second gap appears to be a program startup.  The 
things logged at 18:56:nn *might* be program shutdown, but the log 
doesn't explicitly say so.  If it is a shutdown, then the program was 
not running for quite a while.

I would definitely take this problem to the ZK mailing list.  The 
server-side problems don't involve Solr at all.  You are having problems 
with Solr, but they are completely within the ZK client code.  Likely 
both problems have the same root cause, so I'd start with the 
server-side issues.

Solr 6.6.2 contains ZK version 3.4.10.  Not the latest, but close.

Thanks,
Shawn


Re: Configuration of SOLR Cluster

Posted by James Keeney <ne...@gmail.com>.
Shawn -

Thanks again for all your help.


   - On AWS side, I've confirmed that each of the members in the node are
   able to talk to each other. The security groups are setup so that all the
   members of the ensemble can receive all traffic from the other members of
   the ensemble.
   - The myid files are properly configured.
   - All nodes are open on to all traffic from all other nodes


I took your suggestion and upgraded all of the nodes to 3.4.11. No change
in the behavior.  However, I used this change to test out what is
happening.

This is the sequence:


   - If I stop one node in the ensemble, the remaining 2 nodes properly
   call and election and establish that there are only 2 nodes and who the
   leader is. So far so good.
   - When I restart the disconnected node though it cannot reconnect with
   the ensemble. The ensemble rejects the connection request.
   - I then restart the remaining 2 nodes and they all are able to connect
   again and the full ensemble is restored


I did notice one thing in the logs:

2018-02-28 13:21:58,932 [myid:1] - INFO  [/172.31.86.130:3888
:QuorumCnxManager$Listener@743] - *Received connection request
/172.31.73.122:34804 <http://172.31.73.122:34804>*
2018-02-28 13:21:58,934 [myid:1] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@1028] - Interrupting SendWorker
2018-02-28 13:21:58,934 [myid:1] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@941] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929)

When the restarted node attempts to reconnect with the ensemble it looks
like it does so on a random port. Could it be that nodes in the ensemble
are rejecting the new request to rejoin because they are not listening on
that port? And why is it not requesting on 3888:2888? This is confusing to
me.

I have attached a ZK log and a SOLR log. You can watch the whole
progression in the ZK log as it goes from happy to disconnected to trying
to reconnect to part of the ensemble when the other nodes are restarted.
Seems like ZK holds onto a state based on the original ensemble
interactions and that state prevents the node from rejoining the ensemble.
The state is then lost with the restart which allows the members to
re-establish connection and form the new ensemble.

You are right. This is definitely a ZK thing. Solr just observes that it
can no longer connect to one of the members of the ensemble in the list it
received. SOLR appears to get progressively upset about the fact until
finally it through an exception and returns to complaining.

Let me know if you want me to take it over to the ZK mailing list.

Jim K.







On Tue, Feb 27, 2018 at 10:35 PM Shawn Heisey <el...@elyograg.org> wrote:

> On 2/27/2018 6:42 PM, James Keeney wrote:
> > -DzkHost=<ZK Host internal IP 1>:2181,<ZK Host internal IP 2>:2181,<ZK
> Host
> > internal IP 1>:2181
>
> This looks correct, except that with AWS, I have no idea whether you
> need the internal IP addressing or the external IP addressing.  If all
> of the machines involved (both servers and clients) are able to
> communicate on the internal addresses, then that should be fine.  You
> might want to discuss the IP addressing with Amazon just to make sure.
>
> > java.net.ConnectException: Connection refused
>
> All of the logs you included look like they have this message --
> connection refused.  Normally this happens when the software isn't
> running -- the OS refuses connections when no software is listening on a
> TCP port.  Sometimes firewalls can refuse connections, but more commonly
> they just drop the traffic silently, and the system starting the
> connection has to wait for a timeout and never gets any kind of
> response.  In this case, there IS a response -- the connection is refused.
>
> It looks like you've pasted parts of the log, but I was actually hoping
> for entire logfiles, or at least entire sections of logfiles, to see
> errors in context with non-errors, and to be sure that nothing is lost,
> and that the formatting isn't destroyed by inclusion in an email
> message.  A paste website or a file sharing website is often the best
> way to share that kind of information.  If you need to redact
> information from the files, please do so in a way that preserves the
> ability to decipher the log.  For IP addresses, you could just redact
> the first two octets and leave the last two -- although if they are
> private addresses, you could leave them intact.
>
> My instinct here is to think there's either a fundamental networking
> issue (firewalls, other problems), or that there may be some kind of
> problem with ZK.  What version of ZK are you using on the servers, and
> what version of Solr is it?
>
> My instincts could be wrong because of a limited understanding of how ZK
> functions.
>
> My recommendation would be to run ZK version 3.4.11 on your servers.
> Each new release of ZK has a very impressive list of fixed bugs.  The
> client ZK version will depend on the Solr version, since the ZK jar is
> part of Solr.
>
> I looked at your ZK server config.  Your initLimit value is ten times
> what the default config for the embedded ZK in Solr is. Based on the
> comment in the embedded ZK config, that's probably not a problem, but I
> can't say for sure without more ZK knowledge.  The other parts of the
> config seem normal enough.
>
> Are you configuring the "myid" file in each ZK server's data directory,
> and does the value on each server correspond to the line in the ZK
> config for that server?  I assume you probably have this correct,
> because ZK probably wouldn't work at all if it wasn't right.
>
> I really don't know what might be going on.  Maybe with more complete
> logs I might spot something, but I don't know.
>
> Thanks,
> Shawn
>
> --
Jim Keeney
President, FitterWeb
E: jim@fitterweb.com
M: 703-568-5887

*FitterWeb Consulting*
*Are you lean and agile enough? *

Re: Configuration of SOLR Cluster

Posted by Shawn Heisey <el...@elyograg.org>.
On 2/27/2018 6:42 PM, James Keeney wrote:
> -DzkHost=<ZK Host internal IP 1>:2181,<ZK Host internal IP 2>:2181,<ZK Host
> internal IP 1>:2181

This looks correct, except that with AWS, I have no idea whether you 
need the internal IP addressing or the external IP addressing.  If all 
of the machines involved (both servers and clients) are able to 
communicate on the internal addresses, then that should be fine.  You 
might want to discuss the IP addressing with Amazon just to make sure.

> java.net.ConnectException: Connection refused

All of the logs you included look like they have this message -- 
connection refused.  Normally this happens when the software isn't 
running -- the OS refuses connections when no software is listening on a 
TCP port.  Sometimes firewalls can refuse connections, but more commonly 
they just drop the traffic silently, and the system starting the 
connection has to wait for a timeout and never gets any kind of 
response.  In this case, there IS a response -- the connection is refused.

It looks like you've pasted parts of the log, but I was actually hoping 
for entire logfiles, or at least entire sections of logfiles, to see 
errors in context with non-errors, and to be sure that nothing is lost, 
and that the formatting isn't destroyed by inclusion in an email 
message.  A paste website or a file sharing website is often the best 
way to share that kind of information.  If you need to redact 
information from the files, please do so in a way that preserves the 
ability to decipher the log.  For IP addresses, you could just redact 
the first two octets and leave the last two -- although if they are 
private addresses, you could leave them intact.

My instinct here is to think there's either a fundamental networking 
issue (firewalls, other problems), or that there may be some kind of 
problem with ZK.  What version of ZK are you using on the servers, and 
what version of Solr is it?

My instincts could be wrong because of a limited understanding of how ZK 
functions.

My recommendation would be to run ZK version 3.4.11 on your servers.  
Each new release of ZK has a very impressive list of fixed bugs.  The 
client ZK version will depend on the Solr version, since the ZK jar is 
part of Solr.

I looked at your ZK server config.  Your initLimit value is ten times 
what the default config for the embedded ZK in Solr is. Based on the 
comment in the embedded ZK config, that's probably not a problem, but I 
can't say for sure without more ZK knowledge.  The other parts of the 
config seem normal enough.

Are you configuring the "myid" file in each ZK server's data directory, 
and does the value on each server correspond to the line in the ZK 
config for that server?  I assume you probably have this correct, 
because ZK probably wouldn't work at all if it wasn't right.

I really don't know what might be going on.  Maybe with more complete 
logs I might spot something, but I don't know.

Thanks,
Shawn


Re: Configuration of SOLR Cluster

Posted by James Keeney <ne...@gmail.com>.
Shawn -

First, it's good to know that this is unusual behavior. That actually helps
as it lets me know that I should keep digging.

Here are a couple of things that might help.

In the configuration I am calling out all three ZK nodes. Here is the
configuration of Solr:

-DSTOP.KEY=solrrocks
-DSTOP.PORT=7983
-Dhost=solr2
-Djetty.home=/opt/solr/server
-Djetty.port=8983
-Dlog4j.configuration=file:/data/solr/log4j.properties
-Dsolr.install.dir=/opt/solr
-Dsolr.log.dir=/data/solr/logs
-Dsolr.log.muteconsole
-Dsolr.solr.home=/data/solr/data
-Duser.timezone=UTC
-DzkClientTimeout=15000
-DzkHost=<ZK Host internal IP 1>:2181,<ZK Host internal IP 2>:2181,<ZK Host
internal IP 1>:2181
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails-XX:+PrintGCTimeStamps-XX:+PrintHeapAtGC-XX:+PrintTenuringDistribution-XX:+UseCMSInitiatingOccupancyOnly-XX:+UseConcMarkSweepGC
-XX:+UseGCLogFileRotation
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4
-XX:GCLogFileSize=20M
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:NumberOfGCLogFiles=9
-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /data/solr/logs
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90
-Xloggc:/data/solr/logs/solr_gc.log
-Xms2G
-Xmx6G
-Xss1024k
-Xss256k
-verbose:gc


Here are the types of Solr errors I receive when this happens. I was able
to determine that it was not a security problem using telnet to connect to
port 2181 on the ZK nodes.

2018-02-26 19:58:50.964 WARN  (main-SendThread(<internal IP>:2181)) [   ]
o.a.z.ClientCnxn Session 0x361d3ae3f1c0000 for server null, unexpected
error, closing socket connection and attempting reconnect

java.net.ConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

2018-02-26 19:58:52.894 WARN  (main-SendThread(<internal IP>:2181)) [   ]
o.a.z.ClientCnxn Session 0x361d3ae3f1c0000 for server null, unexpected
error, closing socket connection and attempting reconnect

java.net.ConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

2018-02-26 19:58:53.456 WARN  (main-SendThread(<internal IP>:2181)) [   ]
o.a.z.ClientCnxn Session 0x361d3ae3f1c0000 for server null, unexpected
error, closing socket connection and attempting reconnect

java.net.ConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)


And here are the errors when the ZK nodes are not able to connect to each
other.


2018-02-26 19:57:25,554 [myid:2] - WARN
[WorkerSender[myid=2]:QuorumCnxManager@588] - Cannot open channel to 1 at
election address /<internal IP>:3888

java.net.ConnectException: Connection refused (Connection refused)

at java.net.PlainSocketImpl.socketConnect(Native Method)

at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:538)

at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452)

at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433)

at java.lang.Thread.run(Thread.java:748)

2018-02-26 19:57:25,554 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumPeer$QuorumServer@167] - Resolved
hostname: <internal
IP> to address: /<internal IP>

2018-02-26 19:57:25,554 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 2 (n.leader), 0xa00000013 (n.zxid), 0x4 (n.round), LOOKING
(n.state), 2 (n.sid), 0xa (n.peerEpoch) LOOKING (my state)

2018-02-26 19:57:25,556 [myid:2] - WARN
[WorkerSender[myid=2]:QuorumCnxManager@588] - Cannot open channel to 3 at
election address /<internal IP>:3888

java.net.ConnectException: Connection refused (Connection refused)

at java.net.PlainSocketImpl.socketConnect(Native Method)

at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:538)

at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452)

at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433)

at java.lang.Thread.run(Thread.java:748)

2018-02-26 19:57:25,556 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumPeer$QuorumServer@167] - Resolved
hostname: <internal
IP> to address: /<internal IP>

2018-02-26 19:57:25,756 [myid:2] - WARN
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@588] - Cannot
open channel to 1 at election address /<internal IP>:3888

java.net.ConnectException: Connection refused (Connection refused)

at java.net.PlainSocketImpl.socketConnect(Native Method)

at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:614)

at
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:843)

at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:913)

2018-02-26 19:57:25,757 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@167] -
Resolved hostname: <internal IP> to address: /<internal IP>

2018-02-26 19:57:25,812 [myid:2] - WARN
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@588] - Cannot
open channel to 3 at election address /<internal IP>:3888

java.net.ConnectException: Connection refused (Connection refused)

at java.net.PlainSocketImpl.socketConnect(Native Method)

at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:614)

at
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:843)

at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:913)


Let me know if you need anything else or if I should take my request over
to ZooKeeper.


Thanks.

Jim K.



On Tue, Feb 27, 2018 at 8:19 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 2/27/2018 10:57 AM, James Keeney wrote:
> > *1 - ZK ensemble not accepting return of node*
> > Currently, when a ZK node in the ensemble goes down the ensemble is able
> to
> > do what it should do and keeps working. However when I bring the 3rd node
> > back online the other two nodes reject connection requests from the 3rd
> > node until I restart the nodes. The sequence is:
> >
> >    1. Bring 3rd node back on line
> >    2. Restart follower in existing ensemble
> >    3. Restart leader in existing ensemble
> >
> > When this is done the third node happily becomes part fo the ensemble.
>
> From what I understand, restarting the other nodes should not be
> required.  If everything is configured properly, I don't think that
> should be happening, but I don't have deep ZK knowledge.
>
> > *2 - Solr nodes unable to connect*
> > When setting up the cluster for the first time the ensemble rejects the
> > solr connection requests until the ZK on the ZK ensemble members is
> > restarted.
>
> <snip>
>
> > However, we have also seen that if we have a problem with one of the Solr
> > nodes that requires restarting more than one node we have to restart ZK
> to
> > reconnect the nodes with thee ensemble again.
>
> These problems sound very weird too.  I wish I had some idea, but
> without logs showing what kind of errors are encountered, I have no idea
> what's happening.
>
> None of these problems are in Solr code.  Solr uses the ZooKeeper client
> code without modification.  All the ZK communication is done in ZK code,
> initialized with the zkHost string and a few other config bits (like
> zkClientTimeout) provided to Solr at startup.
>
> If you want to share the Solr log and the ZK server logs covering the
> timeframe when the problems happen, maybe we can find something useful
> and at least point you towards the problem, but even then, you may have
> to talk to the ZooKeeper mailing list for real help, and they'll want
> the same logs.
>
> Are you informing Solr about all three of your ZK hosts when you start
> it up?  That is a requirement.  If the zkHost string you send to Solr
> doesn't list all your servers, then the ZK client inside Solr will not
> be able to fail over correctly.  The version of ZK that Solr includes is
> not able to dynamically change the servers that it talks to, and the
> version of ZK that *does* have dynamic reconfiguration is still in
> beta.  Solr is not going to include ZK 3.5.x until they put out a stable
> release.  I don't know when they're going to do that.  It could be soon,
> or it could be several months out.  The ZK project does NOT make
> frequent releases.
>
> Thanks,
> Shawn
>
> --
Jim Keeney
President, FitterWeb
E: jim@fitterweb.com
M: 703-568-5887

*FitterWeb Consulting*
*Are you lean and agile enough? *

Re: Configuration of SOLR Cluster

Posted by Shawn Heisey <ap...@elyograg.org>.
On 2/27/2018 10:57 AM, James Keeney wrote:
> *1 - ZK ensemble not accepting return of node*
> Currently, when a ZK node in the ensemble goes down the ensemble is able to
> do what it should do and keeps working. However when I bring the 3rd node
> back online the other two nodes reject connection requests from the 3rd
> node until I restart the nodes. The sequence is:
>
>    1. Bring 3rd node back on line
>    2. Restart follower in existing ensemble
>    3. Restart leader in existing ensemble
>
> When this is done the third node happily becomes part fo the ensemble.

From what I understand, restarting the other nodes should not be
required.  If everything is configured properly, I don't think that
should be happening, but I don't have deep ZK knowledge.

> *2 - Solr nodes unable to connect*
> When setting up the cluster for the first time the ensemble rejects the
> solr connection requests until the ZK on the ZK ensemble members is
> restarted.

<snip>

> However, we have also seen that if we have a problem with one of the Solr
> nodes that requires restarting more than one node we have to restart ZK to
> reconnect the nodes with thee ensemble again.

These problems sound very weird too.  I wish I had some idea, but
without logs showing what kind of errors are encountered, I have no idea
what's happening.

None of these problems are in Solr code.  Solr uses the ZooKeeper client
code without modification.  All the ZK communication is done in ZK code,
initialized with the zkHost string and a few other config bits (like
zkClientTimeout) provided to Solr at startup.

If you want to share the Solr log and the ZK server logs covering the
timeframe when the problems happen, maybe we can find something useful
and at least point you towards the problem, but even then, you may have
to talk to the ZooKeeper mailing list for real help, and they'll want
the same logs.

Are you informing Solr about all three of your ZK hosts when you start
it up?  That is a requirement.  If the zkHost string you send to Solr
doesn't list all your servers, then the ZK client inside Solr will not
be able to fail over correctly.  The version of ZK that Solr includes is
not able to dynamically change the servers that it talks to, and the
version of ZK that *does* have dynamic reconfiguration is still in
beta.  Solr is not going to include ZK 3.5.x until they put out a stable
release.  I don't know when they're going to do that.  It could be soon,
or it could be several months out.  The ZK project does NOT make
frequent releases.

Thanks,
Shawn