You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <el...@elyograg.org> on 2018/03/01 04:33:53 UTC
Re: Configuration of SOLR Cluster

On 2/28/2018 6:54 AM, James Keeney wrote:
> I did notice one thing in the logs:
>
> 2018-02-28 13:21:58,932 [myid:1] - INFO 
> [/172.31.86.130:3888:QuorumCnxManager$Listener@743] - *Received 
> connection request /172.31.73.122:34804 <http://172.31.73.122:34804>*

<snip>

> When the restarted node attempts to reconnect with the ensemble it 
> looks like it does so on a random port. Could it be that nodes in the 
> ensemble are rejecting the new request to rejoin because they are not 
> listening on that port? And why is it not requesting on 3888:2888? 
> This is confusing to me.

That appears to be the source port.  Which is generally going to be a 
very high port and semi-unpredictable.Normal TCP operation.

> I have attached a ZK log and a SOLR log. You can watch the whole 
> progression in the ZK log as it goes from happy to disconnected to 
> trying to reconnect to part of the ensemble when the other nodes are 
> restarted. Seems like ZK holds onto a state based on the original 
> ensemble interactions and that state prevents the node from rejoining 
> the ensemble. The state is then lost with the restart which allows the 
> members to re-establish connection and form the new ensemble.

What timestamps correspond to the actions you took?

Lots and lots of connections refused.  Unless there's something 
preventing network access, I would only expect connections to be refused 
if the software isn't running or isn't listening on the destination port.

Which ZK is that log from?  The one that you shut down to begin testing, 
or one of the others?  I see some very large time gaps in the log:

=========
2018-02-26 18:28:18,066 [myid:1] - INFO 
[LearnerHandler-/172.31.73.122:57652:LearnerHandler@535] - Received 
NEWLEADER-ACK message from 3
2018-02-26 18:56:19,656 [myid:1] - INFO [SyncThread:1:FileTxnLog@203] - 
Creating new log file: log.400000001
=========

=========
2018-02-26 18:56:26,286 [myid:1] - WARN 
[SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2018-02-26 19:34:38,103 [myid:] - INFO [main:QuorumPeerConfig@134] - 
Reading configuration from: /opt/zookeeper/current/bin/../conf/zoo.cfg
=========

The first gap is nearly half an hour, the second is more than half an hour.

What happens after the second gap appears to be a program startup.  The 
things logged at 18:56:nn *might* be program shutdown, but the log 
doesn't explicitly say so.  If it is a shutdown, then the program was 
not running for quite a while.

I would definitely take this problem to the ZK mailing list.  The 
server-side problems don't involve Solr at all.  You are having problems 
with Solr, but they are completely within the ZK client code.  Likely 
both problems have the same root cause, so I'd start with the 
server-side issues.

Solr 6.6.2 contains ZK version 3.4.10.  Not the latest, but close.

Thanks,
Shawn