You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Satya Marivada <sa...@gmail.com> on 2018/06/18 13:38:11 UTC

some solr replicas down

Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down and 1
is up and serving.

I see a single line error repeating in logs as below. nothing else specific
exception apart from it. Wondering what this below message is saying, is it
the cause of nodes being down, but saw that this happened even before the
repllicas went down.

2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1
r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin Invalid
key request timestamp: 1529297138215 , received timestamp: 1529297151817 ,
TTL: 5000

Thanks,
Satya

Re: some solr replicas down

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/20/2018 6:39 AM, Satya Marivada wrote:
> Yes, there are some other errors that there is a javabin character 2
> expected and is returning 60 which is "<" .

This happens when the response is an error.  Error responses are sent in 
HTML format (so they render properly when viewed in a browser), and the 
first character of any HTML formatted data is the "<" character.

----------

Extreme pauses due to garbage collection are one thing that can cause 
time discrepancies in Java-based software, even when the systems 
themselves are in perfect time sync.  Imagine a situation where a 
request is received, GC pauses happen taking up several seconds, and 
then the time data in the request is checked against the clock.

----------

If your VMs really are out of time sync, this might apply:

The core NTP algorithm will refuse to sync a system that's off by more 
than 1024 milliseconds.  So if the system is off by 8-10 seconds, NTP 
might not actually be able to get the system synchronized.

Redhat-based distributions of Linux deal with this problem at boot time 
by running ntpdate before ntp itself, to hard-set the system to the 
exact time returned by an NTP source.  Debian-based distributions handle 
it directly in NTP itself when NTP starts, though I do not know the 
exact details.  Either way, the time on the system is then close enough 
for the core NTP algorithm to work.

If your system is a version of Linux based on Redhat, you might be able 
to get NTP working properly with this sequence:

service ntpd stop
service ntpdate start
sleep 10
service ntpd start

On a Debian-based system (including Ubuntu and others), this might be 
enough:

/etc/init.d/ntp restart

For other operating systems, you'll need to research.

Depending on what virtual machine technology you're using, special 
considerations for NTP might need to be made.  I think VMware requires 
some special config for time synchronization to work properly.

Thanks,
Shawn

Re: some solr replicas down

Posted by Chris Ulicny <cu...@iq.media>.

Having time drift longer than the TTL would definitely cause these types of
problems.

In our case, the clusters are time-synchronized and the error is still
encountered periodically.


On Wed, Jun 20, 2018 at 10:07 AM Erick Erickson <er...@gmail.com>
wrote:

> We've seen this exact issue when the times reported by various
> machines have different wall-clock times, so getting these times
> coordinated is definitely the very first thing I'd do.
>
> It's particularly annoying because if the clocks are drifting apart
> gradually, your setup can be running find for days/weeks/months then
> break all of the sudden....
>
> Best,
> Erick
>
> On Wed, Jun 20, 2018 at 5:39 AM, Satya Marivada
> <sa...@gmail.com> wrote:
> > Chris,
> >
> > You are spot on with the timestamps. The date command returns different
> > times on these vms and are not in sync with ntp. The ntpstat returns a
> > difference of about 8-10 seconds on the 4 vms and that would caused this
> > synchronization issues and marked the replicas as down. This just
> happened
> > recently and was working fine earlier. So would get this issue fixed and
> > hope everything falls in place.
> >
> > Yes, there are some other errors that there is a javabin character 2
> > expected and is returning 60 which is "<" .
> >
> > Thanks,
> > Satya
> >
> > On Tue, Jun 19, 2018 at 8:01 AM Chris Ulicny <cu...@iq.media> wrote:
> >
> >> Satya,
> >>
> >> There should be some other log messages that are probably relevant to
> the
> >> issue you are having. Something along the lines of "leader cannot
> >> communicate with follower...publishing replica as down." It's likely
> there
> >> also is a message of "expecting json/xml but got html" in another
> >> instance's logs.
> >>
> >> We've seen this problem in various scenarios in our own clusters,
> usually
> >> during high volumes of requests, and what seems to be happening to us is
> >> the following.
> >>
> >> Since authentication is enabled, all requests between nodes must be
> >> authenticated, and Solr is using a timestamp to do this (in some way,
> not
> >> sure on the details). When the recipient of the request processes it,
> the
> >> timestamp is checked to see if it is within the Time-To-Live (TTL)
> >> millisecond value (default of 5000). If the timestamp is too old, the
> >> request is rejected with the above error and a response of 401 is
> delivered
> >> to the sender.
> >>
> >> When a request is sent from the leader to the follower and receives a
> 401
> >> response, the leader becomes too proactive sometimes and declares the
> >> replica down. In older versions (6.3.0), it seems that the replica will
> >> never recover automatically (manually delete the down replicas and add
> new
> >> ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas
> >> will usually start to recover at some point (and the leaders seem less
> >> proactive to declare replicas down). Although, we have had cases where
> they
> >> did not recover after being down for hours on 7.2.1.
> >>
> >> Likely the solution to the problem is to increase the TTL value by
> adding
> >> the line
> >>
> >> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=######"
> >>
> >> to the solr environment file (solr.in.sh) on each node and restarting
> >> them.
> >> Replace ##### with some millisecond value of your choice. I'd suggest
> just
> >> increasing it by intervals of 5s to start. If this does not fix your
> >> problem, then there is likely too much pressure on your hardware for
> some
> >> reason or another.
> >>
> >> Hopefully that helps.
> >>
> >> If anyone with more knowledge about the authentication plugin has
> >> corrections, wants fill in gaps, or has an idea to figure out what
> requests
> >> cause this issue. It'd be greatly appreciated.
> >>
> >> Best,
> >> Chris
> >>
> >> On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada <
> satya.chaitanya@gmail.com>
> >> wrote:
> >>
> >> > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down
> >> and 1
> >> > is up and serving.
> >> >
> >> > I see a single line error repeating in logs as below. nothing else
> >> specific
> >> > exception apart from it. Wondering what this below message is saying,
> is
> >> it
> >> > the cause of nodes being down, but saw that this happened even before
> the
> >> > repllicas went down.
> >> >
> >> > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1
> >> > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin
> >> Invalid
> >> > key request timestamp: 1529297138215 , received timestamp:
> 1529297151817
> >> ,
> >> > TTL: 5000
> >> >
> >> > Thanks,
> >> > Satya
> >> >
> >>
>

Re: some solr replicas down

Posted by Erick Erickson <er...@gmail.com>.

We've seen this exact issue when the times reported by various
machines have different wall-clock times, so getting these times
coordinated is definitely the very first thing I'd do.

It's particularly annoying because if the clocks are drifting apart
gradually, your setup can be running find for days/weeks/months then
break all of the sudden....

Best,
Erick

On Wed, Jun 20, 2018 at 5:39 AM, Satya Marivada
<sa...@gmail.com> wrote:
> Chris,
>
> You are spot on with the timestamps. The date command returns different
> times on these vms and are not in sync with ntp. The ntpstat returns a
> difference of about 8-10 seconds on the 4 vms and that would caused this
> synchronization issues and marked the replicas as down. This just happened
> recently and was working fine earlier. So would get this issue fixed and
> hope everything falls in place.
>
> Yes, there are some other errors that there is a javabin character 2
> expected and is returning 60 which is "<" .
>
> Thanks,
> Satya
>
> On Tue, Jun 19, 2018 at 8:01 AM Chris Ulicny <cu...@iq.media> wrote:
>
>> Satya,
>>
>> There should be some other log messages that are probably relevant to the
>> issue you are having. Something along the lines of "leader cannot
>> communicate with follower...publishing replica as down." It's likely there
>> also is a message of "expecting json/xml but got html" in another
>> instance's logs.
>>
>> We've seen this problem in various scenarios in our own clusters, usually
>> during high volumes of requests, and what seems to be happening to us is
>> the following.
>>
>> Since authentication is enabled, all requests between nodes must be
>> authenticated, and Solr is using a timestamp to do this (in some way, not
>> sure on the details). When the recipient of the request processes it, the
>> timestamp is checked to see if it is within the Time-To-Live (TTL)
>> millisecond value (default of 5000). If the timestamp is too old, the
>> request is rejected with the above error and a response of 401 is delivered
>> to the sender.
>>
>> When a request is sent from the leader to the follower and receives a 401
>> response, the leader becomes too proactive sometimes and declares the
>> replica down. In older versions (6.3.0), it seems that the replica will
>> never recover automatically (manually delete the down replicas and add new
>> ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas
>> will usually start to recover at some point (and the leaders seem less
>> proactive to declare replicas down). Although, we have had cases where they
>> did not recover after being down for hours on 7.2.1.
>>
>> Likely the solution to the problem is to increase the TTL value by adding
>> the line
>>
>> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=######"
>>
>> to the solr environment file (solr.in.sh) on each node and restarting
>> them.
>> Replace ##### with some millisecond value of your choice. I'd suggest just
>> increasing it by intervals of 5s to start. If this does not fix your
>> problem, then there is likely too much pressure on your hardware for some
>> reason or another.
>>
>> Hopefully that helps.
>>
>> If anyone with more knowledge about the authentication plugin has
>> corrections, wants fill in gaps, or has an idea to figure out what requests
>> cause this issue. It'd be greatly appreciated.
>>
>> Best,
>> Chris
>>
>> On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada <sa...@gmail.com>
>> wrote:
>>
>> > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down
>> and 1
>> > is up and serving.
>> >
>> > I see a single line error repeating in logs as below. nothing else
>> specific
>> > exception apart from it. Wondering what this below message is saying, is
>> it
>> > the cause of nodes being down, but saw that this happened even before the
>> > repllicas went down.
>> >
>> > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1
>> > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin
>> Invalid
>> > key request timestamp: 1529297138215 , received timestamp: 1529297151817
>> ,
>> > TTL: 5000
>> >
>> > Thanks,
>> > Satya
>> >
>>

Re: some solr replicas down

Posted by Satya Marivada <sa...@gmail.com>.

Chris,

You are spot on with the timestamps. The date command returns different
times on these vms and are not in sync with ntp. The ntpstat returns a
difference of about 8-10 seconds on the 4 vms and that would caused this
synchronization issues and marked the replicas as down. This just happened
recently and was working fine earlier. So would get this issue fixed and
hope everything falls in place.

Yes, there are some other errors that there is a javabin character 2
expected and is returning 60 which is "<" .

Thanks,
Satya

On Tue, Jun 19, 2018 at 8:01 AM Chris Ulicny <cu...@iq.media> wrote:

> Satya,
>
> There should be some other log messages that are probably relevant to the
> issue you are having. Something along the lines of "leader cannot
> communicate with follower...publishing replica as down." It's likely there
> also is a message of "expecting json/xml but got html" in another
> instance's logs.
>
> We've seen this problem in various scenarios in our own clusters, usually
> during high volumes of requests, and what seems to be happening to us is
> the following.
>
> Since authentication is enabled, all requests between nodes must be
> authenticated, and Solr is using a timestamp to do this (in some way, not
> sure on the details). When the recipient of the request processes it, the
> timestamp is checked to see if it is within the Time-To-Live (TTL)
> millisecond value (default of 5000). If the timestamp is too old, the
> request is rejected with the above error and a response of 401 is delivered
> to the sender.
>
> When a request is sent from the leader to the follower and receives a 401
> response, the leader becomes too proactive sometimes and declares the
> replica down. In older versions (6.3.0), it seems that the replica will
> never recover automatically (manually delete the down replicas and add new
> ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas
> will usually start to recover at some point (and the leaders seem less
> proactive to declare replicas down). Although, we have had cases where they
> did not recover after being down for hours on 7.2.1.
>
> Likely the solution to the problem is to increase the TTL value by adding
> the line
>
> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=######"
>
> to the solr environment file (solr.in.sh) on each node and restarting
> them.
> Replace ##### with some millisecond value of your choice. I'd suggest just
> increasing it by intervals of 5s to start. If this does not fix your
> problem, then there is likely too much pressure on your hardware for some
> reason or another.
>
> Hopefully that helps.
>
> If anyone with more knowledge about the authentication plugin has
> corrections, wants fill in gaps, or has an idea to figure out what requests
> cause this issue. It'd be greatly appreciated.
>
> Best,
> Chris
>
> On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada <sa...@gmail.com>
> wrote:
>
> > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down
> and 1
> > is up and serving.
> >
> > I see a single line error repeating in logs as below. nothing else
> specific
> > exception apart from it. Wondering what this below message is saying, is
> it
> > the cause of nodes being down, but saw that this happened even before the
> > repllicas went down.
> >
> > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1
> > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin
> Invalid
> > key request timestamp: 1529297138215 , received timestamp: 1529297151817
> ,
> > TTL: 5000
> >
> > Thanks,
> > Satya
> >
>

Re: some solr replicas down

Posted by Chris Ulicny <cu...@iq.media>.

Satya,

There should be some other log messages that are probably relevant to the
issue you are having. Something along the lines of "leader cannot
communicate with follower...publishing replica as down." It's likely there
also is a message of "expecting json/xml but got html" in another
instance's logs.

We've seen this problem in various scenarios in our own clusters, usually
during high volumes of requests, and what seems to be happening to us is
the following.

Since authentication is enabled, all requests between nodes must be
authenticated, and Solr is using a timestamp to do this (in some way, not
sure on the details). When the recipient of the request processes it, the
timestamp is checked to see if it is within the Time-To-Live (TTL)
millisecond value (default of 5000). If the timestamp is too old, the
request is rejected with the above error and a response of 401 is delivered
to the sender.

When a request is sent from the leader to the follower and receives a 401
response, the leader becomes too proactive sometimes and declares the
replica down. In older versions (6.3.0), it seems that the replica will
never recover automatically (manually delete the down replicas and add new
ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas
will usually start to recover at some point (and the leaders seem less
proactive to declare replicas down). Although, we have had cases where they
did not recover after being down for hours on 7.2.1.

Likely the solution to the problem is to increase the TTL value by adding
the line

SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=######"

to the solr environment file (solr.in.sh) on each node and restarting them.
Replace ##### with some millisecond value of your choice. I'd suggest just
increasing it by intervals of 5s to start. If this does not fix your
problem, then there is likely too much pressure on your hardware for some
reason or another.

Hopefully that helps.

If anyone with more knowledge about the authentication plugin has
corrections, wants fill in gaps, or has an idea to figure out what requests
cause this issue. It'd be greatly appreciated.

Best,
Chris

On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada <sa...@gmail.com>
wrote:

> Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down and 1
> is up and serving.
>
> I see a single line error repeating in logs as below. nothing else specific
> exception apart from it. Wondering what this below message is saying, is it
> the cause of nodes being down, but saw that this happened even before the
> repllicas went down.
>
> 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1
> r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin Invalid
> key request timestamp: 1529297138215 , received timestamp: 1529297151817 ,
> TTL: 5000
>
> Thanks,
> Satya
>