You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Vadim Ivanov <va...@spb.ntk-intourist.ru> on 2018/10/24 09:06:55 UTC

TLOG replica stucks

Hi All !

I'm testing Solr 7.5 with TLOG replicas on SolrCloud with 5 nodes.

My collection has shards and every shard has 3 TLOG replicas on different
nodes.

I've noticed that some replicas stop receiving updates from the leader
without any visible signs from the cluster status.

(all replicas active and green in Admin UI CLOUD graph). But indexversion of
'ill' replica not increasing with the leader.

It seems to be dangerous, because that 'ill' replica could become a leader
after restart of the nodes and I already experienced data loss.

I didn't notice any meaningfull records in solr log, except that probably
problem occurs when leader changes.

Meanwhile, I monitor indexversion of all replicas in a cluster by mbeans and
recreate ill replicas when difference with the leader indexversion  more
than one

Any suggestions?

-- 

Best regards, Vadim

 


Re: TLOG replica stucks

Posted by Shawn Heisey <ap...@elyograg.org>.
On 11/2/2018 3:12 AM, Vadim Ivanov wrote:
> It seems to me that issue related with:
> - restart solr node
> - rebalance leader
> - reload collection
> - reload core (Core admin is not forbidden but seems obsolete in SolrCloud)

In SolrCloud, CoreAdmin is an expert option.  Many of the things that 
the Collections API does are implemented internally with code that 
includes calls to the CoreAdmin API ... but using CoreAdmin directly is 
strongly discouraged, especially for anything related to manipulating 
replicas or creating indexes.  It is possible to use CoreAdmin for many 
of these things successfully, but it's also very easy to use it 
incorrectly and cause problems that are difficult to fix.  We recommend 
not using it at all, even if you're intimately familiar the SolrCloud code.

When you reload a collection, all cores (shard replicas) that make up 
the collection are reloaded, even if they are on separate machines.  So 
you do not need to use CoreAdmin to do a reload.  Situations where one 
core in a collection needs a reload but other cores do not are rare.

None of what I've written above addresses the problem that started the 
thread, it's about your note in parentheses on this message.

I don't know any more than the other people responding do about why your 
replica is getting out of sync.  If you can come up with simple step by 
step instructions for reproducing the problem that begin with "download 
the X.Y.Z binary version of Solr", that will make it much easier to 
diagnose.  Until the issue can be seen first-hand and there's something 
useful in Solr's log, we're guessing about what could be going wrong.  
Once we can reproduce it, the odds of getting you a new version that 
doesn't have the problem go up significantly.

Thanks,
Shawn


RE: TLOG replica stucks

Posted by Vadim Ivanov <va...@spb.ntk-intourist.ru>.
It seems to me that issue related with:
- restart solr node
- rebalance leader
- reload collection
- reload core (Core admin is not forbidden but seems obsolete in SolrCloud)
If nothing is changing in cluster state everything goes smoothly.
May be it can be reproduced wit the same test as in " SolrCloud Replication Failure" branch
-- Vadim

> -----Original Message-----
> From: Ere Maijala [mailto:ere.maijala@helsinki.fi]
> Sent: Thursday, November 01, 2018 5:21 PM
> To: solr-user@lucene.apache.org
> Subject: Re: TLOG replica stucks
> 
> Could it be related to reloading a collection? I need to do some
> testing, but it just occurred to me that reload was done at least once
> during the period the cluster had been up.
> 
> Regards,
> Ere
> 
> Ere Maijala kirjoitti 30.10.2018 klo 12.03:
> > Hi,
> >
> > We had the same happen with PULL replicas with Solr 7.5. Solr was
> > showing that they all had correct index version, but the changes were
> > not showing. Unfortunately the solr.log size was too small to catch any
> > issues, so I've now increased and waiting for it to happen again.
> >
> > Regards,
> > Ere
> >
> > Vadim Ivanov kirjoitti 25.10.2018 klo 18.42:
> >> Thanks Erick for you attention!
> >> My comments below, but supposing that the problem resides in zookeeper
> >> I'll collect more information  from zk logs and solr logs and be back
> >> soon.
> >>
> >>> bq. I've noticed that some replicas stop receiving updates from the
> >>> leader without any visible signs from the cluster status.
> >>>
> >>> Hmm, yes, this isn't expected at all. What are you seeing that causes
> >>> you to say this? You'd have to be monitoring the log for update
> >>> messages to the replicas that aren't leaders or the like.  If anyone is
> >>> going to have a prayer of reproducing we'll need more info on exactly
> >>> what you're seeing and how you're measuring this.
> >>
> >> Meanwhile, I have log level WARN... I'l decrease  it to INFO and see. Tnx
> >>
> >>>
> >>> Have you changed any configurations in your replicas at all? We'd need
> >>> the exact steps you performed if so.
> >> Command to create replicas was like this (implicit sharding and custom
> >> CoreName ) :
> >>
> >> mysolr07:8983/solr/admin/collections?action=ADDREPLICA
> >>     &collection=rpk94
> >>     &shard=rpk94_1_0
> >>     &property.name=rpk94_1_0_07
> >>     &type=tlog
> >>     &node=mysolr07:8983_solr
> >>
> >>>
> >>> On a quick test I didn't see this, but if it were that easy to
> >>> reproduce I'd expect it to have shown up before.
> >>
> >> Yesterday I've tried to reproduce...  trying to change leader with
> >> REBALANCELEADERS command.
> >> It ended up with no leader at all for the shard  and I could not set
> >> leader at all for a long time.
> >>
> >>     There was a problem trying to register as the
> >> leader:org.apache.solr.common.SolrException: Could not register as the
> >> leader because creating the ephemeral registration node in ZooKeeper
> >> failed
> >> ...
> >>     Deleting duplicate registration:
> >>
> /collections/rpk94/leader_elect/rpk94_1_117/election/298318118789952308
> 5-core_node73-n_0000000022
> >>
> >> ...
> >>    Index fetch failed :org.apache.solr.common.SolrException: No
> >> registered leader was found after waiting for 4000ms , collection:
> >> rpk94 slice: rpk94_1_117
> >> ...
> >>
> >> Even to delete all replicas for the shard and recreate Replica to the
> >> same node with the same name did not help - no leader for that shard.
> >> I had to delete collection, wait till morning and then it recreated
> >> successfully.
> >> Suppose some weird znodes were deleted from  zk by morning.
> >>
> >>>
> >>> NOTE: just looking at the cloud graph and having a node be active is
> >>> not _necessarily_ sufficient for the node to be up to date. It
> >>> _should_ be sufficient if (and only if) the node was shut down
> >>> gracefully, but a "kill -9" or similar doesn't give the replicas on
> >>> the node the opportunity to change the state. The "live_nodes" znode
> >>> in ZooKeeper must also contain the node the replica resides on.
> >>
> >> Node was live, cluster was healthy
> >>
> >>>
> >>> If you see this state again, you could try pinging the node directly,
> >>> does it respond? Your URL should look something like:
> >>>
> http://host:port/solr/colection_shard1_replica_t1/query?q=*:*&distrib=false
> >>>
> >>
> >> Yes, sure I did. Ill replica responded and number of documents differs
> >> with the leader
> >>
> >>>
> >>> The "distrib=false" is important as it won't forward the query to any
> >>> other replica. If what you're reporting is really happening, that node
> >>> should respond with a document count different from other nodes.
> >>>
> >>> NOTE: there's a delay between the time the leader indexes a doc and
> >>> it's visible on the follower. Are you sure you're waiting for
> >>> leader_commit_interval+polling_interval+autowarm_time before
> >>> concluding that there's a problem? I'm a bit suspicious that checking
> >>> the versions is concluding that your indexes are out of sync when
> >>> really they're just catching up normally. If it's at all possible to
> >>> turn off indexing for a few minutes when this happens and everything
> >>> just gets better then it's not really a problem.
> >>
> >> Sure, the problem was on many shards but not on all shards
> >> and for the long time.
> >>
> >>>
> >>> If we prove out that this is really happening as you think, then a
> >>> JIRA (with steps to reproduce) is _definitely_ in order.
> >>>
> >>> Best,
> >>> Erick
> >>> On Wed, Oct 24, 2018 at 2:07 AM Vadim Ivanov
> >>> <va...@spb.ntk-intourist.ru> wrote:
> >>>>
> >>>> Hi All !
> >>>>
> >>>> I'm testing Solr 7.5 with TLOG replicas on SolrCloud with 5 nodes.
> >>>>
> >>>> My collection has shards and every shard has 3 TLOG replicas on
> >>>> different
> >>>> nodes.
> >>>>
> >>>> I've noticed that some replicas stop receiving updates from the leader
> >>>> without any visible signs from the cluster status.
> >>>>
> >>>> (all replicas active and green in Admin UI CLOUD graph). But
> >>>> indexversion of
> >>>> 'ill' replica not increasing with the leader.
> >>>>
> >>>> It seems to be dangerous, because that 'ill' replica could become a
> >>>> leader
> >>>> after restart of the nodes and I already experienced data loss.
> >>>>
> >>>> I didn't notice any meaningfull records in solr log, except that
> >>>> probably
> >>>> problem occurs when leader changes.
> >>>>
> >>>> Meanwhile, I monitor indexversion of all replicas in a cluster by
> >>>> mbeans and
> >>>> recreate ill replicas when difference with the leader indexversion
> >>>> more
> >>>> than one
> >>>>
> >>>> Any suggestions?
> >>>>
> >>>> --
> >>>>
> >>>> Best regards, Vadim
> >>>>
> >>>>
> >>>>
> >>
> >
> 
> --
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland


Re: TLOG replica stucks

Posted by Ere Maijala <er...@helsinki.fi>.
Could it be related to reloading a collection? I need to do some 
testing, but it just occurred to me that reload was done at least once 
during the period the cluster had been up.

Regards,
Ere

Ere Maijala kirjoitti 30.10.2018 klo 12.03:
> Hi,
> 
> We had the same happen with PULL replicas with Solr 7.5. Solr was 
> showing that they all had correct index version, but the changes were 
> not showing. Unfortunately the solr.log size was too small to catch any 
> issues, so I've now increased and waiting for it to happen again.
> 
> Regards,
> Ere
> 
> Vadim Ivanov kirjoitti 25.10.2018 klo 18.42:
>> Thanks Erick for you attention!
>> My comments below, but supposing that the problem resides in zookeeper
>> I'll collect more information  from zk logs and solr logs and be back 
>> soon.
>>
>>> bq. I've noticed that some replicas stop receiving updates from the
>>> leader without any visible signs from the cluster status.
>>>
>>> Hmm, yes, this isn't expected at all. What are you seeing that causes
>>> you to say this? You'd have to be monitoring the log for update
>>> messages to the replicas that aren't leaders or the like.  If anyone is
>>> going to have a prayer of reproducing we'll need more info on exactly
>>> what you're seeing and how you're measuring this.
>>
>> Meanwhile, I have log level WARN... I'l decrease  it to INFO and see. Tnx
>>
>>>
>>> Have you changed any configurations in your replicas at all? We'd need
>>> the exact steps you performed if so.
>> Command to create replicas was like this (implicit sharding and custom 
>> CoreName ) :
>>
>> mysolr07:8983/solr/admin/collections?action=ADDREPLICA
>>     &collection=rpk94
>>     &shard=rpk94_1_0
>>     &property.name=rpk94_1_0_07
>>     &type=tlog
>>     &node=mysolr07:8983_solr
>>
>>>
>>> On a quick test I didn't see this, but if it were that easy to
>>> reproduce I'd expect it to have shown up before.
>>
>> Yesterday I've tried to reproduce...  trying to change leader with 
>> REBALANCELEADERS command.
>> It ended up with no leader at all for the shard  and I could not set 
>> leader at all for a long time.
>>
>>     There was a problem trying to register as the 
>> leader:org.apache.solr.common.SolrException: Could not register as the 
>> leader because creating the ephemeral registration node in ZooKeeper 
>> failed
>> ...
>>     Deleting duplicate registration: 
>> /collections/rpk94/leader_elect/rpk94_1_117/election/2983181187899523085-core_node73-n_0000000022 
>>
>> ...
>>    Index fetch failed :org.apache.solr.common.SolrException: No 
>> registered leader was found after waiting for 4000ms , collection: 
>> rpk94 slice: rpk94_1_117
>> ...
>>
>> Even to delete all replicas for the shard and recreate Replica to the 
>> same node with the same name did not help - no leader for that shard.
>> I had to delete collection, wait till morning and then it recreated 
>> successfully.
>> Suppose some weird znodes were deleted from  zk by morning.
>>
>>>
>>> NOTE: just looking at the cloud graph and having a node be active is
>>> not _necessarily_ sufficient for the node to be up to date. It
>>> _should_ be sufficient if (and only if) the node was shut down
>>> gracefully, but a "kill -9" or similar doesn't give the replicas on
>>> the node the opportunity to change the state. The "live_nodes" znode
>>> in ZooKeeper must also contain the node the replica resides on.
>>
>> Node was live, cluster was healthy
>>
>>>
>>> If you see this state again, you could try pinging the node directly,
>>> does it respond? Your URL should look something like:
>>> http://host:port/solr/colection_shard1_replica_t1/query?q=*:*&distrib=false 
>>>
>>
>> Yes, sure I did. Ill replica responded and number of documents differs 
>> with the leader
>>
>>>
>>> The "distrib=false" is important as it won't forward the query to any
>>> other replica. If what you're reporting is really happening, that node
>>> should respond with a document count different from other nodes.
>>>
>>> NOTE: there's a delay between the time the leader indexes a doc and
>>> it's visible on the follower. Are you sure you're waiting for
>>> leader_commit_interval+polling_interval+autowarm_time before
>>> concluding that there's a problem? I'm a bit suspicious that checking
>>> the versions is concluding that your indexes are out of sync when
>>> really they're just catching up normally. If it's at all possible to
>>> turn off indexing for a few minutes when this happens and everything
>>> just gets better then it's not really a problem.
>>
>> Sure, the problem was on many shards but not on all shards
>> and for the long time.
>>
>>>
>>> If we prove out that this is really happening as you think, then a
>>> JIRA (with steps to reproduce) is _definitely_ in order.
>>>
>>> Best,
>>> Erick
>>> On Wed, Oct 24, 2018 at 2:07 AM Vadim Ivanov
>>> <va...@spb.ntk-intourist.ru> wrote:
>>>>
>>>> Hi All !
>>>>
>>>> I'm testing Solr 7.5 with TLOG replicas on SolrCloud with 5 nodes.
>>>>
>>>> My collection has shards and every shard has 3 TLOG replicas on 
>>>> different
>>>> nodes.
>>>>
>>>> I've noticed that some replicas stop receiving updates from the leader
>>>> without any visible signs from the cluster status.
>>>>
>>>> (all replicas active and green in Admin UI CLOUD graph). But 
>>>> indexversion of
>>>> 'ill' replica not increasing with the leader.
>>>>
>>>> It seems to be dangerous, because that 'ill' replica could become a 
>>>> leader
>>>> after restart of the nodes and I already experienced data loss.
>>>>
>>>> I didn't notice any meaningfull records in solr log, except that 
>>>> probably
>>>> problem occurs when leader changes.
>>>>
>>>> Meanwhile, I monitor indexversion of all replicas in a cluster by 
>>>> mbeans and
>>>> recreate ill replicas when difference with the leader indexversion  
>>>> more
>>>> than one
>>>>
>>>> Any suggestions?
>>>>
>>>> -- 
>>>>
>>>> Best regards, Vadim
>>>>
>>>>
>>>>
>>
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland

Re: TLOG replica stucks

Posted by Ere Maijala <er...@helsinki.fi>.
Hi,

We had the same happen with PULL replicas with Solr 7.5. Solr was 
showing that they all had correct index version, but the changes were 
not showing. Unfortunately the solr.log size was too small to catch any 
issues, so I've now increased and waiting for it to happen again.

Regards,
Ere

Vadim Ivanov kirjoitti 25.10.2018 klo 18.42:
> Thanks Erick for you attention!
> My comments below, but supposing that the problem resides in zookeeper
> I'll collect more information  from zk logs and solr logs and be back soon.
> 
>> bq. I've noticed that some replicas stop receiving updates from the
>> leader without any visible signs from the cluster status.
>>
>> Hmm, yes, this isn't expected at all. What are you seeing that causes
>> you to say this? You'd have to be monitoring the log for update
>> messages to the replicas that aren't leaders or the like.  If anyone is
>> going to have a prayer of reproducing we'll need more info on exactly
>> what you're seeing and how you're measuring this.
> 
> Meanwhile, I have log level WARN... I'l decrease  it to INFO and see. Tnx
> 
>>
>> Have you changed any configurations in your replicas at all? We'd need
>> the exact steps you performed if so.
> Command to create replicas was like this (implicit sharding and custom CoreName ) :
> 
> mysolr07:8983/solr/admin/collections?action=ADDREPLICA
> 	&collection=rpk94
> 	&shard=rpk94_1_0
> 	&property.name=rpk94_1_0_07
> 	&type=tlog
> 	&node=mysolr07:8983_solr
> 
>>
>> On a quick test I didn't see this, but if it were that easy to
>> reproduce I'd expect it to have shown up before.
> 
> Yesterday I've tried to reproduce...  trying to change leader with REBALANCELEADERS command.
> It ended up with no leader at all for the shard  and I could not set leader at all for a long time.
> 
>     There was a problem trying to register as the leader:org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed
> ...
>     Deleting duplicate registration: /collections/rpk94/leader_elect/rpk94_1_117/election/2983181187899523085-core_node73-n_0000000022
> ...
>    Index fetch failed :org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: rpk94 slice: rpk94_1_117
> ...
> 
> Even to delete all replicas for the shard and recreate Replica to the same node with the same name did not help - no leader for that shard.
> I had to delete collection, wait till morning and then it recreated successfully.
> Suppose some weird znodes were deleted from  zk by morning.
> 
>>
>> NOTE: just looking at the cloud graph and having a node be active is
>> not _necessarily_ sufficient for the node to be up to date. It
>> _should_ be sufficient if (and only if) the node was shut down
>> gracefully, but a "kill -9" or similar doesn't give the replicas on
>> the node the opportunity to change the state. The "live_nodes" znode
>> in ZooKeeper must also contain the node the replica resides on.
> 
> Node was live, cluster was healthy
> 
>>
>> If you see this state again, you could try pinging the node directly,
>> does it respond? Your URL should look something like:
>> http://host:port/solr/colection_shard1_replica_t1/query?q=*:*&distrib=false
> 
> Yes, sure I did. Ill replica responded and number of documents differs with the leader
> 
>>
>> The "distrib=false" is important as it won't forward the query to any
>> other replica. If what you're reporting is really happening, that node
>> should respond with a document count different from other nodes.
>>
>> NOTE: there's a delay between the time the leader indexes a doc and
>> it's visible on the follower. Are you sure you're waiting for
>> leader_commit_interval+polling_interval+autowarm_time before
>> concluding that there's a problem? I'm a bit suspicious that checking
>> the versions is concluding that your indexes are out of sync when
>> really they're just catching up normally. If it's at all possible to
>> turn off indexing for a few minutes when this happens and everything
>> just gets better then it's not really a problem.
> 
> Sure, the problem was on many shards but not on all shards
> and for the long time.
> 
>>
>> If we prove out that this is really happening as you think, then a
>> JIRA (with steps to reproduce) is _definitely_ in order.
>>
>> Best,
>> Erick
>> On Wed, Oct 24, 2018 at 2:07 AM Vadim Ivanov
>> <va...@spb.ntk-intourist.ru> wrote:
>>>
>>> Hi All !
>>>
>>> I'm testing Solr 7.5 with TLOG replicas on SolrCloud with 5 nodes.
>>>
>>> My collection has shards and every shard has 3 TLOG replicas on different
>>> nodes.
>>>
>>> I've noticed that some replicas stop receiving updates from the leader
>>> without any visible signs from the cluster status.
>>>
>>> (all replicas active and green in Admin UI CLOUD graph). But indexversion of
>>> 'ill' replica not increasing with the leader.
>>>
>>> It seems to be dangerous, because that 'ill' replica could become a leader
>>> after restart of the nodes and I already experienced data loss.
>>>
>>> I didn't notice any meaningfull records in solr log, except that probably
>>> problem occurs when leader changes.
>>>
>>> Meanwhile, I monitor indexversion of all replicas in a cluster by mbeans and
>>> recreate ill replicas when difference with the leader indexversion  more
>>> than one
>>>
>>> Any suggestions?
>>>
>>> --
>>>
>>> Best regards, Vadim
>>>
>>>
>>>
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland

RE: TLOG replica stucks

Posted by Vadim Ivanov <va...@spb.ntk-intourist.ru>.
Thanks Erick for you attention!
My comments below, but supposing that the problem resides in zookeeper 
I'll collect more information  from zk logs and solr logs and be back soon.

> bq. I've noticed that some replicas stop receiving updates from the
> leader without any visible signs from the cluster status.
> 
> Hmm, yes, this isn't expected at all. What are you seeing that causes
> you to say this? You'd have to be monitoring the log for update
> messages to the replicas that aren't leaders or the like.  If anyone is
> going to have a prayer of reproducing we'll need more info on exactly
> what you're seeing and how you're measuring this.

Meanwhile, I have log level WARN... I'l decrease  it to INFO and see. Tnx

> 
> Have you changed any configurations in your replicas at all? We'd need
> the exact steps you performed if so.
Command to create replicas was like this (implicit sharding and custom CoreName ) :

mysolr07:8983/solr/admin/collections?action=ADDREPLICA
	&collection=rpk94
	&shard=rpk94_1_0
	&property.name=rpk94_1_0_07
	&type=tlog
	&node=mysolr07:8983_solr

> 
> On a quick test I didn't see this, but if it were that easy to
> reproduce I'd expect it to have shown up before.

Yesterday I've tried to reproduce...  trying to change leader with REBALANCELEADERS command. 
It ended up with no leader at all for the shard  and I could not set leader at all for a long time.

   There was a problem trying to register as the leader:org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed
...
   Deleting duplicate registration: /collections/rpk94/leader_elect/rpk94_1_117/election/2983181187899523085-core_node73-n_0000000022
...
  Index fetch failed :org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: rpk94 slice: rpk94_1_117
...

Even to delete all replicas for the shard and recreate Replica to the same node with the same name did not help - no leader for that shard.
I had to delete collection, wait till morning and then it recreated successfully.
Suppose some weird znodes were deleted from  zk by morning.

> 
> NOTE: just looking at the cloud graph and having a node be active is
> not _necessarily_ sufficient for the node to be up to date. It
> _should_ be sufficient if (and only if) the node was shut down
> gracefully, but a "kill -9" or similar doesn't give the replicas on
> the node the opportunity to change the state. The "live_nodes" znode
> in ZooKeeper must also contain the node the replica resides on.

Node was live, cluster was healthy

> 
> If you see this state again, you could try pinging the node directly,
> does it respond? Your URL should look something like:
> http://host:port/solr/colection_shard1_replica_t1/query?q=*:*&distrib=false

Yes, sure I did. Ill replica responded and number of documents differs with the leader

> 
> The "distrib=false" is important as it won't forward the query to any
> other replica. If what you're reporting is really happening, that node
> should respond with a document count different from other nodes.
> 
> NOTE: there's a delay between the time the leader indexes a doc and
> it's visible on the follower. Are you sure you're waiting for
> leader_commit_interval+polling_interval+autowarm_time before
> concluding that there's a problem? I'm a bit suspicious that checking
> the versions is concluding that your indexes are out of sync when
> really they're just catching up normally. If it's at all possible to
> turn off indexing for a few minutes when this happens and everything
> just gets better then it's not really a problem.

Sure, the problem was on many shards but not on all shards
and for the long time. 

> 
> If we prove out that this is really happening as you think, then a
> JIRA (with steps to reproduce) is _definitely_ in order.
> 
> Best,
> Erick
> On Wed, Oct 24, 2018 at 2:07 AM Vadim Ivanov
> <va...@spb.ntk-intourist.ru> wrote:
> >
> > Hi All !
> >
> > I'm testing Solr 7.5 with TLOG replicas on SolrCloud with 5 nodes.
> >
> > My collection has shards and every shard has 3 TLOG replicas on different
> > nodes.
> >
> > I've noticed that some replicas stop receiving updates from the leader
> > without any visible signs from the cluster status.
> >
> > (all replicas active and green in Admin UI CLOUD graph). But indexversion of
> > 'ill' replica not increasing with the leader.
> >
> > It seems to be dangerous, because that 'ill' replica could become a leader
> > after restart of the nodes and I already experienced data loss.
> >
> > I didn't notice any meaningfull records in solr log, except that probably
> > problem occurs when leader changes.
> >
> > Meanwhile, I monitor indexversion of all replicas in a cluster by mbeans and
> > recreate ill replicas when difference with the leader indexversion  more
> > than one
> >
> > Any suggestions?
> >
> > --
> >
> > Best regards, Vadim
> >
> >
> >


Re: TLOG replica stucks

Posted by Erick Erickson <er...@gmail.com>.
bq. I've noticed that some replicas stop receiving updates from the
leader without any visible signs from the cluster status.

Hmm, yes, this isn't expected at all. What are you seeing that causes
you to say this? You'd have to be monitoring the log for update
messages to the replicas that aren't leaders or the like. If anyone is
going to have a prayer of reproducing we'll need more info on exactly
what you're seeing and how you're measuring this.

Have you changed any configurations in your replicas at all? We'd need
the exact steps you performed if so.

On a quick test I didn't see this, but if it were that easy to
reproduce I'd expect it to have shown up before.

NOTE: just looking at the cloud graph and having a node be active is
not _necessarily_ sufficient for the node to be up to date. It
_should_ be sufficient if (and only if) the node was shut down
gracefully, but a "kill -9" or similar doesn't give the replicas on
the node the opportunity to change the state. The "live_nodes" znode
in ZooKeeper must also contain the node the replica resides on.

If you see this state again, you could try pinging the node directly,
does it respond? Your URL should look something like:
http://host:port/solr/colection_shard1_replica_t1/query?q=*:*&distrib=false

The "distrib=false" is important as it won't forward the query to any
other replica. If what you're reporting is really happening, that node
should respond with a document count different from other nodes.

NOTE: there's a delay between the time the leader indexes a doc and
it's visible on the follower. Are you sure you're waiting for
leader_commit_interval+polling_interval+autowarm_time before
concluding that there's a problem? I'm a bit suspicious that checking
the versions is concluding that your indexes are out of sync when
really they're just catching up normally. If it's at all possible to
turn off indexing for a few minutes when this happens and everything
just gets better then it's not really a problem.

If we prove out that this is really happening as you think, then a
JIRA (with steps to reproduce) is _definitely_ in order.

Best,
Erick
On Wed, Oct 24, 2018 at 2:07 AM Vadim Ivanov
<va...@spb.ntk-intourist.ru> wrote:
>
> Hi All !
>
> I'm testing Solr 7.5 with TLOG replicas on SolrCloud with 5 nodes.
>
> My collection has shards and every shard has 3 TLOG replicas on different
> nodes.
>
> I've noticed that some replicas stop receiving updates from the leader
> without any visible signs from the cluster status.
>
> (all replicas active and green in Admin UI CLOUD graph). But indexversion of
> 'ill' replica not increasing with the leader.
>
> It seems to be dangerous, because that 'ill' replica could become a leader
> after restart of the nodes and I already experienced data loss.
>
> I didn't notice any meaningfull records in solr log, except that probably
> problem occurs when leader changes.
>
> Meanwhile, I monitor indexversion of all replicas in a cluster by mbeans and
> recreate ill replicas when difference with the leader indexversion  more
> than one
>
> Any suggestions?
>
> --
>
> Best regards, Vadim
>
>
>