You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Viktor Molnár (Jira)" <ji...@apache.org> on 2022/03/21 09:04:00 UTC

[jira] [Commented] (SOLR-14679) TLOGs grow forever, never get out of BUFFERING state

    [ https://issues.apache.org/jira/browse/SOLR-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509700#comment-17509700 ] 

Viktor Molnár commented on SOLR-14679:
--------------------------------------

I ran into the same problem: forever growing TLOG for some shards in TLOG followers.

I did my own research (with *Solr 8.11.0* sources) and I may have found a bug and also fixed it. 

In {*}ZkController.java{*}, function {*}rejoinShardLeaderElection(SolrParams params{*}{*}){*} there is a code:
 
{code:java}
      try (SolrCore core = cc.getCore(coreName)) {
        Replica.Type replicaType = core.getCoreDescriptor().getCloudDescriptor().getReplicaType();
        if (replicaType == Type.TLOG) {
          String leaderUrl = getLeader(core.getCoreDescriptor().getCloudDescriptor(), cloudConfig.getLeaderVoteWait());
          if (!leaderUrl.equals(ourUrl)) {
            // restart the replication thread to ensure the replication is running in each new replica
            // especially if previous role is "leader" (i.e., no replication thread)
            stopReplicationFromLeader(coreName);
            startReplicationFromLeader(coreName, false);
          }
        }
      }

{code}
When calling {*}startReplicationFromLeader(String coreName, boolean switchTransactionLog){*}, *second argument* is \{*}FALSE{*}, but I think {*}it should be TRUE{*}, so we will switch/rotate transaction log. I changed it to TRUE and tested it quickly and I it {*}seems to be fixed{*}.

Could you please look into this [~thelabdude] , [~erickerickson] ?

 

Thank you

> TLOGs grow forever, never get out of BUFFERING state
> ----------------------------------------------------
>
>                 Key: SOLR-14679
>                 URL: https://issues.apache.org/jira/browse/SOLR-14679
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Priority: Major
>
> From the user's list, 
>  (Gael Jourdan-Weil)
> https://www.mail-archive.com/solr-user@lucene.apache.org/msg151867.html
> I think I've come down to the root cause of this mess in our case.
> Everything is confirming that the TLOG state is "BUFFERING" rather than "ACTIVE".
> 1/ This can be seen with the metrics API as well where we observe:
> "TLOG.replay.remaining.bytes":48997506,
> "TLOG.replay.remaining.logs":1,
> "TLOG.state":1,
> 2/ When a hard commit occurs, we can see it in the logs and as the index files are updated ; but we can also see that postCommit and preCommit UpdateLog methods are called but exits immediately which looking at the code indicates the state is "BUFFERING".
> So, why is this TLOG still in "BUFFERING" state?
> From the code, the only place where state is set to "BUFFERING" seems to be UpdateLog.bufferUpdates.
> From the logs, in our case it comes from recovery process. We see the message "Begin buffering updates. core=[col_blue_shard1]".
> Just after we can see "Publishing state of core [col_blue_shard1] as recovering, leader is [http://srv2/solr/col_blue_shard1/] and I am [http://srv1/solr/col_blue_shard1/]".
> Until here, everything is expected I guess but why the TLOG state is not set to "ACTIVE" a bit later?
> Well, the "Begin buffering updates" occured and 500ms later we can see:
> - "Updated live nodes from ZooKeeper... (2) -> (1)" (I think at this time we shut down srv2, this is our main cause of problem)
> - "I am going to be the leader srv1"
> - "Stopping recovery for core=[col_blue_shard1] coreNodeName=[core_node1]"
> And 2s later:
> - "Attempting to PeerSync from [http://srv2/solr/es_blue_shard1/] - recoveringAfterStartup=[true]"
> - "Error while trying to recover. core=es_blue_shard1:org.apache.solr.common.SolrException: Failed to get fingerprint from leader"
> - "Finished recovery process, successful=[false]"
> At this point, I think the root cause on our side is a rolling update that we did too quickly: we stopped node2 while node1 while recovering from it.
> It's still not clear how everything went back to "active" state after such a failed recovery and a TLOG still in "BUFFERING".
> We shouldn't have been in recovery in the first place and I think we know why, this is a first thing that we have adressed.
> Then we need to add some pauses in our rolling update strategy.
> Does it makes sense? Can you think of something else to check/improve?
> Best Regards,
> Gaël



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org