You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2021/11/26 11:56:12 UTC

[GitHub] [incubator-doris] Userwhite opened a new issue #7229: [Bug] tablet recover failed when restart BE

Userwhite opened a new issue #7229:
URL: https://github.com/apache/incubator-doris/issues/7229


   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and found no similar issues.
   
   
   ### Version
   
   0.14
   
   ### What's Wrong?
   
   1. tablet miss version and fe rebalance failed
   2. tablet can't do compaction because of missed version
   
   **First: grep clone fe.log | less**
   ```java
   ......
   2021-11-23 20:50:23,068 WARN (Thread-34|91) [ReportHandler.handleRecoverTablet():768] find 10 tablets on backend 11008 which is bad or misses versions that need clone or force recovery
   ......
   java.lang.IllegalStateException: fromBe has no replica in the map, can't move
           at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[spark-dpp-1.0.0.jar:1.0.0]
           at org.apache.doris.clone.TwoDimensionalGreedyRebalanceAlgo.moveOneReplica(TwoDimensionalGreedyRebalanceAlgo.java:322) ~[palo-fe.jar:3.4.0]
           at org.apache.doris.clone.TwoDimensionalGreedyRebalanceAlgo.applyMove(TwoDimensionalGreedyRebalanceAlgo.java:261) ~[palo-fe.jar:3.4.0]
           at org.apache.doris.clone.TwoDimensionalGreedyRebalanceAlgo.getNextMoves(TwoDimensionalGreedyRebalanceAlgo.java:142) ~[palo-fe.jar:3.4.0]
           at org.apache.doris.clone.PartitionRebalancer.selectAlternativeTabletsForCluster(PartitionRebalancer.java:110) ~[palo-fe.jar:3.4.0]
           at org.apache.doris.clone.Rebalancer.selectAlternativeTablets(Rebalancer.java:61) ~[palo-fe.jar:3.4.0]
           at org.apache.doris.clone.TabletScheduler.selectTabletsForBalance(TabletScheduler.java:1055) ~[palo-fe.jar:3.4.0]
           at org.apache.doris.clone.TabletScheduler.runAfterCatalogReady(TabletScheduler.java:280) ~[palo-fe.jar:3.4.0]
           at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[palo-fe.jar:3.4.0]
           at org.apache.doris.common.util.Daemon.run(Daemon.java:116) [palo-fe.jar:3.4.0]
   ```
   
   **Second: find a error tablet**
   ```java
   fe.log.20211123-1:2021-11-23 21:09:02,012 WARN (Thread-41|110) [ReportHandler.handleRecoverTablet():768] find 3 tablets on backend 10006 which is bad or misses versions that need clone or force recovery
   fe.log.20211123-1:2021-11-23 21:09:02,012 INFO (Thread-41|110) [ReportHandler.handleSetTabletPartitionId():872] find [0] tablets without partition id, try to set them
   fe.log.20211123-1:2021-11-23 21:09:02,014 INFO (Thread-41|110) [ReportHandler.handleSetTabletInMemory():925] find [0] tablets need set in memory meta
   fe.log.20211123-1:2021-11-23 21:09:02,014 INFO (Thread-41|110) [ReportHandler.tabletReport():315] tablet report from backend[10006] cost: 6 ms
   fe.log.20211123-1:2021-11-23 21:09:02,030 INFO (Thread-41|110) [ReportHandler.tabletReport():241] backend[10004] reports 6213 tablet(s). report version: 16376710370000
   fe.log.20211123-1:2021-11-23 21:09:02,030 INFO (thrift-server-pool-38|193) [ReportHandler.handleReport():174] receive report from be 10004. type: tablet, current queue size: 1
   fe.log.20211123-1:2021-11-23 21:09:02,031 INFO (Thread-41|110) [TabletInvertedIndex.tabletReport():134] begin to do tablet diff with backend[10004]. num: 6213
   replica 668247 of tablet 104608 on backend 10004 need recovery. replica in FE: [replicaId=668247, BackendId=10004, version=18025217, versionHash=2996819540149708548, dataSize=290683805, rowCount=22439433, lastFailedVersion=18025218, lastFailedVersionHash=5097744780157020547, lastSuccessVersion=18025716, lastSuccessVersionHash=8261391189031240653, lastFailedTimestamp=1637671848573, schemaHash=1346645421, state=NORMAL], report version 18025217-2996819540149708548, report schema hash: 1346645421, is bad: unknown, is version missing: true
   ```
   
   **Third: find 104608 on be**
   ```java
   W1123 21:07:23.121587 1816689 internal_service.cpp:105] tablet writer add batch failed, message=tablet writer write failed, tablet_id=104608, txn_id=44871640, err=-215, id=784f29cd6b3149e1-dae986720025fa98, index_id=104531, sender_id=0
   W1123 21:07:23.330183 1816603 tablet.cpp:548] status:-214, tablet:104608.1346645421.014b70c508ec195a-e1231460bfef55ae, missed version for version:[0-18025716]
   W1123 21:07:23.330199 1816603 tablet.cpp:988] 104608.1346645421.014b70c508ec195a-e1231460bfef55ae has 1 missed version:[18025218-18025218],
   W1123 21:07:24.154244 1816696 delta_writer.cpp:106] failed to init delta writer. version count: 501, exceed limit: 500. tablet: 104608.1346645421.014b70c508ec195a-e1231460bfef55ae
   W1123 21:07:24.154300 1816696 tablets_channel.cpp:117] tablet writer write failed, tablet_id=104608, txn_id=44871642, err=-215
   ```
   
   **Forth: version exceeds limit, guess that tablet can't do compaction**
   
   ```cpp
   W1123 20:47:59.494597 1816603 tablet.cpp:988] 104608.1346645421.014b70c508ec195a-e1231460bfef55ae has 1 missed version:[18025218-18025218],
   W1123 20:48:00.514454 1816603 tablet.cpp:548] status:-214, tablet:104608.1346645421.014b70c508ec195a-e1231460bfef55ae, missed version for version:[0-18025580]
   W1123 20:48:00.514592 1816603 tablet.cpp:988] 104608.1346645421.014b70c508ec195a-e1231460bfef55ae has 1 missed version:[18025218-18025218],
   W1123 20:48:01.529275 1816603 tablet.cpp:548] status:-214, tablet:104608.1346645421.014b70c508ec195a-e1231460bfef55ae, missed version for version:[0-18025581]
   ...
   W1123 20:51:21.389887 1816603 tablet.cpp:548] status:-214, tablet:104608.1346645421.014b70c508ec195a-e1231460bfef55ae, missed version for version:[0-18025716]
   W1123 20:51:21.389930 1816603 tablet.cpp:988] 104608.1346645421.014b70c508ec195a-e1231460bfef55ae has 1 missed version:[18025218-18025218],
   W1123 20:51:21.639307 1816700 delta_writer.cpp:106] failed to init delta writer. version count: 501, exceed limit: 500. tablet: 104608.1346645421.014b70c508ec195a-e1231460bfef55ae
   W1123 20:51:21.639377 1816700 tablets_channel.cpp:117] tablet writer write failed, tablet_id=104608, txn_id=44867521, err=-215
   W1123 20:51:21.639390 1816700 internal_service.cpp:105] tablet writer add batch failed, message=tablet writer write failed, tablet_id=104608, txn_id=44867521, err=-215, id=374aa512973dbb93-d6b92597e1d412b1, index_id=104531, sender_id=0
   W1123 20:51:22.408419 1816603 tablet.cpp:548] status:-214, tablet:104608.1346645421.014b70c508ec195a-e1231460bfef55ae, missed version for version:[0-18025716]
   ...
   W1123 21:05:10.026687 1816603 tablet.cpp:548] status:-214, tablet:104608.1346645421.014b70c508ec195a-e1231460bfef55ae, missed version for version:[0-18025716]
   W1123 21:05:10.026711 1816603 tablet.cpp:988] 104608.1346645421.014b70c508ec195a-e1231460bfef55ae has 1 missed version:[18025218-18025218],
   ```
   
   **Summary**
   
   1. restart BE
   2. BE miss a version, need to be recovered
   3. fe recoverd failed
   4. BE can't do compaction, and version will exceed limit , and that will make the load blocked.
   
   
   ### What You Expected?
   
   when I use partition rebalance, if restarted BE and be has missed version, it can be recovered by fe.
   
   ### How to Reproduce?
   
   1. use partition rebalance
   
   ### Anything Else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org