You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@iotdb.apache.org by "刘珍 (Jira)" <ji...@apache.org> on 2022/12/05 07:56:00 UTC

[jira] [Reopened] (IOTDB-4218) [ remove datanode ] The new peer does not continue to synchronize, the leader does not accept new writes

     [ https://issues.apache.org/jira/browse/IOTDB-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

刘珍 reopened IOTDB-4218:
-----------------------

rel/1.0   20221205    235f663
私有云3期
node1="172.20.70.2"
node2="172.20.70.3"
node3="172.20.70.14"
node4="172.20.70.4"
node5="172.20.70.5"

config_node1="172.20.70.13"
config_node2="172.20.70.15"
config_node3="172.20.70.16"

ConfigNode配置
MAX_HEAP_SIZE="16G"
MAX_DIRECT_MEMORY_SIZE="6G"
cn_connection_timeout_ms=120000

DataNode配置
MAX_HEAP_SIZE="20G"
MAX_DIRECT_MEMORY_SIZE="6G"

Common配置
schema_replication_factor=3
data_replication_factor=3
connection_timeout_ms=120000
max_connection_for_internal_service=200
query_timeout_threshold=3600000
max_waiting_time_when_insert_blocked=3600000
time_partition_interval=86400
data_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus


运行附件中的BM配置
缩容ip2 (是follower)


> [ remove datanode ] The new peer does not continue to synchronize, the leader does not accept new writes
> --------------------------------------------------------------------------------------------------------
>
>                 Key: IOTDB-4218
>                 URL: https://issues.apache.org/jira/browse/IOTDB-4218
>             Project: Apache IoTDB
>          Issue Type: Bug
>          Components: mpp-cluster
>    Affects Versions: 0.14.0-SNAPSHOT
>            Reporter: 刘珍
>            Assignee: Song Ziyang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: bm_config.properties, image-2022-08-24-11-05-21-029.png, image-2022-08-24-11-06-50-633.png, image-2022-08-24-11-07-43-638.png, ip1_log_all.log, ip3_log_all.log, ip4_log_all.log, ip5_log_all.log, screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> 测试版本:宋子阳提供的依赖他本地的ratis包版本,datanode的代码的commitID是 3585d6b2316b0efcac3ff35fdafd7806185f0285
> 3副本3C5D , 1 dataregion,缩容1个follower(ip1):
> 问题1:{color:#DE350B}ip1 的data已经被删除,但是datanode进程不退出{color}
> 缩容节点ip1的error 日志
>  !screenshot-1.png! 
> 问题2:{color:#DE350B}leader(ip4)有报错,且不继续接受新的写入(缩容期间bm写入并不停止){color}
> 2022-08-24 10:12:07,280 [pool-16-IoTDB-Region-Migrate-Pool-1] ERROR o.a.i.d.s.RegionMigrateService$AddRegionPeerTask:294 - add new peer TEndPoint(ip:192.168.130.5, port:40010) for region DataRegion[1] failed, resp: ConsensusGenericResponse{success=false} exception=org.apache.iotdb.consensus.exception.RatisRequestFailedException: Ratis request failed
> 问题3:接问题2的{color:#DE350B}leader报错add new peer  failed,但是show regions ,却看到new peer是add成功的{color}。
>  !image-2022-08-24-11-07-43-638.png! 
> 问题4:{color:#DE350B}new peer ip5不同步{color}:
>  !image-2022-08-24-11-05-21-029.png! 
> leader ip4的raft log(10:22后没有新的写入,此时bm并没停止)
>  !image-2022-08-24-11-06-50-633.png! 
> 复现流程:
> 1. 192.168.130.1 / 2/3/4/5     
>  3,4,5 机器 16核32GB,
> 1,2 机器  8核32GB
> bm在ip2
> 数据库配置参数
> ConfigNode
> MAX_HEAP_SIZE="4G"
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> schema_replication_factor=3
> data_replication_factor=3
> DataNode配置参数
> MAX_HEAP_SIZE="16G"
> max_waiting_time_when_insert_blocked=3600000
> query_timeout_threshold=36000000
> partition_interval=86400
> 2. 启动benchmark,配置文件见附件
> regions信息:
>  !screenshot-2.png! 
> 3. 缩容follower ip1 (ip1 有snapshot)
> 4. 问题现象见问题描述
> ip1 ,ip4,ip5的日志见附件



--
This message was sent by Atlassian Jira
(v8.20.10#820010)