You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@iotdb.apache.org by "刘珍 (Jira)" <ji...@apache.org> on 2023/05/15 07:30:00 UTC

[jira] [Assigned] (IOTDB-5876) [ ConfigNode "Took a snapshot" + expand a config node ] restart this cluster, there is a config node startup failure

     [ https://issues.apache.org/jira/browse/IOTDB-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

刘珍 reassigned IOTDB-5876:
-------------------------

    Assignee: Song Ziyang

> [ ConfigNode "Took a snapshot"  + expand a config node ] restart this cluster, there is a config node startup failure
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: IOTDB-5876
>                 URL: https://issues.apache.org/jira/browse/IOTDB-5876
>             Project: Apache IoTDB
>          Issue Type: Bug
>          Components: mpp-cluster
>            Reporter: 刘珍
>            Assignee: Song Ziyang
>            Priority: Major
>         Attachments: 1.conf, image-2023-05-15-15-15-39-161.png, image-2023-05-15-15-21-11-259.png, image-2023-05-15-15-21-40-565.png, ip14_logs.tar.gz, ip26_logs.tar.gz, ip27_logs.tar.gz, ip30_logs.tar.gz, ip4_logs.tar.gz, ip5_logs.tar.gz, only_stop_dn.sh, only_stop_timecho_cn.sh, start_timecho_cn.sh, start_timecho_dn.sh
>
>
> 问题描述
> confignode ratis协议,schema region ratis协议,data region IoT协议,
> 启动2C5D, 客户端写入不停,confignode触发snapshot后,扩容1个confignode,
> 继续写入数据,confignode触发snapshot。
> 客户端写入完成,连接1个datanode执行flush。
> 停止集群:停止5D, 停止3C
> 清各节点的操作系统缓存,
> 启动集群:脚本间隔2秒启动 3C ,脚本间隔1秒启动5D。
> 查看集群状态,全部节点是Running.
> 停止ip5的confignode服务,停止ip4的confignode服务,执行查询失败(只有1个confignode ip14为扩容节点  ,在线)
> 继续{color:#DE350B}*启动ip4的confignode ,失败:*{color}
>  !image-2023-05-15-15-15-39-161.png! 
> 2023-05-15 11:00:10,015 [main] INFO  o.a.i.c.s.ConfigNode:247 - Successfully initialize ConfigManager.
> 2023-05-15 11:00:10,016 [main] INFO  o.a.i.c.s.ConfigNode:107 - IoTDB-ConfigNode is in restarting process...
> 2023-05-15 11:00:10,196 [main] WARN  o.a.i.c.c.s.SyncConfigNodeClientPool:99 - RESTART_CONFIG_NODE failed on ConfigNode TEndPoint(ip:172.20.70.5, port:10710), because net.sf.cglib.core.CodeGenerationException: org.apache.thrift.transport.TTransportException-->java.net.ConnectException: Connection refused (Connection refused), retrying 0...
> 2023-05-15 11:00:10,298 [main] WARN  o.a.i.c.c.s.SyncConfigNodeClientPool:99 - RESTART_CONFIG_NODE failed on ConfigNode TEndPoint(ip:172.20.70.5, port:10710), because net.sf.cglib.core.CodeGenerationException: org.apache.thrift.transport.TTransportException-->java.net.ConnectException: Connection refused (Connection refused), retrying 1...
> 2023-05-15 11:00:10,499 [main] WARN  o.a.i.c.c.s.SyncConfigNodeClientPool:99 - RESTART_CONFIG_NODE failed on ConfigNode TEndPoint(ip:172.20.70.5, port:10710), because net.sf.cglib.core.CodeGenerationException: org.apache.thrift.transport.TTransportException-->java.net.ConnectException: Connection refused (Connection refused), retrying 2...
> 2023-05-15 11:00:10,901 [main] WARN  o.a.i.c.c.s.SyncConfigNodeClientPool:99 - RESTART_CONFIG_NODE failed on ConfigNode TEndPoint(ip:172.20.70.5, port:10710), because net.sf.cglib.core.CodeGenerationException: org.apache.thrift.transport.TTransportException-->java.net.ConnectException: Connection refused (Connection refused), retrying 3...
> 2023-05-15 11:00:11,702 [main] WARN  o.a.i.c.c.s.SyncConfigNodeClientPool:99 - RESTART_CONFIG_NODE failed on ConfigNode TEndPoint(ip:172.20.70.5, port:10710), because net.sf.cglib.core.CodeGenerationException: org.apache.thrift.transport.TTransportException-->java.net.ConnectException: Connection refused (Connection refused), retrying 4...
> 2023-05-15 11:00:13,304 [main] WARN  o.a.i.c.c.s.SyncConfigNodeClientPool:99 - RESTART_CONFIG_NODE failed on ConfigNode TEndPoint(ip:172.20.70.5, port:10710), because net.sf.cglib.core.CodeGenerationException: org.apache.thrift.transport.TTransportException-->java.net.ConnectException: Connection refused (Connection refused), retrying 5...
> 2023-05-15 11:00:16,507 [main] ERROR o.a.i.c.c.s.SyncConfigNodeClientPool:108 - RESTART_CONFIG_NODE failed on ConfigNode TEndPoint(ip:172.20.70.5, port:10710)
> org.apache.iotdb.commons.client.exception.ClientManagerException: net.sf.cglib.core.CodeGenerationException: org.apache.thrift.transport.TTransportException-->java.net.ConnectException: Connection refused (Connection refused)
>         at org.apache.iotdb.commons.client.ClientManager.borrowClient(ClientManager.java:55)
>         at org.apache.iotdb.confignode.client.sync.SyncConfigNodeClientPool.sendSyncRequestToConfigNodeWithRetry(SyncConfigNodeClientPool.java:72)
>         at org.apache.iotdb.confignode.service.ConfigNode.sendRestartConfigNodeRequest(ConfigNode.java:330)
>         at org.apache.iotdb.confignode.service.ConfigNode.active(ConfigNode.java:115)
>         at org.apache.iotdb.confignode.service.ConfigNodeCommandLine.run(ConfigNodeCommandLine.java:78)
>         at org.apache.iotdb.commons.ServerCommandLine.doMain(ServerCommandLine.java:58)
>         at org.apache.iotdb.confignode.service.ConfigNode.main(ConfigNode.java:90)
> Caused by: net.sf.cglib.core.CodeGenerationException: org.apache.thrift.transport.TTransportException-->java.net.ConnectException: Connection refused (Connection refused)
>         at net.sf.cglib.core.ReflectUtils.newInstance(ReflectUtils.java:235)
>         at net.sf.cglib.core.ReflectUtils.newInstance(ReflectUtils.java:220)
>         at net.sf.cglib.proxy.Enhancer.createUsingReflection(Enhancer.java:639)
>         at net.sf.cglib.proxy.Enhancer.firstInstance(Enhancer.java:538)
>         at net.sf.cglib.core.AbstractClassGenerator.create(AbstractClassGenerator.java:231)
>         at net.sf.cglib.proxy.Enhancer.createHelper(Enhancer.java:377)
>         at net.sf.cglib.proxy.Enhancer.create(Enhancer.java:304)
>         at org.apache.iotdb.commons.client.sync.SyncThriftClientWithErrorHandler.newErrorHandler(SyncThriftClientWithErrorHandler.java:46)
>         at org.apache.iotdb.commons.client.sync.SyncConfigNodeIServiceClient$Factory.makeObject(SyncConfigNodeIServiceClient.java:112)
>         at org.apache.iotdb.commons.client.sync.SyncConfigNodeIServiceClient$Factory.makeObject(SyncConfigNodeIServiceClient.java:94)
>         at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:780)
>         at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:439)
>         at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:350)
>         at org.apache.iotdb.commons.client.ClientManager.borrowClient(ClientManager.java:53)
>         ... 6 common frames omitted
> Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
>         at org.apache.thrift.transport.TSocket.open(TSocket.java:243)
>         at org.apache.iotdb.rpc.TElasticFramedTransport.open(TElasticFramedTransport.java:91)
>         at org.apache.iotdb.commons.client.sync.SyncConfigNodeIServiceClient.<init>(SyncConfigNodeIServiceClient.java:62)
>         at org.apache.iotdb.commons.client.sync.SyncConfigNodeIServiceClient$$EnhancerByCGLIB$$74177f2d.<init>(<generated>)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at net.sf.cglib.core.ReflectUtils.newInstance(ReflectUtils.java:228)
>         ... 19 common frames omitted
> Caused by: java.net.ConnectException: Connection refused (Connection refused)
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at org.apache.thrift.transport.TSocket.open(TSocket.java:238)
>         ... 27 common frames omitted
> 2023-05-15 11:00:16,513 [main] ERROR o.a.i.c.s.ConfigNode:204 - Meet error while starting up.
> org.apache.iotdb.commons.exception.StartupException: All retry failed due to: net.sf.cglib.core.CodeGenerationException: org.apache.thrift.transport.TTransportException-->java.net.ConnectException: Connection refused (Connection refused)
>         at org.apache.iotdb.confignode.service.ConfigNode.sendRestartConfigNodeRequest(ConfigNode.java:340)
>         at org.apache.iotdb.confignode.service.ConfigNode.active(ConfigNode.java:115)
>         at org.apache.iotdb.confignode.service.ConfigNodeCommandLine.run(ConfigNodeCommandLine.java:78)
>         at org.apache.iotdb.commons.ServerCommandLine.doMain(ServerCommandLine.java:58)
>         at org.apache.iotdb.confignode.service.ConfigNode.main(ConfigNode.java:90)
> 2023-05-15 11:00:16,514 [main] INFO  o.a.i.c.s.ConfigNode:362 - Deactivating IoTDB-ConfigNode...
> 2023-05-15 11:00:16,514 [main] INFO  o.a.i.c.s.RegisterManager:67 - deregister all service.
> 2023-05-15 11:00:16,514 [main] INFO  o.a.i.c.s.ConfigNode:368 - IoTDB-ConfigNode is deactivated.
> 测试环境
> 1. 启动2C5D    8CPU 32GB内存
> 2C : 172.20.70.5 / 4
> 5D : 
> 172.20.70.26
> 172.20.70.14
> 172.20.70.27
> 172.20.70.30
> 显式配置参数
> COMMON 参数显式配置:
> schema_replication_factor=3
> data_replication_factor=2
> time_partition_interval=86400000
> config_node_ratis_snapshot_trigger_threshold=10000
> schema_region_ratis_snapshot_trigger_threshold=10000
> schema_region_group_extension_policy=CUSTOM
> data_region_group_extension_policy=CUSTOM
> default_data_region_group_num_per_database=9
> default_schema_region_group_num_per_database=3
> fsync_wal_delay_in_ms=1000
> wal_buffer_size_in_byte=167772160
> wal_file_size_threshold_in_byte=104857600
> wal_buffer_queue_capacity=1000
> iot_consensus_throttle_threshold_in_byte=536870912000
> CONFIGNODE ENV:
> MAX_HEAP_SIZE="4G"
> HEAP_NEWSIZE="4G"
> MAX_DIRECT_MEMORY_SIZE="2G"
> DATANODE ENV:
> MAX_HEAP_SIZE="20G"
> HEAP_NEWSIZE="20G"
> MAX_DIRECT_MEMORY_SIZE="2G"
> 2. 启动Benchmark 写入数据 配置参数见附件1.conf
> bm在172.20.70.13  /data/iotdb/benchmark/bm_20230428_e7dad04
> 3. 查看ip5 (Leader) ConfigNode 的 日志
> 触发几次snapshot后,(扩容)启动ip14的confignode
> [图片]
> 查看ip14的confignode日志:
>  !image-2023-05-15-15-21-11-259.png! 
> 4. 等待benchmark写入完成
> 此时ip5的confignode snapshot信息
>  !image-2023-05-15-15-21-40-565.png! 
> 5. 连ip30 执行flush ,成功
> /data/iotdb/t_rc3_0514_4e34fa6/sbin/start-cli.sh -h 172.20.70.30 -e 'flush'
> 以下脚本的位置在:172.20.70.3  /data/iotdb/cluster_shell
> 6. 停止集群,先停止5D,再停止3C ,脚本见附件
> only_stop_dn.sh
> only_stop_timecho_cn.sh
> 7.各节点清操作系统缓存
> 脚本见附件clear_cache.sh
> 8.启动集群 先启动3C ,再启动5D
> 启动3C的脚本
> start_timecho_cn.sh
> 启动5D的脚本
> start_timecho_dn.sh
> 9.查看集群状态,全部节点是Running
> 查询某个设备的count(s_0) 成功
> 10. stop ip5的confignode ,stop ip4的confignode 服务
> 只有ip14的confignode服务在线,执行查询,失败:
> /data/iotdb/t_rc3_0514_4e34fa6/sbin/start-cli.sh -h 172.20.70.30 -e 'show regions'
> Msg: 305: Error in calling method showRegion, because: Fail to connect to any config node. Please check status of ConfigNodes
> 11.启动ip4的confignode 服务,失败。
> 所有的各节点日志见附件。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)