You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@iotdb.apache.org by "刘珍 (Jira)" <ji...@apache.org> on 2022/10/24 07:31:00 UTC

[jira] [Assigned] (IOTDB-4731) [ remove datanode ] Data is inconsistent ( remove datanode before the synchronization is complete )

     [ https://issues.apache.org/jira/browse/IOTDB-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

刘珍 reassigned IOTDB-4731:
-------------------------

     Attachment: image-2022-10-24-15-22-17-729.png
                 image-2022-10-24-15-13-45-086.png
    Component/s: mpp-cluster
         Sprint: 2022-10-Cluster
       Assignee: Jinrui Zhang
    Description: 
master_1023_2fea011
3rep , 3C3D  ,benchmark  write done .
Start the fourth datanode (ip64).
ip68 : SET SYSTEM TO READONLY ON LOCAL
remove datanode(ip68).
before remove , ip68 is DataRegion[14]' Leader  , there is unsynchronized data:
 !image-2022-10-24-15-13-45-086.png! 

When ip68 is in the removing state , datanode error log :

2022-10-24 14:18:18,092 [pool-49-IoTDB-LogDispatcher-DataRegion[14]-3] ERROR o.a.i.d.w.n.WALNode$PlanNodeIterator:590 - Fail to read wal from wal file /data/liuzhen_test/master_1023_2fea011/sbin/../data/datanode/wal/root.test.g_4-14/_150-200-1.wal, skip this file.
java.nio.channels.ClosedByInterruptException: null
        at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
        at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:315)
        at org.apache.iotdb.db.wal.io.WALByteBufReader.<init>(WALByteBufReader.java:47)
        at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:552)
        at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.next(WALNode.java:683)
        at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.constructBatchFromWAL(LogDispatcher.java:438)
        at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.getBatch(LogDispatcher.java:348)
        at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.run(LogDispatcher.java:274)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2022-10-24 14:18:18,093 [pool-49-IoTDB-LogDispatcher-DataRegion[14]-3] ERROR o.a.i.d.w.n.WALNode$PlanNodeIterator:590 - Fail to read wal from wal file /data/liuzhen_test/master_1023_2fea011/sbin/../data/datanode/wal/root.test.g_4-14/_151-204-1.wal, skip this file.
java.nio.channels.ClosedByInterruptException: null
        at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
        at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:315)
        at org.apache.iotdb.db.wal.io.WALByteBufReader.<init>(WALByteBufReader.java:47)
        at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:552)
        at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:597)
        at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.next(WALNode.java:683)
        at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.constructBatchFromWAL(LogDispatcher.java:438)
        at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.getBatch(LogDispatcher.java:348)
        at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.run(LogDispatcher.java:274)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2022-10-24 14:18:18,093 [pool-49-IoTDB-LogDispatcher-DataRegion[14]-3] ERROR o.a.i.c.m.l.LogDispatcher$LogDispatcherThread:294 - Unexpected error in logDispatcher for peer Peer{groupId=DataRegion[14], endpoint=TEndPoint(ip:192.168.10.64, port:40010), nodeId=6}
java.lang.ArrayIndexOutOfBoundsException: 29
        at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:530)
        at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:597)
        at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:597)
        at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.next(WALNode.java:683)
        at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.constructBatchFromWAL(LogDispatcher.java:438)
        at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.getBatch(LogDispatcher.java:348)
        at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.run(LogDispatcher.java:274)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

After ip68 removed , Query DataRegion[14] :
"select count(s_0) from root.test.g_4.** align by device"
ip64( DataRegion[14] Leader )  :  100 points/sensor
stop ip64 datanode
"select count(s_0) from root.test.g_4.** align by device"
ip66 (48 rows < 100 points ): 
 !image-2022-10-24-15-22-17-729.png! 

Test environment

1. 192.168.10.62 / 66 /68  72CPU 256GB
benchmark : ip64 /data/liuzhen_test/weektest/benchmark_tool

ConfigNode
MAX_HEAP_SIZE="8G"
schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus
schema_replication_factor=3
data_replication_factor=3

DataNode
MAX_HEAP_SIZE="192G"
MAX_DIRECT_MEMORY_SIZE="32G"
query_timeout_threshold=36000000

2. benchmark configuration
see attachment .

3. after benchmark write done
ip68 cli "flush"
ip68 cli : SET SYSTEM TO READONLY ON LOCAL

remove-datanode.sh    "ip68's NodeID"

4. View ip68 datanode logs



> [ remove datanode ] Data is inconsistent  ( remove datanode before the synchronization is complete )
> ----------------------------------------------------------------------------------------------------
>
>                 Key: IOTDB-4731
>                 URL: https://issues.apache.org/jira/browse/IOTDB-4731
>             Project: Apache IoTDB
>          Issue Type: Bug
>          Components: mpp-cluster
>    Affects Versions: 0.14.0-SNAPSHOT
>            Reporter: 刘珍
>            Assignee: Jinrui Zhang
>            Priority: Major
>         Attachments: image-2022-10-24-15-13-45-086.png, image-2022-10-24-15-22-17-729.png
>
>
> master_1023_2fea011
> 3rep , 3C3D  ,benchmark  write done .
> Start the fourth datanode (ip64).
> ip68 : SET SYSTEM TO READONLY ON LOCAL
> remove datanode(ip68).
> before remove , ip68 is DataRegion[14]' Leader  , there is unsynchronized data:
>  !image-2022-10-24-15-13-45-086.png! 
> When ip68 is in the removing state , datanode error log :
> 2022-10-24 14:18:18,092 [pool-49-IoTDB-LogDispatcher-DataRegion[14]-3] ERROR o.a.i.d.w.n.WALNode$PlanNodeIterator:590 - Fail to read wal from wal file /data/liuzhen_test/master_1023_2fea011/sbin/../data/datanode/wal/root.test.g_4-14/_150-200-1.wal, skip this file.
> java.nio.channels.ClosedByInterruptException: null
>         at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>         at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:315)
>         at org.apache.iotdb.db.wal.io.WALByteBufReader.<init>(WALByteBufReader.java:47)
>         at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:552)
>         at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.next(WALNode.java:683)
>         at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.constructBatchFromWAL(LogDispatcher.java:438)
>         at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.getBatch(LogDispatcher.java:348)
>         at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.run(LogDispatcher.java:274)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2022-10-24 14:18:18,093 [pool-49-IoTDB-LogDispatcher-DataRegion[14]-3] ERROR o.a.i.d.w.n.WALNode$PlanNodeIterator:590 - Fail to read wal from wal file /data/liuzhen_test/master_1023_2fea011/sbin/../data/datanode/wal/root.test.g_4-14/_151-204-1.wal, skip this file.
> java.nio.channels.ClosedByInterruptException: null
>         at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>         at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:315)
>         at org.apache.iotdb.db.wal.io.WALByteBufReader.<init>(WALByteBufReader.java:47)
>         at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:552)
>         at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:597)
>         at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.next(WALNode.java:683)
>         at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.constructBatchFromWAL(LogDispatcher.java:438)
>         at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.getBatch(LogDispatcher.java:348)
>         at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.run(LogDispatcher.java:274)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2022-10-24 14:18:18,093 [pool-49-IoTDB-LogDispatcher-DataRegion[14]-3] ERROR o.a.i.c.m.l.LogDispatcher$LogDispatcherThread:294 - Unexpected error in logDispatcher for peer Peer{groupId=DataRegion[14], endpoint=TEndPoint(ip:192.168.10.64, port:40010), nodeId=6}
> java.lang.ArrayIndexOutOfBoundsException: 29
>         at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:530)
>         at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:597)
>         at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:597)
>         at org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.next(WALNode.java:683)
>         at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.constructBatchFromWAL(LogDispatcher.java:438)
>         at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.getBatch(LogDispatcher.java:348)
>         at org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.run(LogDispatcher.java:274)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> After ip68 removed , Query DataRegion[14] :
> "select count(s_0) from root.test.g_4.** align by device"
> ip64( DataRegion[14] Leader )  :  100 points/sensor
> stop ip64 datanode
> "select count(s_0) from root.test.g_4.** align by device"
> ip66 (48 rows < 100 points ): 
>  !image-2022-10-24-15-22-17-729.png! 
> Test environment
> 1. 192.168.10.62 / 66 /68  72CPU 256GB
> benchmark : ip64 /data/liuzhen_test/weektest/benchmark_tool
> ConfigNode
> MAX_HEAP_SIZE="8G"
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus
> schema_replication_factor=3
> data_replication_factor=3
> DataNode
> MAX_HEAP_SIZE="192G"
> MAX_DIRECT_MEMORY_SIZE="32G"
> query_timeout_threshold=36000000
> 2. benchmark configuration
> see attachment .
> 3. after benchmark write done
> ip68 cli "flush"
> ip68 cli : SET SYSTEM TO READONLY ON LOCAL
> remove-datanode.sh    "ip68's NodeID"
> 4. View ip68 datanode logs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)