You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kvrocks.apache.org by GitBox <gi...@apache.org> on 2022/07/13 17:09:33 UTC

[GitHub] [incubator-kvrocks] ethervoid created a discussion: Master node got unresponsive after restart one of the replicas

GitHub user ethervoid created a discussion: Master node got unresponsive after restart one of the replicas

Hello everyone!

We've been using kvrocks for a while and to give a bit of context on how we're working with it our system has 2 replicas and 1 master node as part of a Sentinel cluster. 

In our case when we want to release an update we first make the release and restart of the replicas, then we do a manual fail-over with Sentinel and after that, we release in the former master node.

Using this workflow today we've found that for some reason after releasing the changes in the first replica our master becomes unresponsive. We started to have gaps in our metrics from Grafana as you can see 

<img width="463" alt="Captura de pantalla 2022-07-13 a las 19 05 48" src="https://user-images.githubusercontent.com/741240/178790699-de3a82e7-3127-483e-b5ac-4778372773fd.png">

<img width="428" alt="Captura de pantalla 2022-07-13 a las 19 05 40" src="https://user-images.githubusercontent.com/741240/178790781-9a22389d-e5f0-45d1-8b1d-61bf18f70c24.png">

We connected to the machine and checked the docker image and was running with the following logs for that timeline

```
E0713 12:22:43.009356 14447 replication.cc:111] Write error while sending batch to slave: Broken pipe. batches: 0x243130360D0A7B11FE880D0000000200000003013201250B5F5F6E616D6573706163650000000C735F363737333231333331342F1B266EB733CEAC30060181F782F22601250B5F5F6E616D6573706163650000000C735F363737333231333331342F1B266EB733CEAC6105302E3735300D0A
E0713 12:23:08.211652    32 redis_cmd.cc:3533] checkWALBoundary with sequence: 58132926866, but GetWALIter return older sequence: 58132926860
E0713 12:43:06.192111  9671 replication.cc:111] Write error while sending batch to slave: Broken pipe. batches: 0x2431350D0A510D26890D000000000000000301320D0A
```

We stopped writing on that node and after some minutes the node went back and started to be responsive again without doing anything else.

Could be this a bug, an issue or misconfiguration?

GitHub link: https://github.com/apache/incubator-kvrocks/discussions/728

----
This is an automatically sent email for dev@kvrocks.apache.org.
To unsubscribe, please send an email to: dev-unsubscribe@kvrocks.apache.org


[GitHub] [incubator-kvrocks] ethervoid added a comment to the discussion: Master node got unresponsive after restart one of the replicas

Posted by GitBox <gi...@apache.org>.
GitHub user ethervoid added a comment to the discussion: Master node got unresponsive after restart one of the replicas

So restarting a replica made it start a full sync and that caused the master to be blocked by that full sync in terms of network and CPU usage, is that correct?

If that's the case, any recommendation to avoid that?

GitHub link: https://github.com/apache/incubator-kvrocks/discussions/728#discussioncomment-3145828

----
This is an automatically sent email for dev@kvrocks.apache.org.
To unsubscribe, please send an email to: dev-unsubscribe@kvrocks.apache.org


[GitHub] [incubator-kvrocks] git-hulk added a comment to the discussion: Master node got unresponsive after restart one of the replicas

Posted by GitBox <gi...@apache.org>.
GitHub user git-hulk added a comment to the discussion: Master node got unresponsive after restart one of the replicas

@ethervoid Thanks for your feedback. Can you paste more INFO logs and CPU/Net performance when becoming unreachable. I can't tell what's wrong with your instance by above information.

GitHub link: https://github.com/apache/incubator-kvrocks/discussions/728#discussioncomment-3142391

----
This is an automatically sent email for dev@kvrocks.apache.org.
To unsubscribe, please send an email to: dev-unsubscribe@kvrocks.apache.org


[GitHub] [incubator-kvrocks] git-hulk edited a comment on the discussion: Master node got unresponsive after restart one of the replicas

Posted by GitBox <gi...@apache.org>.
GitHub user git-hulk edited a comment on the discussion: Master node got unresponsive after restart one of the replicas

Thanks @ethervoid 

> E0713 12:23:08.211652    32 redis_cmd.cc:3533] checkWALBoundary with sequence: 58132926866, but GetWALIter return older sequence: 58132926860

This log entry means the replica's replication offset was too old, so it would fully sync with master DB. The exporter can't fetch the metrics that happens at the same time point, so I guess that it may be caused by high network bandwidth and CPU usage when doing full sync.  

GitHub link: https://github.com/apache/incubator-kvrocks/discussions/728#discussioncomment-3145556

----
This is an automatically sent email for dev@kvrocks.apache.org.
To unsubscribe, please send an email to: dev-unsubscribe@kvrocks.apache.org


[GitHub] [incubator-kvrocks] ethervoid added a comment to the discussion: Master node got unresponsive after restart one of the replicas

Posted by GitBox <gi...@apache.org>.
GitHub user ethervoid added a comment to the discussion: Master node got unresponsive after restart one of the replicas

@git-hulk Thank you for your answer! I cannot provide more logs than those, sadly we lost the rest :( but I can provide network data from cloudwatch and CPU usage from Prometheus metrics provided by the exporter, let me know if anything else is useful

<img width="1218" alt="Captura de pantalla 2022-07-14 a las 10 32 36" src="https://user-images.githubusercontent.com/741240/178940495-5b5c6106-1434-47c8-84ed-b19ac87959a6.png">
<img width="1236" alt="Captura de pantalla 2022-07-14 a las 10 33 14" src="https://user-images.githubusercontent.com/741240/178940512-93ebbd56-970f-49fe-9c9c-434d3e6c1c01.png">
<img width="2236" alt="Captura de pantalla 2022-07-14 a las 10 36 20" src="https://user-images.githubusercontent.com/741240/178941571-22c5dbbf-2ba9-45a5-93b9-1fd8cce72087.png">
<img width="2217" alt="Captura de pantalla 2022-07-14 a las 10 36 44" src="https://user-images.githubusercontent.com/741240/178941592-fa83f06c-7f2b-47da-af2d-522ede7d6852.png">
<img width="2246" alt="Captura de pantalla 2022-07-14 a las 10 37 06" src="https://user-images.githubusercontent.com/741240/178941597-7c91e64a-ce4f-4266-9ec4-48b91ff007df.png">
<img width="2235" alt="Captura de pantalla 2022-07-14 a las 10 37 52" src="https://user-images.githubusercontent.com/741240/178941604-57d17c77-62c5-4594-97c1-9bbd898627cb.png">


GitHub link: https://github.com/apache/incubator-kvrocks/discussions/728#discussioncomment-3144771

----
This is an automatically sent email for dev@kvrocks.apache.org.
To unsubscribe, please send an email to: dev-unsubscribe@kvrocks.apache.org


[GitHub] [incubator-kvrocks] git-hulk added a comment to the discussion: Master node got unresponsive after restart one of the replicas

Posted by GitBox <gi...@apache.org>.
GitHub user git-hulk added a comment to the discussion: Master node got unresponsive after restart one of the replicas

You can increase the [rocksdb.wal_size_limit_mb](https://github.com/apache/incubator-kvrocks/blob/unstable/kvrocks.conf#L497) to reduce the possibility of full sync, but it will also use more disk space. For the network usage, can use `max-replication-mb` to limit the speed of replication, this configuration can be changed online by using `config set max-replication-mb xxx`.

GitHub link: https://github.com/apache/incubator-kvrocks/discussions/728#discussioncomment-3145899

----
This is an automatically sent email for dev@kvrocks.apache.org.
To unsubscribe, please send an email to: dev-unsubscribe@kvrocks.apache.org


[GitHub] [incubator-kvrocks] git-hulk added a comment to the discussion: Master node got unresponsive after restart one of the replicas

Posted by GitBox <gi...@apache.org>.
GitHub user git-hulk added a comment to the discussion: Master node got unresponsive after restart one of the replicas

Thanks @ethervoid 

> E0713 12:23:08.211652    32 redis_cmd.cc:3533] checkWALBoundary with sequence: 58132926866, but GetWALIter return older sequence: 58132926860

This log entry means the replica's replication offset was too old, so it would full sync with master db. The exporter can't fetch the metrics happens at the same time point, so I guess that it may be caused by too much network bandwidth and cpu usage when doing full sync.  

GitHub link: https://github.com/apache/incubator-kvrocks/discussions/728#discussioncomment-3145556

----
This is an automatically sent email for dev@kvrocks.apache.org.
To unsubscribe, please send an email to: dev-unsubscribe@kvrocks.apache.org


[GitHub] [incubator-kvrocks] ethervoid added a comment to the discussion: Master node got unresponsive after restart one of the replicas

Posted by GitBox <gi...@apache.org>.
GitHub user ethervoid added a comment to the discussion: Master node got unresponsive after restart one of the replicas

Understood! Thank you very much Hulk 🙇 

GitHub link: https://github.com/apache/incubator-kvrocks/discussions/728#discussioncomment-3145908

----
This is an automatically sent email for dev@kvrocks.apache.org.
To unsubscribe, please send an email to: dev-unsubscribe@kvrocks.apache.org