You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ratis.apache.org by William Song <sz...@163.com> on 2022/09/22 14:51:29 UTC
Inconsistent AppendEntries and OutOfDirectMemoryError in GrpcLogAppender
Hi,
We have new discoveries on https://issues.apache.org/jira/browse/RATIS-1674 <https://issues.apache.org/jira/browse/RATIS-1674>. We observe a lot of inconsistent AppendEntries and finally OutOfDirectMemory error on leader. Previous discussions please refer to https://lists.apache.org/list?dev@ratis.apache.org:2022-7:DirectOOM <https://lists.apache.org/list?dev@ratis.apache.org:2022-7:DirectOOM> and https://lists.apache.org/list?dev@ratis.apache.org:2022-8:DirectOOM <https://lists.apache.org/list?dev@ratis.apache.org:2022-8:DirectOOM>.
This time, After analyzing logs and code carefully, we highly suspect that the problem roots in gRPC Log Appender's AppendEntries sending and NextIndex updating mechanism.
When a leader-switch happens in a cluster containing a slow follower, previous leader’s pending AppendEntries queued in slow follower will cause it reply a wrong NextIndex to the new leader, which starts an Inconsistent AE storm and finally lead to OOM.
A detailed description is provided in https://issues.apache.org/jira/browse/RATIS-1674 <https://issues.apache.org/jira/browse/RATIS-1674>. Please help me to confirm this problem. Thanks in advance!
Regards,
William
Re: [ANNOUNCE] New committer: Junfan Zhang
Posted by Kaijie Chen <ck...@apache.org>.
Congrats!
Best,
Kaijie
Re: Inconsistent AppendEntries and OutOfDirectMemoryError in GrpcLogAppender
Posted by Tsz Wo Sze <sz...@gmail.com>.
Hi William,
Thanks a lot for the follow up. Will check the detailed description
in RATIS-1674.
Tsz-Wo
On Thu, Sep 22, 2022 at 10:51 PM William Song <sz...@163.com> wrote:
> Hi,
>
> We have new discoveries on
> https://issues.apache.org/jira/browse/RATIS-1674 <
> https://issues.apache.org/jira/browse/RATIS-1674>. We observe a lot of
> inconsistent AppendEntries and finally OutOfDirectMemory error on leader.
> Previous discussions please refer to
> https://lists.apache.org/list?dev@ratis.apache.org:2022-7:DirectOOM <
> https://lists.apache.org/list?dev@ratis.apache.org:2022-7:DirectOOM> and
> https://lists.apache.org/list?dev@ratis.apache.org:2022-8:DirectOOM <
> https://lists.apache.org/list?dev@ratis.apache.org:2022-8:DirectOOM>.
>
> This time, After analyzing logs and code carefully, we highly suspect that
> the problem roots in gRPC Log Appender's AppendEntries sending and
> NextIndex updating mechanism.
>
> When a leader-switch happens in a cluster containing a slow follower,
> previous leader’s pending AppendEntries queued in slow follower will cause
> it reply a wrong NextIndex to the new leader, which starts an Inconsistent
> AE storm and finally lead to OOM.
>
> A detailed description is provided in
> https://issues.apache.org/jira/browse/RATIS-1674 <
> https://issues.apache.org/jira/browse/RATIS-1674>. Please help me to
> confirm this problem. Thanks in advance!
>
> Regards,
> William