You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ratis.apache.org by William Song <sz...@163.com> on 2022/09/22 14:51:29 UTC

Inconsistent AppendEntries and OutOfDirectMemoryError in GrpcLogAppender

Hi, 

We have new discoveries on https://issues.apache.org/jira/browse/RATIS-1674 <https://issues.apache.org/jira/browse/RATIS-1674>. We observe a lot of  inconsistent AppendEntries and finally OutOfDirectMemory error on leader. Previous discussions please refer to https://lists.apache.org/list?dev@ratis.apache.org:2022-7:DirectOOM <https://lists.apache.org/list?dev@ratis.apache.org:2022-7:DirectOOM> and https://lists.apache.org/list?dev@ratis.apache.org:2022-8:DirectOOM <https://lists.apache.org/list?dev@ratis.apache.org:2022-8:DirectOOM>.

This time, After analyzing logs and code carefully, we highly suspect that the problem roots in gRPC Log Appender's AppendEntries sending and NextIndex updating mechanism. 

When a leader-switch happens in a cluster containing a slow follower, previous leader’s pending AppendEntries queued in slow follower will cause it reply a wrong NextIndex to the new leader, which starts an Inconsistent AE storm and finally lead to OOM.

A detailed description is provided in https://issues.apache.org/jira/browse/RATIS-1674 <https://issues.apache.org/jira/browse/RATIS-1674>. Please help me to confirm this problem. Thanks in advance!

Regards,
William

Re: [ANNOUNCE] New committer: Junfan Zhang

Posted by Kaijie Chen <ck...@apache.org>.
Congrats!

Best,
Kaijie

Re: Inconsistent AppendEntries and OutOfDirectMemoryError in GrpcLogAppender

Posted by Tsz Wo Sze <sz...@gmail.com>.
Hi William,

Thanks a lot for the follow up.  Will check the detailed description
in RATIS-1674.

Tsz-Wo

On Thu, Sep 22, 2022 at 10:51 PM William Song <sz...@163.com> wrote:

> Hi,
>
> We have new discoveries on
> https://issues.apache.org/jira/browse/RATIS-1674 <
> https://issues.apache.org/jira/browse/RATIS-1674>. We observe a lot of
> inconsistent AppendEntries and finally OutOfDirectMemory error on leader.
> Previous discussions please refer to
> https://lists.apache.org/list?dev@ratis.apache.org:2022-7:DirectOOM <
> https://lists.apache.org/list?dev@ratis.apache.org:2022-7:DirectOOM> and
> https://lists.apache.org/list?dev@ratis.apache.org:2022-8:DirectOOM <
> https://lists.apache.org/list?dev@ratis.apache.org:2022-8:DirectOOM>.
>
> This time, After analyzing logs and code carefully, we highly suspect that
> the problem roots in gRPC Log Appender's AppendEntries sending and
> NextIndex updating mechanism.
>
> When a leader-switch happens in a cluster containing a slow follower,
> previous leader’s pending AppendEntries queued in slow follower will cause
> it reply a wrong NextIndex to the new leader, which starts an Inconsistent
> AE storm and finally lead to OOM.
>
> A detailed description is provided in
> https://issues.apache.org/jira/browse/RATIS-1674 <
> https://issues.apache.org/jira/browse/RATIS-1674>. Please help me to
> confirm this problem. Thanks in advance!
>
> Regards,
> William