You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Dong Lin (JIRA)" <ji...@apache.org> on 2018/07/25 21:06:00 UTC
[jira] [Updated] (KAFKA-7128) Lagging high watermark can lead to committed data loss after ISR expansion

     [ https://issues.apache.org/jira/browse/KAFKA-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dong Lin updated KAFKA-7128:
----------------------------
    Description: 
Some model checking exposed a weakness in the ISR expansion logic. We know that the high watermark can go backwards after a leader failover, but we may not have known that this can lead to the loss of committed data.

Say we have three replicas: r1, r2, and r3. Initially, the ISR consists of (r1, r2) and the leader is r1. r3 is a new replica which has not begun fetching. The data up to offset 10 has been committed to the ISR. Here is the initial state:

State 1
ISR: (r1, r2)
 Leader: r1
 r1: [hw=10, leo=10]
 r2: [hw=5, leo=10]
 r3: [hw=0, leo=0]

Replica 1 then initiates shutdown (or fails) and leaves the ISR, which makes r2 the new leader. The high watermark is still lagging r1.

State 2
ISR: (r2)
 Leader: r2
 r1 (offline): [hw=10, leo=10]
 r2: [hw=5, leo=10]
 r3: [hw=0, leo=0]

Replica 3 then catch up to the high watermark on r2 and joins the ISR. Perhaps it's high watermark is lagging behind r2, but this is unimportant.

State 3
ISR: (r2, r3)
 Leader: r2
 r1 (offline): [hw=10, leo=10]
 r2: [hw=5, leo=10]
 r3: [hw=0, leo=5]

Now r2 fails and r3 is elected leader and is the only member of the ISR. The committed data from offsets 5 to 10 has been lost.

State 4
ISR: (r3)
 Leader: r3
 r1 (offline): [hw=10, leo=10]
 r2 (offline): [hw=5, leo=10]
 r3: [hw=0, leo=5]

The bug is the fact that we allowed r3 into the ISR after the local high watermark had been reached. Since the follower does not know the true high watermark for the previous leader's epoch, it should not allow a replica to join the ISR until it has caught up to an offset within its own epoch.

Note this is related to [https://cwiki.apache.org/confluence/display/KAFKA/KIP-207%3A+Offsets+returned+by+ListOffsetsResponse+should+be+monotonically+increasing+even+during+a+partition+leader+change]

  was:
Some model checking exposed a weakness in the ISR expansion logic. We know that the high watermark can go backwards after a leader failover, but we may not have known that this can lead to the loss of committed data. 

Say we have three replicas: r1, r2, and r3. Initially, the ISR consists of (r1, r2) and the leader is r1. r3 is a new replica which has not begun fetching. The data up to offset 10 has been committed to the ISR. Here is the initial state:

ISR: (r1, r2)
Leader: r1
r1: [hw=10, leo=10]
r2: [hw=5, leo=10]
r3: [hw=0, leo=0]

Replica 1 then initiates shutdown (or fails) and leaves the ISR, which makes r2 the new leader. The high watermark is still lagging r1.

ISR: (r2)
Leader: r2
r1 (offline): [hw=10, leo=10]
r2: [hw=5, leo=10]
r3: [hw=0, leo=0]

Replica 3 then catch up to the high watermark on r2 and joins the ISR. Perhaps it's high watermark is lagging behind r2, but this is unimportant.

ISR: (r2, r3)
Leader: r2
r1 (offline): [hw=10, leo=10]
r2: [hw=5, leo=10]
r3: [hw=0, leo=5]

Now r2 fails and r3 is elected leader and is the only member of the ISR. The committed data from offsets 5 to 10 has been lost.

ISR: (r3)
Leader: r3
r1 (offline): [hw=10, leo=10]
r2 (offline): [hw=5, leo=10]
r3: [hw=0, leo=5]

The bug is the fact that we allowed r3 into the ISR after the local high watermark had been reached. Since the follower does not know the true high watermark for the previous leader's epoch, it should not allow a replica to join the ISR until it has caught up to an offset within its own epoch. 

Note this is related to https://cwiki.apache.org/confluence/display/KAFKA/KIP-207%3A+Offsets+returned+by+ListOffsetsResponse+should+be+monotonically+increasing+even+during+a+partition+leader+change


> Lagging high watermark can lead to committed data loss after ISR expansion
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-7128
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7128
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Anna Povzner
>            Priority: Major
>
> Some model checking exposed a weakness in the ISR expansion logic. We know that the high watermark can go backwards after a leader failover, but we may not have known that this can lead to the loss of committed data.
> Say we have three replicas: r1, r2, and r3. Initially, the ISR consists of (r1, r2) and the leader is r1. r3 is a new replica which has not begun fetching. The data up to offset 10 has been committed to the ISR. Here is the initial state:
> State 1
> ISR: (r1, r2)
>  Leader: r1
>  r1: [hw=10, leo=10]
>  r2: [hw=5, leo=10]
>  r3: [hw=0, leo=0]
> Replica 1 then initiates shutdown (or fails) and leaves the ISR, which makes r2 the new leader. The high watermark is still lagging r1.
> State 2
> ISR: (r2)
>  Leader: r2
>  r1 (offline): [hw=10, leo=10]
>  r2: [hw=5, leo=10]
>  r3: [hw=0, leo=0]
> Replica 3 then catch up to the high watermark on r2 and joins the ISR. Perhaps it's high watermark is lagging behind r2, but this is unimportant.
> State 3
> ISR: (r2, r3)
>  Leader: r2
>  r1 (offline): [hw=10, leo=10]
>  r2: [hw=5, leo=10]
>  r3: [hw=0, leo=5]
> Now r2 fails and r3 is elected leader and is the only member of the ISR. The committed data from offsets 5 to 10 has been lost.
> State 4
> ISR: (r3)
>  Leader: r3
>  r1 (offline): [hw=10, leo=10]
>  r2 (offline): [hw=5, leo=10]
>  r3: [hw=0, leo=5]
> The bug is the fact that we allowed r3 into the ISR after the local high watermark had been reached. Since the follower does not know the true high watermark for the previous leader's epoch, it should not allow a replica to join the ISR until it has caught up to an offset within its own epoch.
> Note this is related to [https://cwiki.apache.org/confluence/display/KAFKA/KIP-207%3A+Offsets+returned+by+ListOffsetsResponse+should+be+monotonically+increasing+even+during+a+partition+leader+change]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)