You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Yogesh Sangvikar <yo...@gmail.com> on 2017/09/18 16:08:33 UTC

Fwd: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Hi Team,

Please help to find resolution for below kafka rolling upgrade issue.

Thanks,

Yogesh

On Monday, September 18, 2017 at 9:03:04 PM UTC+5:30, Yogesh Sangvikar 
wrote:
>
> Hi Team,
>
> Currently, we are using confluent 3.0.0 kafka cluster in our production 
> environment. And, we are planing to upgrade the kafka cluster for confluent 
> 3.2.2 
> We are having topics with millions on records and data getting 
> continuously published to those topics. And, also, we are using other 
> confluent services like schema-registry, kafka connect and kafka rest to 
> process the data.
>
> So, we can't afford downtime upgrade for the platform.
>
> We have tries rolling kafka upgrade as suggested on blogs in Development 
> environment,
>
> https://docs.confluent.io/3.2.2/upgrade.html
>
> https://kafka.apache.org/documentation/#upgrade
>
> But, we are observing data loss on topics while doing rolling upgrade / 
> restart of kafka servers for "inter.broker.protocol.version=0.10.2".
>
> As per our observation, we suspect the root cause for the data loss 
> (explained for a topic partition having 3 replicas), 
>
>    - As the kafka broker protocol version updates from 0.10.0 to 0.10.2 
>    in rolling fashion, the in-sync replicas having older version will not 
>    allow updated replicas (0.10.2) to be in sync unless are all updated. 
>    - Also, we have explicitly disabled "unclean.leader.election.enabled" 
>    property, so only in-sync replicas will be elected as leader for the given 
>    partition.
>    - While doing rolling fashion update, as mentioned above, older 
>    version leader is not allowing newer version replicas to be in sync, so the 
>    data pushed using this older version leader, will not be synced with other 
>    replicas and if this leader(older version)  goes down for an upgrade, other 
>    updated replicas will be shown in in-sync column and become leader, but 
>    they lag in offset with old version leader and shows the offset of the data 
>    till they have synced.
>    - And, once the last replica comes up with updated version, will start 
>    syncing data from the current leader.  
>
>
> Please let us know comments on our observation and suggest proper way for 
> rolling kafka upgrade as we can't afford downtime.
>
> Thanks,
> Yogesh
>

Re: Data loss while upgrading confluent 3.0.0 kafka cluster toconfluent 3.2.2

Posted by Ismael Juma <is...@juma.me.uk>.
Great that it's working. Yes, you need retries not to drop messages during
broker restarts.

Ismael

On Tue, Sep 26, 2017 at 3:33 PM, Yogesh Sangvikar <
yogesh.sangvikar@gmail.com> wrote:

> Hi Team,
>
> Thanks a lot for the suggestion Ismael.
>
> We have tried kafka cluster rolling upgrade by doing the version changes
> (CURRENT_KAFKA_VERSION -  0.10.0, CURRENT_MESSAGE_FORMAT_VERSION - 0.10.0
> and upgraded respective version 0.10.2) in upgraded confluent package 3.2.2
> and observed the in-sync replicas are coming up immediately & also, the
> preferred leaders are coming up after version bump post sync.
>
> As per my understanding, the in-sync replicas & leader election happening
> quickly as the new data getting published while upgrade is getting written
> and synced using upgraded package libraries (0.10.2).
>
> Also, observed some records failed to produce due to error,
>
> kafka-rest error response -
>
> {"offsets":[{"partition":null,"offset":null,"error_code":
> 50003,"error":"This
> server is not the leader for that
> topic-partition."}],"key_schema_id":1542,"value_schema_id":1541}
>
> Exception in log file -
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
> is not the leader for that topic-partition.
>
>
> To resolve the above error, we have override properties *acks=-1 (default,
> 1) retries=3 (default, 0) *for kafka rest producer config
> (kafka-rest.properties) and getting some duplicate events in topic.
>
>
> Thanks,
> Yogesh
>
> On Thu, Sep 21, 2017 at 7:09 AM, yogesh sangvikar <
> yogesh.sangvikar@gmail.com> wrote:
>
> > Thanks Ismael.
> > I will try the solution and update all.
> >
> > Thanks,
> > Yogesh
> > ------------------------------
> > From: Ismael Juma <is...@juma.me.uk>
> > Sent: ‎20-‎09-‎2017 11:57 PM
> > To: Kafka Users <us...@kafka.apache.org>
> > Subject: Re: Data loss while upgrading confluent 3.0.0 kafka cluster
> > toconfluent 3.2.2
> >
> > One clarification below:
> >
> > On Wed, Sep 20, 2017 at 3:50 PM, Ismael Juma <is...@juma.me.uk> wrote:
> >
> > > Comments inline.
> > >
> > > On Wed, Sep 20, 2017 at 11:56 AM, Yogesh Sangvikar <
> > > yogesh.sangvikar@gmail.com> wrote:
> > >
> > >> 2. At which point in the sequence below was the code for the brokers
> > >> updated to 0.10.2?
> > >>
> > >> [Comment: On the kafka servers, we have confluent-3.0.0 and
> > >> confluent-3.2.2
> > >> packages deployed separately. So, first for protocol and message
> version
> > >> to
> > >> 0.10.0 we have updated server.properties file in running
> confluent-3.0.0
> > >> package and restarted the service for the same.
> > >
> > > And, for protocol and message version to 0.10.2 bumb, we have modified
> > >> server.properties file in confluent-3.2.2 & stopped the old package
> > >> services and started the kafka services using new one. All restarts
> are
> > >> done rolling fashion and random broker.id sequence (4,3,2,1).]
> > >>
> > >
> > > You have to set version 0.10.0 in the server.properties of the
> 0.10.2/3.2
> > > brokers. This is probably the source of your issue. After all running
> > > brokers are version 0.10.2/3.2, then you can switch the version to
> > 0.10.2.
> > >
> >
> > The last sentence may be clearer with the following change:
> >
> > "After all running brokers are version 0.10.2/3.2, then you can switch
> the
> > inter.broker.protocol.version to 0.10.2 in server.properties."
> >
> > Ismael
> >
>

Re: Data loss while upgrading confluent 3.0.0 kafka cluster toconfluent 3.2.2

Posted by Yogesh Sangvikar <yo...@gmail.com>.
Hi Team,

Thanks a lot for the suggestion Ismael.

We have tried kafka cluster rolling upgrade by doing the version changes
(CURRENT_KAFKA_VERSION -  0.10.0, CURRENT_MESSAGE_FORMAT_VERSION - 0.10.0
and upgraded respective version 0.10.2) in upgraded confluent package 3.2.2
and observed the in-sync replicas are coming up immediately & also, the
preferred leaders are coming up after version bump post sync.

As per my understanding, the in-sync replicas & leader election happening
quickly as the new data getting published while upgrade is getting written
and synced using upgraded package libraries (0.10.2).

Also, observed some records failed to produce due to error,

kafka-rest error response -

{"offsets":[{"partition":null,"offset":null,"error_code":50003,"error":"This
server is not the leader for that
topic-partition."}],"key_schema_id":1542,"value_schema_id":1541}

Exception in log file -
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.


To resolve the above error, we have override properties *acks=-1 (default,
1) retries=3 (default, 0) *for kafka rest producer config
(kafka-rest.properties) and getting some duplicate events in topic.


Thanks,
Yogesh

On Thu, Sep 21, 2017 at 7:09 AM, yogesh sangvikar <
yogesh.sangvikar@gmail.com> wrote:

> Thanks Ismael.
> I will try the solution and update all.
>
> Thanks,
> Yogesh
> ------------------------------
> From: Ismael Juma <is...@juma.me.uk>
> Sent: ‎20-‎09-‎2017 11:57 PM
> To: Kafka Users <us...@kafka.apache.org>
> Subject: Re: Data loss while upgrading confluent 3.0.0 kafka cluster
> toconfluent 3.2.2
>
> One clarification below:
>
> On Wed, Sep 20, 2017 at 3:50 PM, Ismael Juma <is...@juma.me.uk> wrote:
>
> > Comments inline.
> >
> > On Wed, Sep 20, 2017 at 11:56 AM, Yogesh Sangvikar <
> > yogesh.sangvikar@gmail.com> wrote:
> >
> >> 2. At which point in the sequence below was the code for the brokers
> >> updated to 0.10.2?
> >>
> >> [Comment: On the kafka servers, we have confluent-3.0.0 and
> >> confluent-3.2.2
> >> packages deployed separately. So, first for protocol and message version
> >> to
> >> 0.10.0 we have updated server.properties file in running confluent-3.0.0
> >> package and restarted the service for the same.
> >
> > And, for protocol and message version to 0.10.2 bumb, we have modified
> >> server.properties file in confluent-3.2.2 & stopped the old package
> >> services and started the kafka services using new one. All restarts are
> >> done rolling fashion and random broker.id sequence (4,3,2,1).]
> >>
> >
> > You have to set version 0.10.0 in the server.properties of the 0.10.2/3.2
> > brokers. This is probably the source of your issue. After all running
> > brokers are version 0.10.2/3.2, then you can switch the version to
> 0.10.2.
> >
>
> The last sentence may be clearer with the following change:
>
> "After all running brokers are version 0.10.2/3.2, then you can switch the
> inter.broker.protocol.version to 0.10.2 in server.properties."
>
> Ismael
>

Re: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Posted by Ismael Juma <is...@juma.me.uk>.
One clarification below:

On Wed, Sep 20, 2017 at 3:50 PM, Ismael Juma <is...@juma.me.uk> wrote:

> Comments inline.
>
> On Wed, Sep 20, 2017 at 11:56 AM, Yogesh Sangvikar <
> yogesh.sangvikar@gmail.com> wrote:
>
>> 2. At which point in the sequence below was the code for the brokers
>> updated to 0.10.2?
>>
>> [Comment: On the kafka servers, we have confluent-3.0.0 and
>> confluent-3.2.2
>> packages deployed separately. So, first for protocol and message version
>> to
>> 0.10.0 we have updated server.properties file in running confluent-3.0.0
>> package and restarted the service for the same.
>
> And, for protocol and message version to 0.10.2 bumb, we have modified
>> server.properties file in confluent-3.2.2 & stopped the old package
>> services and started the kafka services using new one. All restarts are
>> done rolling fashion and random broker.id sequence (4,3,2,1).]
>>
>
> You have to set version 0.10.0 in the server.properties of the 0.10.2/3.2
> brokers. This is probably the source of your issue. After all running
> brokers are version 0.10.2/3.2, then you can switch the version to 0.10.2.
>

The last sentence may be clearer with the following change:

"After all running brokers are version 0.10.2/3.2, then you can switch the
inter.broker.protocol.version to 0.10.2 in server.properties."

Ismael

Re: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Posted by Ismael Juma <is...@juma.me.uk>.
Comments inline.

On Wed, Sep 20, 2017 at 11:56 AM, Yogesh Sangvikar <
yogesh.sangvikar@gmail.com> wrote:

> 2. At which point in the sequence below was the code for the brokers
> updated to 0.10.2?
>
> [Comment: On the kafka servers, we have confluent-3.0.0 and confluent-3.2.2
> packages deployed separately. So, first for protocol and message version to
> 0.10.0 we have updated server.properties file in running confluent-3.0.0
> package and restarted the service for the same.

And, for protocol and message version to 0.10.2 bumb, we have modified
> server.properties file in confluent-3.2.2 & stopped the old package
> services and started the kafka services using new one. All restarts are
> done rolling fashion and random broker.id sequence (4,3,2,1).]
>

You have to set version 0.10.0 in the server.properties of the 0.10.2/3.2
brokers. This is probably the source of your issue. After all running
brokers are version 0.10.2/3.2, then you can switch the version to 0.10.2.

Let us know if this fixes the issue.

Ismael

Re: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Posted by Yogesh Sangvikar <yo...@gmail.com>.
Hi Ismael,

A few questions:

1. Please share the code for the test script.

    [Comment: We are publishing events using kafka-rest
*POST: /topics/<topic-name>* API. And, using Jmeter script to call the API
to publish events continuously for 2 hrs. The "key" value for the event is
constant so that, we can  check to which partition events is getting
published.]

2. At which point in the sequence below was the code for the brokers
updated to 0.10.2?

[Comment: On the kafka servers, we have confluent-3.0.0 and confluent-3.2.2
packages deployed separately. So, first for protocol and message version to
0.10.0 we have updated server.properties file in running confluent-3.0.0
package and restarted the service for the same.
And, for protocol and message version to 0.10.2 bumb, we have modified
server.properties file in confluent-3.2.2 & stopped the old package
services and started the kafka services using new one. All restarts are
done rolling fashion and random broker.id sequence (4,3,2,1).]

3. When doing a rolling restart, it's generally a good idea to ensure that
there are no under-replicated partitions.

[Comment: Yes, for every restart we have waited for the required in-sync
replicas to be back.]

4. Is controlled shutdown completing successfully?

[Comment: Yes. We are stopping & starting the kafka services using
scripts kafka-server-stop & kafka-server-start.]


We are seeing some exceptions in kafka REST logs like,

org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
2017-09-20 10:16:49 ERROR ProduceTask:71 - Producer error for request
io.confluent.kafkarest.ProduceTask@228c0e7e
org.apache.kafka.common.errors.NetworkException: The server disconnected
before a response was received.
2017-09-20 10:17:19 ERROR ProduceTask:71 - Producer error for request
io.confluent.kafkarest.ProduceTask@7d68db9d
org.apache.kafka.common.errors.TimeoutException: Batch containing 1
record(s) expired due to timeout while requesting metadata from brokers for
student-activity-3
2017-09-20 10:17:19 ERROR ProduceTask:71 - Producer error for request
io.confluent.kafkarest.ProduceTask@3bd78e12

Hope, those exceptions are expected while rolling restart of kafka servers
and data getting published.

Also, we have tried upgrade by adding explicit properties set like,
producer.acks=all
producer.retries=1

but, still the issue is same.

Thanks,
Yogesh

On Tue, Sep 19, 2017 at 6:48 PM, Ismael Juma <is...@juma.me.uk> wrote:

> Hi Yogesh,
>
> A few questions:
>
> 1. Please share the code for the test script.
> 2. At which point in the sequence below was the code for the brokers
> updated to 0.10.2?
> 3. When doing a rolling restart, it's generally a good idea to ensure that
> there are no under-replicated partitions.
> 4. Is controlled shutdown completing successfully?
>
> Ismael
>
> On Tue, Sep 19, 2017 at 12:33 PM, Yogesh Sangvikar <
> yogesh.sangvikar@gmail.com> wrote:
>
> > Hi Team,
> >
> > Thanks for providing comments.
> >
> > Here adding more details on steps followed for upgrade,
> >
> > Cluster details: We are using 4 node kafka cluster and topics with 3
> > replication factor. For upgrade test, we are using a topic with 5
> > partitions & 3 replication factor.
> >
> > Topic:student-activity  PartitionCount:5        ReplicationFactor:3
> > Configs:
> >         Topic: student-activity Partition: 0    Leader: 4       Replicas:
> > 4,2,3 Isr: 4,2,3
> >         Topic: student-activity Partition: 1    Leader: 1       Replicas:
> > 1,3,4 Isr: 1,4,3
> >         Topic: student-activity Partition: 2    Leader: 2       Replicas:
> > 2,4,1 Isr: 2,4,1
> >         Topic: student-activity Partition: 3    Leader: 3       Replicas:
> > 3,1,2 Isr: 1,2,3
> >         Topic: student-activity Partition: 4    Leader: 4       Replicas:
> > 4,3,1 Isr: 4,1,3
> >
> > We are using a test script to publish events continuously to one of the
> > topic partition (here partition 3) and monitoring the scripts total
> > published events count  with the partition 3 offset value.
> >
> > [ Note: The topic partitions offset count may differ from CLI utility and
> > screenshot due to capture delay. ]
> >
> >    - First, we have rolling restarted all kafka brokers for explicit
> >    protocol and message version to 0.10.0, inter.broker.protocol.version=
> 0.10.0
> >
> >    log.message.format.version=0.10.0
> >
> >    - During this restarted, the events are getting published as expected
> >    and counters are increasing & in-sync replicas are coming up
> immediately
> >    post restart.
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    student-activity:3:785
> >    student-activity:0:1
> >    [image: Inline image 1]
> >
> >
> >    - Next, we  have rolling restarted kafka brokers for
> >    "inter.broker.protocol.version=0.10.2" in below broker sequence.
> (note
> >    that, test script is publishing events to the topic partition
> continuously)
> >
> >    - Restarted server with  broker.id = 4,
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    student-activity:3:1189
> >    student-activity:0:1
> >
> >    [image: Inline image 2]
> >
> >    - Restarted server with  broker.id = 3,
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    *student-activity:3:1430*
> >    student-activity:0:1
> >
> >
> >        [image: Inline image 3]
> >
> >
> >    - Restarted server with  broker.id = 2, (here, observe the partition
> 3
> >    offset count is decreased from last restart offset)
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    *student-activity:3:1357*
> >    student-activity:0:1
> >
> >            [image: Inline image 4]
> >
> >
> >    - Restarted last server with  broker.id = 1,
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    student-activity:3:1613
> >    student-activity:0:1
> >    [image: Inline image 5]
> >
> >    - Finally, rolling restarted all brokers (in same sequence above) for
> >    "log.message.format.version=0.10.2"
> >
> >
> > ​ [image: Inline image 6]
> > [image: Inline image 7]
> > [image: Inline image 8]
> >
> > [image: Inline image 9]
> >
> >    - The topic offset counter after final restart,
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    student-activity:3:2694
> >    student-activity:0:1
> >
> >
> >    - And, the topic offset counter after stopping events publish script,
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    student-activity:3:2769
> >    student-activity:0:1
> >
> >    - Calculating missing events counts,
> >    Total events published by script to partition 3 :
> > *3090 *Offset count on Partition 3 :
> > *2769 *
> >    Missing events count : 3090 - 2769 = *321*
> >
> >
> > As per above observation during rolling restart for protocol version,
> >
> >    1. The partition 3 leader changed to in-sync replica 2 (with older
> >    protocol version) and upgraded replicas (3 & 4) are missing from
> in-sync
> >    replica list.
> >    2. And, one we down server 2 down for upgrade, suddenly replicas 3 & 4
> >    appear in in-sync replica list and partition offset count resets.
> >    3. Post server 2 & 1 upgrade, 3 in-sync replicas shown for partition 3
> >    but, missing events lag is not recovered.
> >
> > Please let us know your comments on our observations and correct us if we
> > are missing any upgrade steps.
> >
> > Thanks,
> > Yogesh
> >
> > On Tue, Sep 19, 2017 at 2:07 AM, Ismael Juma <is...@juma.me.uk> wrote:
> >
> >> Hi Scott,
> >>
> >> There is nothing preventing a replica running a newer version from being
> >> in
> >> sync as long as the instructions are followed (i.e.
> >> inter.broker.protocol.version has to be set correctly and, if there's a
> >> message format change, log.message.format.version). That's why I asked
> >> Yogesh for more details. The upgrade path he mentioned (0.10.0 ->
> 0.10.2)
> >> is straightforward, there isn't a message format change, so only
> >> inter.broker.protocol.version needs to be set.
> >>
> >> Ismael
> >>
> >> On Mon, Sep 18, 2017 at 5:50 PM, Scott Reynolds <
> >> sreynolds@twilio.com.invalid> wrote:
> >>
> >> > Can we get some clarity on this point:
> >> > >older version leader is not allowing newer version replicas to be in
> >> sync,
> >> > so the data pushed using this older version leader
> >> >
> >> > That is super scary.
> >> >
> >> > What protocol version is the older version leader running?
> >> >
> >> > Would this happen if you are skipping a protocol version bump?
> >> >
> >> >
> >> >
> >> > On Mon, Sep 18, 2017 at 9:33 AM Ismael Juma <is...@juma.me.uk>
> wrote:
> >> >
> >> > > Hi Yogesh,
> >> > >
> >> > > Can you please clarify what you mean by "observing data loss"?
> >> > >
> >> > > Ismael
> >> > >
> >> > > On Mon, Sep 18, 2017 at 5:08 PM, Yogesh Sangvikar <
> >> > > yogesh.sangvikar@gmail.com> wrote:
> >> > >
> >> > > > Hi Team,
> >> > > >
> >> > > > Please help to find resolution for below kafka rolling upgrade
> >> issue.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Yogesh
> >> > > >
> >> > > > On Monday, September 18, 2017 at 9:03:04 PM UTC+5:30, Yogesh
> >> Sangvikar
> >> > > > wrote:
> >> > > >>
> >> > > >> Hi Team,
> >> > > >>
> >> > > >> Currently, we are using confluent 3.0.0 kafka cluster in our
> >> > production
> >> > > >> environment. And, we are planing to upgrade the kafka cluster for
> >> > > confluent
> >> > > >> 3.2.2
> >> > > >> We are having topics with millions on records and data getting
> >> > > >> continuously published to those topics. And, also, we are using
> >> other
> >> > > >> confluent services like schema-registry, kafka connect and kafka
> >> rest
> >> > to
> >> > > >> process the data.
> >> > > >>
> >> > > >> So, we can't afford downtime upgrade for the platform.
> >> > > >>
> >> > > >> We have tries rolling kafka upgrade as suggested on blogs in
> >> > Development
> >> > > >> environment,
> >> > > >>
> >> > > >>
> >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.
> >> > confluent.io_3.2.2_upgrade.html&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=
> >> > ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_
> >> > w54zWrVd48xst46GuPGCxV0&s=DMcA8JOnGXNNa_dRFpkNOd7AJoIQUgkEcw
> >> 6q06RHgl0&e=
> >> > > >>
> >> > > >>
> >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__kafka.
> >> > apache.org_documentation_-23upgrade&d=DwIBaQ&c=x_Y1Lz9GyeGp2
> >> OvBCa_eow&r=
> >> > ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_
> >> > w54zWrVd48xst46GuPGCxV0&s=0p4Fn8sKMbVJMR6nk42C-lhyujAEVUXTYZ
> >> JhteC11Fs&e=
> >> > > >>
> >> > > >> But, we are observing data loss on topics while doing rolling
> >> upgrade
> >> > /
> >> > > >> restart of kafka servers for "inter.broker.protocol.version
> >> =0.10.2".
> >> > > >>
> >> > > >> As per our observation, we suspect the root cause for the data
> loss
> >> > > >> (explained for a topic partition having 3 replicas),
> >> > > >>
> >> > > >>    - As the kafka broker protocol version updates from 0.10.0 to
> >> > 0.10.2
> >> > > >>    in rolling fashion, the in-sync replicas having older version
> >> will
> >> > > not
> >> > > >>    allow updated replicas (0.10.2) to be in sync unless are all
> >> > updated.
> >> > > >>    - Also, we have explicitly disabled "unclean.leader.election.
> >> > enabled"
> >> > > >>    property, so only in-sync replicas will be elected as leader
> for
> >> > the
> >> > > given
> >> > > >>    partition.
> >> > > >>    - While doing rolling fashion update, as mentioned above,
> older
> >> > > >>    version leader is not allowing newer version replicas to be in
> >> > sync,
> >> > > so the
> >> > > >>    data pushed using this older version leader, will not be
> synced
> >> > with
> >> > > other
> >> > > >>    replicas and if this leader(older version)  goes down for an
> >> > > upgrade, other
> >> > > >>    updated replicas will be shown in in-sync column and become
> >> leader,
> >> > > but
> >> > > >>    they lag in offset with old version leader and shows the
> offset
> >> of
> >> > > the data
> >> > > >>    till they have synced.
> >> > > >>    - And, once the last replica comes up with updated version,
> will
> >> > > >>    start syncing data from the current leader.
> >> > > >>
> >> > > >>
> >> > > >> Please let us know comments on our observation and suggest proper
> >> way
> >> > > for
> >> > > >> rolling kafka upgrade as we can't afford downtime.
> >> > > >>
> >> > > >> Thanks,
> >> > > >> Yogesh
> >> > > >>
> >> > > >
> >> > >
> >> > --
> >> >
> >> > Scott Reynolds
> >> > Principal Engineer
> >> > [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
> >> > MOBILE (630) 254-2474
> >> > EMAIL sreynolds@twilio.com
> >> >
> >>
> >
> >
>

Re: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Posted by Ismael Juma <is...@juma.me.uk>.
Hi Yogesh,

A few questions:

1. Please share the code for the test script.
2. At which point in the sequence below was the code for the brokers
updated to 0.10.2?
3. When doing a rolling restart, it's generally a good idea to ensure that
there are no under-replicated partitions.
4. Is controlled shutdown completing successfully?

Ismael

On Tue, Sep 19, 2017 at 12:33 PM, Yogesh Sangvikar <
yogesh.sangvikar@gmail.com> wrote:

> Hi Team,
>
> Thanks for providing comments.
>
> Here adding more details on steps followed for upgrade,
>
> Cluster details: We are using 4 node kafka cluster and topics with 3
> replication factor. For upgrade test, we are using a topic with 5
> partitions & 3 replication factor.
>
> Topic:student-activity  PartitionCount:5        ReplicationFactor:3
> Configs:
>         Topic: student-activity Partition: 0    Leader: 4       Replicas:
> 4,2,3 Isr: 4,2,3
>         Topic: student-activity Partition: 1    Leader: 1       Replicas:
> 1,3,4 Isr: 1,4,3
>         Topic: student-activity Partition: 2    Leader: 2       Replicas:
> 2,4,1 Isr: 2,4,1
>         Topic: student-activity Partition: 3    Leader: 3       Replicas:
> 3,1,2 Isr: 1,2,3
>         Topic: student-activity Partition: 4    Leader: 4       Replicas:
> 4,3,1 Isr: 4,1,3
>
> We are using a test script to publish events continuously to one of the
> topic partition (here partition 3) and monitoring the scripts total
> published events count  with the partition 3 offset value.
>
> [ Note: The topic partitions offset count may differ from CLI utility and
> screenshot due to capture delay. ]
>
>    - First, we have rolling restarted all kafka brokers for explicit
>    protocol and message version to 0.10.0, inter.broker.protocol.version=0.10.0
>
>    log.message.format.version=0.10.0
>
>    - During this restarted, the events are getting published as expected
>    and counters are increasing & in-sync replicas are coming up immediately
>    post restart.
>
>    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
>    kafka.tools.GetOffsetShell --broker-list ***.***.***.***:9092,***.***.*
>    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
>    student-activity --time -1
>    student-activity:2:1
>    student-activity:4:1
>    student-activity:1:68
>    student-activity:3:785
>    student-activity:0:1
>    [image: Inline image 1]
>
>
>    - Next, we  have rolling restarted kafka brokers for
>    "inter.broker.protocol.version=0.10.2" in below broker sequence. (note
>    that, test script is publishing events to the topic partition continuously)
>
>    - Restarted server with  broker.id = 4,
>
>    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
>    kafka.tools.GetOffsetShell --broker-list ***.***.***.***:9092,***.***.*
>    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
>    student-activity --time -1
>    student-activity:2:1
>    student-activity:4:1
>    student-activity:1:68
>    student-activity:3:1189
>    student-activity:0:1
>
>    [image: Inline image 2]
>
>    - Restarted server with  broker.id = 3,
>
>    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
>    kafka.tools.GetOffsetShell --broker-list ***.***.***.***:9092,***.***.*
>    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
>    student-activity --time -1
>    student-activity:2:1
>    student-activity:4:1
>    student-activity:1:68
>    *student-activity:3:1430*
>    student-activity:0:1
>
>
>        [image: Inline image 3]
>
>
>    - Restarted server with  broker.id = 2, (here, observe the partition 3
>    offset count is decreased from last restart offset)
>
>    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
>    kafka.tools.GetOffsetShell --broker-list ***.***.***.***:9092,***.***.*
>    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
>    student-activity --time -1
>    student-activity:2:1
>    student-activity:4:1
>    student-activity:1:68
>    *student-activity:3:1357*
>    student-activity:0:1
>
>            [image: Inline image 4]
>
>
>    - Restarted last server with  broker.id = 1,
>
>    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
>    kafka.tools.GetOffsetShell --broker-list ***.***.***.***:9092,***.***.*
>    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
>    student-activity --time -1
>    student-activity:2:1
>    student-activity:4:1
>    student-activity:1:68
>    student-activity:3:1613
>    student-activity:0:1
>    [image: Inline image 5]
>
>    - Finally, rolling restarted all brokers (in same sequence above) for
>    "log.message.format.version=0.10.2"
>
>
> ​ [image: Inline image 6]
> [image: Inline image 7]
> [image: Inline image 8]
>
> [image: Inline image 9]
>
>    - The topic offset counter after final restart,
>
>    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
>    kafka.tools.GetOffsetShell --broker-list ***.***.***.***:9092,***.***.*
>    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
>    student-activity --time -1
>    student-activity:2:1
>    student-activity:4:1
>    student-activity:1:68
>    student-activity:3:2694
>    student-activity:0:1
>
>
>    - And, the topic offset counter after stopping events publish script,
>
>    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
>    kafka.tools.GetOffsetShell --broker-list ***.***.***.***:9092,***.***.*
>    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
>    student-activity --time -1
>    student-activity:2:1
>    student-activity:4:1
>    student-activity:1:68
>    student-activity:3:2769
>    student-activity:0:1
>
>    - Calculating missing events counts,
>    Total events published by script to partition 3 :
> *3090 *Offset count on Partition 3 :
> *2769 *
>    Missing events count : 3090 - 2769 = *321*
>
>
> As per above observation during rolling restart for protocol version,
>
>    1. The partition 3 leader changed to in-sync replica 2 (with older
>    protocol version) and upgraded replicas (3 & 4) are missing from in-sync
>    replica list.
>    2. And, one we down server 2 down for upgrade, suddenly replicas 3 & 4
>    appear in in-sync replica list and partition offset count resets.
>    3. Post server 2 & 1 upgrade, 3 in-sync replicas shown for partition 3
>    but, missing events lag is not recovered.
>
> Please let us know your comments on our observations and correct us if we
> are missing any upgrade steps.
>
> Thanks,
> Yogesh
>
> On Tue, Sep 19, 2017 at 2:07 AM, Ismael Juma <is...@juma.me.uk> wrote:
>
>> Hi Scott,
>>
>> There is nothing preventing a replica running a newer version from being
>> in
>> sync as long as the instructions are followed (i.e.
>> inter.broker.protocol.version has to be set correctly and, if there's a
>> message format change, log.message.format.version). That's why I asked
>> Yogesh for more details. The upgrade path he mentioned (0.10.0 -> 0.10.2)
>> is straightforward, there isn't a message format change, so only
>> inter.broker.protocol.version needs to be set.
>>
>> Ismael
>>
>> On Mon, Sep 18, 2017 at 5:50 PM, Scott Reynolds <
>> sreynolds@twilio.com.invalid> wrote:
>>
>> > Can we get some clarity on this point:
>> > >older version leader is not allowing newer version replicas to be in
>> sync,
>> > so the data pushed using this older version leader
>> >
>> > That is super scary.
>> >
>> > What protocol version is the older version leader running?
>> >
>> > Would this happen if you are skipping a protocol version bump?
>> >
>> >
>> >
>> > On Mon, Sep 18, 2017 at 9:33 AM Ismael Juma <is...@juma.me.uk> wrote:
>> >
>> > > Hi Yogesh,
>> > >
>> > > Can you please clarify what you mean by "observing data loss"?
>> > >
>> > > Ismael
>> > >
>> > > On Mon, Sep 18, 2017 at 5:08 PM, Yogesh Sangvikar <
>> > > yogesh.sangvikar@gmail.com> wrote:
>> > >
>> > > > Hi Team,
>> > > >
>> > > > Please help to find resolution for below kafka rolling upgrade
>> issue.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Yogesh
>> > > >
>> > > > On Monday, September 18, 2017 at 9:03:04 PM UTC+5:30, Yogesh
>> Sangvikar
>> > > > wrote:
>> > > >>
>> > > >> Hi Team,
>> > > >>
>> > > >> Currently, we are using confluent 3.0.0 kafka cluster in our
>> > production
>> > > >> environment. And, we are planing to upgrade the kafka cluster for
>> > > confluent
>> > > >> 3.2.2
>> > > >> We are having topics with millions on records and data getting
>> > > >> continuously published to those topics. And, also, we are using
>> other
>> > > >> confluent services like schema-registry, kafka connect and kafka
>> rest
>> > to
>> > > >> process the data.
>> > > >>
>> > > >> So, we can't afford downtime upgrade for the platform.
>> > > >>
>> > > >> We have tries rolling kafka upgrade as suggested on blogs in
>> > Development
>> > > >> environment,
>> > > >>
>> > > >>
>> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.
>> > confluent.io_3.2.2_upgrade.html&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=
>> > ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_
>> > w54zWrVd48xst46GuPGCxV0&s=DMcA8JOnGXNNa_dRFpkNOd7AJoIQUgkEcw
>> 6q06RHgl0&e=
>> > > >>
>> > > >>
>> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__kafka.
>> > apache.org_documentation_-23upgrade&d=DwIBaQ&c=x_Y1Lz9GyeGp2
>> OvBCa_eow&r=
>> > ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_
>> > w54zWrVd48xst46GuPGCxV0&s=0p4Fn8sKMbVJMR6nk42C-lhyujAEVUXTYZ
>> JhteC11Fs&e=
>> > > >>
>> > > >> But, we are observing data loss on topics while doing rolling
>> upgrade
>> > /
>> > > >> restart of kafka servers for "inter.broker.protocol.version
>> =0.10.2".
>> > > >>
>> > > >> As per our observation, we suspect the root cause for the data loss
>> > > >> (explained for a topic partition having 3 replicas),
>> > > >>
>> > > >>    - As the kafka broker protocol version updates from 0.10.0 to
>> > 0.10.2
>> > > >>    in rolling fashion, the in-sync replicas having older version
>> will
>> > > not
>> > > >>    allow updated replicas (0.10.2) to be in sync unless are all
>> > updated.
>> > > >>    - Also, we have explicitly disabled "unclean.leader.election.
>> > enabled"
>> > > >>    property, so only in-sync replicas will be elected as leader for
>> > the
>> > > given
>> > > >>    partition.
>> > > >>    - While doing rolling fashion update, as mentioned above, older
>> > > >>    version leader is not allowing newer version replicas to be in
>> > sync,
>> > > so the
>> > > >>    data pushed using this older version leader, will not be synced
>> > with
>> > > other
>> > > >>    replicas and if this leader(older version)  goes down for an
>> > > upgrade, other
>> > > >>    updated replicas will be shown in in-sync column and become
>> leader,
>> > > but
>> > > >>    they lag in offset with old version leader and shows the offset
>> of
>> > > the data
>> > > >>    till they have synced.
>> > > >>    - And, once the last replica comes up with updated version, will
>> > > >>    start syncing data from the current leader.
>> > > >>
>> > > >>
>> > > >> Please let us know comments on our observation and suggest proper
>> way
>> > > for
>> > > >> rolling kafka upgrade as we can't afford downtime.
>> > > >>
>> > > >> Thanks,
>> > > >> Yogesh
>> > > >>
>> > > >
>> > >
>> > --
>> >
>> > Scott Reynolds
>> > Principal Engineer
>> > [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
>> > MOBILE (630) 254-2474
>> > EMAIL sreynolds@twilio.com
>> >
>>
>
>

Re: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Posted by Yogesh Sangvikar <yo...@gmail.com>.
Hi Team,

Thanks for providing comments.

Here adding more details on steps followed for upgrade,

Cluster details: We are using 4 node kafka cluster and topics with 3
replication factor. For upgrade test, we are using a topic with 5
partitions & 3 replication factor.

Topic:student-activity  PartitionCount:5        ReplicationFactor:3
Configs:
        Topic: student-activity Partition: 0    Leader: 4       Replicas:
4,2,3 Isr: 4,2,3
        Topic: student-activity Partition: 1    Leader: 1       Replicas:
1,3,4 Isr: 1,4,3
        Topic: student-activity Partition: 2    Leader: 2       Replicas:
2,4,1 Isr: 2,4,1
        Topic: student-activity Partition: 3    Leader: 3       Replicas:
3,1,2 Isr: 1,2,3
        Topic: student-activity Partition: 4    Leader: 4       Replicas:
4,3,1 Isr: 4,1,3

We are using a test script to publish events continuously to one of the
topic partition (here partition 3) and monitoring the scripts total
published events count  with the partition 3 offset value.

[ Note: The topic partitions offset count may differ from CLI utility and
screenshot due to capture delay. ]

   - First, we have rolling restarted all kafka brokers for explicit
   protocol and message version to 0.10.0,
   inter.broker.protocol.version=0.10.0
   log.message.format.version=0.10.0

   - During this restarted, the events are getting published as expected
   and counters are increasing & in-sync replicas are coming up immediately
   post restart.

   [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
   kafka.tools.GetOffsetShell --broker-list
   ***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092
   --topic student-activity --time -1
   student-activity:2:1
   student-activity:4:1
   student-activity:1:68
   student-activity:3:785
   student-activity:0:1
   [image: Inline image 1]


   - Next, we  have rolling restarted kafka brokers for
   "inter.broker.protocol.version=0.10.2" in below broker sequence. (note
   that, test script is publishing events to the topic partition continuously)

   - Restarted server with  broker.id = 4,

   [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
   kafka.tools.GetOffsetShell --broker-list
   ***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092
   --topic student-activity --time -1
   student-activity:2:1
   student-activity:4:1
   student-activity:1:68
   student-activity:3:1189
   student-activity:0:1

   [image: Inline image 2]

   - Restarted server with  broker.id = 3,

   [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
   kafka.tools.GetOffsetShell --broker-list
   ***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092
   --topic student-activity --time -1
   student-activity:2:1
   student-activity:4:1
   student-activity:1:68
   *student-activity:3:1430*
   student-activity:0:1


       [image: Inline image 3]


   - Restarted server with  broker.id = 2, (here, observe the partition 3
   offset count is decreased from last restart offset)

   [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
   kafka.tools.GetOffsetShell --broker-list
   ***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092
   --topic student-activity --time -1
   student-activity:2:1
   student-activity:4:1
   student-activity:1:68
   *student-activity:3:1357*
   student-activity:0:1

           [image: Inline image 4]


   - Restarted last server with  broker.id = 1,

   [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
   kafka.tools.GetOffsetShell --broker-list
   ***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092
   --topic student-activity --time -1
   student-activity:2:1
   student-activity:4:1
   student-activity:1:68
   student-activity:3:1613
   student-activity:0:1
   [image: Inline image 5]

   - Finally, rolling restarted all brokers (in same sequence above) for
   "log.message.format.version=0.10.2"


​ [image: Inline image 6]
[image: Inline image 7]
[image: Inline image 8]

[image: Inline image 9]

   - The topic offset counter after final restart,

   [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
   kafka.tools.GetOffsetShell --broker-list
   ***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092
   --topic student-activity --time -1
   student-activity:2:1
   student-activity:4:1
   student-activity:1:68
   student-activity:3:2694
   student-activity:0:1


   - And, the topic offset counter after stopping events publish script,

   [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
   kafka.tools.GetOffsetShell --broker-list
   ***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092,***.***.***.***:9092
   --topic student-activity --time -1
   student-activity:2:1
   student-activity:4:1
   student-activity:1:68
   student-activity:3:2769
   student-activity:0:1

   - Calculating missing events counts,
   Total events published by script to partition 3 :
*3090 *Offset count on Partition 3 :
*2769 *
   Missing events count : 3090 - 2769 = *321*


As per above observation during rolling restart for protocol version,

   1. The partition 3 leader changed to in-sync replica 2 (with older
   protocol version) and upgraded replicas (3 & 4) are missing from in-sync
   replica list.
   2. And, one we down server 2 down for upgrade, suddenly replicas 3 & 4
   appear in in-sync replica list and partition offset count resets.
   3. Post server 2 & 1 upgrade, 3 in-sync replicas shown for partition 3
   but, missing events lag is not recovered.

Please let us know your comments on our observations and correct us if we
are missing any upgrade steps.

Thanks,
Yogesh

On Tue, Sep 19, 2017 at 2:07 AM, Ismael Juma <is...@juma.me.uk> wrote:

> Hi Scott,
>
> There is nothing preventing a replica running a newer version from being in
> sync as long as the instructions are followed (i.e.
> inter.broker.protocol.version has to be set correctly and, if there's a
> message format change, log.message.format.version). That's why I asked
> Yogesh for more details. The upgrade path he mentioned (0.10.0 -> 0.10.2)
> is straightforward, there isn't a message format change, so only
> inter.broker.protocol.version needs to be set.
>
> Ismael
>
> On Mon, Sep 18, 2017 at 5:50 PM, Scott Reynolds <
> sreynolds@twilio.com.invalid> wrote:
>
> > Can we get some clarity on this point:
> > >older version leader is not allowing newer version replicas to be in
> sync,
> > so the data pushed using this older version leader
> >
> > That is super scary.
> >
> > What protocol version is the older version leader running?
> >
> > Would this happen if you are skipping a protocol version bump?
> >
> >
> >
> > On Mon, Sep 18, 2017 at 9:33 AM Ismael Juma <is...@juma.me.uk> wrote:
> >
> > > Hi Yogesh,
> > >
> > > Can you please clarify what you mean by "observing data loss"?
> > >
> > > Ismael
> > >
> > > On Mon, Sep 18, 2017 at 5:08 PM, Yogesh Sangvikar <
> > > yogesh.sangvikar@gmail.com> wrote:
> > >
> > > > Hi Team,
> > > >
> > > > Please help to find resolution for below kafka rolling upgrade issue.
> > > >
> > > > Thanks,
> > > >
> > > > Yogesh
> > > >
> > > > On Monday, September 18, 2017 at 9:03:04 PM UTC+5:30, Yogesh
> Sangvikar
> > > > wrote:
> > > >>
> > > >> Hi Team,
> > > >>
> > > >> Currently, we are using confluent 3.0.0 kafka cluster in our
> > production
> > > >> environment. And, we are planing to upgrade the kafka cluster for
> > > confluent
> > > >> 3.2.2
> > > >> We are having topics with millions on records and data getting
> > > >> continuously published to those topics. And, also, we are using
> other
> > > >> confluent services like schema-registry, kafka connect and kafka
> rest
> > to
> > > >> process the data.
> > > >>
> > > >> So, we can't afford downtime upgrade for the platform.
> > > >>
> > > >> We have tries rolling kafka upgrade as suggested on blogs in
> > Development
> > > >> environment,
> > > >>
> > > >>
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.
> > confluent.io_3.2.2_upgrade.html&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=
> > ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_
> > w54zWrVd48xst46GuPGCxV0&s=DMcA8JOnGXNNa_dRFpkNOd7AJoIQUgkEcw6q06RHgl0&e=
> > > >>
> > > >>
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__kafka.
> > apache.org_documentation_-23upgrade&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=
> > ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_
> > w54zWrVd48xst46GuPGCxV0&s=0p4Fn8sKMbVJMR6nk42C-lhyujAEVUXTYZJhteC11Fs&e=
> > > >>
> > > >> But, we are observing data loss on topics while doing rolling
> upgrade
> > /
> > > >> restart of kafka servers for "inter.broker.protocol.
> version=0.10.2".
> > > >>
> > > >> As per our observation, we suspect the root cause for the data loss
> > > >> (explained for a topic partition having 3 replicas),
> > > >>
> > > >>    - As the kafka broker protocol version updates from 0.10.0 to
> > 0.10.2
> > > >>    in rolling fashion, the in-sync replicas having older version
> will
> > > not
> > > >>    allow updated replicas (0.10.2) to be in sync unless are all
> > updated.
> > > >>    - Also, we have explicitly disabled "unclean.leader.election.
> > enabled"
> > > >>    property, so only in-sync replicas will be elected as leader for
> > the
> > > given
> > > >>    partition.
> > > >>    - While doing rolling fashion update, as mentioned above, older
> > > >>    version leader is not allowing newer version replicas to be in
> > sync,
> > > so the
> > > >>    data pushed using this older version leader, will not be synced
> > with
> > > other
> > > >>    replicas and if this leader(older version)  goes down for an
> > > upgrade, other
> > > >>    updated replicas will be shown in in-sync column and become
> leader,
> > > but
> > > >>    they lag in offset with old version leader and shows the offset
> of
> > > the data
> > > >>    till they have synced.
> > > >>    - And, once the last replica comes up with updated version, will
> > > >>    start syncing data from the current leader.
> > > >>
> > > >>
> > > >> Please let us know comments on our observation and suggest proper
> way
> > > for
> > > >> rolling kafka upgrade as we can't afford downtime.
> > > >>
> > > >> Thanks,
> > > >> Yogesh
> > > >>
> > > >
> > >
> > --
> >
> > Scott Reynolds
> > Principal Engineer
> > [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
> > MOBILE (630) 254-2474
> > EMAIL sreynolds@twilio.com
> >
>

Re: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Posted by Ismael Juma <is...@juma.me.uk>.
Hi Scott,

There is nothing preventing a replica running a newer version from being in
sync as long as the instructions are followed (i.e.
inter.broker.protocol.version has to be set correctly and, if there's a
message format change, log.message.format.version). That's why I asked
Yogesh for more details. The upgrade path he mentioned (0.10.0 -> 0.10.2)
is straightforward, there isn't a message format change, so only
inter.broker.protocol.version needs to be set.

Ismael

On Mon, Sep 18, 2017 at 5:50 PM, Scott Reynolds <
sreynolds@twilio.com.invalid> wrote:

> Can we get some clarity on this point:
> >older version leader is not allowing newer version replicas to be in sync,
> so the data pushed using this older version leader
>
> That is super scary.
>
> What protocol version is the older version leader running?
>
> Would this happen if you are skipping a protocol version bump?
>
>
>
> On Mon, Sep 18, 2017 at 9:33 AM Ismael Juma <is...@juma.me.uk> wrote:
>
> > Hi Yogesh,
> >
> > Can you please clarify what you mean by "observing data loss"?
> >
> > Ismael
> >
> > On Mon, Sep 18, 2017 at 5:08 PM, Yogesh Sangvikar <
> > yogesh.sangvikar@gmail.com> wrote:
> >
> > > Hi Team,
> > >
> > > Please help to find resolution for below kafka rolling upgrade issue.
> > >
> > > Thanks,
> > >
> > > Yogesh
> > >
> > > On Monday, September 18, 2017 at 9:03:04 PM UTC+5:30, Yogesh Sangvikar
> > > wrote:
> > >>
> > >> Hi Team,
> > >>
> > >> Currently, we are using confluent 3.0.0 kafka cluster in our
> production
> > >> environment. And, we are planing to upgrade the kafka cluster for
> > confluent
> > >> 3.2.2
> > >> We are having topics with millions on records and data getting
> > >> continuously published to those topics. And, also, we are using other
> > >> confluent services like schema-registry, kafka connect and kafka rest
> to
> > >> process the data.
> > >>
> > >> So, we can't afford downtime upgrade for the platform.
> > >>
> > >> We have tries rolling kafka upgrade as suggested on blogs in
> Development
> > >> environment,
> > >>
> > >>
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.
> confluent.io_3.2.2_upgrade.html&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=
> ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_
> w54zWrVd48xst46GuPGCxV0&s=DMcA8JOnGXNNa_dRFpkNOd7AJoIQUgkEcw6q06RHgl0&e=
> > >>
> > >>
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__kafka.
> apache.org_documentation_-23upgrade&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=
> ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_
> w54zWrVd48xst46GuPGCxV0&s=0p4Fn8sKMbVJMR6nk42C-lhyujAEVUXTYZJhteC11Fs&e=
> > >>
> > >> But, we are observing data loss on topics while doing rolling upgrade
> /
> > >> restart of kafka servers for "inter.broker.protocol.version=0.10.2".
> > >>
> > >> As per our observation, we suspect the root cause for the data loss
> > >> (explained for a topic partition having 3 replicas),
> > >>
> > >>    - As the kafka broker protocol version updates from 0.10.0 to
> 0.10.2
> > >>    in rolling fashion, the in-sync replicas having older version will
> > not
> > >>    allow updated replicas (0.10.2) to be in sync unless are all
> updated.
> > >>    - Also, we have explicitly disabled "unclean.leader.election.
> enabled"
> > >>    property, so only in-sync replicas will be elected as leader for
> the
> > given
> > >>    partition.
> > >>    - While doing rolling fashion update, as mentioned above, older
> > >>    version leader is not allowing newer version replicas to be in
> sync,
> > so the
> > >>    data pushed using this older version leader, will not be synced
> with
> > other
> > >>    replicas and if this leader(older version)  goes down for an
> > upgrade, other
> > >>    updated replicas will be shown in in-sync column and become leader,
> > but
> > >>    they lag in offset with old version leader and shows the offset of
> > the data
> > >>    till they have synced.
> > >>    - And, once the last replica comes up with updated version, will
> > >>    start syncing data from the current leader.
> > >>
> > >>
> > >> Please let us know comments on our observation and suggest proper way
> > for
> > >> rolling kafka upgrade as we can't afford downtime.
> > >>
> > >> Thanks,
> > >> Yogesh
> > >>
> > >
> >
> --
>
> Scott Reynolds
> Principal Engineer
> [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
> MOBILE (630) 254-2474
> EMAIL sreynolds@twilio.com
>

Re: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Posted by Scott Reynolds <sr...@twilio.com.INVALID>.
Can we get some clarity on this point:
>older version leader is not allowing newer version replicas to be in sync,
so the data pushed using this older version leader

That is super scary.

What protocol version is the older version leader running?

Would this happen if you are skipping a protocol version bump?



On Mon, Sep 18, 2017 at 9:33 AM Ismael Juma <is...@juma.me.uk> wrote:

> Hi Yogesh,
>
> Can you please clarify what you mean by "observing data loss"?
>
> Ismael
>
> On Mon, Sep 18, 2017 at 5:08 PM, Yogesh Sangvikar <
> yogesh.sangvikar@gmail.com> wrote:
>
> > Hi Team,
> >
> > Please help to find resolution for below kafka rolling upgrade issue.
> >
> > Thanks,
> >
> > Yogesh
> >
> > On Monday, September 18, 2017 at 9:03:04 PM UTC+5:30, Yogesh Sangvikar
> > wrote:
> >>
> >> Hi Team,
> >>
> >> Currently, we are using confluent 3.0.0 kafka cluster in our production
> >> environment. And, we are planing to upgrade the kafka cluster for
> confluent
> >> 3.2.2
> >> We are having topics with millions on records and data getting
> >> continuously published to those topics. And, also, we are using other
> >> confluent services like schema-registry, kafka connect and kafka rest to
> >> process the data.
> >>
> >> So, we can't afford downtime upgrade for the platform.
> >>
> >> We have tries rolling kafka upgrade as suggested on blogs in Development
> >> environment,
> >>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.confluent.io_3.2.2_upgrade.html&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_w54zWrVd48xst46GuPGCxV0&s=DMcA8JOnGXNNa_dRFpkNOd7AJoIQUgkEcw6q06RHgl0&e=
> >>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__kafka.apache.org_documentation_-23upgrade&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_w54zWrVd48xst46GuPGCxV0&s=0p4Fn8sKMbVJMR6nk42C-lhyujAEVUXTYZJhteC11Fs&e=
> >>
> >> But, we are observing data loss on topics while doing rolling upgrade /
> >> restart of kafka servers for "inter.broker.protocol.version=0.10.2".
> >>
> >> As per our observation, we suspect the root cause for the data loss
> >> (explained for a topic partition having 3 replicas),
> >>
> >>    - As the kafka broker protocol version updates from 0.10.0 to 0.10.2
> >>    in rolling fashion, the in-sync replicas having older version will
> not
> >>    allow updated replicas (0.10.2) to be in sync unless are all updated.
> >>    - Also, we have explicitly disabled "unclean.leader.election.enabled"
> >>    property, so only in-sync replicas will be elected as leader for the
> given
> >>    partition.
> >>    - While doing rolling fashion update, as mentioned above, older
> >>    version leader is not allowing newer version replicas to be in sync,
> so the
> >>    data pushed using this older version leader, will not be synced with
> other
> >>    replicas and if this leader(older version)  goes down for an
> upgrade, other
> >>    updated replicas will be shown in in-sync column and become leader,
> but
> >>    they lag in offset with old version leader and shows the offset of
> the data
> >>    till they have synced.
> >>    - And, once the last replica comes up with updated version, will
> >>    start syncing data from the current leader.
> >>
> >>
> >> Please let us know comments on our observation and suggest proper way
> for
> >> rolling kafka upgrade as we can't afford downtime.
> >>
> >> Thanks,
> >> Yogesh
> >>
> >
>
-- 

Scott Reynolds
Principal Engineer
[image: twilio] <http://www.twilio.com/?utm_source=email_signature>
MOBILE (630) 254-2474
EMAIL sreynolds@twilio.com

Re: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Posted by Ismael Juma <is...@juma.me.uk>.
Hi Yogesh,

Can you please clarify what you mean by "observing data loss"?

Ismael

On Mon, Sep 18, 2017 at 5:08 PM, Yogesh Sangvikar <
yogesh.sangvikar@gmail.com> wrote:

> Hi Team,
>
> Please help to find resolution for below kafka rolling upgrade issue.
>
> Thanks,
>
> Yogesh
>
> On Monday, September 18, 2017 at 9:03:04 PM UTC+5:30, Yogesh Sangvikar
> wrote:
>>
>> Hi Team,
>>
>> Currently, we are using confluent 3.0.0 kafka cluster in our production
>> environment. And, we are planing to upgrade the kafka cluster for confluent
>> 3.2.2
>> We are having topics with millions on records and data getting
>> continuously published to those topics. And, also, we are using other
>> confluent services like schema-registry, kafka connect and kafka rest to
>> process the data.
>>
>> So, we can't afford downtime upgrade for the platform.
>>
>> We have tries rolling kafka upgrade as suggested on blogs in Development
>> environment,
>>
>> https://docs.confluent.io/3.2.2/upgrade.html
>>
>> https://kafka.apache.org/documentation/#upgrade
>>
>> But, we are observing data loss on topics while doing rolling upgrade /
>> restart of kafka servers for "inter.broker.protocol.version=0.10.2".
>>
>> As per our observation, we suspect the root cause for the data loss
>> (explained for a topic partition having 3 replicas),
>>
>>    - As the kafka broker protocol version updates from 0.10.0 to 0.10.2
>>    in rolling fashion, the in-sync replicas having older version will not
>>    allow updated replicas (0.10.2) to be in sync unless are all updated.
>>    - Also, we have explicitly disabled "unclean.leader.election.enabled"
>>    property, so only in-sync replicas will be elected as leader for the given
>>    partition.
>>    - While doing rolling fashion update, as mentioned above, older
>>    version leader is not allowing newer version replicas to be in sync, so the
>>    data pushed using this older version leader, will not be synced with other
>>    replicas and if this leader(older version)  goes down for an upgrade, other
>>    updated replicas will be shown in in-sync column and become leader, but
>>    they lag in offset with old version leader and shows the offset of the data
>>    till they have synced.
>>    - And, once the last replica comes up with updated version, will
>>    start syncing data from the current leader.
>>
>>
>> Please let us know comments on our observation and suggest proper way for
>> rolling kafka upgrade as we can't afford downtime.
>>
>> Thanks,
>> Yogesh
>>
>