You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jhanssen F��varo <jh...@gmail.com> on 2021/06/17 20:54:50 UTC

Kafka Ate My Data!

Hi all, we were testing kafka disaster/recover in our Sites. 

Anyway do avoid the scenario in this post ?
https://blog.softwaremill.com/help-kafka-ate-my-data-ae2e5d3e6576

But, the Unclean Leader exception is not an option in our case. 
FYI..
We needed to deactivated our systemctl for kafka brokers to avoid a service startup with a corrupted leader disk.

Best Regards! 


Re: Kafka Ate My Data!

Posted by Ran Lupovich <ra...@gmail.com>.
Having setting as described above will tolerate one broker down without
service outage,

בתאריך יום ו׳, 18 ביוני 2021, 00:42, מאת Ran Lupovich ‏<
ranlupovich@gmail.com>:

> That's why you have 3 brokers in minimum for production,  having
> replication factor set to 3 , min.isr set to 2, having each broker on
> different rack , you could also use mm2 or replicator to copy data to other
> dc...
>
> בתאריך יום ו׳, 18 ביוני 2021, 00:33, מאת Jhanssen Fávaro ‏<
> jhanssenfavaro@gmail.com>:
>
>> Thats a disaster recovery simulation, we need to validate a way to avoid
>> that in a disaster case/scenario!! I mean If I have a disaster and the
>> servers got rebooted we need to prevent its kafka weaknes.
>>
>> Regards,
>> Jhanssen Fávaro de Oliveira
>>
>>
>>
>> On Thu, Jun 17, 2021 at 6:30 PM Sunil Unnithan <su...@gmail.com>
>> wrote:
>>
>> > Why would you reboot all three brokers on same week/day?
>> >
>> > On Thu, Jun 17, 2021 at 5:26 PM Jhanssen Fávaro <
>> jhanssenfavaro@gmail.com>
>> > wrote:
>> >
>> > > Sunil,
>> > > Business needs... Anyway, if it was 2, we would face the same problem.
>> > For
>> > > example if the partition leader was the last one to be rebooted and
>> then
>> > > got its disk corrupted. The erase would happens the same way.
>> > >
>> > > Regrads,
>> > >
>> > > On 2021/06/17 21:23:40, Sunil Unnithan <su...@gmail.com> wrote:
>> > > > Why isr=all? Why not use min.isr=2 in this case?
>> > > >
>> > > > On Thu, Jun 17, 2021 at 5:11 PM Jhanssen Fávaro <
>> > > jhanssenfavaro@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Basically, if we have 3 brokers and the ISR == all, and in the
>> case
>> > > that a
>> > > > > leader partition broker was the last server that was
>> > > restarted/rebooted,
>> > > > > and during its startup got a disk corruption, all the followers
>> will
>> > > mark
>> > > > > the topic as offline.
>> > > > > So, If the last broker leader that got the corrupted disk starts,
>> It
>> > > will
>> > > > > be back to the partition leaderhip and then erase all the others
>> > > > > followers/brokers in the cluster.
>> > > > >
>> > > > > It should at least "asks" the other 2 brokers if they are not
>> zeroed.
>> > > > > Anyway to avoid this data to be truncate in the followers ?
>> > > > >
>> > > > > Best Regards,
>> > > > > Jhanssen
>> > > > > On 2021/06/17 20:54:50, Jhanssen F��varo <
>> jhanssenfavaro@gmail.com>
>> > > > > wrote:
>> > > > > > Hi all, we were testing kafka disaster/recover in our Sites.
>> > > > > >
>> > > > > > Anyway do avoid the scenario in this post ?
>> > > > > >
>> https://blog.softwaremill.com/help-kafka-ate-my-data-ae2e5d3e6576
>> > > > > >
>> > > > > > But, the Unclean Leader exception is not an option in our case.
>> > > > > > FYI..
>> > > > > > We needed to deactivated our systemctl for kafka brokers to
>> avoid a
>> > > > > service startup with a corrupted leader disk.
>> > > > > >
>> > > > > > Best Regards!
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Kafka Ate My Data!

Posted by Ran Lupovich <ra...@gmail.com>.
That's why you have 3 brokers in minimum for production,  having
replication factor set to 3 , min.isr set to 2, having each broker on
different rack , you could also use mm2 or replicator to copy data to other
dc...

בתאריך יום ו׳, 18 ביוני 2021, 00:33, מאת Jhanssen Fávaro ‏<
jhanssenfavaro@gmail.com>:

> Thats a disaster recovery simulation, we need to validate a way to avoid
> that in a disaster case/scenario!! I mean If I have a disaster and the
> servers got rebooted we need to prevent its kafka weaknes.
>
> Regards,
> Jhanssen Fávaro de Oliveira
>
>
>
> On Thu, Jun 17, 2021 at 6:30 PM Sunil Unnithan <su...@gmail.com>
> wrote:
>
> > Why would you reboot all three brokers on same week/day?
> >
> > On Thu, Jun 17, 2021 at 5:26 PM Jhanssen Fávaro <
> jhanssenfavaro@gmail.com>
> > wrote:
> >
> > > Sunil,
> > > Business needs... Anyway, if it was 2, we would face the same problem.
> > For
> > > example if the partition leader was the last one to be rebooted and
> then
> > > got its disk corrupted. The erase would happens the same way.
> > >
> > > Regrads,
> > >
> > > On 2021/06/17 21:23:40, Sunil Unnithan <su...@gmail.com> wrote:
> > > > Why isr=all? Why not use min.isr=2 in this case?
> > > >
> > > > On Thu, Jun 17, 2021 at 5:11 PM Jhanssen Fávaro <
> > > jhanssenfavaro@gmail.com>
> > > > wrote:
> > > >
> > > > > Basically, if we have 3 brokers and the ISR == all, and in the case
> > > that a
> > > > > leader partition broker was the last server that was
> > > restarted/rebooted,
> > > > > and during its startup got a disk corruption, all the followers
> will
> > > mark
> > > > > the topic as offline.
> > > > > So, If the last broker leader that got the corrupted disk starts,
> It
> > > will
> > > > > be back to the partition leaderhip and then erase all the others
> > > > > followers/brokers in the cluster.
> > > > >
> > > > > It should at least "asks" the other 2 brokers if they are not
> zeroed.
> > > > > Anyway to avoid this data to be truncate in the followers ?
> > > > >
> > > > > Best Regards,
> > > > > Jhanssen
> > > > > On 2021/06/17 20:54:50, Jhanssen F��varo <jhanssenfavaro@gmail.com
> >
> > > > > wrote:
> > > > > > Hi all, we were testing kafka disaster/recover in our Sites.
> > > > > >
> > > > > > Anyway do avoid the scenario in this post ?
> > > > > >
> https://blog.softwaremill.com/help-kafka-ate-my-data-ae2e5d3e6576
> > > > > >
> > > > > > But, the Unclean Leader exception is not an option in our case.
> > > > > > FYI..
> > > > > > We needed to deactivated our systemctl for kafka brokers to
> avoid a
> > > > > service startup with a corrupted leader disk.
> > > > > >
> > > > > > Best Regards!
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Kafka Ate My Data!

Posted by Jhanssen Fávaro <jh...@gmail.com>.
Thats a disaster recovery simulation, we need to validate a way to avoid
that in a disaster case/scenario!! I mean If I have a disaster and the
servers got rebooted we need to prevent its kafka weaknes.

Regards,
Jhanssen Fávaro de Oliveira



On Thu, Jun 17, 2021 at 6:30 PM Sunil Unnithan <su...@gmail.com> wrote:

> Why would you reboot all three brokers on same week/day?
>
> On Thu, Jun 17, 2021 at 5:26 PM Jhanssen Fávaro <jh...@gmail.com>
> wrote:
>
> > Sunil,
> > Business needs... Anyway, if it was 2, we would face the same problem.
> For
> > example if the partition leader was the last one to be rebooted and then
> > got its disk corrupted. The erase would happens the same way.
> >
> > Regrads,
> >
> > On 2021/06/17 21:23:40, Sunil Unnithan <su...@gmail.com> wrote:
> > > Why isr=all? Why not use min.isr=2 in this case?
> > >
> > > On Thu, Jun 17, 2021 at 5:11 PM Jhanssen Fávaro <
> > jhanssenfavaro@gmail.com>
> > > wrote:
> > >
> > > > Basically, if we have 3 brokers and the ISR == all, and in the case
> > that a
> > > > leader partition broker was the last server that was
> > restarted/rebooted,
> > > > and during its startup got a disk corruption, all the followers will
> > mark
> > > > the topic as offline.
> > > > So, If the last broker leader that got the corrupted disk starts, It
> > will
> > > > be back to the partition leaderhip and then erase all the others
> > > > followers/brokers in the cluster.
> > > >
> > > > It should at least "asks" the other 2 brokers if they are not zeroed.
> > > > Anyway to avoid this data to be truncate in the followers ?
> > > >
> > > > Best Regards,
> > > > Jhanssen
> > > > On 2021/06/17 20:54:50, Jhanssen F��varo <jh...@gmail.com>
> > > > wrote:
> > > > > Hi all, we were testing kafka disaster/recover in our Sites.
> > > > >
> > > > > Anyway do avoid the scenario in this post ?
> > > > > https://blog.softwaremill.com/help-kafka-ate-my-data-ae2e5d3e6576
> > > > >
> > > > > But, the Unclean Leader exception is not an option in our case.
> > > > > FYI..
> > > > > We needed to deactivated our systemctl for kafka brokers to avoid a
> > > > service startup with a corrupted leader disk.
> > > > >
> > > > > Best Regards!
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Kafka Ate My Data!

Posted by Sunil Unnithan <su...@gmail.com>.
Why would you reboot all three brokers on same week/day?

On Thu, Jun 17, 2021 at 5:26 PM Jhanssen Fávaro <jh...@gmail.com>
wrote:

> Sunil,
> Business needs... Anyway, if it was 2, we would face the same problem. For
> example if the partition leader was the last one to be rebooted and then
> got its disk corrupted. The erase would happens the same way.
>
> Regrads,
>
> On 2021/06/17 21:23:40, Sunil Unnithan <su...@gmail.com> wrote:
> > Why isr=all? Why not use min.isr=2 in this case?
> >
> > On Thu, Jun 17, 2021 at 5:11 PM Jhanssen Fávaro <
> jhanssenfavaro@gmail.com>
> > wrote:
> >
> > > Basically, if we have 3 brokers and the ISR == all, and in the case
> that a
> > > leader partition broker was the last server that was
> restarted/rebooted,
> > > and during its startup got a disk corruption, all the followers will
> mark
> > > the topic as offline.
> > > So, If the last broker leader that got the corrupted disk starts, It
> will
> > > be back to the partition leaderhip and then erase all the others
> > > followers/brokers in the cluster.
> > >
> > > It should at least "asks" the other 2 brokers if they are not zeroed.
> > > Anyway to avoid this data to be truncate in the followers ?
> > >
> > > Best Regards,
> > > Jhanssen
> > > On 2021/06/17 20:54:50, Jhanssen F��varo <jh...@gmail.com>
> > > wrote:
> > > > Hi all, we were testing kafka disaster/recover in our Sites.
> > > >
> > > > Anyway do avoid the scenario in this post ?
> > > > https://blog.softwaremill.com/help-kafka-ate-my-data-ae2e5d3e6576
> > > >
> > > > But, the Unclean Leader exception is not an option in our case.
> > > > FYI..
> > > > We needed to deactivated our systemctl for kafka brokers to avoid a
> > > service startup with a corrupted leader disk.
> > > >
> > > > Best Regards!
> > > >
> > > >
> > >
> >
>

Re: Kafka Ate My Data!

Posted by Jhanssen F��varo <jh...@gmail.com>.
Sunil,
Business needs... Anyway, if it was 2, we would face the same problem. For example if the partition leader was the last one to be rebooted and then got its disk corrupted. The erase would happens the same way.

Regrads,

On 2021/06/17 21:23:40, Sunil Unnithan <su...@gmail.com> wrote: 
> Why isr=all? Why not use min.isr=2 in this case?
> 
> On Thu, Jun 17, 2021 at 5:11 PM Jhanssen Fávaro <jh...@gmail.com>
> wrote:
> 
> > Basically, if we have 3 brokers and the ISR == all, and in the case that a
> > leader partition broker was the last server that was restarted/rebooted,
> > and during its startup got a disk corruption, all the followers will mark
> > the topic as offline.
> > So, If the last broker leader that got the corrupted disk starts, It will
> > be back to the partition leaderhip and then erase all the others
> > followers/brokers in the cluster.
> >
> > It should at least "asks" the other 2 brokers if they are not zeroed.
> > Anyway to avoid this data to be truncate in the followers ?
> >
> > Best Regards,
> > Jhanssen
> > On 2021/06/17 20:54:50, Jhanssen F��varo <jh...@gmail.com>
> > wrote:
> > > Hi all, we were testing kafka disaster/recover in our Sites.
> > >
> > > Anyway do avoid the scenario in this post ?
> > > https://blog.softwaremill.com/help-kafka-ate-my-data-ae2e5d3e6576
> > >
> > > But, the Unclean Leader exception is not an option in our case.
> > > FYI..
> > > We needed to deactivated our systemctl for kafka brokers to avoid a
> > service startup with a corrupted leader disk.
> > >
> > > Best Regards!
> > >
> > >
> >
> 

Re: Kafka Ate My Data!

Posted by Sunil Unnithan <su...@gmail.com>.
Why isr=all? Why not use min.isr=2 in this case?

On Thu, Jun 17, 2021 at 5:11 PM Jhanssen Fávaro <jh...@gmail.com>
wrote:

> Basically, if we have 3 brokers and the ISR == all, and in the case that a
> leader partition broker was the last server that was restarted/rebooted,
> and during its startup got a disk corruption, all the followers will mark
> the topic as offline.
> So, If the last broker leader that got the corrupted disk starts, It will
> be back to the partition leaderhip and then erase all the others
> followers/brokers in the cluster.
>
> It should at least "asks" the other 2 brokers if they are not zeroed.
> Anyway to avoid this data to be truncate in the followers ?
>
> Best Regards,
> Jhanssen
> On 2021/06/17 20:54:50, Jhanssen F��varo <jh...@gmail.com>
> wrote:
> > Hi all, we were testing kafka disaster/recover in our Sites.
> >
> > Anyway do avoid the scenario in this post ?
> > https://blog.softwaremill.com/help-kafka-ate-my-data-ae2e5d3e6576
> >
> > But, the Unclean Leader exception is not an option in our case.
> > FYI..
> > We needed to deactivated our systemctl for kafka brokers to avoid a
> service startup with a corrupted leader disk.
> >
> > Best Regards!
> >
> >
>

Re: Kafka Ate My Data!

Posted by Jhanssen F��varo <jh...@gmail.com>.
Basically, if we have 3 brokers and the ISR == all, and in the case that a leader partition broker was the last server that was restarted/rebooted, and during its startup got a disk corruption, all the followers will mark the topic as offline. 
So, If the last broker leader that got the corrupted disk starts, It will be back to the partition leaderhip and then erase all the others followers/brokers in the cluster.

It should at least "asks" the other 2 brokers if they are not zeroed. 
Anyway to avoid this data to be truncate in the followers ?

Best Regards,
Jhanssen
On 2021/06/17 20:54:50, Jhanssen F��varo <jh...@gmail.com> wrote: 
> Hi all, we were testing kafka disaster/recover in our Sites. 
> 
> Anyway do avoid the scenario in this post ?
> https://blog.softwaremill.com/help-kafka-ate-my-data-ae2e5d3e6576
> 
> But, the Unclean Leader exception is not an option in our case. 
> FYI..
> We needed to deactivated our systemctl for kafka brokers to avoid a service startup with a corrupted leader disk.
> 
> Best Regards! 
> 
>