You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Pieter Hameete <pi...@blockbax.com> on 2019/11/20 13:26:54 UTC

Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Hello,

after having some Broker issues (too many open files) we managed to recover our Brokers, but read_committed consumers are stuck for a specific topic partition. It seems like the LSO is stuck at a specific offset. The transactional producer for the topic partition is working without errors so the latest offset is incrementing correctly and so is transactional producing.

What could be wrong here? And how can we get this specific LSO to be increment again?

Thank you in advance for any advice.

Best,

Pieter

Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Posted by Pieter Hameete <pi...@blockbax.com>.
Hello,

I final update on this. I found that there is an open transaction causing the LSO to be stuck at offset 10794778. Similar to this stackoverflow issue:

https://stackoverflow.com/questions/56643907/manually-close-old-kafka-transaction

Despite using the same pool of transactional IDs this old transaction was not aborted after the brokers and client apps came back online.

Is there any way to abort this defective transaction? Or is the only way to migrate all the data from this topic to a new one by using a read_uncommitted reader?

Best,

Pieter
________________________________
Van: Pieter Hameete <pi...@blockbax.com>
Verzonden: woensdag 20 november 2019 16:33
Aan: Ashutosh singh <ge...@gmail.com>
CC: users@kafka.apache.org <us...@kafka.apache.org>
Onderwerp: Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Hi Ashu, others,

I have tested with the latest kafkacat with librdkafka 1.2.2 which can also do transactional reading.

Reading the partition with offset reset from beginning will read until offset 10794778 (this is the offset of the LSO that is stuck)

Reading the partition from any offset after 10794778 (so any specific offset greater than 10794778, or auto offset reset to latest) will not read anything at all.

Reading in uncommitted mode will read properly from any offset.

I think my only solution would be to somehow get the LSO on the broker side to increase again. There's nothing I can do on the consumer side to get this working again while keeping read mode read_committed.

Best,

Pieter
________________________________
Van: Ashutosh singh <ge...@gmail.com>
Verzonden: woensdag 20 november 2019 15:15
Aan: Pieter Hameete <pi...@blockbax.com>
CC: users@kafka.apache.org <us...@kafka.apache.org>
Onderwerp: Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Alright got that.
What about resetting or changing the consumer offset ?  You can try to change it to some previous offset and restart consumer.  Consumer may have to do duplicate processing but should work .

On Wed, Nov 20, 2019 at 7:18 PM Pieter Hameete <pi...@blockbax.com>> wrote:
Hi Ashu,

thanks for the tip. We have tried restarting the consumer, but that did not help. All read_committed consumers for this partition (we have multiple) have the same issue.

The partition already had different leaders, when we performed a rolling-restart of the brokers. All brokers give the same stuck LSO, so I don't think deleting will the partition will help? It will then restore the partition from another in-sync replica but that also has the incorrect LSO?

Best,

Pieter
________________________________
Van: Ashutosh singh <ge...@gmail.com>>
Verzonden: woensdag 20 november 2019 14:43
Aan: users@kafka.apache.org<ma...@kafka.apache.org> <us...@kafka.apache.org>>
Onderwerp: Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Hello Pieter,

We had similar issue.

Did you try restarting your consumer ?  It that doesn't fix then you can
try deleting that particular topic partition from the broker and restart
the broker so that it will get in sync.  Please make sure that you have
replica in-sync before deleting the partition.

Thanks
Ashu


On Wed, Nov 20, 2019 at 6:57 PM Pieter Hameete <pi...@blockbax.com>>
wrote:

> Hello,
>
> after having some Broker issues (too many open files) we managed to
> recover our Brokers, but read_committed consumers are stuck for a specific
> topic partition. It seems like the LSO is stuck at a specific offset. The
> transactional producer for the topic partition is working without errors so
> the latest offset is incrementing correctly and so is transactional
> producing.
>
> What could be wrong here? And how can we get this specific LSO to be
> increment again?
>
> Thank you in advance for any advice.
>
> Best,
>
> Pieter
>


--
Thanx & Regard
Ashutosh Singh
08151945559


--
Thanx & Regard
Ashutosh Singh
08151945559


Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Posted by Pieter Hameete <pi...@blockbax.com>.
Hi Ashu, others,

I have tested with the latest kafkacat with librdkafka 1.2.2 which can also do transactional reading.

Reading the partition with offset reset from beginning will read until offset 10794778 (this is the offset of the LSO that is stuck)

Reading the partition from any offset after 10794778 (so any specific offset greater than 10794778, or auto offset reset to latest) will not read anything at all.

Reading in uncommitted mode will read properly from any offset.

I think my only solution would be to somehow get the LSO on the broker side to increase again. There's nothing I can do on the consumer side to get this working again while keeping read mode read_committed.

Best,

Pieter
________________________________
Van: Ashutosh singh <ge...@gmail.com>
Verzonden: woensdag 20 november 2019 15:15
Aan: Pieter Hameete <pi...@blockbax.com>
CC: users@kafka.apache.org <us...@kafka.apache.org>
Onderwerp: Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Alright got that.
What about resetting or changing the consumer offset ?  You can try to change it to some previous offset and restart consumer.  Consumer may have to do duplicate processing but should work .

On Wed, Nov 20, 2019 at 7:18 PM Pieter Hameete <pi...@blockbax.com>> wrote:
Hi Ashu,

thanks for the tip. We have tried restarting the consumer, but that did not help. All read_committed consumers for this partition (we have multiple) have the same issue.

The partition already had different leaders, when we performed a rolling-restart of the brokers. All brokers give the same stuck LSO, so I don't think deleting will the partition will help? It will then restore the partition from another in-sync replica but that also has the incorrect LSO?

Best,

Pieter
________________________________
Van: Ashutosh singh <ge...@gmail.com>>
Verzonden: woensdag 20 november 2019 14:43
Aan: users@kafka.apache.org<ma...@kafka.apache.org> <us...@kafka.apache.org>>
Onderwerp: Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Hello Pieter,

We had similar issue.

Did you try restarting your consumer ?  It that doesn't fix then you can
try deleting that particular topic partition from the broker and restart
the broker so that it will get in sync.  Please make sure that you have
replica in-sync before deleting the partition.

Thanks
Ashu


On Wed, Nov 20, 2019 at 6:57 PM Pieter Hameete <pi...@blockbax.com>>
wrote:

> Hello,
>
> after having some Broker issues (too many open files) we managed to
> recover our Brokers, but read_committed consumers are stuck for a specific
> topic partition. It seems like the LSO is stuck at a specific offset. The
> transactional producer for the topic partition is working without errors so
> the latest offset is incrementing correctly and so is transactional
> producing.
>
> What could be wrong here? And how can we get this specific LSO to be
> increment again?
>
> Thank you in advance for any advice.
>
> Best,
>
> Pieter
>


--
Thanx & Regard
Ashutosh Singh
08151945559


--
Thanx & Regard
Ashutosh Singh
08151945559


Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Posted by Ashutosh singh <ge...@gmail.com>.
Alright got that.
What about resetting or changing the consumer offset ?  You can try to
change it to some previous offset and restart consumer.  Consumer may have
to do duplicate processing but should work .

On Wed, Nov 20, 2019 at 7:18 PM Pieter Hameete <pi...@blockbax.com>
wrote:

> Hi Ashu,
>
> thanks for the tip. We have tried restarting the consumer, but that did
> not help. All read_committed consumers for this partition (we have
> multiple) have the same issue.
>
> The partition already had different leaders, when we performed a
> rolling-restart of the brokers. All brokers give the same stuck LSO, so I
> don't think deleting will the partition will help? It will then restore the
> partition from another in-sync replica but that also has the incorrect LSO?
>
> Best,
>
> Pieter
> ------------------------------
> *Van:* Ashutosh singh <ge...@gmail.com>
> *Verzonden:* woensdag 20 november 2019 14:43
> *Aan:* users@kafka.apache.org <us...@kafka.apache.org>
> *Onderwerp:* Re: Last Stable Offset (LSO) stuck for specific topic
> partition after Broker issues
>
> Hello Pieter,
>
> We had similar issue.
>
> Did you try restarting your consumer ?  It that doesn't fix then you can
> try deleting that particular topic partition from the broker and restart
> the broker so that it will get in sync.  Please make sure that you have
> replica in-sync before deleting the partition.
>
> Thanks
> Ashu
>
>
> On Wed, Nov 20, 2019 at 6:57 PM Pieter Hameete <
> pieter.hameete@blockbax.com>
> wrote:
>
> > Hello,
> >
> > after having some Broker issues (too many open files) we managed to
> > recover our Brokers, but read_committed consumers are stuck for a
> specific
> > topic partition. It seems like the LSO is stuck at a specific offset. The
> > transactional producer for the topic partition is working without errors
> so
> > the latest offset is incrementing correctly and so is transactional
> > producing.
> >
> > What could be wrong here? And how can we get this specific LSO to be
> > increment again?
> >
> > Thank you in advance for any advice.
> >
> > Best,
> >
> > Pieter
> >
>
>
> --
> Thanx & Regard
> Ashutosh Singh
> 08151945559
>


-- 
Thanx & Regard
Ashutosh Singh
08151945559

Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Posted by Pieter Hameete <pi...@blockbax.com>.
Hi Ashu,

thanks for the tip. We have tried restarting the consumer, but that did not help. All read_committed consumers for this partition (we have multiple) have the same issue.

The partition already had different leaders, when we performed a rolling-restart of the brokers. All brokers give the same stuck LSO, so I don't think deleting will the partition will help? It will then restore the partition from another in-sync replica but that also has the incorrect LSO?

Best,

Pieter
________________________________
Van: Ashutosh singh <ge...@gmail.com>
Verzonden: woensdag 20 november 2019 14:43
Aan: users@kafka.apache.org <us...@kafka.apache.org>
Onderwerp: Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Hello Pieter,

We had similar issue.

Did you try restarting your consumer ?  It that doesn't fix then you can
try deleting that particular topic partition from the broker and restart
the broker so that it will get in sync.  Please make sure that you have
replica in-sync before deleting the partition.

Thanks
Ashu


On Wed, Nov 20, 2019 at 6:57 PM Pieter Hameete <pi...@blockbax.com>
wrote:

> Hello,
>
> after having some Broker issues (too many open files) we managed to
> recover our Brokers, but read_committed consumers are stuck for a specific
> topic partition. It seems like the LSO is stuck at a specific offset. The
> transactional producer for the topic partition is working without errors so
> the latest offset is incrementing correctly and so is transactional
> producing.
>
> What could be wrong here? And how can we get this specific LSO to be
> increment again?
>
> Thank you in advance for any advice.
>
> Best,
>
> Pieter
>


--
Thanx & Regard
Ashutosh Singh
08151945559

Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Posted by Ashutosh singh <ge...@gmail.com>.
Hello Pieter,

We had similar issue.

Did you try restarting your consumer ?  It that doesn't fix then you can
try deleting that particular topic partition from the broker and restart
the broker so that it will get in sync.  Please make sure that you have
replica in-sync before deleting the partition.

Thanks
Ashu


On Wed, Nov 20, 2019 at 6:57 PM Pieter Hameete <pi...@blockbax.com>
wrote:

> Hello,
>
> after having some Broker issues (too many open files) we managed to
> recover our Brokers, but read_committed consumers are stuck for a specific
> topic partition. It seems like the LSO is stuck at a specific offset. The
> transactional producer for the topic partition is working without errors so
> the latest offset is incrementing correctly and so is transactional
> producing.
>
> What could be wrong here? And how can we get this specific LSO to be
> increment again?
>
> Thank you in advance for any advice.
>
> Best,
>
> Pieter
>


-- 
Thanx & Regard
Ashutosh Singh
08151945559

Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Posted by Alexandre Dupriez <al...@gmail.com>.
Hi Pieter,

FWIW, you may have encountered the following bug:
https://issues.apache.org/jira/browse/KAFKA-12671 .

Thanks,
Alexandre

Le ven. 12 juin 2020 à 00:43, D C <dr...@gmail.com> a écrit :
>
> Hey peeps,
>
> Anyone else encountered this and got to the bottom of it?
>
> I'm facing a similar issue, having LSO stuck for some partitions in a topic
> and the consumers can't get data out of it (we're using read_committed =
> true).
>
> When this issue started happening we were on kafka 2.3.1
> i tried:
> - restarting the consumers
> - deleting the partition from the leader and letting it get in sync with
> the new leader
> - rolling restart of the brokers
> - shutting down the whole cluster and starting it again
> - tried deleting the txnindex files (after backing them up) and restarting
> the brokers
> - tried putting down the follower brokers of a partition and resyncing that
> partition on them from scratch
> - upgraded both kafka broker and client to 2.5.0
>
> Now the following questions arise:
> Where is the LSO actually stored (even if you get rid of the txnfiles, the
> LSO stays the same).
> Is there any way that the LSO can be reset?
> Is there any way that you can manually abort and clean the state of a stuck
> transaction? (i suspect that this is the reason why the LSO is stuck)
> Is there any way to manually trigger a consistency check on the logfiles
> that would fix any existing issues with either the logs or the indexes in
> the partition?
>
> Cheers,
> Dragos
>
> On 2019/11/20 13:26:54, Pieter Hameete <pi...@blockbax.com> wrote:
> > Hello,
> >
> > after having some Broker issues (too many open files) we managed to recover our Brokers, but read_committed consumers are stuck for a specific topic partition. It seems like the LSO is stuck at a specific offset. The transactional producer for the topic partition is working without errors so the latest offset is incrementing correctly and so is transactional producing.
> >
> > What could be wrong here? And how can we get this specific LSO to be increment again?
> >
> > Thank you in advance for any advice.
> >
> > Best,
> >
> > Pieter
> >

Re: Last Stable Offset (LSO) stuck for specific topic partition after Broker issues

Posted by D C <dr...@gmail.com>.
Hey peeps,

Anyone else encountered this and got to the bottom of it?

I'm facing a similar issue, having LSO stuck for some partitions in a topic
and the consumers can't get data out of it (we're using read_committed =
true).

When this issue started happening we were on kafka 2.3.1
i tried:
- restarting the consumers
- deleting the partition from the leader and letting it get in sync with
the new leader
- rolling restart of the brokers
- shutting down the whole cluster and starting it again
- tried deleting the txnindex files (after backing them up) and restarting
the brokers
- tried putting down the follower brokers of a partition and resyncing that
partition on them from scratch
- upgraded both kafka broker and client to 2.5.0

Now the following questions arise:
Where is the LSO actually stored (even if you get rid of the txnfiles, the
LSO stays the same).
Is there any way that the LSO can be reset?
Is there any way that you can manually abort and clean the state of a stuck
transaction? (i suspect that this is the reason why the LSO is stuck)
Is there any way to manually trigger a consistency check on the logfiles
that would fix any existing issues with either the logs or the indexes in
the partition?

Cheers,
Dragos

On 2019/11/20 13:26:54, Pieter Hameete <pi...@blockbax.com> wrote: 
> Hello,
> 
> after having some Broker issues (too many open files) we managed to recover our Brokers, but read_committed consumers are stuck for a specific topic partition. It seems like the LSO is stuck at a specific offset. The transactional producer for the topic partition is working without errors so the latest offset is incrementing correctly and so is transactional producing.
> 
> What could be wrong here? And how can we get this specific LSO to be increment again?
> 
> Thank you in advance for any advice.
> 
> Best,
> 
> Pieter
>