You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pulsar.apache.org by Yong Zhang <zh...@gmail.com> on 2022/09/06 09:33:26 UTC

Zookeeper exception handler in Pulsar

Hi all,

I saw in the Pulsar Metadata handler, we retry the operation when zookeeper
throws a connection loss exception. But the operation may fail after the
retry.

For example, we update the ledgers map in memory after successfully
updating the LedgerInfo in the zookeeper. If the zookeeper update operation
execute successfully on the server but throws connection loss on the
client, and
we have to retry on the connection loss exception, then the callback may
be received
a BadVersion exception. At this moment, the memory ledgers list is
different from
the zookeeper server. And that may cause some other issues on the broker.

We need to do some work on the metastore and managed ledger to keep the
consistency between them. But that would change most of the callback of the
meta store to handle it.

I want to know more ideas from yours. WDYT?

Regards,
Yong

Re: Zookeeper exception handler in Pulsar

Posted by Yong Zhang <zh...@gmail.com>.

Hi Lari,

I file an issue here https://github.com/apache/pulsar/issues/17516

>This sounds like a severe issue that could lead to data loss. Is that
correct? What are the implications of this?
Yes. What we met before is the consumer gets stuck and the ledger could not
read from the tiered storage, because the ledger is deleted. The topic
metadata
shows it offload successfully and it didn't expire.

I am still trying to figure out a way to avoid changing many places in
pulsar
so haven't a detailed solution yet.

Yong



On Wed, 7 Sept 2022 at 16:51, Lari Hotari <lh...@apache.org> wrote:

> Hi Yong,
>
> Thanks for sharing your findings. Would it make sense to also share the
> issues with some detailed log messages in GH issues so that others that
> experience these problems would be able to find the later fixes for this
> problem and track the status?
>
> > a BadVersion exception. At this moment, the memory ledgers list is
> > different from
> > the zookeeper server. And that may cause some other issues on the broker.
>
> This sounds like a severe issue that could lead to data loss. Is that
> correct? What are the implications of this?
>
> > We need to do some work on the metastore and managed ledger to keep the
> > consistency between them. But that would change most of the callback of
> the
> > meta store to handle it.
>
> This sounds reasonable. Would you be able to share more details about this
> solution?
>
> -Lari
>
> On 2022/09/06 09:33:26 Yong Zhang wrote:
> > Hi all,
> >
> > I saw in the Pulsar Metadata handler, we retry the operation when
> zookeeper
> > throws a connection loss exception. But the operation may fail after the
> > retry.
> >
> > For example, we update the ledgers map in memory after successfully
> > updating the LedgerInfo in the zookeeper. If the zookeeper update
> operation
> > execute successfully on the server but throws connection loss on the
> > client, and
> > we have to retry on the connection loss exception, then the callback may
> > be received
> > a BadVersion exception. At this moment, the memory ledgers list is
> > different from
> > the zookeeper server. And that may cause some other issues on the broker.
> >
> > We need to do some work on the metastore and managed ledger to keep the
> > consistency between them. But that would change most of the callback of
> the
> > meta store to handle it.
> >
> > I want to know more ideas from yours. WDYT?
> >
> > Regards,
> > Yong
> >
>

Re: Zookeeper exception handler in Pulsar

Posted by Lari Hotari <lh...@apache.org>.

Hi Yong, 

Thanks for sharing your findings. Would it make sense to also share the issues with some detailed log messages in GH issues so that others that experience these problems would be able to find the later fixes for this problem and track the status?

> a BadVersion exception. At this moment, the memory ledgers list is
> different from
> the zookeeper server. And that may cause some other issues on the broker.

This sounds like a severe issue that could lead to data loss. Is that correct? What are the implications of this?

> We need to do some work on the metastore and managed ledger to keep the
> consistency between them. But that would change most of the callback of the
> meta store to handle it.

This sounds reasonable. Would you be able to share more details about this solution?

-Lari

On 2022/09/06 09:33:26 Yong Zhang wrote:
> Hi all,
> 
> I saw in the Pulsar Metadata handler, we retry the operation when zookeeper
> throws a connection loss exception. But the operation may fail after the
> retry.
> 
> For example, we update the ledgers map in memory after successfully
> updating the LedgerInfo in the zookeeper. If the zookeeper update operation
> execute successfully on the server but throws connection loss on the
> client, and
> we have to retry on the connection loss exception, then the callback may
> be received
> a BadVersion exception. At this moment, the memory ledgers list is
> different from
> the zookeeper server. And that may cause some other issues on the broker.
> 
> We need to do some work on the metastore and managed ledger to keep the
> consistency between them. But that would change most of the callback of the
> meta store to handle it.
> 
> I want to know more ideas from yours. WDYT?
> 
> Regards,
> Yong
>