You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by Enrico Olivelli <eo...@gmail.com> on 2021/01/08 16:27:21 UTC

Unbounded memory usage for WQ > AQ ?

Hi Matteo,
in this comment you are talking about an issue you saw when WQ is greater
that AQ
https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246

IIUC you are saying that if one bookie is slow the client continues to
accumulate references to the entries that still have not received the
confirmation from it.
I think that this is correct.

Have you seen problems in production related to this scenario ?
Can you tell more about them ?

Regards
Enrico

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
Right, good catch, Enrico. The issue (#1063) description says: 

> PendingAddOp:maybeRecycle()->recycle() keeps the buffer until writeComplete() 
> is called for each bookie write. We need to keep this buffer only until it is successfully
> transferred by netty. In the current code, the write is retired only if
> disableEnsembleChangeFeature is enabled. Otherwise, there is no point in keeping
> this buffer around.

JV, the author of the PR, says also the following to Sijie:

> toSend buffer is not needed for retries as we discussed on slack.


I don't know what the reason is. JV, Sijie, it has been a while back, but perhaps you guys can elaborate? Specifically, I'm trying to understand what is the guarantee that BK is currently offering for a configuration in which WQ > AQ. I'd think that we guarantee that an entry that is acknowledged is eventually written WQ ways and that it is observable by readers when the ledger is closed.

-Flavio

> On 14 Jan 2021, at 18:34, Enrico Olivelli <eo...@gmail.com> wrote:
> 
> Flavio
> 
> Il giorno gio 14 gen 2021 alle ore 17:56 Flavio Junqueira <fp...@apache.org>
> ha scritto:
> 
>> Using your example, the PendindAddOp should remain active until there are
>> 3 copies of the add entry. The client can ack back once it receives two
>> positive acks from bookies, but it shouldn't declare the add entry done at
>> that point.
>> 
> 
> Probably this behaviour has been broken by this commit
> https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c
> 
> My understanding is that as soon as we reach AQ we are discarding the
> "toSend" buffer and we cannot retry the write anymore
> 
> Enrico
> 
> 
>> 
>> There is the case that the third bookie is slow, but it could have failed
>> altogether, in which case the entry needs to be replicated in a new bookie.
>> 
>> -Flavio
>> 
>> On 13 Jan 2021, at 17:28, Enrico Olivelli <eo...@gmail.com> wrote:
>> 
>> 
>> 
>> Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira <fp...@apache.org>
>> ha scritto:
>> 
>>> We should work on some kind of back-pressure mechanism for the client,
>>> but I am not sure about which kind of support we should provide at BK level
>>> 
>>> 
>>> Is there an issue for this? If there isn't, then perhaps we can start
>>> that way.
>>> 
>>> And as soon as the application is notified of the result of the write
>>> (success or failure) we are releasing the reference to the payload (as I
>>> have shown in this email thread),
>>> so in theory the application has full control over the retained memory
>>> and it can apply its own memory management mechanisms
>>> 
>>> 
>>> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies
>>> reply, then it is possible that the entry is not going to be written to
>>> |WQ| - |AQ| bookies because the entry data might have been reclaimed by the
>>> application? The contract as I understand it is that an entry is to be
>>> replicated |WQ| ways, even though the application is willing to receive a
>>> confirmation after |AQ| bookie responses.
>>> 
>>> What am I missing?
>>> 
>> 
>> If I am not wrong in reading PendingAddOp code currently we do it this
>> way, say we run with 3-3-2:
>> - enqueue the write request to the 3 PerChannelBookieClients
>> - as soon as we receive 2 confirmations we trigger the callback and
>> discard the payload
>> 
>> so if the first 2 confirmations arrive before we write to the socket
>> (enqueue the payload on Netty channel actually) of the third bookie, we are
>> not sending the entry to the 3rd bookie at all.
>> This should not happen because we serialize the operations per-ledger (by
>> sticking them to one thread), so you cannot process the incoming acks from
>> the first two bookies while executing PendingAddOp write loop.
>> So we are giving a chance to every bookie to receive the entry, if it is
>> in good health (bookie, network...)
>> 
>> Enrico
>> 
>> 
>> 
>>> 
>>> -Flavio
>>> 
>>> On 13 Jan 2021, at 11:30, Enrico Olivelli <eo...@gmail.com> wrote:
>>> 
>>> Flavio
>>> 
>>> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <fp...@apache.org>
>>> ha scritto:
>>> 
>>>> I have observed the issue that Matteo describes and I also attributed
>>>> the problem to the absence of a back pressure mechanism in the client.
>>>> Issue #2497 was not about that, though. There was some corruption going on
>>>> that was leading to the server receiving garbage.
>>>> 
>>> 
>>> Correct, #2497 is not about the topic of this email, I just mentioned it
>>> because the discussion started from that comment from Matteo.
>>> 
>>> We should work on some kind of back-pressure mechanism for the client,
>>> but I am not sure about which kind of support we should provide at BK level
>>> 
>>> Regarding the writer side of this story and memory usage,
>>> we are not performing copies of the original payload that the caller is
>>> passing, in case of a ByteBuf
>>> see PendingAddOp
>>> 
>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
>>> and here, we simply wrap it in a ByteBufList
>>> 
>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116
>>> 
>>> And as soon as the application is notified of the result of the write
>>> (success or failure) we are releasing the reference to the payload (as I
>>> have shown in this email thread),
>>> so in theory the application has full control over the retained memory
>>> and it can apply its own memory management mechanisms
>>> 
>>> 
>>> Enrico
>>> 
>>> 
>>>> -Flavio
>>>> 
>>>>> On 8 Jan 2021, at 22:47, Matteo Merli <mm...@apache.org> wrote:
>>>>> 
>>>>> On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eo...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> Hi Matteo,
>>>>>> in this comment you are talking about an issue you saw when WQ is
>>>> greater that AQ
>>>>>> 
>>>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>>>>>> 
>>>>>> IIUC you are saying that if one bookie is slow the client continues
>>>> to accumulate references to the entries that still have not received the
>>>> confirmation from it.
>>>>>> I think that this is correct.
>>>>>> 
>>>>>> Have you seen problems in production related to this scenario ?
>>>>>> Can you tell more about them ?
>>>>> 
>>>>> Yes, for simplicity, assume e=3, w=3, a=2.
>>>>> 
>>>>> If one bookie is slow (not down, just slow), the BK client will the
>>>>> acks to the user that the entries are written after the first 2 acks.
>>>>> In the meantime, it will keep waiting for the 3rd bookie to respond.
>>>>> If the bookie responds within the timeout, the entries can now be
>>>>> dropped from memory, otherwise the write will timeout internally and
>>>>> it will get replayed to a new bookie.
>>>>> 
>>>>> In both cases, the amount of memory used in the client will max at
>>>>> "throughput" * "timeout". This can be a large amount of memory and
>>>>> easily cause OOM errors.
>>>>> 
>>>>> Part of the problem is that it cannot be solved from outside the BK
>>>>> client, since there's no visibility on what entries have 2 or 3 acks
>>>>> and therefore it's not possible to apply backpressure. Instead,
>>>>> there should be a backpressure mechanism in the BK client itself to
>>>>> prevent this kind of issue.
>>>>> One possibility there could be to use the same approach as described
>>>>> in
>>>> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits
>>>> ,
>>>>> giving a max memory limit per BK client instance and throttling
>>>>> everything after the quota is reached.
>>>>> 
>>>>> 
>>>>> Matteo
>>> 
>>> 
>> 


Re: Unbounded memory usage for WQ > AQ ?

Posted by Enrico Olivelli <eo...@gmail.com>.
Flavio

Il giorno gio 14 gen 2021 alle ore 17:56 Flavio Junqueira <fp...@apache.org>
ha scritto:

> Using your example, the PendindAddOp should remain active until there are
> 3 copies of the add entry. The client can ack back once it receives two
> positive acks from bookies, but it shouldn't declare the add entry done at
> that point.
>

Probably this behaviour has been broken by this commit
https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c

My understanding is that as soon as we reach AQ we are discarding the
"toSend" buffer and we cannot retry the write anymore

Enrico


>
> There is the case that the third bookie is slow, but it could have failed
> altogether, in which case the entry needs to be replicated in a new bookie.
>
> -Flavio
>
> On 13 Jan 2021, at 17:28, Enrico Olivelli <eo...@gmail.com> wrote:
>
>
>
> Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira <fp...@apache.org>
> ha scritto:
>
>> We should work on some kind of back-pressure mechanism for the client,
>> but I am not sure about which kind of support we should provide at BK level
>>
>>
>> Is there an issue for this? If there isn't, then perhaps we can start
>> that way.
>>
>> And as soon as the application is notified of the result of the write
>> (success or failure) we are releasing the reference to the payload (as I
>> have shown in this email thread),
>> so in theory the application has full control over the retained memory
>> and it can apply its own memory management mechanisms
>>
>>
>> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies
>> reply, then it is possible that the entry is not going to be written to
>> |WQ| - |AQ| bookies because the entry data might have been reclaimed by the
>> application? The contract as I understand it is that an entry is to be
>> replicated |WQ| ways, even though the application is willing to receive a
>> confirmation after |AQ| bookie responses.
>>
>> What am I missing?
>>
>
> If I am not wrong in reading PendingAddOp code currently we do it this
> way, say we run with 3-3-2:
> - enqueue the write request to the 3 PerChannelBookieClients
> - as soon as we receive 2 confirmations we trigger the callback and
> discard the payload
>
> so if the first 2 confirmations arrive before we write to the socket
> (enqueue the payload on Netty channel actually) of the third bookie, we are
> not sending the entry to the 3rd bookie at all.
> This should not happen because we serialize the operations per-ledger (by
> sticking them to one thread), so you cannot process the incoming acks from
> the first two bookies while executing PendingAddOp write loop.
> So we are giving a chance to every bookie to receive the entry, if it is
> in good health (bookie, network...)
>
> Enrico
>
>
>
>>
>> -Flavio
>>
>> On 13 Jan 2021, at 11:30, Enrico Olivelli <eo...@gmail.com> wrote:
>>
>> Flavio
>>
>> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <fp...@apache.org>
>> ha scritto:
>>
>>> I have observed the issue that Matteo describes and I also attributed
>>> the problem to the absence of a back pressure mechanism in the client.
>>> Issue #2497 was not about that, though. There was some corruption going on
>>> that was leading to the server receiving garbage.
>>>
>>
>> Correct, #2497 is not about the topic of this email, I just mentioned it
>> because the discussion started from that comment from Matteo.
>>
>> We should work on some kind of back-pressure mechanism for the client,
>> but I am not sure about which kind of support we should provide at BK level
>>
>> Regarding the writer side of this story and memory usage,
>> we are not performing copies of the original payload that the caller is
>> passing, in case of a ByteBuf
>> see PendingAddOp
>>
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
>> and here, we simply wrap it in a ByteBufList
>>
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116
>>
>> And as soon as the application is notified of the result of the write
>> (success or failure) we are releasing the reference to the payload (as I
>> have shown in this email thread),
>> so in theory the application has full control over the retained memory
>> and it can apply its own memory management mechanisms
>>
>>
>> Enrico
>>
>>
>>> -Flavio
>>>
>>> > On 8 Jan 2021, at 22:47, Matteo Merli <mm...@apache.org> wrote:
>>> >
>>> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eo...@gmail.com>
>>> wrote:
>>> >>
>>> >> Hi Matteo,
>>> >> in this comment you are talking about an issue you saw when WQ is
>>> greater that AQ
>>> >>
>>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>>> >>
>>> >> IIUC you are saying that if one bookie is slow the client continues
>>> to accumulate references to the entries that still have not received the
>>> confirmation from it.
>>> >> I think that this is correct.
>>> >>
>>> >> Have you seen problems in production related to this scenario ?
>>> >> Can you tell more about them ?
>>> >
>>> > Yes, for simplicity, assume e=3, w=3, a=2.
>>> >
>>> > If one bookie is slow (not down, just slow), the BK client will the
>>> > acks to the user that the entries are written after the first 2 acks.
>>> > In the meantime, it will keep waiting for the 3rd bookie to respond.
>>> > If the bookie responds within the timeout, the entries can now be
>>> > dropped from memory, otherwise the write will timeout internally and
>>> > it will get replayed to a new bookie.
>>> >
>>> > In both cases, the amount of memory used in the client will max at
>>> > "throughput" * "timeout". This can be a large amount of memory and
>>> > easily cause OOM errors.
>>> >
>>> > Part of the problem is that it cannot be solved from outside the BK
>>> > client, since there's no visibility on what entries have 2 or 3 acks
>>> > and therefore it's not possible to apply backpressure. Instead,
>>> > there should be a backpressure mechanism in the BK client itself to
>>> > prevent this kind of issue.
>>> > One possibility there could be to use the same approach as described
>>> > in
>>> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits
>>> ,
>>> > giving a max memory limit per BK client instance and throttling
>>> > everything after the quota is reached.
>>> >
>>> >
>>> > Matteo
>>
>>
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
Using your example, the PendindAddOp should remain active until there are 3 copies of the add entry. The client can ack back once it receives two positive acks from bookies, but it shouldn't declare the add entry done at that point. 

There is the case that the third bookie is slow, but it could have failed altogether, in which case the entry needs to be replicated in a new bookie.

-Flavio 

> On 13 Jan 2021, at 17:28, Enrico Olivelli <eo...@gmail.com> wrote:
> 
> 
> 
> Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira <fpj@apache.org <ma...@apache.org>> ha scritto:
>> We should work on some kind of back-pressure mechanism for the client, but I am not sure about which kind of support we should provide at BK level
> 
> Is there an issue for this? If there isn't, then perhaps we can start that way.
> 
>> And as soon as the application is notified of the result of the write (success or failure) we are releasing the reference to the payload (as I have shown in this email thread),
>> so in theory the application has full control over the retained memory and it can apply its own memory management mechanisms 
> 
> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies reply, then it is possible that the entry is not going to be written to |WQ| - |AQ| bookies because the entry data might have been reclaimed by the application? The contract as I understand it is that an entry is to be replicated |WQ| ways, even though the application is willing to receive a confirmation after |AQ| bookie responses.
> 
> What am I missing?
> 
> If I am not wrong in reading PendingAddOp code currently we do it this way, say we run with 3-3-2:
> - enqueue the write request to the 3 PerChannelBookieClients
> - as soon as we receive 2 confirmations we trigger the callback and discard the payload
> 
> so if the first 2 confirmations arrive before we write to the socket (enqueue the payload on Netty channel actually) of the third bookie, we are not sending the entry to the 3rd bookie at all.
> This should not happen because we serialize the operations per-ledger (by sticking them to one thread), so you cannot process the incoming acks from the first two bookies while executing PendingAddOp write loop.
> So we are giving a chance to every bookie to receive the entry, if it is in good health (bookie, network...)
> 
> Enrico  
> 
>  
> 
> -Flavio  
> 
>> On 13 Jan 2021, at 11:30, Enrico Olivelli <eolivelli@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Flavio
>> 
>> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <fpj@apache.org <ma...@apache.org>> ha scritto:
>> I have observed the issue that Matteo describes and I also attributed the problem to the absence of a back pressure mechanism in the client. Issue #2497 was not about that, though. There was some corruption going on that was leading to the server receiving garbage.
>> 
>> Correct, #2497 is not about the topic of this email, I just mentioned it because the discussion started from that comment from Matteo.
>> 
>> We should work on some kind of back-pressure mechanism for the client, but I am not sure about which kind of support we should provide at BK level
>> 
>> Regarding the writer side of this story and memory usage,
>> we are not performing copies of the original payload that the caller is passing, in case of a ByteBuf
>> see PendingAddOp
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263 <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263>
>> and here, we simply wrap it in a ByteBufList 
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116 <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116>
>> 
>> And as soon as the application is notified of the result of the write (success or failure) we are releasing the reference to the payload (as I have shown in this email thread),
>> so in theory the application has full control over the retained memory
>> and it can apply its own memory management mechanisms
>> 
>> 
>> Enrico
>> 
>> 
>> -Flavio 
>> 
>> > On 8 Jan 2021, at 22:47, Matteo Merli <mmerli@apache.org <ma...@apache.org>> wrote:
>> > 
>> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eolivelli@gmail.com <ma...@gmail.com>> wrote:
>> >> 
>> >> Hi Matteo,
>> >> in this comment you are talking about an issue you saw when WQ is greater that AQ
>> >> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246 <https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246>
>> >> 
>> >> IIUC you are saying that if one bookie is slow the client continues to accumulate references to the entries that still have not received the confirmation from it.
>> >> I think that this is correct.
>> >> 
>> >> Have you seen problems in production related to this scenario ?
>> >> Can you tell more about them ?
>> > 
>> > Yes, for simplicity, assume e=3, w=3, a=2.
>> > 
>> > If one bookie is slow (not down, just slow), the BK client will the
>> > acks to the user that the entries are written after the first 2 acks.
>> > In the meantime, it will keep waiting for the 3rd bookie to respond.
>> > If the bookie responds within the timeout, the entries can now be
>> > dropped from memory, otherwise the write will timeout internally and
>> > it will get replayed to a new bookie.
>> > 
>> > In both cases, the amount of memory used in the client will max at
>> > "throughput" * "timeout". This can be a large amount of memory and
>> > easily cause OOM errors.
>> > 
>> > Part of the problem is that it cannot be solved from outside the BK
>> > client, since there's no visibility on what entries have 2 or 3 acks
>> > and therefore it's not possible to apply backpressure. Instead,
>> > there should be a backpressure mechanism in the BK client itself to
>> > prevent this kind of issue.
>> > One possibility there could be to use the same approach as described
>> > in https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits <https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits>,
>> > giving a max memory limit per BK client instance and throttling
>> > everything after the quota is reached.
>> > 
>> > 
>> > Matteo


Re: Unbounded memory usage for WQ > AQ ?

Posted by Enrico Olivelli <eo...@gmail.com>.
Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira <fp...@apache.org>
ha scritto:

> We should work on some kind of back-pressure mechanism for the client, but
> I am not sure about which kind of support we should provide at BK level
>
>
> Is there an issue for this? If there isn't, then perhaps we can start that
> way.
>
> And as soon as the application is notified of the result of the write
> (success or failure) we are releasing the reference to the payload (as I
> have shown in this email thread),
> so in theory the application has full control over the retained memory and
> it can apply its own memory management mechanisms
>
>
> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies
> reply, then it is possible that the entry is not going to be written to
> |WQ| - |AQ| bookies because the entry data might have been reclaimed by the
> application? The contract as I understand it is that an entry is to be
> replicated |WQ| ways, even though the application is willing to receive a
> confirmation after |AQ| bookie responses.
>
> What am I missing?
>

If I am not wrong in reading PendingAddOp code currently we do it this way,
say we run with 3-3-2:
- enqueue the write request to the 3 PerChannelBookieClients
- as soon as we receive 2 confirmations we trigger the callback and discard
the payload

so if the first 2 confirmations arrive before we write to the socket
(enqueue the payload on Netty channel actually) of the third bookie, we are
not sending the entry to the 3rd bookie at all.
This should not happen because we serialize the operations per-ledger (by
sticking them to one thread), so you cannot process the incoming acks from
the first two bookies while executing PendingAddOp write loop.
So we are giving a chance to every bookie to receive the entry, if it is in
good health (bookie, network...)

Enrico



>
> -Flavio
>
> On 13 Jan 2021, at 11:30, Enrico Olivelli <eo...@gmail.com> wrote:
>
> Flavio
>
> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <fp...@apache.org>
> ha scritto:
>
>> I have observed the issue that Matteo describes and I also attributed the
>> problem to the absence of a back pressure mechanism in the client. Issue
>> #2497 was not about that, though. There was some corruption going on that
>> was leading to the server receiving garbage.
>>
>
> Correct, #2497 is not about the topic of this email, I just mentioned it
> because the discussion started from that comment from Matteo.
>
> We should work on some kind of back-pressure mechanism for the client, but
> I am not sure about which kind of support we should provide at BK level
>
> Regarding the writer side of this story and memory usage,
> we are not performing copies of the original payload that the caller is
> passing, in case of a ByteBuf
> see PendingAddOp
>
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
> and here, we simply wrap it in a ByteBufList
>
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116
>
> And as soon as the application is notified of the result of the write
> (success or failure) we are releasing the reference to the payload (as I
> have shown in this email thread),
> so in theory the application has full control over the retained memory
> and it can apply its own memory management mechanisms
>
>
> Enrico
>
>
>> -Flavio
>>
>> > On 8 Jan 2021, at 22:47, Matteo Merli <mm...@apache.org> wrote:
>> >
>> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eo...@gmail.com>
>> wrote:
>> >>
>> >> Hi Matteo,
>> >> in this comment you are talking about an issue you saw when WQ is
>> greater that AQ
>> >>
>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>> >>
>> >> IIUC you are saying that if one bookie is slow the client continues to
>> accumulate references to the entries that still have not received the
>> confirmation from it.
>> >> I think that this is correct.
>> >>
>> >> Have you seen problems in production related to this scenario ?
>> >> Can you tell more about them ?
>> >
>> > Yes, for simplicity, assume e=3, w=3, a=2.
>> >
>> > If one bookie is slow (not down, just slow), the BK client will the
>> > acks to the user that the entries are written after the first 2 acks.
>> > In the meantime, it will keep waiting for the 3rd bookie to respond.
>> > If the bookie responds within the timeout, the entries can now be
>> > dropped from memory, otherwise the write will timeout internally and
>> > it will get replayed to a new bookie.
>> >
>> > In both cases, the amount of memory used in the client will max at
>> > "throughput" * "timeout". This can be a large amount of memory and
>> > easily cause OOM errors.
>> >
>> > Part of the problem is that it cannot be solved from outside the BK
>> > client, since there's no visibility on what entries have 2 or 3 acks
>> > and therefore it's not possible to apply backpressure. Instead,
>> > there should be a backpressure mechanism in the BK client itself to
>> > prevent this kind of issue.
>> > One possibility there could be to use the same approach as described
>> > in
>> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits
>> ,
>> > giving a max memory limit per BK client instance and throttling
>> > everything after the quota is reached.
>> >
>> >
>> > Matteo
>>
>>
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
> We should work on some kind of back-pressure mechanism for the client, but I am not sure about which kind of support we should provide at BK level

Is there an issue for this? If there isn't, then perhaps we can start that way.

> And as soon as the application is notified of the result of the write (success or failure) we are releasing the reference to the payload (as I have shown in this email thread),
> so in theory the application has full control over the retained memory and it can apply its own memory management mechanisms 

Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies reply, then it is possible that the entry is not going to be written to |WQ| - |AQ| bookies because the entry data might have been reclaimed by the application? The contract as I understand it is that an entry is to be replicated |WQ| ways, even though the application is willing to receive a confirmation after |AQ| bookie responses.

What am I missing?

-Flavio  

> On 13 Jan 2021, at 11:30, Enrico Olivelli <eo...@gmail.com> wrote:
> 
> Flavio
> 
> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <fpj@apache.org <ma...@apache.org>> ha scritto:
> I have observed the issue that Matteo describes and I also attributed the problem to the absence of a back pressure mechanism in the client. Issue #2497 was not about that, though. There was some corruption going on that was leading to the server receiving garbage.
> 
> Correct, #2497 is not about the topic of this email, I just mentioned it because the discussion started from that comment from Matteo.
> 
> We should work on some kind of back-pressure mechanism for the client, but I am not sure about which kind of support we should provide at BK level
> 
> Regarding the writer side of this story and memory usage,
> we are not performing copies of the original payload that the caller is passing, in case of a ByteBuf
> see PendingAddOp
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263 <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263>
> and here, we simply wrap it in a ByteBufList 
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116 <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116>
> 
> And as soon as the application is notified of the result of the write (success or failure) we are releasing the reference to the payload (as I have shown in this email thread),
> so in theory the application has full control over the retained memory
> and it can apply its own memory management mechanisms
> 
> 
> Enrico
> 
> 
> -Flavio 
> 
> > On 8 Jan 2021, at 22:47, Matteo Merli <mmerli@apache.org <ma...@apache.org>> wrote:
> > 
> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eolivelli@gmail.com <ma...@gmail.com>> wrote:
> >> 
> >> Hi Matteo,
> >> in this comment you are talking about an issue you saw when WQ is greater that AQ
> >> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246 <https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246>
> >> 
> >> IIUC you are saying that if one bookie is slow the client continues to accumulate references to the entries that still have not received the confirmation from it.
> >> I think that this is correct.
> >> 
> >> Have you seen problems in production related to this scenario ?
> >> Can you tell more about them ?
> > 
> > Yes, for simplicity, assume e=3, w=3, a=2.
> > 
> > If one bookie is slow (not down, just slow), the BK client will the
> > acks to the user that the entries are written after the first 2 acks.
> > In the meantime, it will keep waiting for the 3rd bookie to respond.
> > If the bookie responds within the timeout, the entries can now be
> > dropped from memory, otherwise the write will timeout internally and
> > it will get replayed to a new bookie.
> > 
> > In both cases, the amount of memory used in the client will max at
> > "throughput" * "timeout". This can be a large amount of memory and
> > easily cause OOM errors.
> > 
> > Part of the problem is that it cannot be solved from outside the BK
> > client, since there's no visibility on what entries have 2 or 3 acks
> > and therefore it's not possible to apply backpressure. Instead,
> > there should be a backpressure mechanism in the BK client itself to
> > prevent this kind of issue.
> > One possibility there could be to use the same approach as described
> > in https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits <https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits>,
> > giving a max memory limit per BK client instance and throttling
> > everything after the quota is reached.
> > 
> > 
> > Matteo
> 


Re: Unbounded memory usage for WQ > AQ ?

Posted by Enrico Olivelli <eo...@gmail.com>.
Flavio

Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <fp...@apache.org>
ha scritto:

> I have observed the issue that Matteo describes and I also attributed the
> problem to the absence of a back pressure mechanism in the client. Issue
> #2497 was not about that, though. There was some corruption going on that
> was leading to the server receiving garbage.
>

Correct, #2497 is not about the topic of this email, I just mentioned it
because the discussion started from that comment from Matteo.

We should work on some kind of back-pressure mechanism for the client, but
I am not sure about which kind of support we should provide at BK level

Regarding the writer side of this story and memory usage,
we are not performing copies of the original payload that the caller is
passing, in case of a ByteBuf
see PendingAddOp
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
and here, we simply wrap it in a ByteBufList
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116

And as soon as the application is notified of the result of the write
(success or failure) we are releasing the reference to the payload (as I
have shown in this email thread),
so in theory the application has full control over the retained memory
and it can apply its own memory management mechanisms


Enrico


> -Flavio
>
> > On 8 Jan 2021, at 22:47, Matteo Merli <mm...@apache.org> wrote:
> >
> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eo...@gmail.com>
> wrote:
> >>
> >> Hi Matteo,
> >> in this comment you are talking about an issue you saw when WQ is
> greater that AQ
> >> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
> >>
> >> IIUC you are saying that if one bookie is slow the client continues to
> accumulate references to the entries that still have not received the
> confirmation from it.
> >> I think that this is correct.
> >>
> >> Have you seen problems in production related to this scenario ?
> >> Can you tell more about them ?
> >
> > Yes, for simplicity, assume e=3, w=3, a=2.
> >
> > If one bookie is slow (not down, just slow), the BK client will the
> > acks to the user that the entries are written after the first 2 acks.
> > In the meantime, it will keep waiting for the 3rd bookie to respond.
> > If the bookie responds within the timeout, the entries can now be
> > dropped from memory, otherwise the write will timeout internally and
> > it will get replayed to a new bookie.
> >
> > In both cases, the amount of memory used in the client will max at
> > "throughput" * "timeout". This can be a large amount of memory and
> > easily cause OOM errors.
> >
> > Part of the problem is that it cannot be solved from outside the BK
> > client, since there's no visibility on what entries have 2 or 3 acks
> > and therefore it's not possible to apply backpressure. Instead,
> > there should be a backpressure mechanism in the BK client itself to
> > prevent this kind of issue.
> > One possibility there could be to use the same approach as described
> > in
> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits
> ,
> > giving a max memory limit per BK client instance and throttling
> > everything after the quota is reached.
> >
> >
> > Matteo
>
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
I have observed the issue that Matteo describes and I also attributed the problem to the absence of a back pressure mechanism in the client. Issue #2497 was not about that, though. There was some corruption going on that was leading to the server receiving garbage.

-Flavio 

> On 8 Jan 2021, at 22:47, Matteo Merli <mm...@apache.org> wrote:
> 
> On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eo...@gmail.com> wrote:
>> 
>> Hi Matteo,
>> in this comment you are talking about an issue you saw when WQ is greater that AQ
>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>> 
>> IIUC you are saying that if one bookie is slow the client continues to accumulate references to the entries that still have not received the confirmation from it.
>> I think that this is correct.
>> 
>> Have you seen problems in production related to this scenario ?
>> Can you tell more about them ?
> 
> Yes, for simplicity, assume e=3, w=3, a=2.
> 
> If one bookie is slow (not down, just slow), the BK client will the
> acks to the user that the entries are written after the first 2 acks.
> In the meantime, it will keep waiting for the 3rd bookie to respond.
> If the bookie responds within the timeout, the entries can now be
> dropped from memory, otherwise the write will timeout internally and
> it will get replayed to a new bookie.
> 
> In both cases, the amount of memory used in the client will max at
> "throughput" * "timeout". This can be a large amount of memory and
> easily cause OOM errors.
> 
> Part of the problem is that it cannot be solved from outside the BK
> client, since there's no visibility on what entries have 2 or 3 acks
> and therefore it's not possible to apply backpressure. Instead,
> there should be a backpressure mechanism in the BK client itself to
> prevent this kind of issue.
> One possibility there could be to use the same approach as described
> in https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits,
> giving a max memory limit per BK client instance and throttling
> everything after the quota is reached.
> 
> 
> Matteo


Re: Unbounded memory usage for WQ > AQ ?

Posted by Enrico Olivelli <eo...@gmail.com>.
Il giorno lun 11 gen 2021 alle ore 18:14 Venkateswara Rao Jujjuri <
jujjuri@gmail.com> ha scritto:

> > new data integrity check that Ivan worked on
> The current auditor should take care of this if
> "auditorLedgerVerificationPercentage" is set to 100%.
> I don't think this is the most efficient way, but I believe it does take
> care of filling holes.
>
> On Mon, Jan 11, 2021 at 12:31 AM Jack Vanlightly
> <jv...@splunk.com.invalid> wrote:
>
> > Hi,
> >
> > I've recently modelled the BookKeeper protocol in TLA+ and can confirm
> that
> > once confirmed, that an entry is not replayed to another bookie. This
> > leaves a "hole" as the entry is now replicated only to 2 bookies,
> however,
> > the new data integrity check that Ivan worked on, when run periodically
> > will be able to repair that hole.
>

Matteo
I  did some experiment and it turns out that as soon as we are able to
confirm the write to the application we discard the reference to the
payload (and memory can be released)

see here:
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L492

so:
- no more retry operations
- memory is released

Am I missing some detail ?

Enrico


> >
> > Jack
> >
> > On Sat, Jan 9, 2021 at 1:06 AM Venkateswara Rao Jujjuri <
> jujjuri@gmail.com
> > >
> > wrote:
> >
> > > [ External sender. Exercise caution. ]
> > >
> > > On Fri, Jan 8, 2021 at 2:29 PM Matteo Merli <ma...@gmail.com>
> > > wrote:
> > >
> > > > On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
> > > > <ju...@gmail.com> wrote:
> > > > >
> > > > > > otherwise the write will timeout internally and it will get
> > replayed
> > > > to a
> > > > > new bookie.
> > > > > If Qa is met and the writes of Qw-Qa fail after we send the success
> > to
> > > > the
> > > > > client, why would the write replayed on a new bookie?
> > > >
> > > > I think the original intention was to avoid having 1 bookie with a
> > > > "hole" in the entries sequence. If you then lose one of the 2
> bookies,
> > > > it would be difficult to know which entries need to be recovered.
> > > >
> > >
> > > @Matteo Merli <ma...@gmail.com>  I don't believe we retry the
> > write
> > > on bookie if Qa is satisfied and the write to a bookie timedout.
> > > Once the entry is ack'ed to the client we move the LAC and can't
> > > retroactively change the active segment's ensemble.
> > >
> > > >  will get replayed to a new bookie
> > > This will happen only if we are not able to satisfy Qa and go through
> > > ensemble changes.
> > > We change the ensemble and tetry write only if bookie write fails
> before
> > > satisfying Qa.
> > > We have added a new feature called handling "delayed write failure",
> but
> > > that happens only for
> > > new entries not retroactively.
> > >
> > > I may be missing something here, and not understanding your point.
> > >
> > > Thanks,
> > > JV
> > >
> > >
> > >
> > >
> > > --
> > > Jvrao
> > > ---
> > > First they ignore you, then they laugh at you, then they fight you,
> then
> > > you win. - Mahatma Gandhi
> > >
> >
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.
> new data integrity check that Ivan worked on
The current auditor should take care of this if
"auditorLedgerVerificationPercentage" is set to 100%.
I don't think this is the most efficient way, but I believe it does take
care of filling holes.

On Mon, Jan 11, 2021 at 12:31 AM Jack Vanlightly
<jv...@splunk.com.invalid> wrote:

> Hi,
>
> I've recently modelled the BookKeeper protocol in TLA+ and can confirm that
> once confirmed, that an entry is not replayed to another bookie. This
> leaves a "hole" as the entry is now replicated only to 2 bookies, however,
> the new data integrity check that Ivan worked on, when run periodically
> will be able to repair that hole.
>
> Jack
>
> On Sat, Jan 9, 2021 at 1:06 AM Venkateswara Rao Jujjuri <jujjuri@gmail.com
> >
> wrote:
>
> > [ External sender. Exercise caution. ]
> >
> > On Fri, Jan 8, 2021 at 2:29 PM Matteo Merli <ma...@gmail.com>
> > wrote:
> >
> > > On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
> > > <ju...@gmail.com> wrote:
> > > >
> > > > > otherwise the write will timeout internally and it will get
> replayed
> > > to a
> > > > new bookie.
> > > > If Qa is met and the writes of Qw-Qa fail after we send the success
> to
> > > the
> > > > client, why would the write replayed on a new bookie?
> > >
> > > I think the original intention was to avoid having 1 bookie with a
> > > "hole" in the entries sequence. If you then lose one of the 2 bookies,
> > > it would be difficult to know which entries need to be recovered.
> > >
> >
> > @Matteo Merli <ma...@gmail.com>  I don't believe we retry the
> write
> > on bookie if Qa is satisfied and the write to a bookie timedout.
> > Once the entry is ack'ed to the client we move the LAC and can't
> > retroactively change the active segment's ensemble.
> >
> > >  will get replayed to a new bookie
> > This will happen only if we are not able to satisfy Qa and go through
> > ensemble changes.
> > We change the ensemble and tetry write only if bookie write fails before
> > satisfying Qa.
> > We have added a new feature called handling "delayed write failure", but
> > that happens only for
> > new entries not retroactively.
> >
> > I may be missing something here, and not understanding your point.
> >
> > Thanks,
> > JV
> >
> >
> >
> >
> > --
> > Jvrao
> > ---
> > First they ignore you, then they laugh at you, then they fight you, then
> > you win. - Mahatma Gandhi
> >
>


-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Unbounded memory usage for WQ > AQ ?

Posted by Andrey Yegorov <an...@gmail.com>.
I remember issues with bookies OOMing/slowing down due to memory pressure
under load.
https://github.com/apache/bookkeeper/issues/1409
https://github.com/apache/bookkeeper/pull/1410

IIRC, there were a couple of problems:

- Slow bookie kept on accepting data hat it could not process (netty kept
on reading it and throwing it into the queue)
AQ < WQ means that the client does not wait after AQ acks received and
keeps on throwing data to the slow bookie and ensemble change did not
happen (or did not happen fast enough?)

- client submitted a lot of requests but was too slow to process responses
(network capacity, NIC bandwidth, something else), and the bookie kept to
the data

It's been a while and I don't recall all the details but the PR is merged.
Have you played with these settings:
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/ServerConfiguration.java
    // backpressure control
    protected static final String MAX_ADDS_IN_PROGRESS_LIMIT =
"maxAddsInProgressLimit";
    protected static final String MAX_READS_IN_PROGRESS_LIMIT =
"maxReadsInProgressLimit";
    protected static final String CLOSE_CHANNEL_ON_RESPONSE_TIMEOUT =
"closeChannelOnResponseTimeout";
    protected static final String WAIT_TIMEOUT_ON_RESPONSE_BACKPRESSURE =
"waitTimeoutOnResponseBackpressureMs";
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/ClientConfiguration.java
    // backpressure configuration
    protected static final String WAIT_TIMEOUT_ON_BACKPRESSURE =
"waitTimeoutOnBackpressureMs";

----------
Andrey Yegorov


On Tue, Jan 19, 2021 at 4:01 AM Flavio Junqueira <fp...@apache.org> wrote:

> Thanks for the feedback, JV, see comments interspersed:
>
> > On 18 Jan 2021, at 22:54, Venkateswara Rao Jujjuri <ju...@gmail.com>
> wrote:
> >
> > On Mon, Jan 18, 2021 at 10:53 AM Sijie Guo <guosijie@gmail.com <mailto:
> guosijie@gmail.com>> wrote:
> >
> >>> One concern for me in this thread is case (3). I'd expect a client that
> >> doesn't crash to not give up, and eventually replace the bookie if it is
> >> unresponsive.
> >>
> >> The current implementation doesn't retry replacing a bookie if an entry
> is
> >> already acknowledged (receiving AQ responses). It relies on inspection
> to
> >> repair the hole.
> >>
> >
> > Exactly. It is not even practical to do this as with the current code.
>
> I'd be interested in knowing more precisely what makes you say it.
>
> > Once the Qa meets we move the LAC. So
> >
> > Ensemble      B0      B1     B2         LAC
> > Entry:0           W      W       W          -1
> > 1                    W      W        NR        0       (NR: No Response)
> > 2                    W      W        NR        1
> > Now B1 failed with network error where write fails immediately
> > 3                  when attempted to write it gets error immediately and
> > attempts ensemble change.
> >                    I think this is wrong. Why we treat errors after Qa is
> > different from before reaching Qa.
> >                   What is stopping the code from waiting to see if Qa is
> > met or not before attempting ensemble change.? @Sijie Guo
>
>
> If I understand what you are saying, then we could end up in the situation
> that we never try to replace the faulty bookie because all entries get AQ
> replies from B0 and B1(you say that B1 failed, but I think you meant B2
> based on the example). There needs to be a trigger for the bookie
> replacement despite entries receiving AQ replies.
>
> Actually, this point makes me wonder whether one alternative to the back
> pressure discussion in this thread would be to replace a bookie based on
> the number of entries queued in the bookie client. If a bookie client is
> accumulating many entries for a bookie compared to others in the ensemble,
> then we could declare it unhealthy and trigger a replacement. Is this a
> suitable approach?
>
>
> > <guosijie@gmail.com <ma...@gmail.com>> ?
> > Ensemble     B0      B10      B2        LAC
> > 3                   W        W        NR       2
> >
> > Since we changed ensemble if entry 1 and 2 fails with timeout we can't go
> > back and retroactively change the ensemble
> >
> >
> >
> >> In case (1), the client crashed and the ledger will be recovered by some
> >> reader. For all entries that have been acknowledged, including e, I'd
> >> expect them to be readable from the closed ledger. Each one of these
> >> entries that haven't been written to bookie b should be written there as
> >> part of the recovery process.
> >>
> > I don't think this can ever happen because we have OSE hashed by
> ledgerId.
> > We can't receive and process any responses before we send out to all Qw
> > bookies.
>
> I'm not sure what you're referring to as not being possible. If an entry e
> has been acknowledged to the application, then the last entry once closed
> must be greater or equal to the id of e, right? You might be referring to
> something else?
>
> >
> > Not sure what is the consensus reached on Issue#1063
> > <
> https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c
> <
> https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c
> >>.
> > If it appears to be a problem let's have a quick call, maybe that is easy
> > to resolve.
> >
>
> As part of this thread, Jack, Enrico and I have set some time this Friday
> to talk. We scheduled for 4pm CET / 10am EST. Would you and Sijie be
> interested in joining? If so, ping me separately so that I can send you the
> zoom link. In general, anyone interested should feel free to join.
>
> -Flavio
>
> >
> >> So the memory pressure is not coming from retrying. It is straight that
> the
> >> bookkeeper client references the sendBuffers until it receives any
> >> responses from the slow bookie. The bookkeeper client allows enqueuing
> >> addEntry operations because the operations meet the AQ requirements.
> Pulsar
> >> does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
> >> the add operations. But this won't work as bookkeeper will notify the
> >> callbacks once the operations meet the AQ requirements. But there is a
> huge
> >> amount of memory (throughput * timeout period) referenced by a slow
> bookie.
> >> Hence we have to add a memory-based throttling mechanism as Matteo
> >> suggested.
> >>
> >> If we want to add the retry logic to replace a bookie, this will add
> more
> >> pressure to the memory. But it can still be solved by a memory-based
> >> back-pressure mechansim.
> >>
> >> Thanks,
> >> Sijie
> >>
> >> On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira <fp...@apache.org>
> wrote:
> >>
> >>> In the scenario that WQ > AQ, a client acknowledges the add of an
> entry e
> >>> to the application once it receives AQ bookie acks. Say now that the
> >> client
> >>> is not able to write a copy of e to at least one bookie b, it could be
> >>> because:
> >>>
> >>> 1- The client crashed before it is able to do it
> >>> 2- Bookie b crashed
> >>> 3- The client gave up trying
> >>>
> >>> In case (1), the client crashed and the ledger will be recovered by
> some
> >>> reader. For all entries that have been acknowledged, including e, I'd
> >>> expect them to be readable from the closed ledger. Each one of these
> >>> entries that haven't been written to bookie b should be written there
> as
> >>> part of the recovery process.
> >>>
> >>> In case (2), the client is not able to write entry e to the crashed
> >> bookie
> >>> b, so it will replace the bookie and write e to the new bookie. I see
> in
> >>> this discussion that there is an option to disable bookie replacement,
> >> I'm
> >>> ignoring that for this discussion.
> >>>
> >>> In case (3), the client say discards the entry after adding
> successfully
> >>> to AQ bookies, and gives up at some point because it can't reach the
> >>> bookie. The client maybe replaces bookie b or bookie b eventually comes
> >>> back and the client proceeds with the adds. In either case, there is a
> >> hole
> >>> that can only be fixed by inspecting the ledger.
> >>>
> >>> One concern for me in this thread is case (3). I'd expect a client that
> >>> doesn't crash to not give up, and eventually replace the bookie if it
> is
> >>> unresponsive. But, that certainly leads to the memory pressure problem
> >> that
> >>> was also mentioned in the thread, for which one potential direction
> also
> >>> mentioned is to apply back pressure.
> >>>
> >>> Thanks,
> >>> -Flavio
> >>>
> >>>> On 18 Jan 2021, at 12:20, Jack Vanlightly <jvanlightly@splunk.com
> >> .INVALID>
> >>> wrote:
> >>>>
> >>>>> Did you guys see any issues with the ledger auditor?
> >>>>
> >>>>> The active writer can't guarantee it writing entries to WQ because it
> >>> can
> >>>>> crash during retrying adding entries to (WQ - AQ) bookies.
> >>>>
> >>>> The need to repair AQ replicated entries is clear and the auditor is
> >> one
> >>>> such strategy. Ivan has also worked on a self-healing bookie strategy
> >>> where
> >>>> each bookie itself is able to detect these holes and is able to obtain
> >>> the
> >>>> missing entries itself. The detection of these holes using this
> >> strategy
> >>> is
> >>>> more efficient as it only requires network calls for the ledger
> >> metadata
> >>>> scanning (to zk) and the missing entry reads (to other bookies). The
> >>>> auditor as I understand it, reads all entries of all ledgers from all
> >>>> bookies (of an entries ensemble) meaning these entries cross the
> >> network.
> >>>> Using the auditor approach is likely to be run less frequently due to
> >> the
> >>>> network cost.
> >>>>
> >>>> I do also wonder if the writer, on performing an ensemble change,
> >> should
> >>>> replay "AQ but not WQ" entries, this would just leave writer failures
> >>>> causing these AQ replicated entries.
> >>>>
> >>>>> Regarding recovery reads, recovery read doesn't need to be
> >>> deterministic.
> >>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> >>>>> either including it or excluding it in the sealed ledger is correct
> >>>>> behavior. The bookkeeper client guarantees that once a ledger is
> >> sealed,
> >>>>> the entries in the sealed ledger can always be read and can be read
> >>>>> consistently.
> >>>>
> >>>>> I am not sure it is a problem unless I misunderstand it.
> >>>>
> >>>> It is true that it doesn't violate any safety property, but it is a
> >>> strange
> >>>> check to me. It looks like an implementation artefact rather than an
> >>>> explicit protocol design choice. But not a huge deal.
> >>>>
> >>>> Jack
> >>>>
> >>>>
> >>>> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <gu...@gmail.com> wrote:
> >>>>
> >>>>> [ External sender. Exercise caution. ]
> >>>>>
> >>>>> Sorry for being late in this thread.
> >>>>>
> >>>>> If I understand this correctly, the main topic is about the "hole"
> >> when
> >>> WQ
> >>>>>> AQ.
> >>>>>
> >>>>>> This leaves a "hole" as the entry is now replicated only to 2
> >> bookies,
> >>>>>
> >>>>> We do have one hole when ensemble change is enabled and WQ > AQ. That
> >>> was a
> >>>>> known behavior. But the hole will be repaired by the ledger auditor
> as
> >>> JV
> >>>>> said. Did you guys see any issues with the ledger auditor?
> >>>>>
> >>>>>> I'd think that we guarantee that an entry that is acknowledged is
> >>>>> eventually written WQ ways and that it is observable by readers when
> >> the
> >>>>> ledger is closed.
> >>>>>
> >>>>> To Flavio's question, we don't guarantee (and can't guarantee) that
> >> the
> >>>>> active writer will eventually write the entries to WQ. For the active
> >>>>> writers, we only guarantee entries are written to AQ. The ledger
> >>> auditor is
> >>>>> to ensure all the entries are written to WQ.
> >>>>>
> >>>>> The active writer can't guarantee it writing entries to WQ because it
> >>> can
> >>>>> crash during retrying adding entries to (WQ - AQ) bookies.
> >>>>>
> >>>>>> A single successful read is enough. However
> >>>>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
> >>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the
> >>> ledger
> >>>>> recovery process ends there. This means that given the responses
> >>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> >>>>> considered successful is non-deterministic.
> >>>>>
> >>>>> Regarding recovery reads, recovery read doesn't need to be
> >>> deterministic.
> >>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> >>>>> either including it or excluding it in the sealed ledger is correct
> >>>>> behavior. The bookkeeper client guarantees that once a ledger is
> >> sealed,
> >>>>> the entries in the sealed ledger can always be read and can be read
> >>>>> consistently.
> >>>>>
> >>>>> I am not sure it is a problem unless I misunderstand it.
> >>>>>
> >>>>> - Sijie
> >>>>>
> >>>>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
> >>>>> <jv...@splunk.com.invalid> wrote:
> >>>>>
> >>>>>> Let's set up a call and create any issues from that. I have already
> >>>>> created
> >>>>>> the patches in our (Splunk) fork and it might be easiest or not to
> >> wait
> >>>>>> until we re-sync up with the open source repo. We can include the
> >> fixes
> >>>>> in
> >>>>>> the discussion.
> >>>>>>
> >>>>>> Jack
> >>>>>>
> >>>>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org>
> >>> wrote:
> >>>>>>
> >>>>>>> [ External sender. Exercise caution. ]
> >>>>>>>
> >>>>>>> Hi Jack,
> >>>>>>>
> >>>>>>> Thanks for getting back.
> >>>>>>>
> >>>>>>>> What's the best way to share the TLA+ findings?
> >>>>>>>
> >>>>>>> Would you be able to share the spec? I'm ok with reading TLA+.
> >>>>>>>
> >>>>>>> As for sharing your specific findings, I'd suggest one of the
> >>>>> following:
> >>>>>>>
> >>>>>>> 1- Create an email thread describing the scenarios that trigger a
> >> bug.
> >>>>>>> 2- Create issues, one for each problem you found.
> >>>>>>> 3- Create a discussion on the project Slack, perhaps a channel
> >>> specific
> >>>>>>> for it.
> >>>>>>> 4- Set up a zoom call to present and discuss with the community.
> >>>>>>>
> >>>>>>> Option 2 is ideal from a community perspective, but we can also set
> >> up
> >>>>> a
> >>>>>>> call inviting everyone and create issues out of that discussion. We
> >>> can
> >>>>>> in
> >>>>>>> fact set up a call even if we create the issues ahead of time.
> >>>>>>>
> >>>>>>> Does it make sense?
> >>>>>>>
> >>>>>>> -Flavio
> >>>>>>>
> >>>>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
> >>>>>> .INVALID>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Flavio,
> >>>>>>>>
> >>>>>>>>>> This is an example of a scenario corresponding to what we
> suspect
> >>>>> is
> >>>>>> a
> >>>>>>>> bug introduced earlier, but Enrico is arguing that this is not the
> >>>>>>> intended
> >>>>>>>> behavior, and at this point, I agree.
> >>>>>>>>
> >>>>>>>>>> By the time a successful callback is received, the client might
> >>>>> only
> >>>>>>>> have replicated AQ ways, so the guarantee can only be at that
> point
> >>>>> of
> >>>>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
> >>>>> states
> >>>>>>> that
> >>>>>>>> the application wants to have WQ copies >> of each entry, though.
> >> I'd
> >>>>>>>> expect a ledger to have WQ copies of each entry up to the final
> >> entry
> >>>>>>>> number when it is closed. Do you see it differently?
> >>>>>>>>
> >>>>>>>> I also agree and was pretty surprised when I discovered the
> >>>>> behaviour.
> >>>>>> It
> >>>>>>>> is not something that users expect and I think we need to correct
> >> it.
> >>>>>> So
> >>>>>>>> I'm with you.
> >>>>>>>>
> >>>>>>>> What's the best way to share the TLA+ findings?
> >>>>>>>>
> >>>>>>>> Jack
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> [ External sender. Exercise caution. ]
> >>>>>>>>>
> >>>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ
> and
> >>>>> the
> >>>>>>>>>> confirm callback to the client is called and the LAC is set to
> >>>>>> 100.Now
> >>>>>>>>> the
> >>>>>>>>>> 3rd bookie times out. Ensemble change is executed and all
> pending
> >>>>>> adds
> >>>>>>>>> that
> >>>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
> >>>>> that
> >>>>>>> the
> >>>>>>>>>> entry e100 is not replayed to another bookie, causing this entry
> >> to
> >>>>>>> meet
> >>>>>>>>>> the rep factor of only AQ.
> >>>>>>>>>
> >>>>>>>>> This is an example of a scenario corresponding to what we suspect
> >>>>> is a
> >>>>>>> bug
> >>>>>>>>> introduced earlier, but Enrico is arguing that this is not the
> >>>>>> intended
> >>>>>>>>> behavior, and at this point, I agree.
> >>>>>>>>>
> >>>>>>>>>> This is alluded to in the docs as they state
> >>>>>>>>>> that AQ is also the minimum guaranteed replication factor.
> >>>>>>>>>
> >>>>>>>>> By the time a successful callback is received, the client might
> >> only
> >>>>>>> have
> >>>>>>>>> replicated AQ ways, so the guarantee can only be at that point of
> >>>>>> being
> >>>>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
> >>>>> that
> >>>>>>> the
> >>>>>>>>> application wants to have WQ copies of each entry, though. I'd
> >>>>> expect
> >>>>>> a
> >>>>>>>>> ledger to have WQ copies of each entry up to the final entry
> >> number
> >>>>>>> when it
> >>>>>>>>> is closed. Do you see it differently?
> >>>>>>>>>
> >>>>>>>>>> I'd be happy to set up a meeting to discuss the spec and its
> >>>>>> findings.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> That'd be great, I'm interested.
> >>>>>>>>>
> >>>>>>>>> -Flavio
> >>>>>>>>>
> >>>>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <
> >> jvanlightly@splunk.com
> >>>>>>> .INVALID>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> No you cannot miss data, if the client is not able to find a
> >>>>> bookie
> >>>>>>> that
> >>>>>>>>>> is
> >>>>>>>>>>> able to answer with the entry it receives an error.
> >>>>>>>>>>
> >>>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ
> and
> >>>>> the
> >>>>>>>>>> confirm callback to the client is called and the LAC is set to
> >> 100.
> >>>>>> Now
> >>>>>>>>> the
> >>>>>>>>>> 3rd bookie times out. Ensemble change is executed and all
> pending
> >>>>>> adds
> >>>>>>>>> that
> >>>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
> >>>>> that
> >>>>>>> the
> >>>>>>>>>> entry e100 is not replayed to another bookie, causing this entry
> >> to
> >>>>>>> meet
> >>>>>>>>>> the rep factor of only AQ. This is alluded to in the docs as
> they
> >>>>>> state
> >>>>>>>>>> that AQ is also the minimum guaranteed replication factor.
> >>>>>>>>>>
> >>>>>>>>>>> The recovery read fails if it is not possible to read every
> >> entry
> >>>>>> from
> >>>>>>>>> at
> >>>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
> >>>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
> >>>>> does
> >>>>>>> not
> >>>>>>>>>>> find enough bookies.
> >>>>>>>>>>
> >>>>>>>>>> This is not quite accurate. A single successful read is enough.
> >>>>>> However
> >>>>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads
> fail
> >>>>>> with
> >>>>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and
> >> the
> >>>>>>> ledger
> >>>>>>>>>> recovery process ends there. This means that given the responses
> >>>>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the
> read
> >>>>> is
> >>>>>>>>>> considered successful is non-deterministic. If the response from
> >> b1
> >>>>>> is
> >>>>>>>>>> received last, then the read is already considered failed,
> >>>>> otherwise
> >>>>>>> the
> >>>>>>>>>> read succeeds.
> >>>>>>>>>>
> >>>>>>>>>> I have come to the above conclusions through my reverse
> >> engineering
> >>>>>>>>> process
> >>>>>>>>>> for creating the TLA+ specification. I still have pending to
> >>>>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
> >>>>>> verified
> >>>>>>>>> via
> >>>>>>>>>> tests the conclusion about ledger recovery reads.
> >>>>>>>>>>
> >>>>>>>>>> Note that I have found two defects with the BookKeeper protocol,
> >>>>> most
> >>>>>>>>>> notably data loss due to that fencing does not prevent further
> >>>>>>> successful
> >>>>>>>>>> adds. Currently the specification and associated documentation
> is
> >>>>> on
> >>>>>> a
> >>>>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
> >>>>> discuss
> >>>>>>> the
> >>>>>>>>>> spec and its findings.
> >>>>>>>>>>
> >>>>>>>>>> Best
> >>>>>>>>>> Jack
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
> >>>>>> eolivelli@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> [ External sender. Exercise caution. ]
> >>>>>>>>>>>
> >>>>>>>>>>> Jonathan,
> >>>>>>>>>>>
> >>>>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> >>>>>>>>>>> jbellis@apache.org>
> >>>>>>>>>>> ha scritto:
> >>>>>>>>>>>
> >>>>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and
> can
> >>>>>>> confirm
> >>>>>>>>>>>> that
> >>>>>>>>>>>>> once confirmed, that an entry is not replayed to another
> >> bookie.
> >>>>>>> This
> >>>>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
> >>>>> bookies,
> >>>>>>>>>>>> however,
> >>>>>>>>>>>>> the new data integrity check that Ivan worked on, when run
> >>>>>>>>> periodically
> >>>>>>>>>>>>> will be able to repair that hole.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Can I read from the bookie with a hole in the meantime, and
> >>>>>> silently
> >>>>>>>>> miss
> >>>>>>>>>>>> data that it doesn't know about?
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> No you cannot miss data, if the client is not able to find a
> >>>>> bookie
> >>>>>>>>> that is
> >>>>>>>>>>> able to answer with the entry it receives an error.
> >>>>>>>>>>>
> >>>>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
> >>>>> value
> >>>>>> is
> >>>>>>>>>>> stored on ledger metadata once the ledger is "closed".
> >>>>>>>>>>>
> >>>>>>>>>>> When the ledger is still open, that is when the writer is
> >> writing
> >>>>> to
> >>>>>>> it,
> >>>>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
> >>>>> entry
> >>>>>>>>>>> this LAC value is returned to the reader using a piggyback
> >>>>>> mechanism,
> >>>>>>>>>>> without reading from metadata.
> >>>>>>>>>>> The reader cannot read beyond the latest position that has been
> >>>>>>>>> confirmed
> >>>>>>>>>>> to the writer by AQ bookies.
> >>>>>>>>>>>
> >>>>>>>>>>> We have a third case, the 'recovery read'.
> >>>>>>>>>>> A reader starts a "recovery read" when you want to recover a
> >>>>> ledger
> >>>>>>> that
> >>>>>>>>>>> has been abandoned by a dead writer
> >>>>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want
> >> to
> >>>>>>> fence
> >>>>>>>>> out
> >>>>>>>>>>> the old leader.
> >>>>>>>>>>> In this case the reader merges the current status of the ledger
> >> on
> >>>>>> ZK
> >>>>>>>>> with
> >>>>>>>>>>> the result of a scan of the whole ledger.
> >>>>>>>>>>> Basically it reads the ledger from the beginning up to the
> tail,
> >>>>>> until
> >>>>>>>>> it
> >>>>>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
> >>>>> flag
> >>>>>> on
> >>>>>>>>> the
> >>>>>>>>>>> ledger
> >>>>>>>>>>> on every bookie and also it is able to detect the actual tail
> of
> >>>>> the
> >>>>>>>>> ledger
> >>>>>>>>>>> (because the writer died and it was not able to flush metadata
> >> to
> >>>>>> ZK).
> >>>>>>>>>>>
> >>>>>>>>>>> The recovery read fails if it is not possible to read every
> >> entry
> >>>>>> from
> >>>>>>>>> at
> >>>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
> >>>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
> >>>>> does
> >>>>>>> not
> >>>>>>>>>>> find enough bookies.
> >>>>>>>>>>>
> >>>>>>>>>>> I hope that helps
> >>>>>>>>>>> Enrico
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > Jvrao
> > ---
> > First they ignore you, then they laugh at you, then they fight you, then
> > you win. - Mahatma Gandhi
>
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
Thanks for the feedback, JV, see comments interspersed:

> On 18 Jan 2021, at 22:54, Venkateswara Rao Jujjuri <ju...@gmail.com> wrote:
> 
> On Mon, Jan 18, 2021 at 10:53 AM Sijie Guo <guosijie@gmail.com <ma...@gmail.com>> wrote:
> 
>>> One concern for me in this thread is case (3). I'd expect a client that
>> doesn't crash to not give up, and eventually replace the bookie if it is
>> unresponsive.
>> 
>> The current implementation doesn't retry replacing a bookie if an entry is
>> already acknowledged (receiving AQ responses). It relies on inspection to
>> repair the hole.
>> 
> 
> Exactly. It is not even practical to do this as with the current code.

I'd be interested in knowing more precisely what makes you say it.

> Once the Qa meets we move the LAC. So
> 
> Ensemble      B0      B1     B2         LAC
> Entry:0           W      W       W          -1
> 1                    W      W        NR        0       (NR: No Response)
> 2                    W      W        NR        1
> Now B1 failed with network error where write fails immediately
> 3                  when attempted to write it gets error immediately and
> attempts ensemble change.
>                    I think this is wrong. Why we treat errors after Qa is
> different from before reaching Qa.
>                   What is stopping the code from waiting to see if Qa is
> met or not before attempting ensemble change.? @Sijie Guo


If I understand what you are saying, then we could end up in the situation that we never try to replace the faulty bookie because all entries get AQ replies from B0 and B1(you say that B1 failed, but I think you meant B2 based on the example). There needs to be a trigger for the bookie replacement despite entries receiving AQ replies.

Actually, this point makes me wonder whether one alternative to the back pressure discussion in this thread would be to replace a bookie based on the number of entries queued in the bookie client. If a bookie client is accumulating many entries for a bookie compared to others in the ensemble, then we could declare it unhealthy and trigger a replacement. Is this a suitable approach?


> <guosijie@gmail.com <ma...@gmail.com>> ?
> Ensemble     B0      B10      B2        LAC
> 3                   W        W        NR       2
> 
> Since we changed ensemble if entry 1 and 2 fails with timeout we can't go
> back and retroactively change the ensemble
> 
> 
> 
>> In case (1), the client crashed and the ledger will be recovered by some
>> reader. For all entries that have been acknowledged, including e, I'd
>> expect them to be readable from the closed ledger. Each one of these
>> entries that haven't been written to bookie b should be written there as
>> part of the recovery process.
>> 
> I don't think this can ever happen because we have OSE hashed by ledgerId.
> We can't receive and process any responses before we send out to all Qw
> bookies.

I'm not sure what you're referring to as not being possible. If an entry e has been acknowledged to the application, then the last entry once closed must be greater or equal to the id of e, right? You might be referring to something else?

> 
> Not sure what is the consensus reached on Issue#1063
> <https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c <https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c>>.
> If it appears to be a problem let's have a quick call, maybe that is easy
> to resolve.
> 

As part of this thread, Jack, Enrico and I have set some time this Friday to talk. We scheduled for 4pm CET / 10am EST. Would you and Sijie be interested in joining? If so, ping me separately so that I can send you the zoom link. In general, anyone interested should feel free to join.

-Flavio

> 
>> So the memory pressure is not coming from retrying. It is straight that the
>> bookkeeper client references the sendBuffers until it receives any
>> responses from the slow bookie. The bookkeeper client allows enqueuing
>> addEntry operations because the operations meet the AQ requirements. Pulsar
>> does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
>> the add operations. But this won't work as bookkeeper will notify the
>> callbacks once the operations meet the AQ requirements. But there is a huge
>> amount of memory (throughput * timeout period) referenced by a slow bookie.
>> Hence we have to add a memory-based throttling mechanism as Matteo
>> suggested.
>> 
>> If we want to add the retry logic to replace a bookie, this will add more
>> pressure to the memory. But it can still be solved by a memory-based
>> back-pressure mechansim.
>> 
>> Thanks,
>> Sijie
>> 
>> On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira <fp...@apache.org> wrote:
>> 
>>> In the scenario that WQ > AQ, a client acknowledges the add of an entry e
>>> to the application once it receives AQ bookie acks. Say now that the
>> client
>>> is not able to write a copy of e to at least one bookie b, it could be
>>> because:
>>> 
>>> 1- The client crashed before it is able to do it
>>> 2- Bookie b crashed
>>> 3- The client gave up trying
>>> 
>>> In case (1), the client crashed and the ledger will be recovered by some
>>> reader. For all entries that have been acknowledged, including e, I'd
>>> expect them to be readable from the closed ledger. Each one of these
>>> entries that haven't been written to bookie b should be written there as
>>> part of the recovery process.
>>> 
>>> In case (2), the client is not able to write entry e to the crashed
>> bookie
>>> b, so it will replace the bookie and write e to the new bookie. I see in
>>> this discussion that there is an option to disable bookie replacement,
>> I'm
>>> ignoring that for this discussion.
>>> 
>>> In case (3), the client say discards the entry after adding successfully
>>> to AQ bookies, and gives up at some point because it can't reach the
>>> bookie. The client maybe replaces bookie b or bookie b eventually comes
>>> back and the client proceeds with the adds. In either case, there is a
>> hole
>>> that can only be fixed by inspecting the ledger.
>>> 
>>> One concern for me in this thread is case (3). I'd expect a client that
>>> doesn't crash to not give up, and eventually replace the bookie if it is
>>> unresponsive. But, that certainly leads to the memory pressure problem
>> that
>>> was also mentioned in the thread, for which one potential direction also
>>> mentioned is to apply back pressure.
>>> 
>>> Thanks,
>>> -Flavio
>>> 
>>>> On 18 Jan 2021, at 12:20, Jack Vanlightly <jvanlightly@splunk.com
>> .INVALID>
>>> wrote:
>>>> 
>>>>> Did you guys see any issues with the ledger auditor?
>>>> 
>>>>> The active writer can't guarantee it writing entries to WQ because it
>>> can
>>>>> crash during retrying adding entries to (WQ - AQ) bookies.
>>>> 
>>>> The need to repair AQ replicated entries is clear and the auditor is
>> one
>>>> such strategy. Ivan has also worked on a self-healing bookie strategy
>>> where
>>>> each bookie itself is able to detect these holes and is able to obtain
>>> the
>>>> missing entries itself. The detection of these holes using this
>> strategy
>>> is
>>>> more efficient as it only requires network calls for the ledger
>> metadata
>>>> scanning (to zk) and the missing entry reads (to other bookies). The
>>>> auditor as I understand it, reads all entries of all ledgers from all
>>>> bookies (of an entries ensemble) meaning these entries cross the
>> network.
>>>> Using the auditor approach is likely to be run less frequently due to
>> the
>>>> network cost.
>>>> 
>>>> I do also wonder if the writer, on performing an ensemble change,
>> should
>>>> replay "AQ but not WQ" entries, this would just leave writer failures
>>>> causing these AQ replicated entries.
>>>> 
>>>>> Regarding recovery reads, recovery read doesn't need to be
>>> deterministic.
>>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>>>> either including it or excluding it in the sealed ledger is correct
>>>>> behavior. The bookkeeper client guarantees that once a ledger is
>> sealed,
>>>>> the entries in the sealed ledger can always be read and can be read
>>>>> consistently.
>>>> 
>>>>> I am not sure it is a problem unless I misunderstand it.
>>>> 
>>>> It is true that it doesn't violate any safety property, but it is a
>>> strange
>>>> check to me. It looks like an implementation artefact rather than an
>>>> explicit protocol design choice. But not a huge deal.
>>>> 
>>>> Jack
>>>> 
>>>> 
>>>> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <gu...@gmail.com> wrote:
>>>> 
>>>>> [ External sender. Exercise caution. ]
>>>>> 
>>>>> Sorry for being late in this thread.
>>>>> 
>>>>> If I understand this correctly, the main topic is about the "hole"
>> when
>>> WQ
>>>>>> AQ.
>>>>> 
>>>>>> This leaves a "hole" as the entry is now replicated only to 2
>> bookies,
>>>>> 
>>>>> We do have one hole when ensemble change is enabled and WQ > AQ. That
>>> was a
>>>>> known behavior. But the hole will be repaired by the ledger auditor as
>>> JV
>>>>> said. Did you guys see any issues with the ledger auditor?
>>>>> 
>>>>>> I'd think that we guarantee that an entry that is acknowledged is
>>>>> eventually written WQ ways and that it is observable by readers when
>> the
>>>>> ledger is closed.
>>>>> 
>>>>> To Flavio's question, we don't guarantee (and can't guarantee) that
>> the
>>>>> active writer will eventually write the entries to WQ. For the active
>>>>> writers, we only guarantee entries are written to AQ. The ledger
>>> auditor is
>>>>> to ensure all the entries are written to WQ.
>>>>> 
>>>>> The active writer can't guarantee it writing entries to WQ because it
>>> can
>>>>> crash during retrying adding entries to (WQ - AQ) bookies.
>>>>> 
>>>>>> A single successful read is enough. However
>>>>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the
>>> ledger
>>>>> recovery process ends there. This means that given the responses
>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>>>>> considered successful is non-deterministic.
>>>>> 
>>>>> Regarding recovery reads, recovery read doesn't need to be
>>> deterministic.
>>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>>>> either including it or excluding it in the sealed ledger is correct
>>>>> behavior. The bookkeeper client guarantees that once a ledger is
>> sealed,
>>>>> the entries in the sealed ledger can always be read and can be read
>>>>> consistently.
>>>>> 
>>>>> I am not sure it is a problem unless I misunderstand it.
>>>>> 
>>>>> - Sijie
>>>>> 
>>>>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
>>>>> <jv...@splunk.com.invalid> wrote:
>>>>> 
>>>>>> Let's set up a call and create any issues from that. I have already
>>>>> created
>>>>>> the patches in our (Splunk) fork and it might be easiest or not to
>> wait
>>>>>> until we re-sync up with the open source repo. We can include the
>> fixes
>>>>> in
>>>>>> the discussion.
>>>>>> 
>>>>>> Jack
>>>>>> 
>>>>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org>
>>> wrote:
>>>>>> 
>>>>>>> [ External sender. Exercise caution. ]
>>>>>>> 
>>>>>>> Hi Jack,
>>>>>>> 
>>>>>>> Thanks for getting back.
>>>>>>> 
>>>>>>>> What's the best way to share the TLA+ findings?
>>>>>>> 
>>>>>>> Would you be able to share the spec? I'm ok with reading TLA+.
>>>>>>> 
>>>>>>> As for sharing your specific findings, I'd suggest one of the
>>>>> following:
>>>>>>> 
>>>>>>> 1- Create an email thread describing the scenarios that trigger a
>> bug.
>>>>>>> 2- Create issues, one for each problem you found.
>>>>>>> 3- Create a discussion on the project Slack, perhaps a channel
>>> specific
>>>>>>> for it.
>>>>>>> 4- Set up a zoom call to present and discuss with the community.
>>>>>>> 
>>>>>>> Option 2 is ideal from a community perspective, but we can also set
>> up
>>>>> a
>>>>>>> call inviting everyone and create issues out of that discussion. We
>>> can
>>>>>> in
>>>>>>> fact set up a call even if we create the issues ahead of time.
>>>>>>> 
>>>>>>> Does it make sense?
>>>>>>> 
>>>>>>> -Flavio
>>>>>>> 
>>>>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
>>>>>> .INVALID>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Flavio,
>>>>>>>> 
>>>>>>>>>> This is an example of a scenario corresponding to what we suspect
>>>>> is
>>>>>> a
>>>>>>>> bug introduced earlier, but Enrico is arguing that this is not the
>>>>>>> intended
>>>>>>>> behavior, and at this point, I agree.
>>>>>>>> 
>>>>>>>>>> By the time a successful callback is received, the client might
>>>>> only
>>>>>>>> have replicated AQ ways, so the guarantee can only be at that point
>>>>> of
>>>>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
>>>>> states
>>>>>>> that
>>>>>>>> the application wants to have WQ copies >> of each entry, though.
>> I'd
>>>>>>>> expect a ledger to have WQ copies of each entry up to the final
>> entry
>>>>>>>> number when it is closed. Do you see it differently?
>>>>>>>> 
>>>>>>>> I also agree and was pretty surprised when I discovered the
>>>>> behaviour.
>>>>>> It
>>>>>>>> is not something that users expect and I think we need to correct
>> it.
>>>>>> So
>>>>>>>> I'm with you.
>>>>>>>> 
>>>>>>>> What's the best way to share the TLA+ findings?
>>>>>>>> 
>>>>>>>> Jack
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>>> 
>>>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>>>> the
>>>>>>>>>> confirm callback to the client is called and the LAC is set to
>>>>>> 100.Now
>>>>>>>>> the
>>>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>>>> adds
>>>>>>>>> that
>>>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>>>> that
>>>>>>> the
>>>>>>>>>> entry e100 is not replayed to another bookie, causing this entry
>> to
>>>>>>> meet
>>>>>>>>>> the rep factor of only AQ.
>>>>>>>>> 
>>>>>>>>> This is an example of a scenario corresponding to what we suspect
>>>>> is a
>>>>>>> bug
>>>>>>>>> introduced earlier, but Enrico is arguing that this is not the
>>>>>> intended
>>>>>>>>> behavior, and at this point, I agree.
>>>>>>>>> 
>>>>>>>>>> This is alluded to in the docs as they state
>>>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>>>> 
>>>>>>>>> By the time a successful callback is received, the client might
>> only
>>>>>>> have
>>>>>>>>> replicated AQ ways, so the guarantee can only be at that point of
>>>>>> being
>>>>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
>>>>> that
>>>>>>> the
>>>>>>>>> application wants to have WQ copies of each entry, though. I'd
>>>>> expect
>>>>>> a
>>>>>>>>> ledger to have WQ copies of each entry up to the final entry
>> number
>>>>>>> when it
>>>>>>>>> is closed. Do you see it differently?
>>>>>>>>> 
>>>>>>>>>> I'd be happy to set up a meeting to discuss the spec and its
>>>>>> findings.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> That'd be great, I'm interested.
>>>>>>>>> 
>>>>>>>>> -Flavio
>>>>>>>>> 
>>>>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <
>> jvanlightly@splunk.com
>>>>>>> .INVALID>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>>>> bookie
>>>>>>> that
>>>>>>>>>> is
>>>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>>>> 
>>>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>>>> the
>>>>>>>>>> confirm callback to the client is called and the LAC is set to
>> 100.
>>>>>> Now
>>>>>>>>> the
>>>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>>>> adds
>>>>>>>>> that
>>>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>>>> that
>>>>>>> the
>>>>>>>>>> entry e100 is not replayed to another bookie, causing this entry
>> to
>>>>>>> meet
>>>>>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
>>>>>> state
>>>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>>>>> 
>>>>>>>>>>> The recovery read fails if it is not possible to read every
>> entry
>>>>>> from
>>>>>>>>> at
>>>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>>>> does
>>>>>>> not
>>>>>>>>>>> find enough bookies.
>>>>>>>>>> 
>>>>>>>>>> This is not quite accurate. A single successful read is enough.
>>>>>> However
>>>>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
>>>>>> with
>>>>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and
>> the
>>>>>>> ledger
>>>>>>>>>> recovery process ends there. This means that given the responses
>>>>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
>>>>> is
>>>>>>>>>> considered successful is non-deterministic. If the response from
>> b1
>>>>>> is
>>>>>>>>>> received last, then the read is already considered failed,
>>>>> otherwise
>>>>>>> the
>>>>>>>>>> read succeeds.
>>>>>>>>>> 
>>>>>>>>>> I have come to the above conclusions through my reverse
>> engineering
>>>>>>>>> process
>>>>>>>>>> for creating the TLA+ specification. I still have pending to
>>>>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
>>>>>> verified
>>>>>>>>> via
>>>>>>>>>> tests the conclusion about ledger recovery reads.
>>>>>>>>>> 
>>>>>>>>>> Note that I have found two defects with the BookKeeper protocol,
>>>>> most
>>>>>>>>>> notably data loss due to that fencing does not prevent further
>>>>>>> successful
>>>>>>>>>> adds. Currently the specification and associated documentation is
>>>>> on
>>>>>> a
>>>>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
>>>>> discuss
>>>>>>> the
>>>>>>>>>> spec and its findings.
>>>>>>>>>> 
>>>>>>>>>> Best
>>>>>>>>>> Jack
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
>>>>>> eolivelli@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>>>>> 
>>>>>>>>>>> Jonathan,
>>>>>>>>>>> 
>>>>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>>>>>>>>>>> jbellis@apache.org>
>>>>>>>>>>> ha scritto:
>>>>>>>>>>> 
>>>>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
>>>>>>> confirm
>>>>>>>>>>>> that
>>>>>>>>>>>>> once confirmed, that an entry is not replayed to another
>> bookie.
>>>>>>> This
>>>>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
>>>>> bookies,
>>>>>>>>>>>> however,
>>>>>>>>>>>>> the new data integrity check that Ivan worked on, when run
>>>>>>>>> periodically
>>>>>>>>>>>>> will be able to repair that hole.
>>>>>>>>>>>> 
>>>>>>>>>>>> Can I read from the bookie with a hole in the meantime, and
>>>>>> silently
>>>>>>>>> miss
>>>>>>>>>>>> data that it doesn't know about?
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>>>> bookie
>>>>>>>>> that is
>>>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>>>>> 
>>>>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
>>>>> value
>>>>>> is
>>>>>>>>>>> stored on ledger metadata once the ledger is "closed".
>>>>>>>>>>> 
>>>>>>>>>>> When the ledger is still open, that is when the writer is
>> writing
>>>>> to
>>>>>>> it,
>>>>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
>>>>> entry
>>>>>>>>>>> this LAC value is returned to the reader using a piggyback
>>>>>> mechanism,
>>>>>>>>>>> without reading from metadata.
>>>>>>>>>>> The reader cannot read beyond the latest position that has been
>>>>>>>>> confirmed
>>>>>>>>>>> to the writer by AQ bookies.
>>>>>>>>>>> 
>>>>>>>>>>> We have a third case, the 'recovery read'.
>>>>>>>>>>> A reader starts a "recovery read" when you want to recover a
>>>>> ledger
>>>>>>> that
>>>>>>>>>>> has been abandoned by a dead writer
>>>>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want
>> to
>>>>>>> fence
>>>>>>>>> out
>>>>>>>>>>> the old leader.
>>>>>>>>>>> In this case the reader merges the current status of the ledger
>> on
>>>>>> ZK
>>>>>>>>> with
>>>>>>>>>>> the result of a scan of the whole ledger.
>>>>>>>>>>> Basically it reads the ledger from the beginning up to the tail,
>>>>>> until
>>>>>>>>> it
>>>>>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
>>>>> flag
>>>>>> on
>>>>>>>>> the
>>>>>>>>>>> ledger
>>>>>>>>>>> on every bookie and also it is able to detect the actual tail of
>>>>> the
>>>>>>>>> ledger
>>>>>>>>>>> (because the writer died and it was not able to flush metadata
>> to
>>>>>> ZK).
>>>>>>>>>>> 
>>>>>>>>>>> The recovery read fails if it is not possible to read every
>> entry
>>>>>> from
>>>>>>>>> at
>>>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>>>> does
>>>>>>> not
>>>>>>>>>>> find enough bookies.
>>>>>>>>>>> 
>>>>>>>>>>> I hope that helps
>>>>>>>>>>> Enrico
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>> 
> 
> 
> -- 
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi


Re: Unbounded memory usage for WQ > AQ ?

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.
On Mon, Jan 18, 2021 at 10:53 AM Sijie Guo <gu...@gmail.com> wrote:

> > One concern for me in this thread is case (3). I'd expect a client that
> doesn't crash to not give up, and eventually replace the bookie if it is
> unresponsive.
>
> The current implementation doesn't retry replacing a bookie if an entry is
> already acknowledged (receiving AQ responses). It relies on inspection to
> repair the hole.
>

Exactly. It is not even practical to do this as with the current code.
Once the Qa meets we move the LAC. So

Ensemble      B0      B1     B2         LAC
Entry:0           W      W       W          -1
1                    W      W        NR        0       (NR: No Response)
2                    W      W        NR        1
Now B1 failed with network error where write fails immediately
3                  when attempted to write it gets error immediately and
attempts ensemble change.
                    I think this is wrong. Why we treat errors after Qa is
different from before reaching Qa.
                   What is stopping the code from waiting to see if Qa is
met or not before attempting ensemble change.? @Sijie Guo
<gu...@gmail.com> ?
Ensemble     B0      B10      B2        LAC
3                   W        W        NR       2

Since we changed ensemble if entry 1 and 2 fails with timeout we can't go
back and retroactively change the ensemble



> In case (1), the client crashed and the ledger will be recovered by some
reader. For all entries that have been acknowledged, including e, I'd
expect them to be readable from the closed ledger. Each one of these
entries that haven't been written to bookie b should be written there as
part of the recovery process.

I don't think this can ever happen because we have OSE hashed by ledgerId.
We can't receive and process any responses before we send out to all Qw
bookies.

Not sure what is the consensus reached on Issue#1063
<https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c>.
If it appears to be a problem let's have a quick call, maybe that is easy
to resolve.

Thanks,
JV


> So the memory pressure is not coming from retrying. It is straight that the
> bookkeeper client references the sendBuffers until it receives any
> responses from the slow bookie. The bookkeeper client allows enqueuing
> addEntry operations because the operations meet the AQ requirements. Pulsar
> does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
> the add operations. But this won't work as bookkeeper will notify the
> callbacks once the operations meet the AQ requirements. But there is a huge
> amount of memory (throughput * timeout period) referenced by a slow bookie.
> Hence we have to add a memory-based throttling mechanism as Matteo
> suggested.
>
> If we want to add the retry logic to replace a bookie, this will add more
> pressure to the memory. But it can still be solved by a memory-based
> back-pressure mechansim.
>
> Thanks,
> Sijie
>
> On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira <fp...@apache.org> wrote:
>
> > In the scenario that WQ > AQ, a client acknowledges the add of an entry e
> > to the application once it receives AQ bookie acks. Say now that the
> client
> > is not able to write a copy of e to at least one bookie b, it could be
> > because:
> >
> > 1- The client crashed before it is able to do it
> > 2- Bookie b crashed
> > 3- The client gave up trying
> >
> > In case (1), the client crashed and the ledger will be recovered by some
> > reader. For all entries that have been acknowledged, including e, I'd
> > expect them to be readable from the closed ledger. Each one of these
> > entries that haven't been written to bookie b should be written there as
> > part of the recovery process.
> >
> > In case (2), the client is not able to write entry e to the crashed
> bookie
> > b, so it will replace the bookie and write e to the new bookie. I see in
> > this discussion that there is an option to disable bookie replacement,
> I'm
> > ignoring that for this discussion.
> >
> > In case (3), the client say discards the entry after adding successfully
> > to AQ bookies, and gives up at some point because it can't reach the
> > bookie. The client maybe replaces bookie b or bookie b eventually comes
> > back and the client proceeds with the adds. In either case, there is a
> hole
> > that can only be fixed by inspecting the ledger.
> >
> > One concern for me in this thread is case (3). I'd expect a client that
> > doesn't crash to not give up, and eventually replace the bookie if it is
> > unresponsive. But, that certainly leads to the memory pressure problem
> that
> > was also mentioned in the thread, for which one potential direction also
> > mentioned is to apply back pressure.
> >
> > Thanks,
> > -Flavio
> >
> > > On 18 Jan 2021, at 12:20, Jack Vanlightly <jvanlightly@splunk.com
> .INVALID>
> > wrote:
> > >
> > >> Did you guys see any issues with the ledger auditor?
> > >
> > >> The active writer can't guarantee it writing entries to WQ because it
> > can
> > >> crash during retrying adding entries to (WQ - AQ) bookies.
> > >
> > > The need to repair AQ replicated entries is clear and the auditor is
> one
> > > such strategy. Ivan has also worked on a self-healing bookie strategy
> > where
> > > each bookie itself is able to detect these holes and is able to obtain
> > the
> > > missing entries itself. The detection of these holes using this
> strategy
> > is
> > > more efficient as it only requires network calls for the ledger
> metadata
> > > scanning (to zk) and the missing entry reads (to other bookies). The
> > > auditor as I understand it, reads all entries of all ledgers from all
> > > bookies (of an entries ensemble) meaning these entries cross the
> network.
> > > Using the auditor approach is likely to be run less frequently due to
> the
> > > network cost.
> > >
> > > I do also wonder if the writer, on performing an ensemble change,
> should
> > > replay "AQ but not WQ" entries, this would just leave writer failures
> > > causing these AQ replicated entries.
> > >
> > >> Regarding recovery reads, recovery read doesn't need to be
> > deterministic.
> > >> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> > >> either including it or excluding it in the sealed ledger is correct
> > >> behavior. The bookkeeper client guarantees that once a ledger is
> sealed,
> > >> the entries in the sealed ledger can always be read and can be read
> > >> consistently.
> > >
> > >> I am not sure it is a problem unless I misunderstand it.
> > >
> > > It is true that it doesn't violate any safety property, but it is a
> > strange
> > > check to me. It looks like an implementation artefact rather than an
> > > explicit protocol design choice. But not a huge deal.
> > >
> > > Jack
> > >
> > >
> > > On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <gu...@gmail.com> wrote:
> > >
> > >> [ External sender. Exercise caution. ]
> > >>
> > >> Sorry for being late in this thread.
> > >>
> > >> If I understand this correctly, the main topic is about the "hole"
> when
> > WQ
> > >>> AQ.
> > >>
> > >>> This leaves a "hole" as the entry is now replicated only to 2
> bookies,
> > >>
> > >> We do have one hole when ensemble change is enabled and WQ > AQ. That
> > was a
> > >> known behavior. But the hole will be repaired by the ledger auditor as
> > JV
> > >> said. Did you guys see any issues with the ledger auditor?
> > >>
> > >>> I'd think that we guarantee that an entry that is acknowledged is
> > >> eventually written WQ ways and that it is observable by readers when
> the
> > >> ledger is closed.
> > >>
> > >> To Flavio's question, we don't guarantee (and can't guarantee) that
> the
> > >> active writer will eventually write the entries to WQ. For the active
> > >> writers, we only guarantee entries are written to AQ. The ledger
> > auditor is
> > >> to ensure all the entries are written to WQ.
> > >>
> > >> The active writer can't guarantee it writing entries to WQ because it
> > can
> > >> crash during retrying adding entries to (WQ - AQ) bookies.
> > >>
> > >>> A single successful read is enough. However
> > >> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
> > >> explicit NoSuchEntry/Ledger, the read is considered failed and the
> > ledger
> > >> recovery process ends there. This means that given the responses
> > >> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> > >> considered successful is non-deterministic.
> > >>
> > >> Regarding recovery reads, recovery read doesn't need to be
> > deterministic.
> > >> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> > >> either including it or excluding it in the sealed ledger is correct
> > >> behavior. The bookkeeper client guarantees that once a ledger is
> sealed,
> > >> the entries in the sealed ledger can always be read and can be read
> > >> consistently.
> > >>
> > >> I am not sure it is a problem unless I misunderstand it.
> > >>
> > >> - Sijie
> > >>
> > >> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
> > >> <jv...@splunk.com.invalid> wrote:
> > >>
> > >>> Let's set up a call and create any issues from that. I have already
> > >> created
> > >>> the patches in our (Splunk) fork and it might be easiest or not to
> wait
> > >>> until we re-sync up with the open source repo. We can include the
> fixes
> > >> in
> > >>> the discussion.
> > >>>
> > >>> Jack
> > >>>
> > >>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org>
> > wrote:
> > >>>
> > >>>> [ External sender. Exercise caution. ]
> > >>>>
> > >>>> Hi Jack,
> > >>>>
> > >>>> Thanks for getting back.
> > >>>>
> > >>>>> What's the best way to share the TLA+ findings?
> > >>>>
> > >>>> Would you be able to share the spec? I'm ok with reading TLA+.
> > >>>>
> > >>>> As for sharing your specific findings, I'd suggest one of the
> > >> following:
> > >>>>
> > >>>> 1- Create an email thread describing the scenarios that trigger a
> bug.
> > >>>> 2- Create issues, one for each problem you found.
> > >>>> 3- Create a discussion on the project Slack, perhaps a channel
> > specific
> > >>>> for it.
> > >>>> 4- Set up a zoom call to present and discuss with the community.
> > >>>>
> > >>>> Option 2 is ideal from a community perspective, but we can also set
> up
> > >> a
> > >>>> call inviting everyone and create issues out of that discussion. We
> > can
> > >>> in
> > >>>> fact set up a call even if we create the issues ahead of time.
> > >>>>
> > >>>> Does it make sense?
> > >>>>
> > >>>> -Flavio
> > >>>>
> > >>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
> > >>> .INVALID>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hi Flavio,
> > >>>>>
> > >>>>>>> This is an example of a scenario corresponding to what we suspect
> > >> is
> > >>> a
> > >>>>> bug introduced earlier, but Enrico is arguing that this is not the
> > >>>> intended
> > >>>>> behavior, and at this point, I agree.
> > >>>>>
> > >>>>>>> By the time a successful callback is received, the client might
> > >> only
> > >>>>> have replicated AQ ways, so the guarantee can only be at that point
> > >> of
> > >>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
> > >> states
> > >>>> that
> > >>>>> the application wants to have WQ copies >> of each entry, though.
> I'd
> > >>>>> expect a ledger to have WQ copies of each entry up to the final
> entry
> > >>>>> number when it is closed. Do you see it differently?
> > >>>>>
> > >>>>> I also agree and was pretty surprised when I discovered the
> > >> behaviour.
> > >>> It
> > >>>>> is not something that users expect and I think we need to correct
> it.
> > >>> So
> > >>>>> I'm with you.
> > >>>>>
> > >>>>> What's the best way to share the TLA+ findings?
> > >>>>>
> > >>>>> Jack
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
> > >>> wrote:
> > >>>>>
> > >>>>>> [ External sender. Exercise caution. ]
> > >>>>>>
> > >>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> > >> the
> > >>>>>>> confirm callback to the client is called and the LAC is set to
> > >>> 100.Now
> > >>>>>> the
> > >>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
> > >>> adds
> > >>>>>> that
> > >>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
> > >> that
> > >>>> the
> > >>>>>>> entry e100 is not replayed to another bookie, causing this entry
> to
> > >>>> meet
> > >>>>>>> the rep factor of only AQ.
> > >>>>>>
> > >>>>>> This is an example of a scenario corresponding to what we suspect
> > >> is a
> > >>>> bug
> > >>>>>> introduced earlier, but Enrico is arguing that this is not the
> > >>> intended
> > >>>>>> behavior, and at this point, I agree.
> > >>>>>>
> > >>>>>>> This is alluded to in the docs as they state
> > >>>>>>> that AQ is also the minimum guaranteed replication factor.
> > >>>>>>
> > >>>>>> By the time a successful callback is received, the client might
> only
> > >>>> have
> > >>>>>> replicated AQ ways, so the guarantee can only be at that point of
> > >>> being
> > >>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
> > >> that
> > >>>> the
> > >>>>>> application wants to have WQ copies of each entry, though. I'd
> > >> expect
> > >>> a
> > >>>>>> ledger to have WQ copies of each entry up to the final entry
> number
> > >>>> when it
> > >>>>>> is closed. Do you see it differently?
> > >>>>>>
> > >>>>>>> I'd be happy to set up a meeting to discuss the spec and its
> > >>> findings.
> > >>>>>>
> > >>>>>>
> > >>>>>> That'd be great, I'm interested.
> > >>>>>>
> > >>>>>> -Flavio
> > >>>>>>
> > >>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <
> jvanlightly@splunk.com
> > >>>> .INVALID>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> No you cannot miss data, if the client is not able to find a
> > >> bookie
> > >>>> that
> > >>>>>>> is
> > >>>>>>>> able to answer with the entry it receives an error.
> > >>>>>>>
> > >>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> > >> the
> > >>>>>>> confirm callback to the client is called and the LAC is set to
> 100.
> > >>> Now
> > >>>>>> the
> > >>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
> > >>> adds
> > >>>>>> that
> > >>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
> > >> that
> > >>>> the
> > >>>>>>> entry e100 is not replayed to another bookie, causing this entry
> to
> > >>>> meet
> > >>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
> > >>> state
> > >>>>>>> that AQ is also the minimum guaranteed replication factor.
> > >>>>>>>
> > >>>>>>>> The recovery read fails if it is not possible to read every
> entry
> > >>> from
> > >>>>>> at
> > >>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > >>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
> > >> does
> > >>>> not
> > >>>>>>>> find enough bookies.
> > >>>>>>>
> > >>>>>>> This is not quite accurate. A single successful read is enough.
> > >>> However
> > >>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
> > >>> with
> > >>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and
> the
> > >>>> ledger
> > >>>>>>> recovery process ends there. This means that given the responses
> > >>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
> > >> is
> > >>>>>>> considered successful is non-deterministic. If the response from
> b1
> > >>> is
> > >>>>>>> received last, then the read is already considered failed,
> > >> otherwise
> > >>>> the
> > >>>>>>> read succeeds.
> > >>>>>>>
> > >>>>>>> I have come to the above conclusions through my reverse
> engineering
> > >>>>>> process
> > >>>>>>> for creating the TLA+ specification. I still have pending to
> > >>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
> > >>> verified
> > >>>>>> via
> > >>>>>>> tests the conclusion about ledger recovery reads.
> > >>>>>>>
> > >>>>>>> Note that I have found two defects with the BookKeeper protocol,
> > >> most
> > >>>>>>> notably data loss due to that fencing does not prevent further
> > >>>> successful
> > >>>>>>> adds. Currently the specification and associated documentation is
> > >> on
> > >>> a
> > >>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
> > >> discuss
> > >>>> the
> > >>>>>>> spec and its findings.
> > >>>>>>>
> > >>>>>>> Best
> > >>>>>>> Jack
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
> > >>> eolivelli@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> [ External sender. Exercise caution. ]
> > >>>>>>>>
> > >>>>>>>> Jonathan,
> > >>>>>>>>
> > >>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> > >>>>>>>> jbellis@apache.org>
> > >>>>>>>> ha scritto:
> > >>>>>>>>
> > >>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> > >>>> confirm
> > >>>>>>>>> that
> > >>>>>>>>>> once confirmed, that an entry is not replayed to another
> bookie.
> > >>>> This
> > >>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
> > >> bookies,
> > >>>>>>>>> however,
> > >>>>>>>>>> the new data integrity check that Ivan worked on, when run
> > >>>>>> periodically
> > >>>>>>>>>> will be able to repair that hole.
> > >>>>>>>>>
> > >>>>>>>>> Can I read from the bookie with a hole in the meantime, and
> > >>> silently
> > >>>>>> miss
> > >>>>>>>>> data that it doesn't know about?
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> No you cannot miss data, if the client is not able to find a
> > >> bookie
> > >>>>>> that is
> > >>>>>>>> able to answer with the entry it receives an error.
> > >>>>>>>>
> > >>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
> > >> value
> > >>> is
> > >>>>>>>> stored on ledger metadata once the ledger is "closed".
> > >>>>>>>>
> > >>>>>>>> When the ledger is still open, that is when the writer is
> writing
> > >> to
> > >>>> it,
> > >>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
> > >> entry
> > >>>>>>>> this LAC value is returned to the reader using a piggyback
> > >>> mechanism,
> > >>>>>>>> without reading from metadata.
> > >>>>>>>> The reader cannot read beyond the latest position that has been
> > >>>>>> confirmed
> > >>>>>>>> to the writer by AQ bookies.
> > >>>>>>>>
> > >>>>>>>> We have a third case, the 'recovery read'.
> > >>>>>>>> A reader starts a "recovery read" when you want to recover a
> > >> ledger
> > >>>> that
> > >>>>>>>> has been abandoned by a dead writer
> > >>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want
> to
> > >>>> fence
> > >>>>>> out
> > >>>>>>>> the old leader.
> > >>>>>>>> In this case the reader merges the current status of the ledger
> on
> > >>> ZK
> > >>>>>> with
> > >>>>>>>> the result of a scan of the whole ledger.
> > >>>>>>>> Basically it reads the ledger from the beginning up to the tail,
> > >>> until
> > >>>>>> it
> > >>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
> > >> flag
> > >>> on
> > >>>>>> the
> > >>>>>>>> ledger
> > >>>>>>>> on every bookie and also it is able to detect the actual tail of
> > >> the
> > >>>>>> ledger
> > >>>>>>>> (because the writer died and it was not able to flush metadata
> to
> > >>> ZK).
> > >>>>>>>>
> > >>>>>>>> The recovery read fails if it is not possible to read every
> entry
> > >>> from
> > >>>>>> at
> > >>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > >>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
> > >> does
> > >>>> not
> > >>>>>>>> find enough bookies.
> > >>>>>>>>
> > >>>>>>>> I hope that helps
> > >>>>>>>> Enrico
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> >
> >
>


-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
Thanks for the feedback, Sijie:

> On 18 Jan 2021, at 19:53, Sijie Guo <gu...@gmail.com> wrote:
> 
>> One concern for me in this thread is case (3). I'd expect a client that
>> doesn't crash to not give up, and eventually replace the bookie if it is
>> unresponsive.
>> 
> The current implementation doesn't retry replacing a bookie if an entry is
> already acknowledged (receiving AQ responses). It relies on inspection to
> repair the hole.

Ok, that's good information, let me think a bit more about it. I'd like to understand why we can't keep a pending add op reference until it is fully replicated, which as I understand would enable bookie replacement for entries with fewer than WQ acks.

> 
> So the memory pressure is not coming from retrying. It is straight that the
> bookkeeper client references the sendBuffers until it receives any
> responses from the slow bookie. The bookkeeper client allows enqueuing
> addEntry operations because the operations meet the AQ requirements.

I see, the entries queuing in the bookie client are inducing memory pressure in the presence of a slow bookie.

> Pulsar
> does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
> the add operations. But this won't work as bookkeeper will notify the
> callbacks once the operations meet the AQ requirements. But there is a huge
> amount of memory (throughput * timeout period) referenced by a slow bookie.
> Hence we have to add a memory-based throttling mechanism as Matteo
> suggested.

Thanks for pointing to Mateo's message, I reviewed it again. He actually makes two observations:

1- It is difficult to throttle from outside the bookkeeper client because the application using it does not have visibility into what has been fully replicated. A back pressure mechanism internal to the bookie (and possibly configurable) might be necessary.
2- There is some Pulsar work (PIP-74) that could be leveraged to throttle from outside the bookkeeper client based on memory limits.

> 
> If we want to add the retry logic to replace a bookie, this will add more
> pressure to the memory. But it can still be solved by a memory-based
> back-pressure mechansim.
> 

I don't know much about (2), but I'll have a look to form an opinion. At a high level, it seems reasonable. We might still want to consider doing (1) to simplify the job of the application.

-Flavio  

> Thanks,
> Sijie
> 
> On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira <fp...@apache.org> wrote:
> 
>> In the scenario that WQ > AQ, a client acknowledges the add of an entry e
>> to the application once it receives AQ bookie acks. Say now that the client
>> is not able to write a copy of e to at least one bookie b, it could be
>> because:
>> 
>> 1- The client crashed before it is able to do it
>> 2- Bookie b crashed
>> 3- The client gave up trying
>> 
>> In case (1), the client crashed and the ledger will be recovered by some
>> reader. For all entries that have been acknowledged, including e, I'd
>> expect them to be readable from the closed ledger. Each one of these
>> entries that haven't been written to bookie b should be written there as
>> part of the recovery process.
>> 
>> In case (2), the client is not able to write entry e to the crashed bookie
>> b, so it will replace the bookie and write e to the new bookie. I see in
>> this discussion that there is an option to disable bookie replacement, I'm
>> ignoring that for this discussion.
>> 
>> In case (3), the client say discards the entry after adding successfully
>> to AQ bookies, and gives up at some point because it can't reach the
>> bookie. The client maybe replaces bookie b or bookie b eventually comes
>> back and the client proceeds with the adds. In either case, there is a hole
>> that can only be fixed by inspecting the ledger.
>> 
>> One concern for me in this thread is case (3). I'd expect a client that
>> doesn't crash to not give up, and eventually replace the bookie if it is
>> unresponsive. But, that certainly leads to the memory pressure problem that
>> was also mentioned in the thread, for which one potential direction also
>> mentioned is to apply back pressure.
>> 
>> Thanks,
>> -Flavio
>> 
>>> On 18 Jan 2021, at 12:20, Jack Vanlightly <jv...@splunk.com.INVALID>
>> wrote:
>>> 
>>>> Did you guys see any issues with the ledger auditor?
>>> 
>>>> The active writer can't guarantee it writing entries to WQ because it
>> can
>>>> crash during retrying adding entries to (WQ - AQ) bookies.
>>> 
>>> The need to repair AQ replicated entries is clear and the auditor is one
>>> such strategy. Ivan has also worked on a self-healing bookie strategy
>> where
>>> each bookie itself is able to detect these holes and is able to obtain
>> the
>>> missing entries itself. The detection of these holes using this strategy
>> is
>>> more efficient as it only requires network calls for the ledger metadata
>>> scanning (to zk) and the missing entry reads (to other bookies). The
>>> auditor as I understand it, reads all entries of all ledgers from all
>>> bookies (of an entries ensemble) meaning these entries cross the network.
>>> Using the auditor approach is likely to be run less frequently due to the
>>> network cost.
>>> 
>>> I do also wonder if the writer, on performing an ensemble change, should
>>> replay "AQ but not WQ" entries, this would just leave writer failures
>>> causing these AQ replicated entries.
>>> 
>>>> Regarding recovery reads, recovery read doesn't need to be
>> deterministic.
>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>>> either including it or excluding it in the sealed ledger is correct
>>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>>> the entries in the sealed ledger can always be read and can be read
>>>> consistently.
>>> 
>>>> I am not sure it is a problem unless I misunderstand it.
>>> 
>>> It is true that it doesn't violate any safety property, but it is a
>> strange
>>> check to me. It looks like an implementation artefact rather than an
>>> explicit protocol design choice. But not a huge deal.
>>> 
>>> Jack
>>> 
>>> 
>>> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <gu...@gmail.com> wrote:
>>> 
>>>> [ External sender. Exercise caution. ]
>>>> 
>>>> Sorry for being late in this thread.
>>>> 
>>>> If I understand this correctly, the main topic is about the "hole" when
>> WQ
>>>>> AQ.
>>>> 
>>>>> This leaves a "hole" as the entry is now replicated only to 2 bookies,
>>>> 
>>>> We do have one hole when ensemble change is enabled and WQ > AQ. That
>> was a
>>>> known behavior. But the hole will be repaired by the ledger auditor as
>> JV
>>>> said. Did you guys see any issues with the ledger auditor?
>>>> 
>>>>> I'd think that we guarantee that an entry that is acknowledged is
>>>> eventually written WQ ways and that it is observable by readers when the
>>>> ledger is closed.
>>>> 
>>>> To Flavio's question, we don't guarantee (and can't guarantee) that the
>>>> active writer will eventually write the entries to WQ. For the active
>>>> writers, we only guarantee entries are written to AQ. The ledger
>> auditor is
>>>> to ensure all the entries are written to WQ.
>>>> 
>>>> The active writer can't guarantee it writing entries to WQ because it
>> can
>>>> crash during retrying adding entries to (WQ - AQ) bookies.
>>>> 
>>>>> A single successful read is enough. However
>>>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the
>> ledger
>>>> recovery process ends there. This means that given the responses
>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>>>> considered successful is non-deterministic.
>>>> 
>>>> Regarding recovery reads, recovery read doesn't need to be
>> deterministic.
>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>>> either including it or excluding it in the sealed ledger is correct
>>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>>> the entries in the sealed ledger can always be read and can be read
>>>> consistently.
>>>> 
>>>> I am not sure it is a problem unless I misunderstand it.
>>>> 
>>>> - Sijie
>>>> 
>>>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
>>>> <jv...@splunk.com.invalid> wrote:
>>>> 
>>>>> Let's set up a call and create any issues from that. I have already
>>>> created
>>>>> the patches in our (Splunk) fork and it might be easiest or not to wait
>>>>> until we re-sync up with the open source repo. We can include the fixes
>>>> in
>>>>> the discussion.
>>>>> 
>>>>> Jack
>>>>> 
>>>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org>
>> wrote:
>>>>> 
>>>>>> [ External sender. Exercise caution. ]
>>>>>> 
>>>>>> Hi Jack,
>>>>>> 
>>>>>> Thanks for getting back.
>>>>>> 
>>>>>>> What's the best way to share the TLA+ findings?
>>>>>> 
>>>>>> Would you be able to share the spec? I'm ok with reading TLA+.
>>>>>> 
>>>>>> As for sharing your specific findings, I'd suggest one of the
>>>> following:
>>>>>> 
>>>>>> 1- Create an email thread describing the scenarios that trigger a bug.
>>>>>> 2- Create issues, one for each problem you found.
>>>>>> 3- Create a discussion on the project Slack, perhaps a channel
>> specific
>>>>>> for it.
>>>>>> 4- Set up a zoom call to present and discuss with the community.
>>>>>> 
>>>>>> Option 2 is ideal from a community perspective, but we can also set up
>>>> a
>>>>>> call inviting everyone and create issues out of that discussion. We
>> can
>>>>> in
>>>>>> fact set up a call even if we create the issues ahead of time.
>>>>>> 
>>>>>> Does it make sense?
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
>>>>> .INVALID>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi Flavio,
>>>>>>> 
>>>>>>>>> This is an example of a scenario corresponding to what we suspect
>>>> is
>>>>> a
>>>>>>> bug introduced earlier, but Enrico is arguing that this is not the
>>>>>> intended
>>>>>>> behavior, and at this point, I agree.
>>>>>>> 
>>>>>>>>> By the time a successful callback is received, the client might
>>>> only
>>>>>>> have replicated AQ ways, so the guarantee can only be at that point
>>>> of
>>>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
>>>> states
>>>>>> that
>>>>>>> the application wants to have WQ copies >> of each entry, though. I'd
>>>>>>> expect a ledger to have WQ copies of each entry up to the final entry
>>>>>>> number when it is closed. Do you see it differently?
>>>>>>> 
>>>>>>> I also agree and was pretty surprised when I discovered the
>>>> behaviour.
>>>>> It
>>>>>>> is not something that users expect and I think we need to correct it.
>>>>> So
>>>>>>> I'm with you.
>>>>>>> 
>>>>>>> What's the best way to share the TLA+ findings?
>>>>>>> 
>>>>>>> Jack
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
>>>>> wrote:
>>>>>>> 
>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>> 
>>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>>> the
>>>>>>>>> confirm callback to the client is called and the LAC is set to
>>>>> 100.Now
>>>>>>>> the
>>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>>> adds
>>>>>>>> that
>>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>>> that
>>>>>> the
>>>>>>>>> entry e100 is not replayed to another bookie, causing this entry to
>>>>>> meet
>>>>>>>>> the rep factor of only AQ.
>>>>>>>> 
>>>>>>>> This is an example of a scenario corresponding to what we suspect
>>>> is a
>>>>>> bug
>>>>>>>> introduced earlier, but Enrico is arguing that this is not the
>>>>> intended
>>>>>>>> behavior, and at this point, I agree.
>>>>>>>> 
>>>>>>>>> This is alluded to in the docs as they state
>>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>>> 
>>>>>>>> By the time a successful callback is received, the client might only
>>>>>> have
>>>>>>>> replicated AQ ways, so the guarantee can only be at that point of
>>>>> being
>>>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
>>>> that
>>>>>> the
>>>>>>>> application wants to have WQ copies of each entry, though. I'd
>>>> expect
>>>>> a
>>>>>>>> ledger to have WQ copies of each entry up to the final entry number
>>>>>> when it
>>>>>>>> is closed. Do you see it differently?
>>>>>>>> 
>>>>>>>>> I'd be happy to set up a meeting to discuss the spec and its
>>>>> findings.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> That'd be great, I'm interested.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanlightly@splunk.com
>>>>>> .INVALID>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>>> bookie
>>>>>> that
>>>>>>>>> is
>>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>>> 
>>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>>> the
>>>>>>>>> confirm callback to the client is called and the LAC is set to 100.
>>>>> Now
>>>>>>>> the
>>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>>> adds
>>>>>>>> that
>>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>>> that
>>>>>> the
>>>>>>>>> entry e100 is not replayed to another bookie, causing this entry to
>>>>>> meet
>>>>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
>>>>> state
>>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>>>> 
>>>>>>>>>> The recovery read fails if it is not possible to read every entry
>>>>> from
>>>>>>>> at
>>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>>> does
>>>>>> not
>>>>>>>>>> find enough bookies.
>>>>>>>>> 
>>>>>>>>> This is not quite accurate. A single successful read is enough.
>>>>> However
>>>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
>>>>> with
>>>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the
>>>>>> ledger
>>>>>>>>> recovery process ends there. This means that given the responses
>>>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
>>>> is
>>>>>>>>> considered successful is non-deterministic. If the response from b1
>>>>> is
>>>>>>>>> received last, then the read is already considered failed,
>>>> otherwise
>>>>>> the
>>>>>>>>> read succeeds.
>>>>>>>>> 
>>>>>>>>> I have come to the above conclusions through my reverse engineering
>>>>>>>> process
>>>>>>>>> for creating the TLA+ specification. I still have pending to
>>>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
>>>>> verified
>>>>>>>> via
>>>>>>>>> tests the conclusion about ledger recovery reads.
>>>>>>>>> 
>>>>>>>>> Note that I have found two defects with the BookKeeper protocol,
>>>> most
>>>>>>>>> notably data loss due to that fencing does not prevent further
>>>>>> successful
>>>>>>>>> adds. Currently the specification and associated documentation is
>>>> on
>>>>> a
>>>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
>>>> discuss
>>>>>> the
>>>>>>>>> spec and its findings.
>>>>>>>>> 
>>>>>>>>> Best
>>>>>>>>> Jack
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
>>>>> eolivelli@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>>>> 
>>>>>>>>>> Jonathan,
>>>>>>>>>> 
>>>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>>>>>>>>>> jbellis@apache.org>
>>>>>>>>>> ha scritto:
>>>>>>>>>> 
>>>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
>>>>>> confirm
>>>>>>>>>>> that
>>>>>>>>>>>> once confirmed, that an entry is not replayed to another bookie.
>>>>>> This
>>>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
>>>> bookies,
>>>>>>>>>>> however,
>>>>>>>>>>>> the new data integrity check that Ivan worked on, when run
>>>>>>>> periodically
>>>>>>>>>>>> will be able to repair that hole.
>>>>>>>>>>> 
>>>>>>>>>>> Can I read from the bookie with a hole in the meantime, and
>>>>> silently
>>>>>>>> miss
>>>>>>>>>>> data that it doesn't know about?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>>> bookie
>>>>>>>> that is
>>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>>>> 
>>>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
>>>> value
>>>>> is
>>>>>>>>>> stored on ledger metadata once the ledger is "closed".
>>>>>>>>>> 
>>>>>>>>>> When the ledger is still open, that is when the writer is writing
>>>> to
>>>>>> it,
>>>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
>>>> entry
>>>>>>>>>> this LAC value is returned to the reader using a piggyback
>>>>> mechanism,
>>>>>>>>>> without reading from metadata.
>>>>>>>>>> The reader cannot read beyond the latest position that has been
>>>>>>>> confirmed
>>>>>>>>>> to the writer by AQ bookies.
>>>>>>>>>> 
>>>>>>>>>> We have a third case, the 'recovery read'.
>>>>>>>>>> A reader starts a "recovery read" when you want to recover a
>>>> ledger
>>>>>> that
>>>>>>>>>> has been abandoned by a dead writer
>>>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want to
>>>>>> fence
>>>>>>>> out
>>>>>>>>>> the old leader.
>>>>>>>>>> In this case the reader merges the current status of the ledger on
>>>>> ZK
>>>>>>>> with
>>>>>>>>>> the result of a scan of the whole ledger.
>>>>>>>>>> Basically it reads the ledger from the beginning up to the tail,
>>>>> until
>>>>>>>> it
>>>>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
>>>> flag
>>>>> on
>>>>>>>> the
>>>>>>>>>> ledger
>>>>>>>>>> on every bookie and also it is able to detect the actual tail of
>>>> the
>>>>>>>> ledger
>>>>>>>>>> (because the writer died and it was not able to flush metadata to
>>>>> ZK).
>>>>>>>>>> 
>>>>>>>>>> The recovery read fails if it is not possible to read every entry
>>>>> from
>>>>>>>> at
>>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>>> does
>>>>>> not
>>>>>>>>>> find enough bookies.
>>>>>>>>>> 
>>>>>>>>>> I hope that helps
>>>>>>>>>> Enrico
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> 


Re: Unbounded memory usage for WQ > AQ ?

Posted by Sijie Guo <gu...@gmail.com>.
> One concern for me in this thread is case (3). I'd expect a client that
doesn't crash to not give up, and eventually replace the bookie if it is
unresponsive.

The current implementation doesn't retry replacing a bookie if an entry is
already acknowledged (receiving AQ responses). It relies on inspection to
repair the hole.

So the memory pressure is not coming from retrying. It is straight that the
bookkeeper client references the sendBuffers until it receives any
responses from the slow bookie. The bookkeeper client allows enqueuing
addEntry operations because the operations meet the AQ requirements. Pulsar
does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
the add operations. But this won't work as bookkeeper will notify the
callbacks once the operations meet the AQ requirements. But there is a huge
amount of memory (throughput * timeout period) referenced by a slow bookie.
Hence we have to add a memory-based throttling mechanism as Matteo
suggested.

If we want to add the retry logic to replace a bookie, this will add more
pressure to the memory. But it can still be solved by a memory-based
back-pressure mechansim.

Thanks,
Sijie

On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira <fp...@apache.org> wrote:

> In the scenario that WQ > AQ, a client acknowledges the add of an entry e
> to the application once it receives AQ bookie acks. Say now that the client
> is not able to write a copy of e to at least one bookie b, it could be
> because:
>
> 1- The client crashed before it is able to do it
> 2- Bookie b crashed
> 3- The client gave up trying
>
> In case (1), the client crashed and the ledger will be recovered by some
> reader. For all entries that have been acknowledged, including e, I'd
> expect them to be readable from the closed ledger. Each one of these
> entries that haven't been written to bookie b should be written there as
> part of the recovery process.
>
> In case (2), the client is not able to write entry e to the crashed bookie
> b, so it will replace the bookie and write e to the new bookie. I see in
> this discussion that there is an option to disable bookie replacement, I'm
> ignoring that for this discussion.
>
> In case (3), the client say discards the entry after adding successfully
> to AQ bookies, and gives up at some point because it can't reach the
> bookie. The client maybe replaces bookie b or bookie b eventually comes
> back and the client proceeds with the adds. In either case, there is a hole
> that can only be fixed by inspecting the ledger.
>
> One concern for me in this thread is case (3). I'd expect a client that
> doesn't crash to not give up, and eventually replace the bookie if it is
> unresponsive. But, that certainly leads to the memory pressure problem that
> was also mentioned in the thread, for which one potential direction also
> mentioned is to apply back pressure.
>
> Thanks,
> -Flavio
>
> > On 18 Jan 2021, at 12:20, Jack Vanlightly <jv...@splunk.com.INVALID>
> wrote:
> >
> >> Did you guys see any issues with the ledger auditor?
> >
> >> The active writer can't guarantee it writing entries to WQ because it
> can
> >> crash during retrying adding entries to (WQ - AQ) bookies.
> >
> > The need to repair AQ replicated entries is clear and the auditor is one
> > such strategy. Ivan has also worked on a self-healing bookie strategy
> where
> > each bookie itself is able to detect these holes and is able to obtain
> the
> > missing entries itself. The detection of these holes using this strategy
> is
> > more efficient as it only requires network calls for the ledger metadata
> > scanning (to zk) and the missing entry reads (to other bookies). The
> > auditor as I understand it, reads all entries of all ledgers from all
> > bookies (of an entries ensemble) meaning these entries cross the network.
> > Using the auditor approach is likely to be run less frequently due to the
> > network cost.
> >
> > I do also wonder if the writer, on performing an ensemble change, should
> > replay "AQ but not WQ" entries, this would just leave writer failures
> > causing these AQ replicated entries.
> >
> >> Regarding recovery reads, recovery read doesn't need to be
> deterministic.
> >> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> >> either including it or excluding it in the sealed ledger is correct
> >> behavior. The bookkeeper client guarantees that once a ledger is sealed,
> >> the entries in the sealed ledger can always be read and can be read
> >> consistently.
> >
> >> I am not sure it is a problem unless I misunderstand it.
> >
> > It is true that it doesn't violate any safety property, but it is a
> strange
> > check to me. It looks like an implementation artefact rather than an
> > explicit protocol design choice. But not a huge deal.
> >
> > Jack
> >
> >
> > On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <gu...@gmail.com> wrote:
> >
> >> [ External sender. Exercise caution. ]
> >>
> >> Sorry for being late in this thread.
> >>
> >> If I understand this correctly, the main topic is about the "hole" when
> WQ
> >>> AQ.
> >>
> >>> This leaves a "hole" as the entry is now replicated only to 2 bookies,
> >>
> >> We do have one hole when ensemble change is enabled and WQ > AQ. That
> was a
> >> known behavior. But the hole will be repaired by the ledger auditor as
> JV
> >> said. Did you guys see any issues with the ledger auditor?
> >>
> >>> I'd think that we guarantee that an entry that is acknowledged is
> >> eventually written WQ ways and that it is observable by readers when the
> >> ledger is closed.
> >>
> >> To Flavio's question, we don't guarantee (and can't guarantee) that the
> >> active writer will eventually write the entries to WQ. For the active
> >> writers, we only guarantee entries are written to AQ. The ledger
> auditor is
> >> to ensure all the entries are written to WQ.
> >>
> >> The active writer can't guarantee it writing entries to WQ because it
> can
> >> crash during retrying adding entries to (WQ - AQ) bookies.
> >>
> >>> A single successful read is enough. However
> >> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
> >> explicit NoSuchEntry/Ledger, the read is considered failed and the
> ledger
> >> recovery process ends there. This means that given the responses
> >> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> >> considered successful is non-deterministic.
> >>
> >> Regarding recovery reads, recovery read doesn't need to be
> deterministic.
> >> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> >> either including it or excluding it in the sealed ledger is correct
> >> behavior. The bookkeeper client guarantees that once a ledger is sealed,
> >> the entries in the sealed ledger can always be read and can be read
> >> consistently.
> >>
> >> I am not sure it is a problem unless I misunderstand it.
> >>
> >> - Sijie
> >>
> >> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
> >> <jv...@splunk.com.invalid> wrote:
> >>
> >>> Let's set up a call and create any issues from that. I have already
> >> created
> >>> the patches in our (Splunk) fork and it might be easiest or not to wait
> >>> until we re-sync up with the open source repo. We can include the fixes
> >> in
> >>> the discussion.
> >>>
> >>> Jack
> >>>
> >>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org>
> wrote:
> >>>
> >>>> [ External sender. Exercise caution. ]
> >>>>
> >>>> Hi Jack,
> >>>>
> >>>> Thanks for getting back.
> >>>>
> >>>>> What's the best way to share the TLA+ findings?
> >>>>
> >>>> Would you be able to share the spec? I'm ok with reading TLA+.
> >>>>
> >>>> As for sharing your specific findings, I'd suggest one of the
> >> following:
> >>>>
> >>>> 1- Create an email thread describing the scenarios that trigger a bug.
> >>>> 2- Create issues, one for each problem you found.
> >>>> 3- Create a discussion on the project Slack, perhaps a channel
> specific
> >>>> for it.
> >>>> 4- Set up a zoom call to present and discuss with the community.
> >>>>
> >>>> Option 2 is ideal from a community perspective, but we can also set up
> >> a
> >>>> call inviting everyone and create issues out of that discussion. We
> can
> >>> in
> >>>> fact set up a call even if we create the issues ahead of time.
> >>>>
> >>>> Does it make sense?
> >>>>
> >>>> -Flavio
> >>>>
> >>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
> >>> .INVALID>
> >>>> wrote:
> >>>>>
> >>>>> Hi Flavio,
> >>>>>
> >>>>>>> This is an example of a scenario corresponding to what we suspect
> >> is
> >>> a
> >>>>> bug introduced earlier, but Enrico is arguing that this is not the
> >>>> intended
> >>>>> behavior, and at this point, I agree.
> >>>>>
> >>>>>>> By the time a successful callback is received, the client might
> >> only
> >>>>> have replicated AQ ways, so the guarantee can only be at that point
> >> of
> >>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
> >> states
> >>>> that
> >>>>> the application wants to have WQ copies >> of each entry, though. I'd
> >>>>> expect a ledger to have WQ copies of each entry up to the final entry
> >>>>> number when it is closed. Do you see it differently?
> >>>>>
> >>>>> I also agree and was pretty surprised when I discovered the
> >> behaviour.
> >>> It
> >>>>> is not something that users expect and I think we need to correct it.
> >>> So
> >>>>> I'm with you.
> >>>>>
> >>>>> What's the best way to share the TLA+ findings?
> >>>>>
> >>>>> Jack
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
> >>> wrote:
> >>>>>
> >>>>>> [ External sender. Exercise caution. ]
> >>>>>>
> >>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> >> the
> >>>>>>> confirm callback to the client is called and the LAC is set to
> >>> 100.Now
> >>>>>> the
> >>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
> >>> adds
> >>>>>> that
> >>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
> >> that
> >>>> the
> >>>>>>> entry e100 is not replayed to another bookie, causing this entry to
> >>>> meet
> >>>>>>> the rep factor of only AQ.
> >>>>>>
> >>>>>> This is an example of a scenario corresponding to what we suspect
> >> is a
> >>>> bug
> >>>>>> introduced earlier, but Enrico is arguing that this is not the
> >>> intended
> >>>>>> behavior, and at this point, I agree.
> >>>>>>
> >>>>>>> This is alluded to in the docs as they state
> >>>>>>> that AQ is also the minimum guaranteed replication factor.
> >>>>>>
> >>>>>> By the time a successful callback is received, the client might only
> >>>> have
> >>>>>> replicated AQ ways, so the guarantee can only be at that point of
> >>> being
> >>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
> >> that
> >>>> the
> >>>>>> application wants to have WQ copies of each entry, though. I'd
> >> expect
> >>> a
> >>>>>> ledger to have WQ copies of each entry up to the final entry number
> >>>> when it
> >>>>>> is closed. Do you see it differently?
> >>>>>>
> >>>>>>> I'd be happy to set up a meeting to discuss the spec and its
> >>> findings.
> >>>>>>
> >>>>>>
> >>>>>> That'd be great, I'm interested.
> >>>>>>
> >>>>>> -Flavio
> >>>>>>
> >>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanlightly@splunk.com
> >>>> .INVALID>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> No you cannot miss data, if the client is not able to find a
> >> bookie
> >>>> that
> >>>>>>> is
> >>>>>>>> able to answer with the entry it receives an error.
> >>>>>>>
> >>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> >> the
> >>>>>>> confirm callback to the client is called and the LAC is set to 100.
> >>> Now
> >>>>>> the
> >>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
> >>> adds
> >>>>>> that
> >>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
> >> that
> >>>> the
> >>>>>>> entry e100 is not replayed to another bookie, causing this entry to
> >>>> meet
> >>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
> >>> state
> >>>>>>> that AQ is also the minimum guaranteed replication factor.
> >>>>>>>
> >>>>>>>> The recovery read fails if it is not possible to read every entry
> >>> from
> >>>>>> at
> >>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
> >>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
> >> does
> >>>> not
> >>>>>>>> find enough bookies.
> >>>>>>>
> >>>>>>> This is not quite accurate. A single successful read is enough.
> >>> However
> >>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
> >>> with
> >>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the
> >>>> ledger
> >>>>>>> recovery process ends there. This means that given the responses
> >>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
> >> is
> >>>>>>> considered successful is non-deterministic. If the response from b1
> >>> is
> >>>>>>> received last, then the read is already considered failed,
> >> otherwise
> >>>> the
> >>>>>>> read succeeds.
> >>>>>>>
> >>>>>>> I have come to the above conclusions through my reverse engineering
> >>>>>> process
> >>>>>>> for creating the TLA+ specification. I still have pending to
> >>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
> >>> verified
> >>>>>> via
> >>>>>>> tests the conclusion about ledger recovery reads.
> >>>>>>>
> >>>>>>> Note that I have found two defects with the BookKeeper protocol,
> >> most
> >>>>>>> notably data loss due to that fencing does not prevent further
> >>>> successful
> >>>>>>> adds. Currently the specification and associated documentation is
> >> on
> >>> a
> >>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
> >> discuss
> >>>> the
> >>>>>>> spec and its findings.
> >>>>>>>
> >>>>>>> Best
> >>>>>>> Jack
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
> >>> eolivelli@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> [ External sender. Exercise caution. ]
> >>>>>>>>
> >>>>>>>> Jonathan,
> >>>>>>>>
> >>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> >>>>>>>> jbellis@apache.org>
> >>>>>>>> ha scritto:
> >>>>>>>>
> >>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> >>>> confirm
> >>>>>>>>> that
> >>>>>>>>>> once confirmed, that an entry is not replayed to another bookie.
> >>>> This
> >>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
> >> bookies,
> >>>>>>>>> however,
> >>>>>>>>>> the new data integrity check that Ivan worked on, when run
> >>>>>> periodically
> >>>>>>>>>> will be able to repair that hole.
> >>>>>>>>>
> >>>>>>>>> Can I read from the bookie with a hole in the meantime, and
> >>> silently
> >>>>>> miss
> >>>>>>>>> data that it doesn't know about?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> No you cannot miss data, if the client is not able to find a
> >> bookie
> >>>>>> that is
> >>>>>>>> able to answer with the entry it receives an error.
> >>>>>>>>
> >>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
> >> value
> >>> is
> >>>>>>>> stored on ledger metadata once the ledger is "closed".
> >>>>>>>>
> >>>>>>>> When the ledger is still open, that is when the writer is writing
> >> to
> >>>> it,
> >>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
> >> entry
> >>>>>>>> this LAC value is returned to the reader using a piggyback
> >>> mechanism,
> >>>>>>>> without reading from metadata.
> >>>>>>>> The reader cannot read beyond the latest position that has been
> >>>>>> confirmed
> >>>>>>>> to the writer by AQ bookies.
> >>>>>>>>
> >>>>>>>> We have a third case, the 'recovery read'.
> >>>>>>>> A reader starts a "recovery read" when you want to recover a
> >> ledger
> >>>> that
> >>>>>>>> has been abandoned by a dead writer
> >>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want to
> >>>> fence
> >>>>>> out
> >>>>>>>> the old leader.
> >>>>>>>> In this case the reader merges the current status of the ledger on
> >>> ZK
> >>>>>> with
> >>>>>>>> the result of a scan of the whole ledger.
> >>>>>>>> Basically it reads the ledger from the beginning up to the tail,
> >>> until
> >>>>>> it
> >>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
> >> flag
> >>> on
> >>>>>> the
> >>>>>>>> ledger
> >>>>>>>> on every bookie and also it is able to detect the actual tail of
> >> the
> >>>>>> ledger
> >>>>>>>> (because the writer died and it was not able to flush metadata to
> >>> ZK).
> >>>>>>>>
> >>>>>>>> The recovery read fails if it is not possible to read every entry
> >>> from
> >>>>>> at
> >>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
> >>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
> >> does
> >>>> not
> >>>>>>>> find enough bookies.
> >>>>>>>>
> >>>>>>>> I hope that helps
> >>>>>>>> Enrico
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
In the scenario that WQ > AQ, a client acknowledges the add of an entry e to the application once it receives AQ bookie acks. Say now that the client is not able to write a copy of e to at least one bookie b, it could be because:

1- The client crashed before it is able to do it
2- Bookie b crashed
3- The client gave up trying

In case (1), the client crashed and the ledger will be recovered by some reader. For all entries that have been acknowledged, including e, I'd expect them to be readable from the closed ledger. Each one of these entries that haven't been written to bookie b should be written there as part of the recovery process.   

In case (2), the client is not able to write entry e to the crashed bookie b, so it will replace the bookie and write e to the new bookie. I see in this discussion that there is an option to disable bookie replacement, I'm ignoring that for this discussion.

In case (3), the client say discards the entry after adding successfully to AQ bookies, and gives up at some point because it can't reach the bookie. The client maybe replaces bookie b or bookie b eventually comes back and the client proceeds with the adds. In either case, there is a hole that can only be fixed by inspecting the ledger.

One concern for me in this thread is case (3). I'd expect a client that doesn't crash to not give up, and eventually replace the bookie if it is unresponsive. But, that certainly leads to the memory pressure problem that was also mentioned in the thread, for which one potential direction also mentioned is to apply back pressure.

Thanks,
-Flavio

> On 18 Jan 2021, at 12:20, Jack Vanlightly <jv...@splunk.com.INVALID> wrote:
> 
>> Did you guys see any issues with the ledger auditor?
> 
>> The active writer can't guarantee it writing entries to WQ because it can
>> crash during retrying adding entries to (WQ - AQ) bookies.
> 
> The need to repair AQ replicated entries is clear and the auditor is one
> such strategy. Ivan has also worked on a self-healing bookie strategy where
> each bookie itself is able to detect these holes and is able to obtain the
> missing entries itself. The detection of these holes using this strategy is
> more efficient as it only requires network calls for the ledger metadata
> scanning (to zk) and the missing entry reads (to other bookies). The
> auditor as I understand it, reads all entries of all ledgers from all
> bookies (of an entries ensemble) meaning these entries cross the network.
> Using the auditor approach is likely to be run less frequently due to the
> network cost.
> 
> I do also wonder if the writer, on performing an ensemble change, should
> replay "AQ but not WQ" entries, this would just leave writer failures
> causing these AQ replicated entries.
> 
>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>> either including it or excluding it in the sealed ledger is correct
>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>> the entries in the sealed ledger can always be read and can be read
>> consistently.
> 
>> I am not sure it is a problem unless I misunderstand it.
> 
> It is true that it doesn't violate any safety property, but it is a strange
> check to me. It looks like an implementation artefact rather than an
> explicit protocol design choice. But not a huge deal.
> 
> Jack
> 
> 
> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <gu...@gmail.com> wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>> Sorry for being late in this thread.
>> 
>> If I understand this correctly, the main topic is about the "hole" when WQ
>>> AQ.
>> 
>>> This leaves a "hole" as the entry is now replicated only to 2 bookies,
>> 
>> We do have one hole when ensemble change is enabled and WQ > AQ. That was a
>> known behavior. But the hole will be repaired by the ledger auditor as JV
>> said. Did you guys see any issues with the ledger auditor?
>> 
>>> I'd think that we guarantee that an entry that is acknowledged is
>> eventually written WQ ways and that it is observable by readers when the
>> ledger is closed.
>> 
>> To Flavio's question, we don't guarantee (and can't guarantee) that the
>> active writer will eventually write the entries to WQ. For the active
>> writers, we only guarantee entries are written to AQ. The ledger auditor is
>> to ensure all the entries are written to WQ.
>> 
>> The active writer can't guarantee it writing entries to WQ because it can
>> crash during retrying adding entries to (WQ - AQ) bookies.
>> 
>>> A single successful read is enough. However
>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
>> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
>> recovery process ends there. This means that given the responses
>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>> considered successful is non-deterministic.
>> 
>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>> either including it or excluding it in the sealed ledger is correct
>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>> the entries in the sealed ledger can always be read and can be read
>> consistently.
>> 
>> I am not sure it is a problem unless I misunderstand it.
>> 
>> - Sijie
>> 
>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
>> <jv...@splunk.com.invalid> wrote:
>> 
>>> Let's set up a call and create any issues from that. I have already
>> created
>>> the patches in our (Splunk) fork and it might be easiest or not to wait
>>> until we re-sync up with the open source repo. We can include the fixes
>> in
>>> the discussion.
>>> 
>>> Jack
>>> 
>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org> wrote:
>>> 
>>>> [ External sender. Exercise caution. ]
>>>> 
>>>> Hi Jack,
>>>> 
>>>> Thanks for getting back.
>>>> 
>>>>> What's the best way to share the TLA+ findings?
>>>> 
>>>> Would you be able to share the spec? I'm ok with reading TLA+.
>>>> 
>>>> As for sharing your specific findings, I'd suggest one of the
>> following:
>>>> 
>>>> 1- Create an email thread describing the scenarios that trigger a bug.
>>>> 2- Create issues, one for each problem you found.
>>>> 3- Create a discussion on the project Slack, perhaps a channel specific
>>>> for it.
>>>> 4- Set up a zoom call to present and discuss with the community.
>>>> 
>>>> Option 2 is ideal from a community perspective, but we can also set up
>> a
>>>> call inviting everyone and create issues out of that discussion. We can
>>> in
>>>> fact set up a call even if we create the issues ahead of time.
>>>> 
>>>> Does it make sense?
>>>> 
>>>> -Flavio
>>>> 
>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
>>> .INVALID>
>>>> wrote:
>>>>> 
>>>>> Hi Flavio,
>>>>> 
>>>>>>> This is an example of a scenario corresponding to what we suspect
>> is
>>> a
>>>>> bug introduced earlier, but Enrico is arguing that this is not the
>>>> intended
>>>>> behavior, and at this point, I agree.
>>>>> 
>>>>>>> By the time a successful callback is received, the client might
>> only
>>>>> have replicated AQ ways, so the guarantee can only be at that point
>> of
>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
>> states
>>>> that
>>>>> the application wants to have WQ copies >> of each entry, though. I'd
>>>>> expect a ledger to have WQ copies of each entry up to the final entry
>>>>> number when it is closed. Do you see it differently?
>>>>> 
>>>>> I also agree and was pretty surprised when I discovered the
>> behaviour.
>>> It
>>>>> is not something that users expect and I think we need to correct it.
>>> So
>>>>> I'm with you.
>>>>> 
>>>>> What's the best way to share the TLA+ findings?
>>>>> 
>>>>> Jack
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
>>> wrote:
>>>>> 
>>>>>> [ External sender. Exercise caution. ]
>>>>>> 
>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>> the
>>>>>>> confirm callback to the client is called and the LAC is set to
>>> 100.Now
>>>>>> the
>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>> adds
>>>>>> that
>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>> that
>>>> the
>>>>>>> entry e100 is not replayed to another bookie, causing this entry to
>>>> meet
>>>>>>> the rep factor of only AQ.
>>>>>> 
>>>>>> This is an example of a scenario corresponding to what we suspect
>> is a
>>>> bug
>>>>>> introduced earlier, but Enrico is arguing that this is not the
>>> intended
>>>>>> behavior, and at this point, I agree.
>>>>>> 
>>>>>>> This is alluded to in the docs as they state
>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>> 
>>>>>> By the time a successful callback is received, the client might only
>>>> have
>>>>>> replicated AQ ways, so the guarantee can only be at that point of
>>> being
>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
>> that
>>>> the
>>>>>> application wants to have WQ copies of each entry, though. I'd
>> expect
>>> a
>>>>>> ledger to have WQ copies of each entry up to the final entry number
>>>> when it
>>>>>> is closed. Do you see it differently?
>>>>>> 
>>>>>>> I'd be happy to set up a meeting to discuss the spec and its
>>> findings.
>>>>>> 
>>>>>> 
>>>>>> That'd be great, I'm interested.
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanlightly@splunk.com
>>>> .INVALID>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> No you cannot miss data, if the client is not able to find a
>> bookie
>>>> that
>>>>>>> is
>>>>>>>> able to answer with the entry it receives an error.
>>>>>>> 
>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>> the
>>>>>>> confirm callback to the client is called and the LAC is set to 100.
>>> Now
>>>>>> the
>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>> adds
>>>>>> that
>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>> that
>>>> the
>>>>>>> entry e100 is not replayed to another bookie, causing this entry to
>>>> meet
>>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
>>> state
>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>> 
>>>>>>>> The recovery read fails if it is not possible to read every entry
>>> from
>>>>>> at
>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>> does
>>>> not
>>>>>>>> find enough bookies.
>>>>>>> 
>>>>>>> This is not quite accurate. A single successful read is enough.
>>> However
>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
>>> with
>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the
>>>> ledger
>>>>>>> recovery process ends there. This means that given the responses
>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
>> is
>>>>>>> considered successful is non-deterministic. If the response from b1
>>> is
>>>>>>> received last, then the read is already considered failed,
>> otherwise
>>>> the
>>>>>>> read succeeds.
>>>>>>> 
>>>>>>> I have come to the above conclusions through my reverse engineering
>>>>>> process
>>>>>>> for creating the TLA+ specification. I still have pending to
>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
>>> verified
>>>>>> via
>>>>>>> tests the conclusion about ledger recovery reads.
>>>>>>> 
>>>>>>> Note that I have found two defects with the BookKeeper protocol,
>> most
>>>>>>> notably data loss due to that fencing does not prevent further
>>>> successful
>>>>>>> adds. Currently the specification and associated documentation is
>> on
>>> a
>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
>> discuss
>>>> the
>>>>>>> spec and its findings.
>>>>>>> 
>>>>>>> Best
>>>>>>> Jack
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
>>> eolivelli@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>> 
>>>>>>>> Jonathan,
>>>>>>>> 
>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>>>>>>>> jbellis@apache.org>
>>>>>>>> ha scritto:
>>>>>>>> 
>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
>>>> confirm
>>>>>>>>> that
>>>>>>>>>> once confirmed, that an entry is not replayed to another bookie.
>>>> This
>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
>> bookies,
>>>>>>>>> however,
>>>>>>>>>> the new data integrity check that Ivan worked on, when run
>>>>>> periodically
>>>>>>>>>> will be able to repair that hole.
>>>>>>>>> 
>>>>>>>>> Can I read from the bookie with a hole in the meantime, and
>>> silently
>>>>>> miss
>>>>>>>>> data that it doesn't know about?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> No you cannot miss data, if the client is not able to find a
>> bookie
>>>>>> that is
>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>> 
>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
>> value
>>> is
>>>>>>>> stored on ledger metadata once the ledger is "closed".
>>>>>>>> 
>>>>>>>> When the ledger is still open, that is when the writer is writing
>> to
>>>> it,
>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
>> entry
>>>>>>>> this LAC value is returned to the reader using a piggyback
>>> mechanism,
>>>>>>>> without reading from metadata.
>>>>>>>> The reader cannot read beyond the latest position that has been
>>>>>> confirmed
>>>>>>>> to the writer by AQ bookies.
>>>>>>>> 
>>>>>>>> We have a third case, the 'recovery read'.
>>>>>>>> A reader starts a "recovery read" when you want to recover a
>> ledger
>>>> that
>>>>>>>> has been abandoned by a dead writer
>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want to
>>>> fence
>>>>>> out
>>>>>>>> the old leader.
>>>>>>>> In this case the reader merges the current status of the ledger on
>>> ZK
>>>>>> with
>>>>>>>> the result of a scan of the whole ledger.
>>>>>>>> Basically it reads the ledger from the beginning up to the tail,
>>> until
>>>>>> it
>>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
>> flag
>>> on
>>>>>> the
>>>>>>>> ledger
>>>>>>>> on every bookie and also it is able to detect the actual tail of
>> the
>>>>>> ledger
>>>>>>>> (because the writer died and it was not able to flush metadata to
>>> ZK).
>>>>>>>> 
>>>>>>>> The recovery read fails if it is not possible to read every entry
>>> from
>>>>>> at
>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>> does
>>>> not
>>>>>>>> find enough bookies.
>>>>>>>> 
>>>>>>>> I hope that helps
>>>>>>>> Enrico
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 


Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
> Based on my understanding, Jack wants the behavior on recovering an entry
> does not have enough replicas to be deterministic. i.e. If the entry does
> not have enough replicas, we can always exclude the entry. Jack, did I get
> you right?

I see, if that's the case, then part of the problem here is that there is uncertainty in some cases about the state of an entry. We might not be able to read enough copies to determine for sure that an entry has been sufficiently replicated and consequently might have been acknowledged to the application. To be conservative and avoid violating safety, we make such an entry part of the closed ledger.

-Flavio

> On 18 Jan 2021, at 19:55, Sijie Guo <gu...@gmail.com> wrote:
> 
> On Mon, Jan 18, 2021 at 10:18 AM Flavio Junqueira <fpj@apache.org <ma...@apache.org>> wrote:
> 
>>>>> Regarding recovery reads, recovery read doesn't need to be
>> deterministic.
>>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>>>> either including it or excluding it in the sealed ledger is correct
>>>>> behavior. The bookkeeper client guarantees that once a ledger is
>> sealed,
>>>>> the entries in the sealed ledger can always be read and can be read
>>>>> consistently.
>>>> 
>>>>> I am not sure it is a problem unless I misunderstand it.
>>>> 
>>>> It is true that it doesn't violate any safety property, but it is a
>> strange
>>>> check to me. It looks like an implementation artefact rather than an
>>>> explicit protocol design choice. But not a huge deal.
>>>> 
>>> 
>>> It was discussed in the earlier days as a design choice for this
>> protocol.
>>> 
>>> If we want to make it deterministic, we might need to consider what is
>> the
>>> performance penalty.
>> 
>> 
>> I don't quite follow the observation about a deterministic check. The
>> example that Sijie provides makes sense to me if I understand it right as
>> the entry does not have enough replicas, so it can go either way when the
>> ledger is close. But, that assumes that no later entry has been
>> acknowledged, otherwise we have a data loss if we skip the entry and
>> consequently have a problem with the protocol. If anyone cares to explain
>> the deterministic check referred to, I'd appreciate.
>> 
> 
> Based on my understanding, Jack wants the behavior on recovering an entry
> does not have enough replicas to be deterministic. i.e. If the entry does
> not have enough replicas, we can always exclude the entry. Jack, did I get
> you right?
> 
> - Sijie
> 
> 
>> 
>> -Flavio
>> 
>>> On 18 Jan 2021, at 18:51, Sijie Guo <gu...@gmail.com> wrote:
>>> 
>>> Jack,
>>> 
>>> Thank you for your replies! That's good as there are not violations of
>>> bookkeeper protocol.
>>> 
>>> Comments inline.
>>> 
>>> On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly
>>> <jvanlightly@splunk.com.invalid <ma...@splunk.com.invalid> <mailto:jvanlightly@splunk.com.invalid <ma...@splunk.com.invalid>>>
>> wrote:
>>> 
>>>>> Did you guys see any issues with the ledger auditor?
>>>> 
>>>>> The active writer can't guarantee it writing entries to WQ because it
>> can
>>>>> crash during retrying adding entries to (WQ - AQ) bookies.
>>>> 
>>>> The need to repair AQ replicated entries is clear and the auditor is one
>>>> such strategy. Ivan has also worked on a self-healing bookie strategy
>> where
>>>> each bookie itself is able to detect these holes and is able to obtain
>> the
>>>> missing entries itself. The detection of these holes using this
>> strategy is
>>>> more efficient as it only requires network calls for the ledger metadata
>>>> scanning (to zk) and the missing entry reads (to other bookies). The
>>>> auditor as I understand it, reads all entries of all ledgers from all
>>>> bookies (of an entries ensemble) meaning these entries cross the
>> network.
>>>> Using the auditor approach is likely to be run less frequently due to
>> the
>>>> network cost.
>>>> 
>>> 
>>> Agreed on the efficiency part. I think the Salesforce team introduced the
>>> Disk Scrubber to solve that problem already unless I confused something
>>> there.
>>> 
>>> +JV Jujjuri <vjujjuri@salesforce.com <ma...@salesforce.com> <mailto:vjujjuri@salesforce.com <ma...@salesforce.com>>>
>> can chime in on this part.
>>> 
>>> 
>>>> 
>>>> I do also wonder if the writer, on performing an ensemble change, should
>>>> replay "AQ but not WQ" entries, this would just leave writer failures
>>>> causing these AQ replicated entries.
>>>> 
>>> 
>>> The writer can do that. But there are no guarantees there. You still
>> need a
>>> mechanism to repair the under-replicated entries.
>>> It will also make the writer become much complicated to maintain.
>>> 
>>> 
>>>> 
>>>>> Regarding recovery reads, recovery read doesn't need to be
>> deterministic.
>>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>>>> either including it or excluding it in the sealed ledger is correct
>>>>> behavior. The bookkeeper client guarantees that once a ledger is
>> sealed,
>>>>> the entries in the sealed ledger can always be read and can be read
>>>>> consistently.
>>>> 
>>>>> I am not sure it is a problem unless I misunderstand it.
>>>> 
>>>> It is true that it doesn't violate any safety property, but it is a
>> strange
>>>> check to me. It looks like an implementation artefact rather than an
>>>> explicit protocol design choice. But not a huge deal.
>>>> 
>>> 
>>> It was discussed in the earlier days as a design choice for this
>> protocol.
>>> 
>>> If we want to make it deterministic, we might need to consider what is
>> the
>>> performance penalty.
>>> 
>>> 
>>>> 
>>>> Jack
>>>> 
>>>> 
>>>> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <guosijie@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>>> [ External sender. Exercise caution. ]
>>>>> 
>>>>> Sorry for being late in this thread.
>>>>> 
>>>>> If I understand this correctly, the main topic is about the "hole" when
>>>> WQ
>>>>>> AQ.
>>>>> 
>>>>>> This leaves a "hole" as the entry is now replicated only to 2 bookies,
>>>>> 
>>>>> We do have one hole when ensemble change is enabled and WQ > AQ. That
>>>> was a
>>>>> known behavior. But the hole will be repaired by the ledger auditor as
>> JV
>>>>> said. Did you guys see any issues with the ledger auditor?
>>>>> 
>>>>>> I'd think that we guarantee that an entry that is acknowledged is
>>>>> eventually written WQ ways and that it is observable by readers when
>> the
>>>>> ledger is closed.
>>>>> 
>>>>> To Flavio's question, we don't guarantee (and can't guarantee) that the
>>>>> active writer will eventually write the entries to WQ. For the active
>>>>> writers, we only guarantee entries are written to AQ. The ledger
>> auditor
>>>> is
>>>>> to ensure all the entries are written to WQ.
>>>>> 
>>>>> The active writer can't guarantee it writing entries to WQ because it
>> can
>>>>> crash during retrying adding entries to (WQ - AQ) bookies.
>>>>> 
>>>>>> A single successful read is enough. However
>>>>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the
>> ledger
>>>>> recovery process ends there. This means that given the responses
>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>>>>> considered successful is non-deterministic.
>>>>> 
>>>>> Regarding recovery reads, recovery read doesn't need to be
>> deterministic.
>>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>>>> either including it or excluding it in the sealed ledger is correct
>>>>> behavior. The bookkeeper client guarantees that once a ledger is
>> sealed,
>>>>> the entries in the sealed ledger can always be read and can be read
>>>>> consistently.
>>>>> 
>>>>> I am not sure it is a problem unless I misunderstand it.
>>>>> 
>>>>> - Sijie
>>>>> 
>>>>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
>>>>> <jvanlightly@splunk.com.invalid <ma...@splunk.com.invalid>> wrote:
>>>>> 
>>>>>> Let's set up a call and create any issues from that. I have already
>>>>> created
>>>>>> the patches in our (Splunk) fork and it might be easiest or not to
>> wait
>>>>>> until we re-sync up with the open source repo. We can include the
>> fixes
>>>>> in
>>>>>> the discussion.
>>>>>> 
>>>>>> Jack
>>>>>> 
>>>>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fpj@apache.org <ma...@apache.org>>
>>>> wrote:
>>>>>> 
>>>>>>> [ External sender. Exercise caution. ]
>>>>>>> 
>>>>>>> Hi Jack,
>>>>>>> 
>>>>>>> Thanks for getting back.
>>>>>>> 
>>>>>>>> What's the best way to share the TLA+ findings?
>>>>>>> 
>>>>>>> Would you be able to share the spec? I'm ok with reading TLA+.
>>>>>>> 
>>>>>>> As for sharing your specific findings, I'd suggest one of the
>>>>> following:
>>>>>>> 
>>>>>>> 1- Create an email thread describing the scenarios that trigger a
>>>> bug.
>>>>>>> 2- Create issues, one for each problem you found.
>>>>>>> 3- Create a discussion on the project Slack, perhaps a channel
>>>> specific
>>>>>>> for it.
>>>>>>> 4- Set up a zoom call to present and discuss with the community.
>>>>>>> 
>>>>>>> Option 2 is ideal from a community perspective, but we can also set
>>>> up
>>>>> a
>>>>>>> call inviting everyone and create issues out of that discussion. We
>>>> can
>>>>>> in
>>>>>>> fact set up a call even if we create the issues ahead of time.
>>>>>>> 
>>>>>>> Does it make sense?
>>>>>>> 
>>>>>>> -Flavio
>>>>>>> 
>>>>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
>>>>>> .INVALID>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Flavio,
>>>>>>>> 
>>>>>>>>>> This is an example of a scenario corresponding to what we suspect
>>>>> is
>>>>>> a
>>>>>>>> bug introduced earlier, but Enrico is arguing that this is not the
>>>>>>> intended
>>>>>>>> behavior, and at this point, I agree.
>>>>>>>> 
>>>>>>>>>> By the time a successful callback is received, the client might
>>>>> only
>>>>>>>> have replicated AQ ways, so the guarantee can only be at that point
>>>>> of
>>>>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
>>>>> states
>>>>>>> that
>>>>>>>> the application wants to have WQ copies >> of each entry, though.
>>>> I'd
>>>>>>>> expect a ledger to have WQ copies of each entry up to the final
>>>> entry
>>>>>>>> number when it is closed. Do you see it differently?
>>>>>>>> 
>>>>>>>> I also agree and was pretty surprised when I discovered the
>>>>> behaviour.
>>>>>> It
>>>>>>>> is not something that users expect and I think we need to correct
>>>> it.
>>>>>> So
>>>>>>>> I'm with you.
>>>>>>>> 
>>>>>>>> What's the best way to share the TLA+ findings?
>>>>>>>> 
>>>>>>>> Jack
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>>> 
>>>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>>>> the
>>>>>>>>>> confirm callback to the client is called and the LAC is set to
>>>>>> 100.Now
>>>>>>>>> the
>>>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>>>> adds
>>>>>>>>> that
>>>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>>>> that
>>>>>>> the
>>>>>>>>>> entry e100 is not replayed to another bookie, causing this entry
>>>> to
>>>>>>> meet
>>>>>>>>>> the rep factor of only AQ.
>>>>>>>>> 
>>>>>>>>> This is an example of a scenario corresponding to what we suspect
>>>>> is a
>>>>>>> bug
>>>>>>>>> introduced earlier, but Enrico is arguing that this is not the
>>>>>> intended
>>>>>>>>> behavior, and at this point, I agree.
>>>>>>>>> 
>>>>>>>>>> This is alluded to in the docs as they state
>>>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>>>> 
>>>>>>>>> By the time a successful callback is received, the client might
>>>> only
>>>>>>> have
>>>>>>>>> replicated AQ ways, so the guarantee can only be at that point of
>>>>>> being
>>>>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
>>>>> that
>>>>>>> the
>>>>>>>>> application wants to have WQ copies of each entry, though. I'd
>>>>> expect
>>>>>> a
>>>>>>>>> ledger to have WQ copies of each entry up to the final entry
>>>> number
>>>>>>> when it
>>>>>>>>> is closed. Do you see it differently?
>>>>>>>>> 
>>>>>>>>>> I'd be happy to set up a meeting to discuss the spec and its
>>>>>> findings.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> That'd be great, I'm interested.
>>>>>>>>> 
>>>>>>>>> -Flavio
>>>>>>>>> 
>>>>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <
>>>> jvanlightly@splunk.com
>>>>>>> .INVALID>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>>>> bookie
>>>>>>> that
>>>>>>>>>> is
>>>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>>>> 
>>>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>>>> the
>>>>>>>>>> confirm callback to the client is called and the LAC is set to
>>>> 100.
>>>>>> Now
>>>>>>>>> the
>>>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>>>> adds
>>>>>>>>> that
>>>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>>>> that
>>>>>>> the
>>>>>>>>>> entry e100 is not replayed to another bookie, causing this entry
>>>> to
>>>>>>> meet
>>>>>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
>>>>>> state
>>>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>>>>> 
>>>>>>>>>>> The recovery read fails if it is not possible to read every
>>>> entry
>>>>>> from
>>>>>>>>> at
>>>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>>>> does
>>>>>>> not
>>>>>>>>>>> find enough bookies.
>>>>>>>>>> 
>>>>>>>>>> This is not quite accurate. A single successful read is enough.
>>>>>> However
>>>>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
>>>>>> with
>>>>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and
>>>> the
>>>>>>> ledger
>>>>>>>>>> recovery process ends there. This means that given the responses
>>>>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
>>>>> is
>>>>>>>>>> considered successful is non-deterministic. If the response from
>>>> b1
>>>>>> is
>>>>>>>>>> received last, then the read is already considered failed,
>>>>> otherwise
>>>>>>> the
>>>>>>>>>> read succeeds.
>>>>>>>>>> 
>>>>>>>>>> I have come to the above conclusions through my reverse
>>>> engineering
>>>>>>>>> process
>>>>>>>>>> for creating the TLA+ specification. I still have pending to
>>>>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
>>>>>> verified
>>>>>>>>> via
>>>>>>>>>> tests the conclusion about ledger recovery reads.
>>>>>>>>>> 
>>>>>>>>>> Note that I have found two defects with the BookKeeper protocol,
>>>>> most
>>>>>>>>>> notably data loss due to that fencing does not prevent further
>>>>>>> successful
>>>>>>>>>> adds. Currently the specification and associated documentation is
>>>>> on
>>>>>> a
>>>>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
>>>>> discuss
>>>>>>> the
>>>>>>>>>> spec and its findings.
>>>>>>>>>> 
>>>>>>>>>> Best
>>>>>>>>>> Jack
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
>>>>>> eolivelli@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>>>>> 
>>>>>>>>>>> Jonathan,
>>>>>>>>>>> 
>>>>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>>>>>>>>>>> jbellis@apache.org>
>>>>>>>>>>> ha scritto:
>>>>>>>>>>> 
>>>>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
>>>>>>> confirm
>>>>>>>>>>>> that
>>>>>>>>>>>>> once confirmed, that an entry is not replayed to another
>>>> bookie.
>>>>>>> This
>>>>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
>>>>> bookies,
>>>>>>>>>>>> however,
>>>>>>>>>>>>> the new data integrity check that Ivan worked on, when run
>>>>>>>>> periodically
>>>>>>>>>>>>> will be able to repair that hole.
>>>>>>>>>>>> 
>>>>>>>>>>>> Can I read from the bookie with a hole in the meantime, and
>>>>>> silently
>>>>>>>>> miss
>>>>>>>>>>>> data that it doesn't know about?
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>>>> bookie
>>>>>>>>> that is
>>>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>>>>> 
>>>>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
>>>>> value
>>>>>> is
>>>>>>>>>>> stored on ledger metadata once the ledger is "closed".
>>>>>>>>>>> 
>>>>>>>>>>> When the ledger is still open, that is when the writer is
>>>> writing
>>>>> to
>>>>>>> it,
>>>>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
>>>>> entry
>>>>>>>>>>> this LAC value is returned to the reader using a piggyback
>>>>>> mechanism,
>>>>>>>>>>> without reading from metadata.
>>>>>>>>>>> The reader cannot read beyond the latest position that has been
>>>>>>>>> confirmed
>>>>>>>>>>> to the writer by AQ bookies.
>>>>>>>>>>> 
>>>>>>>>>>> We have a third case, the 'recovery read'.
>>>>>>>>>>> A reader starts a "recovery read" when you want to recover a
>>>>> ledger
>>>>>>> that
>>>>>>>>>>> has been abandoned by a dead writer
>>>>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want
>>>> to
>>>>>>> fence
>>>>>>>>> out
>>>>>>>>>>> the old leader.
>>>>>>>>>>> In this case the reader merges the current status of the ledger
>>>> on
>>>>>> ZK
>>>>>>>>> with
>>>>>>>>>>> the result of a scan of the whole ledger.
>>>>>>>>>>> Basically it reads the ledger from the beginning up to the tail,
>>>>>> until
>>>>>>>>> it
>>>>>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
>>>>> flag
>>>>>> on
>>>>>>>>> the
>>>>>>>>>>> ledger
>>>>>>>>>>> on every bookie and also it is able to detect the actual tail of
>>>>> the
>>>>>>>>> ledger
>>>>>>>>>>> (because the writer died and it was not able to flush metadata
>>>> to
>>>>>> ZK).
>>>>>>>>>>> 
>>>>>>>>>>> The recovery read fails if it is not possible to read every
>>>> entry
>>>>>> from
>>>>>>>>> at
>>>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>>>> does
>>>>>>> not
>>>>>>>>>>> find enough bookies.
>>>>>>>>>>> 
>>>>>>>>>>> I hope that helps
>>>>>>>>>>> Enrico


Re: Unbounded memory usage for WQ > AQ ?

Posted by Sijie Guo <gu...@gmail.com>.
On Mon, Jan 18, 2021 at 10:18 AM Flavio Junqueira <fp...@apache.org> wrote:

> >>> Regarding recovery reads, recovery read doesn't need to be
> deterministic.
> >>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> >>> either including it or excluding it in the sealed ledger is correct
> >>> behavior. The bookkeeper client guarantees that once a ledger is
> sealed,
> >>> the entries in the sealed ledger can always be read and can be read
> >>> consistently.
> >>
> >>> I am not sure it is a problem unless I misunderstand it.
> >>
> >> It is true that it doesn't violate any safety property, but it is a
> strange
> >> check to me. It looks like an implementation artefact rather than an
> >> explicit protocol design choice. But not a huge deal.
> >>
> >
> > It was discussed in the earlier days as a design choice for this
> protocol.
> >
> > If we want to make it deterministic, we might need to consider what is
> the
> > performance penalty.
>
>
> I don't quite follow the observation about a deterministic check. The
> example that Sijie provides makes sense to me if I understand it right as
> the entry does not have enough replicas, so it can go either way when the
> ledger is close. But, that assumes that no later entry has been
> acknowledged, otherwise we have a data loss if we skip the entry and
> consequently have a problem with the protocol. If anyone cares to explain
> the deterministic check referred to, I'd appreciate.
>

Based on my understanding, Jack wants the behavior on recovering an entry
does not have enough replicas to be deterministic. i.e. If the entry does
not have enough replicas, we can always exclude the entry. Jack, did I get
you right?

- Sijie


>
> -Flavio
>
> > On 18 Jan 2021, at 18:51, Sijie Guo <gu...@gmail.com> wrote:
> >
> > Jack,
> >
> > Thank you for your replies! That's good as there are not violations of
> > bookkeeper protocol.
> >
> > Comments inline.
> >
> > On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly
> > <jvanlightly@splunk.com.invalid <ma...@splunk.com.invalid>>
> wrote:
> >
> >>> Did you guys see any issues with the ledger auditor?
> >>
> >>> The active writer can't guarantee it writing entries to WQ because it
> can
> >>> crash during retrying adding entries to (WQ - AQ) bookies.
> >>
> >> The need to repair AQ replicated entries is clear and the auditor is one
> >> such strategy. Ivan has also worked on a self-healing bookie strategy
> where
> >> each bookie itself is able to detect these holes and is able to obtain
> the
> >> missing entries itself. The detection of these holes using this
> strategy is
> >> more efficient as it only requires network calls for the ledger metadata
> >> scanning (to zk) and the missing entry reads (to other bookies). The
> >> auditor as I understand it, reads all entries of all ledgers from all
> >> bookies (of an entries ensemble) meaning these entries cross the
> network.
> >> Using the auditor approach is likely to be run less frequently due to
> the
> >> network cost.
> >>
> >
> > Agreed on the efficiency part. I think the Salesforce team introduced the
> > Disk Scrubber to solve that problem already unless I confused something
> > there.
> >
> > +JV Jujjuri <vjujjuri@salesforce.com <ma...@salesforce.com>>
> can chime in on this part.
> >
> >
> >>
> >> I do also wonder if the writer, on performing an ensemble change, should
> >> replay "AQ but not WQ" entries, this would just leave writer failures
> >> causing these AQ replicated entries.
> >>
> >
> > The writer can do that. But there are no guarantees there. You still
> need a
> > mechanism to repair the under-replicated entries.
> > It will also make the writer become much complicated to maintain.
> >
> >
> >>
> >>> Regarding recovery reads, recovery read doesn't need to be
> deterministic.
> >>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> >>> either including it or excluding it in the sealed ledger is correct
> >>> behavior. The bookkeeper client guarantees that once a ledger is
> sealed,
> >>> the entries in the sealed ledger can always be read and can be read
> >>> consistently.
> >>
> >>> I am not sure it is a problem unless I misunderstand it.
> >>
> >> It is true that it doesn't violate any safety property, but it is a
> strange
> >> check to me. It looks like an implementation artefact rather than an
> >> explicit protocol design choice. But not a huge deal.
> >>
> >
> > It was discussed in the earlier days as a design choice for this
> protocol.
> >
> > If we want to make it deterministic, we might need to consider what is
> the
> > performance penalty.
> >
> >
> >>
> >> Jack
> >>
> >>
> >> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <gu...@gmail.com> wrote:
> >>
> >>> [ External sender. Exercise caution. ]
> >>>
> >>> Sorry for being late in this thread.
> >>>
> >>> If I understand this correctly, the main topic is about the "hole" when
> >> WQ
> >>>> AQ.
> >>>
> >>>> This leaves a "hole" as the entry is now replicated only to 2 bookies,
> >>>
> >>> We do have one hole when ensemble change is enabled and WQ > AQ. That
> >> was a
> >>> known behavior. But the hole will be repaired by the ledger auditor as
> JV
> >>> said. Did you guys see any issues with the ledger auditor?
> >>>
> >>>> I'd think that we guarantee that an entry that is acknowledged is
> >>> eventually written WQ ways and that it is observable by readers when
> the
> >>> ledger is closed.
> >>>
> >>> To Flavio's question, we don't guarantee (and can't guarantee) that the
> >>> active writer will eventually write the entries to WQ. For the active
> >>> writers, we only guarantee entries are written to AQ. The ledger
> auditor
> >> is
> >>> to ensure all the entries are written to WQ.
> >>>
> >>> The active writer can't guarantee it writing entries to WQ because it
> can
> >>> crash during retrying adding entries to (WQ - AQ) bookies.
> >>>
> >>>> A single successful read is enough. However
> >>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
> >>> explicit NoSuchEntry/Ledger, the read is considered failed and the
> ledger
> >>> recovery process ends there. This means that given the responses
> >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> >>> considered successful is non-deterministic.
> >>>
> >>> Regarding recovery reads, recovery read doesn't need to be
> deterministic.
> >>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> >>> either including it or excluding it in the sealed ledger is correct
> >>> behavior. The bookkeeper client guarantees that once a ledger is
> sealed,
> >>> the entries in the sealed ledger can always be read and can be read
> >>> consistently.
> >>>
> >>> I am not sure it is a problem unless I misunderstand it.
> >>>
> >>> - Sijie
> >>>
> >>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
> >>> <jv...@splunk.com.invalid> wrote:
> >>>
> >>>> Let's set up a call and create any issues from that. I have already
> >>> created
> >>>> the patches in our (Splunk) fork and it might be easiest or not to
> wait
> >>>> until we re-sync up with the open source repo. We can include the
> fixes
> >>> in
> >>>> the discussion.
> >>>>
> >>>> Jack
> >>>>
> >>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org>
> >> wrote:
> >>>>
> >>>>> [ External sender. Exercise caution. ]
> >>>>>
> >>>>> Hi Jack,
> >>>>>
> >>>>> Thanks for getting back.
> >>>>>
> >>>>>> What's the best way to share the TLA+ findings?
> >>>>>
> >>>>> Would you be able to share the spec? I'm ok with reading TLA+.
> >>>>>
> >>>>> As for sharing your specific findings, I'd suggest one of the
> >>> following:
> >>>>>
> >>>>> 1- Create an email thread describing the scenarios that trigger a
> >> bug.
> >>>>> 2- Create issues, one for each problem you found.
> >>>>> 3- Create a discussion on the project Slack, perhaps a channel
> >> specific
> >>>>> for it.
> >>>>> 4- Set up a zoom call to present and discuss with the community.
> >>>>>
> >>>>> Option 2 is ideal from a community perspective, but we can also set
> >> up
> >>> a
> >>>>> call inviting everyone and create issues out of that discussion. We
> >> can
> >>>> in
> >>>>> fact set up a call even if we create the issues ahead of time.
> >>>>>
> >>>>> Does it make sense?
> >>>>>
> >>>>> -Flavio
> >>>>>
> >>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
> >>>> .INVALID>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Flavio,
> >>>>>>
> >>>>>>>> This is an example of a scenario corresponding to what we suspect
> >>> is
> >>>> a
> >>>>>> bug introduced earlier, but Enrico is arguing that this is not the
> >>>>> intended
> >>>>>> behavior, and at this point, I agree.
> >>>>>>
> >>>>>>>> By the time a successful callback is received, the client might
> >>> only
> >>>>>> have replicated AQ ways, so the guarantee can only be at that point
> >>> of
> >>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
> >>> states
> >>>>> that
> >>>>>> the application wants to have WQ copies >> of each entry, though.
> >> I'd
> >>>>>> expect a ledger to have WQ copies of each entry up to the final
> >> entry
> >>>>>> number when it is closed. Do you see it differently?
> >>>>>>
> >>>>>> I also agree and was pretty surprised when I discovered the
> >>> behaviour.
> >>>> It
> >>>>>> is not something that users expect and I think we need to correct
> >> it.
> >>>> So
> >>>>>> I'm with you.
> >>>>>>
> >>>>>> What's the best way to share the TLA+ findings?
> >>>>>>
> >>>>>> Jack
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
> >>>> wrote:
> >>>>>>
> >>>>>>> [ External sender. Exercise caution. ]
> >>>>>>>
> >>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> >>> the
> >>>>>>>> confirm callback to the client is called and the LAC is set to
> >>>> 100.Now
> >>>>>>> the
> >>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
> >>>> adds
> >>>>>>> that
> >>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
> >>> that
> >>>>> the
> >>>>>>>> entry e100 is not replayed to another bookie, causing this entry
> >> to
> >>>>> meet
> >>>>>>>> the rep factor of only AQ.
> >>>>>>>
> >>>>>>> This is an example of a scenario corresponding to what we suspect
> >>> is a
> >>>>> bug
> >>>>>>> introduced earlier, but Enrico is arguing that this is not the
> >>>> intended
> >>>>>>> behavior, and at this point, I agree.
> >>>>>>>
> >>>>>>>> This is alluded to in the docs as they state
> >>>>>>>> that AQ is also the minimum guaranteed replication factor.
> >>>>>>>
> >>>>>>> By the time a successful callback is received, the client might
> >> only
> >>>>> have
> >>>>>>> replicated AQ ways, so the guarantee can only be at that point of
> >>>> being
> >>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
> >>> that
> >>>>> the
> >>>>>>> application wants to have WQ copies of each entry, though. I'd
> >>> expect
> >>>> a
> >>>>>>> ledger to have WQ copies of each entry up to the final entry
> >> number
> >>>>> when it
> >>>>>>> is closed. Do you see it differently?
> >>>>>>>
> >>>>>>>> I'd be happy to set up a meeting to discuss the spec and its
> >>>> findings.
> >>>>>>>
> >>>>>>>
> >>>>>>> That'd be great, I'm interested.
> >>>>>>>
> >>>>>>> -Flavio
> >>>>>>>
> >>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <
> >> jvanlightly@splunk.com
> >>>>> .INVALID>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> No you cannot miss data, if the client is not able to find a
> >>> bookie
> >>>>> that
> >>>>>>>> is
> >>>>>>>>> able to answer with the entry it receives an error.
> >>>>>>>>
> >>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> >>> the
> >>>>>>>> confirm callback to the client is called and the LAC is set to
> >> 100.
> >>>> Now
> >>>>>>> the
> >>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
> >>>> adds
> >>>>>>> that
> >>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
> >>> that
> >>>>> the
> >>>>>>>> entry e100 is not replayed to another bookie, causing this entry
> >> to
> >>>>> meet
> >>>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
> >>>> state
> >>>>>>>> that AQ is also the minimum guaranteed replication factor.
> >>>>>>>>
> >>>>>>>>> The recovery read fails if it is not possible to read every
> >> entry
> >>>> from
> >>>>>>> at
> >>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
> >>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
> >>> does
> >>>>> not
> >>>>>>>>> find enough bookies.
> >>>>>>>>
> >>>>>>>> This is not quite accurate. A single successful read is enough.
> >>>> However
> >>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
> >>>> with
> >>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and
> >> the
> >>>>> ledger
> >>>>>>>> recovery process ends there. This means that given the responses
> >>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
> >>> is
> >>>>>>>> considered successful is non-deterministic. If the response from
> >> b1
> >>>> is
> >>>>>>>> received last, then the read is already considered failed,
> >>> otherwise
> >>>>> the
> >>>>>>>> read succeeds.
> >>>>>>>>
> >>>>>>>> I have come to the above conclusions through my reverse
> >> engineering
> >>>>>>> process
> >>>>>>>> for creating the TLA+ specification. I still have pending to
> >>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
> >>>> verified
> >>>>>>> via
> >>>>>>>> tests the conclusion about ledger recovery reads.
> >>>>>>>>
> >>>>>>>> Note that I have found two defects with the BookKeeper protocol,
> >>> most
> >>>>>>>> notably data loss due to that fencing does not prevent further
> >>>>> successful
> >>>>>>>> adds. Currently the specification and associated documentation is
> >>> on
> >>>> a
> >>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
> >>> discuss
> >>>>> the
> >>>>>>>> spec and its findings.
> >>>>>>>>
> >>>>>>>> Best
> >>>>>>>> Jack
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
> >>>> eolivelli@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> [ External sender. Exercise caution. ]
> >>>>>>>>>
> >>>>>>>>> Jonathan,
> >>>>>>>>>
> >>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> >>>>>>>>> jbellis@apache.org>
> >>>>>>>>> ha scritto:
> >>>>>>>>>
> >>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> >>>>> confirm
> >>>>>>>>>> that
> >>>>>>>>>>> once confirmed, that an entry is not replayed to another
> >> bookie.
> >>>>> This
> >>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
> >>> bookies,
> >>>>>>>>>> however,
> >>>>>>>>>>> the new data integrity check that Ivan worked on, when run
> >>>>>>> periodically
> >>>>>>>>>>> will be able to repair that hole.
> >>>>>>>>>>
> >>>>>>>>>> Can I read from the bookie with a hole in the meantime, and
> >>>> silently
> >>>>>>> miss
> >>>>>>>>>> data that it doesn't know about?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> No you cannot miss data, if the client is not able to find a
> >>> bookie
> >>>>>>> that is
> >>>>>>>>> able to answer with the entry it receives an error.
> >>>>>>>>>
> >>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
> >>> value
> >>>> is
> >>>>>>>>> stored on ledger metadata once the ledger is "closed".
> >>>>>>>>>
> >>>>>>>>> When the ledger is still open, that is when the writer is
> >> writing
> >>> to
> >>>>> it,
> >>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
> >>> entry
> >>>>>>>>> this LAC value is returned to the reader using a piggyback
> >>>> mechanism,
> >>>>>>>>> without reading from metadata.
> >>>>>>>>> The reader cannot read beyond the latest position that has been
> >>>>>>> confirmed
> >>>>>>>>> to the writer by AQ bookies.
> >>>>>>>>>
> >>>>>>>>> We have a third case, the 'recovery read'.
> >>>>>>>>> A reader starts a "recovery read" when you want to recover a
> >>> ledger
> >>>>> that
> >>>>>>>>> has been abandoned by a dead writer
> >>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want
> >> to
> >>>>> fence
> >>>>>>> out
> >>>>>>>>> the old leader.
> >>>>>>>>> In this case the reader merges the current status of the ledger
> >> on
> >>>> ZK
> >>>>>>> with
> >>>>>>>>> the result of a scan of the whole ledger.
> >>>>>>>>> Basically it reads the ledger from the beginning up to the tail,
> >>>> until
> >>>>>>> it
> >>>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
> >>> flag
> >>>> on
> >>>>>>> the
> >>>>>>>>> ledger
> >>>>>>>>> on every bookie and also it is able to detect the actual tail of
> >>> the
> >>>>>>> ledger
> >>>>>>>>> (because the writer died and it was not able to flush metadata
> >> to
> >>>> ZK).
> >>>>>>>>>
> >>>>>>>>> The recovery read fails if it is not possible to read every
> >> entry
> >>>> from
> >>>>>>> at
> >>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
> >>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
> >>> does
> >>>>> not
> >>>>>>>>> find enough bookies.
> >>>>>>>>>
> >>>>>>>>> I hope that helps
> >>>>>>>>> Enrico
>
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
>>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>> either including it or excluding it in the sealed ledger is correct
>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>> the entries in the sealed ledger can always be read and can be read
>>> consistently.
>> 
>>> I am not sure it is a problem unless I misunderstand it.
>> 
>> It is true that it doesn't violate any safety property, but it is a strange
>> check to me. It looks like an implementation artefact rather than an
>> explicit protocol design choice. But not a huge deal.
>> 
> 
> It was discussed in the earlier days as a design choice for this protocol.
> 
> If we want to make it deterministic, we might need to consider what is the
> performance penalty.


I don't quite follow the observation about a deterministic check. The example that Sijie provides makes sense to me if I understand it right as the entry does not have enough replicas, so it can go either way when the ledger is close. But, that assumes that no later entry has been acknowledged, otherwise we have a data loss if we skip the entry and consequently have a problem with the protocol. If anyone cares to explain the deterministic check referred to, I'd appreciate.

-Flavio 

> On 18 Jan 2021, at 18:51, Sijie Guo <gu...@gmail.com> wrote:
> 
> Jack,
> 
> Thank you for your replies! That's good as there are not violations of
> bookkeeper protocol.
> 
> Comments inline.
> 
> On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly
> <jvanlightly@splunk.com.invalid <ma...@splunk.com.invalid>> wrote:
> 
>>> Did you guys see any issues with the ledger auditor?
>> 
>>> The active writer can't guarantee it writing entries to WQ because it can
>>> crash during retrying adding entries to (WQ - AQ) bookies.
>> 
>> The need to repair AQ replicated entries is clear and the auditor is one
>> such strategy. Ivan has also worked on a self-healing bookie strategy where
>> each bookie itself is able to detect these holes and is able to obtain the
>> missing entries itself. The detection of these holes using this strategy is
>> more efficient as it only requires network calls for the ledger metadata
>> scanning (to zk) and the missing entry reads (to other bookies). The
>> auditor as I understand it, reads all entries of all ledgers from all
>> bookies (of an entries ensemble) meaning these entries cross the network.
>> Using the auditor approach is likely to be run less frequently due to the
>> network cost.
>> 
> 
> Agreed on the efficiency part. I think the Salesforce team introduced the
> Disk Scrubber to solve that problem already unless I confused something
> there.
> 
> +JV Jujjuri <vjujjuri@salesforce.com <ma...@salesforce.com>> can chime in on this part.
> 
> 
>> 
>> I do also wonder if the writer, on performing an ensemble change, should
>> replay "AQ but not WQ" entries, this would just leave writer failures
>> causing these AQ replicated entries.
>> 
> 
> The writer can do that. But there are no guarantees there. You still need a
> mechanism to repair the under-replicated entries.
> It will also make the writer become much complicated to maintain.
> 
> 
>> 
>>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>> either including it or excluding it in the sealed ledger is correct
>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>> the entries in the sealed ledger can always be read and can be read
>>> consistently.
>> 
>>> I am not sure it is a problem unless I misunderstand it.
>> 
>> It is true that it doesn't violate any safety property, but it is a strange
>> check to me. It looks like an implementation artefact rather than an
>> explicit protocol design choice. But not a huge deal.
>> 
> 
> It was discussed in the earlier days as a design choice for this protocol.
> 
> If we want to make it deterministic, we might need to consider what is the
> performance penalty.
> 
> 
>> 
>> Jack
>> 
>> 
>> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <gu...@gmail.com> wrote:
>> 
>>> [ External sender. Exercise caution. ]
>>> 
>>> Sorry for being late in this thread.
>>> 
>>> If I understand this correctly, the main topic is about the "hole" when
>> WQ
>>>> AQ.
>>> 
>>>> This leaves a "hole" as the entry is now replicated only to 2 bookies,
>>> 
>>> We do have one hole when ensemble change is enabled and WQ > AQ. That
>> was a
>>> known behavior. But the hole will be repaired by the ledger auditor as JV
>>> said. Did you guys see any issues with the ledger auditor?
>>> 
>>>> I'd think that we guarantee that an entry that is acknowledged is
>>> eventually written WQ ways and that it is observable by readers when the
>>> ledger is closed.
>>> 
>>> To Flavio's question, we don't guarantee (and can't guarantee) that the
>>> active writer will eventually write the entries to WQ. For the active
>>> writers, we only guarantee entries are written to AQ. The ledger auditor
>> is
>>> to ensure all the entries are written to WQ.
>>> 
>>> The active writer can't guarantee it writing entries to WQ because it can
>>> crash during retrying adding entries to (WQ - AQ) bookies.
>>> 
>>>> A single successful read is enough. However
>>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
>>> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
>>> recovery process ends there. This means that given the responses
>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>>> considered successful is non-deterministic.
>>> 
>>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>> either including it or excluding it in the sealed ledger is correct
>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>> the entries in the sealed ledger can always be read and can be read
>>> consistently.
>>> 
>>> I am not sure it is a problem unless I misunderstand it.
>>> 
>>> - Sijie
>>> 
>>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
>>> <jv...@splunk.com.invalid> wrote:
>>> 
>>>> Let's set up a call and create any issues from that. I have already
>>> created
>>>> the patches in our (Splunk) fork and it might be easiest or not to wait
>>>> until we re-sync up with the open source repo. We can include the fixes
>>> in
>>>> the discussion.
>>>> 
>>>> Jack
>>>> 
>>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org>
>> wrote:
>>>> 
>>>>> [ External sender. Exercise caution. ]
>>>>> 
>>>>> Hi Jack,
>>>>> 
>>>>> Thanks for getting back.
>>>>> 
>>>>>> What's the best way to share the TLA+ findings?
>>>>> 
>>>>> Would you be able to share the spec? I'm ok with reading TLA+.
>>>>> 
>>>>> As for sharing your specific findings, I'd suggest one of the
>>> following:
>>>>> 
>>>>> 1- Create an email thread describing the scenarios that trigger a
>> bug.
>>>>> 2- Create issues, one for each problem you found.
>>>>> 3- Create a discussion on the project Slack, perhaps a channel
>> specific
>>>>> for it.
>>>>> 4- Set up a zoom call to present and discuss with the community.
>>>>> 
>>>>> Option 2 is ideal from a community perspective, but we can also set
>> up
>>> a
>>>>> call inviting everyone and create issues out of that discussion. We
>> can
>>>> in
>>>>> fact set up a call even if we create the issues ahead of time.
>>>>> 
>>>>> Does it make sense?
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
>>>> .INVALID>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Flavio,
>>>>>> 
>>>>>>>> This is an example of a scenario corresponding to what we suspect
>>> is
>>>> a
>>>>>> bug introduced earlier, but Enrico is arguing that this is not the
>>>>> intended
>>>>>> behavior, and at this point, I agree.
>>>>>> 
>>>>>>>> By the time a successful callback is received, the client might
>>> only
>>>>>> have replicated AQ ways, so the guarantee can only be at that point
>>> of
>>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
>>> states
>>>>> that
>>>>>> the application wants to have WQ copies >> of each entry, though.
>> I'd
>>>>>> expect a ledger to have WQ copies of each entry up to the final
>> entry
>>>>>> number when it is closed. Do you see it differently?
>>>>>> 
>>>>>> I also agree and was pretty surprised when I discovered the
>>> behaviour.
>>>> It
>>>>>> is not something that users expect and I think we need to correct
>> it.
>>>> So
>>>>>> I'm with you.
>>>>>> 
>>>>>> What's the best way to share the TLA+ findings?
>>>>>> 
>>>>>> Jack
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
>>>> wrote:
>>>>>> 
>>>>>>> [ External sender. Exercise caution. ]
>>>>>>> 
>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>> the
>>>>>>>> confirm callback to the client is called and the LAC is set to
>>>> 100.Now
>>>>>>> the
>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>> adds
>>>>>>> that
>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>> that
>>>>> the
>>>>>>>> entry e100 is not replayed to another bookie, causing this entry
>> to
>>>>> meet
>>>>>>>> the rep factor of only AQ.
>>>>>>> 
>>>>>>> This is an example of a scenario corresponding to what we suspect
>>> is a
>>>>> bug
>>>>>>> introduced earlier, but Enrico is arguing that this is not the
>>>> intended
>>>>>>> behavior, and at this point, I agree.
>>>>>>> 
>>>>>>>> This is alluded to in the docs as they state
>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>> 
>>>>>>> By the time a successful callback is received, the client might
>> only
>>>>> have
>>>>>>> replicated AQ ways, so the guarantee can only be at that point of
>>>> being
>>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
>>> that
>>>>> the
>>>>>>> application wants to have WQ copies of each entry, though. I'd
>>> expect
>>>> a
>>>>>>> ledger to have WQ copies of each entry up to the final entry
>> number
>>>>> when it
>>>>>>> is closed. Do you see it differently?
>>>>>>> 
>>>>>>>> I'd be happy to set up a meeting to discuss the spec and its
>>>> findings.
>>>>>>> 
>>>>>>> 
>>>>>>> That'd be great, I'm interested.
>>>>>>> 
>>>>>>> -Flavio
>>>>>>> 
>>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <
>> jvanlightly@splunk.com
>>>>> .INVALID>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>> bookie
>>>>> that
>>>>>>>> is
>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>> 
>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>> the
>>>>>>>> confirm callback to the client is called and the LAC is set to
>> 100.
>>>> Now
>>>>>>> the
>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>> adds
>>>>>>> that
>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>> that
>>>>> the
>>>>>>>> entry e100 is not replayed to another bookie, causing this entry
>> to
>>>>> meet
>>>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
>>>> state
>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>>> 
>>>>>>>>> The recovery read fails if it is not possible to read every
>> entry
>>>> from
>>>>>>> at
>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>> does
>>>>> not
>>>>>>>>> find enough bookies.
>>>>>>>> 
>>>>>>>> This is not quite accurate. A single successful read is enough.
>>>> However
>>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
>>>> with
>>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and
>> the
>>>>> ledger
>>>>>>>> recovery process ends there. This means that given the responses
>>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
>>> is
>>>>>>>> considered successful is non-deterministic. If the response from
>> b1
>>>> is
>>>>>>>> received last, then the read is already considered failed,
>>> otherwise
>>>>> the
>>>>>>>> read succeeds.
>>>>>>>> 
>>>>>>>> I have come to the above conclusions through my reverse
>> engineering
>>>>>>> process
>>>>>>>> for creating the TLA+ specification. I still have pending to
>>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
>>>> verified
>>>>>>> via
>>>>>>>> tests the conclusion about ledger recovery reads.
>>>>>>>> 
>>>>>>>> Note that I have found two defects with the BookKeeper protocol,
>>> most
>>>>>>>> notably data loss due to that fencing does not prevent further
>>>>> successful
>>>>>>>> adds. Currently the specification and associated documentation is
>>> on
>>>> a
>>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
>>> discuss
>>>>> the
>>>>>>>> spec and its findings.
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> Jack
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
>>>> eolivelli@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>>> 
>>>>>>>>> Jonathan,
>>>>>>>>> 
>>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>>>>>>>>> jbellis@apache.org>
>>>>>>>>> ha scritto:
>>>>>>>>> 
>>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
>>>>> confirm
>>>>>>>>>> that
>>>>>>>>>>> once confirmed, that an entry is not replayed to another
>> bookie.
>>>>> This
>>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
>>> bookies,
>>>>>>>>>> however,
>>>>>>>>>>> the new data integrity check that Ivan worked on, when run
>>>>>>> periodically
>>>>>>>>>>> will be able to repair that hole.
>>>>>>>>>> 
>>>>>>>>>> Can I read from the bookie with a hole in the meantime, and
>>>> silently
>>>>>>> miss
>>>>>>>>>> data that it doesn't know about?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>> bookie
>>>>>>> that is
>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>>> 
>>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
>>> value
>>>> is
>>>>>>>>> stored on ledger metadata once the ledger is "closed".
>>>>>>>>> 
>>>>>>>>> When the ledger is still open, that is when the writer is
>> writing
>>> to
>>>>> it,
>>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
>>> entry
>>>>>>>>> this LAC value is returned to the reader using a piggyback
>>>> mechanism,
>>>>>>>>> without reading from metadata.
>>>>>>>>> The reader cannot read beyond the latest position that has been
>>>>>>> confirmed
>>>>>>>>> to the writer by AQ bookies.
>>>>>>>>> 
>>>>>>>>> We have a third case, the 'recovery read'.
>>>>>>>>> A reader starts a "recovery read" when you want to recover a
>>> ledger
>>>>> that
>>>>>>>>> has been abandoned by a dead writer
>>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want
>> to
>>>>> fence
>>>>>>> out
>>>>>>>>> the old leader.
>>>>>>>>> In this case the reader merges the current status of the ledger
>> on
>>>> ZK
>>>>>>> with
>>>>>>>>> the result of a scan of the whole ledger.
>>>>>>>>> Basically it reads the ledger from the beginning up to the tail,
>>>> until
>>>>>>> it
>>>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
>>> flag
>>>> on
>>>>>>> the
>>>>>>>>> ledger
>>>>>>>>> on every bookie and also it is able to detect the actual tail of
>>> the
>>>>>>> ledger
>>>>>>>>> (because the writer died and it was not able to flush metadata
>> to
>>>> ZK).
>>>>>>>>> 
>>>>>>>>> The recovery read fails if it is not possible to read every
>> entry
>>>> from
>>>>>>> at
>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>> does
>>>>> not
>>>>>>>>> find enough bookies.
>>>>>>>>> 
>>>>>>>>> I hope that helps
>>>>>>>>> Enrico


Re: Unbounded memory usage for WQ > AQ ?

Posted by Sijie Guo <gu...@gmail.com>.
Jack,

Thank you for your replies! That's good as there are not violations of
bookkeeper protocol.

Comments inline.

On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly
<jv...@splunk.com.invalid> wrote:

> > Did you guys see any issues with the ledger auditor?
>
> > The active writer can't guarantee it writing entries to WQ because it can
> > crash during retrying adding entries to (WQ - AQ) bookies.
>
> The need to repair AQ replicated entries is clear and the auditor is one
> such strategy. Ivan has also worked on a self-healing bookie strategy where
> each bookie itself is able to detect these holes and is able to obtain the
> missing entries itself. The detection of these holes using this strategy is
> more efficient as it only requires network calls for the ledger metadata
> scanning (to zk) and the missing entry reads (to other bookies). The
> auditor as I understand it, reads all entries of all ledgers from all
> bookies (of an entries ensemble) meaning these entries cross the network.
> Using the auditor approach is likely to be run less frequently due to the
> network cost.
>

Agreed on the efficiency part. I think the Salesforce team introduced the
Disk Scrubber to solve that problem already unless I confused something
there.

+JV Jujjuri <vj...@salesforce.com> can chime in on this part.


>
> I do also wonder if the writer, on performing an ensemble change, should
> replay "AQ but not WQ" entries, this would just leave writer failures
> causing these AQ replicated entries.
>

The writer can do that. But there are no guarantees there. You still need a
mechanism to repair the under-replicated entries.
It will also make the writer become much complicated to maintain.


>
> > Regarding recovery reads, recovery read doesn't need to be deterministic.
> > For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> > either including it or excluding it in the sealed ledger is correct
> > behavior. The bookkeeper client guarantees that once a ledger is sealed,
> > the entries in the sealed ledger can always be read and can be read
> > consistently.
>
> > I am not sure it is a problem unless I misunderstand it.
>
> It is true that it doesn't violate any safety property, but it is a strange
> check to me. It looks like an implementation artefact rather than an
> explicit protocol design choice. But not a huge deal.
>

It was discussed in the earlier days as a design choice for this protocol.

If we want to make it deterministic, we might need to consider what is the
performance penalty.


>
> Jack
>
>
> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <gu...@gmail.com> wrote:
>
> > [ External sender. Exercise caution. ]
> >
> > Sorry for being late in this thread.
> >
> > If I understand this correctly, the main topic is about the "hole" when
> WQ
> > > AQ.
> >
> > > This leaves a "hole" as the entry is now replicated only to 2 bookies,
> >
> > We do have one hole when ensemble change is enabled and WQ > AQ. That
> was a
> > known behavior. But the hole will be repaired by the ledger auditor as JV
> > said. Did you guys see any issues with the ledger auditor?
> >
> > > I'd think that we guarantee that an entry that is acknowledged is
> > eventually written WQ ways and that it is observable by readers when the
> > ledger is closed.
> >
> > To Flavio's question, we don't guarantee (and can't guarantee) that the
> > active writer will eventually write the entries to WQ. For the active
> > writers, we only guarantee entries are written to AQ. The ledger auditor
> is
> > to ensure all the entries are written to WQ.
> >
> > The active writer can't guarantee it writing entries to WQ because it can
> > crash during retrying adding entries to (WQ - AQ) bookies.
> >
> > >  A single successful read is enough. However
> > there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
> > explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> > recovery process ends there. This means that given the responses
> > b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> > considered successful is non-deterministic.
> >
> > Regarding recovery reads, recovery read doesn't need to be deterministic.
> > For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> > either including it or excluding it in the sealed ledger is correct
> > behavior. The bookkeeper client guarantees that once a ledger is sealed,
> > the entries in the sealed ledger can always be read and can be read
> > consistently.
> >
> > I am not sure it is a problem unless I misunderstand it.
> >
> > - Sijie
> >
> > On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
> > <jv...@splunk.com.invalid> wrote:
> >
> > > Let's set up a call and create any issues from that. I have already
> > created
> > > the patches in our (Splunk) fork and it might be easiest or not to wait
> > > until we re-sync up with the open source repo. We can include the fixes
> > in
> > > the discussion.
> > >
> > > Jack
> > >
> > > On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org>
> wrote:
> > >
> > > > [ External sender. Exercise caution. ]
> > > >
> > > > Hi Jack,
> > > >
> > > > Thanks for getting back.
> > > >
> > > > > What's the best way to share the TLA+ findings?
> > > >
> > > > Would you be able to share the spec? I'm ok with reading TLA+.
> > > >
> > > > As for sharing your specific findings, I'd suggest one of the
> > following:
> > > >
> > > > 1- Create an email thread describing the scenarios that trigger a
> bug.
> > > > 2- Create issues, one for each problem you found.
> > > > 3- Create a discussion on the project Slack, perhaps a channel
> specific
> > > > for it.
> > > > 4- Set up a zoom call to present and discuss with the community.
> > > >
> > > > Option 2 is ideal from a community perspective, but we can also set
> up
> > a
> > > > call inviting everyone and create issues out of that discussion. We
> can
> > > in
> > > > fact set up a call even if we create the issues ahead of time.
> > > >
> > > > Does it make sense?
> > > >
> > > > -Flavio
> > > >
> > > > > On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
> > > .INVALID>
> > > > wrote:
> > > > >
> > > > > Hi Flavio,
> > > > >
> > > > >>> This is an example of a scenario corresponding to what we suspect
> > is
> > > a
> > > > > bug introduced earlier, but Enrico is arguing that this is not the
> > > > intended
> > > > > behavior, and at this point, I agree.
> > > > >
> > > > >>> By the time a successful callback is received, the client might
> > only
> > > > > have replicated AQ ways, so the guarantee can only be at that point
> > of
> > > > > being able to tolerate AQ - 1 crashes. The ledger configuration
> > states
> > > > that
> > > > > the application wants to have WQ copies >> of each entry, though.
> I'd
> > > > > expect a ledger to have WQ copies of each entry up to the final
> entry
> > > > > number when it is closed. Do you see it differently?
> > > > >
> > > > > I also agree and was pretty surprised when I discovered the
> > behaviour.
> > > It
> > > > > is not something that users expect and I think we need to correct
> it.
> > > So
> > > > > I'm with you.
> > > > >
> > > > > What's the best way to share the TLA+ findings?
> > > > >
> > > > > Jack
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
> > > wrote:
> > > > >
> > > > >> [ External sender. Exercise caution. ]
> > > > >>
> > > > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> > the
> > > > >>> confirm callback to the client is called and the LAC is set to
> > > 100.Now
> > > > >> the
> > > > >>> 3rd bookie times out. Ensemble change is executed and all pending
> > > adds
> > > > >> that
> > > > >>> are above the LAC of 100 are replayed to another bookie, meaning
> > that
> > > > the
> > > > >>> entry e100 is not replayed to another bookie, causing this entry
> to
> > > > meet
> > > > >>> the rep factor of only AQ.
> > > > >>
> > > > >> This is an example of a scenario corresponding to what we suspect
> > is a
> > > > bug
> > > > >> introduced earlier, but Enrico is arguing that this is not the
> > > intended
> > > > >> behavior, and at this point, I agree.
> > > > >>
> > > > >>> This is alluded to in the docs as they state
> > > > >>> that AQ is also the minimum guaranteed replication factor.
> > > > >>
> > > > >> By the time a successful callback is received, the client might
> only
> > > > have
> > > > >> replicated AQ ways, so the guarantee can only be at that point of
> > > being
> > > > >> able to tolerate AQ - 1 crashes. The ledger configuration states
> > that
> > > > the
> > > > >> application wants to have WQ copies of each entry, though. I'd
> > expect
> > > a
> > > > >> ledger to have WQ copies of each entry up to the final entry
> number
> > > > when it
> > > > >> is closed. Do you see it differently?
> > > > >>
> > > > >>> I'd be happy to set up a meeting to discuss the spec and its
> > > findings.
> > > > >>
> > > > >>
> > > > >> That'd be great, I'm interested.
> > > > >>
> > > > >> -Flavio
> > > > >>
> > > > >>> On 15 Jan 2021, at 15:30, Jack Vanlightly <
> jvanlightly@splunk.com
> > > > .INVALID>
> > > > >> wrote:
> > > > >>>
> > > > >>>> No you cannot miss data, if the client is not able to find a
> > bookie
> > > > that
> > > > >>> is
> > > > >>>> able to answer with the entry it receives an error.
> > > > >>>
> > > > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> > the
> > > > >>> confirm callback to the client is called and the LAC is set to
> 100.
> > > Now
> > > > >> the
> > > > >>> 3rd bookie times out. Ensemble change is executed and all pending
> > > adds
> > > > >> that
> > > > >>> are above the LAC of 100 are replayed to another bookie, meaning
> > that
> > > > the
> > > > >>> entry e100 is not replayed to another bookie, causing this entry
> to
> > > > meet
> > > > >>> the rep factor of only AQ. This is alluded to in the docs as they
> > > state
> > > > >>> that AQ is also the minimum guaranteed replication factor.
> > > > >>>
> > > > >>>> The recovery read fails if it is not possible to read every
> entry
> > > from
> > > > >> at
> > > > >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > > > >>>> and it does not hazard to "repair" (truncate) the ledger if it
> > does
> > > > not
> > > > >>>> find enough bookies.
> > > > >>>
> > > > >>> This is not quite accurate. A single successful read is enough.
> > > However
> > > > >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
> > > with
> > > > >>> explicit NoSuchEntry/Ledger, the read is considered failed and
> the
> > > > ledger
> > > > >>> recovery process ends there. This means that given the responses
> > > > >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
> > is
> > > > >>> considered successful is non-deterministic. If the response from
> b1
> > > is
> > > > >>> received last, then the read is already considered failed,
> > otherwise
> > > > the
> > > > >>> read succeeds.
> > > > >>>
> > > > >>> I have come to the above conclusions through my reverse
> engineering
> > > > >> process
> > > > >>> for creating the TLA+ specification. I still have pending to
> > > > >>> reproduce the AQ rep factor behaviour via some tests, but have
> > > verified
> > > > >> via
> > > > >>> tests the conclusion about ledger recovery reads.
> > > > >>>
> > > > >>> Note that I have found two defects with the BookKeeper protocol,
> > most
> > > > >>> notably data loss due to that fencing does not prevent further
> > > > successful
> > > > >>> adds. Currently the specification and associated documentation is
> > on
> > > a
> > > > >>> private Splunk repo, but I'd be happy to set up a meeting to
> > discuss
> > > > the
> > > > >>> spec and its findings.
> > > > >>>
> > > > >>> Best
> > > > >>> Jack
> > > > >>>
> > > > >>>
> > > > >>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
> > > eolivelli@gmail.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> [ External sender. Exercise caution. ]
> > > > >>>>
> > > > >>>> Jonathan,
> > > > >>>>
> > > > >>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> > > > >>>> jbellis@apache.org>
> > > > >>>> ha scritto:
> > > > >>>>
> > > > >>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > > > >>>>>> Hi,
> > > > >>>>>>
> > > > >>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> > > > confirm
> > > > >>>>> that
> > > > >>>>>> once confirmed, that an entry is not replayed to another
> bookie.
> > > > This
> > > > >>>>>> leaves a "hole" as the entry is now replicated only to 2
> > bookies,
> > > > >>>>> however,
> > > > >>>>>> the new data integrity check that Ivan worked on, when run
> > > > >> periodically
> > > > >>>>>> will be able to repair that hole.
> > > > >>>>>
> > > > >>>>> Can I read from the bookie with a hole in the meantime, and
> > > silently
> > > > >> miss
> > > > >>>>> data that it doesn't know about?
> > > > >>>>>
> > > > >>>>
> > > > >>>> No you cannot miss data, if the client is not able to find a
> > bookie
> > > > >> that is
> > > > >>>> able to answer with the entry it receives an error.
> > > > >>>>
> > > > >>>> The ledger has a known tail (LastAddConfirmed entry) and this
> > value
> > > is
> > > > >>>> stored on ledger metadata once the ledger is "closed".
> > > > >>>>
> > > > >>>> When the ledger is still open, that is when the writer is
> writing
> > to
> > > > it,
> > > > >>>> the reader is allowed to read only up to the LastAddConfirmed
> > entry
> > > > >>>> this LAC value is returned to the reader using a piggyback
> > > mechanism,
> > > > >>>> without reading from metadata.
> > > > >>>> The reader cannot read beyond the latest position that has been
> > > > >> confirmed
> > > > >>>> to the writer by AQ bookies.
> > > > >>>>
> > > > >>>> We have a third case, the 'recovery read'.
> > > > >>>> A reader starts a "recovery read" when you want to recover a
> > ledger
> > > > that
> > > > >>>> has been abandoned by a dead writer
> > > > >>>> or when you are a new leader (Pulsar Bundle Owner) or you want
> to
> > > > fence
> > > > >> out
> > > > >>>> the old leader.
> > > > >>>> In this case the reader merges the current status of the ledger
> on
> > > ZK
> > > > >> with
> > > > >>>> the result of a scan of the whole ledger.
> > > > >>>> Basically it reads the ledger from the beginning up to the tail,
> > > until
> > > > >> it
> > > > >>>> is able to "read" entries, this way it is setting the 'fenced'
> > flag
> > > on
> > > > >> the
> > > > >>>> ledger
> > > > >>>> on every bookie and also it is able to detect the actual tail of
> > the
> > > > >> ledger
> > > > >>>> (because the writer died and it was not able to flush metadata
> to
> > > ZK).
> > > > >>>>
> > > > >>>> The recovery read fails if it is not possible to read every
> entry
> > > from
> > > > >> at
> > > > >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > > > >>>> and it does not hazard to "repair" (truncate) the ledger if it
> > does
> > > > not
> > > > >>>> find enough bookies.
> > > > >>>>
> > > > >>>> I hope that helps
> > > > >>>> Enrico
> > > > >>>>
> > > > >>
> > > > >>
> > > > >>
> > > >
> > > >
> > > >
> > >
> >
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Jack Vanlightly <jv...@splunk.com.INVALID>.
> Did you guys see any issues with the ledger auditor?

> The active writer can't guarantee it writing entries to WQ because it can
> crash during retrying adding entries to (WQ - AQ) bookies.

The need to repair AQ replicated entries is clear and the auditor is one
such strategy. Ivan has also worked on a self-healing bookie strategy where
each bookie itself is able to detect these holes and is able to obtain the
missing entries itself. The detection of these holes using this strategy is
more efficient as it only requires network calls for the ledger metadata
scanning (to zk) and the missing entry reads (to other bookies). The
auditor as I understand it, reads all entries of all ledgers from all
bookies (of an entries ensemble) meaning these entries cross the network.
Using the auditor approach is likely to be run less frequently due to the
network cost.

I do also wonder if the writer, on performing an ensemble change, should
replay "AQ but not WQ" entries, this would just leave writer failures
causing these AQ replicated entries.

> Regarding recovery reads, recovery read doesn't need to be deterministic.
> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> either including it or excluding it in the sealed ledger is correct
> behavior. The bookkeeper client guarantees that once a ledger is sealed,
> the entries in the sealed ledger can always be read and can be read
> consistently.

> I am not sure it is a problem unless I misunderstand it.

It is true that it doesn't violate any safety property, but it is a strange
check to me. It looks like an implementation artefact rather than an
explicit protocol design choice. But not a huge deal.

Jack


On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <gu...@gmail.com> wrote:

> [ External sender. Exercise caution. ]
>
> Sorry for being late in this thread.
>
> If I understand this correctly, the main topic is about the "hole" when WQ
> > AQ.
>
> > This leaves a "hole" as the entry is now replicated only to 2 bookies,
>
> We do have one hole when ensemble change is enabled and WQ > AQ. That was a
> known behavior. But the hole will be repaired by the ledger auditor as JV
> said. Did you guys see any issues with the ledger auditor?
>
> > I'd think that we guarantee that an entry that is acknowledged is
> eventually written WQ ways and that it is observable by readers when the
> ledger is closed.
>
> To Flavio's question, we don't guarantee (and can't guarantee) that the
> active writer will eventually write the entries to WQ. For the active
> writers, we only guarantee entries are written to AQ. The ledger auditor is
> to ensure all the entries are written to WQ.
>
> The active writer can't guarantee it writing entries to WQ because it can
> crash during retrying adding entries to (WQ - AQ) bookies.
>
> >  A single successful read is enough. However
> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> recovery process ends there. This means that given the responses
> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> considered successful is non-deterministic.
>
> Regarding recovery reads, recovery read doesn't need to be deterministic.
> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> either including it or excluding it in the sealed ledger is correct
> behavior. The bookkeeper client guarantees that once a ledger is sealed,
> the entries in the sealed ledger can always be read and can be read
> consistently.
>
> I am not sure it is a problem unless I misunderstand it.
>
> - Sijie
>
> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
> <jv...@splunk.com.invalid> wrote:
>
> > Let's set up a call and create any issues from that. I have already
> created
> > the patches in our (Splunk) fork and it might be easiest or not to wait
> > until we re-sync up with the open source repo. We can include the fixes
> in
> > the discussion.
> >
> > Jack
> >
> > On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org> wrote:
> >
> > > [ External sender. Exercise caution. ]
> > >
> > > Hi Jack,
> > >
> > > Thanks for getting back.
> > >
> > > > What's the best way to share the TLA+ findings?
> > >
> > > Would you be able to share the spec? I'm ok with reading TLA+.
> > >
> > > As for sharing your specific findings, I'd suggest one of the
> following:
> > >
> > > 1- Create an email thread describing the scenarios that trigger a bug.
> > > 2- Create issues, one for each problem you found.
> > > 3- Create a discussion on the project Slack, perhaps a channel specific
> > > for it.
> > > 4- Set up a zoom call to present and discuss with the community.
> > >
> > > Option 2 is ideal from a community perspective, but we can also set up
> a
> > > call inviting everyone and create issues out of that discussion. We can
> > in
> > > fact set up a call even if we create the issues ahead of time.
> > >
> > > Does it make sense?
> > >
> > > -Flavio
> > >
> > > > On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
> > .INVALID>
> > > wrote:
> > > >
> > > > Hi Flavio,
> > > >
> > > >>> This is an example of a scenario corresponding to what we suspect
> is
> > a
> > > > bug introduced earlier, but Enrico is arguing that this is not the
> > > intended
> > > > behavior, and at this point, I agree.
> > > >
> > > >>> By the time a successful callback is received, the client might
> only
> > > > have replicated AQ ways, so the guarantee can only be at that point
> of
> > > > being able to tolerate AQ - 1 crashes. The ledger configuration
> states
> > > that
> > > > the application wants to have WQ copies >> of each entry, though. I'd
> > > > expect a ledger to have WQ copies of each entry up to the final entry
> > > > number when it is closed. Do you see it differently?
> > > >
> > > > I also agree and was pretty surprised when I discovered the
> behaviour.
> > It
> > > > is not something that users expect and I think we need to correct it.
> > So
> > > > I'm with you.
> > > >
> > > > What's the best way to share the TLA+ findings?
> > > >
> > > > Jack
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
> > wrote:
> > > >
> > > >> [ External sender. Exercise caution. ]
> > > >>
> > > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> the
> > > >>> confirm callback to the client is called and the LAC is set to
> > 100.Now
> > > >> the
> > > >>> 3rd bookie times out. Ensemble change is executed and all pending
> > adds
> > > >> that
> > > >>> are above the LAC of 100 are replayed to another bookie, meaning
> that
> > > the
> > > >>> entry e100 is not replayed to another bookie, causing this entry to
> > > meet
> > > >>> the rep factor of only AQ.
> > > >>
> > > >> This is an example of a scenario corresponding to what we suspect
> is a
> > > bug
> > > >> introduced earlier, but Enrico is arguing that this is not the
> > intended
> > > >> behavior, and at this point, I agree.
> > > >>
> > > >>> This is alluded to in the docs as they state
> > > >>> that AQ is also the minimum guaranteed replication factor.
> > > >>
> > > >> By the time a successful callback is received, the client might only
> > > have
> > > >> replicated AQ ways, so the guarantee can only be at that point of
> > being
> > > >> able to tolerate AQ - 1 crashes. The ledger configuration states
> that
> > > the
> > > >> application wants to have WQ copies of each entry, though. I'd
> expect
> > a
> > > >> ledger to have WQ copies of each entry up to the final entry number
> > > when it
> > > >> is closed. Do you see it differently?
> > > >>
> > > >>> I'd be happy to set up a meeting to discuss the spec and its
> > findings.
> > > >>
> > > >>
> > > >> That'd be great, I'm interested.
> > > >>
> > > >> -Flavio
> > > >>
> > > >>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanlightly@splunk.com
> > > .INVALID>
> > > >> wrote:
> > > >>>
> > > >>>> No you cannot miss data, if the client is not able to find a
> bookie
> > > that
> > > >>> is
> > > >>>> able to answer with the entry it receives an error.
> > > >>>
> > > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> the
> > > >>> confirm callback to the client is called and the LAC is set to 100.
> > Now
> > > >> the
> > > >>> 3rd bookie times out. Ensemble change is executed and all pending
> > adds
> > > >> that
> > > >>> are above the LAC of 100 are replayed to another bookie, meaning
> that
> > > the
> > > >>> entry e100 is not replayed to another bookie, causing this entry to
> > > meet
> > > >>> the rep factor of only AQ. This is alluded to in the docs as they
> > state
> > > >>> that AQ is also the minimum guaranteed replication factor.
> > > >>>
> > > >>>> The recovery read fails if it is not possible to read every entry
> > from
> > > >> at
> > > >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > > >>>> and it does not hazard to "repair" (truncate) the ledger if it
> does
> > > not
> > > >>>> find enough bookies.
> > > >>>
> > > >>> This is not quite accurate. A single successful read is enough.
> > However
> > > >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
> > with
> > > >>> explicit NoSuchEntry/Ledger, the read is considered failed and the
> > > ledger
> > > >>> recovery process ends there. This means that given the responses
> > > >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
> is
> > > >>> considered successful is non-deterministic. If the response from b1
> > is
> > > >>> received last, then the read is already considered failed,
> otherwise
> > > the
> > > >>> read succeeds.
> > > >>>
> > > >>> I have come to the above conclusions through my reverse engineering
> > > >> process
> > > >>> for creating the TLA+ specification. I still have pending to
> > > >>> reproduce the AQ rep factor behaviour via some tests, but have
> > verified
> > > >> via
> > > >>> tests the conclusion about ledger recovery reads.
> > > >>>
> > > >>> Note that I have found two defects with the BookKeeper protocol,
> most
> > > >>> notably data loss due to that fencing does not prevent further
> > > successful
> > > >>> adds. Currently the specification and associated documentation is
> on
> > a
> > > >>> private Splunk repo, but I'd be happy to set up a meeting to
> discuss
> > > the
> > > >>> spec and its findings.
> > > >>>
> > > >>> Best
> > > >>> Jack
> > > >>>
> > > >>>
> > > >>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
> > eolivelli@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> [ External sender. Exercise caution. ]
> > > >>>>
> > > >>>> Jonathan,
> > > >>>>
> > > >>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> > > >>>> jbellis@apache.org>
> > > >>>> ha scritto:
> > > >>>>
> > > >>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > > >>>>>> Hi,
> > > >>>>>>
> > > >>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> > > confirm
> > > >>>>> that
> > > >>>>>> once confirmed, that an entry is not replayed to another bookie.
> > > This
> > > >>>>>> leaves a "hole" as the entry is now replicated only to 2
> bookies,
> > > >>>>> however,
> > > >>>>>> the new data integrity check that Ivan worked on, when run
> > > >> periodically
> > > >>>>>> will be able to repair that hole.
> > > >>>>>
> > > >>>>> Can I read from the bookie with a hole in the meantime, and
> > silently
> > > >> miss
> > > >>>>> data that it doesn't know about?
> > > >>>>>
> > > >>>>
> > > >>>> No you cannot miss data, if the client is not able to find a
> bookie
> > > >> that is
> > > >>>> able to answer with the entry it receives an error.
> > > >>>>
> > > >>>> The ledger has a known tail (LastAddConfirmed entry) and this
> value
> > is
> > > >>>> stored on ledger metadata once the ledger is "closed".
> > > >>>>
> > > >>>> When the ledger is still open, that is when the writer is writing
> to
> > > it,
> > > >>>> the reader is allowed to read only up to the LastAddConfirmed
> entry
> > > >>>> this LAC value is returned to the reader using a piggyback
> > mechanism,
> > > >>>> without reading from metadata.
> > > >>>> The reader cannot read beyond the latest position that has been
> > > >> confirmed
> > > >>>> to the writer by AQ bookies.
> > > >>>>
> > > >>>> We have a third case, the 'recovery read'.
> > > >>>> A reader starts a "recovery read" when you want to recover a
> ledger
> > > that
> > > >>>> has been abandoned by a dead writer
> > > >>>> or when you are a new leader (Pulsar Bundle Owner) or you want to
> > > fence
> > > >> out
> > > >>>> the old leader.
> > > >>>> In this case the reader merges the current status of the ledger on
> > ZK
> > > >> with
> > > >>>> the result of a scan of the whole ledger.
> > > >>>> Basically it reads the ledger from the beginning up to the tail,
> > until
> > > >> it
> > > >>>> is able to "read" entries, this way it is setting the 'fenced'
> flag
> > on
> > > >> the
> > > >>>> ledger
> > > >>>> on every bookie and also it is able to detect the actual tail of
> the
> > > >> ledger
> > > >>>> (because the writer died and it was not able to flush metadata to
> > ZK).
> > > >>>>
> > > >>>> The recovery read fails if it is not possible to read every entry
> > from
> > > >> at
> > > >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > > >>>> and it does not hazard to "repair" (truncate) the ledger if it
> does
> > > not
> > > >>>> find enough bookies.
> > > >>>>
> > > >>>> I hope that helps
> > > >>>> Enrico
> > > >>>>
> > > >>
> > > >>
> > > >>
> > >
> > >
> > >
> >
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Sijie Guo <gu...@gmail.com>.
Sorry for being late in this thread.

If I understand this correctly, the main topic is about the "hole" when WQ
> AQ.

> This leaves a "hole" as the entry is now replicated only to 2 bookies,

We do have one hole when ensemble change is enabled and WQ > AQ. That was a
known behavior. But the hole will be repaired by the ledger auditor as JV
said. Did you guys see any issues with the ledger auditor?

> I'd think that we guarantee that an entry that is acknowledged is
eventually written WQ ways and that it is observable by readers when the
ledger is closed.

To Flavio's question, we don't guarantee (and can't guarantee) that the
active writer will eventually write the entries to WQ. For the active
writers, we only guarantee entries are written to AQ. The ledger auditor is
to ensure all the entries are written to WQ.

The active writer can't guarantee it writing entries to WQ because it can
crash during retrying adding entries to (WQ - AQ) bookies.

>  A single successful read is enough. However
there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
recovery process ends there. This means that given the responses
b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
considered successful is non-deterministic.

Regarding recovery reads, recovery read doesn't need to be deterministic.
For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
either including it or excluding it in the sealed ledger is correct
behavior. The bookkeeper client guarantees that once a ledger is sealed,
the entries in the sealed ledger can always be read and can be read
consistently.

I am not sure it is a problem unless I misunderstand it.

- Sijie

On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
<jv...@splunk.com.invalid> wrote:

> Let's set up a call and create any issues from that. I have already created
> the patches in our (Splunk) fork and it might be easiest or not to wait
> until we re-sync up with the open source repo. We can include the fixes in
> the discussion.
>
> Jack
>
> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org> wrote:
>
> > [ External sender. Exercise caution. ]
> >
> > Hi Jack,
> >
> > Thanks for getting back.
> >
> > > What's the best way to share the TLA+ findings?
> >
> > Would you be able to share the spec? I'm ok with reading TLA+.
> >
> > As for sharing your specific findings, I'd suggest one of the following:
> >
> > 1- Create an email thread describing the scenarios that trigger a bug.
> > 2- Create issues, one for each problem you found.
> > 3- Create a discussion on the project Slack, perhaps a channel specific
> > for it.
> > 4- Set up a zoom call to present and discuss with the community.
> >
> > Option 2 is ideal from a community perspective, but we can also set up a
> > call inviting everyone and create issues out of that discussion. We can
> in
> > fact set up a call even if we create the issues ahead of time.
> >
> > Does it make sense?
> >
> > -Flavio
> >
> > > On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanlightly@splunk.com
> .INVALID>
> > wrote:
> > >
> > > Hi Flavio,
> > >
> > >>> This is an example of a scenario corresponding to what we suspect is
> a
> > > bug introduced earlier, but Enrico is arguing that this is not the
> > intended
> > > behavior, and at this point, I agree.
> > >
> > >>> By the time a successful callback is received, the client might only
> > > have replicated AQ ways, so the guarantee can only be at that point of
> > > being able to tolerate AQ - 1 crashes. The ledger configuration states
> > that
> > > the application wants to have WQ copies >> of each entry, though. I'd
> > > expect a ledger to have WQ copies of each entry up to the final entry
> > > number when it is closed. Do you see it differently?
> > >
> > > I also agree and was pretty surprised when I discovered the behaviour.
> It
> > > is not something that users expect and I think we need to correct it.
> So
> > > I'm with you.
> > >
> > > What's the best way to share the TLA+ findings?
> > >
> > > Jack
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org>
> wrote:
> > >
> > >> [ External sender. Exercise caution. ]
> > >>
> > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > >>> confirm callback to the client is called and the LAC is set to
> 100.Now
> > >> the
> > >>> 3rd bookie times out. Ensemble change is executed and all pending
> adds
> > >> that
> > >>> are above the LAC of 100 are replayed to another bookie, meaning that
> > the
> > >>> entry e100 is not replayed to another bookie, causing this entry to
> > meet
> > >>> the rep factor of only AQ.
> > >>
> > >> This is an example of a scenario corresponding to what we suspect is a
> > bug
> > >> introduced earlier, but Enrico is arguing that this is not the
> intended
> > >> behavior, and at this point, I agree.
> > >>
> > >>> This is alluded to in the docs as they state
> > >>> that AQ is also the minimum guaranteed replication factor.
> > >>
> > >> By the time a successful callback is received, the client might only
> > have
> > >> replicated AQ ways, so the guarantee can only be at that point of
> being
> > >> able to tolerate AQ - 1 crashes. The ledger configuration states that
> > the
> > >> application wants to have WQ copies of each entry, though. I'd expect
> a
> > >> ledger to have WQ copies of each entry up to the final entry number
> > when it
> > >> is closed. Do you see it differently?
> > >>
> > >>> I'd be happy to set up a meeting to discuss the spec and its
> findings.
> > >>
> > >>
> > >> That'd be great, I'm interested.
> > >>
> > >> -Flavio
> > >>
> > >>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanlightly@splunk.com
> > .INVALID>
> > >> wrote:
> > >>>
> > >>>> No you cannot miss data, if the client is not able to find a bookie
> > that
> > >>> is
> > >>>> able to answer with the entry it receives an error.
> > >>>
> > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > >>> confirm callback to the client is called and the LAC is set to 100.
> Now
> > >> the
> > >>> 3rd bookie times out. Ensemble change is executed and all pending
> adds
> > >> that
> > >>> are above the LAC of 100 are replayed to another bookie, meaning that
> > the
> > >>> entry e100 is not replayed to another bookie, causing this entry to
> > meet
> > >>> the rep factor of only AQ. This is alluded to in the docs as they
> state
> > >>> that AQ is also the minimum guaranteed replication factor.
> > >>>
> > >>>> The recovery read fails if it is not possible to read every entry
> from
> > >> at
> > >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > >>>> and it does not hazard to "repair" (truncate) the ledger if it does
> > not
> > >>>> find enough bookies.
> > >>>
> > >>> This is not quite accurate. A single successful read is enough.
> However
> > >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
> with
> > >>> explicit NoSuchEntry/Ledger, the read is considered failed and the
> > ledger
> > >>> recovery process ends there. This means that given the responses
> > >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> > >>> considered successful is non-deterministic. If the response from b1
> is
> > >>> received last, then the read is already considered failed, otherwise
> > the
> > >>> read succeeds.
> > >>>
> > >>> I have come to the above conclusions through my reverse engineering
> > >> process
> > >>> for creating the TLA+ specification. I still have pending to
> > >>> reproduce the AQ rep factor behaviour via some tests, but have
> verified
> > >> via
> > >>> tests the conclusion about ledger recovery reads.
> > >>>
> > >>> Note that I have found two defects with the BookKeeper protocol, most
> > >>> notably data loss due to that fencing does not prevent further
> > successful
> > >>> adds. Currently the specification and associated documentation is on
> a
> > >>> private Splunk repo, but I'd be happy to set up a meeting to discuss
> > the
> > >>> spec and its findings.
> > >>>
> > >>> Best
> > >>> Jack
> > >>>
> > >>>
> > >>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
> eolivelli@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> [ External sender. Exercise caution. ]
> > >>>>
> > >>>> Jonathan,
> > >>>>
> > >>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> > >>>> jbellis@apache.org>
> > >>>> ha scritto:
> > >>>>
> > >>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> > confirm
> > >>>>> that
> > >>>>>> once confirmed, that an entry is not replayed to another bookie.
> > This
> > >>>>>> leaves a "hole" as the entry is now replicated only to 2 bookies,
> > >>>>> however,
> > >>>>>> the new data integrity check that Ivan worked on, when run
> > >> periodically
> > >>>>>> will be able to repair that hole.
> > >>>>>
> > >>>>> Can I read from the bookie with a hole in the meantime, and
> silently
> > >> miss
> > >>>>> data that it doesn't know about?
> > >>>>>
> > >>>>
> > >>>> No you cannot miss data, if the client is not able to find a bookie
> > >> that is
> > >>>> able to answer with the entry it receives an error.
> > >>>>
> > >>>> The ledger has a known tail (LastAddConfirmed entry) and this value
> is
> > >>>> stored on ledger metadata once the ledger is "closed".
> > >>>>
> > >>>> When the ledger is still open, that is when the writer is writing to
> > it,
> > >>>> the reader is allowed to read only up to the LastAddConfirmed entry
> > >>>> this LAC value is returned to the reader using a piggyback
> mechanism,
> > >>>> without reading from metadata.
> > >>>> The reader cannot read beyond the latest position that has been
> > >> confirmed
> > >>>> to the writer by AQ bookies.
> > >>>>
> > >>>> We have a third case, the 'recovery read'.
> > >>>> A reader starts a "recovery read" when you want to recover a ledger
> > that
> > >>>> has been abandoned by a dead writer
> > >>>> or when you are a new leader (Pulsar Bundle Owner) or you want to
> > fence
> > >> out
> > >>>> the old leader.
> > >>>> In this case the reader merges the current status of the ledger on
> ZK
> > >> with
> > >>>> the result of a scan of the whole ledger.
> > >>>> Basically it reads the ledger from the beginning up to the tail,
> until
> > >> it
> > >>>> is able to "read" entries, this way it is setting the 'fenced' flag
> on
> > >> the
> > >>>> ledger
> > >>>> on every bookie and also it is able to detect the actual tail of the
> > >> ledger
> > >>>> (because the writer died and it was not able to flush metadata to
> ZK).
> > >>>>
> > >>>> The recovery read fails if it is not possible to read every entry
> from
> > >> at
> > >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > >>>> and it does not hazard to "repair" (truncate) the ledger if it does
> > not
> > >>>> find enough bookies.
> > >>>>
> > >>>> I hope that helps
> > >>>> Enrico
> > >>>>
> > >>
> > >>
> > >>
> >
> >
> >
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Jack Vanlightly <jv...@splunk.com.INVALID>.
Let's set up a call and create any issues from that. I have already created
the patches in our (Splunk) fork and it might be easiest or not to wait
until we re-sync up with the open source repo. We can include the fixes in
the discussion.

Jack

On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <fp...@apache.org> wrote:

> [ External sender. Exercise caution. ]
>
> Hi Jack,
>
> Thanks for getting back.
>
> > What's the best way to share the TLA+ findings?
>
> Would you be able to share the spec? I'm ok with reading TLA+.
>
> As for sharing your specific findings, I'd suggest one of the following:
>
> 1- Create an email thread describing the scenarios that trigger a bug.
> 2- Create issues, one for each problem you found.
> 3- Create a discussion on the project Slack, perhaps a channel specific
> for it.
> 4- Set up a zoom call to present and discuss with the community.
>
> Option 2 is ideal from a community perspective, but we can also set up a
> call inviting everyone and create issues out of that discussion. We can in
> fact set up a call even if we create the issues ahead of time.
>
> Does it make sense?
>
> -Flavio
>
> > On 15 Jan 2021, at 16:14, Jack Vanlightly <jv...@splunk.com.INVALID>
> wrote:
> >
> > Hi Flavio,
> >
> >>> This is an example of a scenario corresponding to what we suspect is a
> > bug introduced earlier, but Enrico is arguing that this is not the
> intended
> > behavior, and at this point, I agree.
> >
> >>> By the time a successful callback is received, the client might only
> > have replicated AQ ways, so the guarantee can only be at that point of
> > being able to tolerate AQ - 1 crashes. The ledger configuration states
> that
> > the application wants to have WQ copies >> of each entry, though. I'd
> > expect a ledger to have WQ copies of each entry up to the final entry
> > number when it is closed. Do you see it differently?
> >
> > I also agree and was pretty surprised when I discovered the behaviour. It
> > is not something that users expect and I think we need to correct it. So
> > I'm with you.
> >
> > What's the best way to share the TLA+ findings?
> >
> > Jack
> >
> >
> >
> >
> >
> > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org> wrote:
> >
> >> [ External sender. Exercise caution. ]
> >>
> >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> >>> confirm callback to the client is called and the LAC is set to 100.Now
> >> the
> >>> 3rd bookie times out. Ensemble change is executed and all pending adds
> >> that
> >>> are above the LAC of 100 are replayed to another bookie, meaning that
> the
> >>> entry e100 is not replayed to another bookie, causing this entry to
> meet
> >>> the rep factor of only AQ.
> >>
> >> This is an example of a scenario corresponding to what we suspect is a
> bug
> >> introduced earlier, but Enrico is arguing that this is not the intended
> >> behavior, and at this point, I agree.
> >>
> >>> This is alluded to in the docs as they state
> >>> that AQ is also the minimum guaranteed replication factor.
> >>
> >> By the time a successful callback is received, the client might only
> have
> >> replicated AQ ways, so the guarantee can only be at that point of being
> >> able to tolerate AQ - 1 crashes. The ledger configuration states that
> the
> >> application wants to have WQ copies of each entry, though. I'd expect a
> >> ledger to have WQ copies of each entry up to the final entry number
> when it
> >> is closed. Do you see it differently?
> >>
> >>> I'd be happy to set up a meeting to discuss the spec and its findings.
> >>
> >>
> >> That'd be great, I'm interested.
> >>
> >> -Flavio
> >>
> >>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanlightly@splunk.com
> .INVALID>
> >> wrote:
> >>>
> >>>> No you cannot miss data, if the client is not able to find a bookie
> that
> >>> is
> >>>> able to answer with the entry it receives an error.
> >>>
> >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> >>> confirm callback to the client is called and the LAC is set to 100. Now
> >> the
> >>> 3rd bookie times out. Ensemble change is executed and all pending adds
> >> that
> >>> are above the LAC of 100 are replayed to another bookie, meaning that
> the
> >>> entry e100 is not replayed to another bookie, causing this entry to
> meet
> >>> the rep factor of only AQ. This is alluded to in the docs as they state
> >>> that AQ is also the minimum guaranteed replication factor.
> >>>
> >>>> The recovery read fails if it is not possible to read every entry from
> >> at
> >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> >>>> and it does not hazard to "repair" (truncate) the ledger if it does
> not
> >>>> find enough bookies.
> >>>
> >>> This is not quite accurate. A single successful read is enough. However
> >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
> >>> explicit NoSuchEntry/Ledger, the read is considered failed and the
> ledger
> >>> recovery process ends there. This means that given the responses
> >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> >>> considered successful is non-deterministic. If the response from b1 is
> >>> received last, then the read is already considered failed, otherwise
> the
> >>> read succeeds.
> >>>
> >>> I have come to the above conclusions through my reverse engineering
> >> process
> >>> for creating the TLA+ specification. I still have pending to
> >>> reproduce the AQ rep factor behaviour via some tests, but have verified
> >> via
> >>> tests the conclusion about ledger recovery reads.
> >>>
> >>> Note that I have found two defects with the BookKeeper protocol, most
> >>> notably data loss due to that fencing does not prevent further
> successful
> >>> adds. Currently the specification and associated documentation is on a
> >>> private Splunk repo, but I'd be happy to set up a meeting to discuss
> the
> >>> spec and its findings.
> >>>
> >>> Best
> >>> Jack
> >>>
> >>>
> >>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eo...@gmail.com>
> >>> wrote:
> >>>
> >>>> [ External sender. Exercise caution. ]
> >>>>
> >>>> Jonathan,
> >>>>
> >>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> >>>> jbellis@apache.org>
> >>>> ha scritto:
> >>>>
> >>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> confirm
> >>>>> that
> >>>>>> once confirmed, that an entry is not replayed to another bookie.
> This
> >>>>>> leaves a "hole" as the entry is now replicated only to 2 bookies,
> >>>>> however,
> >>>>>> the new data integrity check that Ivan worked on, when run
> >> periodically
> >>>>>> will be able to repair that hole.
> >>>>>
> >>>>> Can I read from the bookie with a hole in the meantime, and silently
> >> miss
> >>>>> data that it doesn't know about?
> >>>>>
> >>>>
> >>>> No you cannot miss data, if the client is not able to find a bookie
> >> that is
> >>>> able to answer with the entry it receives an error.
> >>>>
> >>>> The ledger has a known tail (LastAddConfirmed entry) and this value is
> >>>> stored on ledger metadata once the ledger is "closed".
> >>>>
> >>>> When the ledger is still open, that is when the writer is writing to
> it,
> >>>> the reader is allowed to read only up to the LastAddConfirmed entry
> >>>> this LAC value is returned to the reader using a piggyback mechanism,
> >>>> without reading from metadata.
> >>>> The reader cannot read beyond the latest position that has been
> >> confirmed
> >>>> to the writer by AQ bookies.
> >>>>
> >>>> We have a third case, the 'recovery read'.
> >>>> A reader starts a "recovery read" when you want to recover a ledger
> that
> >>>> has been abandoned by a dead writer
> >>>> or when you are a new leader (Pulsar Bundle Owner) or you want to
> fence
> >> out
> >>>> the old leader.
> >>>> In this case the reader merges the current status of the ledger on ZK
> >> with
> >>>> the result of a scan of the whole ledger.
> >>>> Basically it reads the ledger from the beginning up to the tail, until
> >> it
> >>>> is able to "read" entries, this way it is setting the 'fenced' flag on
> >> the
> >>>> ledger
> >>>> on every bookie and also it is able to detect the actual tail of the
> >> ledger
> >>>> (because the writer died and it was not able to flush metadata to ZK).
> >>>>
> >>>> The recovery read fails if it is not possible to read every entry from
> >> at
> >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> >>>> and it does not hazard to "repair" (truncate) the ledger if it does
> not
> >>>> find enough bookies.
> >>>>
> >>>> I hope that helps
> >>>> Enrico
> >>>>
> >>
> >>
> >>
>
>
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
Hi Jack,

Thanks for getting back.

> What's the best way to share the TLA+ findings?

Would you be able to share the spec? I'm ok with reading TLA+. 

As for sharing your specific findings, I'd suggest one of the following:

1- Create an email thread describing the scenarios that trigger a bug.
2- Create issues, one for each problem you found.
3- Create a discussion on the project Slack, perhaps a channel specific for it.
4- Set up a zoom call to present and discuss with the community. 

Option 2 is ideal from a community perspective, but we can also set up a call inviting everyone and create issues out of that discussion. We can in fact set up a call even if we create the issues ahead of time.

Does it make sense?

-Flavio

> On 15 Jan 2021, at 16:14, Jack Vanlightly <jv...@splunk.com.INVALID> wrote:
> 
> Hi Flavio,
> 
>>> This is an example of a scenario corresponding to what we suspect is a
> bug introduced earlier, but Enrico is arguing that this is not the intended
> behavior, and at this point, I agree.
> 
>>> By the time a successful callback is received, the client might only
> have replicated AQ ways, so the guarantee can only be at that point of
> being able to tolerate AQ - 1 crashes. The ledger configuration states that
> the application wants to have WQ copies >> of each entry, though. I'd
> expect a ledger to have WQ copies of each entry up to the final entry
> number when it is closed. Do you see it differently?
> 
> I also agree and was pretty surprised when I discovered the behaviour. It
> is not something that users expect and I think we need to correct it. So
> I'm with you.
> 
> What's the best way to share the TLA+ findings?
> 
> Jack
> 
> 
> 
> 
> 
> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org> wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
>>> confirm callback to the client is called and the LAC is set to 100.Now
>> the
>>> 3rd bookie times out. Ensemble change is executed and all pending adds
>> that
>>> are above the LAC of 100 are replayed to another bookie, meaning that the
>>> entry e100 is not replayed to another bookie, causing this entry to meet
>>> the rep factor of only AQ.
>> 
>> This is an example of a scenario corresponding to what we suspect is a bug
>> introduced earlier, but Enrico is arguing that this is not the intended
>> behavior, and at this point, I agree.
>> 
>>> This is alluded to in the docs as they state
>>> that AQ is also the minimum guaranteed replication factor.
>> 
>> By the time a successful callback is received, the client might only have
>> replicated AQ ways, so the guarantee can only be at that point of being
>> able to tolerate AQ - 1 crashes. The ledger configuration states that the
>> application wants to have WQ copies of each entry, though. I'd expect a
>> ledger to have WQ copies of each entry up to the final entry number when it
>> is closed. Do you see it differently?
>> 
>>> I'd be happy to set up a meeting to discuss the spec and its findings.
>> 
>> 
>> That'd be great, I'm interested.
>> 
>> -Flavio
>> 
>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jv...@splunk.com.INVALID>
>> wrote:
>>> 
>>>> No you cannot miss data, if the client is not able to find a bookie that
>>> is
>>>> able to answer with the entry it receives an error.
>>> 
>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
>>> confirm callback to the client is called and the LAC is set to 100. Now
>> the
>>> 3rd bookie times out. Ensemble change is executed and all pending adds
>> that
>>> are above the LAC of 100 are replayed to another bookie, meaning that the
>>> entry e100 is not replayed to another bookie, causing this entry to meet
>>> the rep factor of only AQ. This is alluded to in the docs as they state
>>> that AQ is also the minimum guaranteed replication factor.
>>> 
>>>> The recovery read fails if it is not possible to read every entry from
>> at
>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>> and it does not hazard to "repair" (truncate) the ledger if it does not
>>>> find enough bookies.
>>> 
>>> This is not quite accurate. A single successful read is enough. However
>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
>>> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
>>> recovery process ends there. This means that given the responses
>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>>> considered successful is non-deterministic. If the response from b1 is
>>> received last, then the read is already considered failed, otherwise the
>>> read succeeds.
>>> 
>>> I have come to the above conclusions through my reverse engineering
>> process
>>> for creating the TLA+ specification. I still have pending to
>>> reproduce the AQ rep factor behaviour via some tests, but have verified
>> via
>>> tests the conclusion about ledger recovery reads.
>>> 
>>> Note that I have found two defects with the BookKeeper protocol, most
>>> notably data loss due to that fencing does not prevent further successful
>>> adds. Currently the specification and associated documentation is on a
>>> private Splunk repo, but I'd be happy to set up a meeting to discuss the
>>> spec and its findings.
>>> 
>>> Best
>>> Jack
>>> 
>>> 
>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eo...@gmail.com>
>>> wrote:
>>> 
>>>> [ External sender. Exercise caution. ]
>>>> 
>>>> Jonathan,
>>>> 
>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>>>> jbellis@apache.org>
>>>> ha scritto:
>>>> 
>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can confirm
>>>>> that
>>>>>> once confirmed, that an entry is not replayed to another bookie. This
>>>>>> leaves a "hole" as the entry is now replicated only to 2 bookies,
>>>>> however,
>>>>>> the new data integrity check that Ivan worked on, when run
>> periodically
>>>>>> will be able to repair that hole.
>>>>> 
>>>>> Can I read from the bookie with a hole in the meantime, and silently
>> miss
>>>>> data that it doesn't know about?
>>>>> 
>>>> 
>>>> No you cannot miss data, if the client is not able to find a bookie
>> that is
>>>> able to answer with the entry it receives an error.
>>>> 
>>>> The ledger has a known tail (LastAddConfirmed entry) and this value is
>>>> stored on ledger metadata once the ledger is "closed".
>>>> 
>>>> When the ledger is still open, that is when the writer is writing to it,
>>>> the reader is allowed to read only up to the LastAddConfirmed entry
>>>> this LAC value is returned to the reader using a piggyback mechanism,
>>>> without reading from metadata.
>>>> The reader cannot read beyond the latest position that has been
>> confirmed
>>>> to the writer by AQ bookies.
>>>> 
>>>> We have a third case, the 'recovery read'.
>>>> A reader starts a "recovery read" when you want to recover a ledger that
>>>> has been abandoned by a dead writer
>>>> or when you are a new leader (Pulsar Bundle Owner) or you want to fence
>> out
>>>> the old leader.
>>>> In this case the reader merges the current status of the ledger on ZK
>> with
>>>> the result of a scan of the whole ledger.
>>>> Basically it reads the ledger from the beginning up to the tail, until
>> it
>>>> is able to "read" entries, this way it is setting the 'fenced' flag on
>> the
>>>> ledger
>>>> on every bookie and also it is able to detect the actual tail of the
>> ledger
>>>> (because the writer died and it was not able to flush metadata to ZK).
>>>> 
>>>> The recovery read fails if it is not possible to read every entry from
>> at
>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>> and it does not hazard to "repair" (truncate) the ledger if it does not
>>>> find enough bookies.
>>>> 
>>>> I hope that helps
>>>> Enrico
>>>> 
>> 
>> 
>> 


Re: Unbounded memory usage for WQ > AQ ?

Posted by Jack Vanlightly <jv...@splunk.com.INVALID>.
Hi Flavio,

>> This is an example of a scenario corresponding to what we suspect is a
bug introduced earlier, but Enrico is arguing that this is not the intended
behavior, and at this point, I agree.

>> By the time a successful callback is received, the client might only
have replicated AQ ways, so the guarantee can only be at that point of
being able to tolerate AQ - 1 crashes. The ledger configuration states that
the application wants to have WQ copies >> of each entry, though. I'd
expect a ledger to have WQ copies of each entry up to the final entry
number when it is closed. Do you see it differently?

I also agree and was pretty surprised when I discovered the behaviour. It
is not something that users expect and I think we need to correct it. So
I'm with you.

What's the best way to share the TLA+ findings?

Jack





On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <fp...@apache.org> wrote:

> [ External sender. Exercise caution. ]
>
> > Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > confirm callback to the client is called and the LAC is set to 100.Now
> the
> > 3rd bookie times out. Ensemble change is executed and all pending adds
> that
> > are above the LAC of 100 are replayed to another bookie, meaning that the
> > entry e100 is not replayed to another bookie, causing this entry to meet
> > the rep factor of only AQ.
>
> This is an example of a scenario corresponding to what we suspect is a bug
> introduced earlier, but Enrico is arguing that this is not the intended
> behavior, and at this point, I agree.
>
> > This is alluded to in the docs as they state
> > that AQ is also the minimum guaranteed replication factor.
>
> By the time a successful callback is received, the client might only have
> replicated AQ ways, so the guarantee can only be at that point of being
> able to tolerate AQ - 1 crashes. The ledger configuration states that the
> application wants to have WQ copies of each entry, though. I'd expect a
> ledger to have WQ copies of each entry up to the final entry number when it
> is closed. Do you see it differently?
>
> >  I'd be happy to set up a meeting to discuss the spec and its findings.
>
>
> That'd be great, I'm interested.
>
> -Flavio
>
> > On 15 Jan 2021, at 15:30, Jack Vanlightly <jv...@splunk.com.INVALID>
> wrote:
> >
> >> No you cannot miss data, if the client is not able to find a bookie that
> > is
> >> able to answer with the entry it receives an error.
> >
> > Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > confirm callback to the client is called and the LAC is set to 100. Now
> the
> > 3rd bookie times out. Ensemble change is executed and all pending adds
> that
> > are above the LAC of 100 are replayed to another bookie, meaning that the
> > entry e100 is not replayed to another bookie, causing this entry to meet
> > the rep factor of only AQ. This is alluded to in the docs as they state
> > that AQ is also the minimum guaranteed replication factor.
> >
> >> The recovery read fails if it is not possible to read every entry from
> at
> >> least AQ bookies  (that is it allows WQ-QA read failures),
> >> and it does not hazard to "repair" (truncate) the ledger if it does not
> >> find enough bookies.
> >
> > This is not quite accurate. A single successful read is enough. However
> > there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
> > explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> > recovery process ends there. This means that given the responses
> > b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> > considered successful is non-deterministic. If the response from b1 is
> > received last, then the read is already considered failed, otherwise the
> > read succeeds.
> >
> > I have come to the above conclusions through my reverse engineering
> process
> > for creating the TLA+ specification. I still have pending to
> > reproduce the AQ rep factor behaviour via some tests, but have verified
> via
> > tests the conclusion about ledger recovery reads.
> >
> > Note that I have found two defects with the BookKeeper protocol, most
> > notably data loss due to that fencing does not prevent further successful
> > adds. Currently the specification and associated documentation is on a
> > private Splunk repo, but I'd be happy to set up a meeting to discuss the
> > spec and its findings.
> >
> > Best
> > Jack
> >
> >
> > On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eo...@gmail.com>
> > wrote:
> >
> >> [ External sender. Exercise caution. ]
> >>
> >> Jonathan,
> >>
> >> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> >> jbellis@apache.org>
> >> ha scritto:
> >>
> >>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> >>>> Hi,
> >>>>
> >>>> I've recently modelled the BookKeeper protocol in TLA+ and can confirm
> >>> that
> >>>> once confirmed, that an entry is not replayed to another bookie. This
> >>>> leaves a "hole" as the entry is now replicated only to 2 bookies,
> >>> however,
> >>>> the new data integrity check that Ivan worked on, when run
> periodically
> >>>> will be able to repair that hole.
> >>>
> >>> Can I read from the bookie with a hole in the meantime, and silently
> miss
> >>> data that it doesn't know about?
> >>>
> >>
> >> No you cannot miss data, if the client is not able to find a bookie
> that is
> >> able to answer with the entry it receives an error.
> >>
> >> The ledger has a known tail (LastAddConfirmed entry) and this value is
> >> stored on ledger metadata once the ledger is "closed".
> >>
> >> When the ledger is still open, that is when the writer is writing to it,
> >> the reader is allowed to read only up to the LastAddConfirmed entry
> >> this LAC value is returned to the reader using a piggyback mechanism,
> >> without reading from metadata.
> >> The reader cannot read beyond the latest position that has been
> confirmed
> >> to the writer by AQ bookies.
> >>
> >> We have a third case, the 'recovery read'.
> >> A reader starts a "recovery read" when you want to recover a ledger that
> >> has been abandoned by a dead writer
> >> or when you are a new leader (Pulsar Bundle Owner) or you want to fence
> out
> >> the old leader.
> >> In this case the reader merges the current status of the ledger on ZK
> with
> >> the result of a scan of the whole ledger.
> >> Basically it reads the ledger from the beginning up to the tail, until
> it
> >> is able to "read" entries, this way it is setting the 'fenced' flag on
> the
> >> ledger
> >> on every bookie and also it is able to detect the actual tail of the
> ledger
> >> (because the writer died and it was not able to flush metadata to ZK).
> >>
> >> The recovery read fails if it is not possible to read every entry from
> at
> >> least AQ bookies  (that is it allows WQ-QA read failures),
> >> and it does not hazard to "repair" (truncate) the ledger if it does not
> >> find enough bookies.
> >>
> >> I hope that helps
> >> Enrico
> >>
>
>
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> confirm callback to the client is called and the LAC is set to 100.Now the
> 3rd bookie times out. Ensemble change is executed and all pending adds that
> are above the LAC of 100 are replayed to another bookie, meaning that the
> entry e100 is not replayed to another bookie, causing this entry to meet
> the rep factor of only AQ.

This is an example of a scenario corresponding to what we suspect is a bug introduced earlier, but Enrico is arguing that this is not the intended behavior, and at this point, I agree. 

> This is alluded to in the docs as they state
> that AQ is also the minimum guaranteed replication factor.

By the time a successful callback is received, the client might only have replicated AQ ways, so the guarantee can only be at that point of being able to tolerate AQ - 1 crashes. The ledger configuration states that the application wants to have WQ copies of each entry, though. I'd expect a ledger to have WQ copies of each entry up to the final entry number when it is closed. Do you see it differently?

>  I'd be happy to set up a meeting to discuss the spec and its findings. 


That'd be great, I'm interested.

-Flavio

> On 15 Jan 2021, at 15:30, Jack Vanlightly <jv...@splunk.com.INVALID> wrote:
> 
>> No you cannot miss data, if the client is not able to find a bookie that
> is
>> able to answer with the entry it receives an error.
> 
> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> confirm callback to the client is called and the LAC is set to 100. Now the
> 3rd bookie times out. Ensemble change is executed and all pending adds that
> are above the LAC of 100 are replayed to another bookie, meaning that the
> entry e100 is not replayed to another bookie, causing this entry to meet
> the rep factor of only AQ. This is alluded to in the docs as they state
> that AQ is also the minimum guaranteed replication factor.
> 
>> The recovery read fails if it is not possible to read every entry from at
>> least AQ bookies  (that is it allows WQ-QA read failures),
>> and it does not hazard to "repair" (truncate) the ledger if it does not
>> find enough bookies.
> 
> This is not quite accurate. A single successful read is enough. However
> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> recovery process ends there. This means that given the responses
> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> considered successful is non-deterministic. If the response from b1 is
> received last, then the read is already considered failed, otherwise the
> read succeeds.
> 
> I have come to the above conclusions through my reverse engineering process
> for creating the TLA+ specification. I still have pending to
> reproduce the AQ rep factor behaviour via some tests, but have verified via
> tests the conclusion about ledger recovery reads.
> 
> Note that I have found two defects with the BookKeeper protocol, most
> notably data loss due to that fencing does not prevent further successful
> adds. Currently the specification and associated documentation is on a
> private Splunk repo, but I'd be happy to set up a meeting to discuss the
> spec and its findings.
> 
> Best
> Jack
> 
> 
> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eo...@gmail.com>
> wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>> Jonathan,
>> 
>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>> jbellis@apache.org>
>> ha scritto:
>> 
>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>> Hi,
>>>> 
>>>> I've recently modelled the BookKeeper protocol in TLA+ and can confirm
>>> that
>>>> once confirmed, that an entry is not replayed to another bookie. This
>>>> leaves a "hole" as the entry is now replicated only to 2 bookies,
>>> however,
>>>> the new data integrity check that Ivan worked on, when run periodically
>>>> will be able to repair that hole.
>>> 
>>> Can I read from the bookie with a hole in the meantime, and silently miss
>>> data that it doesn't know about?
>>> 
>> 
>> No you cannot miss data, if the client is not able to find a bookie that is
>> able to answer with the entry it receives an error.
>> 
>> The ledger has a known tail (LastAddConfirmed entry) and this value is
>> stored on ledger metadata once the ledger is "closed".
>> 
>> When the ledger is still open, that is when the writer is writing to it,
>> the reader is allowed to read only up to the LastAddConfirmed entry
>> this LAC value is returned to the reader using a piggyback mechanism,
>> without reading from metadata.
>> The reader cannot read beyond the latest position that has been confirmed
>> to the writer by AQ bookies.
>> 
>> We have a third case, the 'recovery read'.
>> A reader starts a "recovery read" when you want to recover a ledger that
>> has been abandoned by a dead writer
>> or when you are a new leader (Pulsar Bundle Owner) or you want to fence out
>> the old leader.
>> In this case the reader merges the current status of the ledger on ZK with
>> the result of a scan of the whole ledger.
>> Basically it reads the ledger from the beginning up to the tail, until it
>> is able to "read" entries, this way it is setting the 'fenced' flag on the
>> ledger
>> on every bookie and also it is able to detect the actual tail of the ledger
>> (because the writer died and it was not able to flush metadata to ZK).
>> 
>> The recovery read fails if it is not possible to read every entry from at
>> least AQ bookies  (that is it allows WQ-QA read failures),
>> and it does not hazard to "repair" (truncate) the ledger if it does not
>> find enough bookies.
>> 
>> I hope that helps
>> Enrico
>> 


Re: Unbounded memory usage for WQ > AQ ?

Posted by Jack Vanlightly <jv...@splunk.com.INVALID>.
> No you cannot miss data, if the client is not able to find a bookie that
is
> able to answer with the entry it receives an error.

Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
confirm callback to the client is called and the LAC is set to 100. Now the
3rd bookie times out. Ensemble change is executed and all pending adds that
are above the LAC of 100 are replayed to another bookie, meaning that the
entry e100 is not replayed to another bookie, causing this entry to meet
the rep factor of only AQ. This is alluded to in the docs as they state
that AQ is also the minimum guaranteed replication factor.

> The recovery read fails if it is not possible to read every entry from at
> least AQ bookies  (that is it allows WQ-QA read failures),
> and it does not hazard to "repair" (truncate) the ledger if it does not
> find enough bookies.

This is not quite accurate. A single successful read is enough. However
there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
recovery process ends there. This means that given the responses
b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
considered successful is non-deterministic. If the response from b1 is
received last, then the read is already considered failed, otherwise the
read succeeds.

I have come to the above conclusions through my reverse engineering process
for creating the TLA+ specification. I still have pending to
reproduce the AQ rep factor behaviour via some tests, but have verified via
tests the conclusion about ledger recovery reads.

Note that I have found two defects with the BookKeeper protocol, most
notably data loss due to that fencing does not prevent further successful
adds. Currently the specification and associated documentation is on a
private Splunk repo, but I'd be happy to set up a meeting to discuss the
spec and its findings.

Best
Jack


On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eo...@gmail.com>
wrote:

> [ External sender. Exercise caution. ]
>
> Jonathan,
>
> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> jbellis@apache.org>
> ha scritto:
>
> > On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > > Hi,
> > >
> > > I've recently modelled the BookKeeper protocol in TLA+ and can confirm
> > that
> > > once confirmed, that an entry is not replayed to another bookie. This
> > > leaves a "hole" as the entry is now replicated only to 2 bookies,
> > however,
> > > the new data integrity check that Ivan worked on, when run periodically
> > > will be able to repair that hole.
> >
> > Can I read from the bookie with a hole in the meantime, and silently miss
> > data that it doesn't know about?
> >
>
> No you cannot miss data, if the client is not able to find a bookie that is
> able to answer with the entry it receives an error.
>
> The ledger has a known tail (LastAddConfirmed entry) and this value is
> stored on ledger metadata once the ledger is "closed".
>
> When the ledger is still open, that is when the writer is writing to it,
> the reader is allowed to read only up to the LastAddConfirmed entry
> this LAC value is returned to the reader using a piggyback mechanism,
> without reading from metadata.
> The reader cannot read beyond the latest position that has been confirmed
> to the writer by AQ bookies.
>
> We have a third case, the 'recovery read'.
> A reader starts a "recovery read" when you want to recover a ledger that
> has been abandoned by a dead writer
> or when you are a new leader (Pulsar Bundle Owner) or you want to fence out
> the old leader.
> In this case the reader merges the current status of the ledger on ZK with
> the result of a scan of the whole ledger.
> Basically it reads the ledger from the beginning up to the tail, until it
> is able to "read" entries, this way it is setting the 'fenced' flag on the
> ledger
> on every bookie and also it is able to detect the actual tail of the ledger
> (because the writer died and it was not able to flush metadata to ZK).
>
> The recovery read fails if it is not possible to read every entry from at
> least AQ bookies  (that is it allows WQ-QA read failures),
> and it does not hazard to "repair" (truncate) the ledger if it does not
> find enough bookies.
>
> I hope that helps
> Enrico
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Enrico Olivelli <eo...@gmail.com>.
Jonathan,

Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <jb...@apache.org>
ha scritto:

> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > Hi,
> >
> > I've recently modelled the BookKeeper protocol in TLA+ and can confirm
> that
> > once confirmed, that an entry is not replayed to another bookie. This
> > leaves a "hole" as the entry is now replicated only to 2 bookies,
> however,
> > the new data integrity check that Ivan worked on, when run periodically
> > will be able to repair that hole.
>
> Can I read from the bookie with a hole in the meantime, and silently miss
> data that it doesn't know about?
>

No you cannot miss data, if the client is not able to find a bookie that is
able to answer with the entry it receives an error.

The ledger has a known tail (LastAddConfirmed entry) and this value is
stored on ledger metadata once the ledger is "closed".

When the ledger is still open, that is when the writer is writing to it,
the reader is allowed to read only up to the LastAddConfirmed entry
this LAC value is returned to the reader using a piggyback mechanism,
without reading from metadata.
The reader cannot read beyond the latest position that has been confirmed
to the writer by AQ bookies.

We have a third case, the 'recovery read'.
A reader starts a "recovery read" when you want to recover a ledger that
has been abandoned by a dead writer
or when you are a new leader (Pulsar Bundle Owner) or you want to fence out
the old leader.
In this case the reader merges the current status of the ledger on ZK with
the result of a scan of the whole ledger.
Basically it reads the ledger from the beginning up to the tail, until it
is able to "read" entries, this way it is setting the 'fenced' flag on the
ledger
on every bookie and also it is able to detect the actual tail of the ledger
(because the writer died and it was not able to flush metadata to ZK).

The recovery read fails if it is not possible to read every entry from at
least AQ bookies  (that is it allows WQ-QA read failures),
and it does not hazard to "repair" (truncate) the ledger if it does not
find enough bookies.

I hope that helps
Enrico

Re: Unbounded memory usage for WQ > AQ ?

Posted by Jonathan Ellis <jb...@apache.org>.
On 2021/01/11 08:31:03, Jack Vanlightly wrote: 
> Hi,
> 
> I've recently modelled the BookKeeper protocol in TLA+ and can confirm that
> once confirmed, that an entry is not replayed to another bookie. This
> leaves a "hole" as the entry is now replicated only to 2 bookies, however,
> the new data integrity check that Ivan worked on, when run periodically
> will be able to repair that hole.

Can I read from the bookie with a hole in the meantime, and silently miss data that it doesn't know about?

Re: Unbounded memory usage for WQ > AQ ?

Posted by Flavio Junqueira <fp...@apache.org>.
Hi Jack,

> I've recently modelled the BookKeeper protocol in TLA+ and can confirm that
> once confirmed, that an entry is not replayed to another bookie.

Should I assume that you modeled it after the code? Otherwise, what did you use as a reference? Is the TLA+ spec available anywhere? It sounds like a good development.

> once confirmed, that an entry is not replayed to another bookie.


I'd like to understand this a bit better. I think this is saying that if I have an entry e that is written to AQ < WQ, and at least one bookie b in the ledger ensemble crashes before it writes e, then e is considered confirmed and when b is replaced with b' for the ledger, e is not replicated on b'.

If that's the case, then isn't it a bug?

>  the new data integrity check that Ivan worked on, when run periodically

> will be able to repair that hole.


This is good, but I'm not sure this is a replacement for a proper fix.

Please let me know if I'm missing anything.

-Flavio 

> On 11 Jan 2021, at 09:31, Jack Vanlightly <jv...@splunk.com.INVALID> wrote:
> 
> Hi,
> 
> I've recently modelled the BookKeeper protocol in TLA+ and can confirm that
> once confirmed, that an entry is not replayed to another bookie. This
> leaves a "hole" as the entry is now replicated only to 2 bookies, however,
> the new data integrity check that Ivan worked on, when run periodically
> will be able to repair that hole.
> 
> Jack
> 
> On Sat, Jan 9, 2021 at 1:06 AM Venkateswara Rao Jujjuri <ju...@gmail.com>
> wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>> On Fri, Jan 8, 2021 at 2:29 PM Matteo Merli <ma...@gmail.com>
>> wrote:
>> 
>>> On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
>>> <ju...@gmail.com> wrote:
>>>> 
>>>>> otherwise the write will timeout internally and it will get replayed
>>> to a
>>>> new bookie.
>>>> If Qa is met and the writes of Qw-Qa fail after we send the success to
>>> the
>>>> client, why would the write replayed on a new bookie?
>>> 
>>> I think the original intention was to avoid having 1 bookie with a
>>> "hole" in the entries sequence. If you then lose one of the 2 bookies,
>>> it would be difficult to know which entries need to be recovered.
>>> 
>> 
>> @Matteo Merli <ma...@gmail.com>  I don't believe we retry the write
>> on bookie if Qa is satisfied and the write to a bookie timedout.
>> Once the entry is ack'ed to the client we move the LAC and can't
>> retroactively change the active segment's ensemble.
>> 
>>> will get replayed to a new bookie
>> This will happen only if we are not able to satisfy Qa and go through
>> ensemble changes.
>> We change the ensemble and tetry write only if bookie write fails before
>> satisfying Qa.
>> We have added a new feature called handling "delayed write failure", but
>> that happens only for
>> new entries not retroactively.
>> 
>> I may be missing something here, and not understanding your point.
>> 
>> Thanks,
>> JV
>> 
>> 
>> 
>> 
>> --
>> Jvrao
>> ---
>> First they ignore you, then they laugh at you, then they fight you, then
>> you win. - Mahatma Gandhi
>> 


Re: Unbounded memory usage for WQ > AQ ?

Posted by Jack Vanlightly <jv...@splunk.com.INVALID>.
Hi,

I've recently modelled the BookKeeper protocol in TLA+ and can confirm that
once confirmed, that an entry is not replayed to another bookie. This
leaves a "hole" as the entry is now replicated only to 2 bookies, however,
the new data integrity check that Ivan worked on, when run periodically
will be able to repair that hole.

Jack

On Sat, Jan 9, 2021 at 1:06 AM Venkateswara Rao Jujjuri <ju...@gmail.com>
wrote:

> [ External sender. Exercise caution. ]
>
> On Fri, Jan 8, 2021 at 2:29 PM Matteo Merli <ma...@gmail.com>
> wrote:
>
> > On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
> > <ju...@gmail.com> wrote:
> > >
> > > > otherwise the write will timeout internally and it will get replayed
> > to a
> > > new bookie.
> > > If Qa is met and the writes of Qw-Qa fail after we send the success to
> > the
> > > client, why would the write replayed on a new bookie?
> >
> > I think the original intention was to avoid having 1 bookie with a
> > "hole" in the entries sequence. If you then lose one of the 2 bookies,
> > it would be difficult to know which entries need to be recovered.
> >
>
> @Matteo Merli <ma...@gmail.com>  I don't believe we retry the write
> on bookie if Qa is satisfied and the write to a bookie timedout.
> Once the entry is ack'ed to the client we move the LAC and can't
> retroactively change the active segment's ensemble.
>
> >  will get replayed to a new bookie
> This will happen only if we are not able to satisfy Qa and go through
> ensemble changes.
> We change the ensemble and tetry write only if bookie write fails before
> satisfying Qa.
> We have added a new feature called handling "delayed write failure", but
> that happens only for
> new entries not retroactively.
>
> I may be missing something here, and not understanding your point.
>
> Thanks,
> JV
>
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>

Re: Unbounded memory usage for WQ > AQ ?

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.
On Fri, Jan 8, 2021 at 2:29 PM Matteo Merli <ma...@gmail.com> wrote:

> On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
> <ju...@gmail.com> wrote:
> >
> > > otherwise the write will timeout internally and it will get replayed
> to a
> > new bookie.
> > If Qa is met and the writes of Qw-Qa fail after we send the success to
> the
> > client, why would the write replayed on a new bookie?
>
> I think the original intention was to avoid having 1 bookie with a
> "hole" in the entries sequence. If you then lose one of the 2 bookies,
> it would be difficult to know which entries need to be recovered.
>

@Matteo Merli <ma...@gmail.com>  I don't believe we retry the write
on bookie if Qa is satisfied and the write to a bookie timedout.
Once the entry is ack'ed to the client we move the LAC and can't
retroactively change the active segment's ensemble.

>  will get replayed to a new bookie
This will happen only if we are not able to satisfy Qa and go through
ensemble changes.
We change the ensemble and tetry write only if bookie write fails before
satisfying Qa.
We have added a new feature called handling "delayed write failure", but
that happens only for
new entries not retroactively.

I may be missing something here, and not understanding your point.

Thanks,
JV




-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Unbounded memory usage for WQ > AQ ?

Posted by Matteo Merli <ma...@gmail.com>.
On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
<ju...@gmail.com> wrote:
>
> > otherwise the write will timeout internally and it will get replayed to a
> new bookie.
> If Qa is met and the writes of Qw-Qa fail after we send the success to the
> client, why would the write replayed on a new bookie?

I think the original intention was to avoid having 1 bookie with a
"hole" in the entries sequence. If you then lose one of the 2 bookies,
it would be difficult to know which entries need to be recovered.

Re: Unbounded memory usage for WQ > AQ ?

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.
> otherwise the write will timeout internally and it will get replayed to a
new bookie.
If Qa is met and the writes of Qw-Qa fail after we send the success to the
client, why would the write replayed on a new bookie?

On Fri, Jan 8, 2021 at 1:47 PM Matteo Merli <mm...@apache.org> wrote:

> On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eo...@gmail.com>
> wrote:
> >
> > Hi Matteo,
> > in this comment you are talking about an issue you saw when WQ is
> greater that AQ
> > https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
> >
> > IIUC you are saying that if one bookie is slow the client continues to
> accumulate references to the entries that still have not received the
> confirmation from it.
> > I think that this is correct.
> >
> > Have you seen problems in production related to this scenario ?
> > Can you tell more about them ?
>
> Yes, for simplicity, assume e=3, w=3, a=2.
>
> If one bookie is slow (not down, just slow), the BK client will the
> acks to the user that the entries are written after the first 2 acks.
> In the meantime, it will keep waiting for the 3rd bookie to respond.
> If the bookie responds within the timeout, the entries can now be
> dropped from memory, otherwise the write will timeout internally and
> it will get replayed to a new bookie.
>
> In both cases, the amount of memory used in the client will max at
> "throughput" * "timeout". This can be a large amount of memory and
> easily cause OOM errors.
>
> Part of the problem is that it cannot be solved from outside the BK
> client, since there's no visibility on what entries have 2 or 3 acks
> and therefore it's not possible to apply backpressure. Instead,
> there should be a backpressure mechanism in the BK client itself to
> prevent this kind of issue.
> One possibility there could be to use the same approach as described
> in
> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits
> ,
> giving a max memory limit per BK client instance and throttling
> everything after the quota is reached.
>
>
> Matteo
>


-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Unbounded memory usage for WQ > AQ ?

Posted by Matteo Merli <mm...@apache.org>.
On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eo...@gmail.com> wrote:
>
> Hi Matteo,
> in this comment you are talking about an issue you saw when WQ is greater that AQ
> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>
> IIUC you are saying that if one bookie is slow the client continues to accumulate references to the entries that still have not received the confirmation from it.
> I think that this is correct.
>
> Have you seen problems in production related to this scenario ?
> Can you tell more about them ?

Yes, for simplicity, assume e=3, w=3, a=2.

If one bookie is slow (not down, just slow), the BK client will the
acks to the user that the entries are written after the first 2 acks.
In the meantime, it will keep waiting for the 3rd bookie to respond.
If the bookie responds within the timeout, the entries can now be
dropped from memory, otherwise the write will timeout internally and
it will get replayed to a new bookie.

In both cases, the amount of memory used in the client will max at
"throughput" * "timeout". This can be a large amount of memory and
easily cause OOM errors.

Part of the problem is that it cannot be solved from outside the BK
client, since there's no visibility on what entries have 2 or 3 acks
and therefore it's not possible to apply backpressure. Instead,
there should be a backpressure mechanism in the BK client itself to
prevent this kind of issue.
One possibility there could be to use the same approach as described
in https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits,
giving a max memory limit per BK client instance and throttling
everything after the quota is reached.


Matteo