You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@bookkeeper.apache.org by Enrico Olivelli <eo...@gmail.com> on 2017/07/19 08:04:39 UTC

BookKeeper#openLedgerNoRecovery hangs

Hi,
in some internal benchmarks we are experiencing openLedgerNoRecovery calls
which remain hung.
I see that basically that function calls ZookKeeper#getData.

Does anyone have an idea of how it can happen ?

Is there any implicit timeout on ZK.getData() ? I did not find any way and
personally I never got into this problem.

Maybe there is space for an improvement to add a timeout on openLedgerXXX
operations, but anyway it is strange that the callback is never called.

Unfortunately the problem happens only in integration tests, mabye I can
work to reproduce it on a BK only test case.

The case is simple: start ZK + 1 Bookie + 1 BookKeeper, create concurrencly
many ledgers, write and concurrently open them with openLedgerNoRecovery
from other threads.
The fact is that no error is on ZK logs and BK logs

Any suggestion ?

Thanks

-- Enrico

Re: BookKeeper#openLedgerNoRecovery hangs

Posted by Enrico Olivelli <eo...@gmail.com>.
Looking at this PR from Sijie I noticed that there is a rate limiter for
our internal subclass of ZooKeeper client.
https://github.com/apache/bookkeeper/pull/264

The rate limiter is not enabled and cannot be enabled.
I wonder if I hit a bug in our getData or ZkRetryRunnable or it is enough
to enable the rate limiter.

@Sijie
I left a comment on the PR, for me it is OK but it seems that it lacks
support for client-side BookKeeper, it enables it only on the Bookie

-- Enrico



2017-07-19 11:27 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:

>
>
> Il mer 19 lug 2017, 11:11 Sijie Guo <gu...@gmail.com> ha scritto:
>
>> On Wed, Jul 19, 2017 at 4:04 PM, Enrico Olivelli <eo...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> in some internal benchmarks we are experiencing openLedgerNoRecovery
>>> calls which remain hung.
>>> I see that basically that function calls ZookKeeper#getData.
>>>
>>
>>> Does anyone have an idea of how it can happen ?
>>>
>>
>> What version are you testing? Is it related your recent change on bumping
>> zookeeper version? If that's the case, we should consider rolling back the
>> zookeeper version.
>>
>
> 3.5.1 and 3.5.3
>
>>
>>
>>>
>>> Is there any implicit timeout on ZK.getData() ? I did not find any way
>>> and personally I never got into this problem.
>>>
>>
>> As far as I know, there is no timeout on zookeeper requests. It would be
>> a good question to zookeeper community.
>>
>
> I will do
>
>>
>>
>>>
>>> Maybe there is space for an improvement to add a timeout on
>>> openLedgerXXX operations, but anyway it is strange that the callback is
>>> never called.
>>>
>>> Unfortunately the problem happens only in integration tests, mabye I can
>>> work to reproduce it on a BK only test case.
>>>
>>> The case is simple: start ZK + 1 Bookie + 1 BookKeeper, create
>>> concurrencly many ledgers, write and concurrently open them with
>>> openLedgerNoRecovery from other threads.
>>> The fact is that no error is on ZK logs and BK logs
>>>
>>
>> Can you turn on debugging log for the bookkeeper client and also
>> zookeeper? There might be logs for checking.
>>
>
> Yes I am koggong at info, I will try at debug
>
>>
>> Another solution is to do a TCP dump for tracing the zookeeper calls to
>> see if the getData request and response is received at both sides.
>>
>>
>>>
>>> Any suggestion ?
>>>
>>
>
> Thank you again
> Enrico
>
>>
>>> Thanks
>>>
>>> -- Enrico
>>>
>>>
>>> --
>
>
> -- Enrico Olivelli
>

Re: BookKeeper#openLedgerNoRecovery hangs

Posted by Enrico Olivelli <eo...@gmail.com>.
Il mer 19 lug 2017, 11:11 Sijie Guo <gu...@gmail.com> ha scritto:

> On Wed, Jul 19, 2017 at 4:04 PM, Enrico Olivelli <eo...@gmail.com>
> wrote:
>
>> Hi,
>> in some internal benchmarks we are experiencing openLedgerNoRecovery
>> calls which remain hung.
>> I see that basically that function calls ZookKeeper#getData.
>>
>
>> Does anyone have an idea of how it can happen ?
>>
>
> What version are you testing? Is it related your recent change on bumping
> zookeeper version? If that's the case, we should consider rolling back the
> zookeeper version.
>

3.5.1 and 3.5.3

>
>
>>
>> Is there any implicit timeout on ZK.getData() ? I did not find any way
>> and personally I never got into this problem.
>>
>
> As far as I know, there is no timeout on zookeeper requests. It would be a
> good question to zookeeper community.
>

I will do

>
>
>>
>> Maybe there is space for an improvement to add a timeout on openLedgerXXX
>> operations, but anyway it is strange that the callback is never called.
>>
>> Unfortunately the problem happens only in integration tests, mabye I can
>> work to reproduce it on a BK only test case.
>>
>> The case is simple: start ZK + 1 Bookie + 1 BookKeeper, create
>> concurrencly many ledgers, write and concurrently open them with
>> openLedgerNoRecovery from other threads.
>> The fact is that no error is on ZK logs and BK logs
>>
>
> Can you turn on debugging log for the bookkeeper client and also
> zookeeper? There might be logs for checking.
>

Yes I am koggong at info, I will try at debug

>
> Another solution is to do a TCP dump for tracing the zookeeper calls to
> see if the getData request and response is received at both sides.
>
>
>>
>> Any suggestion ?
>>
>

Thank you again
Enrico

>
>> Thanks
>>
>> -- Enrico
>>
>>
>> --


-- Enrico Olivelli

Re: BookKeeper#openLedgerNoRecovery hangs

Posted by Sijie Guo <gu...@gmail.com>.
On Wed, Jul 19, 2017 at 4:04 PM, Enrico Olivelli <eo...@gmail.com>
wrote:

> Hi,
> in some internal benchmarks we are experiencing openLedgerNoRecovery calls
> which remain hung.
> I see that basically that function calls ZookKeeper#getData.
>

> Does anyone have an idea of how it can happen ?
>

What version are you testing? Is it related your recent change on bumping
zookeeper version? If that's the case, we should consider rolling back the
zookeeper version.


>
> Is there any implicit timeout on ZK.getData() ? I did not find any way and
> personally I never got into this problem.
>

As far as I know, there is no timeout on zookeeper requests. It would be a
good question to zookeeper community.


>
> Maybe there is space for an improvement to add a timeout on openLedgerXXX
> operations, but anyway it is strange that the callback is never called.
>
> Unfortunately the problem happens only in integration tests, mabye I can
> work to reproduce it on a BK only test case.
>
> The case is simple: start ZK + 1 Bookie + 1 BookKeeper, create
> concurrencly many ledgers, write and concurrently open them with
> openLedgerNoRecovery from other threads.
> The fact is that no error is on ZK logs and BK logs
>

Can you turn on debugging log for the bookkeeper client and also zookeeper?
There might be logs for checking.

Another solution is to do a TCP dump for tracing the zookeeper calls to see
if the getData request and response is received at both sides.


>
> Any suggestion ?
>
> Thanks
>
> -- Enrico
>
>
>