You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Stefan Eissing <st...@greenbytes.de> on 2017/01/12 13:19:37 UTC

oak lease handling and clusters recovery

If you are not into lease handling and cluster node recovery, this might not be for you.


In a cloud based cluster app w. oak, we seem to encounter node VM lockups now and then. We are in the process of drilling down what is causing this with the cloud hoster. But whatever this is, nodes are experiencing it. So far, this happened always to a single node at a time while the other cluster nodes ran unaffected. However, when the freeze is just long enough, lease renewal fails and oak shuts down. 

This is rather painful. And before you say "well, tell the hoster to fix it.", think of DoS or broken cables in the cloud, disks can be network based storage, etc. Clocks can get out of sync too and jump ahead. The bigger the cluster, the higher the likelihood.

2 Observations relevant (hopefully) for the people concerned with this part of oak on this list:

  A. the freeze duration of no return is the "leaseEndTime - failureMargin", so 20 seconds too early, due to a bug in the ClusterNodeInfo instantiation. I will file a ticket once Apache Jira is working again.
  B. If A gets fixed, a renewal in the first 4 seconds of other activity will be successful, *no matter the duration of the freeze* (my reading of the code, unless the cluster node data in the store was changed in the meantime...) 
  C. the freeze duration of no return is the "leaseEndTime - failureMargin", so 20 seconds too early, due to a bug in the ClusterNodeInfo instantiation. I will file a ticket once Apache Jira is working again.

So, my question to the experts is: will B work? should it? What are the risks? Local index copies? Can this be mitigated?

I'm afraid such scenarios are moving from pure academic to realistic in cloud/vm hostings.

Cheers,

Stefan Eissing

<green/>bytes GmbH
Hafenstrasse 16
48155 Münster
www.greenbytes.de

Re: oak lease handling and clusters recovery

Posted by Julian Reschke <ju...@gmx.de>.

On 2017-01-12 14:19, Stefan Eissing wrote:
> If you are not into lease handling and cluster node recovery, this might not be for you.
>
>
> In a cloud based cluster app w. oak, we seem to encounter node VM lockups now and then. We are in the process of drilling down what is causing this with the cloud hoster. But whatever this is, nodes are experiencing it. So far, this happened always to a single node at a time while the other cluster nodes ran unaffected. However, when the freeze is just long enough, lease renewal fails and oak shuts down.
>
> This is rather painful. And before you say "well, tell the hoster to fix it.", think of DoS or broken cables in the cloud, disks can be network based storage, etc. Clocks can get out of sync too and jump ahead. The bigger the cluster, the higher the likelihood.
>
> 2 Observations relevant (hopefully) for the people concerned with this part of oak on this list:
>
>   A. the freeze duration of no return is the "leaseEndTime - failureMargin", so 20 seconds too early, due to a bug in the ClusterNodeInfo instantiation. I will file a ticket once Apache Jira is working again.
> ...

FWIW, this code will only help in single-node scenarios anyway. If there 
is another node running concurrently which can see the persistence, 
it''ll declare that node as dead by updating the associated 
ClusterNodeInfo, and subsequently running LastRevRecovery on it.

I do agree that there seems that the retry loop never will do the right 
thing, and something needs to be fixed here.

Best regards, Julian

(see also OAK-5446 once Jira is up again)