You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2018/12/06 14:41:13 UTC

[GitHub] ivakegg opened a new issue #802: Failing to recover WAL leases

ivakegg opened a new issue #802: Failing to recover WAL leases
URL: https://github.com/apache/accumulo/issues/802
 
 
   In 1.8 and 1.9 we are having issues when a tserver dies hard, and the master needs to recover the leases (LeaseException).  With hadoop 2.6.0, we get into a state where the lease cannot be recovered within a resonable amount of time (minutes as opposed to hours).  In order to get out of this state, we have to manually copy the WAL having the issue out of the way, and then move the copy into the original file's place.  We had verified when we came up with this procedure that the copy appears to contain all of the data.
   I am suggesting that we add a little more to the lease recovery such that after a configured amount of time, if we cannot communicate with the tserver and the zookeeper lock does not exist, then force the lease recover (copy the file if needed).  This would allow the system to continue operations without having to intervene every time.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services