You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Eric Newton (JIRA)" <ji...@apache.org> on 2014/09/29 15:38:33 UTC
[jira] [Resolved] (ACCUMULO-2480) ha fail-failover failure
[ https://issues.apache.org/jira/browse/ACCUMULO-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Newton resolved ACCUMULO-2480.
-----------------------------------
Resolution: Fixed
> ha fail-failover failure
> ------------------------
>
> Key: ACCUMULO-2480
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2480
> Project: Accumulo
> Issue Type: Bug
> Components: master, tserver
> Environment: running continuous ingest on a 74-node HA NN hadoop 2.3 cluster, 1.6.0-SNAPSHOT.
> Reporter: Eric Newton
> Assignee: Eric Newton
> Fix For: 1.7.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Ran {{service network stop}} on the active NN. The service failed to switch over since the fencing script on the standby failed to run (sshfence).
> After the network interface was re-established, the standby took over.
> However, accumulo ingest began to have very long hold times since the standby was not providing service for several minutes.
> The master attempted to shutdown the tablet servers with hold time.
> The filesystem hook closed the filesystem, and the servers got stuck endlessly trying to write to the WAL.
> Even after the NN was active, because the filesytem was closed, attempts to get a new WAL continued to fail.
> * why didn't the tablet servers stop?
> * WAL loop should be able to terminate if they see an IOException that indicates that the filesystem is closed
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)