You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "Keith Turner (Updated) (JIRA)" <ji...@apache.org> on 2012/01/27 22:56:09 UTC
[jira] [Updated] (ACCUMULO-327) master lost all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith Turner updated ACCUMULO-327:
----------------------------------
Fix Version/s: 1.4.0
This may not be an issue in 1.3 because there is no merge operation where the master ask a tablet server to split. I am not sure if there are other tserver operations where the synchronization of the connection could cause deadlock.
> master lost all tablet servers
> ------------------------------
>
> Key: ACCUMULO-327
> URL: https://issues.apache.org/jira/browse/ACCUMULO-327
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Environment: running the random walk test on a small cluster
> Reporter: Eric Newton
> Assignee: Keith Turner
> Fix For: 1.4.0
>
>
> Master would occasionally take a long time to collect status information from a tablet server. The connection would timeout after the default 120 second RPC time. This probably left the connection in a bad state because I am seeing
> {noformat}
> org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got 0
> at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:445)
> at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_halt(TabletClientService.java:893)
> at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.halt(TabletClientService.java:876)
> {noformat}
> If the master is unable to collect statistics on the tablet server, it attempts to halt it (as above) and then it removes its lock in zookeeper.
> Eventually, under the pressure of random walk operations, the master killed every tablet server.
> Guess: a lock in the tablet server is delaying status reporting.
> I wrote a script to process the master logs. It saves each line that refers to the IP address of a tablet server. When it sees the zookeeper lock has been deleted, it prints the last N lines that refer to that tablet server.
> In 7 out of the 10 cases, a split timed out prior or during the status request failures.
> In 5 cases, the tablet server was hosting the root tablet (a necessary condition when the last server died).
> In 5 cases, the table_table info tablet was being hosted.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira