You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Christopher Tubbs (JIRA)" <ji...@apache.org> on 2016/07/07 21:19:11 UTC

[jira] [Comment Edited] (ACCUMULO-4359) Accumulo client stuck in infinite loop when Kerberos ticket expires

    [ https://issues.apache.org/jira/browse/ACCUMULO-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366785#comment-15366785 ] 

Christopher Tubbs edited comment on ACCUMULO-4359 at 7/7/16 9:18 PM:
---------------------------------------------------------------------

I know there were some issues with older Hadoop versions... perhaps you need to update to 2.6.4 or later? Maybe [~elserj] can confirm or can inform whether this is a new issue.


was (Author: ctubbsii):
I know there were some issues with older Hadoop versions... perhaps you need to update to 2.6.4 or later? Maybe [~elserj] can confirm.

> Accumulo client stuck in infinite loop when Kerberos ticket expires
> -------------------------------------------------------------------
>
>                 Key: ACCUMULO-4359
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4359
>             Project: Accumulo
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.7.2
>         Environment: Problem only exists when Kerberos is turned on.
>            Reporter: Russ Weeks
>            Assignee: Russ Weeks
>            Priority: Minor
>             Fix For: 1.8.0
>
>
> If an Accumulo client tries to send an RPC to a tserver but the client's token is expired, it will get stuck in an infinite loop [here|https://github.com/apache/accumulo/blob/1.7/core/src/main/java/org/apache/accumulo/core/client/impl/ServerClient.java#L102].
> I'm setting the priority to "minor" because it's actually pretty difficult to put the system into this state: you have to create the client with a valid token, let the token expire, and then try to use the client. We hit this by accident in the cleanup phase of a very long-running MR job; the workaround (a.k.a the right way to do it) is to create a new client instead of re-using an old client.
> On the tserver side, we get an exception like this every 100ms:
> {noformat}
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed
> 	at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
> 	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51)
> 	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:360)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
> 	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48)
> 	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}
> On the client side, no output is produced unless debug logging is turned on for o.a.a.core.client.impl.ServerClient, in which case you see a bunch of "Failed to find TGT" errors.
> I'm not sure about the best way to fix it, advice is welcome, but I'm thinking that a binary exponential backoff (maybe capped at 30s?) instead of a retry every 100ms would at least lighten the load on the tservers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)