You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@slider.apache.org by Steve Loughran <st...@hortonworks.com> on 2015/02/04 13:45:21 UTC

How do HBase and Accumulo handle failures during client RPC operations

Ted & Billie

I'm updating the YARN-2683 registry documentation, including some guidelines on how to handle failure to communicate with a remote server.

This is a problem which the Slider REST client has itself; currently the REST client fails immediately; if a new attempt is made to look up the
AM service API URL it will pick up a new value.

The API client is therefore not handling the rebinding itself. Which is a bit of a get-out; I'm relying on the AM not failing very often. Provided
the clients don't hold onto their client API objects for long, they "should" be OK.

What do HBase and Accumulo do here? I know they publish their binding info via ZK —but how do they fail over?

Re: How do HBase and Accumulo handle failures during client RPC operations

Posted by Ted Yu <yu...@gmail.com>.

What Josh described w.r.t. client failing over to new server is mostly
applicable to HBase as well.

Cheers

On Wed, Feb 4, 2015 at 8:12 AM, Josh Elser <jo...@gmail.com> wrote:

> For Accumulo, like you said, information is published in ZK for clients to
> find. Thinking about just the Accumulo master process (tabletservers follow
> the same principle but in a slightly different way), clients will cache
> that location from ZK and then on some RPC transport failure (e.g.
> ConnectException -- Thrift exception in Accumulo's case), the client will
> invalidate that cache, refresh the location from ZK and try again.
>
> The exit is either the client gets a connection to the master after
> retrying enough, or the code just gives up. I think we tend to keep
> spinning in Accumulo (perhaps more than we should) which hides "expected"
> failures from clients completely (clients don't have to be aware that
> they're talking to a new master than they were before), but it can make
> transport issues harder to diagnose.
>
> Steve Loughran wrote:
>
>> Ted&  Billie
>>
>>
>> I'm updating the YARN-2683 registry documentation, including some
>> guidelines on how to handle failure to communicate with a remote server.
>>
>> This is a problem which the Slider REST client has itself; currently the
>> REST client fails immediately; if a new attempt is made to look up the
>> AM service API URL it will pick up a new value.
>>
>> The API client is therefore not handling the rebinding itself. Which is a
>> bit of a get-out; I'm relying on the AM not failing very often. Provided
>> the clients don't hold onto their client API objects for long, they
>> "should" be OK.
>>
>> What do HBase and Accumulo do here? I know they publish their binding
>> info via ZK —but how do they fail over?
>>
>>
>>

Re: How do HBase and Accumulo handle failures during client RPC operations

Posted by Josh Elser <jo...@gmail.com>.

For Accumulo, like you said, information is published in ZK for clients 
to find. Thinking about just the Accumulo master process (tabletservers 
follow the same principle but in a slightly different way), clients will 
cache that location from ZK and then on some RPC transport failure (e.g. 
ConnectException -- Thrift exception in Accumulo's case), the client 
will invalidate that cache, refresh the location from ZK and try again.

The exit is either the client gets a connection to the master after 
retrying enough, or the code just gives up. I think we tend to keep 
spinning in Accumulo (perhaps more than we should) which hides 
"expected" failures from clients completely (clients don't have to be 
aware that they're talking to a new master than they were before), but 
it can make transport issues harder to diagnose.

Steve Loughran wrote:
> Ted&  Billie
>
> I'm updating the YARN-2683 registry documentation, including some guidelines on how to handle failure to communicate with a remote server.
>
> This is a problem which the Slider REST client has itself; currently the REST client fails immediately; if a new attempt is made to look up the
> AM service API URL it will pick up a new value.
>
> The API client is therefore not handling the rebinding itself. Which is a bit of a get-out; I'm relying on the AM not failing very often. Provided
> the clients don't hold onto their client API objects for long, they "should" be OK.
>
> What do HBase and Accumulo do here? I know they publish their binding info via ZK —but how do they fail over?
>
>