You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Mohit Agarwal <co...@gmail.com> on 2012/08/17 11:07:15 UTC

Understanding UnavailableException

Hi guys,

I am trying to understand what happens when an UnavailableException is
thrown.

a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node cluster.
My understanding is that if one of the nodes is down and the coordinator
node is aware of that(through gossip), then it will respond to the request
with an UnavailableException. Is this correct?

b) What happens if the coordinator isn't aware of a node being down and
sends the request to all the nodes and never hears back from one of the
node. Would this result in a TimedOutException or a UnavailableException?

c) I am trying to understand the cases where the client receives an error,
but data could have been inserted into Cassandra. One such case is the
TimedOutException. Are there any other situations like these?

Thanks,
Mohit

Re: Understanding UnavailableException

Posted by Nick Bailey <ni...@datastax.com>.
> Last time I checked, this was not true for batch writes. The row
> mutations were started sequentially (ie, for each mutation check
> availability, then kick off an aynchronous write), so it was possible
> for the first to succeed, and the second to fail with an
> UnavailableException.
>

Thats a good point, thanks for bringing that up. It is still the case
that if a batch mutate fails you don't know which rows could have
succeeded and which row caused the failure. It is generally good
advice to optimize your application in other areas
(threadpools/concurrency) before attempting to optimize with batches.

Re: Understanding UnavailableException

Posted by Russell Haering <ru...@gmail.com>.
On Fri, Aug 17, 2012 at 8:00 AM, Nick Bailey <ni...@datastax.com> wrote:
> This is actually incorrect. If you get an UnavailableException, the
> write was rejected by the coordinator and was not written anywhere.

Last time I checked, this was not true for batch writes. The row
mutations were started sequentially (ie, for each mutation check
availability, then kick off an aynchronous write), so it was possible
for the first to succeed, and the second to fail with an
UnavailableException.

We had this exact thing happen to us with a custom secondary indexing
system, where we wrote the index but not the data, which at the time
broke a few assumptions we had made.

I would support changing this so that availability is evaluated for
all rows in an initial pass, and once that pass has completed there
would be no circumstances under which an UnavailableException would be
thrown. But the whole thing is of limited value because you could
still get a TImedOutException, there's no way around needing to handle
the "I don't know what got written" scenario.

Re: Understanding UnavailableException

Posted by Mohit Agarwal <co...@gmail.com>.
Thanks Nick for your answers. The blog post is very well written and was
much needed i guess.

On Fri, Aug 17, 2012 at 8:30 PM, Nick Bailey <ni...@datastax.com> wrote:

> This blog post should help:
>
> http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure
>
> But to answer your question:
>
> >> UnavailableException is bit tricky. It means, that not all replicas
> >> required by CL received update. Actually you do not know, whenever
> update
> >> was stored or not, and actually what went wrong.
> >>
>
> This is actually incorrect. If you get an UnavailableException, the
> write was rejected by the coordinator and was not written anywhere.
>
>
> >>> a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node
> cluster.
> >>> My understanding is that if one of the nodes is down and the
> coordinator
> >>> node is aware of that(through gossip), then it will respond to the
> request
> >>> with an UnavailableException. Is this correct?
>
> Correct
>
> >>>
> >>> b) What happens if the coordinator isn't aware of a node being down and
> >>> sends the request to all the nodes and never hears back from one of the
> >>> node. Would this result in a TimedOutException or a
> UnavailableException?
> >>>
>
> You will get a TimedOutException
>
> >>> c) I am trying to understand the cases where the client receives an
> >>> error, but data could have been inserted into Cassandra. One such case
> is
> >>> the TimedOutException. Are there any other situations like these?
> >>>
>
> This should be the only case.
>

Re: Understanding UnavailableException

Posted by Nick Bailey <ni...@datastax.com>.
This blog post should help:

http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure

But to answer your question:

>> UnavailableException is bit tricky. It means, that not all replicas
>> required by CL received update. Actually you do not know, whenever update
>> was stored or not, and actually what went wrong.
>>

This is actually incorrect. If you get an UnavailableException, the
write was rejected by the coordinator and was not written anywhere.


>>> a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node cluster.
>>> My understanding is that if one of the nodes is down and the coordinator
>>> node is aware of that(through gossip), then it will respond to the request
>>> with an UnavailableException. Is this correct?

Correct

>>>
>>> b) What happens if the coordinator isn't aware of a node being down and
>>> sends the request to all the nodes and never hears back from one of the
>>> node. Would this result in a TimedOutException or a UnavailableException?
>>>

You will get a TimedOutException

>>> c) I am trying to understand the cases where the client receives an
>>> error, but data could have been inserted into Cassandra. One such case is
>>> the TimedOutException. Are there any other situations like these?
>>>

This should be the only case.

Re: Understanding UnavailableException

Posted by Mohit Agarwal <co...@gmail.com>.
Does this mean that the coordinator sends requests to all nodes, even when
it  knows that sufficient number of nodes are not available, via gossip?

On Fri, Aug 17, 2012 at 4:49 PM, Maciej Miklas <ma...@gmail.com> wrote:

> UnavailableException is bit tricky. It means, that not all replicas
> required by CL received update. Actually you do not know, whenever update
> was stored or not, and actually what went wrong.
>
> This is the case, why writing with CL.ALL might get problematic. It is
> enough, that only one replica is off-line and you will get exception.
> Remember also, that CL.ALL means, all replicas in all Data Centers - not
> only local DC. Writing with QUORUM_LOCAL could be better idea.
>
> There is only one CL, where exception guarantees, that data was really not
> stored: CL.ANY with hinted handoff enabled.
>
> One more thing: write goes always to all replicas independent from
> provided CL. Client request blocks only until required replicas respond -
> however this response is asynchronous. This means, when you write with
> lower CL, replicas will get data with the same speed, only your client does
> not wait for acknowledgment from all of them.
>
> Ciao,
> Maciej
>
>
>
> On Fri, Aug 17, 2012 at 11:07 AM, Mohit Agarwal <co...@gmail.com>wrote:
>
>> Hi guys,
>>
>> I am trying to understand what happens when an UnavailableException is
>> thrown.
>>
>> a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node cluster.
>> My understanding is that if one of the nodes is down and the coordinator
>> node is aware of that(through gossip), then it will respond to the request
>> with an UnavailableException. Is this correct?
>>
>> b) What happens if the coordinator isn't aware of a node being down and
>> sends the request to all the nodes and never hears back from one of the
>> node. Would this result in a TimedOutException or a UnavailableException?
>>
>> c) I am trying to understand the cases where the client receives an
>> error, but data could have been inserted into Cassandra. One such case is
>> the TimedOutException. Are there any other situations like these?
>>
>> Thanks,
>> Mohit
>>
>
>

Re: Understanding UnavailableException

Posted by Maciej Miklas <ma...@gmail.com>.
UnavailableException is bit tricky. It means, that not all replicas
required by CL received update. Actually you do not know, whenever update
was stored or not, and actually what went wrong.

This is the case, why writing with CL.ALL might get problematic. It is
enough, that only one replica is off-line and you will get exception.
Remember also, that CL.ALL means, all replicas in all Data Centers - not
only local DC. Writing with QUORUM_LOCAL could be better idea.

There is only one CL, where exception guarantees, that data was really not
stored: CL.ANY with hinted handoff enabled.

One more thing: write goes always to all replicas independent from provided
CL. Client request blocks only until required replicas respond - however
this response is asynchronous. This means, when you write with lower CL,
replicas will get data with the same speed, only your client does not wait
for acknowledgment from all of them.

Ciao,
Maciej


On Fri, Aug 17, 2012 at 11:07 AM, Mohit Agarwal <co...@gmail.com>wrote:

> Hi guys,
>
> I am trying to understand what happens when an UnavailableException is
> thrown.
>
> a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node cluster.
> My understanding is that if one of the nodes is down and the coordinator
> node is aware of that(through gossip), then it will respond to the request
> with an UnavailableException. Is this correct?
>
> b) What happens if the coordinator isn't aware of a node being down and
> sends the request to all the nodes and never hears back from one of the
> node. Would this result in a TimedOutException or a UnavailableException?
>
> c) I am trying to understand the cases where the client receives an error,
> but data could have been inserted into Cassandra. One such case is the
> TimedOutException. Are there any other situations like these?
>
> Thanks,
> Mohit
>