You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Shalom Sagges <sh...@liveperson.com> on 2017/03/09 15:32:26 UTC

A Single Dropped Node Fails Entire Read Queries

Hi Cassandra Users,

I hope someone could help me understand the following scenario:

Version: 3.0.9
3 nodes per DC
3 DCs in the cluster.
Consistency Local_Quorum.

I did a small resiliency test and dropped a node to check the availability
of the data.
What I assumed would happen is nothing at all. If a node is down in a 3
nodes DC, Local_Quorum should still be satisfied.
However, during the ~10 first seconds after stopping the service, I got
timeout errors (tried it both from the client and from cqlsh.

This is the error I get:
*ServerError:
com.google.common.util.concurrent.UncheckedExecutionException:
com.google.common.util.concurrent.UncheckedExecutionException:
java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
received only 4 responses.*


After ~10 seconds, the same query is successful with no timeout errors. The
dropped node is still down.

Any idea what could cause this and how to fix it?

Thanks!

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Shalom Sagges <sh...@liveperson.com>.

Upgrading to 3.0.12 solved the issue.

Thanks a lot for the help Joel!


Shalom Sagges
DBA
T: +972-74-700-4035
<http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
<http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
<https://liveperson.docsend.com/view/8iiswfp>


On Tue, Mar 14, 2017 at 10:44 AM, Shalom Sagges <sh...@liveperson.com>
wrote:

> Thanks a lot Joel!
>
> I'll go ahead and upgrade.
>
> Thanks again!
>
>
> Shalom Sagges
> DBA
> T: +972-74-700-4035 <+972%2074-700-4035>
> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
> <https://liveperson.docsend.com/view/8iiswfp>
>
>
> On Mon, Mar 13, 2017 at 7:27 PM, Joel Knighton <joel.knighton@datastax.com
> > wrote:
>
>> It's possible that you're hitting https://issues.apache.
>> org/jira/browse/CASSANDRA-13009 .
>>
>> In (simplified) summary, the read query picks the right number of
>> endpoints fairly early in its execution. Because the down node has not been
>> detected as down yet, it may be one of the nodes. When this node doesn't
>> answer, it is likely that speculative retry will kick in after a certain
>> amount of time and query an up node. This feature is present and working in
>> the earlier releases you tested. Unfortunately, percentile-based
>> speculative retry wasn't working as intended in 2.2+ until fixed in
>> CASSANDRA-13009, which went into 2.2.9/3.0.11+.
>>
>> It may be worth evaluating the latest 3.0.x release.
>>
>> On Mon, Mar 13, 2017 at 11:48 AM, Shalom Sagges <sh...@liveperson.com>
>> wrote:
>>
>>> Just some more info, I've tried the same scenario on 2.0.14 and 2.1.15
>>> and didn't encounter such errors.
>>> What I did find is that the timeout errors appear only until the node is
>>> discovered as "DN" in nodetool status. Once the node is in DN status, the
>>> errors stop and the data is retrieved.
>>>
>>> Could this be a bug in 3.0.9? Or some sort of misconfiguration I missed?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Shalom Sagges
>>> DBA
>>> T: +972-74-700-4035 <+972%2074-700-4035>
>>> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
>>> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
>>> <https://liveperson.docsend.com/view/8iiswfp>
>>>
>>>
>>> On Sun, Mar 12, 2017 at 10:21 AM, Shalom Sagges <sh...@liveperson.com>
>>> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> If a node suddenly fails, and there are other replicas that can still
>>>> satisfy the consistency level, shouldn't the request succeed regardless of
>>>> the failed node?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Shalom Sagges
>>>> DBA
>>>> T: +972-74-700-4035 <+972%2074-700-4035>
>>>> <http://www.linkedin.com/company/164748>
>>>> <http://twitter.com/liveperson> <http://www.facebook.com/LivePersonInc> We
>>>> Create Meaningful Connections
>>>> <https://liveperson.docsend.com/view/8iiswfp>
>>>>
>>>>
>>>> On Fri, Mar 10, 2017 at 6:25 PM, Michael Shuler <michael@pbandjelly.org
>>>> > wrote:
>>>>
>>>>> I may be mistaken on the exact configuration option for the timeout
>>>>> you're hitting, but I believe this may be the general
>>>>> `request_timeout_in_ms: 10000` in conf/cassandra.yaml.
>>>>>
>>>>> A reasonable timeout for a "node down" discovery/processing is needed
>>>>> to
>>>>> prevent random flapping of nodes with a super short timeout interval.
>>>>> Applications should also retry on a host unavailable exception like
>>>>> this, because in the long run, this should be expected from time to
>>>>> time
>>>>> for network partitions, node failure, maintenance cycles, etc.
>>>>>
>>>>> --
>>>>> Kind regards,
>>>>> Michael
>>>>>
>>>>> On 03/10/2017 04:07 AM, Shalom Sagges wrote:
>>>>> > Hi daniel,
>>>>> >
>>>>> > I don't think that's a network issue, because ~10 seconds after the
>>>>> node
>>>>> > stopped, the queries were successful again without any timeout
>>>>> issues.
>>>>> >
>>>>> > Thanks!
>>>>> >
>>>>> >
>>>>> > Shalom Sagges
>>>>> > DBA
>>>>> > T: +972-74-700-4035
>>>>> > <http://www.linkedin.com/company/164748>
>>>>> > <http://twitter.com/liveperson>       <http://www.facebook.com/Live
>>>>> PersonInc>
>>>>> >
>>>>> >       We Create Meaningful Connections
>>>>> >
>>>>> > <https://liveperson.docsend.com/view/8iiswfp>
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Fri, Mar 10, 2017 at 12:01 PM, Daniel Hölbling-Inzko
>>>>> > <daniel.hoelbling-inzko@bitmovin.com
>>>>> > <ma...@bitmovin.com>> wrote:
>>>>> >
>>>>> >     Could there be network issues in connecting between the nodes? If
>>>>> >     node a gets To be the query coordinator but can't reach b and c
>>>>> is
>>>>> >     obviously down it won't get a quorum.
>>>>> >
>>>>> >     Greetings
>>>>> >
>>>>> >     Shalom Sagges <shaloms@liveperson.com
>>>>> >     <ma...@liveperson.com>> schrieb am Fr. 10. März 2017
>>>>> um 10:55:
>>>>> >
>>>>> >         @Ryan, my keyspace replication settings are as follows:
>>>>> >         CREATE KEYSPACE mykeyspace WITH replication = {'class':
>>>>> >         'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}
>>>>> >          AND durable_writes = true;
>>>>> >
>>>>> >         CREATE TABLE mykeyspace.test (
>>>>> >             column1 text,
>>>>> >             column2 text,
>>>>> >             column3 text,
>>>>> >             PRIMARY KEY (column1, column2)
>>>>> >
>>>>> >         The query is */select * from mykeyspace.test where
>>>>> >         column1='xxxxx';/*
>>>>> >
>>>>> >         @Daniel, the replication factor is 3. That's why I don't
>>>>> >         understand why I get these timeouts when only one node drops.
>>>>> >
>>>>> >         Also, when I enabled tracing, I got the following error:
>>>>> >         *Unable to fetch query trace: ('Unable to complete the
>>>>> operation
>>>>> >         against any hosts', {<Host: 127.0.0.1 DC1>:
>>>>> Unavailable('Error
>>>>> >         from server: code=1000 [Unavailable exception]
>>>>> message="Cannot
>>>>> >         achieve consistency level LOCAL_QUORUM"
>>>>> >         info={\'required_replicas\': 2, \'alive_replicas\': 1,
>>>>> >         \'consistency\': \'LOCAL_QUORUM\'}',)})*
>>>>> >
>>>>> >         But nodetool status shows that only 1 replica was down:
>>>>> >         --  Address          Load       Tokens       Owns    Host ID
>>>>> >                                   Rack
>>>>> >         DN  x.x.x.235  134.32 MB  256          ?
>>>>> >         c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
>>>>> >         UN  x.x.x.236  134.02 MB  256          ?
>>>>> >         2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
>>>>> >         UN  x.x.x.237  134.34 MB  256          ?
>>>>> >         5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
>>>>> >
>>>>> >
>>>>> >         I tried to run the same scenario on all 3 nodes, and only the
>>>>> >         3rd node didn't fail the query when I dropped it. The nodes
>>>>> were
>>>>> >         installed and configured with Puppet so the configuration is
>>>>> the
>>>>> >         same on all 3 nodes.
>>>>> >
>>>>> >
>>>>> >         Thanks!
>>>>> >
>>>>> >
>>>>> >
>>>>> >         On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko
>>>>> >         <daniel.hoelbling-inzko@bitmovin.com
>>>>> >         <ma...@bitmovin.com>> wrote:
>>>>> >
>>>>> >             The LOCAL_QUORUM works on the available replicas in the
>>>>> dc.
>>>>> >             So if your replication factor is 2 and you have 10 nodes
>>>>> you
>>>>> >             can still only loose 1. With a replication factor of 3
>>>>> you
>>>>> >             can loose one node and still satisfy the query.
>>>>> >             Ryan Svihla <rs@foundev.pro <ma...@foundev.pro>>
>>>>> schrieb
>>>>> >             am Do. 9. März 2017 um 18:09:
>>>>> >
>>>>> >                 whats your keyspace replication settings and what's
>>>>> your
>>>>> >                 query?
>>>>> >
>>>>> >                 On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges
>>>>> >                 <shaloms@liveperson.com <mailto:
>>>>> shaloms@liveperson.com>>
>>>>> >                 wrote:
>>>>> >
>>>>> >                     Hi Cassandra Users,
>>>>> >
>>>>> >                     I hope someone could help me understand the
>>>>> >                     following scenario:
>>>>> >
>>>>> >                     Version: 3.0.9
>>>>> >                     3 nodes per DC
>>>>> >                     3 DCs in the cluster.
>>>>> >                     Consistency Local_Quorum.
>>>>> >
>>>>> >                     I did a small resiliency test and dropped a node
>>>>> to
>>>>> >                     check the availability of the data.
>>>>> >                     What I assumed would happen is nothing at all.
>>>>> If a
>>>>> >                     node is down in a 3 nodes DC, Local_Quorum should
>>>>> >                     still be satisfied.
>>>>> >                     However, during the ~10 first seconds after
>>>>> stopping
>>>>> >                     the service, I got timeout errors (tried it both
>>>>> >                     from the client and from cqlsh.
>>>>> >
>>>>> >                     This is the error I get:
>>>>> >                     */ServerError:
>>>>> >                     com.google.common.util.concur
>>>>> rent.UncheckedExecutionException:
>>>>> >                     com.google.common.util.concur
>>>>> rent.UncheckedExecutionException:
>>>>> >                     java.lang.RuntimeException:
>>>>> >                     org.apache.cassandra.exceptions.ReadTimeoutException:
>>>>> Operation
>>>>> >                     timed out - received only 4 responses./*
>>>>> >
>>>>> >
>>>>> >                     After ~10 seconds, the same query is successful
>>>>> with
>>>>> >                     no timeout errors. The dropped node is still
>>>>> down.
>>>>> >
>>>>> >                     Any idea what could cause this and how to fix it?
>>>>> >
>>>>> >                     Thanks!
>>>>> >
>>>>> >
>>>>> >                     This message may contain confidential and/or
>>>>> >                     privileged information.
>>>>> >                     If you are not the addressee or authorized to
>>>>> >                     receive this on behalf of the addressee you must
>>>>> not
>>>>> >                     use, copy, disclose or take action based on this
>>>>> >                     message or any information herein.
>>>>> >                     If you have received this message in error,
>>>>> please
>>>>> >                     advise the sender immediately by reply email and
>>>>> >                     delete this message. Thank you.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >                 --
>>>>> >
>>>>> >                 Thanks,
>>>>> >
>>>>> >                 Ryan Svihla
>>>>> >
>>>>> >
>>>>> >
>>>>> >         This message may contain confidential and/or privileged
>>>>> >         information.
>>>>> >         If you are not the addressee or authorized to receive this on
>>>>> >         behalf of the addressee you must not use, copy, disclose or
>>>>> take
>>>>> >         action based on this message or any information herein.
>>>>> >         If you have received this message in error, please advise the
>>>>> >         sender immediately by reply email and delete this message.
>>>>> Thank
>>>>> >         you.
>>>>> >
>>>>> >
>>>>> >
>>>>> > This message may contain confidential and/or privileged information.
>>>>> > If you are not the addressee or authorized to receive this on behalf
>>>>> of
>>>>> > the addressee you must not use, copy, disclose or take action based
>>>>> on
>>>>> > this message or any information herein.
>>>>> > If you have received this message in error, please advise the sender
>>>>> > immediately by reply email and delete this message. Thank you.
>>>>>
>>>>>
>>>>
>>>
>>> This message may contain confidential and/or privileged information.
>>> If you are not the addressee or authorized to receive this on behalf of
>>> the addressee you must not use, copy, disclose or take action based on this
>>> message or any information herein.
>>> If you have received this message in error, please advise the sender
>>> immediately by reply email and delete this message. Thank you.
>>>
>>
>>
>

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Shalom Sagges <sh...@liveperson.com>.

Thanks a lot Joel!

I'll go ahead and upgrade.

Thanks again!


Shalom Sagges
DBA
T: +972-74-700-4035
<http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
<http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
<https://liveperson.docsend.com/view/8iiswfp>


On Mon, Mar 13, 2017 at 7:27 PM, Joel Knighton <jo...@datastax.com>
wrote:

> It's possible that you're hitting https://issues.apache.
> org/jira/browse/CASSANDRA-13009 .
>
> In (simplified) summary, the read query picks the right number of
> endpoints fairly early in its execution. Because the down node has not been
> detected as down yet, it may be one of the nodes. When this node doesn't
> answer, it is likely that speculative retry will kick in after a certain
> amount of time and query an up node. This feature is present and working in
> the earlier releases you tested. Unfortunately, percentile-based
> speculative retry wasn't working as intended in 2.2+ until fixed in
> CASSANDRA-13009, which went into 2.2.9/3.0.11+.
>
> It may be worth evaluating the latest 3.0.x release.
>
> On Mon, Mar 13, 2017 at 11:48 AM, Shalom Sagges <sh...@liveperson.com>
> wrote:
>
>> Just some more info, I've tried the same scenario on 2.0.14 and 2.1.15
>> and didn't encounter such errors.
>> What I did find is that the timeout errors appear only until the node is
>> discovered as "DN" in nodetool status. Once the node is in DN status, the
>> errors stop and the data is retrieved.
>>
>> Could this be a bug in 3.0.9? Or some sort of misconfiguration I missed?
>>
>> Thanks!
>>
>>
>>
>> Shalom Sagges
>> DBA
>> T: +972-74-700-4035 <+972%2074-700-4035>
>> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
>> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
>> <https://liveperson.docsend.com/view/8iiswfp>
>>
>>
>> On Sun, Mar 12, 2017 at 10:21 AM, Shalom Sagges <sh...@liveperson.com>
>> wrote:
>>
>>> Hi Michael,
>>>
>>> If a node suddenly fails, and there are other replicas that can still
>>> satisfy the consistency level, shouldn't the request succeed regardless of
>>> the failed node?
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> Shalom Sagges
>>> DBA
>>> T: +972-74-700-4035 <+972%2074-700-4035>
>>> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
>>> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
>>> <https://liveperson.docsend.com/view/8iiswfp>
>>>
>>>
>>> On Fri, Mar 10, 2017 at 6:25 PM, Michael Shuler <mi...@pbandjelly.org>
>>> wrote:
>>>
>>>> I may be mistaken on the exact configuration option for the timeout
>>>> you're hitting, but I believe this may be the general
>>>> `request_timeout_in_ms: 10000` in conf/cassandra.yaml.
>>>>
>>>> A reasonable timeout for a "node down" discovery/processing is needed to
>>>> prevent random flapping of nodes with a super short timeout interval.
>>>> Applications should also retry on a host unavailable exception like
>>>> this, because in the long run, this should be expected from time to time
>>>> for network partitions, node failure, maintenance cycles, etc.
>>>>
>>>> --
>>>> Kind regards,
>>>> Michael
>>>>
>>>> On 03/10/2017 04:07 AM, Shalom Sagges wrote:
>>>> > Hi daniel,
>>>> >
>>>> > I don't think that's a network issue, because ~10 seconds after the
>>>> node
>>>> > stopped, the queries were successful again without any timeout issues.
>>>> >
>>>> > Thanks!
>>>> >
>>>> >
>>>> > Shalom Sagges
>>>> > DBA
>>>> > T: +972-74-700-4035
>>>> > <http://www.linkedin.com/company/164748>
>>>> > <http://twitter.com/liveperson>       <http://www.facebook.com/Live
>>>> PersonInc>
>>>> >
>>>> >       We Create Meaningful Connections
>>>> >
>>>> > <https://liveperson.docsend.com/view/8iiswfp>
>>>> >
>>>> >
>>>> >
>>>> > On Fri, Mar 10, 2017 at 12:01 PM, Daniel Hölbling-Inzko
>>>> > <daniel.hoelbling-inzko@bitmovin.com
>>>> > <ma...@bitmovin.com>> wrote:
>>>> >
>>>> >     Could there be network issues in connecting between the nodes? If
>>>> >     node a gets To be the query coordinator but can't reach b and c is
>>>> >     obviously down it won't get a quorum.
>>>> >
>>>> >     Greetings
>>>> >
>>>> >     Shalom Sagges <shaloms@liveperson.com
>>>> >     <ma...@liveperson.com>> schrieb am Fr. 10. März 2017 um
>>>> 10:55:
>>>> >
>>>> >         @Ryan, my keyspace replication settings are as follows:
>>>> >         CREATE KEYSPACE mykeyspace WITH replication = {'class':
>>>> >         'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}
>>>> >          AND durable_writes = true;
>>>> >
>>>> >         CREATE TABLE mykeyspace.test (
>>>> >             column1 text,
>>>> >             column2 text,
>>>> >             column3 text,
>>>> >             PRIMARY KEY (column1, column2)
>>>> >
>>>> >         The query is */select * from mykeyspace.test where
>>>> >         column1='xxxxx';/*
>>>> >
>>>> >         @Daniel, the replication factor is 3. That's why I don't
>>>> >         understand why I get these timeouts when only one node drops.
>>>> >
>>>> >         Also, when I enabled tracing, I got the following error:
>>>> >         *Unable to fetch query trace: ('Unable to complete the
>>>> operation
>>>> >         against any hosts', {<Host: 127.0.0.1 DC1>: Unavailable('Error
>>>> >         from server: code=1000 [Unavailable exception] message="Cannot
>>>> >         achieve consistency level LOCAL_QUORUM"
>>>> >         info={\'required_replicas\': 2, \'alive_replicas\': 1,
>>>> >         \'consistency\': \'LOCAL_QUORUM\'}',)})*
>>>> >
>>>> >         But nodetool status shows that only 1 replica was down:
>>>> >         --  Address          Load       Tokens       Owns    Host ID
>>>> >                                   Rack
>>>> >         DN  x.x.x.235  134.32 MB  256          ?
>>>> >         c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
>>>> >         UN  x.x.x.236  134.02 MB  256          ?
>>>> >         2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
>>>> >         UN  x.x.x.237  134.34 MB  256          ?
>>>> >         5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
>>>> >
>>>> >
>>>> >         I tried to run the same scenario on all 3 nodes, and only the
>>>> >         3rd node didn't fail the query when I dropped it. The nodes
>>>> were
>>>> >         installed and configured with Puppet so the configuration is
>>>> the
>>>> >         same on all 3 nodes.
>>>> >
>>>> >
>>>> >         Thanks!
>>>> >
>>>> >
>>>> >
>>>> >         On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko
>>>> >         <daniel.hoelbling-inzko@bitmovin.com
>>>> >         <ma...@bitmovin.com>> wrote:
>>>> >
>>>> >             The LOCAL_QUORUM works on the available replicas in the
>>>> dc.
>>>> >             So if your replication factor is 2 and you have 10 nodes
>>>> you
>>>> >             can still only loose 1. With a replication factor of 3 you
>>>> >             can loose one node and still satisfy the query.
>>>> >             Ryan Svihla <rs@foundev.pro <ma...@foundev.pro>>
>>>> schrieb
>>>> >             am Do. 9. März 2017 um 18:09:
>>>> >
>>>> >                 whats your keyspace replication settings and what's
>>>> your
>>>> >                 query?
>>>> >
>>>> >                 On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges
>>>> >                 <shaloms@liveperson.com <mailto:
>>>> shaloms@liveperson.com>>
>>>> >                 wrote:
>>>> >
>>>> >                     Hi Cassandra Users,
>>>> >
>>>> >                     I hope someone could help me understand the
>>>> >                     following scenario:
>>>> >
>>>> >                     Version: 3.0.9
>>>> >                     3 nodes per DC
>>>> >                     3 DCs in the cluster.
>>>> >                     Consistency Local_Quorum.
>>>> >
>>>> >                     I did a small resiliency test and dropped a node
>>>> to
>>>> >                     check the availability of the data.
>>>> >                     What I assumed would happen is nothing at all. If
>>>> a
>>>> >                     node is down in a 3 nodes DC, Local_Quorum should
>>>> >                     still be satisfied.
>>>> >                     However, during the ~10 first seconds after
>>>> stopping
>>>> >                     the service, I got timeout errors (tried it both
>>>> >                     from the client and from cqlsh.
>>>> >
>>>> >                     This is the error I get:
>>>> >                     */ServerError:
>>>> >                     com.google.common.util.concur
>>>> rent.UncheckedExecutionException:
>>>> >                     com.google.common.util.concur
>>>> rent.UncheckedExecutionException:
>>>> >                     java.lang.RuntimeException:
>>>> >                     org.apache.cassandra.exceptions.ReadTimeoutException:
>>>> Operation
>>>> >                     timed out - received only 4 responses./*
>>>> >
>>>> >
>>>> >                     After ~10 seconds, the same query is successful
>>>> with
>>>> >                     no timeout errors. The dropped node is still down.
>>>> >
>>>> >                     Any idea what could cause this and how to fix it?
>>>> >
>>>> >                     Thanks!
>>>> >
>>>> >
>>>> >                     This message may contain confidential and/or
>>>> >                     privileged information.
>>>> >                     If you are not the addressee or authorized to
>>>> >                     receive this on behalf of the addressee you must
>>>> not
>>>> >                     use, copy, disclose or take action based on this
>>>> >                     message or any information herein.
>>>> >                     If you have received this message in error, please
>>>> >                     advise the sender immediately by reply email and
>>>> >                     delete this message. Thank you.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >                 --
>>>> >
>>>> >                 Thanks,
>>>> >
>>>> >                 Ryan Svihla
>>>> >
>>>> >
>>>> >
>>>> >         This message may contain confidential and/or privileged
>>>> >         information.
>>>> >         If you are not the addressee or authorized to receive this on
>>>> >         behalf of the addressee you must not use, copy, disclose or
>>>> take
>>>> >         action based on this message or any information herein.
>>>> >         If you have received this message in error, please advise the
>>>> >         sender immediately by reply email and delete this message.
>>>> Thank
>>>> >         you.
>>>> >
>>>> >
>>>> >
>>>> > This message may contain confidential and/or privileged information.
>>>> > If you are not the addressee or authorized to receive this on behalf
>>>> of
>>>> > the addressee you must not use, copy, disclose or take action based on
>>>> > this message or any information herein.
>>>> > If you have received this message in error, please advise the sender
>>>> > immediately by reply email and delete this message. Thank you.
>>>>
>>>>
>>>
>>
>> This message may contain confidential and/or privileged information.
>> If you are not the addressee or authorized to receive this on behalf of
>> the addressee you must not use, copy, disclose or take action based on this
>> message or any information herein.
>> If you have received this message in error, please advise the sender
>> immediately by reply email and delete this message. Thank you.
>>
>
>

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Joel Knighton <jo...@datastax.com>.

It's possible that you're hitting
https://issues.apache.org/jira/browse/CASSANDRA-13009 .

In (simplified) summary, the read query picks the right number of endpoints
fairly early in its execution. Because the down node has not been detected
as down yet, it may be one of the nodes. When this node doesn't answer, it
is likely that speculative retry will kick in after a certain amount of
time and query an up node. This feature is present and working in the
earlier releases you tested. Unfortunately, percentile-based speculative
retry wasn't working as intended in 2.2+ until fixed in CASSANDRA-13009,
which went into 2.2.9/3.0.11+.

It may be worth evaluating the latest 3.0.x release.

On Mon, Mar 13, 2017 at 11:48 AM, Shalom Sagges <sh...@liveperson.com>
wrote:

> Just some more info, I've tried the same scenario on 2.0.14 and 2.1.15 and
> didn't encounter such errors.
> What I did find is that the timeout errors appear only until the node is
> discovered as "DN" in nodetool status. Once the node is in DN status, the
> errors stop and the data is retrieved.
>
> Could this be a bug in 3.0.9? Or some sort of misconfiguration I missed?
>
> Thanks!
>
>
>
> Shalom Sagges
> DBA
> T: +972-74-700-4035 <+972%2074-700-4035>
> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
> <https://liveperson.docsend.com/view/8iiswfp>
>
>
> On Sun, Mar 12, 2017 at 10:21 AM, Shalom Sagges <sh...@liveperson.com>
> wrote:
>
>> Hi Michael,
>>
>> If a node suddenly fails, and there are other replicas that can still
>> satisfy the consistency level, shouldn't the request succeed regardless of
>> the failed node?
>>
>> Thanks!
>>
>>
>>
>>
>>
>> Shalom Sagges
>> DBA
>> T: +972-74-700-4035 <+972%2074-700-4035>
>> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
>> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
>> <https://liveperson.docsend.com/view/8iiswfp>
>>
>>
>> On Fri, Mar 10, 2017 at 6:25 PM, Michael Shuler <mi...@pbandjelly.org>
>> wrote:
>>
>>> I may be mistaken on the exact configuration option for the timeout
>>> you're hitting, but I believe this may be the general
>>> `request_timeout_in_ms: 10000` in conf/cassandra.yaml.
>>>
>>> A reasonable timeout for a "node down" discovery/processing is needed to
>>> prevent random flapping of nodes with a super short timeout interval.
>>> Applications should also retry on a host unavailable exception like
>>> this, because in the long run, this should be expected from time to time
>>> for network partitions, node failure, maintenance cycles, etc.
>>>
>>> --
>>> Kind regards,
>>> Michael
>>>
>>> On 03/10/2017 04:07 AM, Shalom Sagges wrote:
>>> > Hi daniel,
>>> >
>>> > I don't think that's a network issue, because ~10 seconds after the
>>> node
>>> > stopped, the queries were successful again without any timeout issues.
>>> >
>>> > Thanks!
>>> >
>>> >
>>> > Shalom Sagges
>>> > DBA
>>> > T: +972-74-700-4035
>>> > <http://www.linkedin.com/company/164748>
>>> > <http://twitter.com/liveperson>       <http://www.facebook.com/Live
>>> PersonInc>
>>> >
>>> >       We Create Meaningful Connections
>>> >
>>> > <https://liveperson.docsend.com/view/8iiswfp>
>>> >
>>> >
>>> >
>>> > On Fri, Mar 10, 2017 at 12:01 PM, Daniel Hölbling-Inzko
>>> > <daniel.hoelbling-inzko@bitmovin.com
>>> > <ma...@bitmovin.com>> wrote:
>>> >
>>> >     Could there be network issues in connecting between the nodes? If
>>> >     node a gets To be the query coordinator but can't reach b and c is
>>> >     obviously down it won't get a quorum.
>>> >
>>> >     Greetings
>>> >
>>> >     Shalom Sagges <shaloms@liveperson.com
>>> >     <ma...@liveperson.com>> schrieb am Fr. 10. März 2017 um
>>> 10:55:
>>> >
>>> >         @Ryan, my keyspace replication settings are as follows:
>>> >         CREATE KEYSPACE mykeyspace WITH replication = {'class':
>>> >         'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}
>>> >          AND durable_writes = true;
>>> >
>>> >         CREATE TABLE mykeyspace.test (
>>> >             column1 text,
>>> >             column2 text,
>>> >             column3 text,
>>> >             PRIMARY KEY (column1, column2)
>>> >
>>> >         The query is */select * from mykeyspace.test where
>>> >         column1='xxxxx';/*
>>> >
>>> >         @Daniel, the replication factor is 3. That's why I don't
>>> >         understand why I get these timeouts when only one node drops.
>>> >
>>> >         Also, when I enabled tracing, I got the following error:
>>> >         *Unable to fetch query trace: ('Unable to complete the
>>> operation
>>> >         against any hosts', {<Host: 127.0.0.1 DC1>: Unavailable('Error
>>> >         from server: code=1000 [Unavailable exception] message="Cannot
>>> >         achieve consistency level LOCAL_QUORUM"
>>> >         info={\'required_replicas\': 2, \'alive_replicas\': 1,
>>> >         \'consistency\': \'LOCAL_QUORUM\'}',)})*
>>> >
>>> >         But nodetool status shows that only 1 replica was down:
>>> >         --  Address          Load       Tokens       Owns    Host ID
>>> >                                   Rack
>>> >         DN  x.x.x.235  134.32 MB  256          ?
>>> >         c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
>>> >         UN  x.x.x.236  134.02 MB  256          ?
>>> >         2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
>>> >         UN  x.x.x.237  134.34 MB  256          ?
>>> >         5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
>>> >
>>> >
>>> >         I tried to run the same scenario on all 3 nodes, and only the
>>> >         3rd node didn't fail the query when I dropped it. The nodes
>>> were
>>> >         installed and configured with Puppet so the configuration is
>>> the
>>> >         same on all 3 nodes.
>>> >
>>> >
>>> >         Thanks!
>>> >
>>> >
>>> >
>>> >         On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko
>>> >         <daniel.hoelbling-inzko@bitmovin.com
>>> >         <ma...@bitmovin.com>> wrote:
>>> >
>>> >             The LOCAL_QUORUM works on the available replicas in the dc.
>>> >             So if your replication factor is 2 and you have 10 nodes
>>> you
>>> >             can still only loose 1. With a replication factor of 3 you
>>> >             can loose one node and still satisfy the query.
>>> >             Ryan Svihla <rs@foundev.pro <ma...@foundev.pro>>
>>> schrieb
>>> >             am Do. 9. März 2017 um 18:09:
>>> >
>>> >                 whats your keyspace replication settings and what's
>>> your
>>> >                 query?
>>> >
>>> >                 On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges
>>> >                 <shaloms@liveperson.com <mailto:shaloms@liveperson.com
>>> >>
>>> >                 wrote:
>>> >
>>> >                     Hi Cassandra Users,
>>> >
>>> >                     I hope someone could help me understand the
>>> >                     following scenario:
>>> >
>>> >                     Version: 3.0.9
>>> >                     3 nodes per DC
>>> >                     3 DCs in the cluster.
>>> >                     Consistency Local_Quorum.
>>> >
>>> >                     I did a small resiliency test and dropped a node to
>>> >                     check the availability of the data.
>>> >                     What I assumed would happen is nothing at all. If a
>>> >                     node is down in a 3 nodes DC, Local_Quorum should
>>> >                     still be satisfied.
>>> >                     However, during the ~10 first seconds after
>>> stopping
>>> >                     the service, I got timeout errors (tried it both
>>> >                     from the client and from cqlsh.
>>> >
>>> >                     This is the error I get:
>>> >                     */ServerError:
>>> >                     com.google.common.util.concur
>>> rent.UncheckedExecutionException:
>>> >                     com.google.common.util.concur
>>> rent.UncheckedExecutionException:
>>> >                     java.lang.RuntimeException:
>>> >                     org.apache.cassandra.exceptions.ReadTimeoutException:
>>> Operation
>>> >                     timed out - received only 4 responses./*
>>> >
>>> >
>>> >                     After ~10 seconds, the same query is successful
>>> with
>>> >                     no timeout errors. The dropped node is still down.
>>> >
>>> >                     Any idea what could cause this and how to fix it?
>>> >
>>> >                     Thanks!
>>> >
>>> >
>>> >                     This message may contain confidential and/or
>>> >                     privileged information.
>>> >                     If you are not the addressee or authorized to
>>> >                     receive this on behalf of the addressee you must
>>> not
>>> >                     use, copy, disclose or take action based on this
>>> >                     message or any information herein.
>>> >                     If you have received this message in error, please
>>> >                     advise the sender immediately by reply email and
>>> >                     delete this message. Thank you.
>>> >
>>> >
>>> >
>>> >
>>> >                 --
>>> >
>>> >                 Thanks,
>>> >
>>> >                 Ryan Svihla
>>> >
>>> >
>>> >
>>> >         This message may contain confidential and/or privileged
>>> >         information.
>>> >         If you are not the addressee or authorized to receive this on
>>> >         behalf of the addressee you must not use, copy, disclose or
>>> take
>>> >         action based on this message or any information herein.
>>> >         If you have received this message in error, please advise the
>>> >         sender immediately by reply email and delete this message.
>>> Thank
>>> >         you.
>>> >
>>> >
>>> >
>>> > This message may contain confidential and/or privileged information.
>>> > If you are not the addressee or authorized to receive this on behalf of
>>> > the addressee you must not use, copy, disclose or take action based on
>>> > this message or any information herein.
>>> > If you have received this message in error, please advise the sender
>>> > immediately by reply email and delete this message. Thank you.
>>>
>>>
>>
>
> This message may contain confidential and/or privileged information.
> If you are not the addressee or authorized to receive this on behalf of
> the addressee you must not use, copy, disclose or take action based on this
> message or any information herein.
> If you have received this message in error, please advise the sender
> immediately by reply email and delete this message. Thank you.
>

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Shalom Sagges <sh...@liveperson.com>.

Just some more info, I've tried the same scenario on 2.0.14 and 2.1.15 and
didn't encounter such errors.
What I did find is that the timeout errors appear only until the node is
discovered as "DN" in nodetool status. Once the node is in DN status, the
errors stop and the data is retrieved.

Could this be a bug in 3.0.9? Or some sort of misconfiguration I missed?

Thanks!



Shalom Sagges
DBA
T: +972-74-700-4035
<http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
<http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
<https://liveperson.docsend.com/view/8iiswfp>


On Sun, Mar 12, 2017 at 10:21 AM, Shalom Sagges <sh...@liveperson.com>
wrote:

> Hi Michael,
>
> If a node suddenly fails, and there are other replicas that can still
> satisfy the consistency level, shouldn't the request succeed regardless of
> the failed node?
>
> Thanks!
>
>
>
>
>
> Shalom Sagges
> DBA
> T: +972-74-700-4035 <+972%2074-700-4035>
> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
> <https://liveperson.docsend.com/view/8iiswfp>
>
>
> On Fri, Mar 10, 2017 at 6:25 PM, Michael Shuler <mi...@pbandjelly.org>
> wrote:
>
>> I may be mistaken on the exact configuration option for the timeout
>> you're hitting, but I believe this may be the general
>> `request_timeout_in_ms: 10000` in conf/cassandra.yaml.
>>
>> A reasonable timeout for a "node down" discovery/processing is needed to
>> prevent random flapping of nodes with a super short timeout interval.
>> Applications should also retry on a host unavailable exception like
>> this, because in the long run, this should be expected from time to time
>> for network partitions, node failure, maintenance cycles, etc.
>>
>> --
>> Kind regards,
>> Michael
>>
>> On 03/10/2017 04:07 AM, Shalom Sagges wrote:
>> > Hi daniel,
>> >
>> > I don't think that's a network issue, because ~10 seconds after the node
>> > stopped, the queries were successful again without any timeout issues.
>> >
>> > Thanks!
>> >
>> >
>> > Shalom Sagges
>> > DBA
>> > T: +972-74-700-4035
>> > <http://www.linkedin.com/company/164748>
>> > <http://twitter.com/liveperson>       <http://www.facebook.com/Live
>> PersonInc>
>> >
>> >       We Create Meaningful Connections
>> >
>> > <https://liveperson.docsend.com/view/8iiswfp>
>> >
>> >
>> >
>> > On Fri, Mar 10, 2017 at 12:01 PM, Daniel Hölbling-Inzko
>> > <daniel.hoelbling-inzko@bitmovin.com
>> > <ma...@bitmovin.com>> wrote:
>> >
>> >     Could there be network issues in connecting between the nodes? If
>> >     node a gets To be the query coordinator but can't reach b and c is
>> >     obviously down it won't get a quorum.
>> >
>> >     Greetings
>> >
>> >     Shalom Sagges <shaloms@liveperson.com
>> >     <ma...@liveperson.com>> schrieb am Fr. 10. März 2017 um
>> 10:55:
>> >
>> >         @Ryan, my keyspace replication settings are as follows:
>> >         CREATE KEYSPACE mykeyspace WITH replication = {'class':
>> >         'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}
>> >          AND durable_writes = true;
>> >
>> >         CREATE TABLE mykeyspace.test (
>> >             column1 text,
>> >             column2 text,
>> >             column3 text,
>> >             PRIMARY KEY (column1, column2)
>> >
>> >         The query is */select * from mykeyspace.test where
>> >         column1='xxxxx';/*
>> >
>> >         @Daniel, the replication factor is 3. That's why I don't
>> >         understand why I get these timeouts when only one node drops.
>> >
>> >         Also, when I enabled tracing, I got the following error:
>> >         *Unable to fetch query trace: ('Unable to complete the operation
>> >         against any hosts', {<Host: 127.0.0.1 DC1>: Unavailable('Error
>> >         from server: code=1000 [Unavailable exception] message="Cannot
>> >         achieve consistency level LOCAL_QUORUM"
>> >         info={\'required_replicas\': 2, \'alive_replicas\': 1,
>> >         \'consistency\': \'LOCAL_QUORUM\'}',)})*
>> >
>> >         But nodetool status shows that only 1 replica was down:
>> >         --  Address          Load       Tokens       Owns    Host ID
>> >                                   Rack
>> >         DN  x.x.x.235  134.32 MB  256          ?
>> >         c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
>> >         UN  x.x.x.236  134.02 MB  256          ?
>> >         2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
>> >         UN  x.x.x.237  134.34 MB  256          ?
>> >         5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
>> >
>> >
>> >         I tried to run the same scenario on all 3 nodes, and only the
>> >         3rd node didn't fail the query when I dropped it. The nodes were
>> >         installed and configured with Puppet so the configuration is the
>> >         same on all 3 nodes.
>> >
>> >
>> >         Thanks!
>> >
>> >
>> >
>> >         On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko
>> >         <daniel.hoelbling-inzko@bitmovin.com
>> >         <ma...@bitmovin.com>> wrote:
>> >
>> >             The LOCAL_QUORUM works on the available replicas in the dc.
>> >             So if your replication factor is 2 and you have 10 nodes you
>> >             can still only loose 1. With a replication factor of 3 you
>> >             can loose one node and still satisfy the query.
>> >             Ryan Svihla <rs@foundev.pro <ma...@foundev.pro>>
>> schrieb
>> >             am Do. 9. März 2017 um 18:09:
>> >
>> >                 whats your keyspace replication settings and what's your
>> >                 query?
>> >
>> >                 On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges
>> >                 <shaloms@liveperson.com <mailto:shaloms@liveperson.com
>> >>
>> >                 wrote:
>> >
>> >                     Hi Cassandra Users,
>> >
>> >                     I hope someone could help me understand the
>> >                     following scenario:
>> >
>> >                     Version: 3.0.9
>> >                     3 nodes per DC
>> >                     3 DCs in the cluster.
>> >                     Consistency Local_Quorum.
>> >
>> >                     I did a small resiliency test and dropped a node to
>> >                     check the availability of the data.
>> >                     What I assumed would happen is nothing at all. If a
>> >                     node is down in a 3 nodes DC, Local_Quorum should
>> >                     still be satisfied.
>> >                     However, during the ~10 first seconds after stopping
>> >                     the service, I got timeout errors (tried it both
>> >                     from the client and from cqlsh.
>> >
>> >                     This is the error I get:
>> >                     */ServerError:
>> >                     com.google.common.util.concur
>> rent.UncheckedExecutionException:
>> >                     com.google.common.util.concur
>> rent.UncheckedExecutionException:
>> >                     java.lang.RuntimeException:
>> >                     org.apache.cassandra.exceptions.ReadTimeoutException:
>> Operation
>> >                     timed out - received only 4 responses./*
>> >
>> >
>> >                     After ~10 seconds, the same query is successful with
>> >                     no timeout errors. The dropped node is still down.
>> >
>> >                     Any idea what could cause this and how to fix it?
>> >
>> >                     Thanks!
>> >
>> >
>> >                     This message may contain confidential and/or
>> >                     privileged information.
>> >                     If you are not the addressee or authorized to
>> >                     receive this on behalf of the addressee you must not
>> >                     use, copy, disclose or take action based on this
>> >                     message or any information herein.
>> >                     If you have received this message in error, please
>> >                     advise the sender immediately by reply email and
>> >                     delete this message. Thank you.
>> >
>> >
>> >
>> >
>> >                 --
>> >
>> >                 Thanks,
>> >
>> >                 Ryan Svihla
>> >
>> >
>> >
>> >         This message may contain confidential and/or privileged
>> >         information.
>> >         If you are not the addressee or authorized to receive this on
>> >         behalf of the addressee you must not use, copy, disclose or take
>> >         action based on this message or any information herein.
>> >         If you have received this message in error, please advise the
>> >         sender immediately by reply email and delete this message. Thank
>> >         you.
>> >
>> >
>> >
>> > This message may contain confidential and/or privileged information.
>> > If you are not the addressee or authorized to receive this on behalf of
>> > the addressee you must not use, copy, disclose or take action based on
>> > this message or any information herein.
>> > If you have received this message in error, please advise the sender
>> > immediately by reply email and delete this message. Thank you.
>>
>>
>

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Shalom Sagges <sh...@liveperson.com>.

Hi Michael,

If a node suddenly fails, and there are other replicas that can still
satisfy the consistency level, shouldn't the request succeed regardless of
the failed node?

Thanks!





Shalom Sagges
DBA
T: +972-74-700-4035
<http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
<http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
<https://liveperson.docsend.com/view/8iiswfp>


On Fri, Mar 10, 2017 at 6:25 PM, Michael Shuler <mi...@pbandjelly.org>
wrote:

> I may be mistaken on the exact configuration option for the timeout
> you're hitting, but I believe this may be the general
> `request_timeout_in_ms: 10000` in conf/cassandra.yaml.
>
> A reasonable timeout for a "node down" discovery/processing is needed to
> prevent random flapping of nodes with a super short timeout interval.
> Applications should also retry on a host unavailable exception like
> this, because in the long run, this should be expected from time to time
> for network partitions, node failure, maintenance cycles, etc.
>
> --
> Kind regards,
> Michael
>
> On 03/10/2017 04:07 AM, Shalom Sagges wrote:
> > Hi daniel,
> >
> > I don't think that's a network issue, because ~10 seconds after the node
> > stopped, the queries were successful again without any timeout issues.
> >
> > Thanks!
> >
> >
> > Shalom Sagges
> > DBA
> > T: +972-74-700-4035
> > <http://www.linkedin.com/company/164748>
> > <http://twitter.com/liveperson>       <http://www.facebook.com/
> LivePersonInc>
> >
> >       We Create Meaningful Connections
> >
> > <https://liveperson.docsend.com/view/8iiswfp>
> >
> >
> >
> > On Fri, Mar 10, 2017 at 12:01 PM, Daniel Hölbling-Inzko
> > <daniel.hoelbling-inzko@bitmovin.com
> > <ma...@bitmovin.com>> wrote:
> >
> >     Could there be network issues in connecting between the nodes? If
> >     node a gets To be the query coordinator but can't reach b and c is
> >     obviously down it won't get a quorum.
> >
> >     Greetings
> >
> >     Shalom Sagges <shaloms@liveperson.com
> >     <ma...@liveperson.com>> schrieb am Fr. 10. März 2017 um
> 10:55:
> >
> >         @Ryan, my keyspace replication settings are as follows:
> >         CREATE KEYSPACE mykeyspace WITH replication = {'class':
> >         'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}
> >          AND durable_writes = true;
> >
> >         CREATE TABLE mykeyspace.test (
> >             column1 text,
> >             column2 text,
> >             column3 text,
> >             PRIMARY KEY (column1, column2)
> >
> >         The query is */select * from mykeyspace.test where
> >         column1='xxxxx';/*
> >
> >         @Daniel, the replication factor is 3. That's why I don't
> >         understand why I get these timeouts when only one node drops.
> >
> >         Also, when I enabled tracing, I got the following error:
> >         *Unable to fetch query trace: ('Unable to complete the operation
> >         against any hosts', {<Host: 127.0.0.1 DC1>: Unavailable('Error
> >         from server: code=1000 [Unavailable exception] message="Cannot
> >         achieve consistency level LOCAL_QUORUM"
> >         info={\'required_replicas\': 2, \'alive_replicas\': 1,
> >         \'consistency\': \'LOCAL_QUORUM\'}',)})*
> >
> >         But nodetool status shows that only 1 replica was down:
> >         --  Address          Load       Tokens       Owns    Host ID
> >                                   Rack
> >         DN  x.x.x.235  134.32 MB  256          ?
> >         c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
> >         UN  x.x.x.236  134.02 MB  256          ?
> >         2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
> >         UN  x.x.x.237  134.34 MB  256          ?
> >         5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
> >
> >
> >         I tried to run the same scenario on all 3 nodes, and only the
> >         3rd node didn't fail the query when I dropped it. The nodes were
> >         installed and configured with Puppet so the configuration is the
> >         same on all 3 nodes.
> >
> >
> >         Thanks!
> >
> >
> >
> >         On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko
> >         <daniel.hoelbling-inzko@bitmovin.com
> >         <ma...@bitmovin.com>> wrote:
> >
> >             The LOCAL_QUORUM works on the available replicas in the dc.
> >             So if your replication factor is 2 and you have 10 nodes you
> >             can still only loose 1. With a replication factor of 3 you
> >             can loose one node and still satisfy the query.
> >             Ryan Svihla <rs@foundev.pro <ma...@foundev.pro>> schrieb
> >             am Do. 9. März 2017 um 18:09:
> >
> >                 whats your keyspace replication settings and what's your
> >                 query?
> >
> >                 On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges
> >                 <shaloms@liveperson.com <ma...@liveperson.com>>
> >                 wrote:
> >
> >                     Hi Cassandra Users,
> >
> >                     I hope someone could help me understand the
> >                     following scenario:
> >
> >                     Version: 3.0.9
> >                     3 nodes per DC
> >                     3 DCs in the cluster.
> >                     Consistency Local_Quorum.
> >
> >                     I did a small resiliency test and dropped a node to
> >                     check the availability of the data.
> >                     What I assumed would happen is nothing at all. If a
> >                     node is down in a 3 nodes DC, Local_Quorum should
> >                     still be satisfied.
> >                     However, during the ~10 first seconds after stopping
> >                     the service, I got timeout errors (tried it both
> >                     from the client and from cqlsh.
> >
> >                     This is the error I get:
> >                     */ServerError:
> >                     com.google.common.util.concurrent.
> UncheckedExecutionException:
> >                     com.google.common.util.concurrent.
> UncheckedExecutionException:
> >                     java.lang.RuntimeException:
> >                     org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation
> >                     timed out - received only 4 responses./*
> >
> >
> >                     After ~10 seconds, the same query is successful with
> >                     no timeout errors. The dropped node is still down.
> >
> >                     Any idea what could cause this and how to fix it?
> >
> >                     Thanks!
> >
> >
> >                     This message may contain confidential and/or
> >                     privileged information.
> >                     If you are not the addressee or authorized to
> >                     receive this on behalf of the addressee you must not
> >                     use, copy, disclose or take action based on this
> >                     message or any information herein.
> >                     If you have received this message in error, please
> >                     advise the sender immediately by reply email and
> >                     delete this message. Thank you.
> >
> >
> >
> >
> >                 --
> >
> >                 Thanks,
> >
> >                 Ryan Svihla
> >
> >
> >
> >         This message may contain confidential and/or privileged
> >         information.
> >         If you are not the addressee or authorized to receive this on
> >         behalf of the addressee you must not use, copy, disclose or take
> >         action based on this message or any information herein.
> >         If you have received this message in error, please advise the
> >         sender immediately by reply email and delete this message. Thank
> >         you.
> >
> >
> >
> > This message may contain confidential and/or privileged information.
> > If you are not the addressee or authorized to receive this on behalf of
> > the addressee you must not use, copy, disclose or take action based on
> > this message or any information herein.
> > If you have received this message in error, please advise the sender
> > immediately by reply email and delete this message. Thank you.
>
>

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Michael Shuler <mi...@pbandjelly.org>.

I may be mistaken on the exact configuration option for the timeout
you're hitting, but I believe this may be the general
`request_timeout_in_ms: 10000` in conf/cassandra.yaml.

A reasonable timeout for a "node down" discovery/processing is needed to
prevent random flapping of nodes with a super short timeout interval.
Applications should also retry on a host unavailable exception like
this, because in the long run, this should be expected from time to time
for network partitions, node failure, maintenance cycles, etc.

-- 
Kind regards,
Michael

On 03/10/2017 04:07 AM, Shalom Sagges wrote:
> Hi daniel, 
> 
> I don't think that's a network issue, because ~10 seconds after the node
> stopped, the queries were successful again without any timeout issues.
> 
> Thanks!
> 
>  
> Shalom Sagges
> DBA
> T: +972-74-700-4035
> <http://www.linkedin.com/company/164748>
> <http://twitter.com/liveperson> 	<http://www.facebook.com/LivePersonInc>
> 
> 	We Create Meaningful Connections
> 
> <https://liveperson.docsend.com/view/8iiswfp>
> 
>  
> 
> On Fri, Mar 10, 2017 at 12:01 PM, Daniel H�lbling-Inzko
> <daniel.hoelbling-inzko@bitmovin.com
> <ma...@bitmovin.com>> wrote:
> 
>     Could there be network issues in connecting between the nodes? If
>     node a gets To be the query coordinator but can't reach b and c is
>     obviously down it won't get a quorum.
> 
>     Greetings
> 
>     Shalom Sagges <shaloms@liveperson.com
>     <ma...@liveperson.com>> schrieb am Fr. 10. M�rz 2017 um 10:55:
> 
>         @Ryan, my keyspace replication settings are as follows:
>         CREATE KEYSPACE mykeyspace WITH replication = {'class':
>         'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}
>          AND durable_writes = true;
> 
>         CREATE TABLE mykeyspace.test (
>             column1 text,
>             column2 text,
>             column3 text,
>             PRIMARY KEY (column1, column2)
> 
>         The query is */select * from mykeyspace.test where
>         column1='xxxxx';/*
> 
>         @Daniel, the replication factor is 3. That's why I don't
>         understand why I get these timeouts when only one node drops. 
> 
>         Also, when I enabled tracing, I got the following error:
>         *Unable to fetch query trace: ('Unable to complete the operation
>         against any hosts', {<Host: 127.0.0.1 DC1>: Unavailable('Error
>         from server: code=1000 [Unavailable exception] message="Cannot
>         achieve consistency level LOCAL_QUORUM"
>         info={\'required_replicas\': 2, \'alive_replicas\': 1,
>         \'consistency\': \'LOCAL_QUORUM\'}',)})*
> 
>         But nodetool status shows that only 1 replica was down:
>         --  Address          Load       Tokens       Owns    Host ID    
>                                   Rack
>         DN  x.x.x.235  134.32 MB  256          ?      
>         c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
>         UN  x.x.x.236  134.02 MB  256          ?      
>         2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
>         UN  x.x.x.237  134.34 MB  256          ?      
>         5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
> 
> 
>         I tried to run the same scenario on all 3 nodes, and only the
>         3rd node didn't fail the query when I dropped it. The nodes were
>         installed and configured with Puppet so the configuration is the
>         same on all 3 nodes. 
> 
> 
>         Thanks!
> 
>           
> 
>         On Fri, Mar 10, 2017 at 10:25 AM, Daniel H�lbling-Inzko
>         <daniel.hoelbling-inzko@bitmovin.com
>         <ma...@bitmovin.com>> wrote:
> 
>             The LOCAL_QUORUM works on the available replicas in the dc.
>             So if your replication factor is 2 and you have 10 nodes you
>             can still only loose 1. With a replication factor of 3 you
>             can loose one node and still satisfy the query.
>             Ryan Svihla <rs@foundev.pro <ma...@foundev.pro>> schrieb
>             am Do. 9. M�rz 2017 um 18:09:
> 
>                 whats your keyspace replication settings and what's your
>                 query?
> 
>                 On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges
>                 <shaloms@liveperson.com <ma...@liveperson.com>>
>                 wrote:
> 
>                     Hi Cassandra Users, 
> 
>                     I hope someone could help me understand the
>                     following scenario:
> 
>                     Version: 3.0.9
>                     3 nodes per DC
>                     3 DCs in the cluster. 
>                     Consistency Local_Quorum. 
> 
>                     I did a small resiliency test and dropped a node to
>                     check the availability of the data. 
>                     What I assumed would happen is nothing at all. If a
>                     node is down in a 3 nodes DC, Local_Quorum should
>                     still be satisfied. 
>                     However, during the ~10 first seconds after stopping
>                     the service, I got timeout errors (tried it both
>                     from the client and from cqlsh. 
> 
>                     This is the error I get:
>                     */ServerError:
>                     com.google.common.util.concurrent.UncheckedExecutionException:
>                     com.google.common.util.concurrent.UncheckedExecutionException:
>                     java.lang.RuntimeException:
>                     org.apache.cassandra.exceptions.ReadTimeoutException: Operation
>                     timed out - received only 4 responses./*
> 
> 
>                     After ~10 seconds, the same query is successful with
>                     no timeout errors. The dropped node is still down. 
> 
>                     Any idea what could cause this and how to fix it? 
> 
>                     Thanks!
>                      
> 
>                     This message may contain confidential and/or
>                     privileged information. 
>                     If you are not the addressee or authorized to
>                     receive this on behalf of the addressee you must not
>                     use, copy, disclose or take action based on this
>                     message or any information herein. 
>                     If you have received this message in error, please
>                     advise the sender immediately by reply email and
>                     delete this message. Thank you.
> 
> 
> 
> 
>                 -- 
> 
>                 Thanks,
> 
>                 Ryan Svihla
> 
> 
> 
>         This message may contain confidential and/or privileged
>         information. 
>         If you are not the addressee or authorized to receive this on
>         behalf of the addressee you must not use, copy, disclose or take
>         action based on this message or any information herein. 
>         If you have received this message in error, please advise the
>         sender immediately by reply email and delete this message. Thank
>         you.
> 
> 
> 
> This message may contain confidential and/or privileged information. 
> If you are not the addressee or authorized to receive this on behalf of
> the addressee you must not use, copy, disclose or take action based on
> this message or any information herein. 
> If you have received this message in error, please advise the sender
> immediately by reply email and delete this message. Thank you.

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Shalom Sagges <sh...@liveperson.com>.

Hi daniel,

I don't think that's a network issue, because ~10 seconds after the node
stopped, the queries were successful again without any timeout issues.

Thanks!


Shalom Sagges
DBA
T: +972-74-700-4035
<http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
<http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
<https://liveperson.docsend.com/view/8iiswfp>


On Fri, Mar 10, 2017 at 12:01 PM, Daniel Hölbling-Inzko <
daniel.hoelbling-inzko@bitmovin.com> wrote:

> Could there be network issues in connecting between the nodes? If node a
> gets To be the query coordinator but can't reach b and c is obviously down
> it won't get a quorum.
>
> Greetings
>
> Shalom Sagges <sh...@liveperson.com> schrieb am Fr. 10. März 2017 um
> 10:55:
>
>> @Ryan, my keyspace replication settings are as follows:
>> CREATE KEYSPACE mykeyspace WITH replication = {'class':
>> 'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}  AND
>> durable_writes = true;
>>
>> CREATE TABLE mykeyspace.test (
>>     column1 text,
>>     column2 text,
>>     column3 text,
>>     PRIMARY KEY (column1, column2)
>>
>> The query is *select * from mykeyspace.test where column1='xxxxx';*
>>
>> @Daniel, the replication factor is 3. That's why I don't understand why I
>> get these timeouts when only one node drops.
>>
>> Also, when I enabled tracing, I got the following error:
>> *Unable to fetch query trace: ('Unable to complete the operation against
>> any hosts', {<Host: 127.0.0.1 DC1>: Unavailable('Error from server:
>> code=1000 [Unavailable exception] message="Cannot achieve consistency level
>> LOCAL_QUORUM" info={\'required_replicas\': 2, \'alive_replicas\': 1,
>> \'consistency\': \'LOCAL_QUORUM\'}',)})*
>>
>> But nodetool status shows that only 1 replica was down:
>> --  Address          Load       Tokens       Owns    Host ID
>>                   Rack
>> DN  x.x.x.235  134.32 MB  256          ?       c0920d11-08da-4f18-a7f3-dbfb8c155b19
>>  RAC1
>> UN  x.x.x.236  134.02 MB  256          ?       2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d
>>  RAC1
>> UN  x.x.x.237  134.34 MB  256          ?       5b2162aa-8803-4b54-88a9-ff2e70b3d830
>>  RAC1
>>
>>
>> I tried to run the same scenario on all 3 nodes, and only the 3rd node
>> didn't fail the query when I dropped it. The nodes were installed and
>> configured with Puppet so the configuration is the same on all 3 nodes.
>>
>>
>> Thanks!
>>
>>
>>
>> On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko <
>> daniel.hoelbling-inzko@bitmovin.com> wrote:
>>
>> The LOCAL_QUORUM works on the available replicas in the dc. So if your
>> replication factor is 2 and you have 10 nodes you can still only loose 1.
>> With a replication factor of 3 you can loose one node and still satisfy the
>> query.
>> Ryan Svihla <rs...@foundev.pro> schrieb am Do. 9. März 2017 um 18:09:
>>
>> whats your keyspace replication settings and what's your query?
>>
>> On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges <sh...@liveperson.com>
>> wrote:
>>
>> Hi Cassandra Users,
>>
>> I hope someone could help me understand the following scenario:
>>
>> Version: 3.0.9
>> 3 nodes per DC
>> 3 DCs in the cluster.
>> Consistency Local_Quorum.
>>
>> I did a small resiliency test and dropped a node to check the
>> availability of the data.
>> What I assumed would happen is nothing at all. If a node is down in a 3
>> nodes DC, Local_Quorum should still be satisfied.
>> However, during the ~10 first seconds after stopping the service, I got
>> timeout errors (tried it both from the client and from cqlsh.
>>
>> This is the error I get:
>> *ServerError:
>> com.google.common.util.concurrent.UncheckedExecutionException:
>> com.google.common.util.concurrent.UncheckedExecutionException:
>> java.lang.RuntimeException:
>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
>> received only 4 responses.*
>>
>>
>> After ~10 seconds, the same query is successful with no timeout errors.
>> The dropped node is still down.
>>
>> Any idea what could cause this and how to fix it?
>>
>> Thanks!
>>
>>
>> This message may contain confidential and/or privileged information.
>> If you are not the addressee or authorized to receive this on behalf of
>> the addressee you must not use, copy, disclose or take action based on this
>> message or any information herein.
>> If you have received this message in error, please advise the sender
>> immediately by reply email and delete this message. Thank you.
>>
>>
>>
>>
>> --
>>
>> Thanks,
>> Ryan Svihla
>>
>>
>>
>> This message may contain confidential and/or privileged information.
>> If you are not the addressee or authorized to receive this on behalf of
>> the addressee you must not use, copy, disclose or take action based on this
>> message or any information herein.
>> If you have received this message in error, please advise the sender
>> immediately by reply email and delete this message. Thank you.
>>
>

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Daniel Hölbling-Inzko <da...@bitmovin.com>.

Could there be network issues in connecting between the nodes? If node a
gets To be the query coordinator but can't reach b and c is obviously down
it won't get a quorum.

Greetings
Shalom Sagges <sh...@liveperson.com> schrieb am Fr. 10. März 2017 um
10:55:

> @Ryan, my keyspace replication settings are as follows:
> CREATE KEYSPACE mykeyspace WITH replication = {'class':
> 'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}  AND
> durable_writes = true;
>
> CREATE TABLE mykeyspace.test (
>     column1 text,
>     column2 text,
>     column3 text,
>     PRIMARY KEY (column1, column2)
>
> The query is *select * from mykeyspace.test where column1='xxxxx';*
>
> @Daniel, the replication factor is 3. That's why I don't understand why I
> get these timeouts when only one node drops.
>
> Also, when I enabled tracing, I got the following error:
> *Unable to fetch query trace: ('Unable to complete the operation against
> any hosts', {<Host: 127.0.0.1 DC1>: Unavailable('Error from server:
> code=1000 [Unavailable exception] message="Cannot achieve consistency level
> LOCAL_QUORUM" info={\'required_replicas\': 2, \'alive_replicas\': 1,
> \'consistency\': \'LOCAL_QUORUM\'}',)})*
>
> But nodetool status shows that only 1 replica was down:
> --  Address          Load       Tokens       Owns    Host ID
>                 Rack
> DN  x.x.x.235  134.32 MB  256          ?
> c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
> UN  x.x.x.236  134.02 MB  256          ?
> 2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
> UN  x.x.x.237  134.34 MB  256          ?
> 5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
>
>
> I tried to run the same scenario on all 3 nodes, and only the 3rd node
> didn't fail the query when I dropped it. The nodes were installed and
> configured with Puppet so the configuration is the same on all 3 nodes.
>
>
> Thanks!
>
>
>
> On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko <
> daniel.hoelbling-inzko@bitmovin.com> wrote:
>
> The LOCAL_QUORUM works on the available replicas in the dc. So if your
> replication factor is 2 and you have 10 nodes you can still only loose 1.
> With a replication factor of 3 you can loose one node and still satisfy the
> query.
> Ryan Svihla <rs...@foundev.pro> schrieb am Do. 9. März 2017 um 18:09:
>
> whats your keyspace replication settings and what's your query?
>
> On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges <sh...@liveperson.com>
> wrote:
>
> Hi Cassandra Users,
>
> I hope someone could help me understand the following scenario:
>
> Version: 3.0.9
> 3 nodes per DC
> 3 DCs in the cluster.
> Consistency Local_Quorum.
>
> I did a small resiliency test and dropped a node to check the availability
> of the data.
> What I assumed would happen is nothing at all. If a node is down in a 3
> nodes DC, Local_Quorum should still be satisfied.
> However, during the ~10 first seconds after stopping the service, I got
> timeout errors (tried it both from the client and from cqlsh.
>
> This is the error I get:
> *ServerError:
> com.google.common.util.concurrent.UncheckedExecutionException:
> com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException:
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
> received only 4 responses.*
>
>
> After ~10 seconds, the same query is successful with no timeout errors.
> The dropped node is still down.
>
> Any idea what could cause this and how to fix it?
>
> Thanks!
>
>
> This message may contain confidential and/or privileged information.
> If you are not the addressee or authorized to receive this on behalf of
> the addressee you must not use, copy, disclose or take action based on this
> message or any information herein.
> If you have received this message in error, please advise the sender
> immediately by reply email and delete this message. Thank you.
>
>
>
>
> --
>
> Thanks,
> Ryan Svihla
>
>
>
> This message may contain confidential and/or privileged information.
> If you are not the addressee or authorized to receive this on behalf of
> the addressee you must not use, copy, disclose or take action based on this
> message or any information herein.
> If you have received this message in error, please advise the sender
> immediately by reply email and delete this message. Thank you.
>

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Shalom Sagges <sh...@liveperson.com>.

@Ryan, my keyspace replication settings are as follows:
CREATE KEYSPACE mykeyspace WITH replication = {'class':
'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}  AND
durable_writes = true;

CREATE TABLE mykeyspace.test (
    column1 text,
    column2 text,
    column3 text,
    PRIMARY KEY (column1, column2)

The query is *select * from mykeyspace.test where column1='xxxxx';*

@Daniel, the replication factor is 3. That's why I don't understand why I
get these timeouts when only one node drops.

Also, when I enabled tracing, I got the following error:
*Unable to fetch query trace: ('Unable to complete the operation against
any hosts', {<Host: 127.0.0.1 DC1>: Unavailable('Error from server:
code=1000 [Unavailable exception] message="Cannot achieve consistency level
LOCAL_QUORUM" info={\'required_replicas\': 2, \'alive_replicas\': 1,
\'consistency\': \'LOCAL_QUORUM\'}',)})*

But nodetool status shows that only 1 replica was down:
--  Address          Load       Tokens       Owns    Host ID
                Rack
DN  x.x.x.235  134.32 MB  256          ?
c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
UN  x.x.x.236  134.02 MB  256          ?
2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
UN  x.x.x.237  134.34 MB  256          ?
5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1


I tried to run the same scenario on all 3 nodes, and only the 3rd node
didn't fail the query when I dropped it. The nodes were installed and
configured with Puppet so the configuration is the same on all 3 nodes.


Thanks!



On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko <
daniel.hoelbling-inzko@bitmovin.com> wrote:

> The LOCAL_QUORUM works on the available replicas in the dc. So if your
> replication factor is 2 and you have 10 nodes you can still only loose 1.
> With a replication factor of 3 you can loose one node and still satisfy the
> query.
> Ryan Svihla <rs...@foundev.pro> schrieb am Do. 9. März 2017 um 18:09:
>
>> whats your keyspace replication settings and what's your query?
>>
>> On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges <sh...@liveperson.com>
>> wrote:
>>
>> Hi Cassandra Users,
>>
>> I hope someone could help me understand the following scenario:
>>
>> Version: 3.0.9
>> 3 nodes per DC
>> 3 DCs in the cluster.
>> Consistency Local_Quorum.
>>
>> I did a small resiliency test and dropped a node to check the
>> availability of the data.
>> What I assumed would happen is nothing at all. If a node is down in a 3
>> nodes DC, Local_Quorum should still be satisfied.
>> However, during the ~10 first seconds after stopping the service, I got
>> timeout errors (tried it both from the client and from cqlsh.
>>
>> This is the error I get:
>> *ServerError:
>> com.google.common.util.concurrent.UncheckedExecutionException:
>> com.google.common.util.concurrent.UncheckedExecutionException:
>> java.lang.RuntimeException:
>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
>> received only 4 responses.*
>>
>>
>> After ~10 seconds, the same query is successful with no timeout errors.
>> The dropped node is still down.
>>
>> Any idea what could cause this and how to fix it?
>>
>> Thanks!
>>
>>
>> This message may contain confidential and/or privileged information.
>> If you are not the addressee or authorized to receive this on behalf of
>> the addressee you must not use, copy, disclose or take action based on this
>> message or any information herein.
>> If you have received this message in error, please advise the sender
>> immediately by reply email and delete this message. Thank you.
>>
>>
>>
>>
>> --
>>
>> Thanks,
>> Ryan Svihla
>>
>>

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Daniel Hölbling-Inzko <da...@bitmovin.com>.

The LOCAL_QUORUM works on the available replicas in the dc. So if your
replication factor is 2 and you have 10 nodes you can still only loose 1.
With a replication factor of 3 you can loose one node and still satisfy the
query.
Ryan Svihla <rs...@foundev.pro> schrieb am Do. 9. März 2017 um 18:09:

> whats your keyspace replication settings and what's your query?
>
> On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges <sh...@liveperson.com>
> wrote:
>
> Hi Cassandra Users,
>
> I hope someone could help me understand the following scenario:
>
> Version: 3.0.9
> 3 nodes per DC
> 3 DCs in the cluster.
> Consistency Local_Quorum.
>
> I did a small resiliency test and dropped a node to check the availability
> of the data.
> What I assumed would happen is nothing at all. If a node is down in a 3
> nodes DC, Local_Quorum should still be satisfied.
> However, during the ~10 first seconds after stopping the service, I got
> timeout errors (tried it both from the client and from cqlsh.
>
> This is the error I get:
> *ServerError:
> com.google.common.util.concurrent.UncheckedExecutionException:
> com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException:
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
> received only 4 responses.*
>
>
> After ~10 seconds, the same query is successful with no timeout errors.
> The dropped node is still down.
>
> Any idea what could cause this and how to fix it?
>
> Thanks!
>
>
> This message may contain confidential and/or privileged information.
> If you are not the addressee or authorized to receive this on behalf of
> the addressee you must not use, copy, disclose or take action based on this
> message or any information herein.
> If you have received this message in error, please advise the sender
> immediately by reply email and delete this message. Thank you.
>
>
>
>
> --
>
> Thanks,
> Ryan Svihla
>
>

Re: A Single Dropped Node Fails Entire Read Queries

Posted by Ryan Svihla <rs...@foundev.pro>.

whats your keyspace replication settings and what's your query?

On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges <sh...@liveperson.com>
wrote:

> Hi Cassandra Users,
>
> I hope someone could help me understand the following scenario:
>
> Version: 3.0.9
> 3 nodes per DC
> 3 DCs in the cluster.
> Consistency Local_Quorum.
>
> I did a small resiliency test and dropped a node to check the availability
> of the data.
> What I assumed would happen is nothing at all. If a node is down in a 3
> nodes DC, Local_Quorum should still be satisfied.
> However, during the ~10 first seconds after stopping the service, I got
> timeout errors (tried it both from the client and from cqlsh.
>
> This is the error I get:
> *ServerError:
> com.google.common.util.concurrent.UncheckedExecutionException:
> com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException:
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
> received only 4 responses.*
>
>
> After ~10 seconds, the same query is successful with no timeout errors.
> The dropped node is still down.
>
> Any idea what could cause this and how to fix it?
>
> Thanks!
>
>
> This message may contain confidential and/or privileged information.
> If you are not the addressee or authorized to receive this on behalf of
> the addressee you must not use, copy, disclose or take action based on this
> message or any information herein.
> If you have received this message in error, please advise the sender
> immediately by reply email and delete this message. Thank you.
>



-- 

Thanks,
Ryan Svihla