You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Chris Burroughs <ch...@gmail.com> on 2013/09/05 15:14:38 UTC
Multi-dc restart impact
We have a 2 DC cluster running cassandra 1.2.9. They are in actual
physically separate DCs on opposite coasts of the US, not just logical
ones. The primary use of this cluster is CL.ONE reads out of a single
column family. My expectation was that in such a scenario restarts
would have minimal impact in the DC where the restart occurred, and no
impact in the remote DC.
We are seeing instead that restarts in one DC have a dramatic impact on
performance in the other (let's call them DCs "A" and "B").
Test scenario on a node in DC "A":
* disablegossip: no change
* drain: no change
* stop node: no change
* start node again: Large increase in latency in both DCs A *and* B
This is a graph showing the increase in latency
(org.apache.cassandra.metrics.ClientRequest.Latency.Read.95percentile)
from DC *B* http://i.imgur.com/OkIQyXI.png (Actual clients report
similar numbers that agree with this server side measurement). Latency
jumps by over an order of magnitude and out of SLAs. (I would prefer
restarting to not cause a latency spike in either DC, but the one
induced in the remote DC is particularly concerning.)
However, the node that was restarted reports only a minor increase in
latency http://i.imgur.com/KnGEJrE.png This is confusing from several
different angles:
* I would not expect any cross-dc reads to normally be occurring
* If there were cross DC reads, they would take 50+ ms instead of < 5
ms normally reported
* If the node that was restarted was still somehow involved it reads,
it's reporting shows it can only account for a small amount of the
latency increase.
Some possible relevant configurations:
* GossipingPropertyFileSnitch
* dynamic_snitch_update_interval_in_ms: 100
* dynamic_snitch_reset_interval_in_ms: 600000
* dynamic_snitch_badness_threshold: 0.1
* read_repair_chance=0.01 and dclocal_read_repair_chance=0.1 (same
type of behavior was observed with just read_repair_chance=0.1)
Has anyone else observed similar behavior and found a way to limit it?
This seems like something that ought not to happen but without knowing
why it is occurring I'm not sure how to stop it.
Re: Multi-dc restart impact
Posted by Chris Burroughs <ch...@gmail.com>.
Thanks, double checked; reads are CL.ONE.
On 10/10/2013 11:15 AM, J. Ryan Earl wrote:
> Are you doing QUORUM reads instead of LOCAL_QUORUM reads?
>
>
> On Wed, Oct 9, 2013 at 7:41 PM, Chris Burroughs
> <ch...@gmail.com>wrote:
>
>> I have not been able to do the test with the 2nd cluster, but have been
>> given a disturbing data point. We had a disk slowly fail causing a
>> significant performance degradation that was only resolved when the "sick"
>> node was killed.
>> * Perf in DC w/ sick disk: http://i.imgur.com/W1I5ymL.**png?1<http://i.imgur.com/W1I5ymL.png?1>
>> * perf in other DC: http://i.imgur.com/gEMrLyF.**png?1<http://i.imgur.com/gEMrLyF.png?1>
>>
>> Not only was a single slow node able to cause an order of magnitude
>> performance hit in a dc, but the other dc faired *worse*.
>>
>>
>> On 09/18/2013 08:50 AM, Chris Burroughs wrote:
>>
>>> On 09/17/2013 04:44 PM, Robert Coli wrote:
>>>
>>>> On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
>>>> <ch...@gmail.com>**wrote:
>>>>
>>>> We have a 2 DC cluster running cassandra 1.2.9. They are in actual
>>>>> physically separate DCs on opposite coasts of the US, not just logical
>>>>> ones. The primary use of this cluster is CL.ONE reads out of a single
>>>>> column family. My expectation was that in such a scenario restarts
>>>>> would
>>>>> have minimal impact in the DC where the restart occurred, and no
>>>>> impact in
>>>>> the remote DC.
>>>>>
>>>>> We are seeing instead that restarts in one DC have a dramatic impact on
>>>>> performance in the other (let's call them DCs "A" and "B").
>>>>>
>>>>>
>>>> Did you end up filing a JIRA on this, or some other outcome?
>>>>
>>>> =Rob
>>>>
>>>>
>>>
>>> No. I am currently in the process of taking a 2nd cluster from being
>>> single to dual DC. Once that is done I was going to repeat the test
>>> with each cluster and gather as much information as reasonable.
>>>
>>
>>
>
Re: Multi-dc restart impact
Posted by "J. Ryan Earl" <os...@jryanearl.us>.
Are you doing QUORUM reads instead of LOCAL_QUORUM reads?
On Wed, Oct 9, 2013 at 7:41 PM, Chris Burroughs
<ch...@gmail.com>wrote:
> I have not been able to do the test with the 2nd cluster, but have been
> given a disturbing data point. We had a disk slowly fail causing a
> significant performance degradation that was only resolved when the "sick"
> node was killed.
> * Perf in DC w/ sick disk: http://i.imgur.com/W1I5ymL.**png?1<http://i.imgur.com/W1I5ymL.png?1>
> * perf in other DC: http://i.imgur.com/gEMrLyF.**png?1<http://i.imgur.com/gEMrLyF.png?1>
>
> Not only was a single slow node able to cause an order of magnitude
> performance hit in a dc, but the other dc faired *worse*.
>
>
> On 09/18/2013 08:50 AM, Chris Burroughs wrote:
>
>> On 09/17/2013 04:44 PM, Robert Coli wrote:
>>
>>> On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
>>> <ch...@gmail.com>**wrote:
>>>
>>> We have a 2 DC cluster running cassandra 1.2.9. They are in actual
>>>> physically separate DCs on opposite coasts of the US, not just logical
>>>> ones. The primary use of this cluster is CL.ONE reads out of a single
>>>> column family. My expectation was that in such a scenario restarts
>>>> would
>>>> have minimal impact in the DC where the restart occurred, and no
>>>> impact in
>>>> the remote DC.
>>>>
>>>> We are seeing instead that restarts in one DC have a dramatic impact on
>>>> performance in the other (let's call them DCs "A" and "B").
>>>>
>>>>
>>> Did you end up filing a JIRA on this, or some other outcome?
>>>
>>> =Rob
>>>
>>>
>>
>> No. I am currently in the process of taking a 2nd cluster from being
>> single to dual DC. Once that is done I was going to repeat the test
>> with each cluster and gather as much information as reasonable.
>>
>
>
Re: Multi-dc restart impact
Posted by Chris Burroughs <ch...@gmail.com>.
I have not been able to do the test with the 2nd cluster, but have been
given a disturbing data point. We had a disk slowly fail causing a
significant performance degradation that was only resolved when the
"sick" node was killed.
* Perf in DC w/ sick disk: http://i.imgur.com/W1I5ymL.png?1
* perf in other DC: http://i.imgur.com/gEMrLyF.png?1
Not only was a single slow node able to cause an order of magnitude
performance hit in a dc, but the other dc faired *worse*.
On 09/18/2013 08:50 AM, Chris Burroughs wrote:
> On 09/17/2013 04:44 PM, Robert Coli wrote:
>> On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
>> <ch...@gmail.com>wrote:
>>
>>> We have a 2 DC cluster running cassandra 1.2.9. They are in actual
>>> physically separate DCs on opposite coasts of the US, not just logical
>>> ones. The primary use of this cluster is CL.ONE reads out of a single
>>> column family. My expectation was that in such a scenario restarts
>>> would
>>> have minimal impact in the DC where the restart occurred, and no
>>> impact in
>>> the remote DC.
>>>
>>> We are seeing instead that restarts in one DC have a dramatic impact on
>>> performance in the other (let's call them DCs "A" and "B").
>>>
>>
>> Did you end up filing a JIRA on this, or some other outcome?
>>
>> =Rob
>>
>
>
> No. I am currently in the process of taking a 2nd cluster from being
> single to dual DC. Once that is done I was going to repeat the test
> with each cluster and gather as much information as reasonable.
Re: Multi-dc restart impact
Posted by Chris Burroughs <ch...@gmail.com>.
On 09/17/2013 04:44 PM, Robert Coli wrote:
> On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
> <ch...@gmail.com>wrote:
>
>> We have a 2 DC cluster running cassandra 1.2.9. They are in actual
>> physically separate DCs on opposite coasts of the US, not just logical
>> ones. The primary use of this cluster is CL.ONE reads out of a single
>> column family. My expectation was that in such a scenario restarts would
>> have minimal impact in the DC where the restart occurred, and no impact in
>> the remote DC.
>>
>> We are seeing instead that restarts in one DC have a dramatic impact on
>> performance in the other (let's call them DCs "A" and "B").
>>
>
> Did you end up filing a JIRA on this, or some other outcome?
>
> =Rob
>
No. I am currently in the process of taking a 2nd cluster from being
single to dual DC. Once that is done I was going to repeat the test
with each cluster and gather as much information as reasonable.
Re: Multi-dc restart impact
Posted by Robert Coli <rc...@eventbrite.com>.
On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
<ch...@gmail.com>wrote:
> We have a 2 DC cluster running cassandra 1.2.9. They are in actual
> physically separate DCs on opposite coasts of the US, not just logical
> ones. The primary use of this cluster is CL.ONE reads out of a single
> column family. My expectation was that in such a scenario restarts would
> have minimal impact in the DC where the restart occurred, and no impact in
> the remote DC.
>
> We are seeing instead that restarts in one DC have a dramatic impact on
> performance in the other (let's call them DCs "A" and "B").
>
Did you end up filing a JIRA on this, or some other outcome?
=Rob