You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Venkat Rama <ve...@gmail.com> on 2012/09/06 05:22:43 UTC

Monitoring replication lag/latency in multi DC setup

Hi,

We have multi DC Cassandra ring with 2 DCs setup.   We use LOCAL_QUORUM for
writes and reads.  The network we have seen between the DC is sometimes
flaky lasting few minutes to few 10 of minutes.

I wanted to know what is the best way to measure/monitor either the lag or
replication latency between the data centers.  Are there any metrics I can
monitor to find the backlog of data that needs to be transferred?

Thanks in advance.

VR

Re: Monitoring replication lag/latency in multi DC setup

Posted by aaron morton <aa...@thelastpickle.com>.

> Is there a specific metric you can recommend?


the not entirely correct but very lightweight approach would be to look at the size of the HintsColumnFamily in the system KS. 

If you want an exact number use the functions on the HH MBean https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/HintedHandOffManagerMBean.java

Note that counting the number of hits involves counting the number of hits, so that can take a while. 

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 6/09/2012, at 5:33 PM, Venkat Rama <ve...@gmail.com> wrote:

> Is there a specific metric you can recommend?
> 
> VR
> 
> On Wed, Sep 5, 2012 at 9:19 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> Cassandra exposes lot of metrics through Jconsole. You might be able to get some information from Jconsole.
> 
> 
> On Wed, Sep 5, 2012 at 8:47 PM, Venkat Rama <ve...@gmail.com> wrote:
> Thanks for the quick reply, Mohit.    Can we measure/monitor the size of Hinted Handoffs?  Would it be a good enough indicator of my back log?  
> 
> Although we know when a network is flaky, we are interested in knowing how much data is piling up in local DC that needs to be transferred.  
> 
> Greatly appreciate your help.
> 
> VR
> 
> 
> On Wed, Sep 5, 2012 at 8:33 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> As far as I know Cassandra doesn't use internal queueing mechanism specific to replication. Cassandra sends the write the remote DC and after that it's upto the tcp/ip stack to deal with buffering. If requests starts to timeout Cassandra would use HH upto certain time. For longer outage you would have to run repair.
>  
> Also look at tcp/ip tuning parameters that are helpful with your scenario:
>  
> http://kaivanov.blogspot.com/2010/09/linux-tcp-tuning.html
>  
> Run iperf and test the latency.
> 
> On Wed, Sep 5, 2012 at 8:22 PM, Venkat Rama <ve...@gmail.com> wrote:
> Hi,
> 
> We have multi DC Cassandra ring with 2 DCs setup.   We use LOCAL_QUORUM for writes and reads.  The network we have seen between the DC is sometimes flaky lasting few minutes to few 10 of minutes.
> 
> I wanted to know what is the best way to measure/monitor either the lag or replication latency between the data centers.  Are there any metrics I can monitor to find the backlog of data that needs to be transferred?
> 
> Thanks in advance.
> 
> VR
> 
> 
> 
>

Re: Monitoring replication lag/latency in multi DC setup

Posted by Venkat Rama <ve...@gmail.com>.

Is there a specific metric you can recommend?

VR

On Wed, Sep 5, 2012 at 9:19 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Cassandra exposes lot of metrics through Jconsole. You might be able to
> get some information from Jconsole.
>
>
> On Wed, Sep 5, 2012 at 8:47 PM, Venkat Rama <ve...@gmail.com>wrote:
>
>> Thanks for the quick reply, Mohit.    Can we measure/monitor the size of
>> Hinted Handoffs?  Would it be a good enough indicator of my back log?
>>
>> Although we know when a network is flaky, we are interested in knowing
>> how much data is piling up in local DC that needs to be transferred.
>>
>> Greatly appreciate your help.
>>
>> VR
>>
>>
>> On Wed, Sep 5, 2012 at 8:33 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> As far as I know Cassandra doesn't use internal queueing mechanism
>>> specific to replication. Cassandra sends the write the remote DC and after
>>> that it's upto the tcp/ip stack to deal with buffering. If requests starts
>>> to timeout Cassandra would use HH upto certain time. For longer outage you
>>> would have to run repair.
>>>
>>> Also look at tcp/ip tuning parameters that are helpful with your
>>> scenario:
>>>
>>> http://kaivanov.blogspot.com/2010/09/linux-tcp-tuning.html
>>>
>>> Run iperf and test the latency.
>>>
>>>  On Wed, Sep 5, 2012 at 8:22 PM, Venkat Rama <ve...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> We have multi DC Cassandra ring with 2 DCs setup.   We use LOCAL_QUORUM
>>>> for writes and reads.  The network we have seen between the DC is sometimes
>>>> flaky lasting few minutes to few 10 of minutes.
>>>>
>>>> I wanted to know what is the best way to measure/monitor either the lag
>>>> or replication latency between the data centers.  Are there any metrics I
>>>> can monitor to find the backlog of data that needs to be transferred?
>>>>
>>>> Thanks in advance.
>>>>
>>>> VR
>>>>
>>>
>>>
>>
>

Re: Monitoring replication lag/latency in multi DC setup

Posted by Mohit Anchlia <mo...@gmail.com>.

Cassandra exposes lot of metrics through Jconsole. You might be able to get
some information from Jconsole.

On Wed, Sep 5, 2012 at 8:47 PM, Venkat Rama <ve...@gmail.com>wrote:

> Thanks for the quick reply, Mohit.    Can we measure/monitor the size of
> Hinted Handoffs?  Would it be a good enough indicator of my back log?
>
> Although we know when a network is flaky, we are interested in knowing how
> much data is piling up in local DC that needs to be transferred.
>
> Greatly appreciate your help.
>
> VR
>
>
> On Wed, Sep 5, 2012 at 8:33 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> As far as I know Cassandra doesn't use internal queueing mechanism
>> specific to replication. Cassandra sends the write the remote DC and after
>> that it's upto the tcp/ip stack to deal with buffering. If requests starts
>> to timeout Cassandra would use HH upto certain time. For longer outage you
>> would have to run repair.
>>
>> Also look at tcp/ip tuning parameters that are helpful with your scenario:
>>
>> http://kaivanov.blogspot.com/2010/09/linux-tcp-tuning.html
>>
>> Run iperf and test the latency.
>>
>>  On Wed, Sep 5, 2012 at 8:22 PM, Venkat Rama <ve...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> We have multi DC Cassandra ring with 2 DCs setup.   We use LOCAL_QUORUM
>>> for writes and reads.  The network we have seen between the DC is sometimes
>>> flaky lasting few minutes to few 10 of minutes.
>>>
>>> I wanted to know what is the best way to measure/monitor either the lag
>>> or replication latency between the data centers.  Are there any metrics I
>>> can monitor to find the backlog of data that needs to be transferred?
>>>
>>> Thanks in advance.
>>>
>>> VR
>>>
>>
>>
>

Re: Monitoring replication lag/latency in multi DC setup

Posted by Venkat Rama <ve...@gmail.com>.

Thanks for the quick reply, Mohit.    Can we measure/monitor the size of
Hinted Handoffs?  Would it be a good enough indicator of my back log?

Although we know when a network is flaky, we are interested in knowing how
much data is piling up in local DC that needs to be transferred.

Greatly appreciate your help.

VR


On Wed, Sep 5, 2012 at 8:33 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> As far as I know Cassandra doesn't use internal queueing mechanism
> specific to replication. Cassandra sends the write the remote DC and after
> that it's upto the tcp/ip stack to deal with buffering. If requests starts
> to timeout Cassandra would use HH upto certain time. For longer outage you
> would have to run repair.
>
> Also look at tcp/ip tuning parameters that are helpful with your scenario:
>
> http://kaivanov.blogspot.com/2010/09/linux-tcp-tuning.html
>
> Run iperf and test the latency.
>
> On Wed, Sep 5, 2012 at 8:22 PM, Venkat Rama <ve...@gmail.com>wrote:
>
>> Hi,
>>
>> We have multi DC Cassandra ring with 2 DCs setup.   We use LOCAL_QUORUM
>> for writes and reads.  The network we have seen between the DC is sometimes
>> flaky lasting few minutes to few 10 of minutes.
>>
>> I wanted to know what is the best way to measure/monitor either the lag
>> or replication latency between the data centers.  Are there any metrics I
>> can monitor to find the backlog of data that needs to be transferred?
>>
>> Thanks in advance.
>>
>> VR
>>
>
>

Re: Monitoring replication lag/latency in multi DC setup

Posted by Mohit Anchlia <mo...@gmail.com>.

As far as I know Cassandra doesn't use internal queueing mechanism specific
to replication. Cassandra sends the write the remote DC and after that it's
upto the tcp/ip stack to deal with buffering. If requests starts to timeout
Cassandra would use HH upto certain time. For longer outage you would have
to run repair.

Also look at tcp/ip tuning parameters that are helpful with your scenario:

http://kaivanov.blogspot.com/2010/09/linux-tcp-tuning.html

Run iperf and test the latency.

On Wed, Sep 5, 2012 at 8:22 PM, Venkat Rama <ve...@gmail.com>wrote:

> Hi,
>
> We have multi DC Cassandra ring with 2 DCs setup.   We use LOCAL_QUORUM
> for writes and reads.  The network we have seen between the DC is sometimes
> flaky lasting few minutes to few 10 of minutes.
>
> I wanted to know what is the best way to measure/monitor either the lag or
> replication latency between the data centers.  Are there any metrics I can
> monitor to find the backlog of data that needs to be transferred?
>
> Thanks in advance.
>
> VR
>