You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by James Masson <ja...@opigram.com> on 2013/01/02 12:28:09 UTC

Re: Cassandra read throughput with little/no caching.

On 31/12/12 18:45, Tyler Hobbs wrote:
> On Mon, Dec 31, 2012 at 11:24 AM, James Masson <james.masson@opigram.com
> <ma...@opigram.com>> wrote:
>
>
>     Well, it turns out the Read-Request Latency graph in Ops-Center is
>     highly misleading.
>
>     Using jconsole, the read-latency for the column family in question
>     is actually normally around 800 microseconds, punctuated by
>     occasional big spikes that drive up the averages.
>
>     Towards the end of the batch process, the Opscenter reported average
>     latency is up above 4000 microsecs, and forced compactions no longer
>     help drive the latency down again.
>
>     I'm going to stop relying on OpsCenter for data for performance
>     analysis metrics, it just doesn't have the resolution.
>
>
> James, it's worth pointing out that Read Request Latency in OpsCenter is
> measuring at the coordinator level, so it includes the time spent
> sending requests to replicas and waiting for a response.  There's
> another latency metric that is per-column family named Local Read
> Latency; it sounds like this is the equivalent number that you were
> looking at in jconsole.  This metric basically just includes the time to
> read local caches/memtables/sstables.
>
> We are looking to rename one or both of the metrics for clarity; any
> input here would be helpful. For example, we're considering "Coordinated
> Read Request Latency" or "Client Read Request Latency" in place of just
> "Read Request Latency".
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>


Hi Tyler,

thanks for clarifying this. So you're saying the difference between the 
global Read Request latency in opscenter, and the column family specific 
one is in the effort coordinating a validated read across multiple 
replicas? Is this not part of what Hector does for itself?

Essentially, I'm looking to see whether I can use this to derive where 
any extra latency from a client request comes from.

As for names, I'd suggest "cluster coordinated read request latency", 
bit of a mouthful, I know.

Is there anywhere I can find concrete definitions of what the stats in 
OpsCenter, and raw Cassandra via JMX mean? The docs I've found seem 
quite ambiguous.

I still think that the data resolution that OpsCenter gives makes it 
more suitable for trending/alerting rather than chasing down tricky 
performance issues. This sort of investigation work is what I do for a 
living, I typically use intervals of 10 seconds or lower, and don't 
average my data. Although, storing your data inside the database your 
measuring does restrict your options a little :-)

regards

James M



Re: Cassandra read throughput with little/no caching.

Posted by Tyler Hobbs <ty...@datastax.com>.
>
> Your description above was much better :-) I'm more interested in docs for
> the raw metrics provided in JMX.


I don't think there are any good docs for what is exposed directly through
JMX.  Most of the OpsCenter metrics map closely to one exposed JMX item, so
that's a start.  Other than that, your best bet at the moment for accurate
descriptions is to read the source.  Methods and attributes exposed through
JMX follow a particular format (MBean classes, naming conventions) that
make them pretty easy to find.

That would be a great feature, but it's quite difficult taking
> high-resolution data capture without disturbing the system you're trying to
> measure.
>

True, but fortunately Cassandra doesn't tend to be CPU bound, so simply
sampling JMX data doesn't tend to impact normal performance metrics.

Perhaps worth taking the data-capture points off-list?


Sure, I'd love to hear your ideas.


On Wed, Jan 2, 2013 at 11:41 AM, James Masson <ja...@opigram.com>wrote:

>
>
> On 02/01/13 16:18, Tyler Hobbs wrote:
>
>> On Wed, Jan 2, 2013 at 5:28 AM, James Masson <james.masson@opigram.com
>> <mailto:james.masson@opigram.**com <ja...@opigram.com>>> wrote:
>>
> >
>
>> 1) Hector sends a request to some node in the cluster, which will act as
>> the coordinator.
>> 2) The coordinator then sends the actual read requests out to each of
>> the (RF) replicas.
>> 3a) The coordinator waits for responses from the replicas; how many it
>> waits for depends on the consistency level.
>> 3b) The replicas perform actual cache/memtable/sstable reads and respond
>> to the coordinator when complete
>> 4) Once the required number of replicas have responded, the coordinator
>> replies to the client (Hector).
>>
>> The Read Request Latency metric is measuring the time taken in steps 2
>> through 4.  The CF Local Read Latency metric is only capturing the time
>> taken in step 3b.
>>
>>
>>
> Great, that's exactly the level of detail I'm looking for.
>
>
>
>>
>>     Is there anywhere I can find concrete definitions of what the stats
>>     in OpsCenter, and raw Cassandra via JMX mean? The docs I've found
>>     seem quite ambiguous.
>>
>>
>> This has pretty good writeups of each:
>> http://www.datastax.com/docs/**opscenter/online_help/**
>> performance/index#opscenter-**performance-metrics<http://www.datastax.com/docs/opscenter/online_help/performance/index#opscenter-performance-metrics>
>>
>
> Your description above was much better :-) I'm more interested in docs for
> the raw metrics provided in JMX.
>
>
>
>>
>>     I still think that the data resolution that OpsCenter gives makes it
>>     more suitable for trending/alerting rather than chasing down tricky
>>     performance issues. This sort of investigation work is what I do for
>>     a living, I typically use intervals of 10 seconds or lower, and
>>     don't average my data. Although, storing your data inside the
>>     database your measuring does restrict your options a little :-)
>>
>>
>> True, there's a limit to what you can detect with 60 second resolution.
>> We've considered being able to report metrics at a finer resolution
>> without durably storing them anywhere, which would be useful for when
>> you're actively watching the cluster.
>>
>
> That would be a great feature, but it's quite difficult taking
> high-resolution data capture without disturbing the system you're trying to
> measure.
>
> Perhaps worth taking the data-capture points off-list?
>
> James M
>



-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Cassandra read throughput with little/no caching.

Posted by James Masson <ja...@opigram.com>.

On 02/01/13 16:18, Tyler Hobbs wrote:
> On Wed, Jan 2, 2013 at 5:28 AM, James Masson <james.masson@opigram.com
> <ma...@opigram.com>> wrote:
 >
> 1) Hector sends a request to some node in the cluster, which will act as
> the coordinator.
> 2) The coordinator then sends the actual read requests out to each of
> the (RF) replicas.
> 3a) The coordinator waits for responses from the replicas; how many it
> waits for depends on the consistency level.
> 3b) The replicas perform actual cache/memtable/sstable reads and respond
> to the coordinator when complete
> 4) Once the required number of replicas have responded, the coordinator
> replies to the client (Hector).
>
> The Read Request Latency metric is measuring the time taken in steps 2
> through 4.  The CF Local Read Latency metric is only capturing the time
> taken in step 3b.
>
>

Great, that's exactly the level of detail I'm looking for.

>
>
>     Is there anywhere I can find concrete definitions of what the stats
>     in OpsCenter, and raw Cassandra via JMX mean? The docs I've found
>     seem quite ambiguous.
>
>
> This has pretty good writeups of each:
> http://www.datastax.com/docs/opscenter/online_help/performance/index#opscenter-performance-metrics

Your description above was much better :-) I'm more interested in docs 
for the raw metrics provided in JMX.

>
>
>     I still think that the data resolution that OpsCenter gives makes it
>     more suitable for trending/alerting rather than chasing down tricky
>     performance issues. This sort of investigation work is what I do for
>     a living, I typically use intervals of 10 seconds or lower, and
>     don't average my data. Although, storing your data inside the
>     database your measuring does restrict your options a little :-)
>
>
> True, there's a limit to what you can detect with 60 second resolution.
> We've considered being able to report metrics at a finer resolution
> without durably storing them anywhere, which would be useful for when
> you're actively watching the cluster.

That would be a great feature, but it's quite difficult taking 
high-resolution data capture without disturbing the system you're trying 
to measure.

Perhaps worth taking the data-capture points off-list?

James M

Re: Cassandra read throughput with little/no caching.

Posted by Tyler Hobbs <ty...@datastax.com>.
On Wed, Jan 2, 2013 at 5:28 AM, James Masson <ja...@opigram.com>wrote:

>
> thanks for clarifying this. So you're saying the difference between the
> global Read Request latency in opscenter, and the column family specific
> one is in the effort coordinating a validated read across multiple replicas?


Yes.


> Is this not part of what Hector does for itself?
>

No.  Here's the basic order of events:

1) Hector sends a request to some node in the cluster, which will act as
the coordinator.
2) The coordinator then sends the actual read requests out to each of the
(RF) replicas.
3a) The coordinator waits for responses from the replicas; how many it
waits for depends on the consistency level.
3b) The replicas perform actual cache/memtable/sstable reads and respond to
the coordinator when complete
4) Once the required number of replicas have responded, the coordinator
replies to the client (Hector).

The Read Request Latency metric is measuring the time taken in steps 2
through 4.  The CF Local Read Latency metric is only capturing the time
taken in step 3b.


> Essentially, I'm looking to see whether I can use this to derive where any
> extra latency from a client request comes from.
>

Yes, using the two numbers in conjunction can be very informative.  Also,
you might be interested in the new query tracing feature in 1.2, which
shows very detailed steps and their latencies.


>
> As for names, I'd suggest "cluster coordinated read request latency", bit
> of a mouthful, I know.
>

Awesome, thanks for your input.


>
> Is there anywhere I can find concrete definitions of what the stats in
> OpsCenter, and raw Cassandra via JMX mean? The docs I've found seem quite
> ambiguous.
>

This has pretty good writeups of each:
http://www.datastax.com/docs/opscenter/online_help/performance/index#opscenter-performance-metrics


>
> I still think that the data resolution that OpsCenter gives makes it more
> suitable for trending/alerting rather than chasing down tricky performance
> issues. This sort of investigation work is what I do for a living, I
> typically use intervals of 10 seconds or lower, and don't average my data.
> Although, storing your data inside the database your measuring does
> restrict your options a little :-)


True, there's a limit to what you can detect with 60 second resolution.
We've considered being able to report metrics at a finer resolution without
durably storing them anywhere, which would be useful for when you're
actively watching the cluster.

Thanks!

-- 
Tyler Hobbs
DataStax <http://datastax.com/>