You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Alexandru Dan Sicoe <si...@googlemail.com> on 2011/10/11 16:44:10 UTC

CompletedTasks attribute exposed via JMX

Hello everyone,
 I was trying to get some cluster wide statistics of the total insertions
performed in my 3 node Cassandra 0.8.6 cluster. So I wrote a nice little
program that gets the CompletedTasks attribute of
org.apache.cassandra.db:type=Commitlog from every node, sums up the values
and records them in a .csv every 10 sec or so. Everything works and I get my
stats but later I found out that I am not really sure what this measure
means. I think it is the individual column insertions performed! Am I
correct?
 In the meantime I installed the trial version of the DataStax Operations
Center. The cluster wide dashboard, showing Writes performed as a function
of time, gives me much smaller values of the rates, compared to the
measurement I described before. The Datastax writes/sec are of the same
order of magnitude as the batch writes I perform on the cluster. But somehow
I cannot relate between this rate and the rate of my CompletedTasks
measurement.

How do people usually measure insertion rates for their custers ? Per batch,
per single columns or is actual data rate more important to know?

Cheers,
Alexandru

Re: CompletedTasks attribute exposed via JMX

Posted by aaron morton <aa...@thelastpickle.com>.
Storage proxy will give you the total writes through the server, for all CFs. 

CommitLog thread pool is not what you want. It's not designed to measure the column or row throughput, it's just how many tasks have run through the thread pool.

The closest thing to recording the number of columns is the MemtableColumnCount in the per CF stats in JMX (and cfinfo in nodetool). It is updated here https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/Memtable.java#L196

Note:
- it only counts top level columns, not sub columns
- it includes deletes
- it is per Memtable, so it is cleared when a new memtable is switched in. 
- the number is also included in the logs when the memtable is flushed  

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 12/10/2011, at 8:31 PM, Alexandru Dan Sicoe wrote:

> Thanks for the quick replies guys!
> 
> Just to explain to you why I wanted to understand these two measures, I do batch inserts to Cassandra but the batches are not fixed in size i.e. the number of columns in a batch varies and also the data type of the values placed in the columns varies (the name of the columns is always a long - timestamp) => this also makes it hard to predict the actual data rate I am sending to Cassandra. I thought that if I can get a cluster wide measurement of the batch insertions per second and also of the individual column insertions per second I can understand better what's happening. 
> 
> So, from what you guys said I understand that:
> - the StorageProxi WriteOperations attribute gives me the batch insertions per second sent to the cluster (so this is fine)
> - the Commitlog CompletedTasks attribute is definitely a closer measurement to the single column insertions but it is not accurate (i.e. it will be higher) because several types of row mutations can happen when any column is inserted - How close is this measurement to the single column insertions per second I want to obtain? Is there anything I can use to get a more accurate measurement of the single column insertions per sec or is it good enough?
> 
> Cheers,
> Alexandru
> 
> On Wed, Oct 12, 2011 at 4:18 AM, Tyler Hobbs <ty...@datastax.com> wrote:
> The OpsCenter graph you're referring to basically does the following:
> 
> 1. For each node, find out how much the WriteOperations attribute of the StorageProxy increased during the last minute.
> 2. Sum these values to get a total for the cluster.
> 3. Divide by 60 to get an average number of WriteOperations per second for the cluster.
> 
> 
> On Tue, Oct 11, 2011 at 3:55 PM, aaron morton <aa...@thelastpickle.com> wrote:
> Its the number of mutations, a mutation is a collection of changes for a single row across one or more column families. 
> 
> Take a look at the nodetool cfstats, this is where I assume Ops Centre is getting it's data from. 
> 
> Cheers
>  
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 12/10/2011, at 3:44 AM, Alexandru Dan Sicoe wrote:
> 
>> Hello everyone,
>>  I was trying to get some cluster wide statistics of the total insertions performed in my 3 node Cassandra 0.8.6 cluster. So I wrote a nice little program that gets the CompletedTasks attribute of org.apache.cassandra.db:type=Commitlog from every node, sums up the values and records them in a .csv every 10 sec or so. Everything works and I get my stats but later I found out that I am not really sure what this measure means. I think it is the individual column insertions performed! Am I correct?
>>  In the meantime I installed the trial version of the DataStax Operations Center. The cluster wide dashboard, showing Writes performed as a function of time, gives me much smaller values of the rates, compared to the measurement I described before. The Datastax writes/sec are of the same order of magnitude as the batch writes I perform on the cluster. But somehow I cannot relate between this rate and the rate of my CompletedTasks measurement.
>> 
>> How do people usually measure insertion rates for their custers ? Per batch, per single columns or is actual data rate more important to know?
>> 
>> Cheers,
>> Alexandru
>> 
> 
> 
> 
> 
> -- 
> Tyler Hobbs
> Software Engineer, DataStax
> Maintainer of the pycassa Cassandra Python client library
> 
> 
> 
> 
> -- 
> Alexandru Dan Sicoe
> MEng, CERN Marie Curie ACEOLE Fellow
> 


Re: CompletedTasks attribute exposed via JMX

Posted by Alexandru Dan Sicoe <si...@googlemail.com>.
Thanks for the quick replies guys!

Just to explain to you why I wanted to understand these two measures, I do
batch inserts to Cassandra but the batches are not fixed in size i.e. the
number of columns in a batch varies and also the data type of the values
placed in the columns varies (the name of the columns is always a long -
timestamp) => this also makes it hard to predict the actual data rate I am
sending to Cassandra. I thought that if I can get a cluster wide measurement
of the batch insertions per second and also of the individual column
insertions per second I can understand better what's happening.

So, from what you guys said I understand that:
- the StorageProxi WriteOperations attribute gives me the batch insertions
per second sent to the cluster (so this is fine)
- the Commitlog CompletedTasks attribute is definitely a closer measurement
to the single column insertions but it is not accurate (i.e. it will be
higher) because several types of row mutations can happen when any column is
inserted - How close is this measurement to the single column insertions per
second I want to obtain? Is there anything I can use to get a more accurate
measurement of the single column insertions per sec or is it good enough?

Cheers,
Alexandru

On Wed, Oct 12, 2011 at 4:18 AM, Tyler Hobbs <ty...@datastax.com> wrote:

> The OpsCenter graph you're referring to basically does the following:
>
> 1. For each node, find out how much the WriteOperations attribute of the
> StorageProxy increased during the last minute.
> 2. Sum these values to get a total for the cluster.
> 3. Divide by 60 to get an average number of WriteOperations per second for
> the cluster.
>
>
> On Tue, Oct 11, 2011 at 3:55 PM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> Its the number of mutations, a mutation is a collection of changes for a
>> single row across one or more column families.
>>
>> Take a look at the nodetool cfstats, this is where I assume Ops Centre is
>> getting it's data from.
>>
>> Cheers
>>
>>  -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 12/10/2011, at 3:44 AM, Alexandru Dan Sicoe wrote:
>>
>> Hello everyone,
>>  I was trying to get some cluster wide statistics of the total insertions
>> performed in my 3 node Cassandra 0.8.6 cluster. So I wrote a nice little
>> program that gets the CompletedTasks attribute of
>> org.apache.cassandra.db:type=Commitlog from every node, sums up the values
>> and records them in a .csv every 10 sec or so. Everything works and I get my
>> stats but later I found out that I am not really sure what this measure
>> means. I think it is the individual column insertions performed! Am I
>> correct?
>>  In the meantime I installed the trial version of the DataStax Operations
>> Center. The cluster wide dashboard, showing Writes performed as a function
>> of time, gives me much smaller values of the rates, compared to the
>> measurement I described before. The Datastax writes/sec are of the same
>> order of magnitude as the batch writes I perform on the cluster. But somehow
>> I cannot relate between this rate and the rate of my CompletedTasks
>> measurement.
>>
>> How do people usually measure insertion rates for their custers ? Per
>> batch, per single columns or is actual data rate more important to know?
>>
>> Cheers,
>> Alexandru
>>
>>
>>
>
>
> --
> Tyler Hobbs
> Software Engineer, DataStax <http://datastax.com/>
> Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
> Python client library
>
>


-- 
Alexandru Dan Sicoe
MEng, CERN Marie Curie ACEOLE Fellow

Re: CompletedTasks attribute exposed via JMX

Posted by Tyler Hobbs <ty...@datastax.com>.
The OpsCenter graph you're referring to basically does the following:

1. For each node, find out how much the WriteOperations attribute of the
StorageProxy increased during the last minute.
2. Sum these values to get a total for the cluster.
3. Divide by 60 to get an average number of WriteOperations per second for
the cluster.

On Tue, Oct 11, 2011 at 3:55 PM, aaron morton <aa...@thelastpickle.com>wrote:

> Its the number of mutations, a mutation is a collection of changes for a
> single row across one or more column families.
>
> Take a look at the nodetool cfstats, this is where I assume Ops Centre is
> getting it's data from.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 12/10/2011, at 3:44 AM, Alexandru Dan Sicoe wrote:
>
> Hello everyone,
>  I was trying to get some cluster wide statistics of the total insertions
> performed in my 3 node Cassandra 0.8.6 cluster. So I wrote a nice little
> program that gets the CompletedTasks attribute of
> org.apache.cassandra.db:type=Commitlog from every node, sums up the values
> and records them in a .csv every 10 sec or so. Everything works and I get my
> stats but later I found out that I am not really sure what this measure
> means. I think it is the individual column insertions performed! Am I
> correct?
>  In the meantime I installed the trial version of the DataStax Operations
> Center. The cluster wide dashboard, showing Writes performed as a function
> of time, gives me much smaller values of the rates, compared to the
> measurement I described before. The Datastax writes/sec are of the same
> order of magnitude as the batch writes I perform on the cluster. But somehow
> I cannot relate between this rate and the rate of my CompletedTasks
> measurement.
>
> How do people usually measure insertion rates for their custers ? Per
> batch, per single columns or is actual data rate more important to know?
>
> Cheers,
> Alexandru
>
>
>


-- 
Tyler Hobbs
Software Engineer, DataStax <http://datastax.com/>
Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
Python client library

Re: CompletedTasks attribute exposed via JMX

Posted by aaron morton <aa...@thelastpickle.com>.
Its the number of mutations, a mutation is a collection of changes for a single row across one or more column families. 

Take a look at the nodetool cfstats, this is where I assume Ops Centre is getting it's data from. 

Cheers
 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 12/10/2011, at 3:44 AM, Alexandru Dan Sicoe wrote:

> Hello everyone,
>  I was trying to get some cluster wide statistics of the total insertions performed in my 3 node Cassandra 0.8.6 cluster. So I wrote a nice little program that gets the CompletedTasks attribute of org.apache.cassandra.db:type=Commitlog from every node, sums up the values and records them in a .csv every 10 sec or so. Everything works and I get my stats but later I found out that I am not really sure what this measure means. I think it is the individual column insertions performed! Am I correct?
>  In the meantime I installed the trial version of the DataStax Operations Center. The cluster wide dashboard, showing Writes performed as a function of time, gives me much smaller values of the rates, compared to the measurement I described before. The Datastax writes/sec are of the same order of magnitude as the batch writes I perform on the cluster. But somehow I cannot relate between this rate and the rate of my CompletedTasks measurement.
> 
> How do people usually measure insertion rates for their custers ? Per batch, per single columns or is actual data rate more important to know?
> 
> Cheers,
> Alexandru
>