You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by James Masson <ja...@opigram.com> on 2012/12/21 13:27:23 UTC

Cassandra read throughput with little/no caching.

Hi list-users,

We have an application that has a relatively unusual access pattern in 
cassandra 1.1.6

Essentially we read an entire multi hundred megabyte column family 
sequentially (little chance of a cassandra cache hit), perform some 
operations on the data, and write the data back to another column family 
in the same keyspace.

We do about 250 writes/sec and 100 reads/sec during this process. Write 
request latency is about 900 microsecs, read request latency is about 
4000 microsecs.

* First Question: Do these numbers make sense?

read-request latency seems a little high to me, cassandra hasn't had a 
chance to cache this data, but it's likely in the Linux disk cache, 
given the sizing of the node/data/jvm.

thanks

James M

Re: Cassandra read throughput with little/no caching.

Posted by James Masson <ja...@opigram.com>.

Hi Aaron,

On 23/12/12 20:18, aaron morton wrote:
> First, the non helpful advice, I strongly suggest changing the data
> model so you do not have 100MB+ rows. They will make life harder.

I don't think we have 100MB+ rows. Column families, yes - but not rows.

>
>> Write request latency is about 900 microsecs, read request
>>
>>         latency
>>         is about 4000 microsecs.
>>
>
> 4 milliseconds to drag 100 to 300 MB data off a SAN, through your
> network, into C* and out to the client does not sound terrible at first
> glance. Can you benchmark and individual request to get an idea of the
> throughput?

It's large numbers of small requests - 250 writes/sec - about 100 
reads/sec. I might look at some tcpdumps, to see what it's actually doing...

With a total volume of approx 400Mb, split over 3 nodes, it takes about 
30mins to run through the complete data-set. There's near zero disk I/O, 
and disk-wait. It's definitely coming out of the Linux disk cache.

That works out at about 0.2Mb/sec in data crunching terms - and about 
0.6Mb/sec network I/O.

>
> I would recommend removing the SAN from the equation, cassandra will run
> better with local disks. It also introduces a single point of failure
> into a distributed system.

Understood about the SPoF, but negated by good SAN fabric design. I 
think a single local disk or two is going to find it hard to compete 
with a FC attached SAN with Gb of dedicated DRAM cache, and SSD tiering.
This is all on VMware anyway, so there's no option of local disks.

>
>> but it's likely in the Linux disk cache, given the sizing of the
>> node/data/jvm.
> Are you sure that the local Linux machine is going to cache files stored
> on the SAN ?

Yes, Linux doesn't care ( and isn't aware) at the filesystem level if 
the volume is 'local' or not, everything goes through the same caching 
strategy. Again, because this is VMware, it appears as a 'local' disk 
anyway.

In short, disk isn't the limiting factor here.

thanks

James M

Re: Cassandra read throughput with little/no caching.

Posted by aaron morton <aa...@thelastpickle.com>.

First, the non helpful advice, I strongly suggest changing the data model so you do not have 100MB+ rows. They will make life harder. 

>               Write request latency is about 900 microsecs, read request
>         latency
>              is about 4000 microsecs.
> 
>      

4 milliseconds to drag 100 to 300 MB data off a SAN, through your network, into C* and out to the client does not sound terrible at first glance. Can you benchmark and individual request to get an idea of the throughput? 

I would recommend removing the SAN from the equation, cassandra will run better with local disks. It also introduces a single point of failure into a distributed system. 

> but it's likely in the Linux disk cache, given the sizing of the node/data/jvm.

Are you sure that the local Linux machine is going to cache files stored on the SAN ? 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/12/2012, at 6:56 AM, Yiming Sun <yi...@gmail.com> wrote:

> James, you could experiment with Row cache, with off-heap JNA cache, and see if it helps.  My own experience with row cache was not good, and the OS cache seemed to be most useful, but in my case, our data space was big, over 10TB.  Your sequential access pattern certainly doesn't play well with LRU, but giving the small data space you have, you may be able to fit the data from one column family entirely into the row cache.
> 
> 
> On Fri, Dec 21, 2012 at 12:03 PM, James Masson <ja...@opigram.com> wrote:
> 
> 
> On 21/12/12 16:27, Yiming Sun wrote:
> James, using RandomPartitioner, the order of the rows is random, so when
> you request these rows in "Sequential" order (sort by the date?),
> Cassandra is not reading them sequentially.
> 
> Yes, I understand the "next" row to be retrieved in sequence is likely to be on a different node, and the ordering is random. I'm using the word sequential to try to explain that the data being requested is in an order, and not repeated, until the next cycle. The data is not guaranteed to be of a size that is cache-able as a whole.
> 
> 
> 
> The size of the data, 200Mb, 300Mb , and 40Mb, are these the size for
> each column? Or are these the total size of the entire column family?
>   It wasn't too clear to me.  But if these are the total size of the
> column families, you will be able to fit them mostly in memory, so you
> should enable row cache.
> 
> Size of the column family, on a single node. Row caching is off at the moment.
> 
> Are you saying that I should increase the JVM heap to fit some data in the row cache, at the expense of linux disk caching?
> 
> Bear in mind that the data is only going to be re-requested in sequence again - I'm not sure what the value is in the cassandra native caching if rows are not re-requested before being evicted.
> 
> My current key-cache hit-rates are near zero on this workload, hence I'm interested in cassandra's zero-cache performance. Unless I can guarantee to fit the entire data-set in memory, it's difficult to justify using memory on a cassandra cache if LRU and workload means it's not actually a benefit.
> 
> 
> 
> I happen to have done some performance tests of my own on cassandra,
> mostly on the read, and was also only able to get less than 6MB/sec read
> rate out of a cluster of 6 nodes RF2 using a single threaded client.
>   But it makes a huge difference when I changed the client to an
> asynchronous multi-threaded structure.
> 
> 
> Yes, I've been talking to the developers about having a separate thread or two that keeps cassandra busy, keeping Disruptor (http://lmax-exchange.github.com/disruptor/) fed to do the processing work.
> 
> But this all doesn't change the fact that under this zero-cache workload, cassandra seems to be very CPU expensive for throughput.
> 
> thanks
> 
> James M
> 
> 
> 
> 
> On Fri, Dec 21, 2012 at 10:36 AM, James Masson <james.masson@opigram.com
> <ma...@opigram.com>> wrote:
> 
> 
>     Hi,
> 
>     thanks for the reply
> 
> 
>     On 21/12/12 14:36, Yiming Sun wrote:
> 
>         I have a few questions for you, James,
> 
>         1. how many nodes are in your Cassandra ring?
> 
> 
>     2 or 3 - depending on environment - it doesn't seem to make a
>     difference to throughput very much. What is a 30 minute task on a 2
>     node environment is a 30 minute task on a 3 node environment.
> 
> 
>         2. what is the replication factor?
> 
> 
>     1
> 
>         3. when you say sequentially, what do you mean?  what
>         Partitioner do you
>         use?
> 
> 
>     The data is organised by date - the keys are read sequentially in
>     order, only once.
> 
>     Random partitioner - the data is equally spread across the nodes to
>     avoid hotspots.
> 
> 
>         4. how many columns per row?  how much data per row?  per column?
> 
> 
>     varies - described in the schema.
> 
>     create keyspace mykeyspace
>        with placement_strategy = 'SimpleStrategy'
>        and strategy_options = {replication_factor : 1}
>        and durable_writes = true;
> 
> 
>     create column family entities
>        with column_type = 'Standard'
>        and comparator = 'BytesType'
>        and default_validation_class = 'BytesType'
>        and key_validation_class = 'AsciiType'
>        and read_repair_chance = 0.0
>        and dclocal_read_repair_chance = 0.0
>        and gc_grace = 0
>        and min_compaction_threshold = 4
>        and max_compaction_threshold = 32
>        and replicate_on_write = false
>        and compaction_strategy =
>     'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy'
> 
>        and caching = 'NONE'
>        and column_metadata = [
>          {column_name : '64656c65746564',
>          validation_class : BytesType,
>          index_name : 'deleted_idx',
>          index_type : 0},
>          {column_name : '6576656e744964',
>          validation_class : TimeUUIDType,
>          index_name : 'eventId_idx',
>          index_type : 0},
>          {column_name : '7061796c6f6164',
>          validation_class : UTF8Type}];
> 
>     2 columns per row here - about 200Mb of data in total
> 
> 
>     create column family events
>        with column_type = 'Standard'
>        and comparator = 'BytesType'
>        and default_validation_class = 'BytesType'
>        and key_validation_class = 'TimeUUIDType'
>        and read_repair_chance = 0.0
>        and dclocal_read_repair_chance = 0.0
>        and gc_grace = 0
>        and min_compaction_threshold = 4
>        and max_compaction_threshold = 32
>        and replicate_on_write = false
>        and compaction_strategy =
>     'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy'
> 
>        and caching = 'NONE';
> 
>     1 column per row - about 300Mb of data
> 
>     create column family intervals
>        with column_type = 'Standard'
>        and comparator = 'BytesType'
>        and default_validation_class = 'BytesType'
>        and key_validation_class = 'AsciiType'
>        and read_repair_chance = 0.0
>        and dclocal_read_repair_chance = 0.0
>        and gc_grace = 0
>        and min_compaction_threshold = 4
>        and max_compaction_threshold = 32
>        and replicate_on_write = false
>        and compaction_strategy =
>     'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy'
> 
>        and caching = 'NONE';
> 
>     variable columns per row - about 40Mb of data.
> 
> 
> 
>         5. what client library do you use to access Cassandra?
>           (Hector?).  Is
>         your client code single threaded?
> 
> 
>     Hector - yes, the processing side of the client is single threaded,
>     but is largely waiting for cassandra responses and has plenty of CPU
>     headroom.
> 
> 
>     I guess what I'm most interested in is why the discrepancy in
>     between read/write latency - although I understand the data volume
>     is much larger in reads, even though the request rate is lower.
> 
>     Network usage on a cassandra box barely gets above 20Mbit, including
>     inter-cluster comms. Averages 5mbit client<>cassandra
> 
>     There is near zero disk I/O, and what little there is is served sub
>     1ms. Storage is backed by a very fast SAN, but like I said earlier,
>     the dataset just about fits in the Linux disk cache. 2Gb VM, 512Mb
>     cassandra heap - GCs are nice and quick, no JVM memory problems,
>     used heap oscillates between 280-350Mb.
> 
>     Basically, I'm just puzzled as cassandra doesn't behave as I would
>     expect. Huge CPU use in cassandra for very little throughput. I'm
>     struggling to find anything that's wrong with the environment,
>     there's no bottleneck that I can see.
> 
>     thanks
> 
>     James M
> 
> 
> 
> 
> 
>         On Fri, Dec 21, 2012 at 7:27 AM, James Masson
>         <james.masson@opigram.com <ma...@opigram.com>
>         <mailto:james.masson@opigram.__com
> 
>         <ma...@opigram.com>>> wrote:
> 
> 
>              Hi list-users,
> 
>              We have an application that has a relatively unusual access
>         pattern
>              in cassandra 1.1.6
> 
>              Essentially we read an entire multi hundred megabyte column
>         family
>              sequentially (little chance of a cassandra cache hit),
>         perform some
>              operations on the data, and write the data back to another
>         column
>              family in the same keyspace.
> 
>              We do about 250 writes/sec and 100 reads/sec during this
>         process.
>              Write request latency is about 900 microsecs, read request
>         latency
>              is about 4000 microsecs.
> 
>              * First Question: Do these numbers make sense?
> 
>              read-request latency seems a little high to me, cassandra
>         hasn't had
>              a chance to cache this data, but it's likely in the Linux disk
>              cache, given the sizing of the node/data/jvm.
> 
>              thanks
> 
>              James M
> 
> 
> 
>

Re: Cassandra read throughput with little/no caching.

Posted by Tyler Hobbs <ty...@datastax.com>.

>
> Your description above was much better :-) I'm more interested in docs for
> the raw metrics provided in JMX.


I don't think there are any good docs for what is exposed directly through
JMX.  Most of the OpsCenter metrics map closely to one exposed JMX item, so
that's a start.  Other than that, your best bet at the moment for accurate
descriptions is to read the source.  Methods and attributes exposed through
JMX follow a particular format (MBean classes, naming conventions) that
make them pretty easy to find.

That would be a great feature, but it's quite difficult taking
> high-resolution data capture without disturbing the system you're trying to
> measure.
>

True, but fortunately Cassandra doesn't tend to be CPU bound, so simply
sampling JMX data doesn't tend to impact normal performance metrics.

Perhaps worth taking the data-capture points off-list?


Sure, I'd love to hear your ideas.


On Wed, Jan 2, 2013 at 11:41 AM, James Masson <ja...@opigram.com>wrote:

>
>
> On 02/01/13 16:18, Tyler Hobbs wrote:
>
>> On Wed, Jan 2, 2013 at 5:28 AM, James Masson <james.masson@opigram.com
>> <mailto:james.masson@opigram.**com <ja...@opigram.com>>> wrote:
>>
> >
>
>> 1) Hector sends a request to some node in the cluster, which will act as
>> the coordinator.
>> 2) The coordinator then sends the actual read requests out to each of
>> the (RF) replicas.
>> 3a) The coordinator waits for responses from the replicas; how many it
>> waits for depends on the consistency level.
>> 3b) The replicas perform actual cache/memtable/sstable reads and respond
>> to the coordinator when complete
>> 4) Once the required number of replicas have responded, the coordinator
>> replies to the client (Hector).
>>
>> The Read Request Latency metric is measuring the time taken in steps 2
>> through 4.  The CF Local Read Latency metric is only capturing the time
>> taken in step 3b.
>>
>>
>>
> Great, that's exactly the level of detail I'm looking for.
>
>
>
>>
>>     Is there anywhere I can find concrete definitions of what the stats
>>     in OpsCenter, and raw Cassandra via JMX mean? The docs I've found
>>     seem quite ambiguous.
>>
>>
>> This has pretty good writeups of each:
>> http://www.datastax.com/docs/**opscenter/online_help/**
>> performance/index#opscenter-**performance-metrics<http://www.datastax.com/docs/opscenter/online_help/performance/index#opscenter-performance-metrics>
>>
>
> Your description above was much better :-) I'm more interested in docs for
> the raw metrics provided in JMX.
>
>
>
>>
>>     I still think that the data resolution that OpsCenter gives makes it
>>     more suitable for trending/alerting rather than chasing down tricky
>>     performance issues. This sort of investigation work is what I do for
>>     a living, I typically use intervals of 10 seconds or lower, and
>>     don't average my data. Although, storing your data inside the
>>     database your measuring does restrict your options a little :-)
>>
>>
>> True, there's a limit to what you can detect with 60 second resolution.
>> We've considered being able to report metrics at a finer resolution
>> without durably storing them anywhere, which would be useful for when
>> you're actively watching the cluster.
>>
>
> That would be a great feature, but it's quite difficult taking
> high-resolution data capture without disturbing the system you're trying to
> measure.
>
> Perhaps worth taking the data-capture points off-list?
>
> James M
>



-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Cassandra read throughput with little/no caching.

Posted by James Masson <ja...@opigram.com>.


On 02/01/13 16:18, Tyler Hobbs wrote:
> On Wed, Jan 2, 2013 at 5:28 AM, James Masson <james.masson@opigram.com
> <ma...@opigram.com>> wrote:
 >
> 1) Hector sends a request to some node in the cluster, which will act as
> the coordinator.
> 2) The coordinator then sends the actual read requests out to each of
> the (RF) replicas.
> 3a) The coordinator waits for responses from the replicas; how many it
> waits for depends on the consistency level.
> 3b) The replicas perform actual cache/memtable/sstable reads and respond
> to the coordinator when complete
> 4) Once the required number of replicas have responded, the coordinator
> replies to the client (Hector).
>
> The Read Request Latency metric is measuring the time taken in steps 2
> through 4.  The CF Local Read Latency metric is only capturing the time
> taken in step 3b.
>
>

Great, that's exactly the level of detail I'm looking for.

>
>
>     Is there anywhere I can find concrete definitions of what the stats
>     in OpsCenter, and raw Cassandra via JMX mean? The docs I've found
>     seem quite ambiguous.
>
>
> This has pretty good writeups of each:
> http://www.datastax.com/docs/opscenter/online_help/performance/index#opscenter-performance-metrics

Your description above was much better :-) I'm more interested in docs 
for the raw metrics provided in JMX.

>
>
>     I still think that the data resolution that OpsCenter gives makes it
>     more suitable for trending/alerting rather than chasing down tricky
>     performance issues. This sort of investigation work is what I do for
>     a living, I typically use intervals of 10 seconds or lower, and
>     don't average my data. Although, storing your data inside the
>     database your measuring does restrict your options a little :-)
>
>
> True, there's a limit to what you can detect with 60 second resolution.
> We've considered being able to report metrics at a finer resolution
> without durably storing them anywhere, which would be useful for when
> you're actively watching the cluster.

That would be a great feature, but it's quite difficult taking 
high-resolution data capture without disturbing the system you're trying 
to measure.

Perhaps worth taking the data-capture points off-list?

James M

Re: Cassandra read throughput with little/no caching.

Posted by Tyler Hobbs <ty...@datastax.com>.

On Wed, Jan 2, 2013 at 5:28 AM, James Masson <ja...@opigram.com>wrote:

>
> thanks for clarifying this. So you're saying the difference between the
> global Read Request latency in opscenter, and the column family specific
> one is in the effort coordinating a validated read across multiple replicas?

Yes.

> Is this not part of what Hector does for itself?
>

No.  Here's the basic order of events:

1) Hector sends a request to some node in the cluster, which will act as
the coordinator.
2) The coordinator then sends the actual read requests out to each of the
(RF) replicas.
3a) The coordinator waits for responses from the replicas; how many it
waits for depends on the consistency level.
3b) The replicas perform actual cache/memtable/sstable reads and respond to
the coordinator when complete
4) Once the required number of replicas have responded, the coordinator
replies to the client (Hector).

The Read Request Latency metric is measuring the time taken in steps 2
through 4.  The CF Local Read Latency metric is only capturing the time
taken in step 3b.

> Essentially, I'm looking to see whether I can use this to derive where any
> extra latency from a client request comes from.
>

Yes, using the two numbers in conjunction can be very informative.  Also,
you might be interested in the new query tracing feature in 1.2, which
shows very detailed steps and their latencies.

>
> As for names, I'd suggest "cluster coordinated read request latency", bit
> of a mouthful, I know.
>

Awesome, thanks for your input.

>
> Is there anywhere I can find concrete definitions of what the stats in
> OpsCenter, and raw Cassandra via JMX mean? The docs I've found seem quite
> ambiguous.
>

This has pretty good writeups of each:
http://www.datastax.com/docs/opscenter/online_help/performance/index#opscenter-performance-metrics

>
> I still think that the data resolution that OpsCenter gives makes it more
> suitable for trending/alerting rather than chasing down tricky performance
> issues. This sort of investigation work is what I do for a living, I
> typically use intervals of 10 seconds or lower, and don't average my data.
> Although, storing your data inside the database your measuring does
> restrict your options a little :-)

True, there's a limit to what you can detect with 60 second resolution.
We've considered being able to report metrics at a finer resolution without
durably storing them anywhere, which would be useful for when you're
actively watching the cluster.

Thanks!

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Cassandra read throughput with little/no caching.

Posted by James Masson <ja...@opigram.com>.

On 31/12/12 18:45, Tyler Hobbs wrote:
> On Mon, Dec 31, 2012 at 11:24 AM, James Masson <james.masson@opigram.com
> <ma...@opigram.com>> wrote:
>
>
>     Well, it turns out the Read-Request Latency graph in Ops-Center is
>     highly misleading.
>
>     Using jconsole, the read-latency for the column family in question
>     is actually normally around 800 microseconds, punctuated by
>     occasional big spikes that drive up the averages.
>
>     Towards the end of the batch process, the Opscenter reported average
>     latency is up above 4000 microsecs, and forced compactions no longer
>     help drive the latency down again.
>
>     I'm going to stop relying on OpsCenter for data for performance
>     analysis metrics, it just doesn't have the resolution.
>
>
> James, it's worth pointing out that Read Request Latency in OpsCenter is
> measuring at the coordinator level, so it includes the time spent
> sending requests to replicas and waiting for a response.  There's
> another latency metric that is per-column family named Local Read
> Latency; it sounds like this is the equivalent number that you were
> looking at in jconsole.  This metric basically just includes the time to
> read local caches/memtables/sstables.
>
> We are looking to rename one or both of the metrics for clarity; any
> input here would be helpful. For example, we're considering "Coordinated
> Read Request Latency" or "Client Read Request Latency" in place of just
> "Read Request Latency".
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>

Hi Tyler,

thanks for clarifying this. So you're saying the difference between the 
global Read Request latency in opscenter, and the column family specific 
one is in the effort coordinating a validated read across multiple 
replicas? Is this not part of what Hector does for itself?

Essentially, I'm looking to see whether I can use this to derive where 
any extra latency from a client request comes from.

As for names, I'd suggest "cluster coordinated read request latency", 
bit of a mouthful, I know.

Is there anywhere I can find concrete definitions of what the stats in 
OpsCenter, and raw Cassandra via JMX mean? The docs I've found seem 
quite ambiguous.

I still think that the data resolution that OpsCenter gives makes it 
more suitable for trending/alerting rather than chasing down tricky 
performance issues. This sort of investigation work is what I do for a 
living, I typically use intervals of 10 seconds or lower, and don't 
average my data. Although, storing your data inside the database your 
measuring does restrict your options a little :-)

regards

James M

Re: Cassandra read throughput with little/no caching.

Posted by Tyler Hobbs <ty...@datastax.com>.

On Mon, Dec 31, 2012 at 11:24 AM, James Masson <ja...@opigram.com>wrote:

>
> Well, it turns out the Read-Request Latency graph in Ops-Center is highly
> misleading.
>
> Using jconsole, the read-latency for the column family in question is
> actually normally around 800 microseconds, punctuated by occasional big
> spikes that drive up the averages.
>
> Towards the end of the batch process, the Opscenter reported average
> latency is up above 4000 microsecs, and forced compactions no longer help
> drive the latency down again.
>
> I'm going to stop relying on OpsCenter for data for performance analysis
> metrics, it just doesn't have the resolution.

James, it's worth pointing out that Read Request Latency in OpsCenter is
measuring at the coordinator level, so it includes the time spent sending
requests to replicas and waiting for a response.  There's another latency
metric that is per-column family named Local Read Latency; it sounds like
this is the equivalent number that you were looking at in jconsole.  This
metric basically just includes the time to read local
caches/memtables/sstables.

We are looking to rename one or both of the metrics for clarity; any input
here would be helpful. For example, we're considering "Coordinated Read
Request Latency" or "Client Read Request Latency" in place of just "Read
Request Latency".

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Cassandra read throughput with little/no caching.

Posted by Keith Wright <kw...@nanigans.com>.

Following up on this, I was hoping to get everyone's take on my use case
for Cassandra and see if everyone agrees it can meet the requirements:

I have a very tight SLA around get times.  These are almost always single
row fetches for 20-50 columns on a row that is likely under 200 columns.
The request must return in under 1 ms (this includes any latency between
the machines though they are in the same rack with at least 1 Gbs NIC).
My configuration currently is 3 nodes with RF: 2 with each node having
dual quad cores, 48 Gbs of RAM, and SSD for the data.  The data set will
certainly exceed the RAM size but will always fit in SSD and I see a key
cache hit rate currently of around 95%.  Assuming I can grow out nodes
linearly, is it reasonable to assume I can achieve sub 1 ms fetch times?
Note that the system will be processing increments which will be part of
the fetches and the write to read ratio will be around 2 to 1 (however I
programmatically batch increments which flush as a single command every
second to reduce load on the cluster).  I am using Astyanax 1.3 with
Cassandra 1.1.7.  I have definitely seen decreases in performance when
compactions are running but was hoping Astayanx's latency load balancing
would mitigate this assuming compactions do not happen on multiple nodes
simultaneously.

Thanks!

On 12/31/12 12:24 PM, "James Masson" <ja...@opigram.com> wrote:

>
>Well, it turns out the Read-Request Latency graph in Ops-Center is
>highly misleading.
>
>Using jconsole, the read-latency for the column family in question is
>actually normally around 800 microseconds, punctuated by occasional big
>spikes that drive up the averages.
>
>Towards the end of the batch process, the Opscenter reported average
>latency is up above 4000 microsecs, and forced compactions no longer
>help drive the latency down again.
>
>I'm going to stop relying on OpsCenter for data for performance analysis
>metrics, it just doesn't have the resolution.
>
>The only things left on my list for investigation are memtable sizes /
>eviction and JNA - and trying to capture some of the requests that are
>causing the spikes for further investigation.
>
>James M
>
>
>On 31/12/12 10:05, James Masson wrote:
>>
>> Hi Yiming,
>>
>> I've had the chance to observe what happens to cassandra read response
>> time over time.
>>
>> It starts out with fast 1ms reads, until the first compaction starts,
>> then the CPUs are maxed out for a period, and read latency rises to 4ms.
>> After compaction finishes, the system returns to 1ms reads and low cpu
>>use.
>>
>> This cycle repeats a few more times, but eventually, compactions become
>> more and more infrequent and read-latency is stuck at 4ms for the rest
>> of the batch operation.
>>
>> I understand why compaction occurs, but not why it takes so long for our
>> dataset, or why it eventually seems to not return to the original
>> performance levels.
>>
>> Our dataset just about fits in each node's disk-cache. Doing compaction
>> should be a matter of memory and CPU bandwidth, bottlenecked by disk
>> writes. I see near zero disk I/O, and the SAN is capable of sustained
>> 100Mb/s writes easily.
>>
>> I'm using a fairly stock cassandra config.
>>
>> tempted to just set this to unlimited.
>>
>> # Throttles compaction to the given total throughput across the entire
>> # system. The faster you insert data, the faster you need to compact in
>> # order to keep the sstable count down, but in general, setting this to
>> # 16 to 32 times the rate you are inserting data is more than
>>sufficient.
>> # Setting this to 0 disables throttling. Note that this account for all
>> types
>> # of compaction, including validation compaction.
>> compaction_throughput_mb_per_sec: 16
>>
>> About the only thing I have changed is this:
>>
>> # For workloads with more data than can fit in memory, Cassandra's
>> # bottleneck will be reads that need to fetch data from
>> # disk. "concurrent_reads" should be set to (16 * number_of_drives) in
>> # order to allow the operations to enqueue low enough in the stack
>> # that the OS and drives can reorder them.
>> #
>> # On the other hand, since writes are almost never IO bound, the ideal
>> # number of "concurrent_writes" is dependent on the number of cores in
>> # your system; (8 * number_of_cores) is a good rule of thumb.
>> concurrent_reads: 128
>> concurrent_writes: 32
>>
>>
>> On 28/12/12 14:02, Yiming Sun wrote:
>>
>>> Is there any chance to increase the VM configuration specs? I couldn't
>>> pinpoint in exactly which message you mentioned the VMs are 2GB mem and
>>> 2 cores, which is a bit meager.
>>
>> The data-set pretty much all fits in RAM, and using 4Ghz of CPU time to
>> serve about 500 key-value pairs per second is pretty poor performance
>> compared to Cassandra's competitors, no? I'd rather understand why
>> performance is bad, rather than throw hardware into a black hole!
>>
>>> Also is it possible to batch the writes together?
>>>
>>
>> I'll ask.
>>
>> thanks for persevering!
>>
>> James M

Re: Cassandra read throughput with little/no caching.

Posted by James Masson <ja...@opigram.com>.

Well, it turns out the Read-Request Latency graph in Ops-Center is 
highly misleading.

Using jconsole, the read-latency for the column family in question is 
actually normally around 800 microseconds, punctuated by occasional big 
spikes that drive up the averages.

Towards the end of the batch process, the Opscenter reported average 
latency is up above 4000 microsecs, and forced compactions no longer 
help drive the latency down again.

I'm going to stop relying on OpsCenter for data for performance analysis 
metrics, it just doesn't have the resolution.

The only things left on my list for investigation are memtable sizes / 
eviction and JNA - and trying to capture some of the requests that are 
causing the spikes for further investigation.

James M


On 31/12/12 10:05, James Masson wrote:
>
> Hi Yiming,
>
> I've had the chance to observe what happens to cassandra read response
> time over time.
>
> It starts out with fast 1ms reads, until the first compaction starts,
> then the CPUs are maxed out for a period, and read latency rises to 4ms.
> After compaction finishes, the system returns to 1ms reads and low cpu use.
>
> This cycle repeats a few more times, but eventually, compactions become
> more and more infrequent and read-latency is stuck at 4ms for the rest
> of the batch operation.
>
> I understand why compaction occurs, but not why it takes so long for our
> dataset, or why it eventually seems to not return to the original
> performance levels.
>
> Our dataset just about fits in each node's disk-cache. Doing compaction
> should be a matter of memory and CPU bandwidth, bottlenecked by disk
> writes. I see near zero disk I/O, and the SAN is capable of sustained
> 100Mb/s writes easily.
>
> I'm using a fairly stock cassandra config.
>
> tempted to just set this to unlimited.
>
> # Throttles compaction to the given total throughput across the entire
> # system. The faster you insert data, the faster you need to compact in
> # order to keep the sstable count down, but in general, setting this to
> # 16 to 32 times the rate you are inserting data is more than sufficient.
> # Setting this to 0 disables throttling. Note that this account for all
> types
> # of compaction, including validation compaction.
> compaction_throughput_mb_per_sec: 16
>
> About the only thing I have changed is this:
>
> # For workloads with more data than can fit in memory, Cassandra's
> # bottleneck will be reads that need to fetch data from
> # disk. "concurrent_reads" should be set to (16 * number_of_drives) in
> # order to allow the operations to enqueue low enough in the stack
> # that the OS and drives can reorder them.
> #
> # On the other hand, since writes are almost never IO bound, the ideal
> # number of "concurrent_writes" is dependent on the number of cores in
> # your system; (8 * number_of_cores) is a good rule of thumb.
> concurrent_reads: 128
> concurrent_writes: 32
>
>
> On 28/12/12 14:02, Yiming Sun wrote:
>
>> Is there any chance to increase the VM configuration specs? I couldn't
>> pinpoint in exactly which message you mentioned the VMs are 2GB mem and
>> 2 cores, which is a bit meager.
>
> The data-set pretty much all fits in RAM, and using 4Ghz of CPU time to
> serve about 500 key-value pairs per second is pretty poor performance
> compared to Cassandra's competitors, no? I'd rather understand why
> performance is bad, rather than throw hardware into a black hole!
>
>> Also is it possible to batch the writes together?
>>
>
> I'll ask.
>
> thanks for persevering!
>
> James M

Re: Cassandra read throughput with little/no caching.

Posted by James Masson <ja...@opigram.com>.

Hi Yiming,

I've had the chance to observe what happens to cassandra read response 
time over time.

It starts out with fast 1ms reads, until the first compaction starts, 
then the CPUs are maxed out for a period, and read latency rises to 4ms. 
After compaction finishes, the system returns to 1ms reads and low cpu use.

This cycle repeats a few more times, but eventually, compactions become 
more and more infrequent and read-latency is stuck at 4ms for the rest 
of the batch operation.

I understand why compaction occurs, but not why it takes so long for our 
dataset, or why it eventually seems to not return to the original 
performance levels.

Our dataset just about fits in each node's disk-cache. Doing compaction 
should be a matter of memory and CPU bandwidth, bottlenecked by disk 
writes. I see near zero disk I/O, and the SAN is capable of sustained 
100Mb/s writes easily.

I'm using a fairly stock cassandra config.

tempted to just set this to unlimited.

# Throttles compaction to the given total throughput across the entire
# system. The faster you insert data, the faster you need to compact in
# order to keep the sstable count down, but in general, setting this to
# 16 to 32 times the rate you are inserting data is more than sufficient.
# Setting this to 0 disables throttling. Note that this account for all 
types
# of compaction, including validation compaction.
compaction_throughput_mb_per_sec: 16

About the only thing I have changed is this:

# For workloads with more data than can fit in memory, Cassandra's
# bottleneck will be reads that need to fetch data from
# disk. "concurrent_reads" should be set to (16 * number_of_drives) in
# order to allow the operations to enqueue low enough in the stack
# that the OS and drives can reorder them.
#
# On the other hand, since writes are almost never IO bound, the ideal
# number of "concurrent_writes" is dependent on the number of cores in
# your system; (8 * number_of_cores) is a good rule of thumb.
concurrent_reads: 128
concurrent_writes: 32

On 28/12/12 14:02, Yiming Sun wrote:

> Is there any chance to increase the VM configuration specs?  I couldn't
> pinpoint in exactly which message you mentioned the VMs are 2GB mem and
> 2 cores, which is a bit meager.

The data-set pretty much all fits in RAM, and using 4Ghz of CPU time to 
serve about 500 key-value pairs per second is pretty poor performance 
compared to Cassandra's competitors, no? I'd rather understand why 
performance is bad, rather than throw hardware into a black hole!

>  Also is it possible to batch the writes together?
>

I'll ask.

thanks for persevering!

James M

Re: Cassandra read throughput with little/no caching.

Posted by Yiming Sun <yi...@gmail.com>.

James, sorry I was out for a few days.  Yes, if the row cache doesn't give
a good hit rate then it should be disabled.

Is there any chance to increase the VM configuration specs?  I couldn't
pinpoint in exactly which message you mentioned the VMs are 2GB mem and 2
cores, which is a bit meager.  Also is it possible to batch the writes
together?

-- Y.


On Mon, Dec 24, 2012 at 7:28 AM, James Masson <ja...@opigram.com>wrote:

>
>
> On 21/12/12 17:56, Yiming Sun wrote:
>
>> James, you could experiment with Row cache, with off-heap JNA cache, and
>> see if it helps.  My own experience with row cache was not good, and the
>> OS cache seemed to be most useful, but in my case, our data space was
>> big, over 10TB.  Your sequential access pattern certainly doesn't play
>> well with LRU, but giving the small data space you have, you may be able
>> to fit the data from one column family entirely into the row cache.
>>
>>
>>
> I've done some experimenting today with JNA/row cache. Extra 500Mb of
> heap, 300Mb row cache, latest JNA, set caching=ALL in the schema for all
> column families in this keyspace.
>
> Getting average 5% row cache hit rate - no increase in cassandra
> throughput, and increased disk read I/O, basically because I've sacrificed
> Linux disk cache for the cassandra row-cache.
>
> Load average was 4 (2cpu boxes) for the duration of the cycle, where it
> was about 2 before, basically because of the disk I/O I think.
>
> So, I think I'll disable row caching again...
>
> James M
>

Re: Cassandra read throughput with little/no caching.

Posted by James Masson <ja...@opigram.com>.

On 21/12/12 17:56, Yiming Sun wrote:
> James, you could experiment with Row cache, with off-heap JNA cache, and
> see if it helps.  My own experience with row cache was not good, and the
> OS cache seemed to be most useful, but in my case, our data space was
> big, over 10TB.  Your sequential access pattern certainly doesn't play
> well with LRU, but giving the small data space you have, you may be able
> to fit the data from one column family entirely into the row cache.
>
>

I've done some experimenting today with JNA/row cache. Extra 500Mb of 
heap, 300Mb row cache, latest JNA, set caching=ALL in the schema for all 
column families in this keyspace.

Getting average 5% row cache hit rate - no increase in cassandra 
throughput, and increased disk read I/O, basically because I've 
sacrificed Linux disk cache for the cassandra row-cache.

Load average was 4 (2cpu boxes) for the duration of the cycle, where it 
was about 2 before, basically because of the disk I/O I think.

So, I think I'll disable row caching again...

James M

Re: Cassandra read throughput with little/no caching.

Posted by Yiming Sun <yi...@gmail.com>.

James, you could experiment with Row cache, with off-heap JNA cache, and
see if it helps.  My own experience with row cache was not good, and the OS
cache seemed to be most useful, but in my case, our data space was big,
over 10TB.  Your sequential access pattern certainly doesn't play well with
LRU, but giving the small data space you have, you may be able to fit the
data from one column family entirely into the row cache.


On Fri, Dec 21, 2012 at 12:03 PM, James Masson <ja...@opigram.com>wrote:

>
>
> On 21/12/12 16:27, Yiming Sun wrote:
>
>> James, using RandomPartitioner, the order of the rows is random, so when
>> you request these rows in "Sequential" order (sort by the date?),
>> Cassandra is not reading them sequentially.
>>
>
> Yes, I understand the "next" row to be retrieved in sequence is likely to
> be on a different node, and the ordering is random. I'm using the word
> sequential to try to explain that the data being requested is in an order,
> and not repeated, until the next cycle. The data is not guaranteed to be of
> a size that is cache-able as a whole.
>
>
>
>> The size of the data, 200Mb, 300Mb , and 40Mb, are these the size for
>> each column? Or are these the total size of the entire column family?
>>   It wasn't too clear to me.  But if these are the total size of the
>> column families, you will be able to fit them mostly in memory, so you
>> should enable row cache.
>>
>
> Size of the column family, on a single node. Row caching is off at the
> moment.
>
> Are you saying that I should increase the JVM heap to fit some data in the
> row cache, at the expense of linux disk caching?
>
> Bear in mind that the data is only going to be re-requested in sequence
> again - I'm not sure what the value is in the cassandra native caching if
> rows are not re-requested before being evicted.
>
> My current key-cache hit-rates are near zero on this workload, hence I'm
> interested in cassandra's zero-cache performance. Unless I can guarantee to
> fit the entire data-set in memory, it's difficult to justify using memory
> on a cassandra cache if LRU and workload means it's not actually a benefit.
>
>
>
>> I happen to have done some performance tests of my own on cassandra,
>> mostly on the read, and was also only able to get less than 6MB/sec read
>> rate out of a cluster of 6 nodes RF2 using a single threaded client.
>>   But it makes a huge difference when I changed the client to an
>> asynchronous multi-threaded structure.
>>
>>
> Yes, I've been talking to the developers about having a separate thread or
> two that keeps cassandra busy, keeping Disruptor (
> http://lmax-exchange.github.**com/disruptor/<http://lmax-exchange.github.com/disruptor/>)
> fed to do the processing work.
>
> But this all doesn't change the fact that under this zero-cache workload,
> cassandra seems to be very CPU expensive for throughput.
>
> thanks
>
> James M
>
>
>>
>>
>> On Fri, Dec 21, 2012 at 10:36 AM, James Masson <james.masson@opigram.com
>> <mailto:james.masson@opigram.**com <ja...@opigram.com>>> wrote:
>>
>>
>>     Hi,
>>
>>     thanks for the reply
>>
>>
>>     On 21/12/12 14:36, Yiming Sun wrote:
>>
>>         I have a few questions for you, James,
>>
>>         1. how many nodes are in your Cassandra ring?
>>
>>
>>     2 or 3 - depending on environment - it doesn't seem to make a
>>     difference to throughput very much. What is a 30 minute task on a 2
>>     node environment is a 30 minute task on a 3 node environment.
>>
>>
>>         2. what is the replication factor?
>>
>>
>>     1
>>
>>         3. when you say sequentially, what do you mean?  what
>>         Partitioner do you
>>         use?
>>
>>
>>     The data is organised by date - the keys are read sequentially in
>>     order, only once.
>>
>>     Random partitioner - the data is equally spread across the nodes to
>>     avoid hotspots.
>>
>>
>>         4. how many columns per row?  how much data per row?  per column?
>>
>>
>>     varies - described in the schema.
>>
>>     create keyspace mykeyspace
>>        with placement_strategy = 'SimpleStrategy'
>>        and strategy_options = {replication_factor : 1}
>>        and durable_writes = true;
>>
>>
>>     create column family entities
>>        with column_type = 'Standard'
>>        and comparator = 'BytesType'
>>        and default_validation_class = 'BytesType'
>>        and key_validation_class = 'AsciiType'
>>        and read_repair_chance = 0.0
>>        and dclocal_read_repair_chance = 0.0
>>        and gc_grace = 0
>>        and min_compaction_threshold = 4
>>        and max_compaction_threshold = 32
>>        and replicate_on_write = false
>>        and compaction_strategy =
>>     'org.apache.cassandra.db.__**compaction.__**
>> SizeTieredCompactionStrategy'
>>
>>        and caching = 'NONE'
>>        and column_metadata = [
>>          {column_name : '64656c65746564',
>>          validation_class : BytesType,
>>          index_name : 'deleted_idx',
>>          index_type : 0},
>>          {column_name : '6576656e744964',
>>          validation_class : TimeUUIDType,
>>          index_name : 'eventId_idx',
>>          index_type : 0},
>>          {column_name : '7061796c6f6164',
>>          validation_class : UTF8Type}];
>>
>>     2 columns per row here - about 200Mb of data in total
>>
>>
>>     create column family events
>>        with column_type = 'Standard'
>>        and comparator = 'BytesType'
>>        and default_validation_class = 'BytesType'
>>        and key_validation_class = 'TimeUUIDType'
>>        and read_repair_chance = 0.0
>>        and dclocal_read_repair_chance = 0.0
>>        and gc_grace = 0
>>        and min_compaction_threshold = 4
>>        and max_compaction_threshold = 32
>>        and replicate_on_write = false
>>        and compaction_strategy =
>>     'org.apache.cassandra.db.__**compaction.__**
>> SizeTieredCompactionStrategy'
>>
>>        and caching = 'NONE';
>>
>>     1 column per row - about 300Mb of data
>>
>>     create column family intervals
>>        with column_type = 'Standard'
>>        and comparator = 'BytesType'
>>        and default_validation_class = 'BytesType'
>>        and key_validation_class = 'AsciiType'
>>        and read_repair_chance = 0.0
>>        and dclocal_read_repair_chance = 0.0
>>        and gc_grace = 0
>>        and min_compaction_threshold = 4
>>        and max_compaction_threshold = 32
>>        and replicate_on_write = false
>>        and compaction_strategy =
>>     'org.apache.cassandra.db.__**compaction.__**
>> SizeTieredCompactionStrategy'
>>
>>        and caching = 'NONE';
>>
>>     variable columns per row - about 40Mb of data.
>>
>>
>>
>>         5. what client library do you use to access Cassandra?
>>           (Hector?).  Is
>>         your client code single threaded?
>>
>>
>>     Hector - yes, the processing side of the client is single threaded,
>>     but is largely waiting for cassandra responses and has plenty of CPU
>>     headroom.
>>
>>
>>     I guess what I'm most interested in is why the discrepancy in
>>     between read/write latency - although I understand the data volume
>>     is much larger in reads, even though the request rate is lower.
>>
>>     Network usage on a cassandra box barely gets above 20Mbit, including
>>     inter-cluster comms. Averages 5mbit client<>cassandra
>>
>>     There is near zero disk I/O, and what little there is is served sub
>>     1ms. Storage is backed by a very fast SAN, but like I said earlier,
>>     the dataset just about fits in the Linux disk cache. 2Gb VM, 512Mb
>>     cassandra heap - GCs are nice and quick, no JVM memory problems,
>>     used heap oscillates between 280-350Mb.
>>
>>     Basically, I'm just puzzled as cassandra doesn't behave as I would
>>     expect. Huge CPU use in cassandra for very little throughput. I'm
>>     struggling to find anything that's wrong with the environment,
>>     there's no bottleneck that I can see.
>>
>>     thanks
>>
>>     James M
>>
>>
>>
>>
>>
>>         On Fri, Dec 21, 2012 at 7:27 AM, James Masson
>>         <james.masson@opigram.com <ma...@opigram.com>
>> >
>>         <mailto:james.masson@opigram._**_com
>>
>>         <mailto:james.masson@opigram.**com <ja...@opigram.com>>>>
>> wrote:
>>
>>
>>              Hi list-users,
>>
>>              We have an application that has a relatively unusual access
>>         pattern
>>              in cassandra 1.1.6
>>
>>              Essentially we read an entire multi hundred megabyte column
>>         family
>>              sequentially (little chance of a cassandra cache hit),
>>         perform some
>>              operations on the data, and write the data back to another
>>         column
>>              family in the same keyspace.
>>
>>              We do about 250 writes/sec and 100 reads/sec during this
>>         process.
>>              Write request latency is about 900 microsecs, read request
>>         latency
>>              is about 4000 microsecs.
>>
>>              * First Question: Do these numbers make sense?
>>
>>              read-request latency seems a little high to me, cassandra
>>         hasn't had
>>              a chance to cache this data, but it's likely in the Linux
>> disk
>>              cache, given the sizing of the node/data/jvm.
>>
>>              thanks
>>
>>              James M
>>
>>
>>
>>

Re: Cassandra read throughput with little/no caching.

Posted by James Masson <ja...@opigram.com>.


On 21/12/12 16:27, Yiming Sun wrote:
> James, using RandomPartitioner, the order of the rows is random, so when
> you request these rows in "Sequential" order (sort by the date?),
> Cassandra is not reading them sequentially.

Yes, I understand the "next" row to be retrieved in sequence is likely 
to be on a different node, and the ordering is random. I'm using the 
word sequential to try to explain that the data being requested is in an 
order, and not repeated, until the next cycle. The data is not 
guaranteed to be of a size that is cache-able as a whole.

>
> The size of the data, 200Mb, 300Mb , and 40Mb, are these the size for
> each column? Or are these the total size of the entire column family?
>   It wasn't too clear to me.  But if these are the total size of the
> column families, you will be able to fit them mostly in memory, so you
> should enable row cache.

Size of the column family, on a single node. Row caching is off at the 
moment.

Are you saying that I should increase the JVM heap to fit some data in 
the row cache, at the expense of linux disk caching?

Bear in mind that the data is only going to be re-requested in sequence 
again - I'm not sure what the value is in the cassandra native caching 
if rows are not re-requested before being evicted.

My current key-cache hit-rates are near zero on this workload, hence I'm 
interested in cassandra's zero-cache performance. Unless I can guarantee 
to fit the entire data-set in memory, it's difficult to justify using 
memory on a cassandra cache if LRU and workload means it's not actually 
a benefit.

>
> I happen to have done some performance tests of my own on cassandra,
> mostly on the read, and was also only able to get less than 6MB/sec read
> rate out of a cluster of 6 nodes RF2 using a single threaded client.
>   But it makes a huge difference when I changed the client to an
> asynchronous multi-threaded structure.
>

Yes, I've been talking to the developers about having a separate thread 
or two that keeps cassandra busy, keeping Disruptor 
(http://lmax-exchange.github.com/disruptor/) fed to do the processing work.

But this all doesn't change the fact that under this zero-cache 
workload, cassandra seems to be very CPU expensive for throughput.

thanks

James M

>
>
>
> On Fri, Dec 21, 2012 at 10:36 AM, James Masson <james.masson@opigram.com
> <ma...@opigram.com>> wrote:
>
>
>     Hi,
>
>     thanks for the reply
>
>
>     On 21/12/12 14:36, Yiming Sun wrote:
>
>         I have a few questions for you, James,
>
>         1. how many nodes are in your Cassandra ring?
>
>
>     2 or 3 - depending on environment - it doesn't seem to make a
>     difference to throughput very much. What is a 30 minute task on a 2
>     node environment is a 30 minute task on a 3 node environment.
>
>
>         2. what is the replication factor?
>
>
>     1
>
>         3. when you say sequentially, what do you mean?  what
>         Partitioner do you
>         use?
>
>
>     The data is organised by date - the keys are read sequentially in
>     order, only once.
>
>     Random partitioner - the data is equally spread across the nodes to
>     avoid hotspots.
>
>
>         4. how many columns per row?  how much data per row?  per column?
>
>
>     varies - described in the schema.
>
>     create keyspace mykeyspace
>        with placement_strategy = 'SimpleStrategy'
>        and strategy_options = {replication_factor : 1}
>        and durable_writes = true;
>
>
>     create column family entities
>        with column_type = 'Standard'
>        and comparator = 'BytesType'
>        and default_validation_class = 'BytesType'
>        and key_validation_class = 'AsciiType'
>        and read_repair_chance = 0.0
>        and dclocal_read_repair_chance = 0.0
>        and gc_grace = 0
>        and min_compaction_threshold = 4
>        and max_compaction_threshold = 32
>        and replicate_on_write = false
>        and compaction_strategy =
>     'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy'
>        and caching = 'NONE'
>        and column_metadata = [
>          {column_name : '64656c65746564',
>          validation_class : BytesType,
>          index_name : 'deleted_idx',
>          index_type : 0},
>          {column_name : '6576656e744964',
>          validation_class : TimeUUIDType,
>          index_name : 'eventId_idx',
>          index_type : 0},
>          {column_name : '7061796c6f6164',
>          validation_class : UTF8Type}];
>
>     2 columns per row here - about 200Mb of data in total
>
>
>     create column family events
>        with column_type = 'Standard'
>        and comparator = 'BytesType'
>        and default_validation_class = 'BytesType'
>        and key_validation_class = 'TimeUUIDType'
>        and read_repair_chance = 0.0
>        and dclocal_read_repair_chance = 0.0
>        and gc_grace = 0
>        and min_compaction_threshold = 4
>        and max_compaction_threshold = 32
>        and replicate_on_write = false
>        and compaction_strategy =
>     'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy'
>        and caching = 'NONE';
>
>     1 column per row - about 300Mb of data
>
>     create column family intervals
>        with column_type = 'Standard'
>        and comparator = 'BytesType'
>        and default_validation_class = 'BytesType'
>        and key_validation_class = 'AsciiType'
>        and read_repair_chance = 0.0
>        and dclocal_read_repair_chance = 0.0
>        and gc_grace = 0
>        and min_compaction_threshold = 4
>        and max_compaction_threshold = 32
>        and replicate_on_write = false
>        and compaction_strategy =
>     'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy'
>        and caching = 'NONE';
>
>     variable columns per row - about 40Mb of data.
>
>
>
>         5. what client library do you use to access Cassandra?
>           (Hector?).  Is
>         your client code single threaded?
>
>
>     Hector - yes, the processing side of the client is single threaded,
>     but is largely waiting for cassandra responses and has plenty of CPU
>     headroom.
>
>
>     I guess what I'm most interested in is why the discrepancy in
>     between read/write latency - although I understand the data volume
>     is much larger in reads, even though the request rate is lower.
>
>     Network usage on a cassandra box barely gets above 20Mbit, including
>     inter-cluster comms. Averages 5mbit client<>cassandra
>
>     There is near zero disk I/O, and what little there is is served sub
>     1ms. Storage is backed by a very fast SAN, but like I said earlier,
>     the dataset just about fits in the Linux disk cache. 2Gb VM, 512Mb
>     cassandra heap - GCs are nice and quick, no JVM memory problems,
>     used heap oscillates between 280-350Mb.
>
>     Basically, I'm just puzzled as cassandra doesn't behave as I would
>     expect. Huge CPU use in cassandra for very little throughput. I'm
>     struggling to find anything that's wrong with the environment,
>     there's no bottleneck that I can see.
>
>     thanks
>
>     James M
>
>
>
>
>
>         On Fri, Dec 21, 2012 at 7:27 AM, James Masson
>         <james.masson@opigram.com <ma...@opigram.com>
>         <mailto:james.masson@opigram.__com
>         <ma...@opigram.com>>> wrote:
>
>
>              Hi list-users,
>
>              We have an application that has a relatively unusual access
>         pattern
>              in cassandra 1.1.6
>
>              Essentially we read an entire multi hundred megabyte column
>         family
>              sequentially (little chance of a cassandra cache hit),
>         perform some
>              operations on the data, and write the data back to another
>         column
>              family in the same keyspace.
>
>              We do about 250 writes/sec and 100 reads/sec during this
>         process.
>              Write request latency is about 900 microsecs, read request
>         latency
>              is about 4000 microsecs.
>
>              * First Question: Do these numbers make sense?
>
>              read-request latency seems a little high to me, cassandra
>         hasn't had
>              a chance to cache this data, but it's likely in the Linux disk
>              cache, given the sizing of the node/data/jvm.
>
>              thanks
>
>              James M
>
>
>

Re: Cassandra read throughput with little/no caching.

Posted by Yiming Sun <yi...@gmail.com>.

James, using RandomPartitioner, the order of the rows is random, so when
you request these rows in "Sequential" order (sort by the date?), Cassandra
is not reading them sequentially.

The size of the data, 200Mb, 300Mb , and 40Mb, are these the size for each
column? Or are these the total size of the entire column family?  It wasn't
too clear to me.  But if these are the total size of the column families,
you will be able to fit them mostly in memory, so you should enable row
cache.

I happen to have done some performance tests of my own on cassandra, mostly
on the read, and was also only able to get less than 6MB/sec read rate out
of a cluster of 6 nodes RF2 using a single threaded client.  But it makes a
huge difference when I changed the client to an asynchronous multi-threaded
structure.




On Fri, Dec 21, 2012 at 10:36 AM, James Masson <ja...@opigram.com>wrote:

>
> Hi,
>
> thanks for the reply
>
>
> On 21/12/12 14:36, Yiming Sun wrote:
>
>> I have a few questions for you, James,
>>
>> 1. how many nodes are in your Cassandra ring?
>>
>
> 2 or 3 - depending on environment - it doesn't seem to make a difference
> to throughput very much. What is a 30 minute task on a 2 node environment
> is a 30 minute task on a 3 node environment.
>
>
>  2. what is the replication factor?
>>
>
> 1
>
>  3. when you say sequentially, what do you mean?  what Partitioner do you
>> use?
>>
>
> The data is organised by date - the keys are read sequentially in order,
> only once.
>
> Random partitioner - the data is equally spread across the nodes to avoid
> hotspots.
>
>
>  4. how many columns per row?  how much data per row?  per column?
>>
>
> varies - described in the schema.
>
> create keyspace mykeyspace
>   with placement_strategy = 'SimpleStrategy'
>   and strategy_options = {replication_factor : 1}
>   and durable_writes = true;
>
>
> create column family entities
>   with column_type = 'Standard'
>   and comparator = 'BytesType'
>   and default_validation_class = 'BytesType'
>   and key_validation_class = 'AsciiType'
>   and read_repair_chance = 0.0
>   and dclocal_read_repair_chance = 0.0
>   and gc_grace = 0
>   and min_compaction_threshold = 4
>   and max_compaction_threshold = 32
>   and replicate_on_write = false
>   and compaction_strategy = 'org.apache.cassandra.db.**compaction.**
> SizeTieredCompactionStrategy'
>   and caching = 'NONE'
>   and column_metadata = [
>     {column_name : '64656c65746564',
>     validation_class : BytesType,
>     index_name : 'deleted_idx',
>     index_type : 0},
>     {column_name : '6576656e744964',
>     validation_class : TimeUUIDType,
>     index_name : 'eventId_idx',
>     index_type : 0},
>     {column_name : '7061796c6f6164',
>     validation_class : UTF8Type}];
>
> 2 columns per row here - about 200Mb of data in total
>
>
> create column family events
>   with column_type = 'Standard'
>   and comparator = 'BytesType'
>   and default_validation_class = 'BytesType'
>   and key_validation_class = 'TimeUUIDType'
>   and read_repair_chance = 0.0
>   and dclocal_read_repair_chance = 0.0
>   and gc_grace = 0
>   and min_compaction_threshold = 4
>   and max_compaction_threshold = 32
>   and replicate_on_write = false
>   and compaction_strategy = 'org.apache.cassandra.db.**compaction.**
> SizeTieredCompactionStrategy'
>   and caching = 'NONE';
>
> 1 column per row - about 300Mb of data
>
> create column family intervals
>   with column_type = 'Standard'
>   and comparator = 'BytesType'
>   and default_validation_class = 'BytesType'
>   and key_validation_class = 'AsciiType'
>   and read_repair_chance = 0.0
>   and dclocal_read_repair_chance = 0.0
>   and gc_grace = 0
>   and min_compaction_threshold = 4
>   and max_compaction_threshold = 32
>   and replicate_on_write = false
>   and compaction_strategy = 'org.apache.cassandra.db.**compaction.**
> SizeTieredCompactionStrategy'
>   and caching = 'NONE';
>
> variable columns per row - about 40Mb of data.
>
>
>
>  5. what client library do you use to access Cassandra?  (Hector?).  Is
>> your client code single threaded?
>>
>
> Hector - yes, the processing side of the client is single threaded, but is
> largely waiting for cassandra responses and has plenty of CPU headroom.
>
>
> I guess what I'm most interested in is why the discrepancy in between
> read/write latency - although I understand the data volume is much larger
> in reads, even though the request rate is lower.
>
> Network usage on a cassandra box barely gets above 20Mbit, including
> inter-cluster comms. Averages 5mbit client<>cassandra
>
> There is near zero disk I/O, and what little there is is served sub 1ms.
> Storage is backed by a very fast SAN, but like I said earlier, the dataset
> just about fits in the Linux disk cache. 2Gb VM, 512Mb cassandra heap - GCs
> are nice and quick, no JVM memory problems, used heap oscillates between
> 280-350Mb.
>
> Basically, I'm just puzzled as cassandra doesn't behave as I would expect.
> Huge CPU use in cassandra for very little throughput. I'm struggling to
> find anything that's wrong with the environment, there's no bottleneck that
> I can see.
>
> thanks
>
> James M
>
>
>
>
>>
>> On Fri, Dec 21, 2012 at 7:27 AM, James Masson <james.masson@opigram.com
>> <mailto:james.masson@opigram.**com <ja...@opigram.com>>> wrote:
>>
>>
>>     Hi list-users,
>>
>>     We have an application that has a relatively unusual access pattern
>>     in cassandra 1.1.6
>>
>>     Essentially we read an entire multi hundred megabyte column family
>>     sequentially (little chance of a cassandra cache hit), perform some
>>     operations on the data, and write the data back to another column
>>     family in the same keyspace.
>>
>>     We do about 250 writes/sec and 100 reads/sec during this process.
>>     Write request latency is about 900 microsecs, read request latency
>>     is about 4000 microsecs.
>>
>>     * First Question: Do these numbers make sense?
>>
>>     read-request latency seems a little high to me, cassandra hasn't had
>>     a chance to cache this data, but it's likely in the Linux disk
>>     cache, given the sizing of the node/data/jvm.
>>
>>     thanks
>>
>>     James M
>>
>>
>>

Re: Cassandra read throughput with little/no caching.

Posted by James Masson <ja...@opigram.com>.

Hi,

thanks for the reply

On 21/12/12 14:36, Yiming Sun wrote:
> I have a few questions for you, James,
>
> 1. how many nodes are in your Cassandra ring?

2 or 3 - depending on environment - it doesn't seem to make a difference 
to throughput very much. What is a 30 minute task on a 2 node 
environment is a 30 minute task on a 3 node environment.

> 2. what is the replication factor?

1

> 3. when you say sequentially, what do you mean?  what Partitioner do you
> use?

The data is organised by date - the keys are read sequentially in order, 
only once.

Random partitioner - the data is equally spread across the nodes to 
avoid hotspots.

> 4. how many columns per row?  how much data per row?  per column?

varies - described in the schema.

create keyspace mykeyspace
   with placement_strategy = 'SimpleStrategy'
   and strategy_options = {replication_factor : 1}
   and durable_writes = true;


create column family entities
   with column_type = 'Standard'
   and comparator = 'BytesType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'AsciiType'
   and read_repair_chance = 0.0
   and dclocal_read_repair_chance = 0.0
   and gc_grace = 0
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = false
   and compaction_strategy = 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
   and caching = 'NONE'
   and column_metadata = [
     {column_name : '64656c65746564',
     validation_class : BytesType,
     index_name : 'deleted_idx',
     index_type : 0},
     {column_name : '6576656e744964',
     validation_class : TimeUUIDType,
     index_name : 'eventId_idx',
     index_type : 0},
     {column_name : '7061796c6f6164',
     validation_class : UTF8Type}];

2 columns per row here - about 200Mb of data in total


create column family events
   with column_type = 'Standard'
   and comparator = 'BytesType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'TimeUUIDType'
   and read_repair_chance = 0.0
   and dclocal_read_repair_chance = 0.0
   and gc_grace = 0
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = false
   and compaction_strategy = 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
   and caching = 'NONE';

1 column per row - about 300Mb of data

create column family intervals
   with column_type = 'Standard'
   and comparator = 'BytesType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'AsciiType'
   and read_repair_chance = 0.0
   and dclocal_read_repair_chance = 0.0
   and gc_grace = 0
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = false
   and compaction_strategy = 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
   and caching = 'NONE';

variable columns per row - about 40Mb of data.


> 5. what client library do you use to access Cassandra?  (Hector?).  Is
> your client code single threaded?

Hector - yes, the processing side of the client is single threaded, but 
is largely waiting for cassandra responses and has plenty of CPU headroom.


I guess what I'm most interested in is why the discrepancy in between 
read/write latency - although I understand the data volume is much 
larger in reads, even though the request rate is lower.

Network usage on a cassandra box barely gets above 20Mbit, including 
inter-cluster comms. Averages 5mbit client<>cassandra

There is near zero disk I/O, and what little there is is served sub 1ms. 
Storage is backed by a very fast SAN, but like I said earlier, the 
dataset just about fits in the Linux disk cache. 2Gb VM, 512Mb cassandra 
heap - GCs are nice and quick, no JVM memory problems, used heap 
oscillates between 280-350Mb.

Basically, I'm just puzzled as cassandra doesn't behave as I would 
expect. Huge CPU use in cassandra for very little throughput. I'm 
struggling to find anything that's wrong with the environment, there's 
no bottleneck that I can see.

thanks

James M



>
>
> On Fri, Dec 21, 2012 at 7:27 AM, James Masson <james.masson@opigram.com
> <ma...@opigram.com>> wrote:
>
>
>     Hi list-users,
>
>     We have an application that has a relatively unusual access pattern
>     in cassandra 1.1.6
>
>     Essentially we read an entire multi hundred megabyte column family
>     sequentially (little chance of a cassandra cache hit), perform some
>     operations on the data, and write the data back to another column
>     family in the same keyspace.
>
>     We do about 250 writes/sec and 100 reads/sec during this process.
>     Write request latency is about 900 microsecs, read request latency
>     is about 4000 microsecs.
>
>     * First Question: Do these numbers make sense?
>
>     read-request latency seems a little high to me, cassandra hasn't had
>     a chance to cache this data, but it's likely in the Linux disk
>     cache, given the sizing of the node/data/jvm.
>
>     thanks
>
>     James M
>
>

Re: Cassandra read throughput with little/no caching.

Posted by Yiming Sun <yi...@gmail.com>.

I have a few questions for you, James,

1. how many nodes are in your Cassandra ring?
2. what is the replication factor?
3. when you say sequentially, what do you mean?  what Partitioner do you
use?
4. how many columns per row?  how much data per row?  per column?
5. what client library do you use to access Cassandra?  (Hector?).  Is your
client code single threaded?

On Fri, Dec 21, 2012 at 7:27 AM, James Masson <ja...@opigram.com>wrote:

>
> Hi list-users,
>
> We have an application that has a relatively unusual access pattern in
> cassandra 1.1.6
>
> Essentially we read an entire multi hundred megabyte column family
> sequentially (little chance of a cassandra cache hit), perform some
> operations on the data, and write the data back to another column family in
> the same keyspace.
>
> We do about 250 writes/sec and 100 reads/sec during this process. Write
> request latency is about 900 microsecs, read request latency is about 4000
> microsecs.
>
> * First Question: Do these numbers make sense?
>
> read-request latency seems a little high to me, cassandra hasn't had a
> chance to cache this data, but it's likely in the Linux disk cache, given
> the sizing of the node/data/jvm.
>
> thanks
>
> James M
>