You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Alaa Zubaidi <al...@pdf.com> on 2010/11/03 20:32:58 UTC

SSD vs. HDD

Hi,
we have a continuous high throughput writes, read and delete, and we are 
trying to find the best hardware.
Is using SSD for Cassandra improves performance? Did any one compare SSD 
vs. HDD? and any recommendations on SSDs?

Thanks,
Alaa

Re: SSD vs. HDD

Posted by Alaa Zubaidi <al...@pdf.com>.

around 1800 col/sec per node, 3kb columns, reading is the same.
Data will be deleted after 4 hours.

On 11/3/2010 5:00 PM, Terje Marthinussen wrote:
> How high is high  and how much data do you have (Cassandra disk usage).
>
> Regards,
> Terje
>
> On 4 Nov 2010, at 04:32, Alaa Zubaidi<al...@pdf.com>  wrote:
>
>> Hi,
>> we have a continuous high throughput writes, read and delete, and we are trying to find the best hardware.
>> Is using SSD for Cassandra improves performance? Did any one compare SSD vs. HDD? and any recommendations on SSDs?
>>
>> Thanks,
>> Alaa
>>
>

-- 
Alaa Zubaidi
PDF Solutions, Inc.
333 West San Carlos Street, Suite 700
San Jose, CA 95110  USA
Tel: 408-283-5639 (or 408-280-7900 x5639)
fax: 408-938-6479
email: alaa.zubaidi@pdf.com

Re: SSD vs. HDD

Posted by Terje Marthinussen <tm...@gmail.com>.

How high is high  and how much data do you have (Cassandra disk usage).

Regards,
Terje

On 4 Nov 2010, at 04:32, Alaa Zubaidi <al...@pdf.com> wrote:

> Hi,
> we have a continuous high throughput writes, read and delete, and we are trying to find the best hardware.
> Is using SSD for Cassandra improves performance? Did any one compare SSD vs. HDD? and any recommendations on SSDs?
> 
> Thanks,
> Alaa
>

Re: SSD vs. HDD

Posted by Jonathan Shook <js...@gmail.com>.

Ah. Point taken on the random access SSD performance. I was trying to
emphasize the relative failure rates given the two scenarios. I didn't
mean to imply that SSD random access performance was not a likely
improvement here, just that it was a complicated trade-off in the
grand scheme of things.. Thanks for catching my goof.


On Wed, Nov 3, 2010 at 3:58 PM, Tyler Hobbs <ty...@riptano.com> wrote:
> SSD will not generally improve your write performance very much, but they
> can significantly improve read performance.
>
> You do *not* want to waste an SSD on the commitlog drive, as even a slow HDD
> can write sequentially very quickly.  For the data drive, they might make
> sense.
>
> As Jonathan talks about, it has a lot to do with your access patterns.  If
> you either: (1) delete parts of rows (2) update parts of rows, or (3) insert
> new columns into existing rows frequently, you'll end up with rows spread
> across several SSTables (which are on disk).  This means that each read may
> require several seeks, which are very slow for HDDs, but are very quick for
> SSDs.
>
> Of course, the randomness of what rows you access is also important, but
> Jonathan did a good job of covering that.  Don't forget about the effects of
> caching here, too.
>
> The only way to tell if it is cost-effective is to test your particular
> access patterns (using a configured stress.py test or, preferably, your
> actual application).
>
> - Tyler
>
> On Wed, Nov 3, 2010 at 3:44 PM, Jonathan Shook <js...@gmail.com> wrote:
>>
>> SSDs are not reliable after a (relatively-low compared to spinning
>> disk) number of writes.
>> They may significantly boost performance if used on the "journal"
>> storage, but will suffer short lifetimes for highly-random write
>> patterns.
>>
>> In general, plan to replace them frequently. Whether they are worth
>> it, given the performance improvement over the cost of replacement x
>> hardware x logistics is generally a calculus problem. It's difficult
>> to make a generic rationale for or against them.
>>
>> You might be better off in general by throwing more memory at your
>> servers, and isolating your random access from your journaled data.
>> Is there any pattern to your reads and writes/deletes? If it is fully
>> random across your keys, then you have the worst-case scenario.
>> Sometimes you can impose access patterns or structural patterns in
>> your app which make caching more effective.
>>
>> Good questions to ask about your data access:
>> Is there a "user session" which shows an access pattern to proximal data?
>> Are there sets of access which always happen close together?
>> Are there keys or maps which add extra indirection?
>>
>> I'm not familiar with your situation. I was just providing some general
>> ideas..
>>
>> Jonathan Shook
>>
>> On Wed, Nov 3, 2010 at 2:32 PM, Alaa Zubaidi <al...@pdf.com> wrote:
>> > Hi,
>> > we have a continuous high throughput writes, read and delete, and we are
>> > trying to find the best hardware.
>> > Is using SSD for Cassandra improves performance? Did any one compare SSD
>> > vs.
>> > HDD? and any recommendations on SSDs?
>> >
>> > Thanks,
>> > Alaa
>> >
>> >
>
>

Re: latest rows

Posted by Alaa Zubaidi <al...@pdf.com>.

Thank you guys ...

On 2/16/2011 1:36 PM, Matthew Dennis wrote:
> +1 on avoiding OPP
>
> On Wed, Feb 16, 2011 at 3:27 PM, Tyler Hobbs<ty...@datastax.com>  wrote:
>
>> Thanks for you input, but we have a set key that consists of name:timestamp
>>> that we are using.. and we need to also retrieve the oldest data as well..
>>>
>> Then you'll need to denormalize and store every row three ways:  timestamp,
>> inverted timestamp, and normal, if you want to be able to access them in all
>> three ways using OPP.
>>
>> I would recommend not using OPP and just using timeline rows.  Here's a
>> fantastic discussion of OrderPreservingPartitioner vs RandomPartitioner<http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/>
>> .
>>
>>
>> --
>> Tyler Hobbs
>> Software Engineer, DataStax<http://datastax.com/>
>> Maintainer of the pycassa<http://github.com/pycassa/pycassa>  Cassandra
>> Python client library
>>
>>

-- 
Alaa Zubaidi
PDF Solutions, Inc.
333 West San Carlos Street, Suite 700
San Jose, CA 95110  USA
Tel: 408-283-5639 (or 408-280-7900 x5639)
fax: 408-938-6479
email: alaa.zubaidi@pdf.com

Re: latest rows

Posted by Matthew Dennis <md...@datastax.com>.

+1 on avoiding OPP

On Wed, Feb 16, 2011 at 3:27 PM, Tyler Hobbs <ty...@datastax.com> wrote:

>
> Thanks for you input, but we have a set key that consists of name:timestamp
>> that we are using.. and we need to also retrieve the oldest data as well..
>>
>
> Then you'll need to denormalize and store every row three ways:  timestamp,
> inverted timestamp, and normal, if you want to be able to access them in all
> three ways using OPP.
>
> I would recommend not using OPP and just using timeline rows.  Here's a
> fantastic discussion of OrderPreservingPartitioner vs RandomPartitioner<http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/>
> .
>
>
> --
> Tyler Hobbs
> Software Engineer, DataStax <http://datastax.com/>
> Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
> Python client library
>
>

Re: latest rows

Posted by Tyler Hobbs <ty...@datastax.com>.

> Thanks for you input, but we have a set key that consists of name:timestamp
> that we are using.. and we need to also retrieve the oldest data as well..
>

Then you'll need to denormalize and store every row three ways:  timestamp,
inverted timestamp, and normal, if you want to be able to access them in all
three ways using OPP.

I would recommend not using OPP and just using timeline rows.  Here's a
fantastic discussion of OrderPreservingPartitioner vs
RandomPartitioner<http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/>
.

-- 
Tyler Hobbs
Software Engineer, DataStax <http://datastax.com/>
Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
Python client library

Re: latest rows

Posted by Alaa Zubaidi <al...@pdf.com>.

Hi Tyler,

Thanks for you input, but we have a set key that consists of 
name:timestamp that we are using.. and we need to also retrieve the 
oldest data as well..

Thanks

On 2/15/2011 9:07 PM, Tyler Hobbs wrote:
>> But wouldn't using timestamp as row keys cause conflicts?
>>
> Depending on client behavior, yes.  If that's an issue for you, make your
> own UUIDs by appending something random or client-specific to the timestamp.
>

-- 
Alaa Zubaidi
PDF Solutions, Inc.
333 West San Carlos Street, Suite 700
San Jose, CA 95110  USA
Tel: 408-283-5639 (or 408-280-7900 x5639)
fax: 408-938-6479
email: alaa.zubaidi@pdf.com

Re: latest rows

Posted by Tyler Hobbs <ty...@datastax.com>.

>
> But wouldn't using timestamp as row keys cause conflicts?
>

Depending on client behavior, yes.  If that's an issue for you, make your
own UUIDs by appending something random or client-specific to the timestamp.

-- 
Tyler Hobbs
Software Engineer, DataStax <http://datastax.com/>
Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
Python client library

Re: latest rows

Posted by Tan Yeh Zheng <ye...@chartnexus.com>.

But wouldn't using timestamp as row keys cause conflicts?
On Tue, 2011-02-15 at 19:11 -0600, Tyler Hobbs wrote:
> 
>         What is the best way to retrieve the latest rows from a CF
>         with OPP.
> 
> Use inverted timestamps (for example, 2^64 - timestamp) with zeros for
> padding as the row keys.
> 
> This way you can do a normal forward range scan and get the N latest
> rows.
> 
> -- 
> Tyler Hobbs
> Software Engineer, DataStax
> Maintainer of the pycassa Cassandra Python client library
> 

-- 
Best Regards,

Tan Yeh Zheng
Software Programmer

____________ ChartNexus® :: Chart Your Success ____________

ChartNexus Pte. Ltd.

15 Enggor Street #10-01
Realty Center
Singapore 079716
Tel:  (65) 6491 1456
Website: www.chartnexus.com

Disclaimer:
This email is confidential and intended only for the use of the
individual or individuals named above and may contain information that
is privileged. If you are not the intended recipient, you are notified
that any dissemination, distribution or copying of this email is
strictly prohibited.

Re: latest rows

Posted by Tyler Hobbs <ty...@datastax.com>.

> What is the best way to retrieve the latest rows from a CF with OPP.
>

Use inverted timestamps (for example, 2^64 - timestamp) with zeros for
padding as the row keys.

This way you can do a normal forward range scan and get the N latest rows.

-- 
Tyler Hobbs
Software Engineer, DataStax <http://datastax.com/>
Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
Python client library

latest rows

Posted by Alaa Zubaidi <al...@pdf.com>.

Hi,

What is the best way to retrieve the latest rows from a CF with OPP.

We are using OPP and key range queries but I cannot find an easy way to 
get the latest 10 keys for example from a column family with 1000s of keys.
I really don't want to create another CF to store row key names as 
columns and then retrieve the latest columns from this CF and use the 
row keys to retrieve the latest data.

Regards and Thanks,
Alaa

Re: SSD vs. HDD

Posted by Nick Telford <ni...@gmail.com>.

If you're experiencing high I/O load and not getting any Java OutOfMemory
(OOM) errors, you should try to keep your heap size as low as possible as
this provides the OS filesystem cache with more memory, which will reduce
read I/O load significantly. I'm not familiar the performance of Windows
filesystems, but I imagine NTFS is somewhat on a par with what we're
familiar with in Linux.

The row cache will be useful in cases where you have a high read/write ratio
(more reads than writes) especially if most of those reads are confined to a
specific subset of data. The key cache will also improve read performance
(which will be your main I/O bottleneck) with much less of a memory impact,
so in your case I would recommend enabling it for as many keys as possible.

Riptano have a pretty decent explanation of tuning Cassandra that I highly
recommend you read: http://www.riptano.com/docs/0.6.5/operations/tuning

<http://www.riptano.com/docs/0.6.5/operations/tuning>Regards,

Nick Telford

On 4 November 2010 22:20, Alaa Zubaidi <al...@pdf.com> wrote:

> Thanks for the advise...
> We are running on Windows, and I just added more memory to my system, 16G I
> will run the test again with 8G heap.
> The load is continues, however, the CPU usage is around 40% with max of
> 70%.
> As for cache, I am not using cache, because I am under the impression that
> cache in my case, where the data keeps changing very quickly in and out of
> cache, is not a good idea?
> Thanks
>
>
> On 11/4/2010 3:14 AM, Nick Telford wrote:
>
>> If you're bottle-necking on read I/O making proper use of Cassandras key
>> cache and row cache will improve things dramatically.
>>
>> A little maths using the numbers you've provided tells me that you have
>> about 80GB of "hot" data (data valid in a 4 hour period). That's obviously
>> too much to directly cache, but you can probably cache some or all of the
>> row keys, depending on your column distribution among keys. This will
>> prevent reads from having to hit the indexes for the relevant sstables -
>> eliminating a seek per sstable.
>>
>> If you have a subset of this data that is read more than the rest, the row
>> cache will help you out a lot too. Have a look at your access patterns and
>> see if it's worthwhile caching some rows.
>>
>> If you make progress using the various caches, but don't have enough
>> memory,
>> I'd explore the costs of expanding the available memory compared to
>> switching to SSDs as I imagine it'd be cheaper and would last longer.
>>
>> Finally, given your particular deletion pattern, it's probably worth
>> looking
>> at 0.7 and upgrading once it is released as stable. CASSANDRA-699[1] adds
>> support for TTL columns that automatically expire and get removed (during
>> compaction) without the need for a manual deletion mechanism. Failing
>> this,
>> since data older than 4 hours is no longer relevant, you should reduce
>> your
>> GCGraceSeconds>= 4 hours. This will ensure deleted data is removed faster,
>> keeping your sstables smaller and allowing the fs cache to operate more
>> effectively.
>>
>> 1: https://issues.apache.org/jira/browse/CASSANDRA-699
>>
>> On 4 November 2010 08:18, Peter Schuller<peter.schuller@infidyne.com
>> >wrote:
>>
>>  I am having time out errors while reading.
>>>> I have 5 CFs but two CFs with high write/read.
>>>> The data is organized in time series rows, in CF1 the new rows are read
>>>> every 10 seconds and then the whole rows are deleted, While in CF2 the
>>>>
>>> rows
>>>
>>>> are read in different time range slices and eventually deleted may be
>>>>
>>> after
>>>
>>>> few hours.
>>>>
>>> So the first thing to do is to confirm what the bottleneck is. If
>>> you're having timeouts on reads, and assuming your not doing reads of
>>> hot-in-cache data so fast that CPU is the bottleneck (and given that
>>> you ask about SSD), the hypothesis then is that you're disk bound due
>>> to seeking.
>>>
>>> Observe the node(s) and in particular use "iostat -x -k 1" (or an
>>> equivalent graph) and look at the %util and %avgqu-sz columns to
>>> confirm that you are indeed disk-bound. Unless you're doing large
>>> reads, you will likely see, on average, small reads in amounts that
>>> simply saturate underlying storage, %util at 100% and the avgu-sz will
>>> probably be approaching the level of concurrency of your read traffic.
>>>
>>> Now, assuming that is true, the question is why. So:
>>>
>>> (1) Are you continually saturating disk or just periodically?
>>> (2) If periodically, does the periods of saturation correlate with
>>> compaction being done by Cassandra (or for that matter something
>>> else)?
>>> (3) What is your data set size relative to system memory? What is your
>>> system memory and JVM heap size? (Relevant because it is important to
>>> look at how much memory the kernel will use for page caching.)
>>>
>>> As others have mentioned, the amount of reads done on disk for each
>>> read form the database (assuming data is not in cache) can be affected
>>> by how data is written (e.g., partial row writes etc). That is one
>>> thing that can be addressed, as is re-structuring data to allow
>>> reading more sequentially (if possible). That only helps along one
>>> dimension though - lessening, somewhat, the cost of cold reads. The
>>> gains may be limited and the real problem may be that you simply need
>>> more memory for caching and/or more IOPS from your storage (i.e., more
>>> disks, maybe SSD, etc).
>>>
>>> If on the other hand you're normally completely fine and you're just
>>> seeing periods of saturation associated with compaction, this may be
>>> mitigated by software improvements by possibly rate limiting reads
>>> and/or writes during compaction and avoiding buffer cache thrashing.
>>> There's a JIRA ticket for direct I/O
>>> (https://issues.apache.org/jira/browse/CASSANDRA-1470). I don't think
>>> there's a JIRA ticket for rate limiting, but I suspect, since you're
>>> doing time series data, that you're not storing very large values -
>>> and I would expect compaction to be CPU bound rather than being close
>>> to saturate disk.
>>>
>>> In either case, please do report back as it's interesting to figure
>>> out what kind of performance issues people are seeing.
>>>
>>> --
>>> / Peter Schuller
>>>
>>>
> --
> Alaa Zubaidi
> PDF Solutions, Inc.
> 333 West San Carlos Street, Suite 700
> San Jose, CA 95110  USA
> Tel: 408-283-5639 (or 408-280-7900 x5639)
> fax: 408-938-6479
> email: alaa.zubaidi@pdf.com
>
>
>

Re: SSD vs. HDD

Posted by Alaa Zubaidi <al...@pdf.com>.

Thanks for the advise...
We are running on Windows, and I just added more memory to my system, 
16G I will run the test again with 8G heap.
The load is continues, however, the CPU usage is around 40% with max of 70%.
As for cache, I am not using cache, because I am under the impression 
that cache in my case, where the data keeps changing very quickly in and 
out of cache, is not a good idea?
Thanks

On 11/4/2010 3:14 AM, Nick Telford wrote:
> If you're bottle-necking on read I/O making proper use of Cassandras key
> cache and row cache will improve things dramatically.
>
> A little maths using the numbers you've provided tells me that you have
> about 80GB of "hot" data (data valid in a 4 hour period). That's obviously
> too much to directly cache, but you can probably cache some or all of the
> row keys, depending on your column distribution among keys. This will
> prevent reads from having to hit the indexes for the relevant sstables -
> eliminating a seek per sstable.
>
> If you have a subset of this data that is read more than the rest, the row
> cache will help you out a lot too. Have a look at your access patterns and
> see if it's worthwhile caching some rows.
>
> If you make progress using the various caches, but don't have enough memory,
> I'd explore the costs of expanding the available memory compared to
> switching to SSDs as I imagine it'd be cheaper and would last longer.
>
> Finally, given your particular deletion pattern, it's probably worth looking
> at 0.7 and upgrading once it is released as stable. CASSANDRA-699[1] adds
> support for TTL columns that automatically expire and get removed (during
> compaction) without the need for a manual deletion mechanism. Failing this,
> since data older than 4 hours is no longer relevant, you should reduce your
> GCGraceSeconds>= 4 hours. This will ensure deleted data is removed faster,
> keeping your sstables smaller and allowing the fs cache to operate more
> effectively.
>
> 1: https://issues.apache.org/jira/browse/CASSANDRA-699
>
> On 4 November 2010 08:18, Peter Schuller<pe...@infidyne.com>wrote:
>
>>> I am having time out errors while reading.
>>> I have 5 CFs but two CFs with high write/read.
>>> The data is organized in time series rows, in CF1 the new rows are read
>>> every 10 seconds and then the whole rows are deleted, While in CF2 the
>> rows
>>> are read in different time range slices and eventually deleted may be
>> after
>>> few hours.
>> So the first thing to do is to confirm what the bottleneck is. If
>> you're having timeouts on reads, and assuming your not doing reads of
>> hot-in-cache data so fast that CPU is the bottleneck (and given that
>> you ask about SSD), the hypothesis then is that you're disk bound due
>> to seeking.
>>
>> Observe the node(s) and in particular use "iostat -x -k 1" (or an
>> equivalent graph) and look at the %util and %avgqu-sz columns to
>> confirm that you are indeed disk-bound. Unless you're doing large
>> reads, you will likely see, on average, small reads in amounts that
>> simply saturate underlying storage, %util at 100% and the avgu-sz will
>> probably be approaching the level of concurrency of your read traffic.
>>
>> Now, assuming that is true, the question is why. So:
>>
>> (1) Are you continually saturating disk or just periodically?
>> (2) If periodically, does the periods of saturation correlate with
>> compaction being done by Cassandra (or for that matter something
>> else)?
>> (3) What is your data set size relative to system memory? What is your
>> system memory and JVM heap size? (Relevant because it is important to
>> look at how much memory the kernel will use for page caching.)
>>
>> As others have mentioned, the amount of reads done on disk for each
>> read form the database (assuming data is not in cache) can be affected
>> by how data is written (e.g., partial row writes etc). That is one
>> thing that can be addressed, as is re-structuring data to allow
>> reading more sequentially (if possible). That only helps along one
>> dimension though - lessening, somewhat, the cost of cold reads. The
>> gains may be limited and the real problem may be that you simply need
>> more memory for caching and/or more IOPS from your storage (i.e., more
>> disks, maybe SSD, etc).
>>
>> If on the other hand you're normally completely fine and you're just
>> seeing periods of saturation associated with compaction, this may be
>> mitigated by software improvements by possibly rate limiting reads
>> and/or writes during compaction and avoiding buffer cache thrashing.
>> There's a JIRA ticket for direct I/O
>> (https://issues.apache.org/jira/browse/CASSANDRA-1470). I don't think
>> there's a JIRA ticket for rate limiting, but I suspect, since you're
>> doing time series data, that you're not storing very large values -
>> and I would expect compaction to be CPU bound rather than being close
>> to saturate disk.
>>
>> In either case, please do report back as it's interesting to figure
>> out what kind of performance issues people are seeing.
>>
>> --
>> / Peter Schuller
>>

-- 
Alaa Zubaidi
PDF Solutions, Inc.
333 West San Carlos Street, Suite 700
San Jose, CA 95110  USA
Tel: 408-283-5639 (or 408-280-7900 x5639)
fax: 408-938-6479
email: alaa.zubaidi@pdf.com

Re: SSD vs. HDD

Posted by Alaa Zubaidi <al...@pdf.com>.

Its a little bit different than what most people use it for, and that's 
why we are trying to test it, to see if we can benefit from the speed of 
writing/reading, scalability when and if we need it, and also the coast.
and part of the testing we are doing, is trying to see how many nodes do 
we need in our cluster, since we know the data volume, so far, its 
almost double what we were calculating and hoping, which is a not so 
good thing..

On 11/4/2010 4:18 AM, Juho Mäkinen wrote:
> Do you really need Cassandra to store just 80 GB data for just four
> hours? It might be just me, but this sounds like quite far fetched
> from normal Cassandra usage. Cassandra isn't happy unless you run
> enough nodes to cover one or two node doing compaction (which hurts
> the node performance). Are you ready to run at least two, preferably
> three C* nodes to store just 80GB of data?
>
>   - Garo
>
> On Thu, Nov 4, 2010 at 12:14 PM, Nick Telford<ni...@gmail.com>  wrote:
>> If you're bottle-necking on read I/O making proper use of Cassandras key
>> cache and row cache will improve things dramatically.
>> A little maths using the numbers you've provided tells me that you have
>> about 80GB of "hot" data (data valid in a 4 hour period). That's obviously
>> too much to directly cache, but you can probably cache some or all of the
>> row keys, depending on your column distribution among keys. This will
>> prevent reads from having to hit the indexes for the relevant sstables -
>> eliminating a seek per sstable.
>> If you have a subset of this data that is read more than the rest, the row
>> cache will help you out a lot too. Have a look at your access patterns and
>> see if it's worthwhile caching some rows.
>> If you make progress using the various caches, but don't have enough memory,
>> I'd explore the costs of expanding the available memory compared to
>> switching to SSDs as I imagine it'd be cheaper and would last longer.
>> Finally, given your particular deletion pattern, it's probably worth looking
>> at 0.7 and upgrading once it is released as stable. CASSANDRA-699[1] adds
>> support for TTL columns that automatically expire and get removed (during
>> compaction) without the need for a manual deletion mechanism. Failing this,
>> since data older than 4 hours is no longer relevant, you should reduce your
>> GCGraceSeconds>= 4 hours. This will ensure deleted data is removed faster,
>> keeping your sstables smaller and allowing the fs cache to operate more
>> effectively.
>> 1: https://issues.apache.org/jira/browse/CASSANDRA-699
>> On 4 November 2010 08:18, Peter Schuller<pe...@infidyne.com>
>> wrote:
>>>> I am having time out errors while reading.
>>>> I have 5 CFs but two CFs with high write/read.
>>>> The data is organized in time series rows, in CF1 the new rows are read
>>>> every 10 seconds and then the whole rows are deleted, While in CF2 the
>>>> rows
>>>> are read in different time range slices and eventually deleted may be
>>>> after
>>>> few hours.
>>> So the first thing to do is to confirm what the bottleneck is. If
>>> you're having timeouts on reads, and assuming your not doing reads of
>>> hot-in-cache data so fast that CPU is the bottleneck (and given that
>>> you ask about SSD), the hypothesis then is that you're disk bound due
>>> to seeking.
>>>
>>> Observe the node(s) and in particular use "iostat -x -k 1" (or an
>>> equivalent graph) and look at the %util and %avgqu-sz columns to
>>> confirm that you are indeed disk-bound. Unless you're doing large
>>> reads, you will likely see, on average, small reads in amounts that
>>> simply saturate underlying storage, %util at 100% and the avgu-sz will
>>> probably be approaching the level of concurrency of your read traffic.
>>>
>>> Now, assuming that is true, the question is why. So:
>>>
>>> (1) Are you continually saturating disk or just periodically?
>>> (2) If periodically, does the periods of saturation correlate with
>>> compaction being done by Cassandra (or for that matter something
>>> else)?
>>> (3) What is your data set size relative to system memory? What is your
>>> system memory and JVM heap size? (Relevant because it is important to
>>> look at how much memory the kernel will use for page caching.)
>>>
>>> As others have mentioned, the amount of reads done on disk for each
>>> read form the database (assuming data is not in cache) can be affected
>>> by how data is written (e.g., partial row writes etc). That is one
>>> thing that can be addressed, as is re-structuring data to allow
>>> reading more sequentially (if possible). That only helps along one
>>> dimension though - lessening, somewhat, the cost of cold reads. The
>>> gains may be limited and the real problem may be that you simply need
>>> more memory for caching and/or more IOPS from your storage (i.e., more
>>> disks, maybe SSD, etc).
>>>
>>> If on the other hand you're normally completely fine and you're just
>>> seeing periods of saturation associated with compaction, this may be
>>> mitigated by software improvements by possibly rate limiting reads
>>> and/or writes during compaction and avoiding buffer cache thrashing.
>>> There's a JIRA ticket for direct I/O
>>> (https://issues.apache.org/jira/browse/CASSANDRA-1470). I don't think
>>> there's a JIRA ticket for rate limiting, but I suspect, since you're
>>> doing time series data, that you're not storing very large values -
>>> and I would expect compaction to be CPU bound rather than being close
>>> to saturate disk.
>>>
>>> In either case, please do report back as it's interesting to figure
>>> out what kind of performance issues people are seeing.
>>>
>>> --
>>> / Peter Schuller
>>
>

-- 
Alaa Zubaidi
PDF Solutions, Inc.
333 West San Carlos Street, Suite 700
San Jose, CA 95110  USA
Tel: 408-283-5639 (or 408-280-7900 x5639)
fax: 408-938-6479
email: alaa.zubaidi@pdf.com

Re: SSD vs. HDD

Posted by Juho Mäkinen <ju...@gmail.com>.

Do you really need Cassandra to store just 80 GB data for just four
hours? It might be just me, but this sounds like quite far fetched
from normal Cassandra usage. Cassandra isn't happy unless you run
enough nodes to cover one or two node doing compaction (which hurts
the node performance). Are you ready to run at least two, preferably
three C* nodes to store just 80GB of data?

 - Garo

On Thu, Nov 4, 2010 at 12:14 PM, Nick Telford <ni...@gmail.com> wrote:
> If you're bottle-necking on read I/O making proper use of Cassandras key
> cache and row cache will improve things dramatically.
> A little maths using the numbers you've provided tells me that you have
> about 80GB of "hot" data (data valid in a 4 hour period). That's obviously
> too much to directly cache, but you can probably cache some or all of the
> row keys, depending on your column distribution among keys. This will
> prevent reads from having to hit the indexes for the relevant sstables -
> eliminating a seek per sstable.
> If you have a subset of this data that is read more than the rest, the row
> cache will help you out a lot too. Have a look at your access patterns and
> see if it's worthwhile caching some rows.
> If you make progress using the various caches, but don't have enough memory,
> I'd explore the costs of expanding the available memory compared to
> switching to SSDs as I imagine it'd be cheaper and would last longer.
> Finally, given your particular deletion pattern, it's probably worth looking
> at 0.7 and upgrading once it is released as stable. CASSANDRA-699[1] adds
> support for TTL columns that automatically expire and get removed (during
> compaction) without the need for a manual deletion mechanism. Failing this,
> since data older than 4 hours is no longer relevant, you should reduce your
> GCGraceSeconds >= 4 hours. This will ensure deleted data is removed faster,
> keeping your sstables smaller and allowing the fs cache to operate more
> effectively.
> 1: https://issues.apache.org/jira/browse/CASSANDRA-699
> On 4 November 2010 08:18, Peter Schuller <pe...@infidyne.com>
> wrote:
>>
>> > I am having time out errors while reading.
>> > I have 5 CFs but two CFs with high write/read.
>> > The data is organized in time series rows, in CF1 the new rows are read
>> > every 10 seconds and then the whole rows are deleted, While in CF2 the
>> > rows
>> > are read in different time range slices and eventually deleted may be
>> > after
>> > few hours.
>>
>> So the first thing to do is to confirm what the bottleneck is. If
>> you're having timeouts on reads, and assuming your not doing reads of
>> hot-in-cache data so fast that CPU is the bottleneck (and given that
>> you ask about SSD), the hypothesis then is that you're disk bound due
>> to seeking.
>>
>> Observe the node(s) and in particular use "iostat -x -k 1" (or an
>> equivalent graph) and look at the %util and %avgqu-sz columns to
>> confirm that you are indeed disk-bound. Unless you're doing large
>> reads, you will likely see, on average, small reads in amounts that
>> simply saturate underlying storage, %util at 100% and the avgu-sz will
>> probably be approaching the level of concurrency of your read traffic.
>>
>> Now, assuming that is true, the question is why. So:
>>
>> (1) Are you continually saturating disk or just periodically?
>> (2) If periodically, does the periods of saturation correlate with
>> compaction being done by Cassandra (or for that matter something
>> else)?
>> (3) What is your data set size relative to system memory? What is your
>> system memory and JVM heap size? (Relevant because it is important to
>> look at how much memory the kernel will use for page caching.)
>>
>> As others have mentioned, the amount of reads done on disk for each
>> read form the database (assuming data is not in cache) can be affected
>> by how data is written (e.g., partial row writes etc). That is one
>> thing that can be addressed, as is re-structuring data to allow
>> reading more sequentially (if possible). That only helps along one
>> dimension though - lessening, somewhat, the cost of cold reads. The
>> gains may be limited and the real problem may be that you simply need
>> more memory for caching and/or more IOPS from your storage (i.e., more
>> disks, maybe SSD, etc).
>>
>> If on the other hand you're normally completely fine and you're just
>> seeing periods of saturation associated with compaction, this may be
>> mitigated by software improvements by possibly rate limiting reads
>> and/or writes during compaction and avoiding buffer cache thrashing.
>> There's a JIRA ticket for direct I/O
>> (https://issues.apache.org/jira/browse/CASSANDRA-1470). I don't think
>> there's a JIRA ticket for rate limiting, but I suspect, since you're
>> doing time series data, that you're not storing very large values -
>> and I would expect compaction to be CPU bound rather than being close
>> to saturate disk.
>>
>> In either case, please do report back as it's interesting to figure
>> out what kind of performance issues people are seeing.
>>
>> --
>> / Peter Schuller
>
>

Re: SSD vs. HDD

Posted by Nick Telford <ni...@gmail.com>.

If you're bottle-necking on read I/O making proper use of Cassandras key
cache and row cache will improve things dramatically.

A little maths using the numbers you've provided tells me that you have
about 80GB of "hot" data (data valid in a 4 hour period). That's obviously
too much to directly cache, but you can probably cache some or all of the
row keys, depending on your column distribution among keys. This will
prevent reads from having to hit the indexes for the relevant sstables -
eliminating a seek per sstable.

If you have a subset of this data that is read more than the rest, the row
cache will help you out a lot too. Have a look at your access patterns and
see if it's worthwhile caching some rows.

If you make progress using the various caches, but don't have enough memory,
I'd explore the costs of expanding the available memory compared to
switching to SSDs as I imagine it'd be cheaper and would last longer.

Finally, given your particular deletion pattern, it's probably worth looking
at 0.7 and upgrading once it is released as stable. CASSANDRA-699[1] adds
support for TTL columns that automatically expire and get removed (during
compaction) without the need for a manual deletion mechanism. Failing this,
since data older than 4 hours is no longer relevant, you should reduce your
GCGraceSeconds >= 4 hours. This will ensure deleted data is removed faster,
keeping your sstables smaller and allowing the fs cache to operate more
effectively.

1: https://issues.apache.org/jira/browse/CASSANDRA-699

On 4 November 2010 08:18, Peter Schuller <pe...@infidyne.com>wrote:

> > I am having time out errors while reading.
> > I have 5 CFs but two CFs with high write/read.
> > The data is organized in time series rows, in CF1 the new rows are read
> > every 10 seconds and then the whole rows are deleted, While in CF2 the
> rows
> > are read in different time range slices and eventually deleted may be
> after
> > few hours.
>
> So the first thing to do is to confirm what the bottleneck is. If
> you're having timeouts on reads, and assuming your not doing reads of
> hot-in-cache data so fast that CPU is the bottleneck (and given that
> you ask about SSD), the hypothesis then is that you're disk bound due
> to seeking.
>
> Observe the node(s) and in particular use "iostat -x -k 1" (or an
> equivalent graph) and look at the %util and %avgqu-sz columns to
> confirm that you are indeed disk-bound. Unless you're doing large
> reads, you will likely see, on average, small reads in amounts that
> simply saturate underlying storage, %util at 100% and the avgu-sz will
> probably be approaching the level of concurrency of your read traffic.
>
> Now, assuming that is true, the question is why. So:
>
> (1) Are you continually saturating disk or just periodically?
> (2) If periodically, does the periods of saturation correlate with
> compaction being done by Cassandra (or for that matter something
> else)?
> (3) What is your data set size relative to system memory? What is your
> system memory and JVM heap size? (Relevant because it is important to
> look at how much memory the kernel will use for page caching.)
>
> As others have mentioned, the amount of reads done on disk for each
> read form the database (assuming data is not in cache) can be affected
> by how data is written (e.g., partial row writes etc). That is one
> thing that can be addressed, as is re-structuring data to allow
> reading more sequentially (if possible). That only helps along one
> dimension though - lessening, somewhat, the cost of cold reads. The
> gains may be limited and the real problem may be that you simply need
> more memory for caching and/or more IOPS from your storage (i.e., more
> disks, maybe SSD, etc).
>
> If on the other hand you're normally completely fine and you're just
> seeing periods of saturation associated with compaction, this may be
> mitigated by software improvements by possibly rate limiting reads
> and/or writes during compaction and avoiding buffer cache thrashing.
> There's a JIRA ticket for direct I/O
> (https://issues.apache.org/jira/browse/CASSANDRA-1470). I don't think
> there's a JIRA ticket for rate limiting, but I suspect, since you're
> doing time series data, that you're not storing very large values -
> and I would expect compaction to be CPU bound rather than being close
> to saturate disk.
>
> In either case, please do report back as it's interesting to figure
> out what kind of performance issues people are seeing.
>
> --
> / Peter Schuller
>

Re: SSD vs. HDD

Posted by Peter Schuller <pe...@infidyne.com>.

> I am having time out errors while reading.
> I have 5 CFs but two CFs with high write/read.
> The data is organized in time series rows, in CF1 the new rows are read
> every 10 seconds and then the whole rows are deleted, While in CF2 the rows
> are read in different time range slices and eventually deleted may be after
> few hours.

So the first thing to do is to confirm what the bottleneck is. If
you're having timeouts on reads, and assuming your not doing reads of
hot-in-cache data so fast that CPU is the bottleneck (and given that
you ask about SSD), the hypothesis then is that you're disk bound due
to seeking.

Observe the node(s) and in particular use "iostat -x -k 1" (or an
equivalent graph) and look at the %util and %avgqu-sz columns to
confirm that you are indeed disk-bound. Unless you're doing large
reads, you will likely see, on average, small reads in amounts that
simply saturate underlying storage, %util at 100% and the avgu-sz will
probably be approaching the level of concurrency of your read traffic.

Now, assuming that is true, the question is why. So:

(1) Are you continually saturating disk or just periodically?
(2) If periodically, does the periods of saturation correlate with
compaction being done by Cassandra (or for that matter something
else)?
(3) What is your data set size relative to system memory? What is your
system memory and JVM heap size? (Relevant because it is important to
look at how much memory the kernel will use for page caching.)

As others have mentioned, the amount of reads done on disk for each
read form the database (assuming data is not in cache) can be affected
by how data is written (e.g., partial row writes etc). That is one
thing that can be addressed, as is re-structuring data to allow
reading more sequentially (if possible). That only helps along one
dimension though - lessening, somewhat, the cost of cold reads. The
gains may be limited and the real problem may be that you simply need
more memory for caching and/or more IOPS from your storage (i.e., more
disks, maybe SSD, etc).

If on the other hand you're normally completely fine and you're just
seeing periods of saturation associated with compaction, this may be
mitigated by software improvements by possibly rate limiting reads
and/or writes during compaction and avoiding buffer cache thrashing.
There's a JIRA ticket for direct I/O
(https://issues.apache.org/jira/browse/CASSANDRA-1470). I don't think
there's a JIRA ticket for rate limiting, but I suspect, since you're
doing time series data, that you're not storing very large values -
and I would expect compaction to be CPU bound rather than being close
to saturate disk.

In either case, please do report back as it's interesting to figure
out what kind of performance issues people are seeing.

-- 
/ Peter Schuller

Re: SSD vs. HDD

Posted by Alaa Zubaidi <al...@pdf.com>.

Thanks for the reply.
I am having time out errors while reading.
I have 5 CFs but two CFs with high write/read.
The data is organized in time series rows, in CF1 the new rows are read 
every 10 seconds and then the whole rows are deleted, While in CF2 the 
rows are read in different time range slices and eventually deleted may 
be after few hours.

Thanks

On 11/3/2010 1:58 PM, Tyler Hobbs wrote:
> SSD will not generally improve your write performance very much, but they
> can significantly improve read performance.
>
> You do *not* want to waste an SSD on the commitlog drive, as even a slow HDD
> can write sequentially very quickly.  For the data drive, they might make
> sense.
>
> As Jonathan talks about, it has a lot to do with your access patterns.  If
> you either: (1) delete parts of rows (2) update parts of rows, or (3) insert
> new columns into existing rows frequently, you'll end up with rows spread
> across several SSTables (which are on disk).  This means that each read may
> require several seeks, which are very slow for HDDs, but are very quick for
> SSDs.
>
> Of course, the randomness of what rows you access is also important, but
> Jonathan did a good job of covering that.  Don't forget about the effects of
> caching here, too.
>
> The only way to tell if it is cost-effective is to test your particular
> access patterns (using a configured stress.py test or, preferably, your
> actual application).
>
> - Tyler
>
> On Wed, Nov 3, 2010 at 3:44 PM, Jonathan Shook<js...@gmail.com>  wrote:
>
>> SSDs are not reliable after a (relatively-low compared to spinning
>> disk) number of writes.
>> They may significantly boost performance if used on the "journal"
>> storage, but will suffer short lifetimes for highly-random write
>> patterns.
>>
>> In general, plan to replace them frequently. Whether they are worth
>> it, given the performance improvement over the cost of replacement x
>> hardware x logistics is generally a calculus problem. It's difficult
>> to make a generic rationale for or against them.
>>
>> You might be better off in general by throwing more memory at your
>> servers, and isolating your random access from your journaled data.
>> Is there any pattern to your reads and writes/deletes? If it is fully
>> random across your keys, then you have the worst-case scenario.
>> Sometimes you can impose access patterns or structural patterns in
>> your app which make caching more effective.
>>
>> Good questions to ask about your data access:
>> Is there a "user session" which shows an access pattern to proximal data?
>> Are there sets of access which always happen close together?
>> Are there keys or maps which add extra indirection?
>>
>> I'm not familiar with your situation. I was just providing some general
>> ideas..
>>
>> Jonathan Shook
>>
>> On Wed, Nov 3, 2010 at 2:32 PM, Alaa Zubaidi<al...@pdf.com>  wrote:
>>> Hi,
>>> we have a continuous high throughput writes, read and delete, and we are
>>> trying to find the best hardware.
>>> Is using SSD for Cassandra improves performance? Did any one compare SSD
>> vs.
>>> HDD? and any recommendations on SSDs?
>>>
>>> Thanks,
>>> Alaa
>>>
>>>

-- 
Alaa Zubaidi
PDF Solutions, Inc.
333 West San Carlos Street, Suite 700
San Jose, CA 95110  USA
Tel: 408-283-5639 (or 408-280-7900 x5639)
fax: 408-938-6479
email: alaa.zubaidi@pdf.com

Re: SSD vs. HDD

Posted by Tyler Hobbs <ty...@riptano.com>.

SSD will not generally improve your write performance very much, but they
can significantly improve read performance.

You do *not* want to waste an SSD on the commitlog drive, as even a slow HDD
can write sequentially very quickly.  For the data drive, they might make
sense.

As Jonathan talks about, it has a lot to do with your access patterns.  If
you either: (1) delete parts of rows (2) update parts of rows, or (3) insert
new columns into existing rows frequently, you'll end up with rows spread
across several SSTables (which are on disk).  This means that each read may
require several seeks, which are very slow for HDDs, but are very quick for
SSDs.

Of course, the randomness of what rows you access is also important, but
Jonathan did a good job of covering that.  Don't forget about the effects of
caching here, too.

The only way to tell if it is cost-effective is to test your particular
access patterns (using a configured stress.py test or, preferably, your
actual application).

- Tyler

On Wed, Nov 3, 2010 at 3:44 PM, Jonathan Shook <js...@gmail.com> wrote:

> SSDs are not reliable after a (relatively-low compared to spinning
> disk) number of writes.
> They may significantly boost performance if used on the "journal"
> storage, but will suffer short lifetimes for highly-random write
> patterns.
>
> In general, plan to replace them frequently. Whether they are worth
> it, given the performance improvement over the cost of replacement x
> hardware x logistics is generally a calculus problem. It's difficult
> to make a generic rationale for or against them.
>
> You might be better off in general by throwing more memory at your
> servers, and isolating your random access from your journaled data.
> Is there any pattern to your reads and writes/deletes? If it is fully
> random across your keys, then you have the worst-case scenario.
> Sometimes you can impose access patterns or structural patterns in
> your app which make caching more effective.
>
> Good questions to ask about your data access:
> Is there a "user session" which shows an access pattern to proximal data?
> Are there sets of access which always happen close together?
> Are there keys or maps which add extra indirection?
>
> I'm not familiar with your situation. I was just providing some general
> ideas..
>
> Jonathan Shook
>
> On Wed, Nov 3, 2010 at 2:32 PM, Alaa Zubaidi <al...@pdf.com> wrote:
> > Hi,
> > we have a continuous high throughput writes, read and delete, and we are
> > trying to find the best hardware.
> > Is using SSD for Cassandra improves performance? Did any one compare SSD
> vs.
> > HDD? and any recommendations on SSDs?
> >
> > Thanks,
> > Alaa
> >
> >
>

Re: SSD vs. HDD

Posted by Eric Rosenberry <er...@rosenberry.org>.

Some comments inline...

On Wed, Nov 3, 2010 at 1:44 PM, Jonathan Shook <js...@gmail.com> wrote:

> SSDs are not reliable after a (relatively-low compared to spinning
> disk) number of writes.
> They may significantly boost performance if used on the "journal"
> storage, but will suffer short lifetimes for highly-random write
> patterns.
>

I agree with this statement in general, however, my understanding is
that Cassandra NEVER does random writes.  It only ever does large sequential
writes.  Cassandra could potentially be the perfect use case for MLC
(multi-level-cell) SSD's.

On Wed, Nov 3, 2010 at 1:58 PM, Tyler Hobbs <ty...@riptano.com> wrote:
>
> You do *not* want to waste an SSD on the commitlog drive, as even a slow
> HDD can write sequentially very quickly.  For the data drive, they might
> make sense.
>

Totally agreed, we do a few thousand writes per second on a single 7200rpm
SATA disk.

On Wed, Nov 3, 2010 at 5:20 PM, Alaa Zubaidi <al...@pdf.com> wrote:

> around 1800 col/sec per node, 3kb columns, reading is the same.
> Data will be deleted after 4 hours.

Hmm, only keeping the data for 4 hours could present some unique challenges
with Cassandra since it does not actually delete the data (it only
tombstones the data).  There are several factors that play into when exactly
the data actually goes away

-Eric

Re: SSD vs. HDD

Posted by Jonathan Shook <js...@gmail.com>.

SSDs are not reliable after a (relatively-low compared to spinning
disk) number of writes.
They may significantly boost performance if used on the "journal"
storage, but will suffer short lifetimes for highly-random write
patterns.

In general, plan to replace them frequently. Whether they are worth
it, given the performance improvement over the cost of replacement x
hardware x logistics is generally a calculus problem. It's difficult
to make a generic rationale for or against them.

You might be better off in general by throwing more memory at your
servers, and isolating your random access from your journaled data.
Is there any pattern to your reads and writes/deletes? If it is fully
random across your keys, then you have the worst-case scenario.
Sometimes you can impose access patterns or structural patterns in
your app which make caching more effective.

Good questions to ask about your data access:
Is there a "user session" which shows an access pattern to proximal data?
Are there sets of access which always happen close together?
Are there keys or maps which add extra indirection?

I'm not familiar with your situation. I was just providing some general ideas..

Jonathan Shook

On Wed, Nov 3, 2010 at 2:32 PM, Alaa Zubaidi <al...@pdf.com> wrote:
> Hi,
> we have a continuous high throughput writes, read and delete, and we are
> trying to find the best hardware.
> Is using SSD for Cassandra improves performance? Did any one compare SSD vs.
> HDD? and any recommendations on SSDs?
>
> Thanks,
> Alaa
>
>