You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Thomas Whiteway <Th...@metaswitch.com> on 2014/10/22 13:34:02 UTC

Performance Issue: Keeping rows in memory

Hi,

I'm working on an application using a Cassandra (2.1.0) cluster where

-          our entire dataset is around 22GB

-          each node has 48GB of memory but only a single (mechanical) hard disk

-          in normal operation we have a low level of writes and no reads

-          very occasionally we need to read rows very fast (>1.5K rows/second), and only read each row once.

When we try and read the rows it takes up to five minutes before Cassandra is able to keep up.  The problem seems to be that it takes a while to get the data into the page cache and until then Cassandra can't retrieve the data from disk fast enough (e.g. if I drop the page cache mid-test then Cassandra slows down for the next 5 minutes).

Given that the total amount of should fit comfortably in memory I've been trying to find a way to keep the rows cached in memory but there doesn't seem to be a particularly great way to achieve this.

I've tried enabling the row cache and pre-populating the test by querying every row before starting the load which gives good performance, but the row cache isn't really intended to be used this way and we'd be fighting the row cache to keep the rows in (e.g. by cyclically reading through all the rows during normal operation).

Keeping the page cache warm by running a background task to keep accessing the files for the sstables would be simpler and currently this is the solution we're leaning towards, but we have less control over the page cache, it would be vulnerable to other processes knocking Cassandra's files out, and it generally feels like a bit of a hack.

Has anyone had any success with trying to do something similar to this or have any suggestions for possible solutions?

Thanks,
Thomas

RE: Performance Issue: Keeping rows in memory

Posted by Thomas Whiteway <Th...@metaswitch.com>.

I was using the pre-2.1.0 configuration scheme of setting caching to ‘rows_only’ on the column family.  I’ve tried runs with  row_cache_size_in_mb set to both 16384 and 32768.

I don’t think the new settings would have helped in my case.  My understanding of the rows_per_partition setting is that it allows you to restrict the number of rows which are cached compared to the pre-2.1.0 way of doing things, while we want to cache as much as possible.

From: DuyHai Doan [mailto:doanduyhai@gmail.com]
Sent: 22 October 2014 16:59
To: user@cassandra.apache.org
Cc: James Lee
Subject: Re: Performance Issue: Keeping rows in memory

If you're using 2.1.0 the row cache has been redesigned. How did you configure it ? There is some new parameters to specify how many "CQL rows" you want to keep in the cache: http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1

On Wed, Oct 22, 2014 at 1:34 PM, Thomas Whiteway <Th...@metaswitch.com>> wrote:
Hi,

I’m working on an application using a Cassandra (2.1.0) cluster where

-          our entire dataset is around 22GB

-          each node has 48GB of memory but only a single (mechanical) hard disk

-          in normal operation we have a low level of writes and no reads

-          very occasionally we need to read rows very fast (>1.5K rows/second), and only read each row once.

When we try and read the rows it takes up to five minutes before Cassandra is able to keep up.  The problem seems to be that it takes a while to get the data into the page cache and until then Cassandra can’t retrieve the data from disk fast enough (e.g. if I drop the page cache mid-test then Cassandra slows down for the next 5 minutes).

Given that the total amount of should fit comfortably in memory I’ve been trying to find a way to keep the rows cached in memory but there doesn’t seem to be a particularly great way to achieve this.

I’ve tried enabling the row cache and pre-populating the test by querying every row before starting the load which gives good performance, but the row cache isn’t really intended to be used this way and we’d be fighting the row cache to keep the rows in (e.g. by cyclically reading through all the rows during normal operation).

Keeping the page cache warm by running a background task to keep accessing the files for the sstables would be simpler and currently this is the solution we’re leaning towards, but we have less control over the page cache, it would be vulnerable to other processes knocking Cassandra’s files out, and it generally feels like a bit of a hack.

Has anyone had any success with trying to do something similar to this or have any suggestions for possible solutions?

Thanks,
Thomas

Re: Performance Issue: Keeping rows in memory

Posted by DuyHai Doan <do...@gmail.com>.

If you're using 2.1.0 the row cache has been redesigned. How did you
configure it ? There is some new parameters to specify how many "CQL rows"
you want to keep in the cache:
http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1

On Wed, Oct 22, 2014 at 1:34 PM, Thomas Whiteway <
Thomas.Whiteway@metaswitch.com> wrote:

>  Hi,
>
>
>
> I’m working on an application using a Cassandra (2.1.0) cluster where
>
> -          our entire dataset is around 22GB
>
> -          each node has 48GB of memory but only a single (mechanical)
> hard disk
>
> -          in normal operation we have a low level of writes and no reads
>
> -          very occasionally we need to read rows very fast (>1.5K
> rows/second), and only read each row once.
>
>
>
> When we try and read the rows it takes up to five minutes before Cassandra
> is able to keep up.  The problem seems to be that it takes a while to get
> the data into the page cache and until then Cassandra can’t retrieve the
> data from disk fast enough (e.g. if I drop the page cache mid-test then
> Cassandra slows down for the next 5 minutes).
>
>
>
> Given that the total amount of should fit comfortably in memory I’ve been
> trying to find a way to keep the rows cached in memory but there doesn’t
> seem to be a particularly great way to achieve this.
>
>
>
> I’ve tried enabling the row cache and pre-populating the test by querying
> every row before starting the load which gives good performance, but the
> row cache isn’t really intended to be used this way and we’d be fighting
> the row cache to keep the rows in (e.g. by cyclically reading through all
> the rows during normal operation).
>
>
>
> Keeping the page cache warm by running a background task to keep accessing
> the files for the sstables would be simpler and currently this is the
> solution we’re leaning towards, but we have less control over the page
> cache, it would be vulnerable to other processes knocking Cassandra’s files
> out, and it generally feels like a bit of a hack.
>
>
>
> Has anyone had any success with trying to do something similar to this or
> have any suggestions for possible solutions?
>
>
>
> Thanks,
>
> Thomas
>
>
>

Re: Performance Issue: Keeping rows in memory

Posted by Robert Coli <rc...@eventbrite.com>.

On Wed, Oct 22, 2014 at 4:34 AM, Thomas Whiteway <
Thomas.Whiteway@metaswitch.com> wrote:

>  I’m working on an application using a Cassandra (2.1.0) cluster where
>
>  -          our entire dataset is around 22GB
>
> -          each node has 48GB of memory but only a single (mechanical)
> hard disk
>
> -          in normal operation we have a low level of writes and no reads
>
> -          very occasionally we need to read rows very fast (>1.5K
> rows/second), and only read each row once.
>
>
>
> When we try and read the rows it takes up to five minutes before Cassandra
> is able to keep up.  The problem seems to be that it takes a while to get
> the data into the page cache and until then Cassandra can’t retrieve the
> data from disk fast enough (e.g. if I drop the page cache mid-test then
> Cassandra slows down for the next 5 minutes).
>

Use :

populate_io_cache_on_flush

It's designed for this case. "flush" in this case also includes the "flush"
that comes at the end of compaction.

Kevin Burton's (hi! :D) https://code.google.com/p/linux-ftools/ will help
you keep the SSTables in the page cache when f/e rebooting nodes.

=Rob

RE: Performance Issue: Keeping rows in memory

Posted by Thomas Whiteway <Th...@metaswitch.com>.

I haven't tried running a query trace.  I'm pretty confident that the difference in performance during the test is due to whether the files are cached or not as
- if I explicitly empty the page cache before the test I get a 5 minute slow period 
- if I leave a few hours between tests but don't do anything to the page cache explicitly I get a 2-3 minute slow period 
- if I warm the page cache by reading all the files in the Cassandra data directory before the test I don't get any slow period.

-----Original Message-----
From: jonathan.haddad@gmail.com [mailto:jonathan.haddad@gmail.com] On Behalf Of Jonathan Haddad
Sent: 22 October 2014 17:20
To: user@cassandra.apache.org
Cc: James Lee
Subject: Re: Performance Issue: Keeping rows in memory

First, did you run a query trace?

I recommend Al Tobey's pcstat util to determine if your files are in the buffer cache: https://github.com/tobert/pcstat



On Wed, Oct 22, 2014 at 4:34 AM, Thomas Whiteway <Th...@metaswitch.com> wrote:
> Hi,
>
>
>
> I’m working on an application using a Cassandra (2.1.0) cluster where
>
> -          our entire dataset is around 22GB
>
> -          each node has 48GB of memory but only a single (mechanical) hard
> disk
>
> -          in normal operation we have a low level of writes and no reads
>
> -          very occasionally we need to read rows very fast (>1.5K
> rows/second), and only read each row once.
>
>
>
> When we try and read the rows it takes up to five minutes before 
> Cassandra is able to keep up.  The problem seems to be that it takes a 
> while to get the data into the page cache and until then Cassandra 
> can’t retrieve the data from disk fast enough (e.g. if I drop the page 
> cache mid-test then Cassandra slows down for the next 5 minutes).
>
>
>
> Given that the total amount of should fit comfortably in memory I’ve 
> been trying to find a way to keep the rows cached in memory but there 
> doesn’t seem to be a particularly great way to achieve this.
>
>
>
> I’ve tried enabling the row cache and pre-populating the test by 
> querying every row before starting the load which gives good 
> performance, but the row cache isn’t really intended to be used this 
> way and we’d be fighting the row cache to keep the rows in (e.g. by 
> cyclically reading through all the rows during normal operation).
>
>
>
> Keeping the page cache warm by running a background task to keep 
> accessing the files for the sstables would be simpler and currently 
> this is the solution we’re leaning towards, but we have less control 
> over the page cache, it would be vulnerable to other processes 
> knocking Cassandra’s files out, and it generally feels like a bit of a hack.
>
>
>
> Has anyone had any success with trying to do something similar to this 
> or have any suggestions for possible solutions?
>
>
>
> Thanks,
>
> Thomas
>
>



--
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: Performance Issue: Keeping rows in memory

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

First, did you run a query trace?

I recommend Al Tobey's pcstat util to determine if your files are in
the buffer cache: https://github.com/tobert/pcstat



On Wed, Oct 22, 2014 at 4:34 AM, Thomas Whiteway
<Th...@metaswitch.com> wrote:
> Hi,
>
>
>
> I’m working on an application using a Cassandra (2.1.0) cluster where
>
> -          our entire dataset is around 22GB
>
> -          each node has 48GB of memory but only a single (mechanical) hard
> disk
>
> -          in normal operation we have a low level of writes and no reads
>
> -          very occasionally we need to read rows very fast (>1.5K
> rows/second), and only read each row once.
>
>
>
> When we try and read the rows it takes up to five minutes before Cassandra
> is able to keep up.  The problem seems to be that it takes a while to get
> the data into the page cache and until then Cassandra can’t retrieve the
> data from disk fast enough (e.g. if I drop the page cache mid-test then
> Cassandra slows down for the next 5 minutes).
>
>
>
> Given that the total amount of should fit comfortably in memory I’ve been
> trying to find a way to keep the rows cached in memory but there doesn’t
> seem to be a particularly great way to achieve this.
>
>
>
> I’ve tried enabling the row cache and pre-populating the test by querying
> every row before starting the load which gives good performance, but the row
> cache isn’t really intended to be used this way and we’d be fighting the row
> cache to keep the rows in (e.g. by cyclically reading through all the rows
> during normal operation).
>
>
>
> Keeping the page cache warm by running a background task to keep accessing
> the files for the sstables would be simpler and currently this is the
> solution we’re leaning towards, but we have less control over the page
> cache, it would be vulnerable to other processes knocking Cassandra’s files
> out, and it generally feels like a bit of a hack.
>
>
>
> Has anyone had any success with trying to do something similar to this or
> have any suggestions for possible solutions?
>
>
>
> Thanks,
>
> Thomas
>
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade