You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Artie Copeland <ye...@gmail.com> on 2010/08/07 01:17:49 UTC

row cache during bootstrap

the way i understand how row caches work is that each node has an
independent cache, in that they do not push there cache contents with other
nodes.  if that the case is it also true that when a new node is added to
the cluster it has to build up its own cache.  if thats the case i see that
as a possible performance bottle neck once the node starts to accept
requests.  since there is no way i know of to warm the cache without adding
the node to the cluster.  would it be infeasible to have part of the
bootstrap process not only stream data from nodes but also cached rows that
are associated with those same keys?  that would allow the new nodes to be
able to provide the best performance once the bootstrap process finishes.

-- 
http://yeslinux.org
http://yestech.org

Re: row cache during bootstrap

Posted by Artie Copeland <ye...@gmail.com>.
On Sun, Aug 8, 2010 at 5:24 AM, aaron morton <aa...@thelastpickle.com>wrote:

> Not sure how feasible it is or if it's planned. But it would probably
> require that the nodes are able so share the state of their row cache so as
> to know which parts to warm. Otherwise it sounds like you're assuming the
> node can hold the entire data set in memory.
>
> Im not assuming the node can hold the entire data set in cassandra in
memory, if thats what you meant. I was thinking of sharing the state of the
row cache, but only those keys that are being moved for the token.  the
other keys can stay hidden to the node.


> If you know in your application when you would like data to be in the
> cache, you can send a query like get_range_slices to the cluster and ask for
> 0 columns. That will warm the row cache for the keys it hits.
>

This is a tuff one as our row cache is over 20 million and takes a while to
get a large hit ratio. so while we try to preload it is taking requests.  If
it were possible to bring up a node that doesnt announce its availability to
the cluster that would help us manually warm the cache.  I know this feature
is in the issue tracker currently, but didnt look like it would come out
anytime before 0.8.

>
> I have heard it mentioned that the coordinator node will take action to
> when one node is considered to be running slow. So it may be able to work
> around the new node until it gets warmed up.
>

That is interesting i haven't heard that one.  I think with the parallel
reads that are happening it makes sense that it would be possible.  That is
unless the data is local.  I believe in that case it always prefers to read
local vs over the network, so if the local machine is the slow node that
wouldnt help.

>
> Are you adding nodes often?
>
Currently not that often.  The main issue is we have very stringent latency
requirements and anything that would affect those we have to understand the
worst case cost to see if we can avoid them.

>
> Aaron
>
> On 7 Aug 2010, at 11:17, Artie Copeland wrote:
>
> the way i understand how row caches work is that each node has an
> independent cache, in that they do not push there cache contents with other
> nodes.  if that the case is it also true that when a new node is added to
> the cluster it has to build up its own cache.  if thats the case i see that
> as a possible performance bottle neck once the node starts to accept
> requests.  since there is no way i know of to warm the cache without adding
> the node to the cluster.  would it be infeasible to have part of the
> bootstrap process not only stream data from nodes but also cached rows that
> are associated with those same keys?  that would allow the new nodes to be
> able to provide the best performance once the bootstrap process finishes.
>
> --
> http://yeslinux.org
> http://yestech.org
>
>
>


-- 
http://yeslinux.org
http://yestech.org

Re: row cache during bootstrap

Posted by aaron morton <aa...@thelastpickle.com>.
Not sure how feasible it is or if it's planned. But it would probably require that the nodes are able so share the state of their row cache so as to know which parts to warm. Otherwise it sounds like you're assuming the node can hold the entire data set in memory. 

If you know in your application when you would like data to be in the cache, you can send a query like get_range_slices to the cluster and ask for 0 columns. That will warm the row cache for the keys it hits. 

I have heard it mentioned that the coordinator node will take action to when one node is considered to be running slow. So it may be able to work around the new node until it gets warmed up. 

Are you adding nodes often? 

Aaron

On 7 Aug 2010, at 11:17, Artie Copeland wrote:

> the way i understand how row caches work is that each node has an independent cache, in that they do not push there cache contents with other nodes.  if that the case is it also true that when a new node is added to the cluster it has to build up its own cache.  if thats the case i see that as a possible performance bottle neck once the node starts to accept requests.  since there is no way i know of to warm the cache without adding the node to the cluster.  would it be infeasible to have part of the bootstrap process not only stream data from nodes but also cached rows that are associated with those same keys?  that would allow the new nodes to be able to provide the best performance once the bootstrap process finishes.
> 
> -- 
> http://yeslinux.org
> http://yestech.org