You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jan Algermissen <ja...@nordsc.com> on 2013/07/31 07:51:26 UTC

VM dimensions for running Cassandra and Hadoop

Hi,

thanks for the helpful replies last week.

It looks as if I will deploy Cassandra on a bunch of VMs and I am now in the process of understanding what the dimensions of the VMS should be.

So far, I understand the following:

- I need at least 3 VMs for a minimal Cassandra setup
- I should get another VM to run the Hadoop job controller or
  can that run on one of the Cassandra VMs
- there is no point in giving the Cassandra JVMs more than
  8-12 GB heap space because of GC, so it seems going beyond 16GB
  RAM per VM makes no sense
- Each VM needs two disks, to separate commit log from data storage
- I must make sure the disks are directly attached, to prevent
  problems when multiple nodes flush the commit log at the
  same time
- I'll be having rather few writes and intend to hold most of the
  data in memory, so spinning disks are fine for the moment

Does that seem reasonable?

How should I plan the disk sizes and number of CPU cores?

Are there any other configuration mistakes to avoid?

Is there online documentation that discusses such VM sizing questions in more detail?

Jan






Re: VM dimensions for running Cassandra and Hadoop

Posted by Jan Algermissen <ja...@nordsc.com>.
Hi Shahab,

On 31.07.2013, at 15:59, Shahab Yunus <sh...@gmail.com> wrote:

> Hi Jan,
> 
> One question...you say
> "- I must make sure the disks are directly attached, to prevent
>   problems when multiple nodes flush the commit log at the
>   same time"

I read that using Cassandra with SANs can cause Problems because the way Cassandra works is likely to cause the nodes to compete for IO (unless each node has dedicated storage)

Jan



> 
> What do you mean by that? 
> 
> Thanks,
> Shahab
> 
> 
> On Wed, Jul 31, 2013 at 3:10 AM, Jan Algermissen <ja...@nordsc.com> wrote:
> Jon,
> 
> On 31.07.2013, at 08:15, Jonathan Haddad <jo...@jonhaddad.com> wrote:
> 
>> Having just enough RAM to hold the JVM's heap generally isn't a good idea unless you're not planning on doing much with the machine.  
> 
> Yes, I agree. Two questions though:
> 
> - Do you think that using a JVM heap of, for example, 12 GB and having 16 available is a bad ratio
>   for a 'simple' Cassandra node?
> 
> - As all of my queries will likely ask for most of a row's data, it seems enableing row cache will be
>   a good thing.
>   AFAIU row cache is cached outside the JVM, so I should then probably get loads of RAM more to account
>   for the row cache?
> 
> Hmm, having said that, I wonder what goes into the JVM heap anyhow. As far as caches are concerned it seems
> only the key cache is inside the JVM heap. Does it make sense to have a heap size that is much larger than
> the amount of stoarge necessary for all my keys (plus some overhead of course).??
> 
> 
> Jan
> 
> 
> 
>> 
>> Any memory not allocated to a process will generally be put to good use serving as page cache. See here: http://en.wikipedia.org/wiki/Page_cache
>> 
>> Jon
>> 
>> 
>> On Tue, Jul 30, 2013 at 10:51 PM, Jan Algermissen <ja...@nordsc.com> wrote:
>> Hi,
>> 
>> thanks for the helpful replies last week.
>> 
>> It looks as if I will deploy Cassandra on a bunch of VMs and I am now in the process of understanding what the dimensions of the VMS should be.
>> 
>> So far, I understand the following:
>> 
>> - I need at least 3 VMs for a minimal Cassandra setup
>> - I should get another VM to run the Hadoop job controller or
>>   can that run on one of the Cassandra VMs
>> - there is no point in giving the Cassandra JVMs more than
>>   8-12 GB heap space because of GC, so it seems going beyond 16GB
>>   RAM per VM makes no sense
>> - Each VM needs two disks, to separate commit log from data storage
>> - I must make sure the disks are directly attached, to prevent
>>   problems when multiple nodes flush the commit log at the
>>   same time
>> - I'll be having rather few writes and intend to hold most of the
>>   data in memory, so spinning disks are fine for the moment
>> 
>> Does that seem reasonable?
>> 
>> How should I plan the disk sizes and number of CPU cores?
>> 
>> Are there any other configuration mistakes to avoid?
>> 
>> Is there online documentation that discusses such VM sizing questions in more detail?
>> 
>> Jan
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> skype: rustyrazorblade
> 
> 


Re: VM dimensions for running Cassandra and Hadoop

Posted by Shahab Yunus <sh...@gmail.com>.
Hi Jan,

One question...you say
"- I must make sure the disks are directly attached, to prevent
  problems when multiple nodes flush the commit log at the
  same time"

What do you mean by that?

Thanks,
Shahab


On Wed, Jul 31, 2013 at 3:10 AM, Jan Algermissen <jan.algermissen@nordsc.com
> wrote:

> Jon,
>
> On 31.07.2013, at 08:15, Jonathan Haddad <jo...@jonhaddad.com> wrote:
>
> Having just enough RAM to hold the JVM's heap generally isn't a good idea
> unless you're not planning on doing much with the machine.
>
>
> Yes, I agree. Two questions though:
>
> - Do you think that using a JVM heap of, for example, 12 GB and having 16
> available is a bad ratio
>   for a 'simple' Cassandra node?
>
> - As all of my queries will likely ask for most of a row's data, it seems
> enableing row cache will be
>   a good thing.
>   AFAIU row cache is cached outside the JVM, so I should then probably get
> loads of RAM more to account
>   for the row cache?
>
> Hmm, having said that, I wonder what goes into the JVM heap anyhow. As far
> as caches are concerned it seems
> only the key cache is inside the JVM heap. Does it make sense to have a
> heap size that is much larger than
> the amount of stoarge necessary for all my keys (plus some overhead of
> course).??
>
>
> Jan
>
>
>
>
> Any memory not allocated to a process will generally be put to good use
> serving as page cache. See here: http://en.wikipedia.org/wiki/Page_cache
>
> Jon
>
>
> On Tue, Jul 30, 2013 at 10:51 PM, Jan Algermissen <
> jan.algermissen@nordsc.com> wrote:
>
>> Hi,
>>
>> thanks for the helpful replies last week.
>>
>> It looks as if I will deploy Cassandra on a bunch of VMs and I am now in
>> the process of understanding what the dimensions of the VMS should be.
>>
>> So far, I understand the following:
>>
>> - I need at least 3 VMs for a minimal Cassandra setup
>> - I should get another VM to run the Hadoop job controller or
>>   can that run on one of the Cassandra VMs
>> - there is no point in giving the Cassandra JVMs more than
>>   8-12 GB heap space because of GC, so it seems going beyond 16GB
>>   RAM per VM makes no sense
>> - Each VM needs two disks, to separate commit log from data storage
>> - I must make sure the disks are directly attached, to prevent
>>   problems when multiple nodes flush the commit log at the
>>   same time
>> - I'll be having rather few writes and intend to hold most of the
>>   data in memory, so spinning disks are fine for the moment
>>
>> Does that seem reasonable?
>>
>> How should I plan the disk sizes and number of CPU cores?
>>
>> Are there any other configuration mistakes to avoid?
>>
>> Is there online documentation that discusses such VM sizing questions in
>> more detail?
>>
>> Jan
>>
>>
>>
>>
>>
>>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> skype: rustyrazorblade
>
>
>

Re: VM dimensions for running Cassandra and Hadoop

Posted by Jan Algermissen <ja...@nordsc.com>.
Jon,

On 31.07.2013, at 08:15, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> Having just enough RAM to hold the JVM's heap generally isn't a good idea unless you're not planning on doing much with the machine.  

Yes, I agree. Two questions though:

- Do you think that using a JVM heap of, for example, 12 GB and having 16 available is a bad ratio
  for a 'simple' Cassandra node?

- As all of my queries will likely ask for most of a row's data, it seems enableing row cache will be
  a good thing.
  AFAIU row cache is cached outside the JVM, so I should then probably get loads of RAM more to account
  for the row cache?

Hmm, having said that, I wonder what goes into the JVM heap anyhow. As far as caches are concerned it seems
only the key cache is inside the JVM heap. Does it make sense to have a heap size that is much larger than
the amount of stoarge necessary for all my keys (plus some overhead of course).??


Jan



> 
> Any memory not allocated to a process will generally be put to good use serving as page cache. See here: http://en.wikipedia.org/wiki/Page_cache
> 
> Jon
> 
> 
> On Tue, Jul 30, 2013 at 10:51 PM, Jan Algermissen <ja...@nordsc.com> wrote:
> Hi,
> 
> thanks for the helpful replies last week.
> 
> It looks as if I will deploy Cassandra on a bunch of VMs and I am now in the process of understanding what the dimensions of the VMS should be.
> 
> So far, I understand the following:
> 
> - I need at least 3 VMs for a minimal Cassandra setup
> - I should get another VM to run the Hadoop job controller or
>   can that run on one of the Cassandra VMs
> - there is no point in giving the Cassandra JVMs more than
>   8-12 GB heap space because of GC, so it seems going beyond 16GB
>   RAM per VM makes no sense
> - Each VM needs two disks, to separate commit log from data storage
> - I must make sure the disks are directly attached, to prevent
>   problems when multiple nodes flush the commit log at the
>   same time
> - I'll be having rather few writes and intend to hold most of the
>   data in memory, so spinning disks are fine for the moment
> 
> Does that seem reasonable?
> 
> How should I plan the disk sizes and number of CPU cores?
> 
> Are there any other configuration mistakes to avoid?
> 
> Is there online documentation that discusses such VM sizing questions in more detail?
> 
> Jan
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Jon Haddad
> http://www.rustyrazorblade.com
> skype: rustyrazorblade


Re: VM dimensions for running Cassandra and Hadoop

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
Having just enough RAM to hold the JVM's heap generally isn't a good idea
unless you're not planning on doing much with the machine.

Any memory not allocated to a process will generally be put to good use
serving as page cache. See here: http://en.wikipedia.org/wiki/Page_cache

Jon


On Tue, Jul 30, 2013 at 10:51 PM, Jan Algermissen <
jan.algermissen@nordsc.com> wrote:

> Hi,
>
> thanks for the helpful replies last week.
>
> It looks as if I will deploy Cassandra on a bunch of VMs and I am now in
> the process of understanding what the dimensions of the VMS should be.
>
> So far, I understand the following:
>
> - I need at least 3 VMs for a minimal Cassandra setup
> - I should get another VM to run the Hadoop job controller or
>   can that run on one of the Cassandra VMs
> - there is no point in giving the Cassandra JVMs more than
>   8-12 GB heap space because of GC, so it seems going beyond 16GB
>   RAM per VM makes no sense
> - Each VM needs two disks, to separate commit log from data storage
> - I must make sure the disks are directly attached, to prevent
>   problems when multiple nodes flush the commit log at the
>   same time
> - I'll be having rather few writes and intend to hold most of the
>   data in memory, so spinning disks are fine for the moment
>
> Does that seem reasonable?
>
> How should I plan the disk sizes and number of CPU cores?
>
> Are there any other configuration mistakes to avoid?
>
> Is there online documentation that discusses such VM sizing questions in
> more detail?
>
> Jan
>
>
>
>
>
>


-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade