You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Mick Semb Wever <mc...@apache.org> on 2011/01/20 21:13:07 UTC

Cassandra on iSCSI?

Does anyone have any experiences with Cassandra on iSCSI?

I'm currently testing a (soon-to-be) production server using both local
raid-5 and iSCSI disks. Our hosting provider is pushing us hard towards
the iSCSI disks because it is easier for them to run (and to meet our
needs for increasing disk capacity overtime).

I'm worried that iSCSI is a non-scalable solution for an otherwise
scalable application (all cassandra nodes will have separate partitions
to the one iSCSI).

To go with raid-5 disks our hosting provider requires proof that iSCSI
won't work. I tried various things (eg `nodetool cleanup` on 12Gb load
giving 5k IOPS) but iSCSI seems to keep up to the performance of the
local raid-5 disks...

Should i be worried about using iSCSI?
Are there better tests i should be running? 

~mck

-- 
"The turtle only makes progress when it's neck is stuck out" Rollo May 
| http://semb.wever.org | http://sesat.no
| http://finn.no       | Java XSS Filter

Re: Cassandra on iSCSI?

Posted by Mick Semb Wever <mc...@apache.org>.

> So if one is forced to use a SAN, how should you set up Cassandra is
> the interesting question - to me! Here are some thoughts:- 
> 1. Ensure that each node gets dedicated - not shared - LUNs 
> 2. Ensure that these LUNs do share spindles, or nodes will seize to be
> isolatable (this will be tough to get, given how SAN administrators
> think about this) 
> 3. Most SANs deliver performance by striping (RAID 0) - sacrifice
> striping for isolation if push comes to shove 
> 4. Do not share data directories from multiple nodes onto a single
> location via NFS or CFS for example. They are cool in shared resource
> environments, but breaks the premise behind Cassandra. All data
> storage should be private to the cassandra node, even when on shared
> storage 
> 5. Do not change any assumption around Replication Factor (RF) or
> Consistency Level (CL) due to the shared storage - in fact if
> anything, increase your replication factor because you now have
> potential SPOF storage.  

That was gold, and lead to a direct conversation between provider and
developer. Various tests showed IOPS will often be at 5k per node.
Therefore the iSCSI solution would need to be tailored to handle it.

Just like mentioned above our provider simply couldn't provide us so much
disk per server. But after a good discussion it became obvious (doh!)
that the application can actually save a lot of disk by using different
keyspaces with different RF. We have raw data that needs to be
collected, but can be temporarily unavailable for reading, hence RF=1
makes sense. This raw data is the vast bulk of the data so this saves
lots of disk space. The aggregated data, which is relatively small in
comparison, is critical for the application to read so we can keep in a
separate keyspace with higher RF...

~mck

-- 
“Anyone who lives within their means suffers from a lack of
imagination.” - Oscar Wilde 
| http://semb.wever.org | http://sesat.no
| http://finn.no       | Java XSS Filter

Re: Cassandra on iSCSI?

Posted by Anthony John <ch...@gmail.com>.

Sort of - do not agree!!

This is the Shared nothing V/s Shared Disk debate. There are many mainstream
RDBMS products that pretend to do horizontal scalability with Shared Disks.
They have the kinds of problems that Cassandra is specifically architected
to avoid!

The original question here has 2 aspects to it:-
1. Is iSCSI SAN good enough - My take is that it is still the poor man's SAN
as compared to FC based SANs. Having said that,  they have found increasing
adoption and the performance penalty is really marginal. Couple that with
the fact that Cassandra is architected to reduce the need for high
performance storage systems via features like reducing of random writes etc.
So net net - a reasonable iSCSI SAN should work.
2. Does it make sense to use a SPOF SAN - again this militates again the
architectural underpinnings of Cassandra, that relies on the shared nothing
idea to ensure that problems - say a bad disk - are easily isolated to a
particular node. On a SAN, depending on RAID configs, and how LUNs are
carved out and so on, a few disk outages could affect multiple nodes. A
performance problem with the SAN, could now affects your entire Cassandra
cluster, and so on. Cassandra is not meant to be set up this way!

But but but...in the real world today - Large storage volumes are available
only with SANs. Rackable machines do not leave a lot of space - typically -
for a bunch of HDDs. On top of that, SANs provide all kinds of admin
capabilities that supposedly help with uptime and performance guarantees and
so on. So a Colo DC might not have any other option but shared storage!

So if one is forced to use a SAN, how should you set up Cassandra is the
interesting question - to me! Here are some thoughts:-
1. Ensure that each node gets dedicated - not shared - LUNs
2. Ensure that these LUNs do share spindles, or nodes will seize to be
isolatable (this will be tough to get, given how SAN administrators think
about this)
3. Most SANs deliver performance by striping (RAID 0) - sacrifice striping
for isolation if push comes to shove
4. Do not share data directories from mutliple nodes onto a single location
via NFS or CFS for example. They are cool in shared resource environments,
but breaks the premise behind Cassandra. All data storage should be private
to the cassandra node, even when on shared storage
5. Do not change any assumption around Replication Factor (RF) or
Consistency Levle (CL) due to the shared storage - in fact if anything,
increase your replication factor because you now have potential SPOF
storage.

My two - or maybe more - cents on the issue,

HTH,

-JA
On Fri, Jan 21, 2011 at 1:15 PM, Edward Capriolo <ed...@gmail.com>wrote:

> On Fri, Jan 21, 2011 at 12:07 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> > On Fri, Jan 21, 2011 at 2:19 AM, Mick Semb Wever <mc...@apache.org> wrote:
> >>
> >>> Of course with a SAN you'd want RF=1 since it's replicating
> >>> internally.
> >>
> >> Isn't this the same case for raid-5 as well?
> >
> > No, because the replication is (mainly) to protect you from machine
> > failures; if the SAN is a SPOF then putting more replicas on it
> > doesn't help.
> >
> >> And we want RF=2 if we need to keep reading while doing rolling
> >> restarts?
> >
> > Yes.
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of Riptano, the source for professional Cassandra support
> > http://riptano.com
> >
>
> If you are using cassandra with a SAN RF=1 makes sense because we are
> making the assumption the san is already replicating your data. RF2
> makes good sense to be not effected by outages. Another alternative is
> something like linux-HA and manage each cassandra instance as a
> resource. This way if a head goes down another node linux ha would
> detect the failure and bring up that instance on another physical
> piece of hardware.
>
> Using LinuxHA+SAN+Cassandra would actually bring Cassandra closer to
> the hbase model which you have a distributed file system but the front
> end Cassandra acts like a region server.
>

Re: Cassandra on iSCSI?

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Jan 21, 2011 at 12:07 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Fri, Jan 21, 2011 at 2:19 AM, Mick Semb Wever <mc...@apache.org> wrote:
>>
>>> Of course with a SAN you'd want RF=1 since it's replicating
>>> internally.
>>
>> Isn't this the same case for raid-5 as well?
>
> No, because the replication is (mainly) to protect you from machine
> failures; if the SAN is a SPOF then putting more replicas on it
> doesn't help.
>
>> And we want RF=2 if we need to keep reading while doing rolling
>> restarts?
>
> Yes.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

If you are using cassandra with a SAN RF=1 makes sense because we are
making the assumption the san is already replicating your data. RF2
makes good sense to be not effected by outages. Another alternative is
something like linux-HA and manage each cassandra instance as a
resource. This way if a head goes down another node linux ha would
detect the failure and bring up that instance on another physical
piece of hardware.

Using LinuxHA+SAN+Cassandra would actually bring Cassandra closer to
the hbase model which you have a distributed file system but the front
end Cassandra acts like a region server.

Re: Cassandra on iSCSI?

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Jan 21, 2011 at 2:19 AM, Mick Semb Wever <mc...@apache.org> wrote:
>
>> Of course with a SAN you'd want RF=1 since it's replicating
>> internally.
>
> Isn't this the same case for raid-5 as well?

No, because the replication is (mainly) to protect you from machine
failures; if the SAN is a SPOF then putting more replicas on it
doesn't help.

> And we want RF=2 if we need to keep reading while doing rolling
> restarts?

Yes.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra on iSCSI?

Posted by Mick Semb Wever <mc...@apache.org>.

> Of course with a SAN you'd want RF=1 since it's replicating
> internally. 

Isn't this the same case for raid-5 as well?

And we want RF=2 if we need to keep reading while doing rolling
restarts?

~mck

-- 
“Anyone who lives within their means suffers from a lack of
imagination.” - Oscar Wilde 
| http://semb.wever.org | http://sesat.no
| http://finn.no       | Java XSS Filter

Re: Cassandra on iSCSI?

Posted by Mick Semb Wever <mc...@apache.org>.

>         [OT] They're quoting roughly the same price for both (claiming
>         that the
>         extra cost goes into having for each node a separate disk
>         cabinet to run
>         local raid-5).
> 
> You might not need raid-5 for local attached storage. 

Yes we did ask. But raid-5 is the minimum being offered from our hosting
provider... We could go to raid 10, but raid 0 is out of the question...

~mck

-- 
"To be young, really young, takes a very long time." Picasso 
| http://semb.wever.org | http://sesat.no
| http://finn.no       | Java XSS Filter

Re: Cassandra on iSCSI?

Posted by Zhu Han <sc...@gmail.com>.

On Fri, Jan 21, 2011 at 3:00 PM, Mick Semb Wever <mc...@apache.org> wrote:

> > It should work fine; the main reason to go with local storage is the
> > huge cost advantage.
>
> [OT] They're quoting roughly the same price for both (claiming that the
> extra cost goes into having for each node a separate disk cabinet to run
> local raid-5).
>

You might not need raid-5 for local attached storage. Refer [1] for more
information.

[1]  http://wiki.apache.org/cassandra/CassandraHardware

>
> > *I just committed a README for contrib/stress to the 0.7 svn branch
>
> thanks! i'll check it out.
>
> ~mck
>
> --
> “An invasion of armies can be resisted, but not an idea whose time has
> come.” - Victor Hugo
> | www.semb.wever.org | www.sesat.no
> | www.finn.no | http://xss-http-filter.sf.net
>

Re: Cassandra on iSCSI?

Posted by Mick Semb Wever <mc...@apache.org>.

> It should work fine; the main reason to go with local storage is the
> huge cost advantage.

[OT] They're quoting roughly the same price for both (claiming that the
extra cost goes into having for each node a separate disk cabinet to run
local raid-5).

> *I just committed a README for contrib/stress to the 0.7 svn branch 

thanks! i'll check it out.

~mck

-- 
“An invasion of armies can be resisted, but not an idea whose time has
come.” - Victor Hugo 
| www.semb.wever.org | www.sesat.no 
| www.finn.no | http://xss-http-filter.sf.net

Re: Cassandra on iSCSI?

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Jan 20, 2011 at 2:13 PM, Mick Semb Wever <mc...@apache.org> wrote:
> To go with raid-5 disks our hosting provider requires proof that iSCSI
> won't work. I tried various things (eg `nodetool cleanup` on 12Gb load
> giving 5k IOPS) but iSCSI seems to keep up to the performance of the
> local raid-5 disks...
>
> Should i be worried about using iSCSI?

It should work fine; the main reason to go with local storage is the
huge cost advantage.

Of course with a SAN you'd want RF=1 since it's replicating internally.

> Are there better tests i should be running?

I would test write scalability going from 1 machine, to half your
planned cluster size, to your full cluster size, or as close as is
feasible, using enough client machines running contrib/stress* (much
faster than contrib/py_stress) that you saturate it.

Writes should be CPU bound, so you expect those to scale roughly
linearly as you add Cassandra nodes.

Reads (once your data set can't be cached in RAM) will be i/o bound,
so I imagine with a SAN you'll be able to max that out at some number
of machines and adding more Cassandra nodes won't help.  What that
limit is depends on your SAN iops and how much of it is being consumed
by other applications.

*I just committed a README for contrib/stress to the 0.7 svn branch

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com