You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Oleg Anastasjev <ol...@gmail.com> on 2010/04/21 16:08:15 UTC

Cassandra's bad behavior on disk failure

Hello,

I am testing how cassandra behaves on single node disk failures to know what to
expect when things go bad.
I had a cluster of 4 cassandra nodes, stress loaded it with client and made 2
tests:
1. emulated disk failure of /data volume on read only stress test
2. emulated disk failure of /commitlog volumn on write intensive test

1. On read test with data volume down, a lot of
"org.apache.thrift.TApplicationException: Internal error processing get_slice"
was logged at client side. On cassandra server logged alot of IOExceptions
reading every *.db file it has. Node continued to show as UP in ring.

OK, the behavior is not ideal, but still can be worked around at client side,
throwing out nodes as soon as TApplicationException is received from cassandra.

2. Much worse was with write test:
No exception was seen at client, writes are going through normally, but
PERIODIC-COMMIT-LOG-SYNCER failed to sync commit logs, heap of node quickly
became full and node freezed in GC loop. Still, it continued to show as UP in
ring.

This, i believe, is bad, because no quick workaround could be done at client
side (no exceptions are coming from failed node) and in real system will lead to
dramatic slow down of the whole cluster, because clients, not knowing, that node
is actually dead, will direct 1/4th of requests to it and timeout.

I think that more correct behavior here could be halting cassandra server on any
disk IO error, so clients can quickly detect this and failover to healthy
servers.

What do you think ?

Did you guys experienced disk failure in production and how was it ?



Re: Cassandra's bad behavior on disk failure

Posted by Schubert Zhang <zs...@gmail.com>.
On Wed, Apr 21, 2010 at 10:08 PM, Oleg Anastasjev <ol...@gmail.com>wrote:

> Hello,
>
> I am testing how cassandra behaves on single node disk failures to know
> what to
> expect when things go bad.
> I had a cluster of 4 cassandra nodes, stress loaded it with client and made
> 2
> tests:
> 1. emulated disk failure of /data volume on read only stress test
> 2. emulated disk failure of /commitlog volumn on write intensive test
>
> Good test.


> 1. On read test with data volume down, a lot of
> "org.apache.thrift.TApplicationException: Internal error processing
> get_slice"
> was logged at client side. On cassandra server logged alot of IOExceptions
> reading every *.db file it has. Node continued to show as UP in ring.
>
> OK, the behavior is not ideal, but still can be worked around at client
> side,
> throwing out nodes as soon as TApplicationException is received from
> cassandra.
>
> [schubert] Usually, we should use RAID to avoud disk failure.
And add some system monitors to maintain/shutdown node.


> 2. Much worse was with write test:
> No exception was seen at client, writes are going through normally, but
> PERIODIC-COMMIT-LOG-SYNCER failed to sync commit logs, heap of node quickly
> became full and node freezed in GC loop. Still, it continued to show as UP
> in
> ring.
>
> [schubert] I think this is also a bad implementation of current Cassandra
on CommitLogSync.
The default config is <CommitLogSync>periodic</CommitLogSync>
The write commit-log will be responsed immediately, but only buffered in
memory, and will be synced to disk periodically according
<CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>.

In your case, the buffer will be immoderately increased, and use many and
many heap.

In 0.6.x, you can use batch CommitLogSync to alleviate this issue.
<CommitLogSync>batch</CommitLogSync>
<CommitLogSyncBatchWindowInMS>1</CommitLogSyncBatchWindowInMS>

I think the right design should be:
It should use a threshold to avoid too much buffered commitlog. The
buffer-time and buffer-size should also be the sync trigger:
When the commit-log buffer size threshold is reached, sync.
When the commit-log buffer time is reached, sync.



> This, i believe, is bad, because no quick workaround could be done at
> client
> side (no exceptions are coming from failed node) and in real system will
> lead to
> dramatic slow down of the whole cluster, because clients, not knowing, that
> node
> is actually dead, will direct 1/4th of requests to it and timeout.
>

I think that more correct behavior here could be halting cassandra server on
> any
> disk IO error, so clients can quickly detect this and failover to healthy
> servers.
>
> What do you think ?
>
> Did you guys experienced disk failure in production and how was it ?
>
>
>

Re: Cassandra's bad behavior on disk failure

Posted by Oleg Anastasjev <ol...@gmail.com>.
> 
> Ideally I think we'd like to leave the node up to serve reads, if a
> disk is erroring out on writes but still read-able.  In my experience
> this is very common when a disk first begins to fail, as well as in
> the "disk is full" case where there is nothing actually wrong with the
> disk per se.

This depends on hardware/drivers in use as well as failling part. 
On some failures disk just disappears completely (controller failures, 
SAN links etc.).
And the easiest way to bring operations team attention to node is 
shutting it down - anyway ppl have something to do with it.
Furthermore, single node shutdown should be not very hurting to cluster's 
performance in production - everyone planning capacity in a way to survive 
single node failure.


Re: Cassandra's bad behavior on disk failure

Posted by Jonathan Ellis <jb...@gmail.com>.
We have a ticket open for this:
https://issues.apache.org/jira/browse/CASSANDRA-809

Ideally I think we'd like to leave the node up to serve reads, if a
disk is erroring out on writes but still read-able.  In my experience
this is very common when a disk first begins to fail, as well as in
the "disk is full" case where there is nothing actually wrong with the
disk per se.

On Wed, Apr 21, 2010 at 9:08 AM, Oleg Anastasjev <ol...@gmail.com> wrote:
> Hello,
>
> I am testing how cassandra behaves on single node disk failures to know what to
> expect when things go bad.
> I had a cluster of 4 cassandra nodes, stress loaded it with client and made 2
> tests:
> 1. emulated disk failure of /data volume on read only stress test
> 2. emulated disk failure of /commitlog volumn on write intensive test
>
> 1. On read test with data volume down, a lot of
> "org.apache.thrift.TApplicationException: Internal error processing get_slice"
> was logged at client side. On cassandra server logged alot of IOExceptions
> reading every *.db file it has. Node continued to show as UP in ring.
>
> OK, the behavior is not ideal, but still can be worked around at client side,
> throwing out nodes as soon as TApplicationException is received from cassandra.
>
> 2. Much worse was with write test:
> No exception was seen at client, writes are going through normally, but
> PERIODIC-COMMIT-LOG-SYNCER failed to sync commit logs, heap of node quickly
> became full and node freezed in GC loop. Still, it continued to show as UP in
> ring.
>
> This, i believe, is bad, because no quick workaround could be done at client
> side (no exceptions are coming from failed node) and in real system will lead to
> dramatic slow down of the whole cluster, because clients, not knowing, that node
> is actually dead, will direct 1/4th of requests to it and timeout.
>
> I think that more correct behavior here could be halting cassandra server on any
> disk IO error, so clients can quickly detect this and failover to healthy
> servers.
>
> What do you think ?
>
> Did you guys experienced disk failure in production and how was it ?
>
>
>