You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Preston Chang <zh...@gmail.com> on 2011/06/01 08:34:08 UTC

Re: sync commitlog in batch mode lose data

I disable the disk cache of RAID controller,  unfortunately it still lost
some data.

2011/6/1 Peter Schuller <pe...@infidyne.com>

> > 1). set commitlog sync in batch mode and the sync batch window in 0 ms
> > 2). one client wrote random keys in infinite loop with consistency level
> > QUORUM and record the keys in file after the insert() method return
> normally
> > 3). unplug one server (node A) power cord
> > 4). restart the server and cassandra service
> > 5). read the key list generated in step 2) with consistency level ONE
>
> How sure are you that the system is honoring fsync() properly,
> including flushing any caches on underlying drives? Or is this with
> battery backed caching RAID controllers?
>
> --
> / Peter Schuller
>



-- 
by Preston Chang

Re: sync commitlog in batch mode lose data

Posted by Edward Capriolo <ed...@gmail.com>.

Your Losing data because at level quorm with 2 nodes becomes all.
Cassandra will not even try to write data after the node goes down .
Client should see unavailableexception. For a small window after the
failure you will see timedoutexception and those writes should hit the
commitlog.

On Wednesday, June 1, 2011, leon hong <co...@gmail.com> wrote:
> wait "geili" reply
>
> 2011/6/1 Preston Chang <zh...@gmail.com>
>
> I disable the disk cache of RAID controller,  unfortunately it still lost some data.
>
> 2011/6/1 Peter Schuller <pe...@infidyne.com>
>
>
>
>> 1). set commitlog sync in batch mode and the sync batch window in 0 ms
>> 2). one client wrote random keys in infinite loop with consistency level
>> QUORUM and record the keys in file after the insert() method return normally
>> 3). unplug one server (node A) power cord
>> 4). restart the server and cassandra service
>> 5). read the key list generated in step 2) with consistency level ONE
>
> How sure are you that the system is honoring fsync() properly,
> including flushing any caches on underlying drives? Or is this with
> battery backed caching RAID controllers?
>
> --
> / Peter Schuller
>
>
> --
> by Preston Chang
>
>
>
>

Re: sync commitlog in batch mode lose data

Posted by leon hong <co...@gmail.com>.

wait "geili" reply

2011/6/1 Preston Chang <zh...@gmail.com>

> I disable the disk cache of RAID controller,  unfortunately it still lost
> some data.
>
> 2011/6/1 Peter Schuller <pe...@infidyne.com>
>
>> > 1). set commitlog sync in batch mode and the sync batch window in 0 ms
>> > 2). one client wrote random keys in infinite loop with consistency level
>> > QUORUM and record the keys in file after the insert() method return
>> normally
>> > 3). unplug one server (node A) power cord
>> > 4). restart the server and cassandra service
>> > 5). read the key list generated in step 2) with consistency level ONE
>>
>> How sure are you that the system is honoring fsync() properly,
>> including flushing any caches on underlying drives? Or is this with
>> battery backed caching RAID controllers?
>>
>> --
>> / Peter Schuller
>>
>
>
>
> --
> by Preston Chang
>
>

Re: sync commitlog in batch mode lose data

Posted by Peter Schuller <pe...@infidyne.com>.

> But I have another question, while I disable the disk cache but leave the cache write mode write-back, how sync works ? Still write the data into the cache ? This issue may not belong to the scope of discussion here  .

I'm not sure, it depends on at what level of abstraction you changed
to write-back and how it's implemented. Generally, the contract of an
fsync() is that whatever was written up to that point must be
persistent (i.e., readable by subsequent reads, even in case of a
power outtage/crash) when the call returns. This usually means:

(1) the userland app must flush buffers and write data to kernel (this
is done prior to fsync())
(2) the OS file system code needs to write whatever is necessary to
underlying block device(s)
(3) the underlying block device(s) need to be told to insert a write
barrier or flush caches depending
(4) the underlying block device itself must handle this correctly
  (a) for a non-battery-backed disk it means flushing the cache and
you have to wakt for that to happen - at minimum seek + rotational
delay
  (b) for a battery-backed RAID device it typically is a NOOP if the
battery backup unit is working, as the raid controller cache is
considered persistent
  (c) for a raid device with caching turned off or the BBU being
inoperable, it usually means asking individual real drives to flush
their caches

However in general, I advise care since all sorts of little details
can derail this from working. For example if you have the kernel
driver configured not to propagate write barriers to the raid
controller, but the raid controller has BBU turned off but is still
caching, an fsync() would not work for the power outage case. Using
LVM in certain configurations can break write (at least up to not very
long ago, maybe fixed in newer kernels) barriers at the OS level - and
the list goes on.

--
/ Peter Schuller

Re: sync commitlog in batch mode lose data

Posted by Preston Chang <zh...@gmail.com>.

Thank you very much Peter !

After I disable the disk cache and change the cache write mode from
write-back to "write-through", I saw the result I'd like to see.

It seems fsync() only synced the data to the disk cache but not the storage
devices while disk cache sync mode in write-back.

But I have another question, while I disable the disk cache but leave the
cache write mode write-back, how sync works ? Still write the data into the
cache ? This issue may not belong to the scope of discussion here [?] .

Thank you all !

2011/6/3 Peter Schuller <pe...@infidyne.com>

> > I disable the disk cache of RAID controller,  unfortunately it still lost
> > some data.
>
> Disabling caching shouldn't be necessary so much as ensuring that all
> layers honor write barriers properly. A battery backed cache that
> survives a power outtage need not be disabled (and usually if you have
> battery backed caching you don't want to since it has a considerable
> performance impact).
>
> To re-address your original post: Yes, given QUORUM @ RF=2 (meaning
> that QUORUM is equivalent to ALL), any *successful* write is supposed
> to be guaranteed to be visible by a subsequent read. In this case even
> at CL.ONE since RF was 2 and QUORUM was equivalent to ALL.
>
> If this is not what you're seeing, likely causes are either (a) a
> problem with your test, (b) a cassandra bug, or (c) a kernel/hardware
> misconfiguration or bug that causes fsync() to be broken with respect
> to power outtages.
>
> In order to eliminate (a), can you share the actual test? Even if (a)
> looks good, you'd be surprised as to how often (c) can be the case.
>
> If you are satisfied that the test is correct, one way to eliminate
> Cassandra as a cause for the problem may be to restart your server by
> a reset instead of cutting power, so that power supply never
> disappears from your storage device. If you are no longer able to
> reproduce the problem, it would indicate that fsync() is at least
> causing I/O to reach a device (exit the operating system). If it still
> fails, you're none the wiser.
>
> If you're running without battery backed cache, or with battery backed
> cache, one test you can do is run this (on a system which is otherwise
> idle):
>
>   http://distfiles.scode.org/mlref/fsynctime.py
>
> The first argument is a filename which will be created/over-written.
> It will then start printing the number of milliseconds each fsync()
> takes. If you do not have battery backed caching, you should be seeing
> numbers in the 5-25 ms range depending on circumstances. If you see
> very low values, that indicates that fsync() is not working and the
> writes are not forced to persistent storage.
>
> (If battery backed caching exists, you will legitimiately get very low
> values without it indicating anything is wrong.)
>
>
> --
> / Peter Schuller
>



-- 
by Preston Chang

Re: sync commitlog in batch mode lose data

Posted by Peter Schuller <pe...@infidyne.com>.

> I disable the disk cache of RAID controller,  unfortunately it still lost
> some data.

Disabling caching shouldn't be necessary so much as ensuring that all
layers honor write barriers properly. A battery backed cache that
survives a power outtage need not be disabled (and usually if you have
battery backed caching you don't want to since it has a considerable
performance impact).

To re-address your original post: Yes, given QUORUM @ RF=2 (meaning
that QUORUM is equivalent to ALL), any *successful* write is supposed
to be guaranteed to be visible by a subsequent read. In this case even
at CL.ONE since RF was 2 and QUORUM was equivalent to ALL.

If this is not what you're seeing, likely causes are either (a) a
problem with your test, (b) a cassandra bug, or (c) a kernel/hardware
misconfiguration or bug that causes fsync() to be broken with respect
to power outtages.

In order to eliminate (a), can you share the actual test? Even if (a)
looks good, you'd be surprised as to how often (c) can be the case.

If you are satisfied that the test is correct, one way to eliminate
Cassandra as a cause for the problem may be to restart your server by
a reset instead of cutting power, so that power supply never
disappears from your storage device. If you are no longer able to
reproduce the problem, it would indicate that fsync() is at least
causing I/O to reach a device (exit the operating system). If it still
fails, you're none the wiser.

If you're running without battery backed cache, or with battery backed
cache, one test you can do is run this (on a system which is otherwise
idle):

   http://distfiles.scode.org/mlref/fsynctime.py

The first argument is a filename which will be created/over-written.
It will then start printing the number of milliseconds each fsync()
takes. If you do not have battery backed caching, you should be seeing
numbers in the 5-25 ms range depending on circumstances. If you see
very low values, that indicates that fsync() is not working and the
writes are not forced to persistent storage.

(If battery backed caching exists, you will legitimiately get very low
values without it indicating anything is wrong.)


-- 
/ Peter Schuller