You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Adam Kocoloski <ko...@apache.org> on 2010/04/14 14:52:01 UTC

Fwd: optimal settings for [couchdb] fsync_options?

Initially posted on user@, but maybe it got lost in the noise.  Does anyone know why we call fsync when we open a file?

Adam

Begin forwarded message:

> From: Adam Kocoloski <ko...@apache.org>
> Date: April 11, 2010 10:44:03 PM EDT
> To: user@couchdb.apache.org
> Subject: optimal settings for [couchdb] fsync_options?
> 
> Hi folks, I wanted to assemble some concrete information about the purpose of each of the three fsync_options available in CouchDB and under what conditions they should be enabled/disabled.  These options are
> 
> 1) before_header - calls file:sync(Fd) before writing a DB header to disk.  I believe the goal here is to prevent DB corruption by ensuring that all the data referred to by the header is durably stored before the header is written.  A system that preserves write ordering could safely disable this option.  Does anyone know an example of such a system? Perhaps a combination of a noop IO scheduler and a write-through or nonvolatile disk cache?
> 
> 2) after_header - calls file:sync(Fd) immediately after writing the DB header.  I think this one is done so that we don't lose too much data following a CouchDB restart, and so that a client can ensure that stored data will be retrievable after a restart by POSTing to /db/_ensure_full_commit.  It might make sense to disable this option if e.g. you're relying on replication for durability.  Although that's dicey because the replicator calls ensure_full_commit for both DBs before writing its own checkpoint record*, and by disabling the after_header option you'd run the risk of skipping updates on the target in the face of a power failure.
> 
> 3) on_file_open - calls file:sync(Fd) immediately after opening a DB file.  I really don't know the purpose of this one.  Anyone?
> 
> Best, Adam
> 
> * The reason the replicator calls ensure_full_commit on the source is to detect situations where update_seqs might be reused.  I wonder if we could engineer a way around that ever happening, for example by ensuring that on restart the update sequence jumps by a large number.  But that's a discussion for dev@.

Re: optimal settings for [couchdb] fsync_options?

Posted by Adam Kocoloski <ko...@apache.org>.

Bah, that's quite right.  Thanks for the step-by-step, I'm not sure how I missed it before.

Adam

On Apr 14, 2010, at 11:04 AM, Robert Newson wrote:

> I think Damien is right here. Consider this sequence;
> 
> 1) update btree
> 2) fsync
> 3) write new header
> 4) fsync
> 5) more updates
> 6) fsync
> 7) write new header
> 8) process terminates
> 
> On open, the header at 7) might or might not be flushed all the way to
> disk, but couchdb would update views to include changes made at 5).
> Since the header at 7) isn't definitely fsync'ed, a second crash (say,
> a kernel panic) could revert the .couch file itself to the state at
> 4), but views are permanently wrong. It's hard to see it in practice
> because the header is 4k and almost always gets to disk soon enough
> anyway, especially if you do more i/o on the view indexes.
> 
> B.
> 
> On Wed, Apr 14, 2010 at 3:46 PM, Adam Kocoloski <ko...@apache.org> wrote:
>> Thanks Damien.  I'm thinking that the situation you describe cannot occur if before_header is enabled in the fsync_options, since any data pointed to by the #db_header that the server found after the restart was already synced.  Is that correct?
>> 
>> Adam
>> 
>> On Apr 14, 2010, at 10:26 AM, Damien Katz wrote:
>> 
>>> The reason for fsync on open is the server doesn't know if the data it's reading off the file is commited fully to the disk. It's possible the the server wrote to file and crashed before fsync, then restarted. Then it could refresh view indexes on the non-fsynced storage data, for example, and crash again, losing data in the storage file, but not the updates to the index file. Now the index is permanently out of date with the storage file. But if you fsync on opening the storage file, that can't happen.
>>> 
>>> -Damien
>>> 
>>> 
>>> On Apr 14, 2010, at 5:52 AM, Adam Kocoloski wrote:
>>> 
>>>> Initially posted on user@, but maybe it got lost in the noise.  Does anyone know why we call fsync when we open a file?
>>>> 
>>>> Adam
>>>> 
>>>> Begin forwarded message:
>>>> 
>>>>> From: Adam Kocoloski <ko...@apache.org>
>>>>> Date: April 11, 2010 10:44:03 PM EDT
>>>>> To: user@couchdb.apache.org
>>>>> Subject: optimal settings for [couchdb] fsync_options?
>>>>> 
>>>>> Hi folks, I wanted to assemble some concrete information about the purpose of each of the three fsync_options available in CouchDB and under what conditions they should be enabled/disabled.  These options are
>>>>> 
>>>>> 1) before_header - calls file:sync(Fd) before writing a DB header to disk.  I believe the goal here is to prevent DB corruption by ensuring that all the data referred to by the header is durably stored before the header is written.  A system that preserves write ordering could safely disable this option.  Does anyone know an example of such a system? Perhaps a combination of a noop IO scheduler and a write-through or nonvolatile disk cache?
>>>>> 
>>>>> 2) after_header - calls file:sync(Fd) immediately after writing the DB header.  I think this one is done so that we don't lose too much data following a CouchDB restart, and so that a client can ensure that stored data will be retrievable after a restart by POSTing to /db/_ensure_full_commit.  It might make sense to disable this option if e.g. you're relying on replication for durability.  Although that's dicey because the replicator calls ensure_full_commit for both DBs before writing its own checkpoint record*, and by disabling the after_header option you'd run the risk of skipping updates on the target in the face of a power failure.
>>>>> 
>>>>> 3) on_file_open - calls file:sync(Fd) immediately after opening a DB file.  I really don't know the purpose of this one.  Anyone?
>>>>> 
>>>>> Best, Adam
>>>>> 
>>>>> * The reason the replicator calls ensure_full_commit on the source is to detect situations where update_seqs might be reused.  I wonder if we could engineer a way around that ever happening, for example by ensuring that on restart the update sequence jumps by a large number.  But that's a discussion for dev@.
>>>> 
>>> 
>> 
>>

Re: optimal settings for [couchdb] fsync_options?

Posted by Robert Newson <ro...@gmail.com>.

I think Damien is right here. Consider this sequence;

1) update btree
2) fsync
3) write new header
4) fsync
5) more updates
6) fsync
7) write new header
8) process terminates

On open, the header at 7) might or might not be flushed all the way to
disk, but couchdb would update views to include changes made at 5).
Since the header at 7) isn't definitely fsync'ed, a second crash (say,
a kernel panic) could revert the .couch file itself to the state at
4), but views are permanently wrong. It's hard to see it in practice
because the header is 4k and almost always gets to disk soon enough
anyway, especially if you do more i/o on the view indexes.

B.

On Wed, Apr 14, 2010 at 3:46 PM, Adam Kocoloski <ko...@apache.org> wrote:
> Thanks Damien.  I'm thinking that the situation you describe cannot occur if before_header is enabled in the fsync_options, since any data pointed to by the #db_header that the server found after the restart was already synced.  Is that correct?
>
> Adam
>
> On Apr 14, 2010, at 10:26 AM, Damien Katz wrote:
>
>> The reason for fsync on open is the server doesn't know if the data it's reading off the file is commited fully to the disk. It's possible the the server wrote to file and crashed before fsync, then restarted. Then it could refresh view indexes on the non-fsynced storage data, for example, and crash again, losing data in the storage file, but not the updates to the index file. Now the index is permanently out of date with the storage file. But if you fsync on opening the storage file, that can't happen.
>>
>> -Damien
>>
>>
>> On Apr 14, 2010, at 5:52 AM, Adam Kocoloski wrote:
>>
>>> Initially posted on user@, but maybe it got lost in the noise.  Does anyone know why we call fsync when we open a file?
>>>
>>> Adam
>>>
>>> Begin forwarded message:
>>>
>>>> From: Adam Kocoloski <ko...@apache.org>
>>>> Date: April 11, 2010 10:44:03 PM EDT
>>>> To: user@couchdb.apache.org
>>>> Subject: optimal settings for [couchdb] fsync_options?
>>>>
>>>> Hi folks, I wanted to assemble some concrete information about the purpose of each of the three fsync_options available in CouchDB and under what conditions they should be enabled/disabled.  These options are
>>>>
>>>> 1) before_header - calls file:sync(Fd) before writing a DB header to disk.  I believe the goal here is to prevent DB corruption by ensuring that all the data referred to by the header is durably stored before the header is written.  A system that preserves write ordering could safely disable this option.  Does anyone know an example of such a system? Perhaps a combination of a noop IO scheduler and a write-through or nonvolatile disk cache?
>>>>
>>>> 2) after_header - calls file:sync(Fd) immediately after writing the DB header.  I think this one is done so that we don't lose too much data following a CouchDB restart, and so that a client can ensure that stored data will be retrievable after a restart by POSTing to /db/_ensure_full_commit.  It might make sense to disable this option if e.g. you're relying on replication for durability.  Although that's dicey because the replicator calls ensure_full_commit for both DBs before writing its own checkpoint record*, and by disabling the after_header option you'd run the risk of skipping updates on the target in the face of a power failure.
>>>>
>>>> 3) on_file_open - calls file:sync(Fd) immediately after opening a DB file.  I really don't know the purpose of this one.  Anyone?
>>>>
>>>> Best, Adam
>>>>
>>>> * The reason the replicator calls ensure_full_commit on the source is to detect situations where update_seqs might be reused.  I wonder if we could engineer a way around that ever happening, for example by ensuring that on restart the update sequence jumps by a large number.  But that's a discussion for dev@.
>>>
>>
>
>

Re: optimal settings for [couchdb] fsync_options?

Posted by Adam Kocoloski <ko...@apache.org>.

Thanks Damien.  I'm thinking that the situation you describe cannot occur if before_header is enabled in the fsync_options, since any data pointed to by the #db_header that the server found after the restart was already synced.  Is that correct?

Adam

On Apr 14, 2010, at 10:26 AM, Damien Katz wrote:

> The reason for fsync on open is the server doesn't know if the data it's reading off the file is commited fully to the disk. It's possible the the server wrote to file and crashed before fsync, then restarted. Then it could refresh view indexes on the non-fsynced storage data, for example, and crash again, losing data in the storage file, but not the updates to the index file. Now the index is permanently out of date with the storage file. But if you fsync on opening the storage file, that can't happen.
> 
> -Damien
> 
> 
> On Apr 14, 2010, at 5:52 AM, Adam Kocoloski wrote:
> 
>> Initially posted on user@, but maybe it got lost in the noise.  Does anyone know why we call fsync when we open a file?
>> 
>> Adam
>> 
>> Begin forwarded message:
>> 
>>> From: Adam Kocoloski <ko...@apache.org>
>>> Date: April 11, 2010 10:44:03 PM EDT
>>> To: user@couchdb.apache.org
>>> Subject: optimal settings for [couchdb] fsync_options?
>>> 
>>> Hi folks, I wanted to assemble some concrete information about the purpose of each of the three fsync_options available in CouchDB and under what conditions they should be enabled/disabled.  These options are
>>> 
>>> 1) before_header - calls file:sync(Fd) before writing a DB header to disk.  I believe the goal here is to prevent DB corruption by ensuring that all the data referred to by the header is durably stored before the header is written.  A system that preserves write ordering could safely disable this option.  Does anyone know an example of such a system? Perhaps a combination of a noop IO scheduler and a write-through or nonvolatile disk cache?
>>> 
>>> 2) after_header - calls file:sync(Fd) immediately after writing the DB header.  I think this one is done so that we don't lose too much data following a CouchDB restart, and so that a client can ensure that stored data will be retrievable after a restart by POSTing to /db/_ensure_full_commit.  It might make sense to disable this option if e.g. you're relying on replication for durability.  Although that's dicey because the replicator calls ensure_full_commit for both DBs before writing its own checkpoint record*, and by disabling the after_header option you'd run the risk of skipping updates on the target in the face of a power failure.
>>> 
>>> 3) on_file_open - calls file:sync(Fd) immediately after opening a DB file.  I really don't know the purpose of this one.  Anyone?
>>> 
>>> Best, Adam
>>> 
>>> * The reason the replicator calls ensure_full_commit on the source is to detect situations where update_seqs might be reused.  I wonder if we could engineer a way around that ever happening, for example by ensuring that on restart the update sequence jumps by a large number.  But that's a discussion for dev@.
>> 
>

Re: optimal settings for [couchdb] fsync_options?

Posted by Damien Katz <da...@apache.org>.

The reason for fsync on open is the server doesn't know if the data it's reading off the file is commited fully to the disk. It's possible the the server wrote to file and crashed before fsync, then restarted. Then it could refresh view indexes on the non-fsynced storage data, for example, and crash again, losing data in the storage file, but not the updates to the index file. Now the index is permanently out of date with the storage file. But if you fsync on opening the storage file, that can't happen.

-Damien


On Apr 14, 2010, at 5:52 AM, Adam Kocoloski wrote:

> Initially posted on user@, but maybe it got lost in the noise.  Does anyone know why we call fsync when we open a file?
> 
> Adam
> 
> Begin forwarded message:
> 
>> From: Adam Kocoloski <ko...@apache.org>
>> Date: April 11, 2010 10:44:03 PM EDT
>> To: user@couchdb.apache.org
>> Subject: optimal settings for [couchdb] fsync_options?
>> 
>> Hi folks, I wanted to assemble some concrete information about the purpose of each of the three fsync_options available in CouchDB and under what conditions they should be enabled/disabled.  These options are
>> 
>> 1) before_header - calls file:sync(Fd) before writing a DB header to disk.  I believe the goal here is to prevent DB corruption by ensuring that all the data referred to by the header is durably stored before the header is written.  A system that preserves write ordering could safely disable this option.  Does anyone know an example of such a system? Perhaps a combination of a noop IO scheduler and a write-through or nonvolatile disk cache?
>> 
>> 2) after_header - calls file:sync(Fd) immediately after writing the DB header.  I think this one is done so that we don't lose too much data following a CouchDB restart, and so that a client can ensure that stored data will be retrievable after a restart by POSTing to /db/_ensure_full_commit.  It might make sense to disable this option if e.g. you're relying on replication for durability.  Although that's dicey because the replicator calls ensure_full_commit for both DBs before writing its own checkpoint record*, and by disabling the after_header option you'd run the risk of skipping updates on the target in the face of a power failure.
>> 
>> 3) on_file_open - calls file:sync(Fd) immediately after opening a DB file.  I really don't know the purpose of this one.  Anyone?
>> 
>> Best, Adam
>> 
>> * The reason the replicator calls ensure_full_commit on the source is to detect situations where update_seqs might be reused.  I wonder if we could engineer a way around that ever happening, for example by ensuring that on restart the update sequence jumps by a large number.  But that's a discussion for dev@.
>