You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by Olav Sandstaa <Ol...@Sun.COM> on 2006/08/30 11:06:59 UTC

Options for syncing of log to disk

For writing the transaction log to disk Derby uses a
RandomAccessFile. If it is supported by the JVM, the log files are
opened in "rws" mode making the file system take care of syncing
writes to disk. "rws" mode will ensure that both the data and the file
meta-data is updated for every write to the file. On most operating
system this leads to two write operation to the disk for every write
issued by Derby. This is limiting the throughput of update intensive
applications.

I have run some simple tests where I have changed mode from "rws" to
"rwd" for the Derby log file. When running a small numbers of
concurrent client threads the throughput is almost doubled and the
response time is almost halved. I am enclosing two graphs that show
this when running a given number of concurrent "tpc-b" clients. The
graphs show the throughput when running with "rws" and "rwd" mode when the
disk's write cache has been enabled and disabled.

This change should also have a positive impact on the Derby startup
time (DERBY-1664) and derbyall. With this change the time for running
derbyall goes down by about 10-15 minutes (approximately 10%) :-)

Is there anyone that is aware of any issues by not updating the file
meta-data for every write? Is there any recovery scenarios where this
can make recovery fail? Derby seems to preallocates the log file
before starting using the file, so I think this should not influence
the ability to fine the last data written to the file after a power
failure.

Any comments?

Thanks,
Olav

Re: Options for syncing of log to disk

Posted by Olav Sandstaa <Ol...@Sun.COM>.
Suresh Thalamati wrote:
> Olav Sandstaa wrote:
>> I have run some simple tests where I have changed mode from "rws" to
>> "rwd" for the Derby log file. When running a small numbers of
>> concurrent client threads the throughput is almost doubled and the
>> response time is almost halved. I am enclosing two graphs that show
>> this when running a given number of concurrent "tpc-b" clients. The
>> graphs show the throughput when running with "rws" and "rwd" mode 
>> when the
>> disk's write cache has been enabled and disabled.
>
> I am really surprised to notice the difference between rws / rwd 
> throughput in your graphs,  Especially when the write cache is 
> enabled.  When I ran small tests long time ago, did not notice any 
> substantial difference between rws/rwd, on windows for sure.

I was not surprised of the results with the disk's write cache disabled, 
but very surprised by the large effect then the write cache was enabled. 
I had expected to so hardly any difference in that case. This was on 
Solaris 10, on Linux 2.6 I did not see any difference between these case 
(both with and without a write cache on the disk).

> Could u please post more details on the hardware/OS/JVM ..etc.  and 

The hardware is a dual Opteron server with two reasonable fast Fujitsu 
SCSI disks (one for log and one for the database). If I remember 
correctly the disks have a 6 MB cache (shared between reading and 
writing), Solaris 10 x86 and Sun JVM 1.5 update 4.

> What is the work done in each transaction ?

The transactions are similar to TPC-B trasactions. Each consists of 
three updates, one insert and one select.

Olav


Re: Options for syncing of log to disk

Posted by Suresh Thalamati <su...@gmail.com>.
Olav Sandstaa wrote:

> 
> I have run some simple tests where I have changed mode from "rws" to
> "rwd" for the Derby log file. When running a small numbers of
> concurrent client threads the throughput is almost doubled and the
> response time is almost halved. I am enclosing two graphs that show
> this when running a given number of concurrent "tpc-b" clients. The
> graphs show the throughput when running with "rws" and "rwd" mode when the
> disk's write cache has been enabled and disabled.

I am really surprised to notice the difference between rws / rwd 
throughput in your graphs,  Especially when the write cache is 
enabled.  When I ran small tests long time ago, did not notice any 
substantial difference between rws/rwd, on windows for sure.

Could u please post more details on the hardware/OS/JVM ..etc.  and 
What is the work done in each transaction ?


> 
> This change should also have a positive impact on the Derby startup
> time (DERBY-1664) and derbyall. With this change the time for running
> derbyall goes down by about 10-15 minutes (approximately 10%) :-)


Another interesting thing to explore here might be to, is it really 
necessary to flush the log for the create tables during database 
creation.  Some time ago, when I dumped the log for the create 
database, only thing that I noticed was create table log records; Most 
of the other action are done in Un-logged mode during creation. Just 
for the record, these log flush is not specific to write sync mode, 
they use to happen when the log was synced using file sync.

I was almost tempted to disable logging for those creates, but then 
thought why add any extra code for the regular create table path for
database creation special case, which is suppose to happen only once :-)



> 
> Is there anyone that is aware of any issues by not updating the file
> meta-data for every write? Is there any recovery scenarios where this
> can make recovery fail? Derby seems to preallocates the log file
> before starting using the file, so I think this should not influence
> the ability to fine the last data written to the file after a power
> failure.
> 


Log is not preallocated dynamically ; it happens only on a log file 
switch.  Log file switch is triggered during log flush calls 
(commit/abort). If a transaction is long one, it is possible to write 
  the log records to non-preallocation portion of a log file.

In non-preallocated log file scenario, If a meta data(Never managed to 
figure out what that exactly means in various operating systems) is 
not updated, last data written to the log file might be lost.



Thanks
-suresht



Re: Options for syncing of log to disk

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.
Suresh Thalamati <su...@gmail.com> writes:

> Knut Anders Hatlen wrote:
>> Mike Matrigali <mi...@sbcglobal.net> writes:
>>
>>
>>>I actually don't even know if
>>>we have ever run a set of tests on freebsd. I wonder if anyone in
>>>the comunity has tried out the 10.2 beta on feebsd?  Just the
>>>results of a nightly test run would be interesting.
>>
>>
>> I frequently run derbyall on FreeBSD 6.1. With Diablo JDK 1.5.0-7
>> (native JDK built on J2SE 5.0 update release 7), the upgrade test
>> fails to boot the database because of a FileNotFoundException (but the
>> error message says "file exists").
>
>
> Thanks for running tests on FreeBsd, Knut. It is a good addtion  to
> the list of platforms the derby is tested on,
>
> The error you are seeing in the uprade test failures seems to be
> similar to the ones seen on Mac
> (DERBY-1(http://issues.apache.org/jira/browse/DERBY-1)) in version
> 10.0 while creating a database on some versions of JVMS. . I vaguely
> remember reading some where on the internet that  "rws" fails on
> FreeBSD too, on some jvm versions.  DERBY-1 was addressed in 10.1,
> with an workaround fix that changes the defaullt log file writes to
> file sync, if a FileNotFoundException occurs during a file creation
> with write sync.
>
> May be the upgrade test is failing while creating a database in
> version 10.0 using write sync ("rws"), while testing upgrade from 10.0
> to 10.1 ; just a wild guess.

Thank you, Suresh, the error I see is indeed DERBY-1. The upgrade test
fails when it tries to create a 10.1 database. I was under the
impression that the upgrade test used 10.1.2.1 (in which DERBY-1 is
fixed), but it is the 10.1.1.0 jars that are used. When I ran the
upgrade test with derbyTesting.jar.path pointing to a location which
contained 10.1.2.1 jars, it ran cleanly.

I have reported the issue to the FreeBSD Java team
(http://www.freebsd.org/cgi/query-pr.cgi?pr=102888).

-- 
Knut Anders

Re: Options for syncing of log to disk

Posted by Suresh Thalamati <su...@gmail.com>.
Knut Anders Hatlen wrote:
> Mike Matrigali <mi...@sbcglobal.net> writes:
> 
> 
>>I actually don't even know if
>>we have ever run a set of tests on freebsd. I wonder if anyone in
>>the comunity has tried out the 10.2 beta on feebsd?  Just the
>>results of a nightly test run would be interesting.
> 
> 
> I frequently run derbyall on FreeBSD 6.1. With Diablo JDK 1.5.0-7
> (native JDK built on J2SE 5.0 update release 7), the upgrade test
> fails to boot the database because of a FileNotFoundException (but the
> error message says "file exists"). 


Thanks for running tests on FreeBsd, Knut. It is a good addtion  to 
the list of platforms the derby is tested on,

The error you are seeing in the uprade test failures seems to be 
similar to the ones seen on Mac 
(DERBY-1(http://issues.apache.org/jira/browse/DERBY-1)) in version 
10.0 while creating a database on some versions of JVMS. . I vaguely 
remember reading some where on the internet that  "rws" fails on 
FreeBSD too, on some jvm versions.  DERBY-1 was addressed in 10.1, 
with an workaround fix that changes the defaullt log file writes to 
file sync, if a FileNotFoundException occurs during a file creation 
with write sync.


May be the upgrade test is failing while creating a database in 
version 10.0 using write sync ("rws"), while testing upgrade from 10.0 
to 10.1 ; just a wild guess.


Thanks
-suresh






Re: Options for syncing of log to disk

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.
Mike Matrigali <mi...@sbcglobal.net> writes:

> This seems like an important performance issue for the platforms
> you note (solaris and freebsd).  Prior to opensource performance
> work concentrated on windows and linux, so there definitely may
> be work to do on other platforms.
>
> I actually don't even know if
> we have ever run a set of tests on freebsd. I wonder if anyone in
> the comunity has tried out the 10.2 beta on feebsd?  Just the
> results of a nightly test run would be interesting.

I frequently run derbyall on FreeBSD 6.1. With Diablo JDK 1.5.0-7
(native JDK built on J2SE 5.0 update release 7), the upgrade test
fails to boot the database because of a FileNotFoundException (but the
error message says "file exists"). All other tests pass. With JDK 1.6
running under Linux binary compatibility mode, all of the tests pass.

-- 
Knut Anders

Re: Options for syncing of log to disk

Posted by Olav Sandstå <ol...@gmail.com>.
Thanks for your suggestions, Mike. I will log a JIRA for this when I
get back from vacation in about a week. I will also try to look a bit
more into the behavior of some of these OSs with regards to how they
behave when using "rwd" and the write operation extends the size of
the file.

Olav


On 9/1/06, Mike Matrigali <mi...@sbcglobal.net> wrote:
> This seems like an important performance issue for the platforms
> you note (solaris and freebsd).  Prior to opensource performance
> work concentrated on windows and linux, so there definitely may
> be work to do on other platforms.
>
> I actually don't even know if
> we have ever run a set of tests on freebsd. I wonder if anyone in
> the comunity has tried out the 10.2 beta on feebsd?  Just the
> results of a nightly test run would be interesting.
>
> I suggest you log a JIRA issue with your observations on log
> I/O on solaris.  The 2 I/O overhead is serious, it is why we
> did a lot of work with the logging system to fix it on windows.
>
> There are a number of possible solutions, depending on information
> about what exactly rws and rwd mean on solaris.  I wonder if we
> could get implementation specific answers from the JVM engineers
> as SUN about this?  I think the 2 key questions are:
> 1) does rwd guarantee I/O safety on an I/O that does not extend the file.
> 2) does rwd guarantee I/O saftey on an I/O that does extend the file.
>
> Unfortunately I don't know of a test program we can run to determine
> this beyond a doubt.  And the JVM spec does not make it clear.  Maybe
> we can get JVM providers to at least say what their implementation
> does.
>
> If 1 is true but 2 is not, then we could look at derby changes to
> the logging system.  A couple of ideas:
> 1) switch I/O modes when we start extending the log file (a kludge in
>     my opinion but probably works).  Requires synchronization in a tricky
>     area of logging and closing the file and reopening it.
> 2) Change the logging system to never extend the file.  probably a
> medium level logging system project.  Suresh probably could size this
> one better.  I think issues here are:
>     o spliting a single log record
>     o changing log system assumptions about when log switches are possible
>     o others???
>
> Note that the actual implementation of reading/writing log records is
> pretty low level and should be plugable.  The rest of the system just
> assumes the log is one single stream of records.
>
> I also believe there is a JSR that is looking at I/O improvement to
> future java releases.  We should get someone from Derby involved in
> that.
>
> Olav Sandstaa wrote:
> > Mike,
> >
> > Thanks for your thought on this, see my comments further down.
> >
> > Mike Matrigali wrote:
> >
> >> Initial writes to the log are to a preallocated log, but once it
> >> is filled it is possible for there to be writes that extend the log
> >> and thus it is not safe to not sync metadata that tracks the
> >> length of the file.
> >
> >
> > I agree that this is a problem that needs to be looked into if we would
> > consider to change from using "rws" to "rwd" for the log files. My main
> > purpose with sending out the performance numbers was to illustrate that
> > there on some platforms is a potential for performance improvements and
> > to get feedback on what issues that needed to be looked into in order to
> > take advantage of this.
> >
> >> Unfortunately this behavior is hardware, OS and JVM specific, and
> >> the exact
> >> meaning of rws and rwd is left vague in the javadoc that I have read.
> >> The javadoc usually says syncs "metadata" but does not explain what
> >> metadata.
> >
> >
> > I certainly agree that the javadoc is vague. For RandomAcceFile the
> > javadoc says [1]:
> >
> > "The "rws" and "rwd" modes work much like the |force(boolean)|
> > <http://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.html#force%28boolean%29>
> > method of the |FileChannel|
> > <http://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.html>
> > class, passing arguments of true and false, respectively, except that
> > they always apply to every I/O operation and are therefore often more
> > efficient. If the file resides on a local storage device then when an
> > invocation of a method of this class returns it is guaranteed that all
> > changes made to the file by that invocation will have been written to
> > that device. This is useful for ensuring that critical information is
> > not lost in the event of a system crash."
> >
> > It is very unclear if the last sentence refer to only files opened with
> > "rws" mode or also holds true when opened with "rwd" mode.
> >
> >> When I worked on this issue for another db vendor, direct
> >> OS access usually provide 3 rather than 2 options.  The 3 options
> >> were:
> >> 1) no metadata sync
> >> 2) only sync file allocation metadata
> >> 3) sync file allocation metadata and other metadata.  The problem
> >>    is that other metadata includes the last modified time info
> >>    which is updated every write to the file.
> >>
> >> What do you mean by "most OS"?
> >
> >
> > Solaris and FreeBSD :-) I have also tried this on Linux 2.6 (Red Hat
> > 4.0?) , but Linux seems to handle "rwd" the same as "rws" (or rather
> > files where the application requests to be opened with the O_DSYNC flag
> > is actually opened with just O_SYNC).
> >
> > The JavaDoc for RandomAccessFile is also indicating that using rws needs
> > two updates to the disk: "using "rws" requires updates to both the
> > file's content and its metadata to be written, which generally requires
> > at least one more low-level I/O operation." [1] Based on this I also
> > assumed that this would be a penalty on "most OS".
> >
> >> What OS/JVM are your numbers from?
> >
> >
> > My numbers where from a machine running Solaris 10 x86 and Sun JVM 1.5.
> >
> >> When the sync option on the log was switched from using full file
> >> sync to "rws" mode tests were run which I believe included linux
> >> (probably only a single version of linux - not sure which) and
> >> XP with sun and ibm jvms (probably 1.4.2 as I think that was the latest
> >> JVM at the time), I think apple OS was also tested but I am not sure.
> >> The first implementation simply switched the
> >> to the "rws" mode but left the log file to grow as needed, "rws" mode
> >> was picked because it is impossible to tell if file allocation metadata
> >> is synced as part of "rwd" so in order to guarantee transaction
> >> consistency the safest mode was picked.  Tests were run which observed
> >> if we preallocated the log file then I/O to a preallocated file that
> >> did not extend the file only paid 1 I/O per sync.  So work was done
> >> to make most log I/O only happen to a preallocated file, but the logging
> >> system was not changed to guarantee all I/O was to a preallocated file.
> >
> >
> > Thanks for the background for how and why "rws" was selected. Based on
> > your observations and the unclear semantics for "rwd" and metadata I
> > agree that this was a good choice. Still, I think paying the extra cost
> > of having to do two disk operations per log write on "some OSs" is high
> > and can make Derby perform worse than some of the other open-source
> > databases on these OSs.
> >
> >> It is probably worth resurrecting the simple I/O test program, to let
> >> people run on their various JVM/OS combinations.  As has been noted in
> >> the past the results of such a test can be thrown way off by the
> >> hardware involved.  If the hardware/filesystem has had write cache
> >> enabled then none of these syncs can do their job and transactions are
> >> at risk no matter what option is picked.
> >
> >
> > Sounds like a very good idea to get data for how various JVM/OS
> > combinations are handling this.
> >
> >> Also it is more common nowadays for higher end hardware to have
> >> battery backed cache to
> >> optimize the sync case, which then provides instantaneous return from
> >> the sync request but provides safe transaction as it guarantees the
> >> write on failure (I've seen this as part of the disk and as part of
> >> the controller).  This particular hardware feature works VERY well for
> >> the derby log I/O case as the block being synced for the log file
> >> metadata tends to be the same block over and over again so basically
> >> the cache space for it is on the order of 8k.
> >
> >
> > And it is very common for lower end hardware to have the disk's write
> > cache enabled to get similar performance. And most users will be very
> > happy with this and unaware of the consequences until one day their
> > favorite database is unable to recover after a power failure....
> >
> > Regards,
> > Olav
> >
> > [1] http://java.sun.com/javase/6/docs/api/java/io/RandomAccessFile.html
> >
> >
> >>
> >> Olav Sandstaa wrote:
> >>
> >>> For writing the transaction log to disk Derby uses a
> >>> RandomAccessFile. If it is supported by the JVM, the log files are
> >>> opened in "rws" mode making the file system take care of syncing
> >>> writes to disk. "rws" mode will ensure that both the data and the file
> >>> meta-data is updated for every write to the file. On most operating
> >>> system this leads to two write operation to the disk for every write
> >>> issued by Derby. This is limiting the throughput of update intensive
> >>> applications.
> >>>
> >>> I have run some simple tests where I have changed mode from "rws" to
> >>> "rwd" for the Derby log file. When running a small numbers of
> >>> concurrent client threads the throughput is almost doubled and the
> >>> response time is almost halved. I am enclosing two graphs that show
> >>> this when running a given number of concurrent "tpc-b" clients. The
> >>> graphs show the throughput when running with "rws" and "rwd" mode
> >>> when the
> >>> disk's write cache has been enabled and disabled.
> >>>
> >>> This change should also have a positive impact on the Derby startup
> >>> time (DERBY-1664) and derbyall. With this change the time for running
> >>> derbyall goes down by about 10-15 minutes (approximately 10%) :-)
> >>>
> >>> Is there anyone that is aware of any issues by not updating the file
> >>> meta-data for every write? Is there any recovery scenarios where this
> >>> can make recovery fail? Derby seems to preallocates the log file
> >>> before starting using the file, so I think this should not influence
> >>> the ability to fine the last data written to the file after a power
> >>> failure.
> >>>
> >>> Any comments?
> >>>
> >>> Thanks,
> >>> Olav
> >>>
> >>>
> >>> ------------------------------------------------------------------------
> >>>
> >>>
> >>> ------------------------------------------------------------------------
> >>>
> >>
> >
> >
> >
>
>

Re: Options for syncing of log to disk

Posted by Bryan Pendleton <bp...@amberpoint.com>.
> I also believe there is a JSR that is looking at I/O improvement to
> future java releases.  We should get someone from Derby involved in
> that.

This looks like it could be the JSR you're thinking of:
http://jcp.org/en/jsr/detail?id=203

JSR 203 was planned to be part of JDK 1.6 but has moved to JDK 1.7

thanks,

bryan


Re: Options for syncing of log to disk

Posted by Mike Matrigali <mi...@sbcglobal.net>.
This seems like an important performance issue for the platforms
you note (solaris and freebsd).  Prior to opensource performance
work concentrated on windows and linux, so there definitely may
be work to do on other platforms.

I actually don't even know if
we have ever run a set of tests on freebsd. I wonder if anyone in
the comunity has tried out the 10.2 beta on feebsd?  Just the
results of a nightly test run would be interesting.

I suggest you log a JIRA issue with your observations on log
I/O on solaris.  The 2 I/O overhead is serious, it is why we
did a lot of work with the logging system to fix it on windows.

There are a number of possible solutions, depending on information
about what exactly rws and rwd mean on solaris.  I wonder if we
could get implementation specific answers from the JVM engineers
as SUN about this?  I think the 2 key questions are:
1) does rwd guarantee I/O safety on an I/O that does not extend the file.
2) does rwd guarantee I/O saftey on an I/O that does extend the file.

Unfortunately I don't know of a test program we can run to determine
this beyond a doubt.  And the JVM spec does not make it clear.  Maybe
we can get JVM providers to at least say what their implementation
does.

If 1 is true but 2 is not, then we could look at derby changes to
the logging system.  A couple of ideas:
1) switch I/O modes when we start extending the log file (a kludge in
    my opinion but probably works).  Requires synchronization in a tricky
    area of logging and closing the file and reopening it.
2) Change the logging system to never extend the file.  probably a 
medium level logging system project.  Suresh probably could size this 
one better.  I think issues here are:
    o spliting a single log record
    o changing log system assumptions about when log switches are possible
    o others???

Note that the actual implementation of reading/writing log records is
pretty low level and should be plugable.  The rest of the system just
assumes the log is one single stream of records.

I also believe there is a JSR that is looking at I/O improvement to
future java releases.  We should get someone from Derby involved in
that.

Olav Sandstaa wrote:
> Mike,
> 
> Thanks for your thought on this, see my comments further down.
> 
> Mike Matrigali wrote:
> 
>> Initial writes to the log are to a preallocated log, but once it
>> is filled it is possible for there to be writes that extend the log
>> and thus it is not safe to not sync metadata that tracks the
>> length of the file.
> 
> 
> I agree that this is a problem that needs to be looked into if we would 
> consider to change from using "rws" to "rwd" for the log files. My main 
> purpose with sending out the performance numbers was to illustrate that 
> there on some platforms is a potential for performance improvements and 
> to get feedback on what issues that needed to be looked into in order to 
> take advantage of this.
> 
>> Unfortunately this behavior is hardware, OS and JVM specific, and
>> the exact
>> meaning of rws and rwd is left vague in the javadoc that I have read.
>> The javadoc usually says syncs "metadata" but does not explain what
>> metadata.  
> 
> 
> I certainly agree that the javadoc is vague. For RandomAcceFile the 
> javadoc says [1]:
> 
> "The "rws" and "rwd" modes work much like the |force(boolean)| 
> <http://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.html#force%28boolean%29> 
> method of the |FileChannel| 
> <http://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.html> 
> class, passing arguments of true and false, respectively, except that 
> they always apply to every I/O operation and are therefore often more 
> efficient. If the file resides on a local storage device then when an 
> invocation of a method of this class returns it is guaranteed that all 
> changes made to the file by that invocation will have been written to 
> that device. This is useful for ensuring that critical information is 
> not lost in the event of a system crash."
> 
> It is very unclear if the last sentence refer to only files opened with 
> "rws" mode or also holds true when opened with "rwd" mode.
> 
>> When I worked on this issue for another db vendor, direct
>> OS access usually provide 3 rather than 2 options.  The 3 options
>> were:
>> 1) no metadata sync
>> 2) only sync file allocation metadata
>> 3) sync file allocation metadata and other metadata.  The problem
>>    is that other metadata includes the last modified time info
>>    which is updated every write to the file.
>>
>> What do you mean by "most OS"?
> 
> 
> Solaris and FreeBSD :-) I have also tried this on Linux 2.6 (Red Hat 
> 4.0?) , but Linux seems to handle "rwd" the same as "rws" (or rather 
> files where the application requests to be opened with the O_DSYNC flag 
> is actually opened with just O_SYNC).
> 
> The JavaDoc for RandomAccessFile is also indicating that using rws needs 
> two updates to the disk: "using "rws" requires updates to both the 
> file's content and its metadata to be written, which generally requires 
> at least one more low-level I/O operation." [1] Based on this I also 
> assumed that this would be a penalty on "most OS".
> 
>> What OS/JVM are your numbers from?
> 
> 
> My numbers where from a machine running Solaris 10 x86 and Sun JVM 1.5.
> 
>> When the sync option on the log was switched from using full file
>> sync to "rws" mode tests were run which I believe included linux 
>> (probably only a single version of linux - not sure which) and
>> XP with sun and ibm jvms (probably 1.4.2 as I think that was the latest
>> JVM at the time), I think apple OS was also tested but I am not sure. 
>> The first implementation simply switched the
>> to the "rws" mode but left the log file to grow as needed, "rws" mode
>> was picked because it is impossible to tell if file allocation metadata
>> is synced as part of "rwd" so in order to guarantee transaction
>> consistency the safest mode was picked.  Tests were run which observed
>> if we preallocated the log file then I/O to a preallocated file that
>> did not extend the file only paid 1 I/O per sync.  So work was done
>> to make most log I/O only happen to a preallocated file, but the logging
>> system was not changed to guarantee all I/O was to a preallocated file.
> 
> 
> Thanks for the background for how and why "rws" was selected. Based on 
> your observations and the unclear semantics for "rwd" and metadata I 
> agree that this was a good choice. Still, I think paying the extra cost 
> of having to do two disk operations per log write on "some OSs" is high 
> and can make Derby perform worse than some of the other open-source 
> databases on these OSs.
> 
>> It is probably worth resurrecting the simple I/O test program, to let
>> people run on their various JVM/OS combinations.  As has been noted in
>> the past the results of such a test can be thrown way off by the
>> hardware involved.  If the hardware/filesystem has had write cache
>> enabled then none of these syncs can do their job and transactions are
>> at risk no matter what option is picked.
> 
> 
> Sounds like a very good idea to get data for how various JVM/OS 
> combinations are handling this.
> 
>> Also it is more common nowadays for higher end hardware to have 
>> battery backed cache to
>> optimize the sync case, which then provides instantaneous return from
>> the sync request but provides safe transaction as it guarantees the
>> write on failure (I've seen this as part of the disk and as part of
>> the controller).  This particular hardware feature works VERY well for
>> the derby log I/O case as the block being synced for the log file
>> metadata tends to be the same block over and over again so basically
>> the cache space for it is on the order of 8k.
> 
> 
> And it is very common for lower end hardware to have the disk's write 
> cache enabled to get similar performance. And most users will be very 
> happy with this and unaware of the consequences until one day their 
> favorite database is unable to recover after a power failure....
> 
> Regards,
> Olav
> 
> [1] http://java.sun.com/javase/6/docs/api/java/io/RandomAccessFile.html
> 
> 
>>
>> Olav Sandstaa wrote:
>>
>>> For writing the transaction log to disk Derby uses a
>>> RandomAccessFile. If it is supported by the JVM, the log files are
>>> opened in "rws" mode making the file system take care of syncing
>>> writes to disk. "rws" mode will ensure that both the data and the file
>>> meta-data is updated for every write to the file. On most operating
>>> system this leads to two write operation to the disk for every write
>>> issued by Derby. This is limiting the throughput of update intensive
>>> applications.
>>>
>>> I have run some simple tests where I have changed mode from "rws" to
>>> "rwd" for the Derby log file. When running a small numbers of
>>> concurrent client threads the throughput is almost doubled and the
>>> response time is almost halved. I am enclosing two graphs that show
>>> this when running a given number of concurrent "tpc-b" clients. The
>>> graphs show the throughput when running with "rws" and "rwd" mode 
>>> when the
>>> disk's write cache has been enabled and disabled.
>>>
>>> This change should also have a positive impact on the Derby startup
>>> time (DERBY-1664) and derbyall. With this change the time for running
>>> derbyall goes down by about 10-15 minutes (approximately 10%) :-)
>>>
>>> Is there anyone that is aware of any issues by not updating the file
>>> meta-data for every write? Is there any recovery scenarios where this
>>> can make recovery fail? Derby seems to preallocates the log file
>>> before starting using the file, so I think this should not influence
>>> the ability to fine the last data written to the file after a power
>>> failure.
>>>
>>> Any comments?
>>>
>>> Thanks,
>>> Olav
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>
> 
> 
> 


Re: Options for syncing of log to disk

Posted by Olav Sandstaa <Ol...@Sun.COM>.
Mike,

Thanks for your thought on this, see my comments further down.

Mike Matrigali wrote:
> Initial writes to the log are to a preallocated log, but once it
> is filled it is possible for there to be writes that extend the log
> and thus it is not safe to not sync metadata that tracks the
> length of the file.

I agree that this is a problem that needs to be looked into if we would 
consider to change from using "rws" to "rwd" for the log files. My main 
purpose with sending out the performance numbers was to illustrate that 
there on some platforms is a potential for performance improvements and 
to get feedback on what issues that needed to be looked into in order to 
take advantage of this.

> Unfortunately this behavior is hardware, OS and JVM specific, and
> the exact
> meaning of rws and rwd is left vague in the javadoc that I have read.
> The javadoc usually says syncs "metadata" but does not explain what
> metadata.  

I certainly agree that the javadoc is vague. For RandomAcceFile the 
javadoc says [1]:

 "The "rws" and "rwd" modes work much like the |force(boolean)| 
<http://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.html#force%28boolean%29> 
method of the |FileChannel| 
<http://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.html> 
class, passing arguments of true and false, respectively, except that 
they always apply to every I/O operation and are therefore often more 
efficient. If the file resides on a local storage device then when an 
invocation of a method of this class returns it is guaranteed that all 
changes made to the file by that invocation will have been written to 
that device. This is useful for ensuring that critical information is 
not lost in the event of a system crash."

It is very unclear if the last sentence refer to only files opened with 
"rws" mode or also holds true when opened with "rwd" mode.

> When I worked on this issue for another db vendor, direct
> OS access usually provide 3 rather than 2 options.  The 3 options
> were:
> 1) no metadata sync
> 2) only sync file allocation metadata
> 3) sync file allocation metadata and other metadata.  The problem
>    is that other metadata includes the last modified time info
>    which is updated every write to the file.
>
> What do you mean by "most OS"?

Solaris and FreeBSD :-) I have also tried this on Linux 2.6 (Red Hat 
4.0?) , but Linux seems to handle "rwd" the same as "rws" (or rather 
files where the application requests to be opened with the O_DSYNC flag 
is actually opened with just O_SYNC).

The JavaDoc for RandomAccessFile is also indicating that using rws needs 
two updates to the disk: "using "rws" requires updates to both the 
file's content and its metadata to be written, which generally requires 
at least one more low-level I/O operation." [1] Based on this I also 
assumed that this would be a penalty on "most OS".

> What OS/JVM are your numbers from?

My numbers where from a machine running Solaris 10 x86 and Sun JVM 1.5.

> When the sync option on the log was switched from using full file
> sync to "rws" mode tests were run which I believe included linux 
> (probably only a single version of linux - not sure which) and
> XP with sun and ibm jvms (probably 1.4.2 as I think that was the latest
> JVM at the time), I think apple OS was also tested but I am not sure. 
> The first implementation simply switched the
> to the "rws" mode but left the log file to grow as needed, "rws" mode
> was picked because it is impossible to tell if file allocation metadata
> is synced as part of "rwd" so in order to guarantee transaction
> consistency the safest mode was picked.  Tests were run which observed
> if we preallocated the log file then I/O to a preallocated file that
> did not extend the file only paid 1 I/O per sync.  So work was done
> to make most log I/O only happen to a preallocated file, but the logging
> system was not changed to guarantee all I/O was to a preallocated file.

Thanks for the background for how and why "rws" was selected. Based on 
your observations and the unclear semantics for "rwd" and metadata I 
agree that this was a good choice. Still, I think paying the extra cost 
of having to do two disk operations per log write on "some OSs" is high 
and can make Derby perform worse than some of the other open-source 
databases on these OSs.

> It is probably worth resurrecting the simple I/O test program, to let
> people run on their various JVM/OS combinations.  As has been noted in
> the past the results of such a test can be thrown way off by the
> hardware involved.  If the hardware/filesystem has had write cache
> enabled then none of these syncs can do their job and transactions are
> at risk no matter what option is picked.

Sounds like a very good idea to get data for how various JVM/OS 
combinations are handling this.

> Also it is more common nowadays for higher end hardware to have 
> battery backed cache to
> optimize the sync case, which then provides instantaneous return from
> the sync request but provides safe transaction as it guarantees the
> write on failure (I've seen this as part of the disk and as part of
> the controller).  This particular hardware feature works VERY well for
> the derby log I/O case as the block being synced for the log file
> metadata tends to be the same block over and over again so basically
> the cache space for it is on the order of 8k.

And it is very common for lower end hardware to have the disk's write 
cache enabled to get similar performance. And most users will be very 
happy with this and unaware of the consequences until one day their 
favorite database is unable to recover after a power failure....

Regards,
Olav

[1] http://java.sun.com/javase/6/docs/api/java/io/RandomAccessFile.html


>
> Olav Sandstaa wrote:
>> For writing the transaction log to disk Derby uses a
>> RandomAccessFile. If it is supported by the JVM, the log files are
>> opened in "rws" mode making the file system take care of syncing
>> writes to disk. "rws" mode will ensure that both the data and the file
>> meta-data is updated for every write to the file. On most operating
>> system this leads to two write operation to the disk for every write
>> issued by Derby. This is limiting the throughput of update intensive
>> applications.
>>
>> I have run some simple tests where I have changed mode from "rws" to
>> "rwd" for the Derby log file. When running a small numbers of
>> concurrent client threads the throughput is almost doubled and the
>> response time is almost halved. I am enclosing two graphs that show
>> this when running a given number of concurrent "tpc-b" clients. The
>> graphs show the throughput when running with "rws" and "rwd" mode 
>> when the
>> disk's write cache has been enabled and disabled.
>>
>> This change should also have a positive impact on the Derby startup
>> time (DERBY-1664) and derbyall. With this change the time for running
>> derbyall goes down by about 10-15 minutes (approximately 10%) :-)
>>
>> Is there anyone that is aware of any issues by not updating the file
>> meta-data for every write? Is there any recovery scenarios where this
>> can make recovery fail? Derby seems to preallocates the log file
>> before starting using the file, so I think this should not influence
>> the ability to fine the last data written to the file after a power
>> failure.
>>
>> Any comments?
>>
>> Thanks,
>> Olav
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> ------------------------------------------------------------------------
>>
>


Re: Options for syncing of log to disk

Posted by Mike Matrigali <mi...@sbcglobal.net>.
Initial writes to the log are to a preallocated log, but once it
is filled it is possible for there to be writes that extend the log
and thus it is not safe to not sync metadata that tracks the
length of the file.

Unfortunately this behavior is hardware, OS and JVM specific, and
the exact
meaning of rws and rwd is left vague in the javadoc that I have read.
The javadoc usually says syncs "metadata" but does not explain what
metadata.  When I worked on this issue for another db vendor, direct
OS access usually provide 3 rather than 2 options.  The 3 options
were:
1) no metadata sync
2) only sync file allocation metadata
3) sync file allocation metadata and other metadata.  The problem
    is that other metadata includes the last modified time info
    which is updated every write to the file.

What do you mean by "most OS"?
What OS/JVM are your numbers from?

When the sync option on the log was switched from using full file
sync to "rws" mode tests were run which I believe included linux 
(probably only a single version of linux - not sure which) and
XP with sun and ibm jvms (probably 1.4.2 as I think that was the latest
JVM at the time), I think apple OS was also tested but I am not sure. 
The first implementation simply switched the
to the "rws" mode but left the log file to grow as needed, "rws" mode
was picked because it is impossible to tell if file allocation metadata
is synced as part of "rwd" so in order to guarantee transaction
consistency the safest mode was picked.  Tests were run which observed
if we preallocated the log file then I/O to a preallocated file that
did not extend the file only paid 1 I/O per sync.  So work was done
to make most log I/O only happen to a preallocated file, but the logging
system was not changed to guarantee all I/O was to a preallocated file.

It is probably worth resurrecting the simple I/O test program, to let
people run on their various JVM/OS combinations.  As has been noted in
the past the results of such a test can be thrown way off by the
hardware involved.  If the hardware/filesystem has had write cache
enabled then none of these syncs can do their job and transactions are
at risk no matter what option is picked.

Also it is more common nowadays for higher end hardware to have battery 
backed cache to
optimize the sync case, which then provides instantaneous return from
the sync request but provides safe transaction as it guarantees the
write on failure (I've seen this as part of the disk and as part of
the controller).  This particular hardware feature works VERY well for
the derby log I/O case as the block being synced for the log file
metadata tends to be the same block over and over again so basically
the cache space for it is on the order of 8k.


Olav Sandstaa wrote:
> For writing the transaction log to disk Derby uses a
> RandomAccessFile. If it is supported by the JVM, the log files are
> opened in "rws" mode making the file system take care of syncing
> writes to disk. "rws" mode will ensure that both the data and the file
> meta-data is updated for every write to the file. On most operating
> system this leads to two write operation to the disk for every write
> issued by Derby. This is limiting the throughput of update intensive
> applications.
> 
> I have run some simple tests where I have changed mode from "rws" to
> "rwd" for the Derby log file. When running a small numbers of
> concurrent client threads the throughput is almost doubled and the
> response time is almost halved. I am enclosing two graphs that show
> this when running a given number of concurrent "tpc-b" clients. The
> graphs show the throughput when running with "rws" and "rwd" mode when the
> disk's write cache has been enabled and disabled.
> 
> This change should also have a positive impact on the Derby startup
> time (DERBY-1664) and derbyall. With this change the time for running
> derbyall goes down by about 10-15 minutes (approximately 10%) :-)
> 
> Is there anyone that is aware of any issues by not updating the file
> meta-data for every write? Is there any recovery scenarios where this
> can make recovery fail? Derby seems to preallocates the log file
> before starting using the file, so I think this should not influence
> the ability to fine the last data written to the file after a power
> failure.
> 
> Any comments?
> 
> Thanks,
> Olav
> 
> 
> ------------------------------------------------------------------------
> 
> 
> ------------------------------------------------------------------------
>