You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Øystein Grøvlen <Oy...@Sun.COM> on 2005/10/31 22:52:01 UTC

Derby I/O issues during checkpointing

Some tests runs we have done show very long transaction response times
during checkpointing.  This has been seen on several platforms.  The
load is TPC-B like transactions and the write cache is turned off so
the system is I/O bound.  There seems to be two major issues:

1. Derby does checkpointing by writing all dirty pages by
   RandomAccessFile.write() and then do file sync when the entire
   cache has been scanned.  When the page cache is large, the file
   system buffer will overflow during checkpointing, and occasionally
   the writes will take very long.  I have observed single write
   operations that took almost 12 seconds.  What is even worse is that
   during this period also read performance on other files can be very
   bad.  For example, reading an index page from disk can take close
   to 10 seconds when the base table is checkpointed.  Hence,
   transactions are severely slowed down.

   I have managed to improve response times by flushing every file for
   every 100th write.  Is this something we should consider including
   in the code?  Do you have better suggestions?

2. What makes thing even worse is that only a single thread can read a
   page from a file at a time.  (Note that Derby has one file per
   table). This is because the implementation of RAFContainer.readPage
   is as follow:

        synchronized (this) {  // 'this' is a FileContainer, i.e. a file object
            fileData.seek(pageOffset);  // fileData is a RandomAccessFile
            fileData.readFully(pageData, 0, pageSize);
	}

   During checkpoint when I/O is slow this creates long queques of
   readers.  In my run with 20 clients, I observed read requests that
   took more than 20 seconds.

   This behavior will also limit throughput and can partly explains
   why I get low CPU utilization with 20 clients.  All my TPCB-B
   clients are serialized since most will need 1-2 disk accesses
   (index leaf page and one page of the account table).

   Generally, in order to make the OS able to optimize I/O, one should
   have many outstanding I/O calls at a time.  (See Frederiksen,
   Bonnet: "Getting Priorities Straight: Improving Linux Support for
   Database I/O", VLDB 2005).  

   I have attached a patch where I have introduced several file
   descriptors (RandomAccessFile objects) per RAFContainer.  These are
   used for reading.  The principle is that when all readers are busy,
   a readPage request will create a new reader.  (There is a maximum
   number of readers.)  With this patch, throughput was improved by
   50% on linux.  The combination of this patch and the synching for
   every 100th write, reduced maximum transaction response times with
   90%.

   The patch is not ready for inclusion into Derby, but I would like
   to here whether you think this is a viable approach.

-- 
Øystein

Index: java/engine/org/apache/derby/impl/store/raw/data/RAFContainer.java
===================================================================
--- java/engine/org/apache/derby/impl/store/raw/data/RAFContainer.java   (revision 312819)
+++ java/engine/org/apache/derby/impl/store/raw/data/RAFContainer.java (working copy)
@@ -45,7 +45,8 @@
 import org.apache.derby.io.StorageFile;
 import org.apache.derby.io.StorageRandomAccessFile;
 
-import java.util.Vector;
+import java.util.ArrayList;
+import java.util.List;
 
 import java.io.DataInput;
 import java.io.IOException;
@@ -66,12 +67,15 @@
       * Immutable fields
      */
     protected StorageRandomAccessFile fileData;
-
+        
   /* 
     ** Mutable fields, only valid when the identity is valid.
       */
      protected boolean                       needsSync;
 
+    private int openReaders;
+    private List freeReaders;
+
     /* privileged actions */
     private int actionCode;
     private static final int GET_FILE_NAME_ACTION = 1;
@@ -79,6 +83,7 @@
     private static final int REMOVE_FILE_ACTION = 3;
     private static final int OPEN_CONTAINER_ACTION = 4;
     private static final int STUBBIFY_ACTION = 5;
+    private static final int OPEN_READONLY_ACTION = 6;
     private ContainerKey actionIdentity;
     private boolean actionStub;
     private boolean actionErrorOK;
@@ -86,12 +91,15 @@
     private StorageFile actionFile;
     private LogInstant actionInstant;
     
+
   /*
       * Constructors
          */
 
    RAFContainer(BaseDataFileFactory factory) {
             super(factory);
+        openReaders = 0;
+        freeReaders = new ArrayList();
         }
 
      /*
@@ -193,12 +201,25 @@
 
                long pageOffset = pageNumber * pageSize;
 
-              synchronized (this) {
+        
+        StorageRandomAccessFile reader = null;
+        for (;;) {
+            synchronized(freeReaders) {
+                if (freeReaders.size() > 0) {
+                    reader = (StorageRandomAccessFile)freeReaders.remove(0);
+                    break;
+                }
+            }
+            openNewReader();
+        } 
 
-                        fileData.seek(pageOffset);
 
-                    fileData.readFully(pageData, 0, pageSize);
-             }
+        reader.seek(pageOffset);
+        reader.readFully(pageData, 0, pageSize);
+        synchronized(freeReaders) {
+            freeReaders.add(reader);
+            freeReaders.notify();
+        }
 
               if (dataFactory.databaseEncrypted() &&
                  pageNumber != FIRST_ALLOC_PAGE_NUMBER)
@@ -769,6 +790,21 @@
         finally{ actionIdentity = null; }
     }
 
+
+   synchronized boolean openNewReader()
+        throws StandardException
+    {
+        actionCode = OPEN_READONLY_ACTION;
+         actionIdentity = (ContainerKey)getIdentity();
+        try
+        {
+            return AccessController.doPrivileged( this) != null;
+        }
+        catch( PrivilegedActionException pae){ throw (StandardException) pae.getException();}
+        finally{ actionIdentity = null; }
+    }
+
+
  private synchronized void stubbify(LogInstant instant)
         throws StandardException
         {
@@ -1112,6 +1148,52 @@
              dataFactory.stubFileToRemoveAfterCheckPoint(stub,actionInstant, getIdentity());
              return null;
          } // end of case STUBBIFY_ACTION
+         case OPEN_READONLY_ACTION:
+         {
+             try {
+                 synchronized(freeReaders) {
+                     if (openReaders > 20) {
+                         freeReaders.wait();
+                         return null;
+                     } else {
+                         ++openReaders;
+                     }
+                 }
+             } catch (InterruptedException ie) {
+                 throw StandardException.newException(
+                     SQLState.DATA_UNEXPECTED_EXCEPTION, ie);
+             }
+             
+             StorageFile file = privGetFileName(actionIdentity, false, true, true);
+             if (file == null)
+                 return null;
+
+             try {
+                 if (!file.exists()) {
+                     return null;
+                 }
+             } catch (SecurityException se) {
+                 throw StandardException.newException(
+                     SQLState.DATA_UNEXPECTED_EXCEPTION, se);
+             }
+
+             try {
+
+                 StorageRandomAccessFile reader = file.getRandomAccessFile("r");
+                 synchronized(freeReaders) {
+                     freeReaders.add(reader);
+                 }
+//                 SanityManager.DEBUG_PRINT("RAFContainer", "Opens reader no. " + openReaders);
+
+             } catch (IOException ioe) {
+                 throw dataFactory.markCorrupt(
+                     StandardException.newException(
+                         SQLState.FILE_CONTAINER_EXCEPTION, ioe, this));
+             }
+
+             return this;
+         } // end of case OPEN_CONTAINER_ACTION
+
          }
          return null;
      } // end of run

Re: Derby I/O issues during checkpointing

Posted by Daniel John Debrunner <dj...@debrunners.com>.

Øystein Grøvlen wrote:

> Some tests runs we have done show very long transaction response times
> during checkpointing.  This has been seen on several platforms.  The
> load is TPC-B like transactions and the write cache is turned off so
> the system is I/O bound.  There seems to be two major issues:

Nice investigation, I think I have seen similar problms on Windows.

> 1. Derby does checkpointing by writing all dirty pages by
>    RandomAccessFile.write() and then do file sync when the entire
>    cache has been scanned.  When the page cache is large, the file
>    system buffer will overflow during checkpointing, and occasionally
>    the writes will take very long.  I have observed single write
>    operations that took almost 12 seconds.  What is even worse is that
>    during this period also read performance on other files can be very
>    bad.  For example, reading an index page from disk can take close
>    to 10 seconds when the base table is checkpointed.  Hence,
>    transactions are severely slowed down.
> 
>    I have managed to improve response times by flushing every file for
>    every 100th write.  Is this something we should consider including
>    in the code?  Do you have better suggestions?

Sounds reasonable.


> 
> 2. What makes thing even worse is that only a single thread can read a
>    page from a file at a time.  (Note that Derby has one file per
>    table). This is because the implementation of RAFContainer.readPage
>    is as follow:
> 
>         synchronized (this) {  // 'this' is a FileContainer, i.e. a file object
>             fileData.seek(pageOffset);  // fileData is a RandomAccessFile
>             fileData.readFully(pageData, 0, pageSize);
> 	}
> 
>    During checkpoint when I/O is slow this creates long queques of
>    readers.  In my run with 20 clients, I observed read requests that
>    took more than 20 seconds.


Hmmm, I think that code was written assuming the call would nat take
that long!


> 
>    This behavior will also limit throughput and can partly explains
>    why I get low CPU utilization with 20 clients.  All my TPCB-B
>    clients are serialized since most will need 1-2 disk accesses
>    (index leaf page and one page of the account table).
> 
>    Generally, in order to make the OS able to optimize I/O, one should
>    have many outstanding I/O calls at a time.  (See Frederiksen,
>    Bonnet: "Getting Priorities Straight: Improving Linux Support for
>    Database I/O", VLDB 2005).  
> 
>    I have attached a patch where I have introduced several file
>    descriptors (RandomAccessFile objects) per RAFContainer.  These are
>    used for reading.  The principle is that when all readers are busy,
>    a readPage request will create a new reader.  (There is a maximum
>    number of readers.)  With this patch, throughput was improved by
>    50% on linux.  The combination of this patch and the synching for
>    every 100th write, reduced maximum transaction response times with
>    90%.

Only concern would be number of open file descriptors as others have
pointed out. Might want to scavenged open descriptors from containers
that are no longer heavily used.

>    The patch is not ready for inclusion into Derby, but I would like
>    to here whether you think this is a viable approach.

It seems like these changes are low risk and enable worthwhile
performance increases without completely changing the i/o system.
Such changes could then provide the performance that a full async
re-write would have to better (or at least match).

Dan.

Re: Derby I/O issues during checkpointing

Posted by Mike Matrigali <mi...@sbcglobal.net>.


Øystein Grøvlen wrote:
> (Any reason this did not go to derby-dev?)
> 
> 
>>>>>>"MM" == Mike Matrigali <mi...@sbcglobal.net> writes:
> 
> 
>     MM> Your change to checkpoint seems like a low risk, and from your tests
>     MM> high benefit.  My only worry is those systems with a bad implementation
>     MM> of "sync" which is linearly related to size of file or size of OS disk
>     MM> cache (basically I have seen implementations where the OS does not have
>     MM> a data structure to track dirty pages associated with a file so it has
>     MM> two choices: 1) search every page in the disk cache or probe in the disk
>     MM> cache for every page in the file - it chooses which approach to use
>     MM> based on file size vs cache size).  I was willing to pay the cost of one
>     MM> of these calls per big file, but I think would lean toward just using
>     MM> sync write for checkpoint given the problems you are seeing, but not
>     MM> very strongly.  With reasonable
>     MM> implementations of file sync I like your approach.
> 
>     MM> If you go with syncing every 100, I wonder if it might make sense to
>     MM> "slow" checkpoint even more in a busy system.  Since the writes are
>     MM> not really doing I/O maybe it might make sense to give other threads
>     MM> in the system a chance more often at an I/O slot by throwing in a
>     MM> give up my time slice call every N writes with N being a relatively
>     MM> small number like 1-5.
> 
> Maybe I should try to see what happens if I just makes the checkpoint
> sleep for a few seconds every N writes instead of doing a sync.  It
> could be that the positive effect is mainly from slowing down the
> checkpoint when the I/O system is overloaded.  

Yes, that would be interesting.  If that helps then I think there are 
better things than sleep, but not worth coding if sleep doesn't help.

Do you think your system will see the same issues if running in 
durability=test mode (ie. no syncs).  Someday I would like to produce
a non-sync system which would guarantee consistent db recovery (just 
might lose transactions but not half of a transaction), so it would
be interesting to understand if the problem is the sync or the problem
is just the blast of unsynced writes.

> 
> ...
> 
>     MM> What is your log rate (bytes/sec to the log).  I think you are just
>     MM> saying that the default of a checkpoint per 10 meg of log is a bad
>     MM> default for these kinds of apps.
> 
> The first checkpoint occurs after about 5 minutes. I guess that should
> indicate a log rate of 300 kbytes/sec.
> 
> I do not think changing the checkpoint interval would help much on the
> high response times unless you make it very short so that the number
> of pages per checkpoint is much smaller. 
ok, that is not too bad - I was worried that you were generating a 
checkpoint every few seconds.  Though in such an application I might
set the checkpoint rate to be more like once per hour.  Again this is
a separate issue, no matter what the checkpoint rate it should still be
fixed to avoid the hits you are seeing.
>

Re: Derby I/O issues during checkpointing

Posted by Øystein Grøvlen <Oy...@Sun.COM>.

>>>>> "MM" == Mike Matrigali <mi...@sbcglobal.net> writes:

    MM> Øystein Grøvlen wrote:
    >> Some tests runs we have done show very long transaction response times
    >> during checkpointing.  This has been seen on several platforms.  The
    >> load is TPC-B like transactions and the write cache is turned off so
    >> the system is I/O bound.  There seems to be two major issues:
    >> 1. Derby does checkpointing by writing all dirty pages by

    >> RandomAccessFile.write() and then do file sync when the entire
    >> cache has been scanned.  When the page cache is large, the file
    >> system buffer will overflow during checkpointing, and occasionally
    >> the writes will take very long.  I have observed single write
    >> operations that took almost 12 seconds.  What is even worse is that
    >> during this period also read performance on other files can be very
    >> bad.  For example, reading an index page from disk can take close
    >> to 10 seconds when the base table is checkpointed.  Hence,
    >> transactions are severely slowed down.
    >> I have managed  to improve response times by  flushing every file
    >> for

    >> every 100th write.  Is this something we should consider including
    >> in the code?  Do you have better suggestions?

    MM> probably the first thing to do is make sure we are doing a reasonable
    MM> amount of checkpoints, most  people who run these benchmarks configure
    MM> the system such that it either does 0 or 1 checkpoints during the run.

We do not do this for benchmarking.  We have just chosen TPC-B load,
because it represents a typical update-intensive load where a single
table represents most of the data volume.

    MM> This  goes to  the ongoing  discussion  on how  best to  automatically
    MM> configure checkpoint interval - the current defaults don't make much
    MM> sense for an OLTP system.

I agree.


    MM> I had  hoped that with the  current checkpoint design  that usually by
    MM> the time that the file sync  happened all the pages would have already
    MM> made
    MM> it to disk.  The hope was that while holding the write semaphore we
    MM> would not do any I/O and thus not cause much interruption to the rest of
    MM> the system.

Since the checkpoint will do buffered writes of all dirty pages, its
write rate will be much higher than the write rate of the disk.  There
is no way all the pages can make it to disk before sync is called.
Since the write is buffered, the write semaphore will not be held very
long for each write.  (I am not quite sure what you mean by the write
semaphore.  Something in the OS, or the synchronization on the file
container?)

Anyhow, I do not feel the problem is that writes or sync takes very
long.  The problem is that this impact read performance in two ways:
    - OS gives long response times on reads when checkpoint stresses
      file system.
    - Reads to a file, will have to wait for a write request to
      complete.  (Only one I/O per file at a time).

The solution seems to be to reduce the I/O utilization by
checkpointing. (i.e., reduce the write rate). 


    MM> What OS/filesystem are you seeing these results on? Any idea why a write
    MM> would take 10 seconds.  Do you think the write blocks when the sync is
    MM> called? If  so do you  think the  block a Derby  sync point or  an OS
    MM> internal sync point.

I have seen this both on Linux and Solaris.  My hypothesis is that a
write may take 10 seconds when the file system buffer is full.  I am
not sure why it is this way, but it seems like it helps to sync
regularly.  My guess is that this is because we avoid filling the
buffer.  We will try to investigate further.

I do not think the long write blocks on a Derby sync point.  What I
measured was just the time to call two RandomAccessFile methods (seek
and write).


    MM> We moved  away from using the  write then sync approach  for log files
    MM> because we found that on some OS/Filesystems performance of the sync

    MM> was linearly related to the size of the file, rather than the number
    MM> of modified pages.  I left it for checkpoint as it seemed an easy
    MM> way to do async write which I thought would then provide the OS with
    MM> basically the equivalent of many concurrent writes to do.

    MM> Another approach  may be to change  checkpoint to use  the direct sync
    MM> write, but make it get it's own open on the file similar to what you

    MM> describe below - that would mean other reader/writer would not block
    MM> ever on checkpoint read/write - at least from derby level.  Whether
    MM> this would increase or decrease overall checkpoint elapsed time is
    MM> probably system dependent - I am pretty sure it would increase time
    MM> on windows, but I continue to believe elapsed time of checkpoint is
    MM> not important - as you point out it is more important to make sure
    MM> it interferes with "real" work as little as possible.

I agree that elapsed time of checkpoint is not that important, but
scheduling only one page at a time for write will reduce the bandwith
of the I/O system since it will not be possible for the I/O to reorder
operations for optimal performance.  I would rather suggest doing a
bunch of unbuffered writes, then wait for a while before writing more
pages.  Alternately, one could use a pool of threads that do direct
io in parallel.

    >> 2. What makes thing even worse is that only a single thread can read
    >> a

    >> page from a file at a time.  (Note that Derby has one file per
    >> table). This is because the implementation of RAFContainer.readPage
    >> is as follow:
    >> synchronized (this)  { // 'this' is a  FileContainer, i.e. a
    >> file object

    >> fileData.seek(pageOffset);  // fileData is a RandomAccessFile
    >> fileData.readFully(pageData, 0, pageSize);
    >> }
    >> During checkpoint when I/O is slow this creates long queques of

    >> readers.  In my run with 20 clients, I observed read requests that
    >> took more than 20 seconds.
    >> This behavior will also limit throughput and can partly explains

    >> why I get low CPU utilization with 20 clients.  All my TPCB-B
    >> clients are serialized since most will need 1-2 disk accesses
    >> (index leaf page and one page of the account table).
    >> Generally,  in order to  make the  OS able  to optimize  I/O, one
    >> should

    >> have many outstanding I/O calls at a time.  (See Frederiksen,
    >> Bonnet: "Getting Priorities Straight: Improving Linux Support for
    >> Database I/O", VLDB 2005). I  have attached a patch where I have
    >> introduced several file

    >> descriptors (RandomAccessFile objects) per RAFContainer.  These are
    >> used for reading.  The principle is that when all readers are busy,
    >> a readPage request will create a new reader.  (There is a maximum
    >> number of readers.)  With this patch, throughput was improved by
    >> 50% on linux.  The combination of this patch and the synching for
    >> every 100th write, reduced maximum transaction response times with
    >> 90%.
    >> The patch is not ready for inclusion into Derby, but I would like

    >> to here whether you think this is a viable approach.
    >> 

    MM> I now see  what you were talking  about, I was thinking at  too high a
    MM> level. In  your test  is the  data spread across  more than  a single
    MM> disk?

No, data is on a single disk.  Log is on a separate disk.

    MM> Especially with data spread across multiple disks it would make sense
    MM> to allow multiple concurrent reads.  That config was just not the target
    MM> of the original Derby code - so especially as we target more processors
    MM> and more disks changes will need to be made.

    MM> I wonder if java's new async interfaces may be more appropriate, maybe
    MM> we just  need to change  every read into  an async read followed  by a
    MM> wait,
    MM> and the same for write.  I have not used the interfaces, does anyone
    MM> have experience with them and is there any downside to using them vs.
    MM> the current RandomAccessFile interfaces?

I have looked at Java NIO and could not find anything about
aynchronous IO for random access files.  There is a FileChannel class
but that seems only to support sequential IO. Have I missed something?

On the other hand, there is a JSR 203 for this.  This will
unfortunately not make it into Mustang (Java 6).

    MM> Your approach may be fine, one consideration may be the number of file
    MM> descriptors necessary to run the system.  On some very small platforms
    MM> the only way to run the original Cloudscape was to change the size
    MM> of the container cache to limit the number of file descriptors.

Maybe we could have a property to limit the number of file
descriptors.



-- 
Øystein

Re: Derby I/O issues during checkpointing

Posted by Mike Matrigali <mi...@sbcglobal.net>.

Øystein Grøvlen wrote:
> Some tests runs we have done show very long transaction response times
> during checkpointing.  This has been seen on several platforms.  The
> load is TPC-B like transactions and the write cache is turned off so
> the system is I/O bound.  There seems to be two major issues:
> 
> 1. Derby does checkpointing by writing all dirty pages by
>    RandomAccessFile.write() and then do file sync when the entire
>    cache has been scanned.  When the page cache is large, the file
>    system buffer will overflow during checkpointing, and occasionally
>    the writes will take very long.  I have observed single write
>    operations that took almost 12 seconds.  What is even worse is that
>    during this period also read performance on other files can be very
>    bad.  For example, reading an index page from disk can take close
>    to 10 seconds when the base table is checkpointed.  Hence,
>    transactions are severely slowed down.
> 
>    I have managed to improve response times by flushing every file for
>    every 100th write.  Is this something we should consider including
>    in the code?  Do you have better suggestions?

probably the first thing to do is make sure we are doing a reasonable
amount of checkpoints, most people who run these benchmarks configure 
the system such that it either does 0 or 1 checkpoints during the run.
This goes to the ongoing discussion on how best to automatically 
configure checkpoint interval - the current defaults don't make much
sense for an OLTP system.

I had hoped that with the current checkpoint design that usually by the 
time that the file sync happened all the pages would have already made
it to disk.  The hope was that while holding the write semaphore we
would not do any I/O and thus not cause much interruption to the rest of
the system.

What OS/filesystem are you seeing these results on? Any idea why a write
would take 10 seconds.  Do you think the write blocks when the sync is
called?  If so do you think the block a Derby sync point or an OS 
internal sync point.

We moved away from using the write then sync approach for log files 
because we found that on some OS/Filesystems performance of the sync
was linearly related to the size of the file, rather than the number
of modified pages.  I left it for checkpoint as it seemed an easy
way to do async write which I thought would then provide the OS with
basically the equivalent of many concurrent writes to do.

Another approach may be to change checkpoint to use the direct sync 
write, but make it get it's own open on the file similar to what you
describe below - that would mean other reader/writer would not block
ever on checkpoint read/write - at least from derby level.  Whether
this would increase or decrease overall checkpoint elapsed time is
probably system dependent - I am pretty sure it would increase time
on windows, but I continue to believe elapsed time of checkpoint is
not important - as you point out it is more important to make sure
it interferes with "real" work as little as possible.
> 
> 2. What makes thing even worse is that only a single thread can read a
>    page from a file at a time.  (Note that Derby has one file per
>    table). This is because the implementation of RAFContainer.readPage
>    is as follow:
> 
>         synchronized (this) {  // 'this' is a FileContainer, i.e. a file object
>             fileData.seek(pageOffset);  // fileData is a RandomAccessFile
>             fileData.readFully(pageData, 0, pageSize);
> 	}
> 
>    During checkpoint when I/O is slow this creates long queques of
>    readers.  In my run with 20 clients, I observed read requests that
>    took more than 20 seconds.
> 
>    This behavior will also limit throughput and can partly explains
>    why I get low CPU utilization with 20 clients.  All my TPCB-B
>    clients are serialized since most will need 1-2 disk accesses
>    (index leaf page and one page of the account table).
> 
>    Generally, in order to make the OS able to optimize I/O, one should
>    have many outstanding I/O calls at a time.  (See Frederiksen,
>    Bonnet: "Getting Priorities Straight: Improving Linux Support for
>    Database I/O", VLDB 2005).  
> 
>    I have attached a patch where I have introduced several file
>    descriptors (RandomAccessFile objects) per RAFContainer.  These are
>    used for reading.  The principle is that when all readers are busy,
>    a readPage request will create a new reader.  (There is a maximum
>    number of readers.)  With this patch, throughput was improved by
>    50% on linux.  The combination of this patch and the synching for
>    every 100th write, reduced maximum transaction response times with
>    90%.
> 
>    The patch is not ready for inclusion into Derby, but I would like
>    to here whether you think this is a viable approach.
> 
I now see what you were talking about, I was thinking at too high a 
level.  In your test is the data spread across more than a single disk?
Especially with data spread across multiple disks it would make sense
to allow multiple concurrent reads.  That config was just not the target
of the original Derby code - so especially as we target more processors
and more disks changes will need to be made.

I wonder if java's new async interfaces may be more appropriate, maybe 
we just need to change every read into an async read followed by a wait,
and the same for write.  I have not used the interfaces, does anyone
have experience with them and is there any downside to using them vs.
the current RandomAccessFile interfaces?
Your approach may be fine, one consideration may be the number of file
descriptors necessary to run the system.  On some very small platforms
the only way to run the original Cloudscape was to change the size
of the container cache to limit the number of file descriptors.

Re: Derby I/O issues during checkpointing

Posted by Francois Orsini <fr...@gmail.com>.

In order for a thread to generate many outstanding I/O calls at a time, it
should *not* block on an I/O in the first place if it does not have to -
this is what you observed - Typically, we would want to be able to issue
Asynchronous I/O's so that a given thread at the low-level does not block
but rather is allowed to check for I/O completion at a later time as
appropriate, while producing additional I/O requests (i.e. read-ahead) -
Asynchronous I/O's in Java is not something you used to get out of the box
and people have implemented it via I/O worker threads (simulated Async
I/O's) or/and using JNI (calling into OS proprietary asynchronous I/O driver
on Unix FSs and Windows (NT)).

I think the approach you have made is good in terms of principles and
prototyping but I would think we would need to implement something more
sophisticated and having an implementation of worker threads simulating
asynchronous I/Os (whether we end-up using Java Asynchronous I/O in NIO or
not). I think we could even see additional performance gain.

Just my 0.02 cents...

--francois

Re: Derby I/O issues during checkpointing

Posted by Mike Matrigali <mi...@sbcglobal.net>.

Sync happens as part of calling 
BaseDataFileFactory.java!checkpoint()!containerCache.cleanAll();

See comments for checkpoint() routine in that file.

Raymond Raymond wrote:
> Dear Oystein,
> 
> In your mail, "Derby I/O issues during checkpointing", you wrote:
> OO:  Some tests runs we have done show very long transaction response times
> OO:  during checkpointing.  This has been seen on several platforms.  The
> OO:  load is TPC-B like transactions and the write cache is turned off so
> OO:  the system is I/O bound.  There seems to be two major issues:
> OO:
> OO:  1. Derby does checkpointing by writing all dirty pages by
> OO:      RandomAccessFile.write() and then do file sync when the entire
> OO:      cache has been scanned.  When the page cache is large, the file
> OO:      system buffer will overflow during checkpointing, and occasionally
> OO:      the writes will take very long.  I have observed single write
> OO:      operations that took almost 12 seconds.  What is even worse is 
> that
> OO:      during this period also read performance on other files can be 
> very
> OO:      bad.  For example, reading an index page from disk can take close
> OO:      to 10 seconds when the base table is checkpointed.  Hence,
> OO:      transactions are severely slowed down.
> OO:
> OO:      I have managed to improve response times by flushing every file 
> for
> OO:      every 100th write.  Is this something we should consider including
> OO:      in the code?  Do you have better suggestions?
> 
> Has this been implemented in the latest derby version? I am trying to 
> spread
> out disk I/O of checkpoint over the checkpoint interval. If the 
> RandomAccessFile
> still does sync when the entire cache has been scanned, my approach will 
> not
> make much difference from current implementation. If it has not been 
> implemented
> in the latest derby version, would you please attach your solution? I would
> like to use    it in my working copy.
> 
> I am also curious to know, in current implementation, when will the sync
> of a RandomAccessFile be executed?
> 
> Thanks.
> 
> 
> 
> Raymond
> 
> _________________________________________________________________
> Take advantage of powerful junk e-mail filters built on patented 
> Microsoft® SmartScreen Technology. 
> http://join.msn.com/?pgmarket=en-ca&page=byoa/prem&xAPID=1994&DI=1034&SU=http://hotmail.com/enca&HL=Market_MSNIS_Taglines 
>  Start enjoying all the benefits of MSN® Premium right now and get the 
> first two months FREE*.
> 
> 
>

Re: Derby I/O issues during checkpointing

Posted by Øystein Grøvlen <Oy...@Sun.COM>.

Raymond Raymond wrote:
> Dear Oystein,
> 
> In your mail, "Derby I/O issues during checkpointing", you wrote:
> OO:  Some tests runs we have done show very long transaction response times
> OO:  during checkpointing.  This has been seen on several platforms.  The
> OO:  load is TPC-B like transactions and the write cache is turned off so
> OO:  the system is I/O bound.  There seems to be two major issues:
> OO:
> OO:  1. Derby does checkpointing by writing all dirty pages by
> OO:      RandomAccessFile.write() and then do file sync when the entire
> OO:      cache has been scanned.  When the page cache is large, the file
> OO:      system buffer will overflow during checkpointing, and occasionally
> OO:      the writes will take very long.  I have observed single write
> OO:      operations that took almost 12 seconds.  What is even worse is 
> that
> OO:      during this period also read performance on other files can be 
> very
> OO:      bad.  For example, reading an index page from disk can take close
> OO:      to 10 seconds when the base table is checkpointed.  Hence,
> OO:      transactions are severely slowed down.
> OO:
> OO:      I have managed to improve response times by flushing every file 
> for
> OO:      every 100th write.  Is this something we should consider including
> OO:      in the code?  Do you have better suggestions?
> 
> Has this been implemented in the latest derby version?

No, not yet.  I wanted to look at some other options first.

> I am trying to 
> spread
> out disk I/O of checkpoint over the checkpoint interval. If the 
> RandomAccessFile
> still does sync when the entire cache has been scanned, my approach will 
> not
> make much difference from current implementation. 

I do not understand why syncing only once should affect your approach 
more than syncing many times.

> If it has not been 
> implemented
> in the latest derby version, would you please attach your solution? I would
> like to use    it in my working copy.

I do not have access to it now.  I can send it tomorrow.  Basically, I 
incremented a counter on every write  in RAFContainer.  When the counter 
reached 100, I did a sync and reset the counter.

> 
> I am also curious to know, in current implementation, when will the sync
> of a RandomAccessFile be executed?

I think Mike already answered this.

--
Øystein

RE: Derby I/O issues during checkpointing

Posted by Raymond Raymond <ra...@hotmail.com>.

Dear Oystein,

In your mail, "Derby I/O issues during checkpointing", you wrote:
OO: Some tests runs we have done show very long transaction response times
OO: during checkpointing. This has been seen on several platforms. The
OO: load is TPC-B like transactions and the write cache is turned off so
OO: the system is I/O bound. There seems to be two major issues:
OO:
OO: 1. Derby does checkpointing by writing all dirty pages by
OO: RandomAccessFile.write() and then do file sync when the entire
OO: cache has been scanned. When the page cache is large, the file
OO: system buffer will overflow during checkpointing, and occasionally
OO: the writes will take very long. I have observed single write
OO: operations that took almost 12 seconds. What is even worse is that
OO: during this period also read performance on other files can be very
OO: bad. For example, reading an index page from disk can take close
OO: to 10 seconds when the base table is checkpointed. Hence,
OO: transactions are severely slowed down.
OO:
OO: I have managed to improve response times by flushing every file for
OO: every 100th write. Is this something we should consider including
OO: in the code? Do you have better suggestions?

Has this been implemented in the latest derby version? I am trying to spread
out disk I/O of checkpoint over the checkpoint interval. If the
RandomAccessFile
still does sync when the entire cache has been scanned, my approach will not
make much difference from current implementation. If it has not been
implemented
in the latest derby version, would you please attach your solution? I would
like to use it in my working copy.

I am also curious to know, in current implementation, when will the sync
of a RandomAccessFile be executed?

Thanks.

Raymond

_________________________________________________________________
Take advantage of powerful junk e-mail filters built on patented MicrosoftŽ
SmartScreen Technology.
http://join.msn.com/?pgmarket=en-ca&page=byoa/prem&xAPID=1994&DI=1034&SU=http://hotmail.com/enca&HL=Market_MSNIS_Taglines
Start enjoying all the benefits of MSNŽ Premium right now and get the
first two months FREE*.

Re: Derby I/O issues during checkpointing

Posted by Oystein Grovlen - Sun Norway <Oy...@Sun.COM>.

Raymond Raymond wrote:

>> I ran a variant of the TPC-B benchmark with a large database (20 GB) 
>> and  a large page cache (1 GB).  I do not think you need TPC-B 
>> transactions to see this, but I think you need update-intensive 
>> transactions that frequently needs to load pages from disk.  For 
>> example, single record updates where an index is used to locate the 
>> record.  You will also need several connections in parallel (e.g., 20).
>>
>> -- 
>> Øystein
> 
> Thanks for your answer. Did you use some tools to know something like
> "single write operations that took almost 12 seconds" and "read requests
> that took more than 20 seconds". I downloaded TPC-B benchmark program,
> it didn't give me those kind of information. I want to know how did you
> know the time for reads and writes.^_^.
> 

No special tools.  Our TPCB-client writes the number of committed 
transaction for each 10 second interval.  That way, I observed regular 
drops in throughput.  I/O response times was measured by instrumenting 
RAFContainer to measure the time it takes to do seek and read/write. 
This was written to derby.log and inspected manually.  (See attached 
diff from RAFContainer.readPage for an example)

-- 
Øystein Grøvlen, Senior Staff Engineer
Sun Microsystems, Database Technology Group
Trondheim, Norway

Re: Derby I/O issues during checkpointing

Posted by Raymond Raymond <ra...@hotmail.com>.

>From: Řystein Grřvlen <Oy...@Sun.COM>
>
>Raymond Raymond wrote:
>>Dear Oystein,
>>  In one of you former mails, you said:
>>
>>>From: Oystein.Grovlen@Sun.COM (Řystein Grřvlen)
>
>>>    During checkpoint when I/O is slow this creates long queques of
>>>    readers.  In my run with 20 clients, I observed read requests that
>>>    took more than 20 seconds ......
>>
>>
>>I am coding on autimatic checkpoint and incremental checkpoint as
>>what we discussed before, I finished party of them and trying to
>>do some test on it. I am really interested in how you observed :
>>
>>>    I have observed single write
>>>    operations that took almost 12 seconds ......
>>
>>and
>>
>>>    readers.  In my run with 20 clients, I observed read requests that
>>>    took more than 20 seconds ......
>>
>
>I ran a variant of the TPC-B benchmark with a large database (20 GB) and  a 
>large page cache (1 GB).  I do not think you need TPC-B transactions to see 
>this, but I think you need update-intensive transactions that frequently 
>needs to load pages from disk.  For example, single record updates where an 
>index is used to locate the record.  You will also need several connections 
>in parallel (e.g., 20).
>
>--
>Řystein

Thanks for your answer. Did you use some tools to know something like
"single write operations that took almost 12 seconds" and "read requests
that took more than 20 seconds". I downloaded TPC-B benchmark program,
it didn't give me those kind of information. I want to know how did you
know the time for reads and writes.^_^.

Thanks.

Raymond

_________________________________________________________________
MSNŽ Calendar keeps you organized and takes the effort out of scheduling 
get-togethers. 
http://join.msn.com/?pgmarket=en-ca&page=byoa/prem&xAPID=1994&DI=1034&SU=http://hotmail.com/enca&HL=Market_MSNIS_Taglines 
  Start enjoying all the benefits of MSNŽ Premium right now and get the 
first two months FREE*.

Re: Derby I/O issues during checkpointing

Posted by Øystein Grøvlen <Oy...@Sun.COM>.

Raymond Raymond wrote:
> Dear Oystein,
>  In one of you former mails, you said:
> 
>> From: Oystein.Grovlen@Sun.COM (Øystein Grøvlen)

>>    During checkpoint when I/O is slow this creates long queques of
>>    readers.  In my run with 20 clients, I observed read requests that
>>    took more than 20 seconds ......
> 
> 
> I am coding on autimatic checkpoint and incremental checkpoint as
> what we discussed before, I finished party of them and trying to
> do some test on it. I am really interested in how you observed :
> 
>>    I have observed single write
>>    operations that took almost 12 seconds ......
> 
> and
> 
>>    readers.  In my run with 20 clients, I observed read requests that
>>    took more than 20 seconds ......
> 

I ran a variant of the TPC-B benchmark with a large database (20 GB) and 
  a large page cache (1 GB).  I do not think you need TPC-B transactions 
to see this, but I think you need update-intensive transactions that 
frequently needs to load pages from disk.  For example, single record 
updates where an index is used to locate the record.  You will also need 
several connections in parallel (e.g., 20).

--
Øystein

RE: Derby I/O issues during checkpointing

Posted by Raymond Raymond <ra...@hotmail.com>.

Dear Oystein,
  In one of you former mails, you said:

>From: Oystein.Grovlen@Sun.COM (Řystein Grřvlen)

>Some tests runs we have done show very long transaction response times
>during checkpointing.  This has been seen on several platforms.  The
>load is TPC-B like transactions and the write cache is turned off so
>the system is I/O bound.  There seems to be two major issues:
>
>1. Derby does checkpointing by writing all dirty pages by
>    RandomAccessFile.write() and then do file sync when the entire
>    cache has been scanned.  When the page cache is large, the file
>    system buffer will overflow during checkpointing, and occasionally
>    the writes will take very long.  I have observed single write
>    operations that took almost 12 seconds ......
>
>2. What makes thing even worse is that only a single thread can read a
>    page from a file at a time.  (Note that Derby has one file per
>    table). This is because the implementation of RAFContainer.readPage
>    is as follow:
>
>         synchronized (this) {  // 'this' is a FileContainer, i.e. a file 
>object
>             fileData.seek(pageOffset);  // fileData is a RandomAccessFile
>             fileData.readFully(pageData, 0, pageSize);
>	}
>
>    During checkpoint when I/O is slow this creates long queques of
>    readers.  In my run with 20 clients, I observed read requests that
>    took more than 20 seconds ......

I am coding on autimatic checkpoint and incremental checkpoint as
what we discussed before, I finished party of them and trying to
do some test on it. I am really interested in how you observed :
>    I have observed single write
>    operations that took almost 12 seconds ......
and
>    readers.  In my run with 20 clients, I observed read requests that
>    took more than 20 seconds ......

Thanks.


Raymond

_________________________________________________________________
Take advantage of powerful junk e-mail filters built on patented MicrosoftŽ 
SmartScreen Technology. 
http://join.msn.com/?pgmarket=en-ca&page=byoa/prem&xAPID=1994&DI=1034&SU=http://hotmail.com/enca&HL=Market_MSNIS_Taglines 
  Start enjoying all the benefits of MSNŽ Premium right now and get the 
first two months FREE*.