You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by mark benedetto king <mb...@boredom.org> on 2003/02/19 20:45:51 UTC

Checkpoint less frequently (was Re: Still hang on svn 4951 RedHat 7.3 SMP)

On Wed, Feb 19, 2003 at 02:36:07PM -0500, Brandon Ehle wrote:
> 
> Index: subversion/libsvn_fs/fs.c
> ===================================================================
> --- subversion/libsvn_fs/fs.c   (revision 4721)
> +++ subversion/libsvn_fs/fs.c   (working copy)
> @@ -163,7 +163,7 @@
> 
>   /* Checkpoint any changes.  */
>   {
> -    int db_err = env->txn_checkpoint (env, 0, 0, 0);
> +    int db_err = env->txn_checkpoint (env, 8000, 60, 0);
> 
> #if SVN_BDB_HAS_DB_INCOMPLETE
>     while (db_err == DB_INCOMPLETE)
> 
> 

I'm in favor of committing this change.  I even volunteer to test it.

Without it, my ra_svn tests frequently hang.

--ben

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently (was Re: Still hang on svn 4951 RedHat 7.3 SMP)

Posted by mark benedetto king <mb...@boredom.org>.

On Wed, Feb 19, 2003 at 03:54:05PM -0500, Garrett Rooney wrote:
> >>  {
> >>-    int db_err = env->txn_checkpoint (env, 0, 0, 0);
> >>+    int db_err = env->txn_checkpoint (env, 8000, 60, 0);
> >
> 
> isn't that just masking whatever the real bug is?  i mean checkpointing 
> more often shouldn't be causing a problem, and if it is, we need to 
> figure out why, not ignore it and hope it goes away.
> 

That's true, but in the meantime, I'd like svncheck to run to completion,
which it doesn't for me, without this patch.

It's possible that all-zeroes tickles a BDB bug; that with the new
values there is no bug.

I'll investigate this a little further tonight.

--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently (was Re: Still hang on svn 4951 RedHat 7.3 SMP)

Posted by "Glenn A. Thompson" <gt...@cdr.net>.

Hey,

Forgive me if I'm out of wack here.  I'm still getting caught up.
Boy have you guys been busy.

Buuuutttt

Karl Fogel wrote:

>Branko Čibej <br...@xbc.nu> writes:
>  
>
>>I think *the* major task for 0.19 is:
>>
>>    * Create a DB monitor that can detect crashed sessions and
>>      automagically unwedge the DB.
>>
If this were to become formalized say in the fs API, I fear it presumes 
too much about the DB backend.  I hope that it can/would be hidden  down 
in  
the DB specific functions.

>>    * Stop creating transactions for read-only requests, and use
>>      ordinary locks instead.
>>
Are you talking about BDB transactions? or Subversion transactions? I'm 
assuming BDB

>>    * Reduce the number of txn_checkpoint calls in our code, or even
>>      eliminate them completely.
>>    
>>
There are only two that I recall.  One in a cleanup function and one in 
the trail commit function.
I have always believed the trail call to be excessive.  But I don't 
fully understand BDB recovery so I have never mentioned it
Like someone else said you still have logs to recover from.  Right?
I don't see any SQL impl doing such a thing.  These types of things are 
handled via DB settings on all the SQL DBs I've worked with.

>Could you expand a little on point number 2?  
>
Yes please.

Thanks,
gat

Re: Checkpoint less frequently (was Re: Still hang on svn 4951 RedHat 7.3 SMP)

Posted by Branko Čibej <br...@xbc.nu>.

Karl Fogel wrote:

>Branko Čibej <br...@xbc.nu> writes:
>  
>
>>I think *the* major task for 0.19 is:
>>
>>    * Create a DB monitor that can detect crashed sessions and
>>      automagically unwedge the DB.
>>    * Stop creating transactions for read-only requests, and use
>>      ordinary locks instead.
>>    * Reduce the number of txn_checkpoint calls in our code, or even
>>      eliminate them completely.
>>    
>>
>
>All of these sound like good ideas (though I have some questions about
>the second one), but aren't they independent?
>
Oh, of course they're independent.

>  We can reduce the
>frequency of txn_checkpoint calls without reducing the frequency with
>which we create transactions in the first place, and vice versa.
>
>Oh, I think I see: We can't switch to a locking system without a DB
>monitor to detect a deadlocked database and break the cycles?  (Or am
>I just missing the point?)
>
No, we don't need a monitor for that. Failing to unlock an object is no
worse (or better) than crashing or ^C-ing while the client holds an
uncommitted DB transaction.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently (was Re: Still hang on svn 4951 RedHat 7.3 SMP)

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Branko Čibej <br...@xbc.nu> writes:
> I think *the* major task for 0.19 is:
> 
>     * Create a DB monitor that can detect crashed sessions and
>       automagically unwedge the DB.
>     * Stop creating transactions for read-only requests, and use
>       ordinary locks instead.
>     * Reduce the number of txn_checkpoint calls in our code, or even
>       eliminate them completely.

All of these sound like good ideas (though I have some questions about
the second one), but aren't they independent?  We can reduce the
frequency of txn_checkpoint calls without reducing the frequency with
which we create transactions in the first place, and vice versa.

Oh, I think I see: We can't switch to a locking system without a DB
monitor to detect a deadlocked database and break the cycles?  (Or am
I just missing the point?)

In any case, the only 0.19 issue affected by these proposals would be
#995, "Large imports and checkouts over DAV can timeout".  In any
case, 0.19 will not be the last milestone concentrating on scalability
issues, you can be sure :-).

> Before amyone starts wondering if I'm off my rocker, 

I'm on the same rocker you are.  However, a few questions:

Could you expand a little on point number 2?  I'm not sure exactly how
you're proposing to use locks, and how they're supposed to replace
some of the functionality we get from transactions.  For example, in
Subversion, read-only requests are usually reading from committed
revisions.  So let's say we don't create a BDB transaction.  How would
locking work?

   'revisions':
      Well, only the revprops might be changing.  I guess one wants a
      consistent picture of those.  So we'd lock just the revision
      record we're reading from, for the duration of the read.  During
      that time, someone changing a revprop on that revision would be
      blocked, but that wouldn't be very long, so it's okay.

   'nodes', 'representations', 'strings':
      What do we lock here?  Would the locking interfere with
      deltification?

   'changes':
      No need to lock this for read-only operations, right?

I'm sort of thinking out loud here, but I get the feeling you have a
much more specific plan in mind...

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

fcntl locks (was: Checkpoint less frequently)

Posted by Greg Stein <gs...@lyra.org>.

On Fri, Feb 21, 2003 at 01:41:36AM -0500, Greg Hudson wrote:
> On Fri, 2003-02-21 at 00:18, Branko Cibej wrote:
> > Justin, we'll need a watcher anyway -- it's the only means we have to
> > automatically unwedge a repository if a client crashes. D'you really
> > thing we can release 1.0 without fixing this totally unacceptable bug?
> 
> ("If a client crashes?"  If we're using ra_svn or ra_dav, the server
> should have a chance to clean up.  As I understand it, the issue arises
> when a server process terminates uncleanly--such as when you interrupt
> an svn command using ra_local, since in that case the "client" and
> "server" are in the same process.)

Yah, ra_local or the server process. "Client of the FS" maybe :-)

> On Unix, anyway, it seems like a fcntl-locked guard around the database
> would do the trick without a separate process.  Get a read lock for
> normal operation, or a write lock to recover.  fcntl locks are
> automatically terminated on process exit, so there is no issue of stale
> locks.

Heh. Funny that you should mention that. That is exactly what
REPOS/lock/db.lock is for. Problem is, that we don't seem to be using it
properly.

Second, an application gets a read lock, but blocks inside of BDB. Thus, the
recovery process can't get in there to do the recover. The administrator has
to go and kill that blocked client.


Really... it seems that we should solve the fcntl thing, or just rip it out
of the SVN codebase.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Michael Price <mp...@atl.lmco.com>.

Branko Čibej writes:
 > Greg Hudson wrote:
 > >On Fri, 2003-02-21 at 01:50, Branko Čibej wrote:
 > >>The BDB docs recommend having a server or monitor process that runs
 > >>recovery when necessary.
 > >
 > >I tried hunting down this reference (I've seen it before) and failed. 
 > >If you could find it, I'd appreciate it.
 > 
 > I can't seem to find it right now, either.

http://www.sleepycat.com/docs/ref/transapp/app.html

Found using 'find . -type f -print | xargs grep monitor' in my local
copy.

Michael Price               Member of the Engineering Staff
Distributed Processing Lab; Lockheed Martin Adv. Tech. Labs
A&E 3W; 1 Federal Street; Camden, NJ 08102
856-338-4021, fax 856-338-4144  email: mprice@atl.lmco.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Branko Čibej <br...@xbc.nu>.

Greg Hudson wrote:

>On Fri, 2003-02-21 at 01:50, Branko Čibej wrote:
>  
>
>>The BDB docs recommend having a server or monitor process that runs
>>recovery when necessary.
>>    
>>
>
>I tried hunting down this reference (I've seen it before) and failed. 
>If you could find it, I'd appreciate it.
>

I can't seem to find it right now, either.

>Honestly, I'm with Justin here.  If it were just me making the
>decisions, I'd say that the point at which we need a monitor process is
>the point at which we should give up on using Berkeley DB, however
>painful that might be at this stage of the game.  (Perhaps more
>realistically, we could try to produce a change to Berkeley DB which
>would make it actually work, and convince Sleepycat to adopt it.)  I'm
>tired of passing our design errors on to the user.
>  
>
I didn't mean that the user would have to start the monitor. The server
(any server) can do that itself, unles the monitor is already started.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by "Glenn A. Thompson" <gt...@cdr.net>.

Hey,

>Honestly, I'm with Justin here.  If it were just me making the
>decisions, I'd say that the point at which we need a monitor process is
>the point at which we should give up on using Berkeley DB, however
>painful that might be at this stage of the game.  (Perhaps more
>realistically, we could try to produce a change to Berkeley DB which
>would make it actually work, and convince Sleepycat to adopt it.)  I'm
>tired of passing our design errors on to the user.
>  
>
I'm for looking at BDB.  However, in their defense, it's an embedded DB. 
 They fully expect the linker to deal with these sorts of things.  The 
data layer should be in it's own process space. Gat awaits the arrows:-)

gat


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Greg Hudson <gh...@MIT.EDU>.

On Fri, 2003-02-21 at 01:50, Branko Čibej wrote:
> The BDB docs recommend having a server or monitor process that runs
> recovery when necessary.

I tried hunting down this reference (I've seen it before) and failed. 
If you could find it, I'd appreciate it.

Honestly, I'm with Justin here.  If it were just me making the
decisions, I'd say that the point at which we need a monitor process is
the point at which we should give up on using Berkeley DB, however
painful that might be at this stage of the game.  (Perhaps more
realistically, we could try to produce a change to Berkeley DB which
would make it actually work, and convince Sleepycat to adopt it.)  I'm
tired of passing our design errors on to the user.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Branko Čibej <br...@xbc.nu>.

Greg Hudson wrote:

>On Fri, 2003-02-21 at 00:18, Branko Čibej wrote:
>  
>
>>Justin, we'll need a watcher anyway -- it's the only means we have to
>>automatically unwedge a repository if a client crashes. D'you really
>>thing we can release 1.0 without fixing this totally unacceptable bug?
>>    
>>
>
>("If a client crashes?"  If we're using ra_svn or ra_dav, the server
>should have a chance to clean up.  As I understand it, the issue arises
>when a server process terminates uncleanly--such as when you interrupt
>an svn command using ra_local, since in that case the "client" and
>"server" are in the same process.)
>
Ah, right -- I meant server, of course.

>On Unix, anyway, it seems like a fcntl-locked guard around the database
>would do the trick without a separate process.  Get a read lock for
>normal operation, or a write lock to recover.  fcntl locks are
>automatically terminated on process exit, so there is no issue of stale
>locks.
>
That doesn't work, unfortunately, because you don't know that you have
to db_recover after an aborted session until you're already blocked on a
stale lock.

>(It seems like Berkeley DB should take care of this under the covers,
>really.)
>  
>
Yes, it should, but unfortunately it doesn't. The BDB docs recommend
having a server or monitor process that runs recovery when necessary.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Greg Hudson <gh...@MIT.EDU>.

On Fri, 2003-02-21 at 00:18, Branko Čibej wrote:
> Justin, we'll need a watcher anyway -- it's the only means we have to
> automatically unwedge a repository if a client crashes. D'you really
> thing we can release 1.0 without fixing this totally unacceptable bug?

("If a client crashes?"  If we're using ra_svn or ra_dav, the server
should have a chance to clean up.  As I understand it, the issue arises
when a server process terminates uncleanly--such as when you interrupt
an svn command using ra_local, since in that case the "client" and
"server" are in the same process.)

On Unix, anyway, it seems like a fcntl-locked guard around the database
would do the trick without a separate process.  Get a read lock for
normal operation, or a write lock to recover.  fcntl locks are
automatically terminated on process exit, so there is no issue of stale
locks.

(It seems like Berkeley DB should take care of this under the covers,
really.)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

auto recovery (was: Checkpoint less frequently)

Posted by Greg Stein <gs...@lyra.org>.

On Fri, Feb 21, 2003 at 01:20:55PM -0500, Greg Hudson wrote:
> On Fri, 2003-02-21 at 13:03, Branko Cibej wrote:
> [When the monitor fails to keep a process from hitting a stale lock:]
> > So we wait for a bit, then kill it.
> 
> If the monitor process is started automatically, then it may have been
> started by a different user than the one whose process hung.  So we
> can't kill it.

Not to mention all the other crap that can happen by arbitrarily whacking
processes. In the DAV case, this would be shooting down an Apache process,
and that could imply that you leave a bunch of shared memory stuffs sitting
around. Yes, Apache does try to clean up in cases like that, but let's not
plan to make it work too hard.

Just say "no" to killing processes :-)

> The following discipline would seem to work, without the need for a
> monitor process:
> 
>   * Wrap a guard file around the database, per my earlier idea.
>     (fcntl-locked, read-locked for normal access, write-locked for
>     recovery.)
> 
>   * Set the lock timeout (at db creation time).

Ah! Key item. Yes, this solves the whole ball of wax.

>   * If we time out on a lock, fail the transaction, grab a write lock on
>     the guard file, run recovery, and retry.

Well, we can change this a bit:

    * If we time out on a lock:
      - retry the transaction (maybe there are other reasons for a timeout,
        such as the database is simply *busy*)
      - if we get DB_RUNRECOVER, then:
        - fail the transaction (well, fail the *trail*, right?)
	- grab a write lock on REPOS/lock/db.lock
	- run recovery
	- unlock the guard
	- retry if we haven't exhausted the retry count

> But it may be inefficient in some cases:
> 
>   * If we erroneously time out on a lock, we will still succeed
>     eventually, but it may take much longer than it would if we had
>     waited.  But that problem should be rare.

Berkeley DB should be able to tell us that we need to run the recovery, so
we can just look for that instead of assuming the need.

>   * If multiple processes hit the stale lock, they will all run
>     recovery.  We could avoid that by putting a timestamp in the guard
>     file saying when recovery was last run, or we could hypothesize that
>     N recoveries doesn't take much longer than one recovery.

The timestamp would be nice. Each process could record when it attempts to
acquire the write lock. When it finally gets the lock, it reads the file,
sees that the recovery finished *after* its acquisition time, and just
releases the write lock and retries the operation.

> I also wonder how many of these problems go away if you instruct
> Berkeley DB to use fcntl locks.  (That's possible, right?)  And what the
> cost is in everyday performance, of course.

Hmm. Interesting, but I think the timeout is key, and should be able to get
us what we need.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Philip Martin <ph...@codematters.co.uk>.

Greg Hudson <gh...@MIT.EDU> writes:

> On Fri, 2003-02-21 at 13:03, Branko Èibej wrote:
> [When the monitor fails to keep a process from hitting a stale lock:]
> > So we wait for a bit, then kill it.
> 
> If the monitor process is started automatically, then it may have been
> started by a different user than the one whose process hung.  So we
> can't kill it.

Even if you get round that, there are other problems.  Subversion
provides libraries to encourage alternative database clients.  We
can't go blindly killing those, it may do more harm than good.  You
might kill my fancy Subversion-aware editor.  You might kill a process
that is accessing multiple repositories, in which case you may well be
the cause of other repositories hanging.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Greg Hudson <gh...@MIT.EDU>.

On Fri, 2003-02-21 at 13:03, Branko Čibej wrote:
[When the monitor fails to keep a process from hitting a stale lock:]
> So we wait for a bit, then kill it.

If the monitor process is started automatically, then it may have been
started by a different user than the one whose process hung.  So we
can't kill it.

The following discipline would seem to work, without the need for a
monitor process:

  * Wrap a guard file around the database, per my earlier idea.
    (fcntl-locked, read-locked for normal access, write-locked for
    recovery.)

  * Set the lock timeout (at db creation time).

  * If we time out on a lock, fail the transaction, grab a write lock on
    the guard file, run recovery, and retry.

But it may be inefficient in some cases:

  * If we erroneously time out on a lock, we will still succeed
    eventually, but it may take much longer than it would if we had
    waited.  But that problem should be rare.

  * If multiple processes hit the stale lock, they will all run
    recovery.  We could avoid that by putting a timestamp in the guard
    file saying when recovery was last run, or we could hypothesize that
    N recoveries doesn't take much longer than one recovery.

I also wonder how many of these problems go away if you instruct
Berkeley DB to use fcntl locks.  (That's possible, right?)  And what the
cost is in everyday performance, of course.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Branko Čibej <br...@xbc.nu>.

Greg Hudson wrote:

>On Fri, 2003-02-21 at 12:52, Branko Čibej wrote:
>  
>
>>You don't have to stop any servers, or anythng. Each server only has to
>>ask the monitor if it may open the database, and to notify it when it
>>closes it. When the monitor detects a crashed process, it starts denying
>>access to the database until all other processes have backed out, runs
>>recovery, then allows access again.
>>    
>>
>
>  
>
>>At least, that's the general idea.
>>    
>>
>
>That doesn't seem very general.
>
>  Process A opens the database
>  Process A acquires many fine locks
>  Process B opens the database
>  Process A crashes
>
>Process B is just as likely to hit a stale lock and hang as if it had
>opened the database after the crash.
>  
>
So we wait for a bit, then kill it. We know which processes were active
(i.e., fiddling with the database) at the time of the crash.

Now it's possible that there's another way to solve this problem:
setting the locl timeout. A process will only block forever on a stale
lock _unless_ a timeout has been set (say, in the DB_CONFIG file). Some
time ago when I was testing different ways to avoid the wedged
DB/blocked process problem, I tried this method and it worked within my
limited test cases. But I don't uderstand it well enough, nor have I
stressed it enough, to be ble to say whether this is an acceptable
solution or not.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Greg Hudson <gh...@MIT.EDU>.

On Fri, 2003-02-21 at 12:52, Branko Čibej wrote:
> You don't have to stop any servers, or anythng. Each server only has to
> ask the monitor if it may open the database, and to notify it when it
> closes it. When the monitor detects a crashed process, it starts denying
> access to the database until all other processes have backed out, runs
> recovery, then allows access again.

> At least, that's the general idea.

That doesn't seem very general.

  Process A opens the database
  Process A acquires many fine locks
  Process B opens the database
  Process A crashes

Process B is just as likely to hit a stale lock and hang as if it had
opened the database after the crash.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Branko Čibej <br...@xbc.nu>.

Brandon Ehle wrote:

>>
>>
>>> No, it's not.  Requiring me to have yet another process running so
>>> that the database can be checkpointed is incredibly lame.
>>>
>>> I won't even get to the issue of what happens when the checkpoint code
>>> crashes.  We'll need a watcher.  Then, another watcher.  No.   
>>
>>
>> Justin, we'll need a watcher anyway -- it's the only means we have to
>> automatically unwedge a repository if a client crashes. D'you really
>> thing we can release 1.0 without fixing this totally unacceptable bug?
>>  
>>
> I don't even think this is possible.  When the needs recoverey while
> apache is running, I usually have to logon with root privileges and do
> a killall -KILL httpd, killall svnserve, then run ipcs and delete all
> the leftover locks, then I will be able to run db_recover or svnadmin
> recover.  Then restart httpd & svnserve -d. We'd need one hell of a
> monitor to be able to accomplish all that after you take security into
> consideration.

You don't have to stop any servers, or anythng. Each server only has to
ask the monitor if it may open the database, and to notify it when it
closes it. When the monitor detects a crashed process, it starts denying
access to the database until all other processes have backed out, runs
recovery, then allows access again.

At least, that's the general idea.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Greg Stein <gs...@lyra.org>.

On Fri, Feb 21, 2003 at 05:53:18PM +0200, Jani Monoses wrote:
> 
> > Show me a database product that doesn't need a babysitter. I'm not aware
> > of one.
> But should subversion be a database?Ok, in a way yes. 
> But CVS with all its drawbacks did not need anyone with a constant on the  for logfiles eating
> the whole disk and such.Most of the babysitting should be automated.

Euh... have you ever tried to maintain a *large* CVS repository with LOTS of
activity on it? Heh. Why do you think CollabNet has been sponsoring
Subversion development? :-)

I've seen CVS knock over a box. Took the whole damn thing down. I don't
think we had to send somebody physically to the box, but we did have to
reboot the darned thing. CVS consumed all available memory and the swap. It
came to a screaming halt.

And don't get me started on stale CVS locks...

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Jani Monoses <ja...@iv.ro>.

> Show me a database product that doesn't need a babysitter. I'm not aware
> of one.
But should subversion be a database?Ok, in a way yes. 
But CVS with all its drawbacks did not need anyone with a constant on the  for logfiles eating
the whole disk and such.Most of the babysitting should be automated.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Michael <mi...@ispwest.com>.

Jani Monoses writes:
 > > I don't even think this is possible.  When the needs recoverey while 
 > > apache is running, I usually have to logon with root privileges and do a 
 > > killall -KILL httpd, killall svnserve, then run ipcs and delete all the 
 > > leftover locks, then I will be able to run db_recover or svnadmin 
 > > recover.  Then restart httpd & svnserve -d. We'd need one hell of a 
 > > monitor to be able to accomplish all that after you take security into 
 > > consideration.
 > 
 > This might be the way to do it now but IMHO a svn needing a babysitter to
 > do all that should not be called 1.0 

Show me a database product that doesn't need a babysitter. I'm not aware
of one.

Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Jani Monoses <ja...@iv.ro>.

> I don't even think this is possible.  When the needs recoverey while 
> apache is running, I usually have to logon with root privileges and do a 
> killall -KILL httpd, killall svnserve, then run ipcs and delete all the 
> leftover locks, then I will be able to run db_recover or svnadmin 
> recover.  Then restart httpd & svnserve -d. We'd need one hell of a 
> monitor to be able to accomplish all that after you take security into 
> consideration.

This might be the way to do it now but IMHO a svn needing a babysitter to
do all that should not be called 1.0 



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Brandon Ehle <az...@yahoo.com>.

> 
>
>>No, it's not.  Requiring me to have yet another process running so
>>that the database can be checkpointed is incredibly lame.
>>
>>I won't even get to the issue of what happens when the checkpoint code
>>crashes.  We'll need a watcher.  Then, another watcher.  No. 
>>    
>>
>
>Justin, we'll need a watcher anyway -- it's the only means we have to
>automatically unwedge a repository if a client crashes. D'you really
>thing we can release 1.0 without fixing this totally unacceptable bug?
>  
>
I don't even think this is possible.  When the needs recoverey while 
apache is running, I usually have to logon with root privileges and do a 
killall -KILL httpd, killall svnserve, then run ipcs and delete all the 
leftover locks, then I will be able to run db_recover or svnadmin 
recover.  Then restart httpd & svnserve -d. We'd need one hell of a 
monitor to be able to accomplish all that after you take security into 
consideration.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Branko Čibej <br...@xbc.nu>.

Justin Erenkrantz wrote:

> --On Thursday, February 20, 2003 17:45:30 +0100 Branko Èibej
> <br...@xbc.nu> wrote:
>
>> Yup. But it would be even better to move the checkpointing into a
>> separate process so that it's asynchronous with regard to the real
>> business managing versions.
>
>
> No, it's not.  Requiring me to have yet another process running so
> that the database can be checkpointed is incredibly lame.
>
> I won't even get to the issue of what happens when the checkpoint code
> crashes.  We'll need a watcher.  Then, another watcher.  No. 

Justin, we'll need a watcher anyway -- it's the only means we have to
automatically unwedge a repository if a client crashes. D'you really
thing we can release 1.0 without fixing this totally unacceptable bug?

> Please don't go this route.  I can't express my animosity towards this
> approach loud enough.  -- justin

Yes, I expect you can't. :-)

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Greg Stein <gs...@lyra.org>.

On Thu, Feb 20, 2003 at 06:12:18PM -0800, Justin Erenkrantz wrote:
> --On Thursday, February 20, 2003 17:45:30 +0100 Branko Èibej <br...@xbc.nu> 
> wrote:
> 
> > Yup. But it would be even better to move the checkpointing into a
> > separate process so that it's asynchronous with regard to the real
> > business managing versions.
> 
> No, it's not.  Requiring me to have yet another process running so that the 
> database can be checkpointed is incredibly lame.
> 
> I won't even get to the issue of what happens when the checkpoint code 
> crashes.  We'll need a watcher.  Then, another watcher.  No.
> 
> Please don't go this route.  I can't express my animosity towards this 
> approach loud enough.  -- justin

Yah. It sucks. Quite hard. Golf balls and hoses hard.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Justin Erenkrantz <je...@apache.org>.

--On Thursday, February 20, 2003 17:45:30 +0100 Branko Èibej <br...@xbc.nu> 
wrote:

> Yup. But it would be even better to move the checkpointing into a
> separate process so that it's asynchronous with regard to the real
> business managing versions.

No, it's not.  Requiring me to have yet another process running so that the 
database can be checkpointed is incredibly lame.

I won't even get to the issue of what happens when the checkpoint code 
crashes.  We'll need a watcher.  Then, another watcher.  No.

Please don't go this route.  I can't express my animosity towards this 
approach loud enough.  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by Branko Čibej <br...@xbc.nu>.

William Uther wrote:

>
> On Thursday, February 20, 2003, at 04:37  PM, Branko Čibej wrote:
>
>> Thanks, Brandon, this is a very good analysis. And it confirms my
>> suspicions that we're using _way_ too many transactions, and issuing far
>> too many txn_checkpoint calls.
>>
>> I think *the* major task for 0.19 is:
>>
>>     * Stop creating transactions for read-only requests, and use
>>       ordinary locks instead.
>
>
> Would this stop the log files growing on read-only requests? 

That's sort of the point, yes -- but it would also reduce the number of
fsyncs on txn commits, which is one of the major slowdowns.

> % ls -l repos/db/log.*
> -rw-r--r--  1 willu  staff  81442 Feb 20 21:12 repos/db/log.0000000001
> % svn up wc
> At revision 1.
> % ls -l repos/db/log.*
> -rw-r--r--  1 willu  staff  86589 Feb 20 21:12 repos/db/log.0000000001
> % svn up wc
> At revision 1.
> % ls -l repos/db/log.*
> -rw-r--r--  1 willu  staff  89135 Feb 20 21:13 repos/db/log.0000000001
>
> Here it doesn't grow much, and so it isn't a major problem, but if it
> were to go away I wouldn't mind. :) 

Imagine serving a web site from the repository. Log files will grow on
every hit -- for no good reason at all.

>>     * Reduce the number of txn_checkpoint calls in our code, or even
>>       eliminate them completely.
>>
>> Before amyone starts wondering if I'm off my rocker, consider this: you
>> only really need a txn_checkpoint when youre doing a hot backup of the
>> database, or removing old log files. Therefore, checkpoints should be
>> issued by the backup/cleanup scripts, definitely not in the critical
>> path.
>
>
> Reading http://www.sleepycat.com/docs/ref/transapp/checkpoint.html
>
> it looks like the database is safe with less frequent checkpointing. 
> (checkpointing just syncs the database files.  The log files are
> already on disk.)  Note that sleepycat mention checkpointing every 60
> seconds, and "Because checkpoints can be quite expensive, choosing how
> often to perform a checkpoint is a common tuning parameter for
> Berkeley DB applications." 

Yup. But it would be even better to move the checkpointing into a
separate process so that it's asynchronous with regard to the real
business managing versions.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Checkpoint less frequently

Posted by Sander Striker <st...@apache.org>.

> From: Sander Striker [mailto:striker@apache.org]
> Sent: Thursday, February 20, 2003 11:26 AM

>> Reading http://www.sleepycat.com/docs/ref/transapp/checkpoint.html
> 
> Reading this I wonder why we don't checkpoint only at the time right
> before we run post-commit (before we tell the client the commit succeeded).
> And only then.  Any reason to checkpoint more often?

Hmm, maybe an exception for operations that only touch (and modify!) transactions.
I'm thinking about the lock strategy notes (the full impl.) here, where 'commits'
happen on a transaction until the lock is released.

Sander

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Checkpoint less frequently

Posted by Sander Striker <st...@apache.org>.

> From: William Uther [mailto:willu.mailingLists@cse.unsw.edu.au]
> Sent: Thursday, February 20, 2003 11:19 AM

>>     * Reduce the number of txn_checkpoint calls in our code, or even
>>       eliminate them completely.
>>
>> Before amyone starts wondering if I'm off my rocker, consider this: you
>> only really need a txn_checkpoint when youre doing a hot backup of the
>> database, or removing old log files. Therefore, checkpoints should be
>> issued by the backup/cleanup scripts, definitely not in the critical 
>> path.
> 
> Reading http://www.sleepycat.com/docs/ref/transapp/checkpoint.html

Reading this I wonder why we don't checkpoint only at the time right
before we run post-commit (before we tell the client the commit succeeded).
And only then.  Any reason to checkpoint more often?

Sander

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently

Posted by William Uther <wi...@cse.unsw.edu.au>.

On Thursday, February 20, 2003, at 04:37  PM, Branko Čibej wrote:

> Thanks, Brandon, this is a very good analysis. And it confirms my
> suspicions that we're using _way_ too many transactions, and issuing 
> far
> too many txn_checkpoint calls.
>
> I think *the* major task for 0.19 is:
>
>     * Stop creating transactions for read-only requests, and use
>       ordinary locks instead.

Would this stop the log files growing on read-only requests?

% ls -l repos/db/log.*
-rw-r--r--  1 willu  staff  81442 Feb 20 21:12 repos/db/log.0000000001
% svn up wc
At revision 1.
% ls -l repos/db/log.*
-rw-r--r--  1 willu  staff  86589 Feb 20 21:12 repos/db/log.0000000001
% svn up wc
At revision 1.
% ls -l repos/db/log.*
-rw-r--r--  1 willu  staff  89135 Feb 20 21:13 repos/db/log.0000000001

Here it doesn't grow much, and so it isn't a major problem, but if it 
were to go away I wouldn't mind. :)

>     * Reduce the number of txn_checkpoint calls in our code, or even
>       eliminate them completely.
>
> Before amyone starts wondering if I'm off my rocker, consider this: you
> only really need a txn_checkpoint when youre doing a hot backup of the
> database, or removing old log files. Therefore, checkpoints should be
> issued by the backup/cleanup scripts, definitely not in the critical 
> path.

Reading http://www.sleepycat.com/docs/ref/transapp/checkpoint.html

it looks like the database is safe with less frequent checkpointing.  
(checkpointing just syncs the database files.  The log files are 
already on disk.)  Note that sleepycat mention checkpointing every 60 
seconds, and "Because checkpoints can be quite expensive, choosing how 
often to perform a checkpoint is a common tuning parameter for Berkeley 
DB applications."

later,

Will        :-}

--
Dr William Uther                            National ICT Australia
Phone: +61 2 9385 6926             School of Computer Science and 
Engineering
Email: willu@cse.unsw.edu.au             University of New South Wales
Jabber: willu@jabber.cse.unsw.edu.au          Sydney, Australia


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently (was Re: Still hang on svn 4951 RedHat 7.3 SMP)

Posted by Branko Čibej <br...@xbc.nu>.

Thanks, Brandon, this is a very good analysis. And it confirms my
suspicions that we're using _way_ too many transactions, and issuing far
too many txn_checkpoint calls.

I think *the* major task for 0.19 is:

    * Create a DB monitor that can detect crashed sessions and
      automagically unwedge the DB.
    * Stop creating transactions for read-only requests, and use
      ordinary locks instead.
    * Reduce the number of txn_checkpoint calls in our code, or even
      eliminate them completely.

Before amyone starts wondering if I'm off my rocker, consider this: you
only really need a txn_checkpoint when youre doing a hot backup of the
database, or removing old log files. Therefore, checkpoints should be
issued by the backup/cleanup scripts, definitely not in the critical path.

I actually think moving the checkpointing out of the main code is the
simplest of the three.

Brandon Ehle wrote:

>>
>>
>>>>  
>>>
>>>
>>> I'm in favor of committing this change.  I even volunteer to test it.
>>>
>>> Without it, my ra_svn tests frequently hang.
>>>
>>
>> isn't that just masking whatever the real bug is?  i mean
>> checkpointing more often shouldn't be causing a problem, and if it
>> is, we need to figure out why, not ignore it and hope it goes away.
>
>
>
> I've been tracking down this issue for about 3 months now and here is
> my guess on whats happening.
>
> Pretty much every svn operation touches the database in some way or
> another.  Even an svn update when nothing has changed in either your
> working copy or the repository, so every operation will put the
> repository in a state where txn_checkpoint() has something to do. 
> Therefore, txn_checkpoint() will get run after every single operation
> (in ra_dav mode this includes every PUT).
>
> Normally this isn't too bad, but as your repository grows, the
> checkpoint times will get larger and larger and eventually you could
> get to the point where my 15GB repository is at and a txn_checkpoint()
> takes 5 minutes or more.
>
> Any operations on the database after this point will wait in
> __os_yield() for a short period of time until the checkpoint has
> released its lock on the shared memory for the last log file, which is
> needed for quite a few operations.  This is the reason why it appears
> why the subversion call stack gets stuck in __os_yield().  If it takes
> more than 90 seconds for txn_checkpoint() to release its locks, thats
> when you see the neon timeouts over ra_dav.
>
> As alot of small operations are running on the database in ra_dav
> mode, the repository can get into a state where it needs to run 2 or 3
> txn_checkpoints() in a row.   This will easily cause the 90 second
> neon timeout.  The txn limiting patch attempts to limit the number of
> checkpoints that will run in a row under these circumstances, although
> it is still very possible to get timeouts if it takes your machine
> more than 90 seconds for a single txn_checkpoint() to release its locks.
>
> For a multi-user ra_dav server, another fun part of this problem is
> that only one txn_checkpoint() can run at a time, so as each operation
> wants to run txn_checkpoint(), and if you have enough users,
> eventually every apache thread will be waiting for a turn to run
> txn_checkpoint() so apache will have to spawn some more processes (if
> it can).  If your apache server is stuck in this mode and you attempt
> to shut it down, it could take on the order of several hours until
> apache finishes shuttting down.  The txn limiting patch helps, but
> does not completely address this issue (you should be able to run
> about 4x as many users on your server with the patch applied).



-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently (was Re: Still hang on svn 4951 RedHat 7.3 SMP)

Posted by Brandon Ehle <az...@yahoo.com>.

> 
>
>>>  
>>
>> I'm in favor of committing this change.  I even volunteer to test it.
>>
>> Without it, my ra_svn tests frequently hang.
>>
>
> isn't that just masking whatever the real bug is?  i mean 
> checkpointing more often shouldn't be causing a problem, and if it is, 
> we need to figure out why, not ignore it and hope it goes away.


I've been tracking down this issue for about 3 months now and here is my 
guess on whats happening.

Pretty much every svn operation touches the database in some way or 
another.  Even an svn update when nothing has changed in either your 
working copy or the repository, so every operation will put the 
repository in a state where txn_checkpoint() has something to do.  
Therefore, txn_checkpoint() will get run after every single operation 
(in ra_dav mode this includes every PUT).

Normally this isn't too bad, but as your repository grows, the 
checkpoint times will get larger and larger and eventually you could get 
to the point where my 15GB repository is at and a txn_checkpoint() takes 
5 minutes or more.

Any operations on the database after this point will wait in 
__os_yield() for a short period of time until the checkpoint has 
released its lock on the shared memory for the last log file, which is 
needed for quite a few operations.  This is the reason why it appears 
why the subversion call stack gets stuck in __os_yield().  If it takes 
more than 90 seconds for txn_checkpoint() to release its locks, thats 
when you see the neon timeouts over ra_dav.

As alot of small operations are running on the database in ra_dav mode, 
the repository can get into a state where it needs to run 2 or 3 
txn_checkpoints() in a row.   This will easily cause the 90 second neon 
timeout.  The txn limiting patch attempts to limit the number of 
checkpoints that will run in a row under these circumstances, although 
it is still very possible to get timeouts if it takes your machine more 
than 90 seconds for a single txn_checkpoint() to release its locks.

For a multi-user ra_dav server, another fun part of this problem is that 
only one txn_checkpoint() can run at a time, so as each operation wants 
to run txn_checkpoint(), and if you have enough users, eventually every 
apache thread will be waiting for a turn to run txn_checkpoint() so 
apache will have to spawn some more processes (if it can).  If your 
apache server is stuck in this mode and you attempt to shut it down, it 
could take on the order of several hours until apache finishes shuttting 
down.  The txn limiting patch helps, but does not completely address 
this issue (you should be able to run about 4x as many users on your 
server with the patch applied).



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Checkpoint less frequently (was Re: Still hang on svn 4951 RedHat 7.3 SMP)

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

mark benedetto king wrote:

>On Wed, Feb 19, 2003 at 02:36:07PM -0500, Brandon Ehle wrote:
>  
>
>>Index: subversion/libsvn_fs/fs.c
>>===================================================================
>>--- subversion/libsvn_fs/fs.c   (revision 4721)
>>+++ subversion/libsvn_fs/fs.c   (working copy)
>>@@ -163,7 +163,7 @@
>>
>>  /* Checkpoint any changes.  */
>>  {
>>-    int db_err = env->txn_checkpoint (env, 0, 0, 0);
>>+    int db_err = env->txn_checkpoint (env, 8000, 60, 0);
>>
>>#if SVN_BDB_HAS_DB_INCOMPLETE
>>    while (db_err == DB_INCOMPLETE)
>>
>>
>>    
>>
>
>I'm in favor of committing this change.  I even volunteer to test it.
>
>Without it, my ra_svn tests frequently hang.
>

isn't that just masking whatever the real bug is?  i mean checkpointing 
more often shouldn't be causing a problem, and if it is, we need to 
figure out why, not ignore it and hope it goes away.

-garrett


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org