You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Greg Hudson <gh...@MIT.EDU> on 2004/08/28 22:25:31 UTC

mbk's recovery lock change

mbk nominated r10467 and r10734 for backport to 1.1.  I've decided I
have a little disagreement with how it works, and want to bring up the
issue on the list so we can resolve it and the get the change into
1.1.

What we do now is print "This may take a while...", silently attempt
to grab the lock, and then recover.  This is bad because if we're
taking a long time, the user has no idea whether it's because of the
lock (which the user probably doesn't even know about) or because DB
recovery is slow.

mbk's change adds a nonblocking flag to svn_repos_recover (via a new
function yadda yadda).  The recover flag tries svn_repos_recover2 with
the nonblocking flag; it it fails, it prints a warning about the lock
and tries again without the nonblocking flag.

Although this is definitely better than the status quo, my
disagreement is that the user doesn't find out whether recovery has
started, and thus doesn't know whether the recovery process can be
safely interrupted.

The easiest alternative is to just bomb out if we can't get the lock
immediately.  A harder alternative (because it requires a tricky API
decision) is to retry with a blocking lock, but print a message once
we get the lock.

I kind of like the first alternative, because even if we print a
message once we get the lock, the user has to perform a little dance
in order to safely kill off the recovery attempt (^Z it, see if the
message was displayed before the process was stopped, then continue).
However, mbk points out that if svnserve is running out of inetd, it
can be convenient to have the recovery process waiting in line to get
the lock, since it's hard to ensure that inetd won't kick off an
svnserve process and grab a read lock.

So, to summarize:

  What we have now:  No diagnostic, blocking lock attempt
  mbk's alternative: Diagnostic if we can't get lock, then block
  My alternative:    Fail if we can't get lock
  Other alternative: Diagnostic if we can't get lock, then block,
                      and print another message when we do get lock

I think this is a straightforward weighing of tradeoffs, so we can
just collect preferences, but feel free to make new arguments if you
have them.  If you want to side with the "other alternative," please
mention what kind of API you'd like to see for it--either a
notification callback to svn_repos_recover2, or separating out the API
to grab the exclusive lock (so svn_repos_recover2 would have the same
signature as svn_repos_recover, but would assume that you've already
grabbed the lock).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Command-line option for changing block-on-recover behavior considered unnecessary.

Posted by Mark Benedetto King <mb...@lowlatency.com>.

On Mon, Aug 30, 2004 at 11:19:32AM -0400, Eric S. Raymond wrote:
> Mark Benedetto King <mb...@lowlatency.com>:
> > This sounds like a job for a command-line option!
> 
> That wouldn't solve the problem, just move the debate to "which mode 
> should be default?"

That's true, except that that debate might be one more easily solved.

The fact that there's a camp of people who want the non-blocking
behaviour, and a camp of people (well, me at least) who like the
idea of not having to completely kill apache to recover a single
svn repository, indicate that perhaps we need to support both modes
of operation.

I am perfectly happy with the non-blocking mode as the default.
If my camp really contains only me, then the debate is over; not
every design decision has to be hard-fought.

--ben

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Command-line option for changing block-on-recover behavior considered unnecessary.

Posted by kf...@collab.net.

> It's all good.  But *don't wimp out on listing the lockers*.

IMHO the proposal overly complex, and would be more maintenance burden
than its worth.  I wouldn't veto it, if the other developers also
think it's a good idea.  But that little script has to get installed
somewhere, it has to be documented, we have to make sure it does not
cause problems on systems where it won't be invoked, make sure it
invokes lsof or whatever in the right way for the particular system
its on, etc.  The estimate of "two, three hours' work at the outside"
seems very low to me.  Of course, that's your time you're
volunteering, so the estimate doesn't really matter -- I'm mainly
concerned about the ongoing maintenance and comprehensibility burden.

And for how much benefit?  Suppose it's httpd, or svnserve being run
from inetd?  In the former case, httpd is probably a child process.
Killing it might not help -- Apache will start up another one when the
next access comes in, which might well happen between the time when
lsof is called and the admin-invoked kill is issued.  Unpredictably,
then, our helpful techniques might fail anyway.  Sometimes they'll
work, sometimes they won't.  That'll be hard to explain.  Far better
for the admin to *understand* what is going on, and restart apache
after editing the appropriate conf file.  Similar arguments apply to
svnserve.

I think the best balance of our effort vs admin convenience would be
to print out a message that says what's going on, speculates about the
sorts of daemon-style processes that might be responsible (i.e., tells
the admin to check Apache and/or svnserve), and then lets the admin Do
The Right Thing.

I don't deny that it's a burden on admins to ask that they know what
"The Right Thing" is for so many different systems, of which
Subversion is only one.  But all my instincts are saying that we
really can't solve this well anyway, and that we'd spend far too much
time trying.

Best,
-Karl

"Eric S. Raymond" <es...@thyrsus.com> writes:
> kfogel@collab.net <kf...@collab.net>:
> > We can't portably determine the other locking processes; attempting to
> > do so would be too much extra complexity. 
> 
> The demand for perfect portability is a red herring in this case.
> You're a Unix programmer; think like one and apply the following
> rules:
> 
> 1. Never write what you can reuse.
> 2. Better to do 85% of the job than do 0% of it because you can't handle 15%.
> 3. When in doubt, use brute force.
> 
> Here's the brute-force solution.  You know the list of files with
> critical locks on them. When svnadmin recover detecys that the
> database is locked, it should shell out to a dumb little script --
> we'll call it "list-all-lockers" -- that does this:
> 
> 1. Determine if fuser is in a command directory.  If so, shell out to
>    fuser called on the critical file list.  It will will either generate 
>    the right report to stdout or an error status.  On error status,
>    complain.  You're done.
> 
> 2. Determine if lsof is in a command directory.  If so, shell out to
>    lsof called on the critical file list. It will either generate 
>    the right report to stdout or an error status. On error status,
>    complain.  You're done.
> 
> 3. If neither of these works, punt with a complaint message. You're done.
> 
> This is the patch I was going to write if nobody else got moving.  Two
> dozen lines of shell.  Maybe three lines of C.  Two, three hours' work
> at the outside.
> 
> This addition would handle Linux, Mac OS/X and other Unixes, all of whom
> can run bash scripts.  svnadmin runs on the *repository* machine, of
> course, so the only people it won't help are people hosting
> repositories on Windows boxes -- and they'd be no worse off than they
> were before.
> 
> >                                          Plus, some people do run
> > repositories over very reliable remote filesystems, in which case
> > lsof/fuser is no help anyway (thanks to ghudson for pointing this out
> > in IRC).
> 
> So lsof/fuser sez "Oops! Boss, I can't stat that file!", returns an
> error status, and you're at case 3.  Yes, this will happen
> occasionally.  Is it a good reason not to help the vast majority of
> admins for whom the simple shellout to fuser/lsof will work?
> 
> Heck, no.
> 
> Here's why it's our responsibility. *The admin doesn't know what the
> critical files are*.  So even if he groks fuser/lsof, he can't know
> the proper mystic incantation to generate to find out what the locking
> processes are.
> 
> Our choices, therefore are simple:
> 
> 1. Punt the problem.  But I've flamed at length about why this would be 
>    wrong, and the list seems to have taken that critique to heart.
> 
> 2. Expose svn's guts by uttering a list of the critical files so the
>    sdmin can run fsuser/lsof by hand.
> 
> 3. Delegate the job of finding the lockers to plugins that may vary 
>    depending on the host OS (what I've suggested).
> 
> Really, is there any reason the distribution shouldn't include 
> a `list-all-lockers` script for each port?
> 
> >       Detecting and auto-killing competing processes, for --force,
> > would be even harder and more error-prone.  "I ran 'svnadmin recover'
> > and _what_ died??"
> 
> Granted.  I had more or less come to the conclusion that this wasn't
> practical myself.
>  
> > But we can make 'svnadmin recover' much friendlier than it is now (and
> > IMHO, friendly enough) without going into such complexity:
> > 
> >    By default, 'svnadmin recover' fails immediately if it can't get a
> >    lock, otherwise takes the lock and begins recovering, in
> >    non-interruptible mode.
> > 
> >    But when passed the '--wait' option, if it can't get the lock, it
> >    prints out a message saying:
> > 
> >    "Waiting for lock.  Currently, at least one other process has the
> >     repository open; recovery will start when the repository is
> >     released.  Press ^C to give up."
> > 
> >    Then when it gets the lock, it prints out a second message:
> > 
> >    "Lock acquired, starting recovery.  This is non-interruptible."
> 
> It's all good.  But *don't wimp out on listing the lockers*.
> -- 
> 		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Command-line option for changing block-on-recover behavior considered unnecessary.

Posted by "Eric S. Raymond" <es...@thyrsus.com>.

kfogel@collab.net <kf...@collab.net>:
> We can't portably determine the other locking processes; attempting to
> do so would be too much extra complexity. 

The demand for perfect portability is a red herring in this case.
You're a Unix programmer; think like one and apply the following
rules:

1. Never write what you can reuse.
2. Better to do 85% of the job than do 0% of it because you can't handle 15%.
3. When in doubt, use brute force.

Here's the brute-force solution.  You know the list of files with
critical locks on them. When svnadmin recover detecys that the
database is locked, it should shell out to a dumb little script --
we'll call it "list-all-lockers" -- that does this:

1. Determine if fuser is in a command directory.  If so, shell out to
   fuser called on the critical file list.  It will will either generate 
   the right report to stdout or an error status.  On error status,
   complain.  You're done.

2. Determine if lsof is in a command directory.  If so, shell out to
   lsof called on the critical file list. It will either generate 
   the right report to stdout or an error status. On error status,
   complain.  You're done.

3. If neither of these works, punt with a complaint message. You're done.

This is the patch I was going to write if nobody else got moving.  Two
dozen lines of shell.  Maybe three lines of C.  Two, three hours' work
at the outside.

This addition would handle Linux, Mac OS/X and other Unixes, all of whom
can run bash scripts.  svnadmin runs on the *repository* machine, of
course, so the only people it won't help are people hosting
repositories on Windows boxes -- and they'd be no worse off than they
were before.

>                                          Plus, some people do run
> repositories over very reliable remote filesystems, in which case
> lsof/fuser is no help anyway (thanks to ghudson for pointing this out
> in IRC).

So lsof/fuser sez "Oops! Boss, I can't stat that file!", returns an
error status, and you're at case 3.  Yes, this will happen
occasionally.  Is it a good reason not to help the vast majority of
admins for whom the simple shellout to fuser/lsof will work?

Heck, no.

Here's why it's our responsibility. *The admin doesn't know what the
critical files are*.  So even if he groks fuser/lsof, he can't know
the proper mystic incantation to generate to find out what the locking
processes are.

Our choices, therefore are simple:

1. Punt the problem.  But I've flamed at length about why this would be 
   wrong, and the list seems to have taken that critique to heart.

2. Expose svn's guts by uttering a list of the critical files so the
   sdmin can run fsuser/lsof by hand.

3. Delegate the job of finding the lockers to plugins that may vary 
   depending on the host OS (what I've suggested).

Really, is there any reason the distribution shouldn't include 
a `list-all-lockers` script for each port?

>       Detecting and auto-killing competing processes, for --force,
> would be even harder and more error-prone.  "I ran 'svnadmin recover'
> and _what_ died??"

Granted.  I had more or less come to the conclusion that this wasn't
practical myself.
 
> But we can make 'svnadmin recover' much friendlier than it is now (and
> IMHO, friendly enough) without going into such complexity:
> 
>    By default, 'svnadmin recover' fails immediately if it can't get a
>    lock, otherwise takes the lock and begins recovering, in
>    non-interruptible mode.
> 
>    But when passed the '--wait' option, if it can't get the lock, it
>    prints out a message saying:
> 
>    "Waiting for lock.  Currently, at least one other process has the
>     repository open; recovery will start when the repository is
>     released.  Press ^C to give up."
> 
>    Then when it gets the lock, it prints out a second message:
> 
>    "Lock acquired, starting recovery.  This is non-interruptible."

It's all good.  But *don't wimp out on listing the lockers*.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Command-line option for changing block-on-recover behavior considered unnecessary.

Posted by kf...@collab.net.

I'd like to make a proposal that incorporates some of the best ideas
from recent posts.

We can't portably determine the other locking processes; attempting to
do so would be too much extra complexity.  Plus, some people do run
repositories over very reliable remote filesystems, in which case
lsof/fuser is no help anyway (thanks to ghudson for pointing this out
in IRC).  Detecting and auto-killing competing processes, for --force,
would be even harder and more error-prone.  "I ran 'svnadmin recover'
and _what_ died??"

But we can make 'svnadmin recover' much friendlier than it is now (and
IMHO, friendly enough) without going into such complexity:

   By default, 'svnadmin recover' fails immediately if it can't get a
   lock, otherwise takes the lock and begins recovering, in
   non-interruptible mode.

   But when passed the '--wait' option, if it can't get the lock, it
   prints out a message saying:

   "Waiting for lock.  Currently, at least one other process has the
    repository open; recovery will start when the repository is
    released.  Press ^C to give up."

   Then when it gets the lock, it prints out a second message:

   "Lock acquired, starting recovery.  This is non-interruptible."

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Command-line option for changing block-on-recover behavior considered unnecessary.

Posted by "Eric S. Raymond" <es...@thyrsus.com>.

Mark Benedetto King <mb...@lowlatency.com>:
> This sounds like a job for a command-line option!

That wouldn't solve the problem, just move the debate to "which mode 
should be default?"

Unix programmers have a reflex to make behavior configurable when there is
a controversy about the right thing.  This is a good reflex, most of the time.
It's especially appropriate for tools like svnadmin that will be used by
experts to react to unusual situations.

However...making things configurable should not be a substitute for thinking
the design problem all the way through.  Every option you add is overhead;
it complicates testing, and has to be documented, and puts an additional
mnemonic load on the user.

In this case, I think we can avoid having an extra command-line option fairly
simply by thinking about the use cases for the tool.  It is almost always
going to be used interactively (I'll have more to say about non-interactive
use at the end of this note).  Therefore we can count on having a user's
finger above the interrupt button.

So, here's a strawman design that I think might satisfy the run-to-completion 
fans without another option:

1. If svn recover sees on startup that it's going to hit a lock, it says
   this:

	The database is locked by processes xxx, yyy, zzz
	Attempting to acquire lock, you can interrupt this with ^C

   Note: listing the blocking processes is not a frill or an option, 
   it is necessary.  It is how you give the user a way out of the jam
   rather than leaving him suspended wondering what to do next.

2. When it gets the lock, it frobs the terminal mode so DEL and QUIT 
   don't work, and says:

	Lock acquired, interrupts will be ignored until completion.

Let's look at how this stacks up against the current proposals as
mbk listed them:

1. What we have now: No diagnostic, blocking lock attempt

It's better than the current state of affairs.  Anything would be.

2. mbk's alternative: Diagnostic if we can't get lock, then block

My proposal blocks, but it blocks soft -- it can be interrupted 
until you get to the critical phase.

3. Greg Hudson's alternative:    Fail if we can't get lock

In the interactive case, I don't think there is much to be gained 
by Greg's proposal, because on seeing the lock diagnostic the user
can hit control-C.  

But the reverse also applies -- it's easy enough to re-invoke the tool
if it errors out with a lock diagnostic.  In fact, his proposal has
the slight advantage that the user doesn't have to hit ^C, but is 
dumped straight to a shell prompt and can issue some kill commands
on the blocking processes.

Which design is better depends on how much value you think there is
in having the tool be able to run to completion hands-off.

4. Other alternative: Diagnostic if we can't get lock, then block,
                       and print another message when we do get lock

This is what I proposed above.  I think it's a clear improvement over
blocking hard with just one diagnostic.

However, my real point is that adding a command-line option to control
whether or not "svn recover" errors out on startup when it detects a
lock does not really buy you anything in the interactive case -- either
failing out (as Greg proposes) or being interruptible during the attempt 
to acquire lock (as I propose) is good enough.

In fact, now that I think about it, I may like Greg's proposal
slightly better than mine.  But I'll leave mine on the table in case
the run-to-completion fans like Mark win the argument.

I promised to address the non-interactive case.  I think Mark is right
that if we think this one will be at all frequent, it makes a
run-to-completion design like his or mine look better.

But let's ask the next question.  If recover is going to be run without
a human watching, do we really want it to block on lock at all?

Probably not.  So there may be a good argument for an option after all -- 
not to control interactive behavior but to control how it works from scripts.
Here's the option:

-f, --force

If this option is specified and svnadmin detects database lockers on recovery
startup, svnadmin will check to see if the calling user has permission to 
kill all the blockers.  If so, all will be summarily executed and the
caller so notified with a diagnostic.  If not, svnadin will complain
about the processes it cannot kill and not kill any blockrs.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: mbk's recovery lock change

Posted by Mark Benedetto King <mb...@lowlatency.com>.

On Sat, Aug 28, 2004 at 06:25:31PM -0400, Greg Hudson wrote:
> 
> So, to summarize:
> 
>   What we have now:  No diagnostic, blocking lock attempt
>   mbk's alternative: Diagnostic if we can't get lock, then block
>   My alternative:    Fail if we can't get lock
>   Other alternative: Diagnostic if we can't get lock, then block,
>                       and print another message when we do get lock
> 

I think Greg's suggestion is reasonable.

Some points in favor of the "blocking" approaches are:

1.) Unless there is an actual unrecoverable problem with the repository,
    they can run to completion without error.

2.) There is no starvation potential (presuming that the locking primitive
    guarantees no starvation).  This is offset somewhat by the fact that
    the repository administrator almost always has at least the priveledge,
    if not the desire, to shutdown svnserve/apache/inetd in the case where
    they are having trouble obtaining the lock.

Hmm... maybe there's some middle ground...

The difference between Greg's suggestion and the trunk code is that
Greg is proposing that we remove the blocking call to svn_repos_recover2.

This sounds like a job for a command-line option!

--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: mbk's recovery lock change

Posted by Tobias Ringström <to...@ringstrom.mine.nu>.

Greg Hudson wrote:

>I'd still like to see more people's opinions on what the
>correct behavior is.
>  
>
Yes, of course. I got carried away while thinking about the issue. 
That's why my opinion came with a patch attached. :-)

/Tobias


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: mbk's recovery lock change

Posted by Greg Hudson <gh...@MIT.EDU>.

On Sun, 2004-08-29 at 10:10, Tobias Ringström wrote:
> The attached patch implements your "other alternative" above, and also 
> makes it possible to interrupt svnadmin while waiting for the lock. 
> Opinions?

Your patch looks correct to me (I'm surprised that APR's apr_signal
layer is so thin that it's appropriate to use SIG_DFL, but it seems to
be that way).  I'd still like to see more people's opinions on what the
correct behavior is.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Re: mbk's recovery lock change

Posted by Max Bowsher <ma...@ukf.net>.

Tobias Ringström wrote:
> Greg Hudson wrote:
>
>>So, to summarize:
>>
>>  What we have now:  No diagnostic, blocking lock attempt
>>  mbk's alternative: Diagnostic if we can't get lock, then block
>>  My alternative:    Fail if we can't get lock
>>  Other alternative: Diagnostic if we can't get lock, then block,
>>                      and print another message when we do get lock
>>
>>
> A problem with the current solution that you didn't mention is that it's
> really hard to kill svnadmin while it's waiting to acquire the lock
> because we ignore SIGINT and SIGKILL. I don't like the "fail if we can't
> get lock" approach for the reason you quoted from mbk, but it's not only
> a problem when running svnserve from inetd. You can get the same
> problems with file:// and svn+ssh://. If you have a busy server that
> serves multiple repositories, you don't want to shut down the whole
> server just to recover a single repository.
>
> The attached patch implements your "other alternative" above, and also
> makes it possible to interrupt svnadmin while waiting for the lock.
> Opinions?

+1 (concept), especially about being able to Ctrl-C whilst waiting on the 
lock.

Nothing against the patch, just haven't had time to review in detail.

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

[PATCH] Re: mbk's recovery lock change

Posted by Tobias Ringström <to...@ringstrom.mine.nu>.

Greg Hudson wrote:

>So, to summarize:
>
>  What we have now:  No diagnostic, blocking lock attempt
>  mbk's alternative: Diagnostic if we can't get lock, then block
>  My alternative:    Fail if we can't get lock
>  Other alternative: Diagnostic if we can't get lock, then block,
>                      and print another message when we do get lock
>  
>
A problem with the current solution that you didn't mention is that it's 
really hard to kill svnadmin while it's waiting to acquire the lock 
because we ignore SIGINT and SIGKILL. I don't like the "fail if we can't 
get lock" approach for the reason you quoted from mbk, but it's not only 
a problem when running svnserve from inetd. You can get the same 
problems with file:// and svn+ssh://. If you have a busy server that 
serves multiple repositories, you don't want to shut down the whole 
server just to recover a single repository.

The attached patch implements your "other alternative" above, and also 
makes it possible to interrupt svnadmin while waiting for the lock. 
Opinions?

/Tobias

[[[
Make 'svnadmin recover' print a warning message when it's waiting for
the exclusive lock and a message when the recovery has started.  Allow
svnadmin to be interrupted while waiting for the lock.

* svnadmin/main.c
  (setup_cancellation_signals): New helper function.

  (recovery_started): New function to inform the user that the recovery
  has started.

  (subcommand_recover): Restore default signal handling before calling
  svn_repos_recover2.  Call the updated svn_repos_recover2 with
  recovery_started as argument.

  (main): Use the new setup_cancellation_signals.

* include/svn_repos.h
  (svn_repos_recover2): Add a callback to be called when the recovery
  starts and a baton to be supplied to the callback.

* libsvn_repos/repos.c
  (svn_repos_recover2): Call the callback after acquiring the lock.

* bindings/java/javahl/native/SVNAdmin.cpp
  (SVNAdmin::recover): Update to match updated svn_repos_recover2.
]]]