You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Peter Howard <pj...@coastal.net.au> on 2002/12/18 11:56:34 UTC

Repository "hung"

I say "hung" as I can't think of a better one-word description.

The long version.

On Tuesday the repository was fine.  On Wednesday night (now) I went to do
an update (remote, different machine to Tuesday).  Can't access repository.
(Apologies here, the exact error is long gone, and at that point I thought I
was dealing with a simple prob)

Go to repository machine.  To be safe do a recover.

"svnadmin recover {path to repos}"
"Acquiring exclusive lock on repository db, and running recovery procedures.
Please stand by . . ."

Some 10 minutes later with nothing more, I ctrl-c the job.  Think.  DAMN!
apache is still running.  Stop apache.  Now try to recover.  Same result
"Please stand by . . ." then nothing.  Checking the process usage, svnadmin
is sitting at 10-15% of CPU.

Check filesystem.  It's full.  Bother.  Remove heaps of stuff. Filesystem
now down to 60%.  Try recover again.  Same result "Please stand by . . ."
then nothing more after 5-10 minutes.

Is there anything there?  Try "svnadmin lsrevs {path to repos}.  This seems
to work fine, after 50 or so revisions have gone by, ctrl-c out.  Try
recover, as above.

OK, let's try a dump.  Similar behaviour to recover -ie. nothing happens.

Try a remote checkout (with apache restarted)
"svn: RA layer request failed
svn: PROPFIND of /: 405 Method Not Allowed"

Try a local checkout (file:///)
At one point during this process this didn't work.  Of course, now as it's
run to get the error message, it works <SIGH/>

I've left out other minor details such as all the stops/restarts of apache,
removing & recreating the repository lockfile, access_log & error_log
messages.

So I'm left with:
- Local checkout works
- Remote checkout (on local or remote machine) fails
- svnadmin "hangs" on recover & dump, lsrevs works

This is on 0.15, on Solaris 8, apache 2.0.43, bdb 4.0.14


Suggestions?

PJH


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Repository "hung"

Posted by Brandon Ehle <az...@yahoo.com>.
> 
>
>One last question - lsof ?  did you mean ls -of ?  Or is lsof some other
>utility I'm unaware of?  (note, -f has different meanings on linux and
>solaris)
>  
>
No the comand is "lsof" or show me who has these files open.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Repository "hung"

Posted by Branko Čibej <br...@xbc.nu>.
Peter Howard wrote:

>I was wrong there.  Going back through the lsof output, it consistently gets
>to log.0000000094, then spends a long time with just the files above (or a
>subset) open, before hitting a segmentation fault.
>
>I built 0.16.1 this afternoon and the behaviour of svnadmin is identical,
>and repeatable; always to log.0000000094.
>

That's not surprising, since "svnadmin recover" is really just a
reimplementation of BDB's "db_recover", tuned to fit Subversion. The
segfault is most likely in BDB code.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Repository "hung"

Posted by Peter Howard <pj...@coastal.net.au>.

> -----Original Message-----
> From: Philip Martin [mailto:pm@home.coastal.net.au]On Behalf Of Philip
> Martin
> Sent: Friday, January 10, 2003 2:57 PM
> To: dev@subversion.tigris.org
> Subject: Re: Repository "hung"
> 
> 
> "Peter Howard" <pj...@coastal.net.au> writes:
> 
> > 2) svnadmin dump stops after revision 106 as well, though 
> without error.  I
> > have my WC, but that means I lose revisions 107-164.  In this instance,
> > nothing to lose sleep over, but not very confidence-inspiring.  Any
> > suggestions on how to recover further?
> 
> Have you tried catastrophic (the -c option to db_recover) recovery?
> 

Gets to revision 100, then jumps to revision 150, then segfaults.

PJH

> -- 
> Philip Martin
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Repository "hung"

Posted by Philip Martin <ph...@codematters.co.uk>.
"Peter Howard" <pj...@coastal.net.au> writes:

> 2) svnadmin dump stops after revision 106 as well, though without error.  I
> have my WC, but that means I lose revisions 107-164.  In this instance,
> nothing to lose sleep over, but not very confidence-inspiring.  Any
> suggestions on how to recover further?

Have you tried catastrophic (the -c option to db_recover) recovery?

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Phoney revisions (was: Repository "hung")

Posted by Peter Howard <pj...@coastal.net.au>.
I've given up on the existing repository, started a new one and loaded in
the old dump then committing my working copy over the top.  During the
process I made the following discoveries

- svnadmin dump actually made it (with both svn and db built in debug) up to
revision 109.
- Going back through the log files (and cross checking against my memory as
to what I had been doing back on Jan 6 when it died) I discovered that
revision 109 was either the very last or second last revision that I had
committed before the repository went ga-ga.  The log files for revisions
110/111-165 do not correspond to any real checkins.

So I've kept nearly all/all of my work (yay!).  But where the fsck did those
other log files come from.  I'm now running svn 0.16.1 and db4.1.24 (hit the
problem with 0.15 and db4.0.14) and hoping it never reappears.

This is now just information in case anyone else hits something similar.

PJH

(PS - I too have noticed the slowdown going from db 4.0 to 4.1 - but as I'm
feeling slightly paranoid about database integrity right now I'm living with
it)




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Repository "hung"

Posted by Peter Howard <pj...@coastal.net.au>.

> -----Original Message-----
> From: Brandon Ehle [mailto:azverkan@yahoo.com]
> Sent: Thursday, January 09, 2003 1:44 AM
> To: Peter Howard
> Cc: Subversion Dev list
> Subject: Re: Repository "hung"
>
>
> >
> >
> >I was wrong there.  Going back through the lsof output, it
> consistently gets
> >to log.0000000094, then spends a long time with just the files
> above (or a
> >subset) open, before hitting a segmentation fault.
> >
> >I built 0.16.1 this afternoon and the behaviour of svnadmin is identical,
> >and repeatable; always to log.0000000094.
> >
> >Anyone with an idea of what could be going on?  Next step? (other than a
> >debug build and running it in gdb, which I'll do next)
> >
> >
> If you have the ability to do so, I'd recompile db in DIAGNOSTIC mode
> and run the process under the debugger.  If not a stack trace would
> probably be a good place to start.

First thing to note: building db with --with-debug meant it now gets to
log.0000000106.  The stack trace is:

Program received signal SIGSEGV, Segmentation fault.
0xff273de4 in __log_register_recover (dbenv=0x31228, dbtp=0x1a7cf8,
lsnp=0x0,
    op=DB_TXN_BACKWARD_ROLL, info=0x19ac18) at ../log/log_rec.c:139
139					__db_err(dbenv,
(gdb) bt
#0  0xff273de4 in __log_register_recover (dbenv=0x31228, dbtp=0x1a7cf8,
    lsnp=0x0, op=DB_TXN_BACKWARD_ROLL, info=0x19ac18) at
../log/log_rec.c:139
#1  0xff23b8a0 in __db_dispatch (dbenv=0x31228, dtab=0x0, db=0xffbef5c0,
    lsnp=0xffbef598, redo=DB_TXN_BACKWARD_ROLL, info=0x19ac18)
    at ../db/db_dispatch.c:202
#2  0xff250ecc in __db_apprec (dbenv=0x31228, max_lsn=0x0, flags=4290704832)
    at ../env/env_recover.c:385
#3  0xff24ec1c in __dbenv_open (dbenv=0x31228,
    db_home=0x2db70 "/usr/local/apache2/htdocs/ARMTechWin/db", flags=161825,
    mode=438) at ../env/env_open.c:288
#4  0xff340a9c in svn_fs_berkeley_recover (
    path=0x2d900 "/usr/local/apache2/htdocs/ARMTechWin/db", pool=0x2d230)
    at subversion/libsvn_fs/fs.c:637
#5  0xff38fe84 in svn_repos_recover (
    path=0x2b430 "/usr/local/apache2/htdocs/ARMTechWin", pool=0x2b228)
    at subversion/libsvn_repos/repos.c:1003
#6  0x120d8 in subcommand_recover (os=0x2b350, baton=0xffbefb68,
pool=0x2b228)

Line 139 is:
139					__db_err(dbenv,
140					    "Improper file close. LSN: %lu/%lu.",
141					    (u_long)lsnp->file, (u_long)lsnp->of

and lsnp is NULL (as per stack trace).  So it seems that lsnp is being
cleared in db_dispatch()

Two questions from this:
1) Is this now an issue for subversion, or should it be referred to the
Brekeley DB mob?  I'm using 4.0.14

2) svnadmin dump stops after revision 106 as well, though without error.  I
have my WC, but that means I lose revisions 107-164.  In this instance,
nothing to lose sleep over, but not very confidence-inspiring.  Any
suggestions on how to recover further?

PJH


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Repository "hung"

Posted by Brandon Ehle <az...@yahoo.com>.
> 
>
>I was wrong there.  Going back through the lsof output, it consistently gets
>to log.0000000094, then spends a long time with just the files above (or a
>subset) open, before hitting a segmentation fault.
>
>I built 0.16.1 this afternoon and the behaviour of svnadmin is identical,
>and repeatable; always to log.0000000094.
>
>Anyone with an idea of what could be going on?  Next step? (other than a
>debug build and running it in gdb, which I'll do next)
>  
>
If you have the ability to do so, I'd recompile db in DIAGNOSTIC mode 
and run the process under the debugger.  If not a stack trace would 
probably be a good place to start.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Repository "hung"

Posted by Peter Howard <pj...@coastal.net.au>.

> -----Original Message-----
> From: Peter Howard [mailto:pjh@coastal.net.au]
> Sent: Monday, January 06, 2003 6:38 PM
> To: Brandon Ehle
> Cc: Subversion Dev list
> Subject: RE: Repository "hung"
>
>
>
>
> > -----Original Message-----
> > From: Peter Howard [mailto:pjh@coastal.net.au]
> > Sent: Thursday, December 19, 2002 7:07 AM
> > To: Brandon Ehle
> > Cc: Subversion Dev list
> > Subject: RE: Repository "hung"
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: Brandon Ehle [mailto:azverkan@yahoo.com]
> > > Sent: Thursday, December 19, 2002 4:18 AM
> > > To: Peter Howard
> > > Cc: Subversion Dev list
> > > Subject: Re: Repository "hung"
> > >
> > > Before running recover, goto the repository/db directory and run "lsof
> > > *".  If you see anything accessing the files kill it.  Then run
> > > recover.  While recover is running you can use "lsof *" to watch its
> > > progress (in a catastrophic recovery, it will walk through
> the log files
> > > a few at a time).
> >
> > Tried to do that, but it had fixed itself :-)  I shut the
> server down last
> > night in frustration and only started it again now (9 hours
> later) and the
> > recover took about 2 seconds.  Remote access works now.  I _had_ done a
> > reboot last night but that didn't get things working.  There were also a
> > couple of files with access permissions in the db dir, but again
> > I had fixed
> > that as I went last night, so why now?
> >
> > So that leaves me moderately bemused as to the exact problem.  But if it
> > happens again, I'll check the semaphore situation.
>
> Guess, what?  It did it to me again today. This time I've checked the
> semaphore state prior to "fixing" (which I have failed to
> magically do yet).
> No semaphoeres, one Shared memory segment.  If I leave svnadmin recover
> running long enough it dies with a seg fault.
>
> I had lsof running on a 20 second loop (+r 20) while svnadmin was running.
> The final loop listed the following files being accessed:
>
>
> ./db/log.0000000001
> ./db/nodes
> ./db/revisions
> ./db/transactions
> ./db/copies
> ./db/changes
> ./db/representations
> ./db/strings
>
> There's 165 revisions in the repository, but it never got past the first
> logfile.
>

I was wrong there.  Going back through the lsof output, it consistently gets
to log.0000000094, then spends a long time with just the files above (or a
subset) open, before hitting a segmentation fault.

I built 0.16.1 this afternoon and the behaviour of svnadmin is identical,
and repeatable; always to log.0000000094.

Anyone with an idea of what could be going on?  Next step? (other than a
debug build and running it in gdb, which I'll do next)

PJH


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Repository "hung"

Posted by Peter Howard <pj...@coastal.net.au>.

> -----Original Message-----
> From: Peter Howard [mailto:pjh@coastal.net.au]
> Sent: Thursday, December 19, 2002 7:07 AM
> To: Brandon Ehle
> Cc: Subversion Dev list
> Subject: RE: Repository "hung"
>
>
>
>
> > -----Original Message-----
> > From: Brandon Ehle [mailto:azverkan@yahoo.com]
> > Sent: Thursday, December 19, 2002 4:18 AM
> > To: Peter Howard
> > Cc: Subversion Dev list
> > Subject: Re: Repository "hung"
> >
> >
> > >
> > >
> > >Some 10 minutes later with nothing more, I ctrl-c the job.
> Think.  DAMN!
> > >apache is still running.  Stop apache.  Now try to recover.
> Same result
> > >"Please stand by . . ." then nothing.  Checking the process
> > usage, svnadmin
> > >is sitting at 10-15% of CPU.
> > >
> > >
> > Verify that ALL the "httpd" processes are gone, as shutting down apache
> > normally when something wedges the repository rarely ever results in all
> > apache processes exiting, then manually removed the leaked semaphores
> > with "ipcs" and "ipcrm -s".  Any semaphores under the user/group
> > "apache" that are still alive after apache exited need to be removed, if
> > not, then eventually your machine will run out of semaphores and need a
> > reboot.
> >
> > >Check filesystem.  It's full.  Bother.  Remove heaps of stuff.
> Filesystem
> > >now down to 60%.  Try recover again.  Same result "Please
> stand by . . ."
> > >then nothing more after 5-10 minutes.
> > >
> > >
> > Before running recover, goto the repository/db directory and run "lsof
> > *".  If you see anything accessing the files kill it.  Then run
> > recover.  While recover is running you can use "lsof *" to watch its
> > progress (in a catastrophic recovery, it will walk through the log files
> > a few at a time).
>
> Tried to do that, but it had fixed itself :-)  I shut the server down last
> night in frustration and only started it again now (9 hours later) and the
> recover took about 2 seconds.  Remote access works now.  I _had_ done a
> reboot last night but that didn't get things working.  There were also a
> couple of files with access permissions in the db dir, but again
> I had fixed
> that as I went last night, so why now?
>
> So that leaves me moderately bemused as to the exact problem.  But if it
> happens again, I'll check the semaphore situation.

Guess, what?  It did it to me again today. This time I've checked the
semaphore state prior to "fixing" (which I have failed to magically do yet).
No semaphoeres, one Shared memory segment.  If I leave svnadmin recover
running long enough it dies with a seg fault.

I had lsof running on a 20 second loop (+r 20) while svnadmin was running.
The final loop listed the following files being accessed:


./db/log.0000000001
./db/nodes
./db/revisions
./db/transactions
./db/copies
./db/changes
./db/representations
./db/strings

There's 165 revisions in the repository, but it never got past the first
logfile.

Suggestions?  Ideas?  Note: this is still using 0.15, so does it sound like
anything subsequently fixed?

Thanks

PJH


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Repository "hung"

Posted by Peter Howard <pj...@coastal.net.au>.

> -----Original Message-----
> From: Brandon Ehle [mailto:azverkan@yahoo.com]
> Sent: Thursday, December 19, 2002 4:18 AM
> To: Peter Howard
> Cc: Subversion Dev list
> Subject: Re: Repository "hung"
>
>
> >
> >
> >Some 10 minutes later with nothing more, I ctrl-c the job.  Think.  DAMN!
> >apache is still running.  Stop apache.  Now try to recover.  Same result
> >"Please stand by . . ." then nothing.  Checking the process
> usage, svnadmin
> >is sitting at 10-15% of CPU.
> >
> >
> Verify that ALL the "httpd" processes are gone, as shutting down apache
> normally when something wedges the repository rarely ever results in all
> apache processes exiting, then manually removed the leaked semaphores
> with "ipcs" and "ipcrm -s".  Any semaphores under the user/group
> "apache" that are still alive after apache exited need to be removed, if
> not, then eventually your machine will run out of semaphores and need a
> reboot.
>
> >Check filesystem.  It's full.  Bother.  Remove heaps of stuff. Filesystem
> >now down to 60%.  Try recover again.  Same result "Please stand by . . ."
> >then nothing more after 5-10 minutes.
> >
> >
> Before running recover, goto the repository/db directory and run "lsof
> *".  If you see anything accessing the files kill it.  Then run
> recover.  While recover is running you can use "lsof *" to watch its
> progress (in a catastrophic recovery, it will walk through the log files
> a few at a time).

Tried to do that, but it had fixed itself :-)  I shut the server down last
night in frustration and only started it again now (9 hours later) and the
recover took about 2 seconds.  Remote access works now.  I _had_ done a
reboot last night but that didn't get things working.  There were also a
couple of files with access permissions in the db dir, but again I had fixed
that as I went last night, so why now?

So that leaves me moderately bemused as to the exact problem.  But if it
happens again, I'll check the semaphore situation.

One last question - lsof ?  did you mean ls -of ?  Or is lsof some other
utility I'm unaware of?  (note, -f has different meanings on linux and
solaris)

PJH


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Repository "hung"

Posted by Brandon Ehle <az...@yahoo.com>.
> 
>
>Some 10 minutes later with nothing more, I ctrl-c the job.  Think.  DAMN!
>apache is still running.  Stop apache.  Now try to recover.  Same result
>"Please stand by . . ." then nothing.  Checking the process usage, svnadmin
>is sitting at 10-15% of CPU.
>  
>
Verify that ALL the "httpd" processes are gone, as shutting down apache 
normally when something wedges the repository rarely ever results in all 
apache processes exiting, then manually removed the leaked semaphores 
with "ipcs" and "ipcrm -s".  Any semaphores under the user/group 
"apache" that are still alive after apache exited need to be removed, if 
not, then eventually your machine will run out of semaphores and need a 
reboot.

>Check filesystem.  It's full.  Bother.  Remove heaps of stuff. Filesystem
>now down to 60%.  Try recover again.  Same result "Please stand by . . ."
>then nothing more after 5-10 minutes.
>  
>
Before running recover, goto the repository/db directory and run "lsof 
*".  If you see anything accessing the files kill it.  Then run 
recover.  While recover is running you can use "lsof *" to watch its 
progress (in a catastrophic recovery, it will walk through the log files 
a few at a time).



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org