You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Michel Jouvin <jo...@lal.in2p3.fr> on 2005/02/19 09:32:49 UTC

Frequent database corruption

Hi,

I forward to dev@ list my original message to uers@, after Karl answer.

As I answered Karl, I am ready to do some testing, if it can help to 
analyse and solve this issue. This is not very urgent for me, if switching 
to FSFS (as suggested by Karl) is an acceptable workaround (in term of 
perfs and reliability). I'll do it next week.

I also have some materials (http logs, db backup, recover outputs...) 
available, as mentionned and can send it if useful.

I understand also that 1.2 is not far and that there are some changes in 
BDB integration. If it sounds more productive, I can wait 1.2 to start 
further investigations.

Best regards,

Michel

---------- Forwarded Message ----------
Date: vendredi 18 février 2005 11:08 -0600
From: kfogel@collab.net
To: Michel Jouvin <jo...@lal.in2p3.fr>
Subject: Re: Frequent database corruption

Michel Jouvin <jo...@lal.in2p3.fr> writes:
> We started a production Subversion server a couple of months ago. We are
> now running Apache 2.0.52 + Subversion 1.1.3 + Db 4.2.52 on Tru64 Unix
> 5.1B.
>
> We quite frequently experience repository database corruption on all of
> our repositories (7, with very different sizes). In previous versions of
> Subversion (before 1.1.2 I would say), we were generally able to fix these
> corruptions with svnadmin recover. We are now experiencing more and more
> corruptions that can't be fixed (svnadmin recover fails), where the only
> solution is a repository restore from backup.
>
> The first corruptions we experienced generally occurred during commit,
> especially on large repositories. When we looked at possible causes for
> these corruptions, we found that one reason was we were running 2 Apache
> servers on 2 different nodes in a cluster configuration (cluster file
> system, no NFS involved). We shut down one of the server and it more or
> less solved the corruption during commits. This remains strange as the
> cluster file system has a pure local file system semantics and we never
> experienced such problems with other databases or other Db usage.
>
> Now we experienced corruptions not related to any repository write. We
> have log files showing successful repository access through HTTP GET
> followed by a GET failure due to database corruption without any
> repository modification in between and without any Apache
> problem/restart. We suspected that these corruption were related to
> Apache restart during a transaction but we now have evidence that
> corruption can occur at any time without any repository modification. We
> have Apache log files and corrupted repository copies.
>
> Generally svnadmin recover fails on these corruptions. Sometimes we were
> able to fix corruptions by recover + verify as documented in a note. We
> also have a directory that we restored from backup and needed to repair
> before having it accessible again. In this case we had to use recover +
> verify. And verify + recover definitly corrupts the repository.
>
> Please could you let us know if this is a known problem (I saw a couple of
> issue entries related to similar problems but this is unclear if this is
> really the same) and if there is any workaround ? Is FSFS an alternative
> to consider ?
>
> Thanks in advance for any help. Let us know materials we could provide to
> help in troubleshooting, if this seems necessary.

Hmmm.  I don't know why you're having these problems, but they are not
unfamiliar to us.  Maybe it has something to do with being on a 64-bit
system, though that's just a wild guess, I have nothing to back it up
with.

Yes, I suggest using FSFS at least for now.  We're working on
improving Subversion's usage of Berkeley DB (the problems are with how
we use it, not with BDB itself).  Very few people have problems as
severe as you are experiencing, and these problems have been hard for
us to reproduce reliably.  You sound like you can reproduce them
pretty reliably, though, so if you want to resend your description to
the dev@subversion.tigris.org list, there might (can't promise) be a
developer interested in using you as a reproduction environment, if
you're willing.  (I wish I could, but my personal stack is full right
now.)

Sorry for the troubles.  I hope the situation improves for you,
-Karl

P.S.  By the way, we try not to say "corruption" if data has not been
      corrupted.  The issue is that your data is not accessible, but
      it has not been corrupted, from your description.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org


---------- End Forwarded Message ----------



     *************************************************************
     * Michel Jouvin                 Email : jouvin@lal.in2p3.fr *
     * LAL / CNRS                    Tel : +33 1 64468932        *
     * B.P. 34                       Fax : +33 1 69079404        *
     * 91898 Orsay Cedex                                         *
     * France                                                    *
     *************************************************************



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Frequent database corruption

Posted by Philip Martin <ph...@codematters.co.uk>.
Michel Jouvin <jo...@lal.in2p3.fr> writes:

> As I said, it could (may be) explain the problem we had when running
> to instances. But this is NO LONGER the case.
>
> It's not clear for me what you call shared mapping. If you are talking
> about mmap(), it is synchronized cluster wide and this should not be
> the problem (for example we have an IMAP server relying on MMAP and we
> never experienced any corruption in our cluster config). But I suggest
> to concentrate on corruption that occurs in the single Apache instance
> configuration during GET operations.

If you build Berkeley DB from source it includes a test suite.  You
might like to try running that to see how well BDB works on your
filesystem.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Frequent database corruption

Posted by Michel Jouvin <jo...@lal.in2p3.fr>.

--On samedi 19 février 2005 21:12 +0100 Bastian Blank 
<ba...@waldi.eu.org> wrote:

> On Sat, Feb 19, 2005 at 08:06:23PM +0100, Michel Jouvin wrote:
>> > Which cluster filesystem?
>> Tru64/TruCluster Cluster File System. I am not sure about Florian
>> question  : the file system (including locks) has the whole semantic of
>> a local file  system but that I am not sure about system mutexes. In
>> fact it is possible  to create shared memory between cluster nodes but
>> it has to be done  explicitly.
>>
>> Anyway, as I explained below, we now run a configuration where we only
>> have  one Apache server handling subversion request and I have httpd
>> logs to  prove that. I understand that there is may be something wrong
>> in our  configuration if we run several concurrent Apachae instances on
>> several  nodes accessing the same repositories. But the database
>> corruption we have  now are not in this configuration and even not
>> related to commit or other  repository modification.
>
> You have to uses fsfs. Berkeley DB shared memory needs shared mapping
> where updates show up in any other mapping. This behavior is not
> required by POSIX and is not possible to do with cluster fs without
> large overhead.
>
> Bastian

As I said, it could (may be) explain the problem we had when running to 
instances. But this is NO LONGER the case.

It's not clear for me what you call shared mapping. If you are talking 
about mmap(), it is synchronized cluster wide and this should not be the 
problem (for example we have an IMAP server relying on MMAP and we never 
experienced any corruption in our cluster config). But I suggest to 
concentrate on corruption that occurs in the single Apache instance 
configuration during GET operations.

Michel

>
> --
> I'm a soldier, not a diplomat.  I can only tell the truth.
> 		-- Kirk, "Errand of Mercy", stardate 3198.9
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>



     *************************************************************
     * Michel Jouvin                 Email : jouvin@lal.in2p3.fr *
     * LAL / CNRS                    Tel : +33 1 64468932        *
     * B.P. 34                       Fax : +33 1 69079404        *
     * 91898 Orsay Cedex                                         *
     * France                                                    *
     *************************************************************



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Frequent database corruption

Posted by Bastian Blank <ba...@waldi.eu.org>.
On Sat, Feb 19, 2005 at 08:06:23PM +0100, Michel Jouvin wrote:
> >Which cluster filesystem?
> Tru64/TruCluster Cluster File System. I am not sure about Florian  question 
> : the file system (including locks) has the whole semantic of a local file 
> system but that I am not sure about system mutexes. In fact it is possible 
> to create shared memory between cluster nodes but it has to be done 
> explicitly.
> 
> Anyway, as I explained below, we now run a configuration where we only have 
> one Apache server handling subversion request and I have httpd logs to 
> prove that. I understand that there is may be something wrong in our 
> configuration if we run several concurrent Apachae instances on several 
> nodes accessing the same repositories. But the database corruption we have 
> now are not in this configuration and even not related to commit or other 
> repository modification.

You have to uses fsfs. Berkeley DB shared memory needs shared mapping
where updates show up in any other mapping. This behavior is not
required by POSIX and is not possible to do with cluster fs without
large overhead.

Bastian

-- 
I'm a soldier, not a diplomat.  I can only tell the truth.
		-- Kirk, "Errand of Mercy", stardate 3198.9

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Frequent database corruption

Posted by Florian Weimer <fw...@deneb.enyo.de>.
* Mark Benedetto King:

> I suspect, then, that one of these other web application is doing
> something evil (like calling exit()).

This does not cause repository corruption (unless you specified the
DB_TXN_NOSYNC flag in DB_CONFIG, but you really shouldn't do this).
Processes accessing the repository may hang after you do this, but
"svnadmin recover" is certainly able to restore a consistent
repository.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Frequent database corruption

Posted by Michel Jouvin <jo...@lal.in2p3.fr>.
In fact I have several repositories with only a subset accessed through 
WebSVN. The only ones to be corrupted are those access by WebSVN. This is 
the reason of my guess.

Michel

--On lundi 21 février 2005 11:40 -0500 Mark Benedetto King 
<mb...@lowlatency.com> wrote:

> On Mon, Feb 21, 2005 at 04:05:58PM +0100, Michel Jouvin wrote:
>> Trying to think about circumstances when corruptions occur, I have the
>> feeling that the problem affects only repositories access by svn clients
>> and WebSVN (which is a client too !). Could it make sense ?
>>
>> Michel
>>
>
> Well, anything is possible. :-)
>
> Could you try disabling the WebSVN for a little while to see if the
> problem goes away?
>
> --ben
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>



     *************************************************************
     * Michel Jouvin                 Email : jouvin@lal.in2p3.fr *
     * LAL / CNRS                    Tel : +33 1 64468932        *
     * B.P. 34                       Fax : +33 1 69079404        *
     * 91898 Orsay Cedex                                         *
     * France                                                    *
     *************************************************************



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Frequent database corruption

Posted by Mark Benedetto King <mb...@lowlatency.com>.
On Mon, Feb 21, 2005 at 04:05:58PM +0100, Michel Jouvin wrote:
> Trying to think about circumstances when corruptions occur, I have the 
> feeling that the problem affects only repositories access by svn clients 
> and WebSVN (which is a client too !). Could it make sense ?
> 
> Michel
> 

Well, anything is possible. :-)

Could you try disabling the WebSVN for a little while to see if the
problem goes away?

--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Frequent database corruption

Posted by Michel Jouvin <jo...@lal.in2p3.fr>.
Trying to think about circumstances when corruptions occur, I have the 
feeling that the problem affects only repositories access by svn clients 
and WebSVN (which is a client too !). Could it make sense ?

Michel

--On dimanche 20 février 2005 11:56 +0100 Michel Jouvin 
<jo...@lal.in2p3.fr> wrote:

>
>
> --On samedi 19 février 2005 16:23 -0500 Mark Benedetto King
> <mb...@lowlatency.com> wrote:
>
>> On Sat, Feb 19, 2005 at 08:06:23PM +0100, Michel Jouvin wrote:
>>> >
>>> >>> Now we experienced corruptions not related to any repository write.
>>> >>> We have log files showing successful repository access through HTTP
>>> >>> GET followed by a GET failure due to database corruption without any
>>> >>> repository modification in between and without any Apache
>>> >>> problem/restart. We suspected that these corruption were related to
>>> >>> Apache restart during a transaction but we now have evidence that
>>> >>> corruption can occur at any time without any repository
>>> >>> modification. We have Apache log files and corrupted repository
>>> >>> copies.
>>> >
>>> > Is this with two httpd nodes or one?
>>> >
>>> > Is the httpd being used for anything else (CGI, PHP, etc?), or is it
>>> > dedicated for mod_dav_svn use?
>>> >
>>>
>>> Subversion server is implemented through a virtual host that is handled
>>> by  only one node running only one Apache instance (with threads
>>> worker). This  server also handles requests for a lot of other Web apps,
>>> including CGI,  PHP.
>>>
>>
>> I suspect, then, that one of these other web application is doing
>> something evil (like calling exit()).
>>
>> Could you try running mod_dav_svn on a dedicated Apache instance (no
>> other web applications) and see whether the problems go away?
>>
>> If that fixes it, and you have reasons why they need to seem like the
>> same instance, perhaps you could try using mod_proxy to integrate the
>> two.
>>
>> --ben
>>
>
> Setting up such a test config could be possible, except we don't have the
> time and hardware required right now. But I don't see any apps doing and
> exit() or something like that. It should imply an apache restart and
> there were no such restart (assessed by Apache logs).
>
> Michel
>
>
>
>      *************************************************************
>      * Michel Jouvin                 Email : jouvin@lal.in2p3.fr *
>      * LAL / CNRS                    Tel : +33 1 64468932        *
>      * B.P. 34                       Fax : +33 1 69079404        *
>      * 91898 Orsay Cedex                                         *
>      * France                                                    *
>      *************************************************************
>



     *************************************************************
     * Michel Jouvin                 Email : jouvin@lal.in2p3.fr *
     * LAL / CNRS                    Tel : +33 1 64468932        *
     * B.P. 34                       Fax : +33 1 69079404        *
     * 91898 Orsay Cedex                                         *
     * France                                                    *
     *************************************************************



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Frequent database corruption

Posted by Michel Jouvin <jo...@lal.in2p3.fr>.

--On samedi 19 février 2005 16:23 -0500 Mark Benedetto King 
<mb...@lowlatency.com> wrote:

> On Sat, Feb 19, 2005 at 08:06:23PM +0100, Michel Jouvin wrote:
>> >
>> >>> Now we experienced corruptions not related to any repository write.
>> >>> We have log files showing successful repository access through HTTP
>> >>> GET followed by a GET failure due to database corruption without any
>> >>> repository modification in between and without any Apache
>> >>> problem/restart. We suspected that these corruption were related to
>> >>> Apache restart during a transaction but we now have evidence that
>> >>> corruption can occur at any time without any repository modification.
>> >>> We have Apache log files and corrupted repository copies.
>> >
>> > Is this with two httpd nodes or one?
>> >
>> > Is the httpd being used for anything else (CGI, PHP, etc?), or is it
>> > dedicated for mod_dav_svn use?
>> >
>>
>> Subversion server is implemented through a virtual host that is handled
>> by  only one node running only one Apache instance (with threads
>> worker). This  server also handles requests for a lot of other Web apps,
>> including CGI,  PHP.
>>
>
> I suspect, then, that one of these other web application is doing
> something evil (like calling exit()).
>
> Could you try running mod_dav_svn on a dedicated Apache instance (no
> other web applications) and see whether the problems go away?
>
> If that fixes it, and you have reasons why they need to seem like the
> same instance, perhaps you could try using mod_proxy to integrate the
> two.
>
> --ben
>

Setting up such a test config could be possible, except we don't have the 
time and hardware required right now. But I don't see any apps doing and 
exit() or something like that. It should imply an apache restart and there 
were no such restart (assessed by Apache logs).

Michel



     *************************************************************
     * Michel Jouvin                 Email : jouvin@lal.in2p3.fr *
     * LAL / CNRS                    Tel : +33 1 64468932        *
     * B.P. 34                       Fax : +33 1 69079404        *
     * 91898 Orsay Cedex                                         *
     * France                                                    *
     *************************************************************



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Frequent database corruption

Posted by Mark Benedetto King <mb...@lowlatency.com>.
On Sat, Feb 19, 2005 at 08:06:23PM +0100, Michel Jouvin wrote:
> >
> >>> Now we experienced corruptions not related to any repository write. We
> >>> have log files showing successful repository access through HTTP GET
> >>> followed by a GET failure due to database corruption without any
> >>> repository modification in between and without any Apache
> >>> problem/restart. We suspected that these corruption were related to
> >>> Apache restart during a transaction but we now have evidence that
> >>> corruption can occur at any time without any repository modification.
> >>> We have Apache log files and corrupted repository copies.
> >
> >Is this with two httpd nodes or one?
> >
> >Is the httpd being used for anything else (CGI, PHP, etc?), or is it
> >dedicated for mod_dav_svn use?
> >
> 
> Subversion server is implemented through a virtual host that is handled by 
> only one node running only one Apache instance (with threads worker). This 
> server also handles requests for a lot of other Web apps, including CGI, 
> PHP.
> 

I suspect, then, that one of these other web application is doing
something evil (like calling exit()).

Could you try running mod_dav_svn on a dedicated Apache instance (no
other web applications) and see whether the problems go away?

If that fixes it, and you have reasons why they need to seem like the
same instance, perhaps you could try using mod_proxy to integrate the
two.

--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Frequent database corruption

Posted by Michel Jouvin <jo...@lal.in2p3.fr>.

--On samedi 19 février 2005 10:11 -0500 Mark Benedetto King 
<mb...@lowlatency.com> wrote:

> On Sat, Feb 19, 2005 at 10:32:49AM +0100, Michel Jouvin wrote:
>> >
>> > The first corruptions we experienced generally occurred during commit,
>> > especially on large repositories. When we looked at possible causes for
>> > these corruptions, we found that one reason was we were running 2
>> > Apache servers on 2 different nodes in a cluster configuration
>> > (cluster file system, no NFS involved). We shut down one of the server
>> > and it more or less solved the corruption during commits. This remains
>> > strange as the cluster file system has a pure local file system
>> > semantics and we never experienced such problems with other databases
>> > or other Db usage.
>> >
>
> Which cluster filesystem?

Tru64/TruCluster Cluster File System. I am not sure about Florian  question 
: the file system (including locks) has the whole semantic of a local file 
system but that I am not sure about system mutexes. In fact it is possible 
to create shared memory between cluster nodes but it has to be done 
explicitly.

Anyway, as I explained below, we now run a configuration where we only have 
one Apache server handling subversion request and I have httpd logs to 
prove that. I understand that there is may be something wrong in our 
configuration if we run several concurrent Apachae instances on several 
nodes accessing the same repositories. But the database corruption we have 
now are not in this configuration and even not related to commit or other 
repository modification.

>
>> > Now we experienced corruptions not related to any repository write. We
>> > have log files showing successful repository access through HTTP GET
>> > followed by a GET failure due to database corruption without any
>> > repository modification in between and without any Apache
>> > problem/restart. We suspected that these corruption were related to
>> > Apache restart during a transaction but we now have evidence that
>> > corruption can occur at any time without any repository modification.
>> > We have Apache log files and corrupted repository copies.
>
> Is this with two httpd nodes or one?
>
> Is the httpd being used for anything else (CGI, PHP, etc?), or is it
> dedicated for mod_dav_svn use?
>

Subversion server is implemented through a virtual host that is handled by 
only one node running only one Apache instance (with threads worker). This 
server also handles requests for a lot of other Web apps, including CGI, 
PHP.


>> >
>> > Generally svnadmin recover fails on these corruptions. Sometimes we
>> > were able to fix corruptions by recover + verify as documented in a
>> > note. We also have a directory that we restored from backup and needed
>> > to repair before having it accessible again. In this case we had to
>> > use recover + verify. And verify + recover definitly corrupts the
>> > repository.
>
> What error message do you get when recover fails?
>

Attached are 2 recover outputs with the same database. The first one 
(*-success) was doing recover (which fails) followed by verify (which 
succeeds) followed by another recover (with succeed without doing any 
recovery).

The second log file (*-error) corresponds to a verify done first. In this 
case the database is impossible to recover.

I have a backup of the database before doing recovery, I can send it if 
this is useful.


>
> --ben
>

Michel

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>



     *************************************************************
     * Michel Jouvin                 Email : jouvin@lal.in2p3.fr *
     * LAL / CNRS                    Tel : +33 1 64468932        *
     * B.P. 34                       Fax : +33 1 69079404        *
     * 91898 Orsay Cedex                                         *
     * France                                                    *
     *************************************************************


Re: Frequent database corruption

Posted by Mark Benedetto King <mb...@lowlatency.com>.
On Sat, Feb 19, 2005 at 10:32:49AM +0100, Michel Jouvin wrote:
> >
> >The first corruptions we experienced generally occurred during commit,
> >especially on large repositories. When we looked at possible causes for
> >these corruptions, we found that one reason was we were running 2 Apache
> >servers on 2 different nodes in a cluster configuration (cluster file
> >system, no NFS involved). We shut down one of the server and it more or
> >less solved the corruption during commits. This remains strange as the
> >cluster file system has a pure local file system semantics and we never
> >experienced such problems with other databases or other Db usage.
> >

Which cluster filesystem?

> >Now we experienced corruptions not related to any repository write. We
> >have log files showing successful repository access through HTTP GET
> >followed by a GET failure due to database corruption without any
> >repository modification in between and without any Apache
> >problem/restart. We suspected that these corruption were related to
> >Apache restart during a transaction but we now have evidence that
> >corruption can occur at any time without any repository modification. We
> >have Apache log files and corrupted repository copies.

Is this with two httpd nodes or one?

Is the httpd being used for anything else (CGI, PHP, etc?), or is it
dedicated for mod_dav_svn use?

> >
> >Generally svnadmin recover fails on these corruptions. Sometimes we were
> >able to fix corruptions by recover + verify as documented in a note. We
> >also have a directory that we restored from backup and needed to repair
> >before having it accessible again. In this case we had to use recover +
> >verify. And verify + recover definitly corrupts the repository.

What error message do you get when recover fails?


--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org