You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Denis Kovalchuk <de...@visualsvn.com> on 2020/05/01 21:28:27 UTC

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

> So the failure mode is that rep-cache.db entries are not able to be
> added while rep-cache.db is being verified.  I am not convinced that
> this is a problem.

I think that this case leads to multiple problems:
1) Commits happening at the same time as verify will hang for the duration of
sqlite busy timeout.
2) These commits will complete with a post-commit error.
3) No entries will be added to rep-cache.db. Unless fixed, following commits
may work without deduplication.

Repository verification can be a regular procedure and can take a long time for
large repositories. And in that case the mentioned problems can also happen
regularly.

> The obvious workaround, in case someone runs into that, is for the
> admin to run build-repcache afterwards.  (Incidentally, we might want
> have the "post-commit FS processing failed" error message namedrop
> «svnadmin build-repcache» when the underlying error code indicates an
> SQLite error.)

I assume that the patch fixes the problem in such a way that it does not
require any additional workarounds and does not worsen any other
characteristics. At the same time, I would like to note that the workaround
does not fix all problems. It fixes only 3) and only if it is called at the
right time.

> Yes, build-repcache will error out in this situation.  Why is that a problem?

I think this is a problem for similar reasons. It may be unexpected that the
read-only verify operation can lead to an error when another maintenance
operation such as 'svnadmin build-repcache' is called. To completely avoid
failures, it may be necessary to separate operations by time, and this is not
always possible.

> What effect does the patch have on _correctness_ of verification?  What
> kinds of corruptions can be detected by the incumbent code but not with
> your patch?
>
> Isn't there an alternative approach that does not affect the correctness
> of verification as much as this approach?

The patch does not change the verification process, only the process of
fetching entries. In the previous approach entries were fetched during one
large query and now they are fetched in small batches.

In this way:
1) If rep-cache.db is not changed during its verification, the new approach
should be equivalent to the previous one.
2) If new entries are added to the rep-cache.db during its verification, the
new approach has a guarantee that all entries that existed at the time of the
call of 'svnadmin verify' will be verified.

So I think that it should not affect the correctness of verification.

Regards,
Denis Kovalchuk

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Daniel Shahaf wrote on Thu, 07 May 2020 21:58 +0000:
> What I'm trying to do here is to get a _problem statement_ to which your
> patch is a solution, so we can consider other solutions.  You wrote
> above
> .
> > rep-cache verification leads to errors for other writes to the rep-cache  
> .
> which is accurate, but doesn't explain _why_ that is a problem (other
> than in terms of the three listed consequences, but you also said fixing
> at least one of _those_ wouldn't address your use-case); and you also
> wrote
> .
> > rep-cache verification runs with holding a SQLite lock for a long time  
> .
> which is an accurate description of the implementation, but is not
> a problem statement.
> 
> I hope I don't sound like a broken record here, but I do think getting
> a clear problem statement is important.  Importantly, the problem
> statement should be made _without reference to the patch_.  Explain
> where the incumbent code falls short and in what way.

The way to get a problem statement is:

1. Write a sentence that explains the problem.
2. Write a sentence that is what would say if someone read the last
sentence you wrote and responded by enquiring "Why is that a problem?".
3. Goto 2.

The termination condition (not spelled out) is "unless, in your
profesional opinion, the explanation is sufficiently detailed".  That
is admittedly nebulous, though I suspect it's closer to being in
consensus than to being subjective.

> > The limitation seems too strong and so I think it would be better to
> > fix the problem with holding a SQLite lock during entire verification.
> >   
> > > For example, how about sharding rep-cache.db to multiple separate
> > > databases, according to the most significant nibbles of the hash value?
> > > This way, instead of having one big database, we'd have 2**(4N) smaller
> > > databases.    
> > 
> > If I am not mistaken, the sizes of such shards will still be O(N), but not
> > O(1), as in the patch. Therefore, this problem can still occur at some
> > rep-cache sizes.  
> 
> (Please don't overload variables; that makes conversation needlessly
> difficult.)
> 
> With the proposal, the number of shards will be O(number of reps /
> 2**(number of significant bits used to identify a shard)).  That's an
> inverse exponential in the number of bits used to identify a shard, so
> if the shards grow too large in the admin's opinion, the admin simply
> has to double the number of bits used to identify a shard.

s/double/increase/

(Exponential functions usually go with O(1) _additive_ increases, as
opposed to multiplicative.  My bad.)

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Fri, 22 May 2020 10:02 +0200:
> On Fri, May 22, 2020 at 12:16 AM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >
> > Johan Corveleyn wrote on Thu, 21 May 2020 17:53 +0200:  
> > > Trying to summarize this thread a bit. I apologize in advance if I
> > > forgot something, or have misrepresented any of the points that were
> > > raised (feel free to correct / add).
> > >  
> >
> > I have additions, below.
> >  
> > > Denis summed up the following problems that might happen while
> > > 'verify' locks the repcache.db:
> > >  
> > > > 1. a post-commit error "database is locked"
> > > > 2. new representations will not be added in the rep-cache.db
> > > > 3. deduplication does not work for new data committed at this time
> > > > 4. commits work with delays.  
> > >
> > > We have also established that the new tool build-repcache is not
> > > suitable for post-factum fixing of 3). It does not reprocess already
> > > committed revisions.  
> >
> > Note, however, that (3) is simply an observation that Denis made about
> > the semantics of the incumbent code.  We do not know of a user who needs
> > rep-cache.db entries to be added so soon after the commit that running
> > «build-repcache» after «verify» finishes would be too late (or impractical).
> >  
> > > We are currently considering two approaches to address these issues:
> > >  
> >
> > Hold on.  Those issues are what happens *whenever* the post-commmit FS
> > processing INSERTs are starved.  They happen when a «verify» starves
> > INSERTs, but they are known to have at least one other case: when two
> > commits happen in quick succession, the INSERTs that happen during
> > post-commit FS processing of the first commit can starve the second
> > commit's INSERTs.  (Don't we have a jira issue for this variant?  I
> > thought we did, but my searches came up dry.)
> >
> > Which is to say, the presentation of the matter as "four issues to be
> > fixed" isn't accurate.  There are _eight_ issues to consider: (1) to (4)
> > when they are caused by a «verify» starving INSERTs, and (1) to (4) when
> > they are caused by an «INSERT» starving INSERTs.
> >
> > Thus, your analysis of the pros and cons is incomplete: it glosses over
> > the case of an «INSERT» starving other INSERTs.  
> 
> Good point. So those "4 problems" are all consequences of "rep-cache
> insertion starvation". One possible cause of this might be 'verify',
> but there are other known causes, such as 'locked by another commit'.
> I'm not sure how likely that last form of starvation is in practice,
> though (but if you've seen a jira issue of it, hmmm, maybe).
> 

It would happen on /repos/asf every time certain PMCs commit a new
release's javadocs.  (I'm not sure whether it still does; those PMCs
may have switched to git since I last looked.)

> > Furthermore, over the course of the thread several other ideas have been
> > floated, intended to address various subsets of the eight issues.  The
> > "Perform INSERTs asynchronously" idea, for example (raised by me, but
> > inspired by a remark of Brane's), can basically fix all eight issues by
> > itself.  There was also the idea to stop marshalling post-commit FS
> > processing errors to the client, which would fix both variants of (1)
> > but none of the other six.  I think there were other ideas as well, but
> > I've run out of time for this thread for tonight, sorry.  
> 
> Ack.
> So IIUC these are addressing the consequences of the starvation (which
> might help regardless of the cause), rather than taking away the
> locking-causes.
> 

Some and some.

- The OP is preventative for the «verify» cause and a no-op for other
  causes.

- Sharding rep-cache is preventative: it effectively makes the critical
  section shorter in duration, for all existing use-cases.
  
- Performing INSERTs asynchronously (and polylithically, i.e., in
  N O(1)-sized transactions) is preventative for the cases of INSERTs
  starving either other INSERTs or readers, and a no-op for the case of
  readers starving INSERTs.
  
  (Speaking of which, here's another thought: add an option flag to
  build-repcache that makes it wait indefinitely in case it's locked out
  of rep-cache.db by a concurrent process.  Denis, WDYT?)
  
- Not marshalling post-commit FS processing errors to the client is
  purely a failure mode handling thing.
  
- ⋮

> I think both 'angles' are quite useful. We might even want to combine
> ... mitigating the consequences, as well as avoiding some of the
> causes for the starvation (because even with consequence-mitigation
> things might not turn out ideal, or only be mitigated at the cost of
> X, so fixing some known causes can still be valuable).
> 

Yeah, for example, the "don't marshal" thing probably makes sense
regardless of anything else.  There's a reason GitHub doesn't page its
users when one of the disks in its RAID arrays needs replacement…

Cheers,

Daniel

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Johan Corveleyn <jc...@gmail.com>.

On Fri, May 22, 2020 at 12:16 AM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
>
> Johan Corveleyn wrote on Thu, 21 May 2020 17:53 +0200:
> > Trying to summarize this thread a bit. I apologize in advance if I
> > forgot something, or have misrepresented any of the points that were
> > raised (feel free to correct / add).
> >
>
> I have additions, below.
>
> > Denis summed up the following problems that might happen while
> > 'verify' locks the repcache.db:
> >
> > > 1. a post-commit error "database is locked"
> > > 2. new representations will not be added in the rep-cache.db
> > > 3. deduplication does not work for new data committed at this time
> > > 4. commits work with delays.
> >
> > We have also established that the new tool build-repcache is not
> > suitable for post-factum fixing of 3). It does not reprocess already
> > committed revisions.
>
> Note, however, that (3) is simply an observation that Denis made about
> the semantics of the incumbent code.  We do not know of a user who needs
> rep-cache.db entries to be added so soon after the commit that running
> «build-repcache» after «verify» finishes would be too late (or impractical).
>
> > We are currently considering two approaches to address these issues:
> >
>
> Hold on.  Those issues are what happens *whenever* the post-commmit FS
> processing INSERTs are starved.  They happen when a «verify» starves
> INSERTs, but they are known to have at least one other case: when two
> commits happen in quick succession, the INSERTs that happen during
> post-commit FS processing of the first commit can starve the second
> commit's INSERTs.  (Don't we have a jira issue for this variant?  I
> thought we did, but my searches came up dry.)
>
> Which is to say, the presentation of the matter as "four issues to be
> fixed" isn't accurate.  There are _eight_ issues to consider: (1) to (4)
> when they are caused by a «verify» starving INSERTs, and (1) to (4) when
> they are caused by an «INSERT» starving INSERTs.
>
> Thus, your analysis of the pros and cons is incomplete: it glosses over
> the case of an «INSERT» starving other INSERTs.

Good point. So those "4 problems" are all consequences of "rep-cache
insertion starvation". One possible cause of this might be 'verify',
but there are other known causes, such as 'locked by another commit'.
I'm not sure how likely that last form of starvation is in practice,
though (but if you've seen a jira issue of it, hmmm, maybe).

Denis' patch tries to fix the 'verify' cause of starvation.

Focusing on avoiding / fixing the *consequences* of the starvation
might be another approach, yes.

> Furthermore, over the course of the thread several other ideas have been
> floated, intended to address various subsets of the eight issues.  The
> "Perform INSERTs asynchronously" idea, for example (raised by me, but
> inspired by a remark of Brane's), can basically fix all eight issues by
> itself.  There was also the idea to stop marshalling post-commit FS
> processing errors to the client, which would fix both variants of (1)
> but none of the other six.  I think there were other ideas as well, but
> I've run out of time for this thread for tonight, sorry.

Ack.
So IIUC these are addressing the consequences of the starvation (which
might help regardless of the cause), rather than taking away the
locking-causes.

I think both 'angles' are quite useful. We might even want to combine
... mitigating the consequences, as well as avoiding some of the
causes for the starvation (because even with consequence-mitigation
things might not turn out ideal, or only be mitigated at the cost of
X, so fixing some known causes can still be valuable).

-- 
Johan

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Thu, 21 May 2020 17:53 +0200:
> Trying to summarize this thread a bit. I apologize in advance if I
> forgot something, or have misrepresented any of the points that were
> raised (feel free to correct / add).
> 

I have additions, below.

> Denis summed up the following problems that might happen while
> 'verify' locks the repcache.db:
> 
> > 1. a post-commit error "database is locked"
> > 2. new representations will not be added in the rep-cache.db
> > 3. deduplication does not work for new data committed at this time
> > 4. commits work with delays.  
> 
> We have also established that the new tool build-repcache is not
> suitable for post-factum fixing of 3). It does not reprocess already
> committed revisions.

Note, however, that (3) is simply an observation that Denis made about
the semantics of the incumbent code.  We do not know of a user who needs
rep-cache.db entries to be added so soon after the commit that running
«build-repcache» after «verify» finishes would be too late (or impractical).

> We are currently considering two approaches to address these issues:
> 

Hold on.  Those issues are what happens *whenever* the post-commmit FS
processing INSERTs are starved.  They happen when a «verify» starves
INSERTs, but they are known to have at least one other case: when two
commits happen in quick succession, the INSERTs that happen during
post-commit FS processing of the first commit can starve the second
commit's INSERTs.  (Don't we have a jira issue for this variant?  I
thought we did, but my searches came up dry.)

Which is to say, the presentation of the matter as "four issues to be
fixed" isn't accurate.  There are _eight_ issues to consider: (1) to (4)
when they are caused by a «verify» starving INSERTs, and (1) to (4) when
they are caused by an «INSERT» starving INSERTs.

Thus, your analysis of the pros and cons is incomplete: it glosses over
the case of an «INSERT» starving other INSERTs.

Furthermore, over the course of the thread several other ideas have been
floated, intended to address various subsets of the eight issues.  The
"Perform INSERTs asynchronously" idea, for example (raised by me, but
inspired by a remark of Brane's), can basically fix all eight issues by
itself.  There was also the idea to stop marshalling post-commit FS
processing errors to the client, which would fix both variants of (1)
but none of the other six.  I think there were other ideas as well, but
I've run out of time for this thread for tonight, sorry.

> I'll add one more concern of my own here, regarding the 'sharding' approach:
> I'd like to warn for the NIHS (Not Invented Here Syndrome) that comes
> peeking around the corner if we say "SQLite might have subtle bugs
> that might hurt us if we do X, but rolling our own solution might be
> better". Why would "rolling our own solution" like sharding
> repcache.db be less susceptible to such subtle bugs than SQLite? Okay,
> on the one hand SQLite is more complex, because it's generic database
> software. But on the other hand it presumably has a lot more users /
> audience than just Subversion. I have no clear answer here.

Thanks, Johan.

Daniel

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Nathan Hartman <ha...@gmail.com>.

On Thu, May 21, 2020 at 11:53 AM Johan Corveleyn <jc...@gmail.com> wrote:
>
> Trying to summarize this thread a bit. I apologize in advance if I
> forgot something, or have misrepresented any of the points that were
> raised (feel free to correct / add).
>
> Denis summed up the following problems that might happen while
> 'verify' locks the repcache.db:
>
> > 1. a post-commit error "database is locked"
> > 2. new representations will not be added in the rep-cache.db
> > 3. deduplication does not work for new data committed at this time
> > 4. commits work with delays.
>
> We have also established that the new tool build-repcache is not
> suitable for post-factum fixing of 3). It does not reprocess already
> committed revisions.
>
> We are currently considering two approaches to address these issues:

Snip.

Both come with various pros/cons.

Another possibility: Handle this with documentation.

Early in the thread, Daniel points out:

> (Also, there's the option of rsync'ing the repository somewhere and
> running «verify» on the copy.

Although that doesn't solve the above-mentioned concerns for in-place
hot verification, it could be added to our docs as a recommendation.
Something along the lines:

[[[

Please note that while the representation cache is locked for
verification, any commits made during that time may be subject to the
following:

(Explain 1, 2, 3, 4)

Depending on your needs, these may or may not represent an actual
problem at your site. However, two workarounds are possible:

1. Disable access to the server for the duration of the 'svnadmin
verify' operation, or

2. Make a hotcopy of the repository and run 'svnadmin verify' on the
copy.

]]]

We might recommend that admins do this as part of their regular backup
routine, i.e., back up the repo and then immediately run verify on the
backup. Solves two problems at the same time.

Nathan

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Johan Corveleyn <jc...@gmail.com>.

Trying to summarize this thread a bit. I apologize in advance if I
forgot something, or have misrepresented any of the points that were
raised (feel free to correct / add).

Denis summed up the following problems that might happen while
'verify' locks the repcache.db:

> 1. a post-commit error "database is locked"
> 2. new representations will not be added in the rep-cache.db
> 3. deduplication does not work for new data committed at this time
> 4. commits work with delays.

We have also established that the new tool build-repcache is not
suitable for post-factum fixing of 3). It does not reprocess already
committed revisions.

We are currently considering two approaches to address these issues:

1) Let verify process the repcache entries in small batches, without
holding an sqlite lock (Denis' patch).

pro:
+ Fixes #1 through #4.

con:
- Relies more heavily on sqlite guarantees that all rows that were
present at the start of 'verify' are readable and correct, after
verify has finished. SQLite might have subtle bugs in this area, and
verify should be as conservative / careful as possible.


2) Shard repcache.db to make the locking window smaller (Daniel's proposal).

pro:
+ Fixes #1 through #4 (if the shard size is 'small enough')

con:
- If we ever ned atomic operations on the entire repcache, we need to
forbid rep-cache.db shards from using WAL mode and use the ATTACH
DATABASE statement with the master journal (which is rarely used and
is not supported by all journaling modes).
- Requires format bump, which means it will only work if the admin has
run 'svnadmin upgrade'.
- May not fully fix the problem if the shard size is too large and
verification of a single shard still takes too much time (e.g. because
it's located on a network drive).


I'll add one more concern of my own here, regarding the 'sharding' approach:
I'd like to warn for the NIHS (Not Invented Here Syndrome) that comes
peeking around the corner if we say "SQLite might have subtle bugs
that might hurt us if we do X, but rolling our own solution might be
better". Why would "rolling our own solution" like sharding
repcache.db be less susceptible to such subtle bugs than SQLite? Okay,
on the one hand SQLite is more complex, because it's generic database
software. But on the other hand it presumably has a lot more users /
audience than just Subversion. I have no clear answer here.

HTH,
-- 
Johan

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

> > > 1. a post-commit error "database is locked"
> > > 2. new representations will not be added in the rep-cache.db
> > > 3. deduplication does not work for new data committed at this time
> > > 4. commits work with delays.

> > I look to hearing Denis's concerns with the sharding approach.
> 
>   In addition, if we ever need atomic operations on the entire
>   rep-cache, we will have to use ATTACH DATABASE statement [1] with the master
>   journal [2], which I think is not used anywhere now, and is not supported by
>   all journaling modes.
> 

Fair point.  Let's see the pros and cons:

- pro: fixes issues #1 through #4, and the "large commits starve the following
  commit" issue as well

- con: if someday we have a need for cross-shard atomic transactions, we'll
  have to either (1) forbid rep-cache.db shards from using WAL mode; or
  (2) figure out another way to solve whichever problem we would use atomic
  cross-shard transactions to solve if we had an unsharded rep-cache.db; or
  (3) revert to a monolithic rep-cache.db and figure out another way to solve
  the problems this thread is about.

> - If I'm not mistaken, this approach requires a format bump. So this does not
>   fix the problem for existing repositories. It is also necessary to perform
>   the division into shards in some form, which means that a fast in-place
>   upgrade also probably will not fix the problem.

An in-place upgrade would be straightforward.

> 

I've snipped most of your points as I already answered them upthread.  If
I didn't respond to some point which _hasn't_ already been addressed upthread,
just ask it again.

My concerns stand.

Daniel

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Denis Kovalchuk <de...@visualsvn.com>.

> > if a commit occurs during the rep-cache.db verification, this can lead to:
> > 1. a post-commit error "database is locked"
> > 2. new representations will not be added in the rep-cache.db
> > 3. deduplication does not work for new data committed at this time
> > 4. commits work with delays.
>
> As I said, you accurately describe the observed behaviour.  However,
> given the misunderstanding upthread, I would still like to ask you to
> make it unambiguously clear which of those four items are requisites of
> your use-case.

I'm not sure that I understand the question about the requisites of use case
correctly. The use case I am talking about is running verify for a hot
repository. And the points listed above are negative consequences that occur in
this case. I think that the best way is to fix all these consequences, if
possible. The proposed patch does exactly that.

> I look to hearing Denis's concerns with the sharding approach.

It seems to me that this approach has at least the following potential
problems:

- There may be considerable difficulties associated with supporting multiple
  databases. For example, it may be necessary to open not one, but up to all
  existing databases during a commit, that may affect the performance of the
  commit. In addition, if we ever need atomic operations on the entire
  rep-cache, we will have to use ATTACH DATABASE statement [1] with the master
  journal [2], which I think is not used anywhere now, and is not supported by
  all journaling modes.

- I assume that reading and verification of all entries in one shard will be
  performed while holding a SQLite lock, because otherwise we return to the
  variations of the proposed patch and the sharding approach will not be
  necessary at all. Then, if the main part of the verification (for example,
  reading the revision content) will be performed while holding the lock, the
  problem may still occur in some cases, because this verification part can
  potentially take a long time (for example, if repositories are located on a
  network share). So the problem will not be completely fixed.

- If I'm not mistaken, this approach requires a format bump. So this does not
  fix the problem for existing repositories. It is also necessary to perform
  the division into shards in some form, which means that a fast in-place
  upgrade also probably will not fix the problem.

> In this case, the tradeoff would seem to be among:
>
> - ship 1.14's «verify» and require «build-repcache» to be run afterwards;
>
> - ship the «verify» in the OP, about whose correctness we are less
>   certain, but which doesn't require running «build-repcache» afterwards;

Speaking about tradeoffs, I would like to note that these cases are not
equivalent, when it comes to visible behavior, because the requirement to run
build-repcache does not fix 1), 3) and 4).

[1] https://www.sqlite.org/lang_attach.html
[2] https://www.sqlite.org/tempfiles.html#master_journal_files

Regards,
Denis Kovalchuk

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Sun, 10 May 2020 21:39 +0200:
> On Sun, May 10, 2020 at 8:26 PM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> > Denis Kovalchuk wrote on Sat, 09 May 2020 23:37 +0300:  
> ...
> > > Regarding the problem with verification depending on the guarantees
> > > provided by SQLite: If we cannot rely on SQLite guarantees, then I
> > > think we should not rely on the guarantee that the table does not
> > > change when it is read using one statement (which the current
> > > implementation relies on). In other words we should either rely on all
> > > SQLite guarantees or not rely on any.  
> >
> > I don't see things quite the same way as you.
> >
> > For starters, while I'm not familiar with SQLite's implementation,
> > I expect that an implementation that locks out writers entirely will be
> > less susceptible to bugs than one that permits concurrent writes.  In
> > other words, the assumption that «verify» in trunk makes is presumably
> > less likely to be violated than the assumption that the patch would have
> > it make.  
> ...
> > > Speaking about other approaches, such as sharding rep-cache.db: It may be a
> > > good approach, but I think there may and probably will be other associated
> > > problems that are currently unknown.  
> >
> > You have just committed FUD.  
> 
> From the peanut gallery: I don't see this as FUD. Not unless we also
> consider your statement about SQLite's potential lesser guarantees
> FUD,

I will reply to this part offlist.  I shall leave it to Johan to bring
back conclusions of the offlist discussion to the list, should and
insofar as he may choose to do so.  In the meantime, let's get the
thread back on topic.  I look to hearing Denis's concerns with the
sharding approach.

> Just another peanut: perhaps we could make the
> non-write-blocking-verify-behaviour optional? That is: if we have
> (even a theoretical) worry that it might lessen the guarantees of
> verify, let the admin decide? --allow-concurrent-writes or something
> (just an idea).

In general, we should push decisions onto administrators when they
involve tradeoffs wherein we are ill-place to make a choice.

In this case, the tradeoff would seem to be among:

- ship 1.14's «verify» and require «build-repcache» to be run afterwards;

- ship the «verify» in the OP, about whose correctness we are less
  certain, but which doesn't require running «build-repcache» afterwards;

- ship some other solution.

Firstly, we can research the correctness of the patch in the OP's
approach: that is something we are _better_ placed to research than
admins.  Secondly, we should brainstorm what other approaches may be
possible, and consider the ones that have already been proposed.

I'll not post any further to this thread for a few days.

Daniel

P.S. When I say "the ones", the plural is correct.  The sharding
solution isn't the only alternative on the table.

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Johan Corveleyn <jc...@gmail.com>.

On Sun, May 10, 2020 at 8:26 PM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Denis Kovalchuk wrote on Sat, 09 May 2020 23:37 +0300:
...
> > Regarding the problem with verification depending on the guarantees
> > provided by SQLite: If we cannot rely on SQLite guarantees, then I
> > think we should not rely on the guarantee that the table does not
> > change when it is read using one statement (which the current
> > implementation relies on). In other words we should either rely on all
> > SQLite guarantees or not rely on any.
>
> I don't see things quite the same way as you.
>
> For starters, while I'm not familiar with SQLite's implementation,
> I expect that an implementation that locks out writers entirely will be
> less susceptible to bugs than one that permits concurrent writes.  In
> other words, the assumption that «verify» in trunk makes is presumably
> less likely to be violated than the assumption that the patch would have
> it make.
...
> > Speaking about other approaches, such as sharding rep-cache.db: It may be a
> > good approach, but I think there may and probably will be other associated
> > problems that are currently unknown.
>
> You have just committed FUD.

From the peanut gallery: I don't see this as FUD. Not unless we also
consider your statement about SQLite's potential lesser guarantees
FUD, if we would implement Denis' solution. In reality anything we do
carries a bit of risk. Whether the risk is greater by leaning on more
SQLite guarantees, or rolling our own solution, is not clear to me.

FWIW, I do consider it an improvement if 'verify' would not hinder
commits. We should consider the different possible approaches, of
course, without exclusion / prejudice.

Just another peanut: perhaps we could make the
non-write-blocking-verify-behaviour optional? That is: if we have
(even a theoretical) worry that it might lessen the guarantees of
verify, let the admin decide? --allow-concurrent-writes or something
(just an idea).

-- 
Johan

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Denis Kovalchuk wrote on Sat, 09 May 2020 23:37 +0300:
> > What I'm trying to do here is to get a _problem statement_ to which your
> > patch is a solution, so we can consider other solutions.  
> 
> Partially repeating the first letter from this thread, the problem is the
> following: if a commit occurs during the rep-cache.db verification, this can
> lead to a post-commit error "database is locked" and new representations will
> not be added in the rep-cache.db. So a verification of a hot repository leads
> to user visible errors. Furthermore, deduplication does not work for new data
> committed at this time. And commits work with delays.

Okay.  Let me quote this again with annotations added:

> if a commit occurs during the rep-cache.db verification, this can lead to:
> 1. a post-commit error "database is locked"
> 2. new representations will not be added in the rep-cache.db
> 3. deduplication does not work for new data committed at this time
> 4. commits work with delays.

As I said, you accurately describe the observed behaviour.  However,
given the misunderstanding upthread, I would still like to ask you to
make it unambiguously clear which of those four items are requisites of
your use-case.

Would your use-case be adequately addressed if #1, #2, and #3 were fixed
(whether by committing your patch or otherwise), but #4 were not?

Would your use-case be adequately addressed if #1, #2, and #4 were fixed
(whether by committing your patch or otherwise), but #3 were not?

Would your use-case be adequately addressed if #1, #3, and #4 were fixed
(whether by committing your patch or otherwise), but #2 were not?

Would your use-case be adequately addressed if #2, #3, and #4 were fixed
(whether by committing your patch or otherwise), but #1 were not?

Thanks in advance for clarifying this.

My overall understanding from your previous emails was that only #2 is
a requisite.

Once this is settled, we can begin to consider other solutions which
will be acceptable to all involved parties.

> So running verify for a hot repository is problematic.

To be more precise, verifying a live repository can cause concurrent
processes to incur delays and report errors due to being unable to
read/write rep-cache.db, which in turn can necessitate build-repcache
runs.

> in reality repositories can always be hot and so it is not clear when
> to verify them at all.

The admin could deploy a workaround (e.g., verify a copy of the
repository, or split the repository into smaller repositories, etc.).

Alternatively, the admin might bite the bullet and verify the repository
hot anyway, incurring issues #1 through #4 above.  That is, they might
assess that course of action as preferable to the risk not detecting
a corruption.

I agree that it would be nice to allow the admin to verify their
repository without incurring issues #1 through #4.  However, I don't
agree that the patch in the OP is the best way to address those issues.

> Speaking about correctness, I think the implementation in the patch is correct
> and as far as I understand there are no concerns about this.

I have no concerns about how faithfully the patch implements the
approach you had chosen.  I have concerns about the soundness of that
approach.

> Regarding the problem with verification depending on the guarantees
> provided by SQLite: If we cannot rely on SQLite guarantees, then I
> think we should not rely on the guarantee that the table does not
> change when it is read using one statement (which the current
> implementation relies on). In other words we should either rely on all
> SQLite guarantees or not rely on any.

I don't see things quite the same way as you.

For starters, while I'm not familiar with SQLite's implementation,
I expect that an implementation that locks out writers entirely will be
less susceptible to bugs than one that permits concurrent writes.  In
other words, the assumption that «verify» in trunk makes is presumably
less likely to be violated than the assumption that the patch would have
it make.

In general, the extents to which «verify» and «commit» should rely on SQLite
are not _necessarily_ the same.  They might _happen_ to be the same by
chance, but the fact that «commit» makes some assumption doesn't _imply_
that «verify» should make the same assumption.

However, it's certainly fair to question how much each use-case should
rely on SQLite's guarantees.  We have many options, such as:

- Rely on SQLite guarantees freely in both «verify» and «commit».
  (that's what the patch does)

- Rely on SQLite guarantees to a lesser extent in «verify» than
  in «commit».  (that's what trunk does)

- Rely on SQLite guarantees in «verify» and/or «commit» to a lesser
  extent than trunk does.  (Unclear what this means exactly.  Sharding
  rep-cache.db and using our own out-of-band locking?  Replacing
  rep-cache.db with some non-SQLite storage?  Something else?)

- …

> All alternative approaches where an administrator takes additional steps do not
> solve the problem in my opinion. Because it turns out that it’s better not to
> verify the hot repository at all, because additional steps will be required.

First of all, let's be clear: by "additional steps" you refer to running
«build-repcache» after «verify».

In most use-cases, an administrator who had chosen not to verify a repository
_because that would have required them to run build-repcache afterwards_
would have been fired.

> Speaking about other approaches, such as sharding rep-cache.db: It may be a
> good approach, but I think there may and probably will be other associated
> problems that are currently unknown.

You have just committed FUD.

Explain what problems those might be.

> I think the new proposed approach is an improvement of the current one, which
> also solves the problem. But if there are still significant concerns about that
> approach, I do not have anything else to offer at the moment. I may try to
> return to this topic if I have any additional thoughts.

You are welcome to return at any time.  However, whether your return
will be fruitful will depend on what approach you take.  If you continue
to refrain from openly considering solutions other than the one you
proposed, you will continue to experience dev@ as a brick wall too high
to throw patches over.  On the other hand, if you participate in the
discussion not with the intent to sell us on a particular, predetermined
solution but with the intent to jointly improve a collaborated-upon
product, you'll find we're a welcoming bunch.

Daniel

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Denis Kovalchuk <de...@visualsvn.com>.

> What I'm trying to do here is to get a _problem statement_ to which your
> patch is a solution, so we can consider other solutions.

Partially repeating the first letter from this thread, the problem is the
following: if a commit occurs during the rep-cache.db verification, this can
lead to a post-commit error "database is locked" and new representations will
not be added in the rep-cache.db. So a verification of a hot repository leads
to user visible errors. Furthermore, deduplication does not work for new data
committed at this time. And commits work with delays. So running verify for a
hot repository is problematic. But then recommending not to verify hot
repositories is also a problem, because in reality repositories can always be
hot and so it is not clear when to verify them at all.

Speaking about correctness, I think the implementation in the patch is correct
and as far as I understand there are no concerns about this. Regarding the
problem with verification depending on the guarantees provided by SQLite: If we
cannot rely on SQLite guarantees, then I think we should not rely on the
guarantee that the table does not change when it is read using one statement
(which the current implementation relies on). In other words we should either
rely on all SQLite guarantees or not rely on any.

All alternative approaches where an administrator takes additional steps do not
solve the problem in my opinion. Because it turns out that it’s better not to
verify the hot repository at all, because additional steps will be required.

Speaking about other approaches, such as sharding rep-cache.db: It may be a
good approach, but I think there may and probably will be other associated
problems that are currently unknown.

I think the new proposed approach is an improvement of the current one, which
also solves the problem. But if there are still significant concerns about that
approach, I do not have anything else to offer at the moment. I may try to
return to this topic if I have any additional thoughts.

Regards,
Denis Kovalchuk

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Denis Kovalchuk wrote on Tue, 05 May 2020 14:35 +0300:
> > Hang on.  In your previous reply you wrote that commits that "complete
> > with a post-commit error" was a problem the patch was designed to solve,
> > but now you say that's not an immediate concern.  Which is it?  We can't
> > design a solution if it's not clear what set of problems the solution is
> > supposed to solve.  
> 
> I don't think there is a contradiction somewhere. What I was trying to say is:
> - The problem is that rep-cache verification leads to errors for other writes
> to the rep-cache. For example, this happens during commit post-processing.
> - One visible consequence of this problem is that commits can hang.
> - Another visible consequence is that commits complete with post-commit errors.
> - Another visible consequence is that new data will not be deduplicated.
> - I think the root cause of problem is that entire rep-cache verification runs
> with holding a SQLite lock for a long time.
> - I don't think that there are issues with the existing way to report
> post-commit errors to clients. Also, I don't think that changing it would be
> relevant to the problem.

Thanks for clarifying this.  However, given that you simultaneously say
that commits terminating with errors is a "consequence" of the problem,
and that you think fixing that "[wouldn't] be relevant to the problem",
I don't understand exactly which problems you had _set out_ to fix; i.e.,
which problems' fixings were _a priori_ design goals of the patch, as
opposed to other effects of the patch, which, nice-to-have though they
may be, are dispensable insofar as your use-case is concerned.

For example, my understanding that "don't spam clients with post-commit
FS processing errors when a commit is concurrent with a «verify»" falls
into the latter class.

What I'm trying to do here is to get a _problem statement_ to which your
patch is a solution, so we can consider other solutions.  You wrote
above
.
> rep-cache verification leads to errors for other writes to the rep-cache
.
which is accurate, but doesn't explain _why_ that is a problem (other
than in terms of the three listed consequences, but you also said fixing
at least one of _those_ wouldn't address your use-case); and you also
wrote
.
> rep-cache verification runs with holding a SQLite lock for a long time
.
which is an accurate description of the implementation, but is not
a problem statement.

I hope I don't sound like a broken record here, but I do think getting
a clear problem statement is important.  Importantly, the problem
statement should be made _without reference to the patch_.  Explain
where the incumbent code falls short and in what way.

> > > For example, 'svnadmin verify' can be checking rep-cache for a long time and
> > > 'svnadmin build-repcache' will keep completing with an error in this case. All
> > > commits during this time will work without deduplication for new data.  
> >
> > I take it that in your use-case it's important that a commit's rep-cache rows
> > should be added as soon after the commit as possible, then?  
> 
> I think it is important that entries don't get lost and do not make
> deduplication worse.

Nobody is proposing that entries be lost.  The question is simply how
soon after the commit entries are required to be added to the rep-cache.
The question is _when_, not _whether_.

> The rep-cache verification can take a long time for real-world
> reportistories, for example an hour or more. And all commits during
> this time will work without deduplication for new data and I think
> that currently there is no way to fix it.

Are you being facetious?  *You* wrote the build-repcache command, which
is the "way to fix it".

> In case of regular verifications, this can be a significant issue.
> 
> > Yes, and it's up to the repository administrator to provide such
> > guarantees when scheduling «verify» and «build-repcache» operations on
> > their repository.  
> 
> I think that in general it may be difficult to have such guarantees. For
> example, verification can be performed with the API or by third-party tools.

I'm sorry, but I am not convinced that a repository administrator is
unable to control when their own repository gets verified — even if they
use third-party tools.  Furthermore, were I so convinced, my next
question would be why locking can't be arranged (either by the
administrator, or in libsvn) to prevent «verify» and «build-repcache»
from running concurrently.

As to your implication that an administrator would be unable to control
access to their own repository because they use the API as opposed to
the CLI, I am having trouble imagining a scenario in which that would be
the case.  Would you illuminate me?

> And based on the above it would mean that it is necessary to also prohibit
> commits during verification.

Commits need not be prohibited during verifications.  (Besides, such
a requirement would constitute an incompatible API change unless
accompanied by a format bump.)

> The limitation seems too strong and so I think it would be better to
> fix the problem with holding a SQLite lock during entire verification.
> 
> > For example, how about sharding rep-cache.db to multiple separate
> > databases, according to the most significant nibbles of the hash value?
> > This way, instead of having one big database, we'd have 2**(4N) smaller
> > databases.  
> 
> If I am not mistaken, the sizes of such shards will still be O(N), but not
> O(1), as in the patch. Therefore, this problem can still occur at some
> rep-cache sizes.

(Please don't overload variables; that makes conversation needlessly
difficult.)

With the proposal, the number of shards will be O(number of reps /
2**(number of significant bits used to identify a shard)).  That's an
inverse exponential in the number of bits used to identify a shard, so
if the shards grow too large in the admin's opinion, the admin simply
has to double the number of bits used to identify a shard.

> > I understand this argument.  However, at the end of the day, the patch
> > makes «verify» read a database as it's being written to, which is more
> > likely to run into bugs (either in our code or in SQLite's), so I'm not
> > a fan of it.
> >
> > Maybe I'm being overly careful.  
> 
> I think that concurrent reading and writing to rep-cache.db is already possible
> in the case of concurrent commits. It is a bit unclear to me why the (optional)
> verification has to be different from the commit in this aspect.

Because «verify» exists to verify invariants do in fact hold, and
it is not obvious that can be accomplished (without false negatives) by
reading the database as it's being modified.

> Especially considering that the current approach leads to the problem.

I'm not disputing the problem.  I'm concerned about the proposed
solution and asking to consider alternatives.

> > P.S. How about using creating a temporary copy of rep-cache.db, and then
> > have «verify» iterate that at its leisure and delete it when done with it?  
> 
> I think that compared to the patch this approach has a problem that every
> verify will lead to a lot of write operations. For example, the size of
> rep-cache.db in the ASF repository is 1.1 GB.
> 

Yes, but on the other hand:

- The writes can be done on a RAM disk
  (or maybe an atomic filesystem snapshot can be taken?)

- The chances of «verify» false negativing will be lower

(Also, there's the option of rsync'ing the repository somewhere and
running «verify» on the copy.  Or the shards solution mentioned above.)

> Just in case, I will be happy to work on any specific things in the patch, if
> there are some things you think can be improved.

I read the patch when you posted it.  IIRC, it correctly implements the
approach you chose.  However, I am sceptical about that approach.  As
mentioned above, I don't understand why you would be certain that, once
«verify» ends, every single row in the database would be readable and
would emit correct values, even granting that every single row in the
database had that property at some point during the run.

Yes, SQLite promises us that that'd be the case, but why should
«verify», of all places, rely on that promise?

Cheers,

Daniel

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Denis Kovalchuk <de...@visualsvn.com>.

> Hang on.  In your previous reply you wrote that commits that "complete
> with a post-commit error" was a problem the patch was designed to solve,
> but now you say that's not an immediate concern.  Which is it?  We can't
> design a solution if it's not clear what set of problems the solution is
> supposed to solve.

I don't think there is a contradiction somewhere. What I was trying to say is:
- The problem is that rep-cache verification leads to errors for other writes
to the rep-cache. For example, this happens during commit post-processing.
- One visible consequence of this problem is that commits can hang.
- Another visible consequence is that commits complete with post-commit errors.
- Another visible consequence is that new data will not be deduplicated.
- I think the root cause of problem is that entire rep-cache verification runs
with holding a SQLite lock for a long time.
- I don't think that there are issues with the existing way to report
post-commit errors to clients. Also, I don't think that changing it would be
relevant to the problem.

> > For example, 'svnadmin verify' can be checking rep-cache for a long time and
> > 'svnadmin build-repcache' will keep completing with an error in this case. All
> > commits during this time will work without deduplication for new data.
>
> I take it that in your use-case it's important that a commit's rep-cache rows
> should be added as soon after the commit as possible, then?

I think it is important that entries don't get lost and do not make
deduplication worse. The rep-cache verification can take a long time for
real-world reportistories, for example an hour or more. And all commits during
this time will work without deduplication for new data and I think that
currently there is no way to fix it. In case of regular verifications, this can
be a significant issue.

> Yes, and it's up to the repository administrator to provide such
> guarantees when scheduling «verify» and «build-repcache» operations on
> their repository.

I think that in general it may be difficult to have such guarantees. For
example, verification can be performed with the API or by third-party tools.

And based on the above it would mean that it is necessary to also prohibit
commits during verification. The limitation seems too strong and so I think it
would be better to fix the problem with holding a SQLite lock during entire
verification.

> For example, how about sharding rep-cache.db to multiple separate
> databases, according to the most significant nibbles of the hash value?
> This way, instead of having one big database, we'd have 2**(4N) smaller
> databases.

If I am not mistaken, the sizes of such shards will still be O(N), but not
O(1), as in the patch. Therefore, this problem can still occur at some
rep-cache sizes.

> I understand this argument.  However, at the end of the day, the patch
> makes «verify» read a database as it's being written to, which is more
> likely to run into bugs (either in our code or in SQLite's), so I'm not
> a fan of it.
>
> Maybe I'm being overly careful.

I think that concurrent reading and writing to rep-cache.db is already possible
in the case of concurrent commits. It is a bit unclear to me why the (optional)
verification has to be different from the commit in this aspect. Especially
considering that the current approach leads to the problem.

> P.S. How about using creating a temporary copy of rep-cache.db, and then
> have «verify» iterate that at its leisure and delete it when done with it?

I think that compared to the patch this approach has a problem that every
verify will lead to a lot of write operations. For example, the size of
rep-cache.db in the ASF repository is 1.1 GB.

Just in case, I will be happy to work on any specific things in the patch, if
there are some things you think can be improved.

Regards,
Denis Kovalchuk

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Denis Kovalchuk wrote on Sun, 03 May 2020 21:04 +0300:
> > For that matter, now that build-repcache is a thing, why should commits
> > populate rep-cache synchronously at all?  Why shouldn't we provide
> > a mode that says "svn_fs_commit_txn() won't populate the rep-cache for
> > you; you're expected to do it yourself via a post-commit hook"?  That'll
> > sidestep the problem entirely.  (Brane made a similar point a few days
> > ago.)
> >  
> > > 2) These commits will complete with a post-commit error.  
> >
> > There are other ways to address this.  For example, whenever the
> > post-commit insertion to rep-cache.db fails, instead of reporting the
> > error to the client, we could record the error in the server log but not
> > report it to the client.  After all, generally the client doesn't care
> > about that error and can't do anything to fix it.  
> 
> I am not too sure the moment of time when the rep-cache is populated and way to
> report errors are currently causing problems.

Hang on.  In your previous reply you wrote that commits that "complete
with a post-commit error" was a problem the patch was designed to solve,
but now you say that's not an immediate concern.  Which is it?  We can't
design a solution if it's not clear what set of problems the solution is
supposed to solve.

Would you please state clearly the _complete_ set of problems which you
would like addressed (whether via the patch in the OP or otherwise)?
Once that is clarified, we can consider what options are available to
address them.

> I think the root cause of the problem is that the entire rep-cache
> verification runs with holding a SQLite lock. And I propose to fix it
> by slightly modified approach in the patch so that errors do not
> happen at all.
> 
> I think that with asynchronous rep-cache population the same errors will
> happen, but at another time.

Yes, they will.  Populating rep-cache asynchronously is not a silver
bullet: it will address root problems (1) and (2) [using the numbering
from your previous email], but not (3).  To solve (3), additional steps
will have to be taken (whether ones I have proposed or others).

There is also a (4): populating rep-cache asynchronously will cause
commits to take less time, as measured by clients.

> Even if errors are hidden from clients, administrators may not have a
> way to fix them.

"Administrators may not have a way to fix them"?  How might this happen?

You mentioned below that administrators may not be able to run
«build-repcache» because they won't be able to ascertain a window of
time during which «verify» won't run.  If that's what you refer to here,
then see my response below.

> > (There are things other than «verify» that can cause the "post-commit FS
> > processing" rep-cache INSERT to time out and error out.  For example,
> > that's known to happen to someone who makes a small commit that just
> > happens to occur right after a commit that has released the write lock
> > but hasn't finished inserting to rep-cache.db.  I don't have the issue
> > number handy, sorry.)  
> 
> If I understand correctly, this is the case when a large number of entries are
> added in the rep-cache during the post-commit processing stage that leads to a
> starvation for other additions.

Yes.

> It looks like a separate problem that can be fixed. And I assume that
> a similar approach can be used for this: add entries in small batches,
> instead of adding all the entries in one transaction.

That's one option.  Making all INSERTs asynchronous is another option
(and is not mutually exclusive of the first option).  There may be yet
other options.

> Also I think that the problem should occur less often than the mentioned one
> with verification, because:
> - During a commit, the lock is held only for the duration of adding new entries
> to the rep-cache.
> - During a verification, the lock is held for the duration of all content
> verification, that can take a lot of time.
> 

Sure.

> In general, It seems to me that the case of committing a large number of
> changes may be more rare than just verifying a repository with a large
> rep-cache. So I propose to fix the problem with verification first.

Sure.

(To be clear, I don't think that focussing on the issue of «verify»
starving INSERTs should rule out of consideration solutions that fix
_both_ that issue and the issue of INSERTs starving other INSERTs.  It
would be fine to fix either issue without the other, certainly, but
solutions that fix both problems simultaneously should be considered
too.)

> > > 3) No entries will be added to rep-cache.db. Unless fixed, following commits
> > > may work without deduplication.  
> >
> > Sure, but why is it not sufficient to run build-repcache afterwards?  
> 
> For example, 'svnadmin verify' can be checking rep-cache for a long time and
> 'svnadmin build-repcache' will keep completing with an error in this case. All
> commits during this time will work without deduplication for new data.

I take it that in your use-case it's important that a commit's rep-cache rows
should be added as soon after the commit as possible, then?

> > What do you mean, "at the right time"?  Under what circumstances would
> > the following not suffice? —
> >
> >     set -e
> >     old=`svnlook youngest r`
> >     svnadmin verify r
> >     svnadmin build-repcache r -r ${old}:HEAD  
> 
> If I am not mistaken, this assumes that only one maintenance operation can
> happen with repository at one time. But in general this approach requires a
> guarantee that there are no other running verifications, because otherwise
> 'svnadmin build-repcache' may fail with an error.

Yes, and it's up to the repository administrator to provide such
guarantees when scheduling «verify» and «build-repcache» operations on
their repository.

On our part, we don't administer the repository; we merely provide tools
that allow the repository to be administered.  We should make these
tools impose as few constraints as possible on the repository
administrator.  At the same time, however, we should ensure that
«verify» detects any corruptions that may be present.

> > Fair point, but the semantics of SQLite are what they are, and we have
> > to work with that.  (Unless you propose to replace rep-cache.db by some
> > non-SQLite storage?  Maybe SQLite isn't the right tool for the job after
> > all, even if it's the right tool for wc.db?  Or, alternatively, can we
> > achieve finer-grained locking within SQLite?)  
> 
> I do not think that there is a problem with using SQLite for these purposes and
> I do not propose to replace it. I think the problem is how SQLite is used
> during verification where unbounded statement currently leads to the problem
> with concurrent writes to the rep-cache.

I understand, and I stand by my questions.  I would say that the root
problem here is that rep-cache.db is a monolithic design, which our
users outgrow as time passes.  So, should we change to a more polylithic
design, in order to allow finer-grained locking?

For example, how about sharding rep-cache.db to multiple separate
databases, according to the most significant nibbles of the hash value?
This way, instead of having one big database, we'd have 2**(4N) smaller
databases.

> > When is it not possible to run «build-repcache» and «verify» sequentially?  
> 
> For example, if there is no way to guarantee that another verification does not
> run at this time.

See above.

> > > 2) If new entries are added to the rep-cache.db during its verification, the
> > > new approach has a guarantee that all entries that existed at the time of the
> > > call of 'svnadmin verify' will be verified.  
> >
> > Yes, this too.  However, why would you assume in this case that just
> > because every row was valid when it was accessed during verification,
> > every row that was present when verification began is valid when the
> > verification finishes?
> >
> > In the current code, the answer to that question is "Because the
> > verification process took an SQLite lock, which made the table-reading
> > atomic".
> >
> > I don't know what the answer to that question is with this patch.  How
> > can you be sure that, at a particular moment in time, every row in the
> > table was valid simultaneously, if you don't read the table atomically?  
> 
> We maintain separate cursor (sha1_digest) to the entries and walk them ordered.
> The row values are immutable after they are added. So in new approach we will
> walk at least all rows that existed at the time of the call of
> 'svnadmin verify' and with the same values.
> 
> The verification of a single row depends only on the values in that row. These
> values are immutable. The revision content on fsfs is also immutable. Based on
> this, I assume that if we do not find errors in accessed rows, there are no
> errors in rows that existed at the time of the call of 'svnadmin verify'.

I understand this argument.  However, at the end of the day, the patch
makes «verify» read a database as it's being written to, which is more
likely to run into bugs (either in our code or in SQLite's), so I'm not
a fan of it.

Maybe I'm being overly careful.

Cheers,

Daniel

P.S. How about using creating a temporary copy of rep-cache.db, and then
have «verify» iterate that at its leisure and delete it when done with it?

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Denis Kovalchuk <de...@visualsvn.com>.

> For that matter, now that build-repcache is a thing, why should commits
> populate rep-cache synchronously at all?  Why shouldn't we provide
> a mode that says "svn_fs_commit_txn() won't populate the rep-cache for
> you; you're expected to do it yourself via a post-commit hook"?  That'll
> sidestep the problem entirely.  (Brane made a similar point a few days
> ago.)
>
> > 2) These commits will complete with a post-commit error.
>
> There are other ways to address this.  For example, whenever the
> post-commit insertion to rep-cache.db fails, instead of reporting the
> error to the client, we could record the error in the server log but not
> report it to the client.  After all, generally the client doesn't care
> about that error and can't do anything to fix it.

I am not too sure the moment of time when the rep-cache is populated and way to
report errors are currently causing problems. I think the root cause of the
problem is that the entire rep-cache verification runs with holding a SQLite
lock. And I propose to fix it by slightly modified approach in the patch so
that errors do not happen at all.

I think that with asynchronous rep-cache population the same errors will
happen, but at another time. Even if errors are hidden from clients,
administrators may not have a way to fix them.

> (There are things other than «verify» that can cause the "post-commit FS
> processing" rep-cache INSERT to time out and error out.  For example,
> that's known to happen to someone who makes a small commit that just
> happens to occur right after a commit that has released the write lock
> but hasn't finished inserting to rep-cache.db.  I don't have the issue
> number handy, sorry.)

If I understand correctly, this is the case when a large number of entries are
added in the rep-cache during the post-commit processing stage that leads to a
starvation for other additions. It looks like a separate problem that can be
fixed. And I assume that a similar approach can be used for this: add entries
in small batches, instead of adding all the entries in one transaction.

Also I think that the problem should occur less often than the mentioned one
with verification, because:
- During a commit, the lock is held only for the duration of adding new entries
to the rep-cache.
- During a verification, the lock is held for the duration of all content
verification, that can take a lot of time.

In general, It seems to me that the case of committing a large number of
changes may be more rare than just verifying a repository with a large
rep-cache. So I propose to fix the problem with verification first.

> > 3) No entries will be added to rep-cache.db. Unless fixed, following commits
> > may work without deduplication.
>
> Sure, but why is it not sufficient to run build-repcache afterwards?

For example, 'svnadmin verify' can be checking rep-cache for a long time and
'svnadmin build-repcache' will keep completing with an error in this case. All
commits during this time will work without deduplication for new data.

> What do you mean, "at the right time"?  Under what circumstances would
> the following not suffice? —
>
>     set -e
>     old=`svnlook youngest r`
>     svnadmin verify r
>     svnadmin build-repcache r -r ${old}:HEAD

If I am not mistaken, this assumes that only one maintenance operation can
happen with repository at one time. But in general this approach requires a
guarantee that there are no other running verifications, because otherwise
'svnadmin build-repcache' may fail with an error.

> Fair point, but the semantics of SQLite are what they are, and we have
> to work with that.  (Unless you propose to replace rep-cache.db by some
> non-SQLite storage?  Maybe SQLite isn't the right tool for the job after
> all, even if it's the right tool for wc.db?  Or, alternatively, can we
> achieve finer-grained locking within SQLite?)

I do not think that there is a problem with using SQLite for these purposes and
I do not propose to replace it. I think the problem is how SQLite is used
during verification where unbounded statement currently leads to the problem
with concurrent writes to the rep-cache.

> When is it not possible to run «build-repcache» and «verify» sequentially?

For example, if there is no way to guarantee that another verification does not
run at this time.

> > 2) If new entries are added to the rep-cache.db during its verification, the
> > new approach has a guarantee that all entries that existed at the time of the
> > call of 'svnadmin verify' will be verified.
>
> Yes, this too.  However, why would you assume in this case that just
> because every row was valid when it was accessed during verification,
> every row that was present when verification began is valid when the
> verification finishes?
>
> In the current code, the answer to that question is "Because the
> verification process took an SQLite lock, which made the table-reading
> atomic".
>
> I don't know what the answer to that question is with this patch.  How
> can you be sure that, at a particular moment in time, every row in the
> table was valid simultaneously, if you don't read the table atomically?

We maintain separate cursor (sha1_digest) to the entries and walk them ordered.
The row values are immutable after they are added. So in new approach we will
walk at least all rows that existed at the time of the call of
'svnadmin verify' and with the same values.

The verification of a single row depends only on the values in that row. These
values are immutable. The revision content on fsfs is also immutable. Based on
this, I assume that if we do not find errors in accessed rows, there are no
errors in rows that existed at the time of the call of 'svnadmin verify'.

Regards,
Denis Kovalchuk

Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Denis Kovalchuk wrote on Sat, 02 May 2020 00:28 +0300:
> > So the failure mode is that rep-cache.db entries are not able to be
> > added while rep-cache.db is being verified.  I am not convinced that
> > this is a problem.  
> 
> I think that this case leads to multiple problems:
> 1) Commits happening at the same time as verify will hang for the duration of
> sqlite busy timeout.

Yeah, they hang for 10 seconds:

[[[
subversion/libsvn_subr/sqlite.c|218 col 22| #define BUSY_TIMEOUT 10000
subversion/libsvn_subr/config_file.c|1552 col 50| "### returning an error.  The default is 10000, i.e. 10 seconds."    NL
subversion/libsvn_subr/config_file.c|1554 col 27| "# busy-timeout = 10000"                                             NL
]]]

Would it help to make this configurable?  The BUSY_TIMEOUT pragma is
per-process so the admin can't just set it out of band, and the above
"busy-timeout" config file knob is used by libsvn_wc only, so this is
not currently configurable.

For that matter, now that build-repcache is a thing, why should commits
populate rep-cache synchronously at all?  Why shouldn't we provide
a mode that says "svn_fs_commit_txn() won't populate the rep-cache for
you; you're expected to do it yourself via a post-commit hook"?  That'll
sidestep the problem entirely.  (Brane made a similar point a few days
ago.)

> 2) These commits will complete with a post-commit error.

There are other ways to address this.  For example, whenever the
post-commit insertion to rep-cache.db fails, instead of reporting the
error to the client, we could record the error in the server log but not
report it to the client.  After all, generally the client doesn't care
about that error and can't do anything to fix it.

>

Note: All the above suggestions, for both (1) and (2), would fix _all_
causes of (1) and (2), whereas your proposal would prevent (1) and (2)
only when they are caused by a concurrent «verify» operation, but not
otherwise.

(There are things other than «verify» that can cause the "post-commit FS
processing" rep-cache INSERT to time out and error out.  For example,
that's known to happen to someone who makes a small commit that just
happens to occur right after a commit that has released the write lock
but hasn't finished inserting to rep-cache.db.  I don't have the issue
number handy, sorry.)

> 3) No entries will be added to rep-cache.db. Unless fixed, following commits
> may work without deduplication.

Sure, but why is it not sufficient to run build-repcache afterwards?

> Repository verification can be a regular procedure and can take a long time for
> large repositories. And in that case the mentioned problems can also happen
> regularly.
> 
> > The obvious workaround, in case someone runs into that, is for the
> > admin to run build-repcache afterwards.  (Incidentally, we might want
> > have the "post-commit FS processing failed" error message namedrop
> > «svnadmin build-repcache» when the underlying error code indicates an
> > SQLite error.)  
> 
> I assume that the patch fixes the problem in such a way that it does not
> require any additional workarounds and does not worsen any other
> characteristics.

> At the same time, I would like to note that the workaround does not
> fix all problems. It fixes only 3) and only if it is called at the
> right time.

What do you mean, "at the right time"?  Under what circumstances would
the following not suffice? —

    set -e
    old=`svnlook youngest r`
    svnadmin verify r
    svnadmin build-repcache r -r ${old}:HEAD

> > Yes, build-repcache will error out in this situation.  Why is that a problem?  
> 
> I think this is a problem for similar reasons. It may be unexpected that the
> read-only verify operation can lead to an error when another maintenance
> operation such as 'svnadmin build-repcache' is called.

Fair point, but the semantics of SQLite are what they are, and we have
to work with that.  (Unless you propose to replace rep-cache.db by some
non-SQLite storage?  Maybe SQLite isn't the right tool for the job after
all, even if it's the right tool for wc.db?  Or, alternatively, can we
achieve finer-grained locking within SQLite?)

However, the foremost consideration for «svnadmin verify» is correctness.
Unintuitive behaviour can be dealt with via documentation, informative
error messages, and so on.

> To completely avoid failures, it may be necessary to separate
> operations by time, and this is not always possible.

When is it not possible to run «build-repcache» and «verify» sequentially?

> > What effect does the patch have on _correctness_ of verification?  What
> > kinds of corruptions can be detected by the incumbent code but not with
> > your patch?
> >
> > Isn't there an alternative approach that does not affect the correctness
> > of verification as much as this approach?  
> 
> The patch does not change the verification process, only the process of
> fetching entries. In the previous approach entries were fetched during one
> large query and now they are fetched in small batches.
> 
> In this way:
> 1) If rep-cache.db is not changed during its verification, the new approach
> should be equivalent to the previous one.

Agreed.

> 2) If new entries are added to the rep-cache.db during its verification, the
> new approach has a guarantee that all entries that existed at the time of the
> call of 'svnadmin verify' will be verified.

Yes, this too.  However, why would you assume in this case that just
because every row was valid when it was accessed during verification,
every row that was present when verification began is valid when the
verification finishes?

In the current code, the answer to that question is "Because the
verification process took an SQLite lock, which made the table-reading
atomic".

I don't know what the answer to that question is with this patch.  How
can you be sure that, at a particular moment in time, every row in the
table was valid simultaneously, if you don't read the table atomically?

> So I think that it should not affect the correctness of verification.

I hope it doesn't, but I also know that it's all too easy to overlook
bugs in concurrent code that uses non-atomic operations.

For example, back when 1.7.0 was in the pre-release stage we had to
remove an entire new SQLite-using feature because we all overlooked
a concurrency bug in it until it was pointed out to us.  Nowadays our
concurrency correctness requirements are just as stringent as then, but
we have far fewer eyeballs around to find bugs with.

Cheers,

Daniel