You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Julian Foad <ju...@apache.org> on 2020/07/28 13:44:49 UTC
Re: FSFS commit failure should release txn proto-rev lock

Ping: can anyone comment on this proposed patch?

- Julian


Julian Foad wrote:
> CC'ing more possible experts Dmitry, Evgeny: any thoughts about whether 
> this change makes sense for FSFS commit?
> 
> - Julian
> 
> 
> Julian Foad wrote:
>> TL;DR: I propose a change to the FSFS commit-transaction function, to 
>> release the proto-rev write lock if an error occurs while it has this 
>> lock.
>>
>> The practical applications of this change are rather obscure, which 
>> perhaps explains why it has not been needed before.  In particular, it 
>> apparently is not needed for the way the rest of standard Subversion 
>> drives FSFS, but may be needed for other users of FSFS.  I have come 
>> across this case in WANdisco's replicator, but as there are other 
>> peculiarities in how that drives FSFS, let us not confuse the issue by 
>> looking too closely at it.  It appears the issue would apply to other 
>> users of FSFS too.
>>
>> In the FSFS commit-transaction code path (in svn_fs_fs__commit) there 
>> is a region where it acquires an exclusive write lock on the prototype 
>> revision (proto-rev).  There are cases where code in this region can 
>> fail, and there is no release of the lock in the error return path. 
>> That means if the calling process re-tries, the "writing" flag is 
>> still set in the transaction object in memory, and this causes an 
>> "already locked" error.
>>
>> In regular Subversion we abandon a transaction if it fails at this 
>> stage, and so never hit the problem.  There are failure modes where a 
>> re-try could not succeed, notably after we move the proto-rev file 
>> into its final location, breaking the transaction; this case is called 
>> out in comments in the function and will remain after this change.  
>> Abandoning the transaction is a safe and effective way to use FSFS.  
>> However, other users of FSFS may prefer to re-try in certain other cases.
>>
>> The case WANdisco encountered is where some old repository corruption 
>> (SVN-4858) was detected and reported at some point in this code 
>> region.   Although the commit would not be able to succeed, it was 
>> important to them that the same error should be reported again during 
>> a re-try, and what was observed instead was that the "already locked" 
>> error was thrown instead.
>>
>> I suppose disk being temporarily inaccessible is one class of error 
>> where a re-try might be successful.
>>
>> The attached test and patch demonstrate and fix the problem.
>>
>> This patch does not attempt to make it possible to re-try a failed 
>> commit in all cases.  Some remaining cases are noted in the patch log 
>> message which is repeated here:
>>
>> [[[
>> Roll back the transaction lock state so re-trying a failing commit 
>> results in the same error again instead of an "already locked" error.
>>
>> The problem was that when a commit returned a failure from within the 
>> code region where it held a transaction proto-rev lock, it did not 
>> unlock it.  Although the FSFSWD replicator replaces the transaction 
>> files on disk, the lock status remained on the transaction object in 
>> memory and a subsequent retry then failed with "already locked" error 
>> instead of the same failure mode as the first attempt.
>>
>> The solution here is to reset the lock status of the in-memory 
>> transaction object before returning a commit failure.
>>
>> This implementation addresses cases where the commit fails and returns 
>> an error (e.g. due to detecting repository corruption at this point, 
>> as in the case reported in NV-7983), and the lock can be successfully 
>> unlocked using the regular unlock code path.
>>
>> Cases not addressed:
>>
>>    * There are conceivable failure modes where this regular unlocking 
>> would not succeed, e.g. disk files becoming inaccessible, and this 
>> patch would not address those cases.  These could perhaps be addressed 
>> by adding a lock clean-up function that ignores errors in clean-up, 
>> and using that instead of the regular unlocking code.
>>
>>    * This implementation does not address cases where the process 
>> crashes in this code region.  (In such cases the in-memory 'is 
>> writable' flag would not be preserved anyway so that is out of scope.)
>>
>> ### NOT YET TESTED:
>> For FSFSWD, this implementation should also work where a failure 
>> occurs after moving the proto-rev file to the final revision-file 
>> location. FSFSWD re-copies the transaction directory before re-trying, 
>> and so this should succeed.  For regular FSFS, it does not address 
>> this case: a re-try in this case will fail to open the transaction 
>> proto-rev file.
>>
>> ### This patch includes debugging code.
>>
>> * subversion/libsvn_fs_fs/transaction.c
>>    (TEST_COMMIT_FAIL): Debug code.
>>    (ERR_PROTO_REV_LOCKED): New macro.
>>    (commit_body): Use the new macro to handle errors in the region 
>> where a proto-rev lock is held, and unlock it in those cases.
>>
>> ]]]
>>
>> My question is, does this change (without the debugging code) make 
>> sense as an improvement to FSFS?
>>
>> - Julian
>