You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Evgeny Kotkov <ev...@visualsvn.com> on 2014/06/17 18:31:59 UTC

[PATCH V1] Fix the hotcopy messing up the destination db/current for old repositories

Hi,

Here is something I was working on at the hackathon.  This is the problem
I reported in http://svn.haxx.se/dev/archive-2014-02/0187.shtml
In brief, the problem lies in the way we use the recovery code when hotcopying
these old repositories.  We grab the YOUNGEST revision from the source, do the
hotcopy and then use the recovery procedure in order to determine the node-id
and the copy-id to stamp in the destination db/current file.  The problem is
that proper recovery requires a *full* scan of the FS, which we do not do.
As a consequence, the db/current in the destination is messed up and that is
shown by the XFail'ing fsfs_hotcopy_old_with_propchanges() test.

Anyway, this "use recover to obtain db/current contents" approach seems broken
from the very start.  I would like to fix this problem by switching to
something more obvious.  We could atomically read the (YOUNGEST, NODE-ID,
COPY-ID) tuple from the destination and write it to the destination at the
very moment the hotcopy finished working.  Here is a series of three patches,
which I am willing to commit one-by-one.  Please note that *I did not yet write
the log messages* for this patch series (probably they are going to include
parts of what I've written below).  I am working on them at this very moment.
And sorry for a lot of text :)

In brief, I have the following plan:

- Extend the existing fsfs_hotcopy_old_with_propchanges() test.  We now know
  that the problem is not actually in the property changes, but in the node-
  and copy-id changes.  We can extend the test (which would still be an
  @XFail on trunk) in order to test *more*.

  This is the first patch of the series.

- Fix the problem itself.  The idea of the fix is pretty simple -- we stop
  using the recover()-related functions when doing the hotcopy for repositories
  which have the old FS format.  Previously we did use the recovery mechanism
  in order to recover the next-node-id and the next-copy-id for the hotcopy
  target, but did it *wrong*:

  * The thing is that proper recovery actually needs a full rev-by-rev scan
    of the filesystem, otherwise it might actually miss some of the next- or
    copy-ids.  That's exactly what we encountered here, because now on trunk
    we run the recovery *only* for the youngest revision of the destination
    and hence are messing the node ids in the target 'current' file.

  * We could probably switch to a full recovery scan for this case (that
    would be simpler in the patch terms), but that actually sucks.  You do
    not want to run a full recovery scan in order for the backup (!) to work.
    So, I've fixed this by atomically reading the 'db/current' contents of the
    source and updating it as-is in the destination.  The diff might look a
    bit scary, but this is just a consequence of me extracting the revision
    copying logic into two separate functions (for old and new FS formats
    accordingly).

    The new logic is pretty simple -- if we encounted an FS with a new format,
    just stick to the old code.  If we see the FS format 1 or 2, we use the new
    approach, which is much simpler.  We do not have to deal with the packs and
    shards and stuff like that and simply copy the revisions and revprops (
    sticking to the old approach which copies only files which differ in terms
    of filestamps).  Then we atomically update the 'db/current' in the
    destination (we've atomically read the source 'db/current' contents just
    a bit earlier).

  This is the second patch of the series.

- Cleanup a bit.  Now, because the hotcopy code does not need the recovery
  code (which seems logical), we can remove the corresponding private API,
  namely svn_fs_fs__find_max_ids().  As a consequence, the id recovery code
  now is "private" in the subversion/libsvn_fs_fs/recovery.c file and that
  makes sense from the design point of view.

  This is the third patch of the series.

In case this might be interesting to you, here is a checklist I've run against
the trunk with all three patches applied:

- Run svnadmin_tests.py with Debug / Windows
- Run svnadmin_tests.py with Debug and --fsfs-packing / Windows
- Run svnadmin_tests.py with Debug and --fsfs-sharding / Windows
- Run svnadmin_tests.py with Debug and --server-minor-version=3 / Windows
- Run svnadmin_tests.py with Debug and --server-minor-version=6 / Windows
- Run svnadmin_tests.py with Debug and --server-minor-version=8 / Windows
- Run svnadmin_tests.py with Release / Windows
- Run svnadmin_tests.py with Release and --fs-type=bdb / Windows
- Run svnadmin_tests.py with Release and --fs-type=fsx / Windows
- Run svnadmin_tests.py with Release and --parallel / Windows
- Run svnadmin_tests.py with Release and --http-dir / Windows
- Run all tests / Ubuntu
- Run hotcopy for the new format serf repository, verify it, check db/current
- Run hotcopy for the new format googletest repository, verify it,
  check db/current
- Run hotcopy for the old format serf repository, verify it, check db/current
- Run hotcopy for the old format googletest repository, verify it,
  check db/current
- Run hotcopy for the PACKED new format serf repository, verify it,
  check db/current
- Run hotcopy for the PACKED new format googletest repository, verify it,
  check db/current
- Run hotcopy for the PACKED old format serf repository, verify it,
  check db/current
- Run hotcopy for the PACKED old format googletest repository, verify it,
  check db/current
- Somehow test the incremental hotcopy (say, abort the hotcopy for a
  sharded repository in the middle of it...)
- Do the previous test with packed shards
- Test that the test with changes still fails on trunk (itself)
- Patch the test in order to run it with a normal repository
  version (> 3) / Windows

What do you think?

Regards,
Evgeny Kotkov

Re: [PATCH V1] Fix the hotcopy messing up the destination db/current for old repositories

Posted by Evgeny Kotkov <ev...@visualsvn.com>.

I committed these three patches in r1603482, r1603485 and r1603486.

> That patch in itself is o.k.. However, you should update
> get_next_revision_ids() in transaction.c to use the new
> utility function (or entirely replace it).

Done as a follow-up in r1603487.

> Removing the internal API is fine; however you might have
> kept the code as a static function without the svn_fs_fs__
> name prefix. There is no need to manually inline it into the
> caller - but that's just me giving an opinion. I'd be fine with
> committing that patch as is.

I committed the patch as is.  Otherwise we would end up with a unwanted name
clash (there already is the recover_find_max_ids() function) and, basically,
I cannot think of an appropriate "private" name for this function.  What it
essentially does is something like open_file_and_find_max_ids(), but that is
ugly.  I think the inlined version is fine.

Thank you for the review!


Regards,
Evgeny Kotkov

Re: [PATCH V1] Fix the hotcopy messing up the destination db/current for old repositories

Posted by Stefan Fuhrmann <st...@wandisco.com>.

On Tue, Jun 17, 2014 at 6:31 PM, Evgeny Kotkov <ev...@visualsvn.com>
wrote:

> Hi,
>
> Here is something I was working on at the hackathon.  This is the problem
> I reported in http://svn.haxx.se/dev/archive-2014-02/0187.shtml
> In brief, the problem lies in the way we use the recovery code when
> hotcopying
> these old repositories.  We grab the YOUNGEST revision from the source, do
> the
> hotcopy and then use the recovery procedure in order to determine the
> node-id
> and the copy-id to stamp in the destination db/current file.  The problem
> is
> that proper recovery requires a *full* scan of the FS, which we do not do.
> As a consequence, the db/current in the destination is messed up and that
> is
> shown by the XFail'ing fsfs_hotcopy_old_with_propchanges() test.
>

Yes, the problem is real: HEAD might not touch a path
that belongs to the latest copy / copied sub-tree. Hence,
we won't catch the highest copy ID by simply looking
at the youngest revision.

Anyway, this "use recover to obtain db/current contents" approach seems
> broken
> from the very start.  I would like to fix this problem by switching to
> something more obvious.  We could atomically read the (YOUNGEST, NODE-ID,
> COPY-ID) tuple from the destination and write it to the destination at the
> very moment the hotcopy finished working.  Here is a series of three
> patches,
> which I am willing to commit one-by-one.  Please note that *I did not yet
> write
> the log messages* for this patch series (probably they are going to include
> parts of what I've written below).  I am working on them at this very
> moment.
> And sorry for a lot of text :)
>
> In brief, I have the following plan:
>
> - Extend the existing fsfs_hotcopy_old_with_propchanges() test.  We now
> know
>   that the problem is not actually in the property changes, but in the
> node-
>   and copy-id changes.  We can extend the test (which would still be an
>   @XFail on trunk) in order to test *more*.
>

  This is the first patch of the series.
>

 extend-the-existing-xfail-test v1 looks good to commit.


> - Fix the problem itself.  The idea of the fix is pretty simple -- we stop
>   using the recover()-related functions when doing the hotcopy for
> repositories
>   which have the old FS format.  Previously we did use the recovery
> mechanism
>   in order to recover the next-node-id and the next-copy-id for the hotcopy
>   target, but did it *wrong*:
>
>   * The thing is that proper recovery actually needs a full rev-by-rev scan
>     of the filesystem, otherwise it might actually miss some of the next-
> or
>     copy-ids.  That's exactly what we encountered here, because now on
> trunk
>     we run the recovery *only* for the youngest revision of the destination
>     and hence are messing the node ids in the target 'current' file.
>
>   * We could probably switch to a full recovery scan for this case (that
>     would be simpler in the patch terms), but that actually sucks.  You do
>     not want to run a full recovery scan in order for the backup (!) to
> work.
>     So, I've fixed this by atomically reading the 'db/current' contents of
> the
>     source and updating it as-is in the destination.  The diff might look a
>     bit scary, but this is just a consequence of me extracting the revision
>     copying logic into two separate functions (for old and new FS formats
>     accordingly).
>

So, in effect, we are actually doing *less* work than before.
I like.


>     The new logic is pretty simple -- if we encounted an FS with a new
> format,
>     just stick to the old code.  If we see the FS format 1 or 2, we use
> the new
>     approach, which is much simpler.  We do not have to deal with the
> packs and
>     shards and stuff like that and simply copy the revisions and revprops (
>     sticking to the old approach which copies only files which differ in
> terms
>     of filestamps).  Then we atomically update the 'db/current' in the
>     destination (we've atomically read the source 'db/current' contents
> just
>     a bit earlier).
>
>   This is the second patch of the series.
>

That patch in itself is o.k.. However, you should update
get_next_revision_ids() in transaction.c to use the new
utility function (or entirely replace it).


> - Cleanup a bit.  Now, because the hotcopy code does not need the recovery
>   code (which seems logical), we can remove the corresponding private API,
>   namely svn_fs_fs__find_max_ids().  As a consequence, the id recovery code
>   now is "private" in the subversion/libsvn_fs_fs/recovery.c file and that
>   makes sense from the design point of view.
>
>   This is the third patch of the series.
>

Removing the internal API is fine; however you might have
kept the code as a static function without the svn_fs_fs__
name prefix. There is no need to manually inline it into the
caller - but that's just me giving an opinion. I'd be fine with
committing that patch as is.


> In case this might be interesting to you, here is a checklist I've run
> against
> the trunk with all three patches applied:
>
> - Run svnadmin_tests.py with Debug / Windows
> - Run svnadmin_tests.py with Debug and --fsfs-packing / Windows
> - Run svnadmin_tests.py with Debug and --fsfs-sharding / Windows
> - Run svnadmin_tests.py with Debug and --server-minor-version=3 / Windows
> - Run svnadmin_tests.py with Debug and --server-minor-version=6 / Windows
> - Run svnadmin_tests.py with Debug and --server-minor-version=8 / Windows
> - Run svnadmin_tests.py with Release / Windows
> - Run svnadmin_tests.py with Release and --fs-type=bdb / Windows
> - Run svnadmin_tests.py with Release and --fs-type=fsx / Windows
> - Run svnadmin_tests.py with Release and --parallel / Windows
> - Run svnadmin_tests.py with Release and --http-dir / Windows
> - Run all tests / Ubuntu
> - Run hotcopy for the new format serf repository, verify it, check
> db/current
> - Run hotcopy for the new format googletest repository, verify it,
>   check db/current
> - Run hotcopy for the old format serf repository, verify it, check
> db/current
> - Run hotcopy for the old format googletest repository, verify it,
>   check db/current
> - Run hotcopy for the PACKED new format serf repository, verify it,
>   check db/current
> - Run hotcopy for the PACKED new format googletest repository, verify it,
>   check db/current
> - Run hotcopy for the PACKED old format serf repository, verify it,
>   check db/current
> - Run hotcopy for the PACKED old format googletest repository, verify it,
>   check db/current
> - Somehow test the incremental hotcopy (say, abort the hotcopy for a
>   sharded repository in the middle of it...)
> - Do the previous test with packed shards
> - Test that the test with changes still fails on trunk (itself)
> - Patch the test in order to run it with a normal repository
>   version (> 3) / Windows
>
> What do you think?
>

Wow my total response is shorter than the mere list of tests
you ran ;) Good job.

Cheers!

-- Stefan^2.