You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Philip Martin <ph...@wandisco.com> on 2012/07/30 18:47:20 UTC

Deterministic FSFS revision files

When writing an FSFS revision file some parts are written in hash order
and so with a recent APR the order is not predictable.  Thus loading the
same dumpfile into two separate empty repositories produces different
revision files.  The things that vary are the order of the nodes in the
file and the order of the change lines at the end of the file.

As with the dumpfile format we have never formally specified the order
of a revision file so the randomn order is not strictly a bug but it
might be useful if the order was at least repeatable.

Things get more complicated in 1.8 as a side-effect of directory
deltification.  Directory deltification works best if the directory
order is stable and so some hashes now use a non-APR hash function to
produce a stable order.  Whether or not the revision file is repeatable
depends on which hashes are used.  The change lines hash is still
unstable but the hash returned by svn_fs_fs__rep_contents_dir can be
stable or unstable depending on whether or not the has was found in the
cache or created on demand.  It makes me even more uneasy that how much
variation is present in the revision file depends on our caching
strategy.

I'm considering changing the commit code so that hashes are written in a
stable order and the revision files are repeatable.  Does anyone think
this would be a bad idea?

Index: subversion/libsvn_fs_fs/fs_fs.c
===================================================================
--- subversion/libsvn_fs_fs/fs_fs.c	(revision 1367025)
+++ subversion/libsvn_fs_fs/fs_fs.c	(working copy)
@@ -7367,16 +7367,19 @@ write_final_rev(const svn_fs_id_t **new_id_p,
     {
       apr_pool_t *subpool;
       apr_hash_t *entries, *str_entries;
-      apr_hash_index_t *hi;
+      apr_array_header_t *sorted_entries;
+      int i;
 
       /* This is a directory.  Write out all the children first. */
       subpool = svn_pool_create(pool);
 
       SVN_ERR(svn_fs_fs__rep_contents_dir(&entries, fs, noderev, pool));
-
-      for (hi = apr_hash_first(pool, entries); hi; hi = apr_hash_next(hi))
+      sorted_entries = svn_sort__hash(entries, svn_sort_compare_items_lexically,
+                                      pool);
+      for (i = 0; i < sorted_entries->nelts; ++i)
         {
-          svn_fs_dirent_t *dirent = svn__apr_hash_index_val(hi);
+          svn_fs_dirent_t *dirent = APR_ARRAY_IDX(sorted_entries, i,
+                                                  svn_sort__item_t).value;
 
           svn_pool_clear(subpool);
           SVN_ERR(write_final_rev(&new_id, file, rev, fs, dirent->id,
@@ -7546,19 +7549,22 @@ write_final_changed_path_info(apr_off_t *offset_p,
 {
   apr_hash_t *changed_paths;
   apr_off_t offset;
-  apr_hash_index_t *hi;
   apr_pool_t *iterpool = svn_pool_create(pool);
   fs_fs_data_t *ffd = fs->fsap_data;
   svn_boolean_t include_node_kinds =
       ffd->format >= SVN_FS_FS__MIN_KIND_IN_CHANGED_FORMAT;
+  apr_array_header_t *sorted_changed_paths;
+  int i;
 
   SVN_ERR(get_file_offset(&offset, file, pool));
 
   SVN_ERR(svn_fs_fs__txn_changes_fetch(&changed_paths, fs, txn_id, pool));
+  sorted_changed_paths = svn_sort__hash(changed_paths,
+                                        svn_sort_compare_items_lexically, pool);
 
   /* Iterate through the changed paths one at a time, and convert the
      temporary node-id into a permanent one for each change entry. */
-  for (hi = apr_hash_first(pool, changed_paths); hi; hi = apr_hash_next(hi))
+  for (i = 0; i < sorted_changed_paths->nelts; ++i)
     {
       node_revision_t *noderev;
       const svn_fs_id_t *id;
@@ -7567,8 +7573,8 @@ write_final_changed_path_info(apr_off_t *offset_p,
 
       svn_pool_clear(iterpool);
 
-      change = svn__apr_hash_index_val(hi);
-      path = svn__apr_hash_index_key(hi);
+      change = APR_ARRAY_IDX(sorted_changed_paths, i, svn_sort__item_t).value;
+      path = APR_ARRAY_IDX(sorted_changed_paths, i, svn_sort__item_t).key;
 
       id = change->node_rev_id;
 

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Deterministic FSFS revision files

Posted by Hyrum K Wright <hy...@hyrumwright.org>.

On Mon, Jul 30, 2012 at 1:19 PM, C. Michael Pilato <cm...@collab.net> wrote:
> On 07/30/2012 12:47 PM, Philip Martin wrote:
>> I'm considering changing the commit code so that hashes are written in a
>> stable order and the revision files are repeatable.  Does anyone think
>> this would be a bad idea?
>
> I can't think of any reason not to do this.  The performance overhead will
> be minimal, I should think, and you've already outlined the benefits.

As long as the performance overhead is negligible, I think having
deterministic ordering is very desirable.  There's just something in
me that longs to have a One True Ordering in dumpfiles and revfiles
and output and everything else.  I can't really say why, but it feels
like by having a canonical ordering, we open ourselves up for more
opportunities to cache work moving forward.

(How's that for nebulous and hand wavy?)

-Hyrum

Re: Deterministic FSFS revision files

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 07/30/2012 12:47 PM, Philip Martin wrote:
> I'm considering changing the commit code so that hashes are written in a
> stable order and the revision files are repeatable.  Does anyone think
> this would be a bad idea?

I can't think of any reason not to do this.  The performance overhead will
be minimal, I should think, and you've already outlined the benefits.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Enterprise Cloud Development

Re: Deterministic FSFS revision files

Posted by Stefan Fuhrmann <st...@wandisco.com>.

On Mon, Jul 30, 2012 at 6:47 PM, Philip Martin
<ph...@wandisco.com>wrote:

> When writing an FSFS revision file some parts are written in hash order
> and so with a recent APR the order is not predictable.  Thus loading the
> same dumpfile into two separate empty repositories produces different
> revision files.  The things that vary are the order of the nodes in the
> file and the order of the change lines at the end of the file.
>
> As with the dumpfile format we have never formally specified the order
> of a revision file so the randomn order is not strictly a bug but it
> might be useful if the order was at least repeatable.
>
> Things get more complicated in 1.8 as a side-effect of directory
> deltification.  Directory deltification works best if the directory
> order is stable and so some hashes now use a non-APR hash function to
> produce a stable order.  Whether or not the revision file is repeatable
> depends on which hashes are used.  The change lines hash is still
> unstable but the hash returned by svn_fs_fs__rep_contents_dir can be
> stable or unstable depending on whether or not the has was found in the
> cache or created on demand.  It makes me even more uneasy that how much
> variation is present in the revision file depends on our caching
> strategy.
>
> I'm considering changing the commit code so that hashes are written in a
> stable order and the revision files are repeatable.  Does anyone think
> this would be a bad idea?
>

+1 but I haven't done an in-depth review of the patch.

Reproducible revision file content is nice. The runtime overhead
should be dwarfed by the other computations (delta, checksum).

-- Stefan^2.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download