You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@subversion.apache.org by cm...@apache.org on 2011/05/25 21:33:21 UTC

svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Author: cmpilato
Date: Wed May 25 19:33:21 2011
New Revision: 1127646

URL: http://svn.apache.org/viewvc?rev=1127646&view=rev
Log:
For issue #3890 ("'svnrdump load' does not map revisions like
'svnadmin load' does"), implement the requisite revision map.  This
fixes the standalone recipe script(s) attached to the issue, as well
as one of the committed regression tests, but there remain two more
tests still XFail-ing.

* subversion/svnrdump/load_editor.c
  (struct parse_baton): Add 'rev_map', 'last_rev_mapped', and
    'oldest_dumpstream_rev' members.
  (struct revision_baton): Add 'rev_offset' member.
  (add_revision_mapping, renumber_mergeinfo_revs): New helper functions.
  (commit_callback): Populate the revision map and update the
    last_rev_mapped parse baton member.
  (new_revision_record): Calculate the revision offset.  Pass the
    revision_baton as the commit callback baton to svn_ra_get_commit_editor3().
  (new_node_record): Use the calculated revision offset as the base
    revision when driving the editor (for out-of-dateness checks).
    Use the revision offset and revision map to correct copyfrom
    revisions, too.
  (set_revision_property): Fix the "revision 0" special case in light
    of possible revision offsets.
  (set_node_property): Apply the same mergeinfo revision renumbering
    logic that 'svnadmin load' applies.
  (close_revision): Update logic, and fix the "revision 0" special
    case, in light of possible revision offsets.
  (svn_rdump__load_dumpstream): Initialize new parse_baton members.

* subversion/tests/cmdline/svnrdump_tests.py
  (svnrdump_load_partial_incremental_dump): Remove @XFail decorator.

Modified:
    subversion/trunk/subversion/svnrdump/load_editor.c
    subversion/trunk/subversion/tests/cmdline/svnrdump_tests.py

Modified: subversion/trunk/subversion/svnrdump/load_editor.c
URL: http://svn.apache.org/viewvc/subversion/trunk/subversion/svnrdump/load_editor.c?rev=1127646&r1=1127645&r2=1127646&view=diff
==============================================================================
--- subversion/trunk/subversion/svnrdump/load_editor.c (original)
+++ subversion/trunk/subversion/svnrdump/load_editor.c Wed May 25 19:33:21 2011
@@ -29,10 +29,12 @@
 #include "svn_props.h"
 #include "svn_path.h"
 #include "svn_ra.h"
+#include "svn_subst.h"
 #include "svn_io.h"
 #include "svn_private_config.h"
 #include "private/svn_repos_private.h"
 #include "private/svn_ra_private.h"
+#include "private/svn_mergeinfo_private.h"
 
 #include "svnrdump.h"
 
@@ -65,6 +67,18 @@ struct parse_baton
 
   /* Root URL of the target repository. */
   const char *root_url;
+
+  /* A mapping of svn_revnum_t * dump stream revisions to their
+     corresponding svn_revnum_t * target repository revisions. */
+  apr_hash_t *rev_map;
+
+  /* The most recent (youngest) revision from the dump stream mapped in
+     REV_MAP, or SVN_INVALID_REVNUM if no revisions have been mapped. */
+  svn_revnum_t last_rev_mapped;
+
+  /* The oldest revision loaded from the dump stream, or
+     SVN_INVALID_REVNUM if none have been loaded. */
+  svn_revnum_t oldest_dumpstream_rev;
 };
 
 /**
@@ -106,6 +120,7 @@ struct revision_baton
 {
   svn_revnum_t rev;
   apr_hash_t *revprop_table;
+  apr_int32_t rev_offset;
 
   const svn_string_t *datestamp;
   const svn_string_t *author;
@@ -117,14 +132,187 @@ struct revision_baton
 
 
 
+/* Record the mapping of FROM_REV to TO_REV in REV_MAP, ensuring that
+   anything added to the hash is allocated in the hash's pool. */
+static void
+add_revision_mapping(apr_hash_t *rev_map,
+                     svn_revnum_t from_rev,
+                     svn_revnum_t to_rev)
+{
+  svn_revnum_t *mapped_revs = apr_palloc(apr_hash_pool_get(rev_map),
+                                         sizeof(svn_revnum_t) * 2);
+  mapped_revs[0] = from_rev;
+  mapped_revs[1] = to_rev;
+  apr_hash_set(rev_map, mapped_revs,
+               sizeof(svn_revnum_t), mapped_revs + 1);
+}
+                     
+
+/* Examine the mergeinfo in INITIAL_VAL, renumber revisions in rangelists
+   as appropriate, and return the (possibly new) mergeinfo in *FINAL_VAL
+   (allocated from POOL). */
+/* ### FIXME:  Consider somehow sharing code with
+   ### libsvn_repos/load-fs-vtable.c:renumber_mergeinfo_revs() */
+static svn_error_t *
+renumber_mergeinfo_revs(svn_string_t **final_val,
+                        const svn_string_t *initial_val,
+                        struct revision_baton *rb,
+                        apr_pool_t *pool)
+{
+  apr_pool_t *subpool = svn_pool_create(pool);
+  svn_mergeinfo_t mergeinfo, predates_stream_mergeinfo;
+  svn_mergeinfo_t final_mergeinfo = apr_hash_make(subpool);
+  apr_hash_index_t *hi;
+
+  SVN_ERR(svn_mergeinfo_parse(&mergeinfo, initial_val->data, subpool));
+
+  /* Issue #3020
+     http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc16
+     Remove mergeinfo older than the oldest revision in the dump stream
+     and adjust its revisions by the difference between the head rev of
+     the target repository and the current dump stream rev. */
+  if (rb->pb->oldest_dumpstream_rev > 1)
+    {
+      SVN_ERR(svn_mergeinfo__filter_mergeinfo_by_ranges(
+        &predates_stream_mergeinfo, mergeinfo,
+        rb->pb->oldest_dumpstream_rev - 1, 0,
+        TRUE, subpool, subpool));
+      SVN_ERR(svn_mergeinfo__filter_mergeinfo_by_ranges(
+        &mergeinfo, mergeinfo,
+        rb->pb->oldest_dumpstream_rev - 1, 0,
+        FALSE, subpool, subpool));
+      SVN_ERR(svn_mergeinfo__adjust_mergeinfo_rangelists(
+        &predates_stream_mergeinfo, predates_stream_mergeinfo,
+        -rb->rev_offset, subpool, subpool));
+    }
+  else
+    {
+      predates_stream_mergeinfo = NULL;
+    }
+
+  for (hi = apr_hash_first(subpool, mergeinfo); hi; hi = apr_hash_next(hi))
+    {
+      const char *merge_source;
+      apr_array_header_t *rangelist;
+      struct parse_baton *pb = rb->pb;
+      int i;
+      const void *key;
+      void *val;
+
+      apr_hash_this(hi, &key, NULL, &val);
+      merge_source = key;
+      rangelist = val;
+
+      /* Possibly renumber revisions in merge source's rangelist. */
+      for (i = 0; i < rangelist->nelts; i++)
+        {
+          svn_revnum_t *rev_from_map;
+          svn_merge_range_t *range = APR_ARRAY_IDX(rangelist, i,
+                                                   svn_merge_range_t *);
+          rev_from_map = apr_hash_get(pb->rev_map, &range->start,
+                                      sizeof(svn_revnum_t));
+          if (rev_from_map && SVN_IS_VALID_REVNUM(*rev_from_map))
+            {
+              range->start = *rev_from_map;
+            }
+          else if (range->start == pb->oldest_dumpstream_rev - 1)
+            {
+              /* Since the start revision of svn_merge_range_t are not
+                 inclusive there is one possible valid start revision that
+                 won't be found in the PB->REV_MAP mapping of load stream
+                 revsions to loaded revisions: The revision immediately
+                 preceeding the oldest revision from the load stream.
+                 This is a valid revision for mergeinfo, but not a valid
+                 copy from revision (which PB->REV_MAP also maps for) so it
+                 will never be in the mapping.
+
+                 If that is what we have here, then find the mapping for the
+                 oldest rev from the load stream and subtract 1 to get the
+                 renumbered, non-inclusive, start revision. */
+              rev_from_map = apr_hash_get(pb->rev_map,
+                                          &pb->oldest_dumpstream_rev,
+                                          sizeof(svn_revnum_t));
+              if (rev_from_map && SVN_IS_VALID_REVNUM(*rev_from_map))
+                range->start = *rev_from_map - 1;
+            }
+          else
+            {
+              /* If we can't remap the start revision then don't even bother
+                 trying to remap the end revision.  It's possible we might
+                 actually succeed at the latter, which can result in invalid
+                 mergeinfo with a start rev > end rev.  If that gets into the
+                 repository then a world of bustage breaks loose anytime that
+                 bogus mergeinfo is parsed.  See
+                 http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc16.
+                 */
+              continue;
+            }
+
+          rev_from_map = apr_hash_get(pb->rev_map, &range->end,
+                                      sizeof(svn_revnum_t));
+          if (rev_from_map && SVN_IS_VALID_REVNUM(*rev_from_map))
+            range->end = *rev_from_map;
+        }
+      apr_hash_set(final_mergeinfo, merge_source,
+                   APR_HASH_KEY_STRING, rangelist);
+    }
+
+  if (predates_stream_mergeinfo)
+      SVN_ERR(svn_mergeinfo_merge(final_mergeinfo, predates_stream_mergeinfo,
+                                  subpool));
+
+  SVN_ERR(svn_mergeinfo_sort(final_mergeinfo, subpool));
+
+  /* Mergeinfo revision sources for r0 and r1 are invalid; you can't merge r0
+     or r1.  However, svndumpfilter can be abused to produce r1 merge source
+     revs.  So if we encounter any, then strip them out, no need to put them
+     into the load target. */
+  SVN_ERR(svn_mergeinfo__filter_mergeinfo_by_ranges(&final_mergeinfo,
+                                                    final_mergeinfo,
+                                                    1, 0, FALSE,
+                                                    subpool, subpool));
+
+  SVN_ERR(svn_mergeinfo_to_string(final_val, final_mergeinfo, pool));
+  svn_pool_destroy(subpool);
+
+  return SVN_NO_ERROR;
+}
+
+
 static svn_error_t *
 commit_callback(const svn_commit_info_t *commit_info,
                 void *baton,
                 apr_pool_t *pool)
 {
+  struct revision_baton *rb = baton;
+  struct parse_baton *pb = rb->pb;
+
   /* ### Don't print directly; generate a notification. */
   SVN_ERR(svn_cmdline_printf(pool, "* Loaded revision %ld.\n",
                              commit_info->revision));
+
+  /* Add the mapping of the dumpstream revision to the committed revision. */
+  add_revision_mapping(pb->rev_map, rb->rev, commit_info->revision);
+
+  /* If the incoming dump stream has non-contiguous revisions (e.g. from
+     using svndumpfilter --drop-empty-revs without --renumber-revs) then
+     we must account for the missing gaps in PB->REV_MAP.  Otherwise we
+     might not be able to map all mergeinfo source revisions to the correct
+     revisions in the target repos. */
+  if ((pb->last_rev_mapped != SVN_INVALID_REVNUM)
+      && (rb->rev != pb->last_rev_mapped + 1))
+    {
+      svn_revnum_t i;
+
+      for (i = pb->last_rev_mapped + 1; i < rb->rev; i++)
+        {
+          add_revision_mapping(pb->rev_map, i, pb->last_rev_mapped);
+        }
+    }
+
+  /* Update our "last revision mapped". */
+  pb->last_rev_mapped = rb->rev;
+  
   return SVN_NO_ERROR;
 }
 
@@ -186,6 +374,7 @@ new_revision_record(void **revision_bato
   struct revision_baton *rb;
   struct parse_baton *pb;
   apr_hash_index_t *hi;
+  svn_revnum_t head_rev;
 
   rb = apr_pcalloc(pool, sizeof(*rb));
   pb = parse_baton;
@@ -201,6 +390,14 @@ new_revision_record(void **revision_bato
         rb->rev = atoi(hval);
     }
 
+  SVN_ERR(svn_ra_get_latest_revnum(pb->session, &head_rev, pool));
+
+  /* FIXME: This is a lame fallback loading multiple segments of dump in
+     several separate operations. It is highly susceptible to race conditions.
+     Calculate the revision 'offset' for finding copyfrom sources.
+     It might be positive or negative. */
+  rb->rev_offset = (apr_int32_t) (rb->rev) - (head_rev + 1);
+
   /* Set the commit_editor/ commit_edit_baton to NULL and wait for
      them to be created in new_node_record */
   rb->pb->commit_editor = NULL;
@@ -263,13 +460,14 @@ new_node_record(void **node_baton,
 
       SVN_ERR(svn_ra_get_commit_editor3(rb->pb->session, &commit_editor,
                                         &commit_edit_baton, rb->revprop_table,
-                                        commit_callback, NULL, NULL, FALSE,
-                                        rb->pool));
+                                        commit_callback, revision_baton,
+                                        NULL, FALSE, rb->pool));
 
       rb->pb->commit_editor = commit_editor;
       rb->pb->commit_edit_baton = commit_edit_baton;
 
-      SVN_ERR(commit_editor->open_root(commit_edit_baton, rb->rev - 1,
+      SVN_ERR(commit_editor->open_root(commit_edit_baton,
+                                       rb->rev - rb->rev_offset - 1,
                                        rb->pool, &child_baton));
 
       LDR_DBG(("Opened root %p\n", child_baton));
@@ -358,7 +556,7 @@ new_node_record(void **node_baton,
                              rb->pool);
           SVN_ERR(commit_editor->open_directory(relpath_compose,
                                                 rb->db->baton,
-                                                rb->rev - 1,
+                                                rb->rev - rb->rev_offset - 1,
                                                 rb->pool, &child_baton));
           LDR_DBG(("Opened dir %p\n", child_baton));
           child_db = apr_pcalloc(rb->pool, sizeof(*child_db));
@@ -370,12 +568,30 @@ new_node_record(void **node_baton,
         }
     }
 
+  /* Fix up the copyfrom revision with our revision mapping. */
+  if (SVN_IS_VALID_REVNUM(nb->copyfrom_rev))
+    {
+      svn_revnum_t copyfrom_rev = nb->copyfrom_rev - rb->rev_offset;
+      svn_revnum_t *src_rev_from_map;
+
+      if ((src_rev_from_map = apr_hash_get(rb->pb->rev_map, &nb->copyfrom_rev,
+                                           sizeof(nb->copyfrom_rev))))
+        copyfrom_rev = *src_rev_from_map;
+
+      if (! SVN_IS_VALID_REVNUM(copyfrom_rev))
+        return svn_error_createf(SVN_ERR_FS_NO_SUCH_REVISION, NULL,
+                                 _("Relative source revision %ld is not"
+                                   " available in current repository"),
+                                 copyfrom_rev);
+      nb->copyfrom_rev = copyfrom_rev;
+    }
+
   switch (nb->action)
     {
     case svn_node_action_delete:
     case svn_node_action_replace:
       LDR_DBG(("Deleting entry %s in %p\n", nb->path, rb->db->baton));
-      SVN_ERR(commit_editor->delete_entry(nb->path, rb->rev,
+      SVN_ERR(commit_editor->delete_entry(nb->path, rb->rev - rb->rev_offset,
                                           rb->db->baton, rb->pool));
       if (nb->action == svn_node_action_delete)
         break;
@@ -441,13 +657,18 @@ set_revision_property(void *baton,
   SVN_ERR(svn_repos__validate_prop(name, value, rb->pool));
 
   if (rb->rev > 0)
-    apr_hash_set(rb->revprop_table, apr_pstrdup(rb->pool, name),
-                 APR_HASH_KEY_STRING, svn_string_dup(value, rb->pool));
-  else
-    /* Special handling for revision 0; this is safe because the
-       commit_editor hasn't been created yet. */
-    SVN_ERR(svn_ra_change_rev_prop2(rb->pb->session, rb->rev,
-                                    name, NULL, value, rb->pool));
+    {
+      apr_hash_set(rb->revprop_table, apr_pstrdup(rb->pool, name),
+                   APR_HASH_KEY_STRING, svn_string_dup(value, rb->pool));
+    }
+  else if (rb->rev_offset == -1)
+    {
+      /* Special case: set revision 0 properties directly (which is
+         safe because the commit_editor hasn't been created yet), but
+         only when loading into an 'empty' filesystem. */
+      SVN_ERR(svn_ra_change_rev_prop2(rb->pb->session, 0,
+                                      name, NULL, value, rb->pool));
+    }
 
   /* Remember any datestamp/ author that passes through (see comment
      in close_revision). */
@@ -468,6 +689,39 @@ set_node_property(void *baton,
   const struct svn_delta_editor_t *commit_editor = nb->rb->pb->commit_editor;
   apr_pool_t *pool = nb->rb->pool;
 
+  if (strcmp(name, SVN_PROP_MERGEINFO) == 0)
+    {
+      svn_string_t *renumbered_mergeinfo;
+      svn_string_t *prop_val = (svn_string_t *)value;
+
+      /* Tolerate mergeinfo with "\r\n" line endings because some
+         dumpstream sources might contain as much.  If so normalize
+         the line endings to '\n' and make a notification to
+         PARSE_BATON->FEEDBACK_STREAM that we have made this
+         correction. */
+      if (strstr(prop_val->data, "\r"))
+        {
+          const char *prop_eol_normalized;
+
+          SVN_ERR(svn_subst_translate_cstring2(prop_val->data,
+                                               &prop_eol_normalized,
+                                               "\n",  /* translate to LF */
+                                               FALSE, /* no repair */
+                                               NULL,  /* no keywords */
+                                               FALSE, /* no expansion */
+                                               pool));
+          prop_val->data = prop_eol_normalized;
+          prop_val->len = strlen(prop_eol_normalized);
+
+          /* ### TODO: notify? */
+        }
+
+      /* Renumber mergeinfo as appropriate. */
+      SVN_ERR(renumber_mergeinfo_revs(&renumbered_mergeinfo, prop_val,
+                                      nb->rb, pool));
+      value = renumbered_mergeinfo;
+    }
+
   SVN_ERR(svn_repos__validate_prop(name, value, pool));
 
   switch (nb->kind)
@@ -604,11 +858,14 @@ close_revision(void *baton)
   struct revision_baton *rb = baton;
   const svn_delta_editor_t *commit_editor = rb->pb->commit_editor;
   void *commit_edit_baton = rb->pb->commit_edit_baton;
+  svn_revnum_t committed_rev = SVN_INVALID_REVNUM;
 
   /* Fake revision 0 */
   if (rb->rev == 0)
-    /* ### Don't print directly; generate a notification. */
-    SVN_ERR(svn_cmdline_printf(rb->pool, "* Loaded revision 0.\n"));
+    {
+      /* ### Don't print directly; generate a notification. */
+      SVN_ERR(svn_cmdline_printf(rb->pool, "* Loaded revision 0.\n"));
+    }
   else if (commit_editor)
     {
       /* Close all pending open directories, and then close the edit
@@ -631,10 +888,11 @@ close_revision(void *baton)
       /* Legitimate revision with no node information */
       SVN_ERR(svn_ra_get_commit_editor3(rb->pb->session, &commit_editor,
                                         &commit_edit_baton, rb->revprop_table,
-                                        commit_callback, NULL, NULL, FALSE,
-                                        rb->pool));
+                                        commit_callback, baton,
+                                        NULL, FALSE, rb->pool));
 
-      SVN_ERR(commit_editor->open_root(commit_edit_baton, rb->rev - 1,
+      SVN_ERR(commit_editor->open_root(commit_edit_baton,
+                                       rb->rev - rb->rev_offset - 1,
                                        rb->pool, &child_baton));
 
       LDR_DBG(("Opened root %p\n", child_baton));
@@ -644,17 +902,33 @@ close_revision(void *baton)
     }
 
   /* svn_fs_commit_txn() rewrites the datestamp and author properties;
-     we'll rewrite them again by hand after closing the commit_editor. */
-  SVN_ERR(svn_repos__validate_prop(SVN_PROP_REVISION_DATE,
-                                   rb->datestamp, rb->pool));
-  SVN_ERR(svn_ra_change_rev_prop2(rb->pb->session, rb->rev,
-                                  SVN_PROP_REVISION_DATE,
-                                  NULL, rb->datestamp, rb->pool));
-  SVN_ERR(svn_repos__validate_prop(SVN_PROP_REVISION_AUTHOR,
-                                   rb->author, rb->pool));
-  SVN_ERR(svn_ra_change_rev_prop2(rb->pb->session, rb->rev,
-                                  SVN_PROP_REVISION_AUTHOR,
-                                  NULL, rb->author, rb->pool));
+     we'll rewrite them again by hand after closing the commit_editor.
+     The only time we don't do this is for revision 0 when loaded into
+     a non-empty repository.  */
+  if (rb->rev > 0)
+    {
+      svn_revnum_t *rev_from_map =
+        apr_hash_get(rb->pb->rev_map, &rb->rev, sizeof(rb->rev));
+      committed_rev = *rev_from_map;
+    }
+  else if (rb->rev_offset == -1)
+    {
+      committed_rev = 0;
+    }
+
+  if (SVN_IS_VALID_REVNUM(committed_rev))
+    {
+      SVN_ERR(svn_repos__validate_prop(SVN_PROP_REVISION_DATE,
+                                       rb->datestamp, rb->pool));
+      SVN_ERR(svn_ra_change_rev_prop2(rb->pb->session, committed_rev,
+                                      SVN_PROP_REVISION_DATE,
+                                      NULL, rb->datestamp, rb->pool));
+      SVN_ERR(svn_repos__validate_prop(SVN_PROP_REVISION_AUTHOR,
+                                       rb->author, rb->pool));
+      SVN_ERR(svn_ra_change_rev_prop2(rb->pb->session, committed_rev,
+                                      SVN_PROP_REVISION_AUTHOR,
+                                      NULL, rb->author, rb->pool));
+    }
 
   svn_pool_destroy(rb->pool);
 
@@ -699,6 +973,9 @@ svn_rdump__load_dumpstream(svn_stream_t 
   parse_baton->session = session;
   parse_baton->aux_session = aux_session;
   parse_baton->root_url = root_url;
+  parse_baton->rev_map = apr_hash_make(pool);
+  parse_baton->last_rev_mapped = SVN_INVALID_REVNUM;
+  parse_baton->oldest_dumpstream_rev = SVN_INVALID_REVNUM;
 
   err = svn_repos_parse_dumpstream2(stream, parser, parse_baton,
                                     cancel_func, cancel_baton, pool);

Modified: subversion/trunk/subversion/tests/cmdline/svnrdump_tests.py
URL: http://svn.apache.org/viewvc/subversion/trunk/subversion/tests/cmdline/svnrdump_tests.py?rev=1127646&r1=1127645&r2=1127646&view=diff
==============================================================================
--- subversion/trunk/subversion/tests/cmdline/svnrdump_tests.py (original)
+++ subversion/trunk/subversion/tests/cmdline/svnrdump_tests.py Wed May 25 19:33:21 2011
@@ -723,7 +723,6 @@ def dont_drop_valid_mergeinfo_during_inc
                                      sbox.repo_url)
 
 #----------------------------------------------------------------------
-@XFail()
 @Issue(3890)
 def svnrdump_load_partial_incremental_dump(sbox):
   "svnrdump load partial incremental dump"



Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 05/25/2011 05:41 PM, Daniel Shahaf wrote:
> Daniel Shahaf wrote on Thu, May 26, 2011 at 00:36:48 +0300:
>> I've not read the code, but would an array of 'svn_revnum_t[2]' be
>> a better representation?
>>
>> Specifically: an append-only array of [from_rev, to_rev] pairs, sorted
>> by from_rev.  That's less overhead (and we could take advantage of the
>> sorting to store less data), at the cost of O(log(n)) lookup.
>>
> 
> And the numbers... well, for a plain C array it would be 70% less than
> in Greg's analysis of the hash case, since the additional 20 bytes per
> mapping entry are gone.   (and that's before I suggest to not store
> [N+1, M+1] if [N,M] is already stored)

Great ideas.  I've filed issue #3903 to track this enhancement.  Not, IMO, a
1.7 blocker (any more than it was a 1.0 blocker).

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Daniel Shahaf wrote on Thu, May 26, 2011 at 00:36:48 +0300:
> I've not read the code, but would an array of 'svn_revnum_t[2]' be
> a better representation?
> 
> Specifically: an append-only array of [from_rev, to_rev] pairs, sorted
> by from_rev.  That's less overhead (and we could take advantage of the
> sorting to store less data), at the cost of O(log(n)) lookup.
> 

And the numbers... well, for a plain C array it would be 70% less than
in Greg's analysis of the hash case, since the additional 20 bytes per
mapping entry are gone.   (and that's before I suggest to not store
[N+1, M+1] if [N,M] is already stored)

> Greg Stein wrote on Wed, May 25, 2011 at 17:21:44 -0400:
> > On Wed, May 25, 2011 at 16:08, C. Michael Pilato <cm...@collab.net> wrote:
> > > On 05/25/2011 04:05 PM, C. Michael Pilato wrote:
> > >> On 05/25/2011 03:49 PM, Greg Stein wrote:
> > >>> On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
> > >>>> ...
> > >>>> +  /* A mapping of svn_revnum_t * dump stream revisions to their
> > >>>> +     corresponding svn_revnum_t * target repository revisions. */
> > >>>> +  apr_hash_t *rev_map;
> > >>>
> > >>> How big can this grow? ie. what happens when there are several million
> > >>> revisions.
> > >>
> > >> It gets big.  (This logic and approach are copied from 'svnadmin load',
> > >> which doesn't excuse it, but might explain it.)
> > >
> > > Actually, I don't really know for sure how big it gets.  It's a mapping of
> > > of sizeof(svn_revnum_t) to sizeof(svn_revnum_t), plus all the hash
> > > internals.  Anybody have any guesses?
> > 
> > struct apr_hash_entry_t is generally 20 bytes. Add in the two revnums
> > (4 bytes each), and you get 28 bytes for each *used* entry.
> > 
> > Now we also have to account for unused entries. APR has a pretty poor
> > hash table implementation. It allocates *upwards* to the nearest power
> > of two. So the internal size will grow like:
> > 
> > 1048576
> > 2097152
> > 4194304
> > 
> > One saving grace is that APR only grows when the entry count matches
> > the internal table size. It uses a "closed hash" algorithm with linked
> > lists at each bucket, so the actual load on the buckets is not
> > possible to compute. The hand-wave means that you can put in 4 million
> > mappings before it grows it up to 8 million buckets.
> > 
> > So... 4 million buckets (pointers) at 4 bytes each is 80 megabytes.
> > Each mapping will add another 28 bytes. So: 4 million mappings is
> > about 134 megabytes. But also recognize that *reaching* that point
> > will use and toss approx the same amount of memory. So about 260 meg
> > total.
> > 
> > On a 64-bit architecture, all these values are likely to be doubled.
> > 
> > Not a machine crusher, in retrospect. But not exactly a winner either.
> > 
> > Cheers,
> > -g

Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
I've not read the code, but would an array of 'svn_revnum_t[2]' be
a better representation?

Specifically: an append-only array of [from_rev, to_rev] pairs, sorted
by from_rev.  That's less overhead (and we could take advantage of the
sorting to store less data), at the cost of O(log(n)) lookup.

Greg Stein wrote on Wed, May 25, 2011 at 17:21:44 -0400:
> On Wed, May 25, 2011 at 16:08, C. Michael Pilato <cm...@collab.net> wrote:
> > On 05/25/2011 04:05 PM, C. Michael Pilato wrote:
> >> On 05/25/2011 03:49 PM, Greg Stein wrote:
> >>> On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
> >>>> ...
> >>>> +  /* A mapping of svn_revnum_t * dump stream revisions to their
> >>>> +     corresponding svn_revnum_t * target repository revisions. */
> >>>> +  apr_hash_t *rev_map;
> >>>
> >>> How big can this grow? ie. what happens when there are several million
> >>> revisions.
> >>
> >> It gets big.  (This logic and approach are copied from 'svnadmin load',
> >> which doesn't excuse it, but might explain it.)
> >
> > Actually, I don't really know for sure how big it gets.  It's a mapping of
> > of sizeof(svn_revnum_t) to sizeof(svn_revnum_t), plus all the hash
> > internals.  Anybody have any guesses?
> 
> struct apr_hash_entry_t is generally 20 bytes. Add in the two revnums
> (4 bytes each), and you get 28 bytes for each *used* entry.
> 
> Now we also have to account for unused entries. APR has a pretty poor
> hash table implementation. It allocates *upwards* to the nearest power
> of two. So the internal size will grow like:
> 
> 1048576
> 2097152
> 4194304
> 
> One saving grace is that APR only grows when the entry count matches
> the internal table size. It uses a "closed hash" algorithm with linked
> lists at each bucket, so the actual load on the buckets is not
> possible to compute. The hand-wave means that you can put in 4 million
> mappings before it grows it up to 8 million buckets.
> 
> So... 4 million buckets (pointers) at 4 bytes each is 80 megabytes.
> Each mapping will add another 28 bytes. So: 4 million mappings is
> about 134 megabytes. But also recognize that *reaching* that point
> will use and toss approx the same amount of memory. So about 260 meg
> total.
> 
> On a 64-bit architecture, all these values are likely to be doubled.
> 
> Not a machine crusher, in retrospect. But not exactly a winner either.
> 
> Cheers,
> -g

Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by Greg Stein <gs...@gmail.com>.
On Wed, May 25, 2011 at 16:08, C. Michael Pilato <cm...@collab.net> wrote:
> On 05/25/2011 04:05 PM, C. Michael Pilato wrote:
>> On 05/25/2011 03:49 PM, Greg Stein wrote:
>>> On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
>>>> ...
>>>> +  /* A mapping of svn_revnum_t * dump stream revisions to their
>>>> +     corresponding svn_revnum_t * target repository revisions. */
>>>> +  apr_hash_t *rev_map;
>>>
>>> How big can this grow? ie. what happens when there are several million
>>> revisions.
>>
>> It gets big.  (This logic and approach are copied from 'svnadmin load',
>> which doesn't excuse it, but might explain it.)
>
> Actually, I don't really know for sure how big it gets.  It's a mapping of
> of sizeof(svn_revnum_t) to sizeof(svn_revnum_t), plus all the hash
> internals.  Anybody have any guesses?

struct apr_hash_entry_t is generally 20 bytes. Add in the two revnums
(4 bytes each), and you get 28 bytes for each *used* entry.

Now we also have to account for unused entries. APR has a pretty poor
hash table implementation. It allocates *upwards* to the nearest power
of two. So the internal size will grow like:

1048576
2097152
4194304

One saving grace is that APR only grows when the entry count matches
the internal table size. It uses a "closed hash" algorithm with linked
lists at each bucket, so the actual load on the buckets is not
possible to compute. The hand-wave means that you can put in 4 million
mappings before it grows it up to 8 million buckets.

So... 4 million buckets (pointers) at 4 bytes each is 80 megabytes.
Each mapping will add another 28 bytes. So: 4 million mappings is
about 134 megabytes. But also recognize that *reaching* that point
will use and toss approx the same amount of memory. So about 260 meg
total.

On a 64-bit architecture, all these values are likely to be doubled.

Not a machine crusher, in retrospect. But not exactly a winner either.

Cheers,
-g

Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 05/25/2011 04:05 PM, C. Michael Pilato wrote:
> On 05/25/2011 03:49 PM, Greg Stein wrote:
>> On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
>>> ...
>>> +  /* A mapping of svn_revnum_t * dump stream revisions to their
>>> +     corresponding svn_revnum_t * target repository revisions. */
>>> +  apr_hash_t *rev_map;
>>
>> How big can this grow? ie. what happens when there are several million
>> revisions.
> 
> It gets big.  (This logic and approach are copied from 'svnadmin load',
> which doesn't excuse it, but might explain it.)

Actually, I don't really know for sure how big it gets.  It's a mapping of
of sizeof(svn_revnum_t) to sizeof(svn_revnum_t), plus all the hash
internals.  Anybody have any guesses?

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 05/25/2011 03:49 PM, Greg Stein wrote:
> On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
>> ...
>> +  /* A mapping of svn_revnum_t * dump stream revisions to their
>> +     corresponding svn_revnum_t * target repository revisions. */
>> +  apr_hash_t *rev_map;
> 
> How big can this grow? ie. what happens when there are several million
> revisions.

It gets big.  (This logic and approach are copied from 'svnadmin load',
which doesn't excuse it, but might explain it.)

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by Greg Stein <gs...@gmail.com>.
On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
>...
> +  /* A mapping of svn_revnum_t * dump stream revisions to their
> +     corresponding svn_revnum_t * target repository revisions. */
> +  apr_hash_t *rev_map;

How big can this grow? ie. what happens when there are several million
revisions.

Cheers,
-g