You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@subversion.apache.org by st...@apache.org on 2014/01/03 15:34:42 UTC

svn commit: r1555109 - /subversion/trunk/subversion/libsvn_fs_fs/pack.c

Author: stefan2
Date: Fri Jan  3 14:34:41 2014
New Revision: 1555109

URL: http://svn.apache.org/r1555109
Log:
Refine the FSFS f7 pack heuristics away from the container-centric
order used by FSX.  The changes only affect the order of text reps
and noderevs; changed paths lists and properties remain as are.

Sort nodes and the associated reps in ascending path order (used to
be descending order) since this is the "natural" traversal order for
directories and it puts parent folders in front of their children.
Tweak the key for "trunk" and "branch" such that they appear before
any other node.  Now, trunk is the easiest to reconstructand branches
are quicker to access than tags or user-defined paths like "vendor".

Skip-deltas implicitly use a concept of "roundness":  In the non-
linear sections, we always deltify against the rep of the node whose
predecessor count has one more 0 LSBs, e.g. x1B0->x1A0->x180->x100->0.
Thus, the "roundest" node found for each path is the one most
likely to be referenced by deltas in future pack files.  Keeping them
in one section of the pack file will make that section a hot spot for
"foreign" references.

Finally, don't concatenate all noderevs for a given path but put the
noderevs immediately in front of their respective representations,
so the noderev->rep access is less likely to cross a block boundary.

* subversion/libsvn_fs_fs/pack.c
  (path_order_t): Also store the predecessor count as an indicator
                  for "roundness".
  (pack_context_t): Use separate containers for noderev->rep and
                    rep->delta base rep references.
  (initialize_pack_context,
   reset_pack_context): Construct and clean new containers as well.
  (copy_rep_to_temp): Write to the right container.
  (tweak_path_for_ordering): New path meddling function.
  (copy_node_to_temp): Write to the right container, provide the
                       extra path_order_t member and use the tweaked
                       path as sort key.
  (compare_path_order): Yield an ascending path order now.
  (compare_references,
   compare_ref_to_item): Replaced by ...
  (compare_references_to,
   compare_references_from,
   compare_ref_to_item_to,
   compare_ref_to_item_from): ... these two variants with basically the
                             same logic.
  (sort_reps): Implement the roundness heuristics and resolve the
               delta chains, which used to be done in select_reps().
  (find_first_reference,
   is_reference_match): Use to the right container.
  (store_item): Factored out from ...
  (store_items): ... this one.
  (select_reps): Drop.
  (copy_reps_from_temp): Copy nodes in exactly the order suggested
                         by sort_reps().

Modified:
    subversion/trunk/subversion/libsvn_fs_fs/pack.c

Modified: subversion/trunk/subversion/libsvn_fs_fs/pack.c
URL: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_fs_fs/pack.c?rev=1555109&r1=1555108&r2=1555109&view=diff
==============================================================================
--- subversion/trunk/subversion/libsvn_fs_fs/pack.c (original)
+++ subversion/trunk/subversion/libsvn_fs_fs/pack.c Fri Jan  3 14:34:41 2014
@@ -20,6 +20,7 @@
  * ====================================================================
  */
 #include <assert.h>
+#include <string.h>
 
 #include "svn_pools.h"
 #include "svn_dirent_uri.h"
@@ -72,7 +73,8 @@
  * - changed paths lists
  * - node property
  * - directory properties
- * - noderevs and representations, reverse lexical path order
+ * - noderevs and representations, lexical path order with special
+ *   treatment of "trunk" and "branches"
  *
  * Step 4 copies the items from the temporary buckets into the final
  * pack file and writes the temporary index files.
@@ -101,6 +103,9 @@ typedef struct path_order_t
   /* when this change happened */
   svn_revnum_t revision;
 
+  /* noderev predecessor count */
+  int predecessor_count;
+
   /* length of the expanded representation content */
   apr_int64_t expanded_size;
 
@@ -201,9 +206,15 @@ typedef struct pack_context_t
    * after each revision range.  Sorted by PATH, NODE_ID. */
   apr_array_header_t *path_order;
 
-  /* array of reference_t *.  Will be filled in phase 2 and be cleared
-   * after each revision range.  It will be sorted by the TO members. */
-  apr_array_header_t *references;
+  /* array of reference_t* linking noderevs to representations.  Will be
+   * filled in phase 2 and be cleared after each revision range.  It will
+   * be sorted by the TO members (i.e. the allow rep->noderev lookup). */
+  apr_array_header_t *node_references;
+
+  /* array of reference_t* linking representations to their delta bases.
+   * Will be filled in phase 2 and be cleared after each revision range.
+   * It will be sorted by the FROM members (for rep->base rep lookup). */
+  apr_array_header_t *rep_references;
 
   /* array of svn_fs_fs__p2l_entry_t*.  Will be filled in phase 2 and be
    * cleared after each revision range.  During phase 3, we will set items
@@ -301,8 +312,12 @@ initialize_pack_context(pack_context_t *
 
   /* noderev and representation item bucket */
   context->rev_offsets = apr_array_make(pool, max_revs, sizeof(int));
-  context->path_order = apr_array_make(pool, max_items, sizeof(path_order_t *));
-  context->references = apr_array_make(pool, max_items, sizeof(reference_t *));
+  context->path_order = apr_array_make(pool, max_items,
+                                       sizeof(path_order_t *));
+  context->node_references = apr_array_make(pool, max_items,
+                                            sizeof(reference_t *));
+  context->rep_references = apr_array_make(pool, max_items,
+                                           sizeof(reference_t *));
   context->reps = apr_array_make(pool, max_items,
                                  sizeof(svn_fs_fs__p2l_entry_t *));
   SVN_ERR(svn_io_open_unique_file3(&context->reps_file, NULL, temp_dir,
@@ -331,7 +346,8 @@ reset_pack_context(pack_context_t *conte
 
   apr_array_clear(context->rev_offsets);
   apr_array_clear(context->path_order);
-  apr_array_clear(context->references);
+  apr_array_clear(context->node_references);
+  apr_array_clear(context->rep_references);
   apr_array_clear(context->reps);
   SVN_ERR(svn_io_file_trunc(context->reps_file, 0, pool));
 
@@ -577,7 +593,7 @@ copy_rep_to_temp(pack_context_t *context
       reference->from = entry->item;
       reference->to.revision = rep_header->base_revision;
       reference->to.number = rep_header->base_item_index;
-      APR_ARRAY_PUSH(context->references, reference_t *) = reference;
+      APR_ARRAY_PUSH(context->rep_references, reference_t *) = reference;
     }
 
   /* copy the whole rep (including header!) to our temp file */
@@ -658,6 +674,37 @@ svn_fs_fs__order_dir_entries(svn_fs_t *f
   return result;
 }
 
+/* Return a duplicate of the the ORIGINAL path and with special sub-strins
+ * (e.g. "trunk") modified in such a way that have a lower lexicographic
+ * value than any other "normal" file name.
+ */
+static const char *
+tweak_path_for_ordering(const char *original,
+                        apr_pool_t *pool)
+{
+  /* We may add further special cases as needed. */
+  enum {SPECIAL_COUNT = 2};
+  static const char *special[SPECIAL_COUNT] = {"trunk", "branch"};
+  char *pos;
+  char *path = apr_pstrdup(pool, original);
+  int i;
+
+  /* Replace the first char of any "special" sub-string we find by
+   * a control char, i.e. '\1' .. '\31'.  In the rare event that this
+   * would clash with existing paths, no data will be lost but merely
+   * the node ordering will be sub-optimal.
+   */
+  for (i = 0; i < SPECIAL_COUNT; ++i)
+    for (pos = strstr(path, special[i]);
+         pos;
+         pos = strstr(pos + 1, special[i]))
+      {
+        *pos = (char)(i + '\1');
+      }
+
+   return path;
+}
+
 /* Copy node revision item identified by ENTRY from the current position
  * in REV_FILE into CONTEXT->REPS_FILE.  Add all tracking into needed by
  * our placement algorithm to CONTEXT.  Use POOL for temporary allocations.
@@ -671,6 +718,7 @@ copy_node_to_temp(pack_context_t *contex
   path_order_t *path_order = apr_pcalloc(context->info_pool,
                                          sizeof(*path_order));
   node_revision_t *noderev;
+  const char *sort_path;
   svn_stream_t *stream;
   apr_off_t source_offset = entry->offset;
 
@@ -703,7 +751,7 @@ copy_node_to_temp(pack_context_t *contex
       reference->from = entry->item;
       reference->to.revision = noderev->data_rep->revision;
       reference->to.number = noderev->data_rep->item_index;
-      APR_ARRAY_PUSH(context->references, reference_t *) = reference;
+      APR_ARRAY_PUSH(context->node_references, reference_t *) = reference;
 
       path_order->rep_id = reference->to;
       path_order->expanded_size = noderev->data_rep->expanded_size
@@ -711,10 +759,13 @@ copy_node_to_temp(pack_context_t *contex
                                 : noderev->data_rep->size;
     }
 
-  path_order->path = svn_prefix_string__create(context->paths,
-                                               noderev->created_path);
+  /* Sort path is the key used for ordering noderevs and associated reps.
+   * It will not be stored in the final pack file. */
+  sort_path = tweak_path_for_ordering(noderev->created_path, pool);
+  path_order->path = svn_prefix_string__create(context->paths, sort_path);
   path_order->node_id = *svn_fs_fs__id_node_id(noderev->id);
   path_order->revision = svn_fs_fs__id_rev(noderev->id);
+  path_order->predecessor_count = noderev->predecessor_count;
   path_order->noderev_id = *svn_fs_fs__id_rev_item(noderev->id);
   APR_ARRAY_PUSH(context->path_order, path_order_t *) = path_order;
 
@@ -730,8 +781,8 @@ compare_path_order(const path_order_t * 
   const path_order_t * lhs = *lhs_p;
   const path_order_t * rhs = *rhs_p;
 
-  /* reverse lexicographic order on path and node (i.e. latest first) */
-  int diff = svn_prefix_string__compare(rhs->path, lhs->path);
+  /* lexicographic order on path and node (i.e. latest first) */
+  int diff = svn_prefix_string__compare(lhs->path, rhs->path);
   if (diff)
     return diff;
 
@@ -747,11 +798,11 @@ compare_path_order(const path_order_t * 
   return 0;
 }
 
-/* implements compare_fn_t.  Sort ascending by TO, FROM.
+/* implements compare_fn_t.  Sort ascendingly by TO, FROM.
  */
 static int
-compare_references(const reference_t * const * lhs_p,
-                   const reference_t * const * rhs_p)
+compare_references_to(const reference_t * const * lhs_p,
+                      const reference_t * const * rhs_p)
 {
   const reference_t * lhs = *lhs_p;
   const reference_t * rhs = *rhs_p;
@@ -760,18 +811,190 @@ compare_references(const reference_t * c
   return diff ? diff : svn_fs_fs__id_part_compare(&lhs->from, &rhs->from);
 }
 
+/* implements compare_fn_t.  Sort ascendingly by FROM, TO.
+ */
+static int
+compare_references_from(const reference_t * const * lhs_p,
+                        const reference_t * const * rhs_p)
+{
+  const reference_t * lhs = *lhs_p;
+  const reference_t * rhs = *rhs_p;
+
+  int diff = svn_fs_fs__id_part_compare(&lhs->from, &rhs->from);
+  return diff ? diff : svn_fs_fs__id_part_compare(&lhs->to, &rhs->to);
+}
+
+/* implements compare_fn_t.  Assume ascending order by TO.
+ */
+static int
+compare_ref_to_item_to(const reference_t * const * lhs_p,
+                       const svn_fs_fs__id_part_t * rhs_p)
+{
+  return svn_fs_fs__id_part_compare(&(*lhs_p)->to, rhs_p);
+}
+
+/* implements compare_fn_t.  Assume ascending order by FROM.
+ */
+static int
+compare_ref_to_item_from(const reference_t * const * lhs_p,
+                         const svn_fs_fs__id_part_t * rhs_p)
+{
+  return svn_fs_fs__id_part_compare(&(*lhs_p)->from, rhs_p);
+}
+
+/* Look for the least significant bit set in VALUE and return the smallest
+ * number with the same property, i.e. the largest power of 2 that is a
+ * factor in VALUE. */
+static int
+roundness(int value)
+{
+  return value ? value - (value & (value - 1)) : INT_MAX;
+}
+
 /* Order the data collected in CONTEXT such that we can place them in the
  * desired order.
  */
 static void
 sort_reps(pack_context_t *context)
 {
+  apr_pool_t *temp_pool;
+  const path_order_t **temp, **path_order;
+  const svn_prefix_string__t *path;
+  int i, dest, count, best;
+  svn_fs_fs__id_part_t rep_id;
+  fs_fs_data_t *ffd = context->fs->fsap_data;
+
+  /* We will later assume that there is at least one node / path.
+   */
+  if (context->path_order->nelts == 0)
+    {
+      assert(context->node_references->nelts == 0);
+      assert(context->rep_references->nelts == 0);
+      return;
+    }
+
+  /* Sort containers by path and IDs, respectively.
+   */
   qsort(context->path_order->elts, context->path_order->nelts,
         context->path_order->elt_size,
         (int (*)(const void *, const void *))compare_path_order);
-  qsort(context->references->elts, context->references->nelts,
-        context->references->elt_size,
-        (int (*)(const void *, const void *))compare_references);
+  qsort(context->rep_references->elts, context->rep_references->nelts,
+        context->rep_references->elt_size,
+        (int (*)(const void *, const void *))compare_references_from);
+  qsort(context->node_references->elts, context->node_references->nelts,
+        context->node_references->elt_size,
+        (int (*)(const void *, const void *))compare_references_to);
+
+  /* Re-order noderevs like this:
+   *
+   * (1) Most likely to be referenced by future pack files, in path order.
+   * (2) highest revision rep per path + dependency chain
+   * (3) Remaining reps in path, rev order
+   *
+   * We simply pick & chose from the existing path, rev order.
+   */
+
+  temp_pool = svn_pool_create(context->info_pool);
+  count = context->path_order->nelts;
+  temp = apr_pcalloc(temp_pool, count * sizeof(*temp));
+  path_order = (void *)context->path_order->elts;
+
+  dest = 0;
+  path = path_order[0]->path;
+  best = 0;
+
+  /* (1) For each path, pick the "roundest" representation and put it in
+   * front of all other nodes in the pack file.  The "roundest" rep is
+   * the one most likely to be referenced from future pack files, i.e. we
+   * concentrate those potential "foreign link targets" in one section of
+   * the pack file.
+   *
+   * And we only apply this to reps outside the linear deltification
+   * sections because references *into* linear deltification ranges are
+   * much less likely.
+   */
+  for (i = 0; i < count; ++i)
+    {
+      /* Investigated all nodes for the current path? */
+      if (svn_prefix_string__compare(path, path_order[i]->path))
+        {
+          /* next path */
+          path = path_order[i]->path;
+          rep_id = path_order[i]->rep_id;
+
+          /* Pick roundest non-linear deltified node. */
+          if (roundness(path_order[best]->predecessor_count)
+              >= ffd->max_linear_deltification)
+            {
+              temp[dest++] = path_order[best];
+              path_order[best] = NULL;
+              best = i;
+            }
+        }
+
+      /* next entry */
+      if (  roundness(path_order[best]->predecessor_count)
+          < roundness(path_order[i]->predecessor_count))
+        best = i;
+    }
+
+  /* Treat the last path the same as all others. */
+  if (roundness(path_order[best]->predecessor_count)
+      >= ffd->max_linear_deltification)
+    {
+      temp[dest++] = path_order[best];
+      path_order[best] = NULL;
+    }
+
+  /* (2) For each (remaining) path, pick the nodes along the delta chain
+   * for the highest revision.  Due to our ordering, this is the first
+   * node we encounter for any path.
+   *
+   * Most references that don't hit a delta base picked in (1), will
+   * access HEAD of the respective path.  Keeping all its dependency chain
+   * in one place turns reconstruction into a linear scan of minimal length.
+   */
+  for (i = 0; i < count; ++i)
+    if (path_order[i])
+      {
+        /* New path? */
+        if (svn_prefix_string__compare(path, path_order[i]->path))
+          {
+            path = path_order[i]->path;
+            rep_id = path_order[i]->rep_id;
+          }
+
+        /* Pick nodes along the deltification chain.  Skip side-branches. */
+        if (svn_fs_fs__id_part_eq(&path_order[i]->rep_id, &rep_id))
+          {
+            reference_t **reference;
+
+            temp[dest++] = path_order[i];
+            path_order[i] = 0;
+
+            reference = svn_sort__array_lookup(context->rep_references,
+                                               &rep_id, NULL,
+              (int (*)(const void *, const void *))compare_ref_to_item_from);
+            if (reference)
+              rep_id = (*reference)->to;
+          }
+      }
+
+  /* (3) All remaining nodes in path, rev order.  Linear deltification
+   * makes HEAD delta chains from (2) cover all or most of their deltas
+   * in a given pack file.  So, this is just a few remnants that we put
+   * at the end of the pack file.
+   */
+  for (i = 0; i < count; ++i)
+    if (path_order[i])
+      temp[dest++] = path_order[i];
+
+  /* We now know the final ordering. */
+  assert(dest == count);
+  for (i = 0; i < count; ++i)
+    path_order[i] = temp[i];
+  
+  svn_pool_destroy(temp_pool);
 }
 
 /* implements compare_fn_t. Place LHS before RHS, if the latter is older.
@@ -851,6 +1074,48 @@ auto_pad_block(pack_context_t *context,
   return SVN_NO_ERROR;
 }
 
+/* Read the contents of ITEM, if not empty, from TEMP_FILE and write it
+ * to CONTEXT->PACK_FILE.  Use POOL for allocations.
+ */
+static svn_error_t *
+store_item(pack_context_t *context,
+           apr_file_t *temp_file,
+           svn_fs_fs__p2l_entry_t *item,
+           apr_pool_t *pool)
+{
+  apr_off_t safety_margin;
+
+  /* skip empty entries */
+  if (item->type == SVN_FS_FS__ITEM_TYPE_UNUSED)
+    return SVN_NO_ERROR;
+
+  /* If the next item does not fit into the current block, auto-pad it.
+      Take special care of textual noderevs since their parsers may
+      prefetch up to 80 bytes and we don't want them to cross block
+      boundaries. */
+  safety_margin = item->type == SVN_FS_FS__ITEM_TYPE_NODEREV
+                ? SVN__LINE_CHUNK_SIZE
+                : 0;
+  SVN_ERR(auto_pad_block(context, item->size + safety_margin, pool));
+
+  /* select the item in the source file and copy it into the target
+    * pack file */
+  SVN_ERR(svn_io_file_seek(temp_file, SEEK_SET, &item->offset, pool));
+  SVN_ERR(copy_file_data(context, context->pack_file, temp_file,
+                         item->size, pool));
+
+  /* write index entry and update current position */
+  item->offset = context->pack_offset;
+  context->pack_offset += item->size;
+
+  SVN_ERR(svn_fs_fs__p2l_proto_index_add_entry(context->proto_p2l_index,
+                                               item, pool));
+
+  APR_ARRAY_PUSH(context->reps, svn_fs_fs__p2l_entry_t *) = item;
+
+  return SVN_NO_ERROR;
+}
+
 /* Read the contents of the non-empty items in ITEMS from TEMP_FILE and
  * write them to CONTEXT->PACK_FILE.  Use POOL for allocations.
  */
@@ -866,40 +1131,10 @@ store_items(pack_context_t *context,
   /* copy all items in strict order */
   for (i = 0; i < items->nelts; ++i)
     {
-      apr_off_t safety_margin;
-
-      /* skip empty entries */
-      svn_fs_fs__p2l_entry_t *entry
-        = APR_ARRAY_IDX(items, i, svn_fs_fs__p2l_entry_t *);
-      if (entry->type == SVN_FS_FS__ITEM_TYPE_UNUSED)
-        continue;
-
       svn_pool_clear(iterpool);
-
-      /* If the next item does not fit into the current block, auto-pad it.
-         Take special care of textual noderevs since their parsers may
-         prefetch up to 80 bytes and we don't want them to cross block
-         boundaries. */
-      safety_margin = entry->type == SVN_FS_FS__ITEM_TYPE_NODEREV
-                    ? SVN__LINE_CHUNK_SIZE
-                    : 0;
-      SVN_ERR(auto_pad_block(context, entry->size + safety_margin, iterpool));
-
-      /* select the item in the source file and copy it into the target
-       * pack file */
-      SVN_ERR(svn_io_file_seek(temp_file, SEEK_SET, &entry->offset,
-                               iterpool));
-      SVN_ERR(copy_file_data(context, context->pack_file, temp_file,
-                             entry->size, iterpool));
-
-      /* write index entry and update current position */
-      entry->offset = context->pack_offset;
-      context->pack_offset += entry->size;
-
-      SVN_ERR(svn_fs_fs__p2l_proto_index_add_entry(
-                   context->proto_p2l_index, entry, iterpool));
-
-      APR_ARRAY_PUSH(context->reps, svn_fs_fs__p2l_entry_t *) = entry;
+      SVN_ERR(store_item(context, temp_file,
+                         APR_ARRAY_IDX(items, i, svn_fs_fs__p2l_entry_t *),
+                         iterpool));
     }
 
   svn_pool_destroy(iterpool);
@@ -907,15 +1142,6 @@ store_items(pack_context_t *context,
   return SVN_NO_ERROR;
 }
 
-/* implements compare_fn_t.  Sort ascending by TO.
- */
-static int
-compare_ref_to_item(const reference_t * const * lhs_p,
-                    const svn_fs_fs__id_part_t * rhs_p)
-{
-  return svn_fs_fs__id_part_compare(&(*lhs_p)->to, rhs_p);
-}
-
 /* Return the index of the first entry in CONTEXT->REFERENCES that
  * references ITEM if such entries exist.  All matching items will be
  * consecutive.
@@ -924,8 +1150,8 @@ static int
 find_first_reference(pack_context_t *context,
                      svn_fs_fs__p2l_entry_t *item)
 {
-  return svn_sort__bsearch_lower_bound(context->references, &item->item,
-                (int (*)(const void *, const void *))compare_ref_to_item);
+  return svn_sort__bsearch_lower_bound(context->node_references, &item->item,
+                (int (*)(const void *, const void *))compare_ref_to_item_to);
 }
 
 /* Check whether entry number IDX in CONTEXT->REFERENCES references ITEM.
@@ -936,84 +1162,13 @@ is_reference_match(pack_context_t *conte
                    svn_fs_fs__p2l_entry_t *item)
 {
   reference_t *reference;
-  if (context->references->nelts <= idx)
+  if (context->node_references->nelts <= idx)
     return FALSE;
 
-  reference = APR_ARRAY_IDX(context->references, idx, reference_t *);
+  reference = APR_ARRAY_IDX(context->node_references, idx, reference_t *);
   return svn_fs_fs__id_part_eq(&reference->to, &item->item);
 }
 
-/* Starting at IDX in CONTEXT->PATH_ORDER, select all representations and
- * noderevs that should be placed into the same container, respectively.
- * Append the svn_fs_fs__p2l_entry_t * of the representations that to
- * REP_PARTS and apend the svn_fs_fs__p2l_entry_t * of the noderevs
- * referencing those reps will to NODE_PARTS.
- *
- * Remove all returned items from the CONTEXT->REPS container and prevent
- * them from being placed a second time later on.  That also means that the
- * caller has to place all items returned.
- */
-static svn_error_t *
-select_reps(pack_context_t *context,
-            int idx,
-            apr_array_header_t *node_parts,
-            apr_array_header_t *rep_parts)
-{
-  apr_array_header_t *path_order = context->path_order;
-  path_order_t *start_path = APR_ARRAY_IDX(path_order, idx, path_order_t *);
-
-  svn_fs_fs__p2l_entry_t *node_part;
-  svn_fs_fs__p2l_entry_t *rep_part;
-  svn_fs_fs__p2l_entry_t *depending;
-  int i, k;
-
-  /* collect all path_order records as well as rep and noderev items
-   * that occupy the same path with the same node. */
-  for (; idx < path_order->nelts; ++idx)
-    {
-      path_order_t *current_path
-        = APR_ARRAY_IDX(path_order, idx, path_order_t *);
-
-      if (!svn_fs_fs__id_part_eq(&start_path->node_id,
-                                 &current_path->node_id))
-        break;
-
-      APR_ARRAY_IDX(path_order, idx, path_order_t *) = NULL;
-      node_part = get_item(context, &current_path->noderev_id, TRUE);
-      rep_part = get_item(context, &current_path->rep_id, TRUE);
-
-      if (node_part)
-        APR_ARRAY_PUSH(node_parts, svn_fs_fs__p2l_entry_t *) = node_part;
-      if (rep_part)
-        APR_ARRAY_PUSH(rep_parts, svn_fs_fs__p2l_entry_t *) = rep_part;
-    }
-
-  /* collect depending reps and noderevs that reference any of the collected
-   * reps */
-  for (i = 0; i < rep_parts->nelts; ++i)
-    {
-      rep_part = APR_ARRAY_IDX(rep_parts, i, svn_fs_fs__p2l_entry_t*);
-      for (k = find_first_reference(context, rep_part);
-           is_reference_match(context, k, rep_part);
-           ++k)
-        {
-          reference_t *reference
-            = APR_ARRAY_IDX(context->references, k, reference_t *);
-
-          depending = get_item(context, &reference->from, TRUE);
-          if (!depending)
-            continue;
-
-          if (depending->type == SVN_FS_FS__ITEM_TYPE_NODEREV)
-            APR_ARRAY_PUSH(node_parts, svn_fs_fs__p2l_entry_t *) = depending;
-          else
-            APR_ARRAY_PUSH(rep_parts, svn_fs_fs__p2l_entry_t *) = depending;
-        }
-    }
-
-  return SVN_NO_ERROR;
-}
-
 /* Copy (append) the items identified by svn_fs_fs__p2l_entry_t * elements
  * in ENTRIES strictly in order from TEMP_FILE into CONTEXT->PACK_FILE.
  * Use POOL for temporary allocations.
@@ -1025,28 +1180,29 @@ copy_reps_from_temp(pack_context_t *cont
 {
   apr_pool_t *iterpool = svn_pool_create(pool);
   apr_array_header_t *path_order = context->path_order;
-  apr_array_header_t *node_parts = apr_array_make(pool, 16, sizeof(void*));
-  apr_array_header_t *rep_parts = apr_array_make(pool, 16, sizeof(void*));
+  apr_array_header_t *parts = apr_array_make(pool, 16, sizeof(void*));
   int i;
 
-  /* copy items in path order. Create block-sized containers. */
+  /* copy items in path order. */
   for (i = 0; i < path_order->nelts; ++i)
     {
-      if (APR_ARRAY_IDX(path_order, i, path_order_t *) == NULL)
-        continue;
+      path_order_t *current_path;
+      svn_fs_fs__p2l_entry_t *node_part;
+      svn_fs_fs__p2l_entry_t *rep_part;
 
       svn_pool_clear(iterpool);
 
-      /* Collect reps to combine and all noderevs referencing them */
-      SVN_ERR(select_reps(context, i, node_parts, rep_parts));
+      current_path = APR_ARRAY_IDX(path_order, i, path_order_t *);
+      node_part = get_item(context, &current_path->noderev_id, TRUE);
+      rep_part = get_item(context, &current_path->rep_id, TRUE);
+
+      if (node_part)
+        SVN_ERR(store_item(context, temp_file, node_part, iterpool));
+      if (rep_part)
+        SVN_ERR(store_item(context, temp_file, rep_part, iterpool));
 
-      /* store the noderevs container in front of the reps */
-      SVN_ERR(store_items(context, temp_file, node_parts, iterpool));
-      SVN_ERR(store_items(context, temp_file, rep_parts, iterpool));
-      
       /* processed all items */
-      apr_array_clear(node_parts);
-      apr_array_clear(rep_parts);
+      apr_array_clear(parts);
     }
 
   svn_pool_destroy(iterpool);



Re: svn commit: r1555109 - /subversion/trunk/subversion/libsvn_fs_fs/pack.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
>> Now, in order to distinguish different clusters, parts (1) and (2) remember
>> (like, "lock on") the (PATH, REP_ID) for the first noderev from the current
>> cluster.   Iff the PATH changes (this means we have just entered the next
>> cluster), we update the stored (PATH, REP_ID) values.  We need to remember
>> the REP_ID in order to reconstruct those delta chains in (2), because they
>> start from this REP_ID.
>>
>> As far as I can tell, the problem lies in the way we handle the first path.
>> For any non-empty path set, the first path always marks the beginning of the
>> corresponding cluster.  However, we only remember the PATH part of the (PATH,
>> REP_ID) pair for the first path and omit the REP_ID.  In case when the path
>> set consists of *a single cluster* (i.e. all path_order_t objects share the
>> same PATH), phase (2) will end up in an attempt to reconstruct the delta
>> chain starting from the uninitialized REP_ID.  There are scenarios that lead
>> to this behavior in fs-fs-pack-test (see create_packed_filesystem() with
>> multiple /iota content tweaks falling into the same cluster) and this can
>> easily happen with real-world repositories.
>
> Well, chances are that PATH mismatches for the first
> non-NULL iteration in (2), which gets REP_ID set.
> However, if there is only one cluster with more than
> a single change in the pack file, the path PATH does
> match and REP_ID will have some random value.

Aha, for some reason I assumed that you need to carry the REP_ID from
(1) to (2) in order to be able to reconstruct the delta chain for this single
cluster (i.e. single path) scenario.  That's clearly incorrect and now I
understand it.

> I committed a fix in r1572512. It removes the unnecessary
> write to REP_ID during (1) and makes sure that (PATH, REP_ID)
> get initialized to the first *relevant* entry at the being of (2).

Thank you for fixing this.


Regards,
Evgeny Kotkov

Re: svn commit: r1555109 - /subversion/trunk/subversion/libsvn_fs_fs/pack.c

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Thu, Feb 27, 2014 at 1:53 AM, Evgeny Kotkov
<ev...@visualsvn.com>wrote:

> Hi Stefan,
>
> > Author: stefan2
> > Date: Fri Jan  3 14:34:41 2014
> > New Revision: 1555109
> >
> > URL: http://svn.apache.org/r1555109
> > Log:
> > Refine the FSFS f7 pack heuristics away from the container-centric
> > order used by FSX.  The changes only affect the order of text reps
> > and noderevs; changed paths lists and properties remain as are.
>
> It looks like there is a problem is the implementation of the reps sorting
> algorithm.  The sort_reps_range() function (which essentially is the core
> of
> the new heuristics) can access uninitialized memory under certain
> circumstances.
> There is a bunch of valgrind warnings for trunk@1572250 ...
> [[[
>   (valgrind fs-fs-pack-test)
>
>   Conditional jump or move depends on uninitialised value(s)
>      at 0x42C9BF: svn_fs_fs__id_part_eq (id.c:145)
>      by 0x43505B: sort_reps_range (pack.c:932)
>      by 0x4340CD: sort_reps (pack.c:1007)
>      by 0x432B38: pack_range (pack.c:1391)
>      by 0x4319BF: pack_log_addressed (pack.c:1570)
>      by 0x43151B: pack_rev_shard (pack.c:1758)
>      by 0x431023: pack_shard (pack.c:1817)
>      by 0x430E33: pack_body (pack.c:1958)
>      by 0x425DAF: with_some_lock_file (fs_fs.c:185)
>      by 0x425C21: svn_fs_fs__with_write_lock (fs_fs.c:203)
>      by 0x430B15: svn_fs_fs__pack (pack.c:1986)
>      by 0x425491: fs_pack (fs.c:364)
> ]]]
>
> ...and this behavior can also be witnessed in some real-world scenarios.
> I patched this function to make it crash upon the uninitialized variable
> usage
> and received 11/14 coredumps in the fs-fs-pack-test and a 100%-reproducible
> coredump when packing the svnrdump'ed
> http://googletest.googlecode.com/svn/
> repository with "layout sharded 5":



> ...
>

Well, the good thing is that this would never lead to a
broken repo but only affect the item placement order
within a pack file. It's still a bug that needs fixing.

The algorithm reorders the noderevs in three scans (please correct me if I
> misunderstood some details).  The first scan (1) bubbles up a single
> noderev
> from every cluster, which is likely to be referenced from the following
> packs.
> Cluster is a term that refers to a sequential noderev range, where every
> element shares the same path.  After that, the second scan (2) bubbles up
> reconstructible delta chains per every cluster.  The last scan (3) is the
> least interesting one, because it simply grabs everything what was left
> after
> (1) and (2).
>

Correct. And that catch-all scan (3) makes the whole
heuristics robust against a wide range of bugs.


> Now, in order to distinguish different clusters, parts (1) and (2) remember
> (like, "lock on") the (PATH, REP_ID) for the first noderev from the current
> cluster.   Iff the PATH changes (this means we have just entered the next
> cluster), we update the stored (PATH, REP_ID) values.  We need to remember
> the
> REP_ID in order to reconstruct those delta chains in (2), because they
> start
> from this REP_ID.
>
> As far as I can tell, the problem lies in the way we handle the first path.
> For any non-empty path set, the first path always marks the beginning of
> the
> corresponding cluster.  However, we only remember the PATH part of the
> (PATH,
> REP_ID) pair for the first path and omit the REP_ID.  In case when the
> path set
> consists of *a single cluster* (i.e. all path_order_t objects share the
> same
> PATH), phase (2) will end up in an attempt to reconstruct the delta chain
> starting from the uninitialized REP_ID.  There are scenarios that lead to
> this
> behavior in fs-fs-pack-test (see create_packed_filesystem() with multiple
> /iota
> content tweaks falling into the same cluster) and this can easily happen
> with
> real-world repositories.
>

Well, chances are that PATH mismatches for the first
non-NULL iteration in (2), which gets REP_ID set.
However, if there is only one cluster with more than
a single change in the pack file, the path PATH does
match and REP_ID will have some random value.


> So, what do you think about fixing it this way?
> [[[
> Index: subversion/libsvn_fs_fs/pack.c
> ===================================================================
> --- subversion/libsvn_fs_fs/pack.c  (revision 1572250)
> +++ subversion/libsvn_fs_fs/pack.c  (working copy)
> @@ -865,6 +865,7 @@ sort_reps_range(pack_context_t *context,
>     */
>    dest = first;
>    path = path_order[first]->path;
> +  rep_id = path_order[first]->rep_id;
>    best = first;
>
>    /* (1) For each path, pick the "roundest" representation and put it in
> ]]]
>

I committed a fix in r1572512. It removes the unnecessary
write to REP_ID during (1) and makes sure that (PATH, REP_ID)
get initialized to the first *relevant* entry at the being of (2).


> Just to mention, even if this change looks good, I won't be able to commit
> it,
> because I am still waiting for my Apache account to be created.  The same
> is
> true for those patches from
> http://svn.haxx.se/dev/archive-2014-02/0280.shtml
> and http://svn.haxx.se/dev/archive-2014-02/0279.shtml
> Hopefully, this process won't take too long.
>

No worries. I'm currently on a working holiday myself
enjoying the African sun.

-- Stefan^2.

Re: svn commit: r1555109 - /subversion/trunk/subversion/libsvn_fs_fs/pack.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Hi Stefan,

> Author: stefan2
> Date: Fri Jan  3 14:34:41 2014
> New Revision: 1555109
>
> URL: http://svn.apache.org/r1555109
> Log:
> Refine the FSFS f7 pack heuristics away from the container-centric
> order used by FSX.  The changes only affect the order of text reps
> and noderevs; changed paths lists and properties remain as are.

It looks like there is a problem is the implementation of the reps sorting
algorithm.  The sort_reps_range() function (which essentially is the core of
the new heuristics) can access uninitialized memory under certain circumstances.
There is a bunch of valgrind warnings for trunk@1572250 ...
[[[
  (valgrind fs-fs-pack-test)

  Conditional jump or move depends on uninitialised value(s)
     at 0x42C9BF: svn_fs_fs__id_part_eq (id.c:145)
     by 0x43505B: sort_reps_range (pack.c:932)
     by 0x4340CD: sort_reps (pack.c:1007)
     by 0x432B38: pack_range (pack.c:1391)
     by 0x4319BF: pack_log_addressed (pack.c:1570)
     by 0x43151B: pack_rev_shard (pack.c:1758)
     by 0x431023: pack_shard (pack.c:1817)
     by 0x430E33: pack_body (pack.c:1958)
     by 0x425DAF: with_some_lock_file (fs_fs.c:185)
     by 0x425C21: svn_fs_fs__with_write_lock (fs_fs.c:203)
     by 0x430B15: svn_fs_fs__pack (pack.c:1986)
     by 0x425491: fs_pack (fs.c:364)
]]]

...and this behavior can also be witnessed in some real-world scenarios.
I patched this function to make it crash upon the uninitialized variable usage
and received 11/14 coredumps in the fs-fs-pack-test and a 100%-reproducible
coredump when packing the svnrdump'ed http://googletest.googlecode.com/svn/
repository with "layout sharded 5":
[[[
Index: subversion/libsvn_fs_fs/pack.c
===================================================================
--- subversion/libsvn_fs_fs/pack.c  (revision 1572250)
+++ subversion/libsvn_fs_fs/pack.c  (working copy)
@@ -849,6 +849,7 @@ sort_reps_range(pack_context_t *context,
   const svn_prefix_string__t *path;
   int i, dest, best;
   svn_fs_fs__id_part_t rep_id;
+  svn_boolean_t rep_id_initialized = FALSE;
   fs_fs_data_t *ffd = context->fs->fsap_data;

   /* The logic below would fail for empty ranges. */
@@ -885,6 +886,7 @@ sort_reps_range(pack_context_t *context,
           /* next path */
           path = path_order[i]->path;
           rep_id = path_order[i]->rep_id;
+          rep_id_initialized = TRUE;

           /* Pick roundest non-linear deltified node. */
           if (roundness(path_order[best]->predecessor_count)
@@ -926,8 +928,11 @@ sort_reps_range(pack_context_t *context,
           {
             path = path_order[i]->path;
             rep_id = path_order[i]->rep_id;
+            rep_id_initialized = TRUE;
           }

+        SVN_ERR_ASSERT_NO_RETURN(rep_id_initialized);
+
         /* Pick nodes along the deltification chain.  Skip side-branches. */
         if (svn_fs_fs__id_part_eq(&path_order[i]->rep_id, &rep_id))
           {
]]]

The algorithm reorders the noderevs in three scans (please correct me if I
misunderstood some details).  The first scan (1) bubbles up a single noderev
from every cluster, which is likely to be referenced from the following packs.
Cluster is a term that refers to a sequential noderev range, where every
element shares the same path.  After that, the second scan (2) bubbles up
reconstructible delta chains per every cluster.  The last scan (3) is the
least interesting one, because it simply grabs everything what was left after
(1) and (2).

Now, in order to distinguish different clusters, parts (1) and (2) remember
(like, "lock on") the (PATH, REP_ID) for the first noderev from the current
cluster.   Iff the PATH changes (this means we have just entered the next
cluster), we update the stored (PATH, REP_ID) values.  We need to remember the
REP_ID in order to reconstruct those delta chains in (2), because they start
from this REP_ID.

As far as I can tell, the problem lies in the way we handle the first path.
For any non-empty path set, the first path always marks the beginning of the
corresponding cluster.  However, we only remember the PATH part of the (PATH,
REP_ID) pair for the first path and omit the REP_ID.  In case when the path set
consists of *a single cluster* (i.e. all path_order_t objects share the same
PATH), phase (2) will end up in an attempt to reconstruct the delta chain
starting from the uninitialized REP_ID.  There are scenarios that lead to this
behavior in fs-fs-pack-test (see create_packed_filesystem() with multiple /iota
content tweaks falling into the same cluster) and this can easily happen with
real-world repositories.

So, what do you think about fixing it this way?
[[[
Index: subversion/libsvn_fs_fs/pack.c
===================================================================
--- subversion/libsvn_fs_fs/pack.c  (revision 1572250)
+++ subversion/libsvn_fs_fs/pack.c  (working copy)
@@ -865,6 +865,7 @@ sort_reps_range(pack_context_t *context,
    */
   dest = first;
   path = path_order[first]->path;
+  rep_id = path_order[first]->rep_id;
   best = first;

   /* (1) For each path, pick the "roundest" representation and put it in
]]]

Just to mention, even if this change looks good, I won't be able to commit it,
because I am still waiting for my Apache account to be created.  The same is
true for those patches from http://svn.haxx.se/dev/archive-2014-02/0280.shtml
and http://svn.haxx.se/dev/archive-2014-02/0279.shtml
Hopefully, this process won't take too long.


Regards,
Evgeny Kotkov