You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Daniel Shahaf <d....@daniel.shahaf.name> on 2010/09/28 14:11:03 UTC

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

> Index: subversion/include/svn_diff.h
> ===================================================================
> --- subversion/include/svn_diff.h	(revision 1001548)
> +++ subversion/include/svn_diff.h	(working copy)
> @@ -112,6 +112,11 @@
(personally I prefer 'svn diff -x-p' to show the function name here)
>    svn_error_t *(*datasource_open)(void *diff_baton,
>                                    svn_diff_datasource_e datasource);
>  
> +  /** Open the datasources of type @a datasources. */
> +  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines,
> +                                   svn_diff_datasource_e datasource0,
> +                                   svn_diff_datasource_e datasource1);
> +

So, you're extending the svn_diff_fns_t struct, which is defined in
a public header.

It's a public struct with no constructor function, so I believe you have
to revv it (into svn_diff_fns2_t) in order to extend it (for binary
compatibility: people allocating this struct and then using a newer
Subversion library at runtime).

If it did have a constructor function, you'd have to extend it only at
the end, and even then make sure that if the added elements are NULL (eg
because an old caller didn't know to fill them) then everything still
worked.

Probably more important to get the logic right than to revv it right
away, though; the latter is a SMOP.

>    /** Close the datasource of type @a datasource. */
>    svn_error_t *(*datasource_close)(void *diff_baton,
>                                     svn_diff_datasource_e datasource);
> @@ -124,6 +129,9 @@
>                                              void *diff_baton,
>                                              svn_diff_datasource_e datasource);
>  
> +  /** Get the number of identical prefix lines from the @a diff_baton. */
> +  apr_off_t (*get_prefix_lines)(void *diff_baton);
> +
>    /** A function for ordering the tokens, resembling 'strcmp' in functionality.
>     * @a compare should contain the return value of the comparison:
>     * If @a ltoken and @a rtoken are "equal", return 0.  If @a ltoken is
> Index: subversion/libsvn_diff/diff_memory.c
> ===================================================================
> --- subversion/libsvn_diff/diff_memory.c	(revision 1001548)
> +++ subversion/libsvn_diff/diff_memory.c	(working copy)
> @@ -95,7 +95,23 @@
>    return SVN_NO_ERROR;
>  }
>  
> +/* Implements svn_diff_fns_t::datasources_open */
> +static svn_error_t *
> +datasources_open(void *baton, apr_off_t *prefix_lines,
> +                 svn_diff_datasource_e datasource0, 
> +                 svn_diff_datasource_e datasource1)
> +{
> +  /* Do nothing: everything is already there and initialized to 0 */
> +  return SVN_NO_ERROR;
> +}
>  
> +/* Implements svn_diff_fns_t::datasource_get_prefix_lines */
> +static apr_off_t
> +get_prefix_lines(void *baton)
> +{
> +  return 0;
> +}
> +
>  /* Implements svn_diff_fns_t::datasource_close */
>  static svn_error_t *
>  datasource_close(void *baton, svn_diff_datasource_e datasource)
> @@ -189,8 +205,10 @@
>  static const svn_diff_fns_t svn_diff__mem_vtable =
>  {
>    datasource_open,
> +  datasources_open,
>    datasource_close,
>    datasource_get_next_token,
> +  get_prefix_lines,
>    token_compare,
>    token_discard,
>    token_discard_all
> Index: subversion/libsvn_diff/diff_file.c
> ===================================================================
> --- subversion/libsvn_diff/diff_file.c	(revision 1001548)
> +++ subversion/libsvn_diff/diff_file.c	(working copy)
> @@ -77,6 +77,10 @@
>    char *curp[4];
>    char *endp[4];
>  
> +  apr_off_t prefix_lines;
> +  int suffix_start_chunk[4];
> +  apr_off_t suffix_offset_in_chunk[4];
> +
>    /* List of free tokens that may be reused. */
>    svn_diff__file_token_t *tokens;
>  
> @@ -233,7 +237,330 @@
>                      curp, length, 0, file_baton->pool);
>  }
>  
> +static svn_error_t *
> +increment_pointer_or_chunk(svn_diff__file_baton_t *file_baton,
> +                           char **curp, char **endp, int *chunk_number,
> +                           char *buffer, apr_off_t last_chunk_number, int idx)
> +{
> +  apr_off_t length;
>  
> +  if ((*curp) == (*endp) - 1)
> +    {
> +      if (*chunk_number == last_chunk_number)
> +        (*curp)++; /* *curp == *endp with last chunk signals end of file */
> +      else
> +        {
> +          (*chunk_number)++;
> +          length = *chunk_number == last_chunk_number ?
> +            offset_in_chunk(file_baton->size[idx]) : CHUNK_SIZE;
> +          SVN_ERR(read_chunk(file_baton->file[idx],
> +                             file_baton->path[idx],
> +                             buffer, length,
> +                             chunk_to_offset(*chunk_number),
> +                             file_baton->pool));
> +          *endp = buffer + length;
> +          *curp = buffer;
> +        }
> +    }
> +  else
> +    {
> +      (*curp)++;
> +    }
> +
> +  return SVN_NO_ERROR;
> +}
> +
> +static svn_error_t *
> +decrement_pointer_or_chunk(svn_diff__file_baton_t *file_baton,
> +                           char **curp, char **endp, int *chunk_number,
> +                           char *buffer, int idx)
> +{
> +  if (*curp == buffer)
> +    {
> +      if (*chunk_number == 0)
> +        (*chunk_number)--; /* *chunk_number == -1 signals beginning of file */
> +      else
> +        {
> +          (*chunk_number)--;
> +          SVN_ERR(read_chunk(file_baton->file[idx],
> +                             file_baton->path[idx],
> +                             buffer, CHUNK_SIZE,
> +                             chunk_to_offset(*chunk_number),
> +                             file_baton->pool));
> +          *endp = buffer + CHUNK_SIZE;
> +          *curp = *endp - 1;
> +        }
> +    }
> +  else
> +    {
> +      (*curp)--;
> +    }
> +
> +  return SVN_NO_ERROR;
> +}
> +
> +/* Find the identical prefix for idx0 and idx1, counting number of lines.
> + * After this function is finished, the buffers, chunks, curp's and endp's 
> + * of the file_baton are set to point at the first byte after the prefix. */
> +static svn_error_t *
> +find_identical_prefix(svn_diff__file_baton_t *file_baton,
> +                      svn_boolean_t *at_least_one_end_reached,
> +                      int idx0, int idx1)
> +{
> +  apr_off_t last_chunk0, last_chunk1;
> +
> +  last_chunk0 = offset_to_chunk(file_baton->size[idx0]);
> +  last_chunk1 = offset_to_chunk(file_baton->size[idx1]);
> +  *at_least_one_end_reached = FALSE;
> +
> +  file_baton->prefix_lines = 0;
> +  while (*file_baton->curp[idx0] == *file_baton->curp[idx1] 
> +         && !*at_least_one_end_reached)
> +    {
> +      /* ### This will only work for \n and \r\n, not for \r */
> +      if (*file_baton->curp[idx0] == '\n')
> +        file_baton->prefix_lines++;
> +      SVN_ERR(increment_pointer_or_chunk(file_baton,
> +                                         &file_baton->curp[idx0],
> +                                         &file_baton->endp[idx0], 
> +                                         &file_baton->chunk[idx0],
> +                                         file_baton->buffer[idx0],
> +                                         last_chunk0, idx0));
> +      SVN_ERR(increment_pointer_or_chunk(file_baton,
> +                                         &file_baton->curp[idx1],
> +                                         &file_baton->endp[idx1],
> +                                         &file_baton->chunk[idx1],
> +                                         file_baton->buffer[idx1],
> +                                         last_chunk1, idx1));
> +      *at_least_one_end_reached = 
> +        file_baton->curp[idx0] == file_baton->endp[idx0] 
> +        || file_baton->curp[idx1] == file_baton->endp[idx1];
> +    }
> +
> +  /* If both files reached their end (i.e. are fully identical), we're done */
> +  if (file_baton->curp[idx0] == file_baton->endp[idx0] 
> +        && file_baton->curp[idx1] == file_baton->endp[idx1])
> +    return SVN_NO_ERROR;
> +
> +  /* Back up to the last newline */
> +  do
> +    {
> +      SVN_ERR(decrement_pointer_or_chunk(file_baton,
> +                                         &file_baton->curp[idx0],
> +                                         &file_baton->endp[idx0],
> +                                         &file_baton->chunk[idx0], 
> +                                         file_baton->buffer[idx0],
> +                                         idx0));
> +      SVN_ERR(decrement_pointer_or_chunk(file_baton,
> +                                         &file_baton->curp[idx1],
> +                                         &file_baton->endp[idx1],
> +                                         &file_baton->chunk[idx1], 
> +                                         file_baton->buffer[idx1],
> +                                         idx1));
> +    } while (*file_baton->curp[idx0] != '\n' 
> +             && file_baton->chunk[idx0] != -1 
> +             && file_baton->chunk[idx1] != -1);
> +
> +  /* slide one byte forward, to point past the \n */
> +  if (file_baton->chunk[idx0] == -1)
> +    file_baton->chunk[idx0] = 0; /* point to beginning of file again */
> +  else
> +    SVN_ERR(increment_pointer_or_chunk(file_baton,
> +                                       &file_baton->curp[idx0],
> +                                       &file_baton->endp[idx0],
> +                                       &file_baton->chunk[idx0],
> +                                       file_baton->buffer[idx0],
> +                                       last_chunk0, idx0));
> +  if (file_baton->chunk[idx1] == -1)
> +    file_baton->chunk[idx1] = 0; /* point to beginning of file again */
> +  else
> +    SVN_ERR(increment_pointer_or_chunk(file_baton,
> +                                       &file_baton->curp[idx1],
> +                                       &file_baton->endp[idx1],
> +                                       &file_baton->chunk[idx1],
> +                                       file_baton->buffer[idx1],
> +                                       last_chunk1, idx1));
> +
> +  return SVN_NO_ERROR;
> +}
> +
> +/* Find the identical suffix for idx0 and idx1. Before this function is called
> + * the file_baton's curp's and chunks should be positioned right after the 
> + * identical prefix (which is the case after find_identical_prefix),
> + * so we can determine where suffix scanning should ultimately stop. */
> +static svn_error_t *
> +find_identical_suffix(svn_diff__file_baton_t *file_baton,
> +                      int idx0, int idx1)
> +{
> +  char *suffix_buffer0, *suffix_buffer1;
> +  int suffix_chunk0, suffix_chunk1;
> +  apr_off_t length0, length1;
> +  apr_off_t last_chunk0, last_chunk1;
> +  apr_off_t suffix_min_offset0;
> +  apr_off_t suffix_min_chunk0;
> +  char *curp0, *curp1;
> +  char *endp0, *endp1;
> +
> +  last_chunk0 = offset_to_chunk(file_baton->size[idx0]);
> +  last_chunk1 = offset_to_chunk(file_baton->size[idx1]);
> +
> +  /* position everything at last chunk, pointer to last byte */
> +  suffix_buffer0 = apr_palloc(file_baton->pool, 
> +    (apr_size_t) (file_baton->size[idx0] > CHUNK_SIZE ? 
> +                   CHUNK_SIZE : file_baton->size[idx0]));
> +  suffix_chunk0 = last_chunk0;
> +  length0 = file_baton->size[idx0] % CHUNK_SIZE;
> +  SVN_ERR(read_chunk(file_baton->file[idx0], file_baton->path[idx0],
> +                     suffix_buffer0, length0,
> +                     chunk_to_offset(suffix_chunk0),
> +                     file_baton->pool));
> +  endp0 = suffix_buffer0 + length0;
> +  curp0 = endp0 - 1;
> +
> +  suffix_buffer1 = apr_palloc(file_baton->pool, 
> +    (apr_size_t) (file_baton->size[idx1] > CHUNK_SIZE ?
> +                   CHUNK_SIZE : file_baton->size[idx1]));
> +  suffix_chunk1 = last_chunk1;
> +  length1 = file_baton->size[idx1] % CHUNK_SIZE;
> +  SVN_ERR(read_chunk(file_baton->file[idx1], file_baton->path[idx1],
> +                     suffix_buffer1, length1,
> +                     chunk_to_offset(suffix_chunk1),
> +                     file_baton->pool));
> +  endp1 = suffix_buffer1 + length1;
> +  curp1 = endp1 - 1;
> +
> +  /* Get the chunk and pointer offset at which we should stop scanning 
> +   * backward for the identical suffix. This is just past the prefix. */
> +  suffix_min_chunk0 = file_baton->chunk[idx0];
> +  suffix_min_offset0 = file_baton->curp[idx0] - file_baton->buffer[idx0];
> +  if (file_baton->size[idx0] > file_baton->size[idx1])
> +    {
> +      suffix_min_chunk0 += 
> +        (file_baton->size[idx0] - file_baton->size[idx1]) / CHUNK_SIZE;
> +      suffix_min_offset0 += 
> +        (file_baton->size[idx0] - file_baton->size[idx1]) % CHUNK_SIZE;
> +    }
> +
> +  /* Scan backwards until mismatch or until we are where the prefix ended */
> +  while (*curp0 == *curp1 && suffix_chunk0 != -1 && suffix_chunk1 != -1
> +         && !(suffix_chunk0 == suffix_min_chunk0 
> +              && (curp0 - suffix_buffer0) == suffix_min_offset0))
> +    {
> +      SVN_ERR(decrement_pointer_or_chunk(file_baton, &curp0, &endp0,
> +                                         &suffix_chunk0, suffix_buffer0,
> +                                         idx0));
> +      SVN_ERR(decrement_pointer_or_chunk(file_baton, &curp1, &endp1,
> +                                         &suffix_chunk1, suffix_buffer1,
> +                                         idx1));
> +    }
> +
> +  /* slide one byte forward, to point at the first byte of common suffix */
> +  if (suffix_chunk0 == -1)
> +    suffix_chunk0 = 0; /* point to beginning of file again */
> +  else
> +    SVN_ERR(increment_pointer_or_chunk(file_baton, &curp0, &endp0, 
> +                                       &suffix_chunk0, suffix_buffer0,
> +                                       last_chunk0, idx0));
> +  if (suffix_chunk1 == -1)
> +    suffix_chunk1 = 0; /* point to beginning of file again */
> +  else
> +    SVN_ERR(increment_pointer_or_chunk(file_baton, &curp1, &endp1, 
> +                                       &suffix_chunk1, suffix_buffer1,
> +                                       last_chunk1, idx1));
> +
> +  /* skip to just after a newline */
> +  while (*curp0 != '\n' && !(curp0 == endp0 || curp1 == endp1))
> +    {
> +      SVN_ERR(increment_pointer_or_chunk(file_baton, &curp0, &endp0, 
> +                                         &suffix_chunk0, suffix_buffer0,
> +                                         last_chunk0, idx0));
> +      SVN_ERR(increment_pointer_or_chunk(file_baton, &curp1, &endp1, 
> +                                         &suffix_chunk1, suffix_buffer1,
> +                                         last_chunk1, idx1));
> +    }
> +
> +  /* slide one more byte, to point past the \n */
> +  SVN_ERR(increment_pointer_or_chunk(file_baton, &curp0, &endp0,
> +                                     &suffix_chunk0, suffix_buffer0,
> +                                     last_chunk0, idx0));
> +  SVN_ERR(increment_pointer_or_chunk(file_baton, &curp1, &endp1,
> +                                     &suffix_chunk1, suffix_buffer1,
> +                                     last_chunk1, idx1));
> +
> +  file_baton->suffix_start_chunk[idx0] = suffix_chunk0;
> +  file_baton->suffix_start_chunk[idx1] = suffix_chunk1;
> +  file_baton->suffix_offset_in_chunk[idx0] = curp0 - suffix_buffer0;
> +  file_baton->suffix_offset_in_chunk[idx1] = curp1 - suffix_buffer1;
> +
> +  return SVN_NO_ERROR;
> +}
> +
> +/* Implements svn_diff_fns_t::datasource_open */
> +static svn_error_t *
> +datasources_open(void *baton, apr_off_t *prefix_lines,
> +                 svn_diff_datasource_e datasource0, 
> +                 svn_diff_datasource_e datasource1)
> +{
> +  svn_diff__file_baton_t *file_baton = baton;
> +  int idx0, idx1;
> +  apr_finfo_t finfo0, finfo1;
> +  apr_off_t length0, length1;
> +  svn_boolean_t at_least_one_end_reached;
> +
> +  /* Open datasource0 and read first chunk */
> +  idx0 = datasource_to_index(datasource0);
> +  SVN_ERR(svn_io_file_open(&file_baton->file[idx0], file_baton->path[idx0],
> +                           APR_READ, APR_OS_DEFAULT, file_baton->pool));
> +  SVN_ERR(svn_io_file_info_get(&finfo0, APR_FINFO_SIZE,
> +                               file_baton->file[idx0], file_baton->pool));
> +  file_baton->size[idx0] = finfo0.size;
> +  length0 = (apr_off_t) (finfo0.size > CHUNK_SIZE ? CHUNK_SIZE : finfo0.size);
> +  file_baton->buffer[idx0] = apr_palloc(file_baton->pool, (apr_size_t) length0);
> +  SVN_ERR(read_chunk(file_baton->file[idx0], file_baton->path[idx0],
> +                     file_baton->buffer[idx0], length0, 0, file_baton->pool));
> +  file_baton->endp[idx0] = file_baton->buffer[idx0] + length0;
> +  file_baton->curp[idx0] = file_baton->buffer[idx0];
> +
> +  /* Open datasource1 and read first chunk */
> +  idx1 = datasource_to_index(datasource1);
> +  SVN_ERR(svn_io_file_open(&file_baton->file[idx1], file_baton->path[idx1],
> +                           APR_READ, APR_OS_DEFAULT, file_baton->pool));
> +  SVN_ERR(svn_io_file_info_get(&finfo1, APR_FINFO_SIZE,
> +                               file_baton->file[idx1], file_baton->pool));
> +  file_baton->size[idx1] = finfo1.size;
> +  length1 = (apr_off_t) (finfo1.size > CHUNK_SIZE ? CHUNK_SIZE : finfo1.size);
> +  file_baton->buffer[idx1] = apr_palloc(file_baton->pool, (apr_size_t) length1);
> +  SVN_ERR(read_chunk(file_baton->file[idx1], file_baton->path[idx1],
> +                     file_baton->buffer[idx1], length1, 0, file_baton->pool));
> +  file_baton->endp[idx1] = file_baton->buffer[idx1] + length1;
> +  file_baton->curp[idx1] = file_baton->buffer[idx1];
> +
> +  if (length0 == 0 || length1 == 0)
> +    /* there will not be any identical prefix/suffix, so we're done */
> +    return SVN_NO_ERROR;
> +
> +  SVN_ERR(find_identical_prefix(file_baton, &at_least_one_end_reached,
> +                                idx0, idx1));
> +  *prefix_lines = file_baton->prefix_lines;
> +
> +  if (at_least_one_end_reached)
> +    /* at least one file consisted totally of identical prefix, 
> +     * so there will be no identical suffix. We're done. */
> +    return SVN_NO_ERROR;
> +
> +  SVN_ERR(find_identical_suffix(file_baton, idx0, idx1));
> +
> +  return SVN_NO_ERROR;
> +}
> +
> +static apr_off_t
> +get_prefix_lines(void *baton)
> +{
> +  svn_diff__file_baton_t *file_baton = baton;
> +
> +  return file_baton->prefix_lines;
> +}
> +
>  /* Implements svn_diff_fns_t::datasource_close */
>  static svn_error_t *
>  datasource_close(void *baton, svn_diff_datasource_e datasource)
> @@ -277,6 +604,11 @@
>        return SVN_NO_ERROR;
>      }
>  
> +  if (file_baton->suffix_start_chunk[idx] || file_baton->suffix_offset_in_chunk[idx])
> +    if (file_baton->chunk[idx] == file_baton->suffix_start_chunk[idx]
> +        && (curp - file_baton->buffer[idx]) == file_baton->suffix_offset_in_chunk[idx])
> +      return SVN_NO_ERROR;
> +
>    /* Get a new token */
>    file_token = file_baton->tokens;
>    if (file_token)
> @@ -526,8 +858,10 @@
>  static const svn_diff_fns_t svn_diff__file_vtable =
>  {
>    datasource_open,
> +  datasources_open,
>    datasource_close,
>    datasource_get_next_token,
> +  get_prefix_lines,
>    token_compare,
>    token_discard,
>    token_discard_all
> Index: subversion/libsvn_diff/diff.h
> ===================================================================
> --- subversion/libsvn_diff/diff.h	(revision 1001548)
> +++ subversion/libsvn_diff/diff.h	(working copy)
> @@ -111,6 +111,7 @@
>                       void *diff_baton,
>                       const svn_diff_fns_t *vtable,
>                       svn_diff_datasource_e datasource,
> +                     svn_boolean_t datasource_opened,
>                       apr_pool_t *pool);
>  
>  
> Index: subversion/libsvn_diff/token.c
> ===================================================================
> --- subversion/libsvn_diff/token.c	(revision 1001548)
> +++ subversion/libsvn_diff/token.c	(working copy)
> @@ -139,6 +139,7 @@
>                       void *diff_baton,
>                       const svn_diff_fns_t *vtable,
>                       svn_diff_datasource_e datasource,
> +                     svn_boolean_t datasource_opened,
>                       apr_pool_t *pool)
>  {
>    svn_diff__position_t *start_position;
> @@ -152,10 +153,11 @@
>    *position_list = NULL;
>  
>  
> -  SVN_ERR(vtable->datasource_open(diff_baton, datasource));
> +  if (!datasource_opened)
> +    SVN_ERR(vtable->datasource_open(diff_baton, datasource));
>  
>    position_ref = &start_position;
> -  offset = 0;
> +  offset = vtable->get_prefix_lines(diff_baton);
>    hash = 0; /* The callback fn doesn't need to touch it per se */
>    while (1)
>      {
> Index: subversion/libsvn_diff/diff.c
> ===================================================================
> --- subversion/libsvn_diff/diff.c	(revision 1001548)
> +++ subversion/libsvn_diff/diff.c	(working copy)
> @@ -43,6 +43,22 @@
>    svn_diff_t *diff;
>    svn_diff_t **diff_ref = &diff;
>  
> +  if (want_common && (original_start > 1))
> +    {
> +      /* we have a prefix to skip */
> +      (*diff_ref) = apr_palloc(pool, sizeof(**diff_ref));
> +
> +      (*diff_ref)->type = svn_diff__type_common;
> +      (*diff_ref)->original_start = 0;
> +      (*diff_ref)->original_length = original_start - 1;
> +      (*diff_ref)->modified_start = 0;
> +      (*diff_ref)->modified_length = modified_start - 1;
> +      (*diff_ref)->latest_start = 0;
> +      (*diff_ref)->latest_length = 0;
> +
> +      diff_ref = &(*diff_ref)->next;
> +    }
> +
>    while (1)
>      {
>        if (original_start < lcs->position[0]->offset
> @@ -108,6 +124,7 @@
>    svn_diff__lcs_t *lcs;
>    apr_pool_t *subpool;
>    apr_pool_t *treepool;
> +  apr_off_t prefix_lines = 0;
>  
>    *diff = NULL;
>  
> @@ -116,17 +133,22 @@
>  
>    svn_diff__tree_create(&tree, treepool);
>  
> +  SVN_ERR(vtable->datasources_open(diff_baton, &prefix_lines,
> +    svn_diff_datasource_original, svn_diff_datasource_modified));
> +
>    /* Insert the data into the tree */
>    SVN_ERR(svn_diff__get_tokens(&position_list[0],
>                                 tree,
>                                 diff_baton, vtable,
>                                 svn_diff_datasource_original,
> +                               TRUE,
>                                 subpool));
>  
>    SVN_ERR(svn_diff__get_tokens(&position_list[1],
>                                 tree,
>                                 diff_baton, vtable,
>                                 svn_diff_datasource_modified,
> +                               TRUE,
>                                 subpool));
>  
>    /* The cool part is that we don't need the tokens anymore.
> @@ -142,7 +164,7 @@
>    lcs = svn_diff__lcs(position_list[0], position_list[1], subpool);
>  
>    /* Produce the diff */
> -  *diff = svn_diff__diff(lcs, 1, 1, TRUE, pool);
> +  *diff = svn_diff__diff(lcs, prefix_lines + 1, prefix_lines + 1, TRUE, pool);
>  
>    /* Get rid of all the data we don't have a use for anymore */
>    svn_pool_destroy(subpool);
> Index: subversion/libsvn_diff/diff3.c
> ===================================================================
> --- subversion/libsvn_diff/diff3.c	(revision 1001548)
> +++ subversion/libsvn_diff/diff3.c	(working copy)
> @@ -267,18 +267,21 @@
>                                 tree,
>                                 diff_baton, vtable,
>                                 svn_diff_datasource_original,
> +                               FALSE,
>                                 subpool));
>  
>    SVN_ERR(svn_diff__get_tokens(&position_list[1],
>                                 tree,
>                                 diff_baton, vtable,
>                                 svn_diff_datasource_modified,
> +                               FALSE,
>                                 subpool));
>  
>    SVN_ERR(svn_diff__get_tokens(&position_list[2],
>                                 tree,
>                                 diff_baton, vtable,
>                                 svn_diff_datasource_latest,
> +                               FALSE,
>                                 subpool));
>  
>    /* Get rid of the tokens, we don't need them to calc the diff */
> Index: subversion/libsvn_diff/diff4.c
> ===================================================================
> --- subversion/libsvn_diff/diff4.c	(revision 1001548)
> +++ subversion/libsvn_diff/diff4.c	(working copy)
> @@ -194,24 +194,28 @@
>                                 tree,
>                                 diff_baton, vtable,
>                                 svn_diff_datasource_original,
> +                               FALSE,
>                                 subpool2));
>  
>    SVN_ERR(svn_diff__get_tokens(&position_list[1],
>                                 tree,
>                                 diff_baton, vtable,
>                                 svn_diff_datasource_modified,
> +                               FALSE,
>                                 subpool));
>  
>    SVN_ERR(svn_diff__get_tokens(&position_list[2],
>                                 tree,
>                                 diff_baton, vtable,
>                                 svn_diff_datasource_latest,
> +                               FALSE,
>                                 subpool));
>  
>    SVN_ERR(svn_diff__get_tokens(&position_list[3],
>                                 tree,
>                                 diff_baton, vtable,
>                                 svn_diff_datasource_ancestor,
> +                               FALSE,
>                                 subpool2));
>  
>    /* Get rid of the tokens, we don't need them to calc the diff */

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Sat, Oct 9, 2010 at 5:19 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200:
>> (side-note: I considered first doing suffix scanning, then prefix
>> scanning, so I could reuse the buffers/pointers from diff_baton all
>> the time, and still have everything pointing correctly after
>> eliminating prefix/suffix. But that could give vastly different
>> results in some cases, for instance when original file is entirely
>> identical to both the prefix and the suffix of the modified file. So I
>> decided it's best to stick with first prefix, then suffix).
>
> What Hyrum said.  How common /is/ this case?  And, anyway, in that case
> both "everything was appended" and "everything was prepended" are
> equally legitimate diffs.

Hm, I'm not sure about this one. I just wanted to try the maximum
reasonably possible to keep the results identical to what they were.
Using another buffer for suffix scanning didn't seem that big of a
deal (only slight increase in memory use (2 chunks of 128K in current
implementation)). I made that decision pretty early, before I knew of
the other problem of suffix scanning, and the keep-50-suffix-lines
compromise we decided upon.

There may be more subtle cases than the one I described, I don't know.
OTOH, now that we have the keep-50-suffix-lines, that may help also in
this case. I'll have to think about that. Maybe I can give it a go,
first suffix then prefix, and see if I can find real-life problems ...

-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200:
> (side-note: I considered first doing suffix scanning, then prefix
> scanning, so I could reuse the buffers/pointers from diff_baton all
> the time, and still have everything pointing correctly after
> eliminating prefix/suffix. But that could give vastly different
> results in some cases, for instance when original file is entirely
> identical to both the prefix and the suffix of the modified file. So I
> decided it's best to stick with first prefix, then suffix).

What Hyrum said.  How common /is/ this case?  And, anyway, in that case
both "everything was appended" and "everything was prepended" are
equally legitimate diffs.

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <ju...@wandisco.com> wrote:
> On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
>> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
>> > On Sat, 2010-10-09, Johan Corveleyn wrote:
>> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
>> >> > So I wrote a patch - attached - that refactors this into an array of 4
>> >> > sub-structures, and simplifies all the code that uses them.
>> > [...]
>> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> >> So please go ahead and commit the refactoring.
>> >
>> > OK, committed in r1021282.
>>
>> Thanks, looks much more manageable now.
>
> I'd like to see a simplified version of your last patch, taking
> advantage of that, before you go exploring other options.

Ok, here's a new version of the patch, taking advantage of your
file_info refactoring. This vastly simplifies the code, so that it
might actually be understandable now :-).

Other things I've done in this version:
1) Generalized everything to handle an array of datasources/files,
instead of just two. This makes it slightly more complex here and
there (using for loops everywhere), but I think it's ok, and it's also
more consistent/generic. If anyone has better ideas to do those for
loops, suggestions welcome.

This makes the algorithm usable by diff3 as well (and diff4 if needed
(?)). I have not yet enabled it for diff3, because I haven't yet
understood how it handles the generation of its diff output (needs to
take into account the prefix_lines. I tried some quick hacks, but lots
of tests were failing, so I'll have to look more into it -> that's for
a follow up patch). When I can enable it for diff3 (and diff4), I can
remove datasource_open (with one datasource).

2) Removed get_prefix_lines from svn_diff_fns_t (and its
implementations in diff_file.c and diff_memory.c). Instead I pass
prefix_lines directly to token.c#svn_diff__get_tokens.

3) If prefix scanning ended in the last chunk, the suffix scanning now
reuses that buffer which already contains the last chunk. As a special
case, this also avoids reading the file twice if it's smaller than 128
Kb.

4) Added doc strings everywhere. Feel free to edit those, I'm new at
documenting things in svn.

Still TODO:
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- See if implementing the critical parts of increment_pointers and
decrement_pointers in a macro improves performance.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. For
this (and for other reasons), I'd still like to investigate pushing
this optimization into the token parsing/handling layer, to extract
entire tokens etc., even if this means the current patch has to be
thrown away. I'll take this up in a separate thread.

Log message:
[[[
Make svn_diff skip identical prefix and suffix to make diff and blame faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function type datasources_open to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (svn_diff__mem_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added member prefix_lines, and inside the
   struct file_info the members suffix_start_chunk and suffix_offset_in_chunk.
  (increment_pointers, decrement_pointers): New functions.
  (is_one_at_bof, is_one_at_eof): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open multiple datasources and find
   their identical prefix and suffix, so these can be excluded from the rest
   of the diff algorithm, as a performance optimization. From the identical
   suffix, 50 lines are kept to help the diff algorithm find the nicest
   possible diff representation in case of ambiguity.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened, and argument "prefix_lines", the number
   of identical prefix lines.and use
   this as the starting offset for the token we're getting.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added arguments "datasource_opened" and
   "prefix_lines". Only open the datasource if datasource_opened is FALSE.
   Set the starting offset of the position list to the number of prefix_lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open to open original and
   modified at once and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__get_tokens, svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <ju...@wandisco.com> wrote:
> On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
>> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
>> > On Sat, 2010-10-09, Johan Corveleyn wrote:
>> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
>> >> > So I wrote a patch - attached - that refactors this into an array of 4
>> >> > sub-structures, and simplifies all the code that uses them.
>> > [...]
>> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> >> So please go ahead and commit the refactoring.
>> >
>> > OK, committed in r1021282.
>>
>> Thanks, looks much more manageable now.
>
> I'd like to see a simplified version of your last patch, taking
> advantage of that, before you go exploring other options.

Ok, here's a new version of the patch, taking advantage of your
file_info refactoring. This vastly simplifies the code, so that it
might actually be understandable now :-).

Other things I've done in this version:
1) Generalized everything to handle an array of datasources/files,
instead of just two. This makes it slightly more complex here and
there (using for loops everywhere), but I think it's ok, and it's also
more consistent/generic. If anyone has better ideas to do those for
loops, suggestions welcome.

This makes the algorithm usable by diff3 as well (and diff4 if needed
(?)). I have not yet enabled it for diff3, because I haven't yet
understood how it handles the generation of its diff output (needs to
take into account the prefix_lines. I tried some quick hacks, but lots
of tests were failing, so I'll have to look more into it -> that's for
a follow up patch). When I can enable it for diff3 (and diff4), I can
remove datasource_open (with one datasource).

2) Removed get_prefix_lines from svn_diff_fns_t (and its
implementations in diff_file.c and diff_memory.c). Instead I pass
prefix_lines directly to token.c#svn_diff__get_tokens.

3) If prefix scanning ended in the last chunk, the suffix scanning now
reuses that buffer which already contains the last chunk. As a special
case, this also avoids reading the file twice if it's smaller than 128
Kb.

4) Added doc strings everywhere. Feel free to edit those, I'm new at
documenting things in svn.

Still TODO:
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- See if implementing the critical parts of increment_pointers and
decrement_pointers in a macro improves performance.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. For
this (and for other reasons), I'd still like to investigate pushing
this optimization into the token parsing/handling layer, to extract
entire tokens etc., even if this means the current patch has to be
thrown away. I'll take this up in a separate thread.

Log message:
[[[
Make svn_diff skip identical prefix and suffix to make diff and blame faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function type datasources_open to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (svn_diff__mem_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added member prefix_lines, and inside the
   struct file_info the members suffix_start_chunk and suffix_offset_in_chunk.
  (increment_pointers, decrement_pointers): New functions.
  (is_one_at_bof, is_one_at_eof): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open multiple datasources and find
   their identical prefix and suffix, so these can be excluded from the rest
   of the diff algorithm, as a performance optimization. From the identical
   suffix, 50 lines are kept to help the diff algorithm find the nicest
   possible diff representation in case of ambiguity.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened, and argument "prefix_lines", the number
   of identical prefix lines.and use
   this as the starting offset for the token we're getting.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added arguments "datasource_opened" and
   "prefix_lines". Only open the datasource if datasource_opened is FALSE.
   Set the starting offset of the position list to the number of prefix_lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open to open original and
   modified at once and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__get_tokens, svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Julian Foad <ju...@wandisco.com>.

On Tue, 2010-10-12, Johan Corveleyn wrote:
> On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <ju...@wandisco.com> wrote:
> > On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
> >> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
> >> > On Sat, 2010-10-09, Johan Corveleyn wrote:
> >> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
> >> >> > So I wrote a patch - attached - that refactors this into an array of 4
> >> >> > sub-structures, and simplifies all the code that uses them.
> >> > [...]
> >> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
> >> >> So please go ahead and commit the refactoring.
> >> >
> >> > OK, committed in r1021282.
> >>
> >> Thanks, looks much more manageable now.
> >
> > I'd like to see a simplified version of your last patch, taking
> > advantage of that, before you go exploring other options.
> 
> Yes, I'll certainly do that in the coming days.
> 
> >> >> Also, maybe the last_chunk number could be included in the file_info
> >> >> struct? Now it's calculated in several places: "last_chunk0 =
> >> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
> >> >> every time as an extra argument. Seems like the sort of info that
> >> >> could be part of file_info.
> >> >
> >> > It looks to me like you only need to calculate it in exactly one place:
> >> > in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
> >> > for efficiency, as it's a trivial calculation.
> >>
> >> Hm, I don't know. That's recalculating it for every byte in the
> >> prefix. In the files I'm testing there could be 1 Mb or more of
> >
> > No, no, not every byte - only 1 in 131072 of the calls need to calculate
> > it - only when it reaches the end of a block.
> 
> Oh right. I guess it's ok then.
> 
> > May I ask whether you've tested the code that crosses from one chunk to
> > another?  I noticed the following, just now:
> >
> >> +      *at_least_one_end_reached =
> >> +        file_baton->curp[idx0] == file_baton->endp[idx0]
> >> +        || file_baton->curp[idx1] == file_baton->endp[idx1];
> >> +    }
> >> +
> >> +  /* If both files reached their end (i.e. are fully identical),
> >> we're done */
> >> +  if (file_baton->curp[idx0] == file_baton->endp[idx0]
> >> +        && file_baton->curp[idx1] == file_baton->endp[idx1])
> >
> > Those tests seem to be saying "if we're at the end of the current chunk
> > then we're at the end of the file".  That looks wrong.  What do you
> > think?
> 
> Well, it does work :-). The code is correct, but it's kind of
> difficult to see, so that's my fault.
> 
> The reason is that increment_pointer_or_chunk takes care that curp
> never equals endp (if curp==endp-1 it reads the next chunk), unless it
> reaches end of file. See this snippet of increment_pointer_or_chunk:
> +      if (*chunk_number == last_chunk_number)
> +        (*curp)++; /* *curp == *endp with last chunk signals end of file */
> 
> If you think this is too obscure to handle this, I agree, but I
> couldn't think of a better/easier way at that time. If you have any
> suggestions, feel free :-). But maybe I should refactor the patch
> first, so it's easier to see how this would work out best.
> 
> The same "obscurity" also applies to reaching "beginning of file"
> (which can happen in decrement_pointer_or_chunk, either during prefix
> scanning (rolling back the first line) or during suffix scanning (if
> there was no prefix, but still a difference in the first line)). There
> it's done by setting the chunk number to -1, which is checked for in
> find_identical_prefix and find_identical_suffix:
> +      if (*chunk_number == 0)
> +        (*chunk_number)--; /* *chunk_number == -1 signals beginning of file */
> 
> Not ideal, but it works ...

Oh OK, thanks for explaining.  No further comment, m'lord :-)

- Julian

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <ju...@wandisco.com> wrote:
> On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
>> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
>> > On Sat, 2010-10-09, Johan Corveleyn wrote:
>> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
>> >> > So I wrote a patch - attached - that refactors this into an array of 4
>> >> > sub-structures, and simplifies all the code that uses them.
>> > [...]
>> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> >> So please go ahead and commit the refactoring.
>> >
>> > OK, committed in r1021282.
>>
>> Thanks, looks much more manageable now.
>
> I'd like to see a simplified version of your last patch, taking
> advantage of that, before you go exploring other options.

Yes, I'll certainly do that in the coming days.

>> >> Also, maybe the last_chunk number could be included in the file_info
>> >> struct? Now it's calculated in several places: "last_chunk0 =
>> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
>> >> every time as an extra argument. Seems like the sort of info that
>> >> could be part of file_info.
>> >
>> > It looks to me like you only need to calculate it in exactly one place:
>> > in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
>> > for efficiency, as it's a trivial calculation.
>>
>> Hm, I don't know. That's recalculating it for every byte in the
>> prefix. In the files I'm testing there could be 1 Mb or more of
>
> No, no, not every byte - only 1 in 131072 of the calls need to calculate
> it - only when it reaches the end of a block.

Oh right. I guess it's ok then.

> May I ask whether you've tested the code that crosses from one chunk to
> another?  I noticed the following, just now:
>
>> +      *at_least_one_end_reached =
>> +        file_baton->curp[idx0] == file_baton->endp[idx0]
>> +        || file_baton->curp[idx1] == file_baton->endp[idx1];
>> +    }
>> +
>> +  /* If both files reached their end (i.e. are fully identical),
>> we're done */
>> +  if (file_baton->curp[idx0] == file_baton->endp[idx0]
>> +        && file_baton->curp[idx1] == file_baton->endp[idx1])
>
> Those tests seem to be saying "if we're at the end of the current chunk
> then we're at the end of the file".  That looks wrong.  What do you
> think?

Well, it does work :-). The code is correct, but it's kind of
difficult to see, so that's my fault.

The reason is that increment_pointer_or_chunk takes care that curp
never equals endp (if curp==endp-1 it reads the next chunk), unless it
reaches end of file. See this snippet of increment_pointer_or_chunk:
+      if (*chunk_number == last_chunk_number)
+        (*curp)++; /* *curp == *endp with last chunk signals end of file */

If you think this is too obscure to handle this, I agree, but I
couldn't think of a better/easier way at that time. If you have any
suggestions, feel free :-). But maybe I should refactor the patch
first, so it's easier to see how this would work out best.

The same "obscurity" also applies to reaching "beginning of file"
(which can happen in decrement_pointer_or_chunk, either during prefix
scanning (rolling back the first line) or during suffix scanning (if
there was no prefix, but still a difference in the first line)). There
it's done by setting the chunk number to -1, which is checked for in
find_identical_prefix and find_identical_suffix:
+      if (*chunk_number == 0)
+        (*chunk_number)--; /* *chunk_number == -1 signals beginning of file */

Not ideal, but it works ...

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Julian Foad <ju...@wandisco.com>.

On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
> > On Sat, 2010-10-09, Johan Corveleyn wrote:
> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
> >> > So I wrote a patch - attached - that refactors this into an array of 4
> >> > sub-structures, and simplifies all the code that uses them.
> > [...]
> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
> >> So please go ahead and commit the refactoring.
> >
> > OK, committed in r1021282.
> 
> Thanks, looks much more manageable now.

I'd like to see a simplified version of your last patch, taking
advantage of that, before you go exploring other options.

> >> Also, maybe the last_chunk number could be included in the file_info
> >> struct? Now it's calculated in several places: "last_chunk0 =
> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
> >> every time as an extra argument. Seems like the sort of info that
> >> could be part of file_info.
> >
> > It looks to me like you only need to calculate it in exactly one place:
> > in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
> > for efficiency, as it's a trivial calculation.
> 
> Hm, I don't know. That's recalculating it for every byte in the
> prefix. In the files I'm testing there could be 1 Mb or more of

No, no, not every byte - only 1 in 131072 of the calls need to calculate
it - only when it reaches the end of a block.


May I ask whether you've tested the code that crosses from one chunk to
another?  I noticed the following, just now:

> +      *at_least_one_end_reached = 
> +        file_baton->curp[idx0] == file_baton->endp[idx0] 
> +        || file_baton->curp[idx1] == file_baton->endp[idx1];
> +    }
> +
> +  /* If both files reached their end (i.e. are fully identical),
> we're done */
> +  if (file_baton->curp[idx0] == file_baton->endp[idx0] 
> +        && file_baton->curp[idx1] == file_baton->endp[idx1])

Those tests seem to be saying "if we're at the end of the current chunk
then we're at the end of the file".  That looks wrong.  What do you
think?

In order to test this more easily than creating huge files, I wonder if
you can change CHUNK_SIZE and CHUNK_SHIFT to something much smaller such
as 8 bytes.  I'm trying this now - not with your patch, first, but with
just a plain trunk build, to see if it works.

- Julian



> > And in a later email you wrote:
> >> [...] Maybe I can give it a go,
> >> first suffix then prefix, and see if I can find real-life problems ...
> >
> > There's no need to re-visit that.  I think it's fine as is it, using a
> > separate pair of buffers.
> 
> I've been thinking some more about that. For relatively small files,
> smaller than 1 chunk (128 Kb), this seems wasteful. If the file is 127
> Kb, it all fits in one chunk, yet I still read it twice, once for
> suffix and once for prefix (and further token processing). Ok, caching
> by the OS might make this insignificant, but still.
> 
> Since most common source files in repositories are probably smaller
> than 128 Kb I might be hurting the common case, in cases where it's
> not compensated by performance gain from identical prefix/suffix.
> 
> Then again, maybe this can be solved with a specific optimization,
> without changing the current code too much: if file size is smaller
> than chunk size, I point the suffix buffer to the prefix buffer and
> I'm done.

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
> On Sat, 2010-10-09, Johan Corveleyn wrote:
>> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
>> > So I wrote a patch - attached - that refactors this into an array of 4
>> > sub-structures, and simplifies all the code that uses them.
> [...]
>> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> So please go ahead and commit the refactoring.
>
> OK, committed in r1021282.

Thanks, looks much more manageable now.

>> Also, maybe the last_chunk number could be included in the file_info
>> struct? Now it's calculated in several places: "last_chunk0 =
>> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
>> every time as an extra argument. Seems like the sort of info that
>> could be part of file_info.
>
> It looks to me like you only need to calculate it in exactly one place:
> in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
> for efficiency, as it's a trivial calculation.

Hm, I don't know. That's recalculating it for every byte in the
prefix. In the files I'm testing there could be 1 Mb or more of
prefix, so 1 million times this calculation (and that for every diff
in say 2000 diffs for a blame operation). Unless the compiler can
optimize that to a single calculation, because it never changes (I
don't know enough about compilers to know if this is possible), it
might have an impact, even if it's a trivial calculation.

OTOH, I might be in premature-optimization-land here, the impact might
be negligible. If I find some time I'll test it.

> And in a later email you wrote:
>> [...] Maybe I can give it a go,
>> first suffix then prefix, and see if I can find real-life problems ...
>
> There's no need to re-visit that.  I think it's fine as is it, using a
> separate pair of buffers.

I've been thinking some more about that. For relatively small files,
smaller than 1 chunk (128 Kb), this seems wasteful. If the file is 127
Kb, it all fits in one chunk, yet I still read it twice, once for
suffix and once for prefix (and further token processing). Ok, caching
by the OS might make this insignificant, but still.

Since most common source files in repositories are probably smaller
than 128 Kb I might be hurting the common case, in cases where it's
not compensated by performance gain from identical prefix/suffix.

Then again, maybe this can be solved with a specific optimization,
without changing the current code too much: if file size is smaller
than chunk size, I point the suffix buffer to the prefix buffer and
I'm done.

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Julian Foad <ju...@wandisco.com>.

On Sat, 2010-10-09, Johan Corveleyn wrote:
> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
> > On Sat, 2010-10-09, Johan Corveleyn wrote:
> >> Ok, third iteration of the patch in attachment. It passes make check.
> >>
> >> As discussed in [1], this version keeps 50 lines of the identical
> >> suffix around, to give the algorithm a good chance to generate a diff
> >> output of good quality (in all but the most extreme cases, this will
> >> be the same as with the original svn_diff algorithm).
> >>
> >> That's about the only difference with the previous iteration. So for
> >> now, I'm submitting this for review. Any feedback is very welcome :-).
> >
> > Hi Johan.
> 
> Hi Julian,
> 
> Thanks for taking a look.
> 
> > I haven't reviewed it, but after seeing today's discussion I had just
> > scrolled quickly through the previous version of this patch.  I noticed
> > that the two main functions - find_identical_suffix and
> > find_identical_suffix - are both quite similar (but not quite similar
> > enough to make them easily share implementation) and both quite long,
> > and I noticed you wrote in an earlier email that you were finding it
> > hard to make the code readable.  I have a suggestion that may help.
[...]
> > So I wrote a patch - attached - that refactors this into an array of 4
> > sub-structures, and simplifies all the code that uses them.
[...]
> Yes, great idea! That would indeed vastly simplify a lot of the code.
> So please go ahead and commit the refactoring.

OK, committed in r1021282.


> Also, maybe the last_chunk number could be included in the file_info
> struct? Now it's calculated in several places: "last_chunk0 =
> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
> every time as an extra argument. Seems like the sort of info that
> could be part of file_info.

It looks to me like you only need to calculate it in exactly one place:
in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
for efficiency, as it's a trivial calculation.


> One more thing: you might have noticed that for find_identical_suffix
> I use other buffers, chunks, curp's, endp's, ... than for the prefix.
> For prefix scanning I can just use the stuff from the diff_baton,
> because after prefix scanning has finished, everything is buffered and
> pointing correctly for the normal algorithm to continue (i.e.
> everything points at the first byte of the first non-identical line).
> For suffix scanning I need to use other structures (newly alloc'd
> buffer etc), so as to not mess with those pointers/buffers from the
> diff_baton.

> Maybe I can give it a go,
> first suffix then prefix, and see if I can find real-life problems ...

> So: I think I'll need the file_info struct to be available out of the
> diff_baton_t struct as well, so I can use this in suffix scanning
> also.

Yes; I gave the struct type a name - "struct file_info" - so you can use
it in other places.

> (side-note: I considered first doing suffix scanning, then prefix
> scanning, so I could reuse the buffers/pointers from diff_baton all
> the time, and still have everything pointing correctly after
> eliminating prefix/suffix. But that could give vastly different
> results in some cases, for instance when original file is entirely
> identical to both the prefix and the suffix of the modified file. So I
> decided it's best to stick with first prefix, then suffix).

And in a later email you wrote:
> [...] Maybe I can give it a go,
> first suffix then prefix, and see if I can find real-life problems ...

There's no need to re-visit that.  I think it's fine as is it, using a
separate pair of buffers.


> > Responding to some of the other points you mentioned in a much earlier
> > mail:
> >
> >> 3) It's only implemented for 2 files. I'd like to generalize this for
> >> an array of datasources, so it can also be used for diff3 and diff4.
> >>
> >> 4) As a small hack, I had to add a flag "datasource_opened" to
> >> token.c#svn_diff__get_tokens, so I could have different behavior for
> >> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
> >> no longer needed.
> >
> > Yes, I'd like to see 3), and so hack 4) will go away.
> 
> I'm wondering though how I should represent the datasources to pass
> into datasources_open. An array combined with a length parameter?
> Something like:
> 
>     static svn_error_t *
>     datasources_open(void *baton, apr_off_t *prefix_lines,
>                      svn_diff_datasource_e[] datasources, int datasources_len)
> 
> ? And then use for loops everywhere I now do things twice for the two
> datasources?

I haven't looked at how datasources_open() API is used or what form of
API would be best.

For the internal functions, you can now pass around an array of
file_info pointers rather than indexes.


> >> 5) I've struggled with making the code readable/understandable. It's
> >> likely that there is still a lot of room for improvement. I also
> >> probably need to document it some more.
> >
> > You need to write a full doc string for datasources_open(), at least.
> > It needs especially to say how it relates to datasource_open() - why
> > should the caller call this function rather than that function, and are
> > they both necessary - and how the caller is supposed to use the
> > 'prefix_lines' parameter.  And doc strings for
> > increment/decrement_...().
> >
> > But this makes me think, it looks to me like this whole
> > prefix-suffix-skipping functionality would fit better inside the
> > lower-level diff algorithm rather than inside the "datasources_open"
> > function.  Do you agree?
> >
> > It works as it is, but you were talking about wanting it to obey the
> > standard token-parsing rules such as "ignore white space", and so on.
> > It seems that things like this would be much better one level down.
> 
> Yes, I've been struggling with this. But I can't easily see it fit in
> the lower levels right now. Problem is that everything in those lower
> levels always acts on 1 datasource at a time (getting all the tokens,
> inserting them into the token tree, ... then the same for the next
> datasource). The datasource_open seemed to me to be the easiest place
> to combine datasources to do things for both of them "concurrently"
> (with least impact on the rest of the code).

I expect you're right.  You've studied this; I haven't.

> Maybe those lower-level functions could also be made
> "multi-datasource", but I have to think a little bit more about that.

My gut feeling is that's probably not the way to go.

I'll respond separately to your follow-up email about this.


> >> 6) I've not paid too much attention to low level optimizations, so
> >> here also there's probably room for improvement, which may be
> >> significant for the critical loops.
> >
> > Maybe. Not important at this stage.
> >
> >> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
> >> in diff_file.c#find_identical_suffix. To completely eliminate this, I
> >> should probably make all "chunks" of type apr_off_t instead of int
> >> (but I have not done that yet, because the original implementation
> >> also used int for the chunk in the svn_diff__file_baton_t struct).
> >> Should I do this (also in the baton struct)? Or should I use an
> >> explicit cast somewhere?
> >
> > I recommend using an explicit cast where needed, in this patch.
> > Changing the 'chunk' data type everywhere could be done in a separate
> > patch, either before or after this patch.
> 
> Ok, will do.
> 
> One more thing I was wondering about: maybe increment_... and
> decrement_... should (eventually) be implemented as macro's? In fact,
> they are mostly just replacements for curp++ and curp-- but taking
> into account that we have chunks. Just a thought ...

If profiling shows that the function-call overghead is too high, then I
suggest that you make just the simple part of it a macro, like this: "if
((f)->curp < (f)->endp - 1) (f)->curp++; else function_call(...);".

> Thanks again for your input.

My pleasure.

- Julian

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Julian Foad <ju...@wandisco.com>.

On Sun, 2010-10-10 at 23:43 +0200, Johan Corveleyn wrote:
> On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> > On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
> >> But this makes me think, it looks to me like this whole
> >> prefix-suffix-skipping functionality would fit better inside the
> >> lower-level diff algorithm rather than inside the "datasources_open"
> >> function.  Do you agree?
> >>
> >> It works as it is, but you were talking about wanting it to obey the
> >> standard token-parsing rules such as "ignore white space", and so on.
> >> It seems that things like this would be much better one level down.
> >
> > Yes, I've been struggling with this. But I can't easily see it fit in
> > the lower levels right now. Problem is that everything in those lower
> > levels always acts on 1 datasource at a time (getting all the tokens,
> > inserting them into the token tree, ... then the same for the next
> > datasource). The datasource_open seemed to me to be the easiest place
> > to combine datasources to do things for both of them "concurrently"
> > (with least impact on the rest of the code).
> >
> > Maybe those lower-level functions could also be made
> > "multi-datasource", but I have to think a little bit more about that.
> 
> I've thought a little bit more about this, going through the code, and
> yeah, it might be a good idea to push the prefix/suffix scanning into
> the lower-level parts of the diff-algorithm (the token parsing /
> building the token tree).
> 
> Something like:
> - Make token.c#svn_diff__get_tokens take multiple datasources.
> - In diff.c, diff3.c and diff4.c replace the multiple calls to this
> function to one call passing multiple datasources.
> - token.c#svn_diff__get_tokens (with multiple datasources) will take
> care of identical prefix and suffix scanning based on "tokens" (so can
> take advantage of the standard token-parsing rules with ignore-*
> options, by simply calling svn_diff_fns_t.datasource_get_next_token).

One of the improvements we're looking for is to make use of the already
existing diff options - ignore-white-space etc.

We can get that improvement with a much smaller change: simply by
calling the 'get_next_token' routine, or rather a part of it, from
within your current 'find_identical_prefix' function, without touching
any of the generic diffN.c/token.c code and APIs.

> Some things needed to support this:
> - svn_diff_fns_t.datasource_get_next_token: calculation of the hash
> should be made conditional (I don't want to waste time for the adler32
> hash for lines that are not going to be put in the token tree).

Yes.  If you take this "smaller change" approach, then the way to do
this would be to factor out the non-adler part of
'datasource_get_next_token' into a separate private function, and call
that.

> - For suffix scanning, I'll need another variation of
> datasource_get_next_token to get tokens from the end going backwards
> (say "datasource_get_previous_token"). No hash needed.

Yes.

> Does that make sense? Or am I missing it completely?

I can't really comment on the question of moving this right down into
the diffN.c/token.c code, as I don't have time to study that right now.
The possible benefits I can see are:

* Sharing this feature across all kinds of diff implementations
(currently: file and in-memory-string).  Good from a practical POV if
and only if really long strings are being compared in memory.  Good from
a code-structural POV.

* I'm curious about whether it would be possible to integrate prefix
skipping into the main algorithm in such a way that the algorithm would
see the skipped prefix as a possible match, just like any other chunk
(including its adler32), but in a much quicker way than the algorithm
currently finds such a prefix.  If so, it would be able to handle better
the cases where one file has added a prefix that is a duplicate of
existing text.  Same for the suffix.

I'm really not sure whether it's worth pursuing that line of
investigation.  If you do, and it turns out well, that would be great.
If you want to just stick to the implementation in diff_file.c, that
seems like a more practical place to stop, now that you have basically
solved the problem.

- Julian

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
>> But this makes me think, it looks to me like this whole
>> prefix-suffix-skipping functionality would fit better inside the
>> lower-level diff algorithm rather than inside the "datasources_open"
>> function.  Do you agree?
>>
>> It works as it is, but you were talking about wanting it to obey the
>> standard token-parsing rules such as "ignore white space", and so on.
>> It seems that things like this would be much better one level down.
>
> Yes, I've been struggling with this. But I can't easily see it fit in
> the lower levels right now. Problem is that everything in those lower
> levels always acts on 1 datasource at a time (getting all the tokens,
> inserting them into the token tree, ... then the same for the next
> datasource). The datasource_open seemed to me to be the easiest place
> to combine datasources to do things for both of them "concurrently"
> (with least impact on the rest of the code).
>
> Maybe those lower-level functions could also be made
> "multi-datasource", but I have to think a little bit more about that.

I've thought a little bit more about this, going through the code, and
yeah, it might be a good idea to push the prefix/suffix scanning into
the lower-level parts of the diff-algorithm (the token parsing /
building the token tree).

Something like:
- Make token.c#svn_diff__get_tokens take multiple datasources.
- In diff.c, diff3.c and diff4.c replace the multiple calls to this
function to one call passing multiple datasources.
- token.c#svn_diff__get_tokens (with multiple datasources) will take
care of identical prefix and suffix scanning based on "tokens" (so can
take advantage of the standard token-parsing rules with ignore-*
options, by simply calling svn_diff_fns_t.datasource_get_next_token).

Some things needed to support this:
- svn_diff_fns_t.datasource_get_next_token: calculation of the hash
should be made conditional (I don't want to waste time for the adler32
hash for lines that are not going to be put in the token tree).
- For suffix scanning, I'll need another variation of
datasource_get_next_token to get tokens from the end going backwards
(say "datasource_get_previous_token"). No hash needed.

Does that make sense? Or am I missing it completely?

It would mean I'd have to throw away the current patch and rewrite it
in the token layer (but I don't really mind, as long as a good
solution takes its place :-)). And your file_info refactoring would
also not be used (although it's certainly a worthwhile refactoring by
itself, making the current code clearer).

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
> On Sat, 2010-10-09, Johan Corveleyn wrote:
>> Ok, third iteration of the patch in attachment. It passes make check.
>>
>> As discussed in [1], this version keeps 50 lines of the identical
>> suffix around, to give the algorithm a good chance to generate a diff
>> output of good quality (in all but the most extreme cases, this will
>> be the same as with the original svn_diff algorithm).
>>
>> That's about the only difference with the previous iteration. So for
>> now, I'm submitting this for review. Any feedback is very welcome :-).
>
> Hi Johan.

Hi Julian,

Thanks for taking a look.

> I haven't reviewed it, but after seeing today's discussion I had just
> scrolled quickly through the previous version of this patch.  I noticed
> that the two main functions - find_identical_suffix and
> find_identical_suffix - are both quite similar (but not quite similar
> enough to make them easily share implementation) and both quite long,
> and I noticed you wrote in an earlier email that you were finding it
> hard to make the code readable.  I have a suggestion that may help.
>
> I think the existing structure of the svn_diff__file_baton_t is
> unhelpful:
> {
>  const svn_diff_file_options_t *options;
>  const char *path[4];
>
>  apr_file_t *file[4];
>  apr_off_t size[4];
>
>  int chunk[4];
>  char *buffer[4];
>  char *curp[4];
>  char *endp[4];
>
>  /* List of free tokens that may be reused. */
>  svn_diff__file_token_t *tokens;
>
>  svn_diff__normalize_state_t normalize_state[4];
>
>  apr_pool_t *pool;
> } svn_diff__file_baton_t;
>
> All those array[4] fields are logically related, but this layout forces
> the programmer to address them individually.
>
> So I wrote a patch - attached - that refactors this into an array of 4
> sub-structures, and simplifies all the code that uses them.
>
> I think this will help you to get better code clarity because then your
> increment_pointer_or_chunk() for example will be able to take a single
> pointer to a file_info structure instead of a lot of pointers to
> individual members of the same.
>
> Would you take a look and let me know if you agree.  If so, I can commit
> the refactoring straight away.

Yes, great idea! That would indeed vastly simplify a lot of the code.
So please go ahead and commit the refactoring.

Also, maybe the last_chunk number could be included in the file_info
struct? Now it's calculated in several places: "last_chunk0 =
offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
every time as an extra argument. Seems like the sort of info that
could be part of file_info.

One more thing: you might have noticed that for find_identical_suffix
I use other buffers, chunks, curp's, endp's, ... than for the prefix.
For prefix scanning I can just use the stuff from the diff_baton,
because after prefix scanning has finished, everything is buffered and
pointing correctly for the normal algorithm to continue (i.e.
everything points at the first byte of the first non-identical line).
For suffix scanning I need to use other structures (newly alloc'd
buffer etc), so as to not mess with those pointers/buffers from the
diff_baton.

So: I think I'll need the file_info struct to be available out of the
diff_baton_t struct as well, so I can use this in suffix scanning
also.

(side-note: I considered first doing suffix scanning, then prefix
scanning, so I could reuse the buffers/pointers from diff_baton all
the time, and still have everything pointing correctly after
eliminating prefix/suffix. But that could give vastly different
results in some cases, for instance when original file is entirely
identical to both the prefix and the suffix of the modified file. So I
decided it's best to stick with first prefix, then suffix).

> Responding to some of the other points you mentioned in a much earlier
> mail:
>
>> 3) It's only implemented for 2 files. I'd like to generalize this for
>> an array of datasources, so it can also be used for diff3 and diff4.
>>
>> 4) As a small hack, I had to add a flag "datasource_opened" to
>> token.c#svn_diff__get_tokens, so I could have different behavior for
>> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
>> no longer needed.
>
> Yes, I'd like to see 3), and so hack 4) will go away.

I'm wondering though how I should represent the datasources to pass
into datasources_open. An array combined with a length parameter?
Something like:

    static svn_error_t *
    datasources_open(void *baton, apr_off_t *prefix_lines,
                     svn_diff_datasource_e[] datasources, int datasources_len)

? And then use for loops everywhere I now do things twice for the two
datasources?

>> 5) I've struggled with making the code readable/understandable. It's
>> likely that there is still a lot of room for improvement. I also
>> probably need to document it some more.
>
> You need to write a full doc string for datasources_open(), at least.
> It needs especially to say how it relates to datasource_open() - why
> should the caller call this function rather than that function, and are
> they both necessary - and how the caller is supposed to use the
> 'prefix_lines' parameter.  And doc strings for
> increment/decrement_...().
>
> But this makes me think, it looks to me like this whole
> prefix-suffix-skipping functionality would fit better inside the
> lower-level diff algorithm rather than inside the "datasources_open"
> function.  Do you agree?
>
> It works as it is, but you were talking about wanting it to obey the
> standard token-parsing rules such as "ignore white space", and so on.
> It seems that things like this would be much better one level down.

Yes, I've been struggling with this. But I can't easily see it fit in
the lower levels right now. Problem is that everything in those lower
levels always acts on 1 datasource at a time (getting all the tokens,
inserting them into the token tree, ... then the same for the next
datasource). The datasource_open seemed to me to be the easiest place
to combine datasources to do things for both of them "concurrently"
(with least impact on the rest of the code).

Maybe those lower-level functions could also be made
"multi-datasource", but I have to think a little bit more about that.

>> 6) I've not paid too much attention to low level optimizations, so
>> here also there's probably room for improvement, which may be
>> significant for the critical loops.
>
> Maybe. Not important at this stage.
>
>> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
>> in diff_file.c#find_identical_suffix. To completely eliminate this, I
>> should probably make all "chunks" of type apr_off_t instead of int
>> (but I have not done that yet, because the original implementation
>> also used int for the chunk in the svn_diff__file_baton_t struct).
>> Should I do this (also in the baton struct)? Or should I use an
>> explicit cast somewhere?
>
> I recommend using an explicit cast where needed, in this patch.
> Changing the 'chunk' data type everywhere could be done in a separate
> patch, either before or after this patch.

Ok, will do.

One more thing I was wondering about: maybe increment_... and
decrement_... should (eventually) be implemented as macro's? In fact,
they are mostly just replacements for curp++ and curp-- but taking
into account that we have chunks. Just a thought ...

Thanks again for your input.

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Julian Foad <ju...@wandisco.com>.

On Sat, 2010-10-09, Johan Corveleyn wrote:
> Ok, third iteration of the patch in attachment. It passes make check.
> 
> As discussed in [1], this version keeps 50 lines of the identical
> suffix around, to give the algorithm a good chance to generate a diff
> output of good quality (in all but the most extreme cases, this will
> be the same as with the original svn_diff algorithm).
> 
> That's about the only difference with the previous iteration. So for
> now, I'm submitting this for review. Any feedback is very welcome :-).

Hi Johan.

I haven't reviewed it, but after seeing today's discussion I had just
scrolled quickly through the previous version of this patch.  I noticed
that the two main functions - find_identical_suffix and
find_identical_suffix - are both quite similar (but not quite similar
enough to make them easily share implementation) and both quite long,
and I noticed you wrote in an earlier email that you were finding it
hard to make the code readable.  I have a suggestion that may help.

I think the existing structure of the svn_diff__file_baton_t is
unhelpful:
{
  const svn_diff_file_options_t *options;
  const char *path[4];

  apr_file_t *file[4];
  apr_off_t size[4];

  int chunk[4];
  char *buffer[4];
  char *curp[4];
  char *endp[4];

  /* List of free tokens that may be reused. */
  svn_diff__file_token_t *tokens;

  svn_diff__normalize_state_t normalize_state[4];

  apr_pool_t *pool;
} svn_diff__file_baton_t;

All those array[4] fields are logically related, but this layout forces
the programmer to address them individually.

So I wrote a patch - attached - that refactors this into an array of 4
sub-structures, and simplifies all the code that uses them.

I think this will help you to get better code clarity because then your
increment_pointer_or_chunk() for example will be able to take a single
pointer to a file_info structure instead of a lot of pointers to
individual members of the same.

Would you take a look and let me know if you agree.  If so, I can commit
the refactoring straight away.

Responding to some of the other points you mentioned in a much earlier
mail:

> 3) It's only implemented for 2 files. I'd like to generalize this for
> an array of datasources, so it can also be used for diff3 and diff4.
> 
> 4) As a small hack, I had to add a flag "datasource_opened" to
> token.c#svn_diff__get_tokens, so I could have different behavior for
> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
> no longer needed.

Yes, I'd like to see 3), and so hack 4) will go away.

> 5) I've struggled with making the code readable/understandable. It's
> likely that there is still a lot of room for improvement. I also
> probably need to document it some more.

You need to write a full doc string for datasources_open(), at least.
It needs especially to say how it relates to datasource_open() - why
should the caller call this function rather than that function, and are
they both necessary - and how the caller is supposed to use the
'prefix_lines' parameter.  And doc strings for
increment/decrement_...().

But this makes me think, it looks to me like this whole
prefix-suffix-skipping functionality would fit better inside the
lower-level diff algorithm rather than inside the "datasources_open"
function.  Do you agree?

It works as it is, but you were talking about wanting it to obey the
standard token-parsing rules such as "ignore white space", and so on.
It seems that things like this would be much better one level down.

> 6) I've not paid too much attention to low level optimizations, so
> here also there's probably room for improvement, which may be
> significant for the critical loops.

Maybe. Not important at this stage.

> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
> in diff_file.c#find_identical_suffix. To completely eliminate this, I
> should probably make all "chunks" of type apr_off_t instead of int
> (but I have not done that yet, because the original implementation
> also used int for the chunk in the svn_diff__file_baton_t struct).
> Should I do this (also in the baton struct)? Or should I use an
> explicit cast somewhere?

I recommend using an explicit cast where needed, in this patch.
Changing the 'chunk' data type everywhere could be done in a separate
patch, either before or after this patch.

- Julian

> I still consider this a WIP, because of the following remaining todo's
> (which may have a lot of impact on the current implementation):
> - Generalize for more than 2 datasources (for diff3 and diff4).
> - revv svn_diff_fns_t and maybe other stuff I've changed in public API.
> - Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
> switch the implementation to read out entire lines before comparing
> (like datasources_get_next_token does).
> 
> Log message:
> [[[
> Make svn_diff_diff skip identical prefix and suffix to make diff and blame
> faster.
> 
> * subversion/include/svn_diff.h
>   (svn_diff_fns_t): Added new function types datasources_open and
>    get_prefix_lines to the vtable.
> 
> * subversion/libsvn_diff/diff_memory.c
>   (datasources_open): New function (does nothing).
>   (get_prefix_lines): New function (does nothing).
>   (svn_diff__mem_vtable): Added new functions datasources_open and
>    get_prefix_lines.
> 
> * subversion/libsvn_diff/diff_file.c
>   (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
>    and suffix_offset_in_chunk.
>   (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
>   (find_identical_prefix, find_identical_suffix): New functions.
>   (datasources_open): New function, to open both datasources and find their
>    identical prefix and suffix. From the identical suffix, 50 lines are kept to
>    help the diff algorithm find the nicest possible diff representation
>    in case of ambiguity.
>   (get_prefix_lines): New function.
>   (datasource_get_next_token): Stop at start of identical suffix.
>   (svn_diff__file_vtable): Added new functions datasources_open and
>    get_prefix_lines.
> 
> * subversion/libsvn_diff/diff.h
>   (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
>    the datasource was already opened.
> 
> * subversion/libsvn_diff/token.c
>   (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
>    datasource if datasource_opened is FALSE. Set the starting offset of the
>    position list to the number of prefix lines.
> 
> * subversion/libsvn_diff/lcs.c
>   (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
>    the offset of the sentinel position for EOF, even if one of the files
>    became empty after eliminating the identical prefix.
> 
> * subversion/libsvn_diff/diff.c
>   (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
>   (svn_diff_diff): Use new function datasources_open, to open original and
>    modified at once, and find their identical prefix and suffix. Pass
>    prefix_lines to svn_diff__lcs and to svn_diff__diff.
> 
> * subversion/libsvn_diff/diff3.c
>   (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
>    Pass prefix_lines = 0 to svn_diff__lcs.
> 
> * subversion/libsvn_diff/diff4.c
>   (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
>    Pass prefix_lines = 0 to svn_diff__lcs.
> ]]]
> 
> 
> Cheers,

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

Ok, third iteration of the patch in attachment. It passes make check.

As discussed in [1], this version keeps 50 lines of the identical
suffix around, to give the algorithm a good chance to generate a diff
output of good quality (in all but the most extreme cases, this will
be the same as with the original svn_diff algorithm).

That's about the only difference with the previous iteration. So for
now, I'm submitting this for review. Any feedback is very welcome :-).

I still consider this a WIP, because of the following remaining todo's
(which may have a lot of impact on the current implementation):
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
switch the implementation to read out entire lines before comparing
(like datasources_get_next_token does).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix. From the identical suffix, 50 lines are kept to
   help the diff algorithm find the nicest possible diff representation
   in case of ambiguity.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]


Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0141.shtml

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

Ok, third iteration of the patch in attachment. It passes make check.

As discussed in [1], this version keeps 50 lines of the identical
suffix around, to give the algorithm a good chance to generate a diff
output of good quality (in all but the most extreme cases, this will
be the same as with the original svn_diff algorithm).

That's about the only difference with the previous iteration. So for
now, I'm submitting this for review. Any feedback is very welcome :-).

I still consider this a WIP, because of the following remaining todo's
(which may have a lot of impact on the current implementation):
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
switch the implementation to read out entire lines before comparing
(like datasources_get_next_token does).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix. From the identical suffix, 50 lines are kept to
   help the diff algorithm find the nicest possible diff representation
   in case of ambiguity.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]


Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0141.shtml

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Sun, Oct 3, 2010 at 1:46 AM, Johan Corveleyn <jc...@gmail.com> wrote:
> Hi,
>
> Here is a second iteration of the patch. It now passes make check.
>
> Differences from the previous version are:
> - Support for \r eol-style (\n and \r\n was already ok).
> - The number of prefix_lines is now passed to svn_diff__lcs, so it can
> use that value to set the position offset of the "EOF" marker
> correctly, in case one of both files has become empty after skipping
> the prefix. This fixes the crashes in blame_tests.py 2 and 7.
>
> The patch is pretty big, so please let me know if I should split it up
> to make it more reviewable (I could easily split it up between the
> prefix-finding and the suffix-finding, at the cost of having overview
> over the entire algorithm).
>
> Still to do:
> - Think about why results are sometimes different (because of
> eliminated suffix, the LCS can sometimes be slightly different), and
> what can be done about it.

Hm, this problem has made me reconsider whether my patch is a good
solution (more details below). I'm thinking of other ways of
implementing this kind of optimization, so for now I'm pulling this
patch. Please don't review it yet, as I might go for a radically
different approach (sorry for any time spent that may be lost).

The details:
The problem happens if there are two (or more) non-matching parts with
an identical part in between (so the prefix scanning stops early,
doesn't meet the suffix in one of the files), and the suffix scanning
goes too far because the end of the last change is identical to the
original at that point.

For example (best viewed with fixed-width font):
[[[
original                     | modified
-----------------------------------------------------------
This is line 1               | This is line 1
This is line 2               | This line is *changed*
This is line 3               | This is line 3
... (more identical lines)   | ...
existing_function()          | existing_function()
{                            | {
  // do stuff                |   // do stuff
  return SVN_NO_ERROR;       |   return SVN_NO_ERROR;
}                            | }
                             |
This is the end of the file  | new_function()
-----------------------------| {
                             |   // do other stuff
                             |   return SVN_NO_ERROR;
                             | }
                             |
                             | This is the end of the file
                             |-----------------------------
]]]

The following identical suffix will be eliminated:
[[[
  return SVN_NO_ERROR;
}

This is the end of the file
]]]

which means the LCS algorithm cannot do better than to say that the
following lines are new:
[[[
+  return SVN_NO_ERROR;
+}
+
+new_function()
+{
+  // do other stuff
]]]

Not quite what we want/expect. If the suffix is not eliminated, the
LCS algorithm gives the correct/expected result. Also if the change in
the beginning of the file isn't there, the result is correct (because
the prefix scanning goes first, and eliminates the identical stuff
from the start (I'll leave it as an exercise to the reader to see that
the result is better)).

Interestingly, GNU diff also has this problem. It mitigates it a bit
by keeping a number of identical prefix/suffix lines (at least the
number of context lines, or a higher number if supplied by
--horizon-lines=NUM). This cannot completely solve the problem though:
for any given number of "horizon-lines" one can always come up with an
example that doesn't give the correct result.

So I need to find a way to not "eliminate" the identical suffix, but
to mark those lines as identical and then include them in the
position_list that's going to be used by the LCS algorithm. I'd like
to do the same for the prefix. This means I need to approach this
problem on a different level.

To be continued ...

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

Hi,

Here is a second iteration of the patch. It now passes make check.

Differences from the previous version are:
- Support for \r eol-style (\n and \r\n was already ok).
- The number of prefix_lines is now passed to svn_diff__lcs, so it can
use that value to set the position offset of the "EOF" marker
correctly, in case one of both files has become empty after skipping
the prefix. This fixes the crashes in blame_tests.py 2 and 7.

The patch is pretty big, so please let me know if I should split it up
to make it more reviewable (I could easily split it up between the
prefix-finding and the suffix-finding, at the cost of having overview
over the entire algorithm).

Still to do:
- Think about why results are sometimes different (because of
eliminated suffix, the LCS can sometimes be slightly different), and
what can be done about it.
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options.

But I'd like to do those things in follow-up patches, after this one
has been reviewed and digested a little bit. So at this point: review,
feedback, ... very welcome :-).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
 (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

Hi,

Here is a second iteration of the patch. It now passes make check.

Differences from the previous version are:
- Support for \r eol-style (\n and \r\n was already ok).
- The number of prefix_lines is now passed to svn_diff__lcs, so it can
use that value to set the position offset of the "EOF" marker
correctly, in case one of both files has become empty after skipping
the prefix. This fixes the crashes in blame_tests.py 2 and 7.

The patch is pretty big, so please let me know if I should split it up
to make it more reviewable (I could easily split it up between the
prefix-finding and the suffix-finding, at the cost of having overview
over the entire algorithm).

Still to do:
- Think about why results are sometimes different (because of
eliminated suffix, the LCS can sometimes be slightly different), and
what can be done about it.
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options.

But I'd like to do those things in follow-up patches, after this one
has been reviewed and digested a little bit. So at this point: review,
feedback, ... very welcome :-).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
 (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Tue, Sep 28, 2010 at 23:37:23 +0200:
> On Tue, Sep 28, 2010 at 4:11 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >> Index: subversion/include/svn_diff.h
> >> ===================================================================
> >> --- subversion/include/svn_diff.h     (revision 1001548)
> >> +++ subversion/include/svn_diff.h     (working copy)
> >> @@ -112,6 +112,11 @@
> > (personally I prefer 'svn diff -x-p' to show the function name here)
> 
> Ok, will do next time.
> 

Thanks.

> I've had some progress:
> - The blame_tests.py now all pass (tests 2 and 7 provoked a core
> dump). That makes all tests pass. However, although I fixed the
> coredump, I don't fully understand the root cause (why they ended up
> in the situation where they ended up). So I'm going to study that
> first a bit more.
> - I have now included support for files with \r eol-style.
> 
> I'll send a new version of the patch shortly.
> 

Sounds good.

> I'm also thinking that I could try to take advantage of -x-b/-x-w or
> -x--ignore-eol-style options to make it even faster (right now the
> prefix/suffix scanning will stop at the first difference, regardless
> if it's a whitespace or eol difference that could/should be ignored).
> 
...
> 
> Thoughts?

None, actually; I'm not (yet) sufficiently familiar with diff internals. 
But let's please have this work in small, digestible (read: reviewable) 
pieces --- support for -x--foo is exactly the sort of additional 
optimization that can be done in a separate patch (on top of this 
starting patch).

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Tue, Sep 28, 2010 at 23:37:23 +0200:
> On Tue, Sep 28, 2010 at 4:11 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >> Index: subversion/include/svn_diff.h
> >> ===================================================================
> >> --- subversion/include/svn_diff.h     (revision 1001548)
> >> +++ subversion/include/svn_diff.h     (working copy)
> >> @@ -112,6 +112,11 @@
> > (personally I prefer 'svn diff -x-p' to show the function name here)
> 
> Ok, will do next time.
> 

Thanks.

> I've had some progress:
> - The blame_tests.py now all pass (tests 2 and 7 provoked a core
> dump). That makes all tests pass. However, although I fixed the
> coredump, I don't fully understand the root cause (why they ended up
> in the situation where they ended up). So I'm going to study that
> first a bit more.
> - I have now included support for files with \r eol-style.
> 
> I'll send a new version of the patch shortly.
> 

Sounds good.

> I'm also thinking that I could try to take advantage of -x-b/-x-w or
> -x--ignore-eol-style options to make it even faster (right now the
> prefix/suffix scanning will stop at the first difference, regardless
> if it's a whitespace or eol difference that could/should be ignored).
> 
...
> 
> Thoughts?

None, actually; I'm not (yet) sufficiently familiar with diff internals. 
But let's please have this work in small, digestible (read: reviewable) 
pieces --- support for -x--foo is exactly the sort of additional 
optimization that can be done in a separate patch (on top of this 
starting patch).

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

Hi Daniel,

Thanks for the feedback.

On Tue, Sep 28, 2010 at 4:11 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
>> Index: subversion/include/svn_diff.h
>> ===================================================================
>> --- subversion/include/svn_diff.h     (revision 1001548)
>> +++ subversion/include/svn_diff.h     (working copy)
>> @@ -112,6 +112,11 @@
> (personally I prefer 'svn diff -x-p' to show the function name here)

Ok, will do next time.

>>    svn_error_t *(*datasource_open)(void *diff_baton,
>>                                    svn_diff_datasource_e datasource);
>>
>> +  /** Open the datasources of type @a datasources. */
>> +  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines,
>> +                                   svn_diff_datasource_e datasource0,
>> +                                   svn_diff_datasource_e datasource1);
>> +
>
> So, you're extending the svn_diff_fns_t struct, which is defined in
> a public header.
>
> It's a public struct with no constructor function, so I believe you have
> to revv it (into svn_diff_fns2_t) in order to extend it (for binary
> compatibility: people allocating this struct and then using a newer
> Subversion library at runtime).
>
> If it did have a constructor function, you'd have to extend it only at
> the end, and even then make sure that if the added elements are NULL (eg
> because an old caller didn't know to fill them) then everything still
> worked.
>
> Probably more important to get the logic right than to revv it right
> away, though; the latter is a SMOP.

Doh! I'm sure that observation was in the back of my head somewhere,
but I forgot about it while working on the solution. Anyway, you're
right: there is first some more work to get the algorithm right.

I've had some progress:
- The blame_tests.py now all pass (tests 2 and 7 provoked a core
dump). That makes all tests pass. However, although I fixed the
coredump, I don't fully understand the root cause (why they ended up
in the situation where they ended up). So I'm going to study that
first a bit more.
- I have now included support for files with \r eol-style.

I'll send a new version of the patch shortly.

I'm also thinking that I could try to take advantage of -x-b/-x-w or
-x--ignore-eol-style options to make it even faster (right now the
prefix/suffix scanning will stop at the first difference, regardless
if it's a whitespace or eol difference that could/should be ignored).

However, I'm not sure if I should implement this myself, as part of
the find_identical_prefix and find_identical_suffix functions, or
switch to the usage of datasource_get_next_token (which is the
function that is used by the "real" diff algorithm to extract the
lines, and which uses util.c#svn_diff__normalize_buffer to normalize
whitespace and eol's if needed).

Right now, I don't read entire lines (tokens) but compare each byte as
I go. But I could do it line-based as well (extract line from file1,
extract line from file2, memcmp lines). I would have to make the
calculation of the adler32 hash in datasource_get_next_token
conditional on some boolean, or factor out the part of the function
that's useful to me into a new static function.

There is one caveat to this approach: I'm not sure if it would work
backwards (for suffix scanning). Well, the normalization function
wouldn't have to be changed, but the extraction of lines would have to
go backward. Surely it's possible, but I have no idea how much I'd
have to change the code in get_next_token to get lines backwards...

I'm also not sure if one would be (significantly) faster than the
other: comparing byte-by-byte while going through both files, or
extracting entire lines and then comparing the lines.

Thoughts?

Cheers,
-- 
Johan