You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@subversion.apache.org by st...@apache.org on 2012/10/13 07:49:50 UTC

svn commit: r1397773 - /subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c

Author: stefan2
Date: Sat Oct 13 05:49:50 2012
New Revision: 1397773

URL: http://svn.apache.org/viewvc?rev=1397773&view=rev
Log:
Due to public request: apply rep-sharing to equal data reps within
the same transaction. 

The idea is simple. When writing a noderev to the txn folder,
write another file named by the rep's SHA1 and store the rep
struct in there. Lookup is then straight-forward.

* subversion/libsvn_fs_fs/fs_fs.c
  (svn_fs_fs__put_node_revision): also look for SHA1-named files
  (get_shared_rep): write SHA1-named files

Modified:
    subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c

Modified: subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c
URL: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c?rev=1397773&r1=1397772&r2=1397773&view=diff
==============================================================================
--- subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c (original)
+++ subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c Sat Oct 13 05:49:50 2012
@@ -2585,6 +2585,10 @@ svn_fs_fs__put_node_revision(svn_fs_t *f
   fs_fs_data_t *ffd = fs->fsap_data;
   apr_file_t *noderev_file;
   const char *txn_id = svn_fs_fs__id_txn_id(id);
+  const char *sha1 = ffd->rep_sharing_allowed && noderev->data_rep
+                   ? svn_checksum_to_cstring(noderev->data_rep->sha1_checksum,
+                                             pool)
+                   : NULL;
 
   noderev->is_fresh_txn_root = fresh_txn_root;
 
@@ -2603,7 +2607,32 @@ svn_fs_fs__put_node_revision(svn_fs_t *f
                                    svn_fs_fs__fs_supports_mergeinfo(fs),
                                    pool));
 
-  return svn_io_file_close(noderev_file, pool);
+  SVN_ERR(svn_io_file_close(noderev_file, pool));
+
+  /* if rep sharing has been enabled and the noderev has a data rep and
+   * its SHA-1 is known, store the rep struct under its SHA1. */
+  if (sha1)
+    {
+      apr_file_t *rep_file;
+      const char *file_name = svn_dirent_join(path_txn_dir(fs, txn_id, pool),
+                                              sha1, pool);
+      const char *rep_string = representation_string(noderev->data_rep,
+                                                     ffd->format,
+                                                     (noderev->kind
+                                                      == svn_node_dir),
+                                                     FALSE,
+                                                     pool);
+      SVN_ERR(svn_io_file_open(&rep_file, file_name,
+                               APR_WRITE | APR_CREATE | APR_TRUNCATE
+                               | APR_BUFFERED, APR_OS_DEFAULT, pool));
+
+      SVN_ERR(svn_io_file_write_full(rep_file, rep_string,
+                                     strlen(rep_string), NULL, pool));
+
+      SVN_ERR(svn_io_file_close(rep_file, pool));
+    }
+
+  return SVN_NO_ERROR;
 }
 
 
@@ -7083,6 +7112,30 @@ get_shared_rep(representation_t **old_re
         }
     }
 
+  /* look for intra-revision matches (usually data reps but not limited
+     to them in case props happen to look like some data rep)
+   */
+  if (*old_rep == NULL && rep->txn_id)
+    {
+      svn_node_kind_t kind;
+      const char *file_name
+        = svn_dirent_join(path_txn_dir(fs, rep->txn_id, pool),
+                          svn_checksum_to_cstring(rep->sha1_checksum, pool),
+                          pool);
+
+      /* in our txn, is there a rep file named with the wanted SHA1?
+         If so, read it and use that rep.
+       */
+      SVN_ERR(svn_io_check_path(file_name, &kind, pool));
+      if (kind == svn_node_file)
+        {
+          svn_stringbuf_t *rep_string;
+          SVN_ERR(svn_stringbuf_from_file2(&rep_string, file_name, pool));
+          SVN_ERR(read_rep_offsets_body(old_rep, rep_string->data,
+                                        rep->txn_id, FALSE, pool));
+        }
+    }
+
   /* Add information that is missing in the cached data. */
   if (*old_rep)
     {



Re: r1397773 - rep sharing in a txn - /subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c

Posted by Branko Čibej <br...@wandisco.com>.
On 13.10.2012 10:17, Julian Foad wrote:
>> Author: stefan2
>> Date: Sat Oct 13 05:49:50 2012
>> New Revision: 1397773
>>
>> URL: http://svn.apache.org/viewvc?rev=1397773&view=rev
>> Log:
>> Due to public request: apply rep-sharing to equal data reps within
>> the same transaction. 
>>
>> The idea is simple. When writing a noderev to the txn folder,
>> write another file named by the rep's SHA1 and store the rep
>> struct in there. Lookup is then straight-forward.
> Hi Stefan.  What's the scalability?  I'm wondering about the big-O performance of storing 10000 or 100000 files in the dir.

Yes, I was wondering about the same thing. The short answer is that it
depends on the underlying filesystem, and that hardly seems good enough.
At the very least, these files stored in shards if the filesystem itself
is sharded.

-- Brane


-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download


Re: r1397773 - rep sharing in a txn - /subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c

Posted by Julian Foad <ju...@btopenworld.com>.
Stefan Fuhrmann wrote:

> Julian Foad wrote:
>>> The idea is simple. When writing a noderev to the txn folder,
>>> write another file named by the rep's SHA1 and store the rep
>>> struct in there. Lookup is then straight-forward.
>>
>> Hi Stefan.  What's the scalability?  I'm wondering about the
>> big-O performance of storing 10000 or 100000 files in the dir.
> 
> Yes, I don't like putting that many files into a single folder either.
> However, we do that already today by creating a file for each
> noderev. I.e. my change at worst doubled the number of files in
> the txn folder.

OK; in that case I'm satisfied that if there is a scalability problem then it was already a problem before this change.

- Julian


Re: r1397773 - rep sharing in a txn - /subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Sat, Oct 13, 2012 at 4:17 PM, Julian Foad <ju...@btopenworld.com>wrote:

> > Author: stefan2
>
> > Date: Sat Oct 13 05:49:50 2012
> > New Revision: 1397773
> >
> > URL: http://svn.apache.org/viewvc?rev=1397773&view=rev
> > Log:
> > Due to public request: apply rep-sharing to equal data reps within
> > the same transaction.
> >
> > The idea is simple. When writing a noderev to the txn folder,
> > write another file named by the rep's SHA1 and store the rep
> > struct in there. Lookup is then straight-forward.
>
> Hi Stefan.  What's the scalability?  I'm wondering about the big-O
> performance of storing 10000 or 100000 files in the dir.
>

Yes, I don't like putting that many files into a single folder either.
However, we do that already today by creating a file for each
noderev. I.e. my change at worst doubled the number of files in
the txn folder.

-- Stefan^2.

-- 
*

Join us this October at Subversion Live
2012<http://www.wandisco.com/svn-live-2012>
 for two days of best practice SVN training, networking, live demos,
committer meet and greet, and more! Space is limited, so get signed up
today<http://www.wandisco.com/svn-live-2012>
!
*

Re: r1397773 - rep sharing in a txn - /subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c

Posted by Julian Foad <ju...@btopenworld.com>.
> Author: stefan2

> Date: Sat Oct 13 05:49:50 2012
> New Revision: 1397773
> 
> URL: http://svn.apache.org/viewvc?rev=1397773&view=rev
> Log:
> Due to public request: apply rep-sharing to equal data reps within
> the same transaction. 
> 
> The idea is simple. When writing a noderev to the txn folder,
> write another file named by the rep's SHA1 and store the rep
> struct in there. Lookup is then straight-forward.

Hi Stefan.  What's the scalability?  I'm wondering about the big-O performance of storing 10000 or 100000 files in the dir.

- Julian


> * subversion/libsvn_fs_fs/fs_fs.c
>   (svn_fs_fs__put_node_revision): also look for SHA1-named files
>   (get_shared_rep): write SHA1-named files
> 
> Modified:
>     subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c
> 
> Modified: subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c
> URL: 
> http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c?rev=1397773&r1=1397772&r2=1397773&view=diff
> ==============================================================================
> --- subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c (original)
> +++ subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c Sat Oct 13 05:49:50 2012
> @@ -2585,6 +2585,10 @@ svn_fs_fs__put_node_revision(svn_fs_t *f
>    fs_fs_data_t *ffd = fs->fsap_data;
>    apr_file_t *noderev_file;
>    const char *txn_id = svn_fs_fs__id_txn_id(id);
> +  const char *sha1 = ffd->rep_sharing_allowed && 
> noderev->data_rep
> +                   ? 
> svn_checksum_to_cstring(noderev->data_rep->sha1_checksum,
> +                                             pool)
> +                   : NULL;
> 
>    noderev->is_fresh_txn_root = fresh_txn_root;
> 
> @@ -2603,7 +2607,32 @@ svn_fs_fs__put_node_revision(svn_fs_t *f
>                                     svn_fs_fs__fs_supports_mergeinfo(fs),
>                                     pool));
> 
> -  return svn_io_file_close(noderev_file, pool);
> +  SVN_ERR(svn_io_file_close(noderev_file, pool));
> +
> +  /* if rep sharing has been enabled and the noderev has a data rep and
> +   * its SHA-1 is known, store the rep struct under its SHA1. */
> +  if (sha1)
> +    {
> +      apr_file_t *rep_file;
> +      const char *file_name = svn_dirent_join(path_txn_dir(fs, txn_id, pool),
> +                                              sha1, pool);
> +      const char *rep_string = representation_string(noderev->data_rep,
> +                                                     ffd->format,
> +                                                     (noderev->kind
> +                                                      == svn_node_dir),
> +                                                     FALSE,
> +                                                     pool);
> +      SVN_ERR(svn_io_file_open(&rep_file, file_name,
> +                               APR_WRITE | APR_CREATE | APR_TRUNCATE
> +                               | APR_BUFFERED, APR_OS_DEFAULT, pool));
> +
> +      SVN_ERR(svn_io_file_write_full(rep_file, rep_string,
> +                                     strlen(rep_string), NULL, pool));
> +
> +      SVN_ERR(svn_io_file_close(rep_file, pool));
> +    }
> +
> +  return SVN_NO_ERROR;
> }
> 
> 
> @@ -7083,6 +7112,30 @@ get_shared_rep(representation_t **old_re
>          }
>      }
> 
> +  /* look for intra-revision matches (usually data reps but not limited
> +     to them in case props happen to look like some data rep)
> +   */
> +  if (*old_rep == NULL && rep->txn_id)
> +    {
> +      svn_node_kind_t kind;
> +      const char *file_name
> +        = svn_dirent_join(path_txn_dir(fs, rep->txn_id, pool),
> +                          svn_checksum_to_cstring(rep->sha1_checksum, pool),
> +                          pool);
> +
> +      /* in our txn, is there a rep file named with the wanted SHA1?
> +         If so, read it and use that rep.
> +       */
> +      SVN_ERR(svn_io_check_path(file_name, &kind, pool));
> +      if (kind == svn_node_file)
> +        {
> +          svn_stringbuf_t *rep_string;
> +          SVN_ERR(svn_stringbuf_from_file2(&rep_string, file_name, pool));
> +          SVN_ERR(read_rep_offsets_body(old_rep, rep_string->data,
> +                                        rep->txn_id, FALSE, pool));
> +        }
> +    }
> +
>    /* Add information that is missing in the cached data. */
>    if (*old_rep)
>      {
> 

Re: r1397773 - rep-sharing in a txn - subversion/libsvn_fs_fs/fs_fs.c

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Mon, Nov 19, 2012 at 7:30 PM, Ben Reser <be...@reser.org> wrote:

> On Mon, Nov 19, 2012 at 6:43 AM, Stefan Fuhrmann
> <st...@wandisco.com> wrote:
> > A crashed writer process may leave a corrupt protorev and / or
> > other incomplete files. There is no atomic incremental change
> > here. The caller (client) using the crashed process is supposed
> > to detect the crash and abandon the transaction.
>
> I don't think that "supposed to" is documented anywhere (unless you
> want to consider our clients behavior as documenation).  After dealing
> with the stuck being_written flag that produced rep_write_cleanup, I
> don't think we should assume that going forward.
>

Seems that failure scenarios are not very will documented
in general. We often rely on "DtRT" without actually working
out what that thing might actually be.


> It may not be worth dealing with in fsfs but within fsfs2 I think we
> should make representation operations even to the protorev files (or
> whatever their equivalent is) be atomic.  Especially since these
> operations are usually driven across separate HTTP requests when we're
> using DAV.
>

Noted.


> Then again, beyond improving repo consistency in some edge cases it'd
> also allow us to support parallelizing the PUTs.  There may be other
> issues I'm not aware of off the top of my head in doing that right
> now, but at least in fsfs it's absolutely not supported since for the
> entire duration of the PUT you have the protorev file locked.
>

Since I'd like to attempt some support for concurrent commit
(e.g. for svnadmin load), I / we may stumble across some
approach that tackles the consistency issue for FSFS already.

-- Stefan^2.

-- 
Certified & Supported Apache Subversion Downloads:
*

http://www.wandisco.com/subversion/download
*

Re: r1397773 - rep-sharing in a txn - subversion/libsvn_fs_fs/fs_fs.c

Posted by Ben Reser <be...@reser.org>.
On Mon, Nov 19, 2012 at 6:43 AM, Stefan Fuhrmann
<st...@wandisco.com> wrote:
> A crashed writer process may leave a corrupt protorev and / or
> other incomplete files. There is no atomic incremental change
> here. The caller (client) using the crashed process is supposed
> to detect the crash and abandon the transaction.

I don't think that "supposed to" is documented anywhere (unless you
want to consider our clients behavior as documenation).  After dealing
with the stuck being_written flag that produced rep_write_cleanup, I
don't think we should assume that going forward.

It may not be worth dealing with in fsfs but within fsfs2 I think we
should make representation operations even to the protorev files (or
whatever their equivalent is) be atomic.  Especially since these
operations are usually driven across separate HTTP requests when we're
using DAV.

Then again, beyond improving repo consistency in some edge cases it'd
also allow us to support parallelizing the PUTs.  There may be other
issues I'm not aware of off the top of my head in doing that right
now, but at least in fsfs it's absolutely not supported since for the
entire duration of the PUT you have the protorev file locked.

Re: r1397773 - rep-sharing in a txn - subversion/libsvn_fs_fs/fs_fs.c

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Mon, Nov 19, 2012 at 15:16:44 +0000:
> Stefan Fuhrmann wrote:
> > relatively rare and constructing an error object can be
> > expensive (NLS). svn_stringbuf_from_file2 is also a very
> > convenient function to use.
> I wonder if we could change our generic error handling scheme to defer
> NLS look-up until the time when we print the error.  That would
> eliminate that overhead in cases where we handle and clear an error
> without printing it.  Something to think about for the future.

I thought we considered error object creation to be expensive for other
reasons, too, not just for NLS reasons?  (eg, using a global pool)

Daniel

Re: r1397773 - rep-sharing in a txn - subversion/libsvn_fs_fs/fs_fs.c

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Mon, Nov 19, 2012 at 4:16 PM, Julian Foad <ju...@btopenworld.com>wrote:

>
> Stefan Fuhrmann wrote:
>
> > Thanks for the review! The code should be multi-thread safe.
>
>
hm. That should read "should be multi-thread safe *now*".

-- Stefan^2.

-- 
Certified & Supported Apache Subversion Downloads:
*

http://www.wandisco.com/subversion/download
*

Re: r1397773 - rep-sharing in a txn - subversion/libsvn_fs_fs/fs_fs.c

Posted by Julian Foad <ju...@btopenworld.com>.
Stefan Fuhrmann wrote:

On Thu, Nov 15, 2012 at 6:16 PM, Julian Foad <ju...@btopenworld.com> wrote:
>>stefan2@apache.org wrote:
>>> URL: http://svn.apache.org/viewvc?rev=1397773&view=rev
>>> Log:
>>> Due to public request: apply rep-sharing to equal data reps within
>>> the same transaction.
>>>
>>> The idea is simple. When writing a noderev to the txn folder,
>>> write another file named by the rep's SHA1 and store the rep
>>> struct in there. Lookup is then straight-forward.
>>
>> Please could you update the FSFS structure document.  I assume it
>> says what files are stored in the txn dir; if it doesn't it should.
> 
> Oops - hadn't thought about that! Done in r1411202.
> All other changes in r1411209.


Thanks, Stefan.  Looks good.
>>> +  /* in our txn, is there a rep file named with the wanted SHA1?

>>> +     If so, read it and use that rep. */
>>> +  SVN_ERR(svn_io_check_path(file_name, &kind, pool));
>> 
>> Instead of stat'ing the file to decide whether to open it, it's
>> more efficient and more 'robust' to just try to open it and then
>> handle the potential failure, isn't it?
> 
> Access is serialized (see below). Checking for existence
> should be more efficient because actual matches are
> relatively rare and constructing an error object can be
> expensive (NLS). svn_stringbuf_from_file2 is also a very
> convenient function to use.
I wonder if we could change our generic error handling scheme to defer NLS look-up until the time when we print the error.  That would eliminate that overhead in cases where we handle and clear an error without printing it.  Something to think about for the future.


[...]

> Thanks for the review! The code should be multi-thread safe.


Thanks.

- Julian


Re: r1397773 - rep-sharing in a txn - subversion/libsvn_fs_fs/fs_fs.c

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Thu, Nov 15, 2012 at 6:16 PM, Julian Foad <ju...@btopenworld.com>wrote:

> Hi Stefan.  Some review questions and comments...
>
> stefan2@apache.org wrote:
> > URL: http://svn.apache.org/viewvc?rev=1397773&view=rev
> > Log:
> > Due to public request: apply rep-sharing to equal data reps within
> > the same transaction.
> >
> > The idea is simple. When writing a noderev to the txn folder,
> > write another file named by the rep's SHA1 and store the rep
> > struct in there. Lookup is then straight-forward.
>
> Please
>  could you update the FSFS structure document.  I assume it says what
> files are stored in the txn dir; if it doesn't it should.
>

Oops - hadn't thought about that! Done in r1411202.
All other changes in r1411209.

> * subversion/libsvn_fs_fs/fs_fs.c
> >   (svn_fs_fs__put_node_revision): also look for SHA1-named files
> >   (get_shared_rep): write SHA1-named files
> [...]
> > @@ -2603,7 +2607,32 @@ svn_fs_fs__put_node_revision(svn_fs_t *f
> >
> > -  return svn_io_file_close(noderev_file, pool);
> > +  SVN_ERR(svn_io_file_close(noderev_file, pool));
> > +
> > +  /* if rep sharing has been enabled and the noderev has a data rep
> > +   * and its SHA-1 is known, store the rep struct under its SHA1. */
>
> It
>  looks like we re-enter this 'if' block, and re-write the rep-reference
> file, every time we store a node-rev with the same representation --
> even if the rep we're storing was already found from this file.  This
> might not be much of an overhead, but it seems ugly.
>
> Could we
> avoid re-writing it in those cases?  Maybe by refactoring such that we
> write the rep-reference file immediately after writing the rep, instead
> of here in put_node_revision.
>

Yep. It turns out that rep_write_contents_close is the
only relevant user of that functionality. There, we can
also decide whether to store the sha1->rep mapping
or not. It also addresses the locking problem below.


> > +  if (sha1)
> > +   {
> > +    apr_file_t *rep_file;
> > +    const char *file_name = svn_dirent_join(path_txn_dir(fs, txn_id,\
> > +                                            pool), sha1, pool);
>
> Creating
>  the file name ('checksum_to_cstring' + 'path_txn_dir' +
> 'svn_dirent_join') is common to both this function and the function
> below where we read the file.  Factor it out, like the existing
> path_txn_node_rev() and similar functions, to make the commonality
> clear.
>

Done.


> > +    const char *rep_string = representation_string(noderev->data_rep,
> > +                                                   ffd->format,
> > +                                                   (noderev->kind
> > +                                                    == svn_node_dir),
> > +                                                   FALSE,
> > +                                                   pool);
> > +    SVN_ERR(svn_io_file_open(&rep_file, file_name,
> > +                             APR_WRITE | APR_CREATE | APR_TRUNCATE
> > +                             | APR_BUFFERED, APR_OS_DEFAULT, pool));
> > +
> > +    SVN_ERR(svn_io_file_write_full(rep_file, rep_string,
> > +                                   strlen(rep_string), NULL, pool));
> > +
> > +    SVN_ERR(svn_io_file_close(rep_file, pool));
> > +   }
> > +
> > +  return SVN_NO_ERROR;
> >  }
> >
> > @@ -7083,6 +7112,30 @@ get_shared_rep(representation_t **old_re
> >
> > +  /* look for intra-revision matches (usually data reps but not limited
> > +     to them in case props happen to look like some data rep)
> > +   */
> > +  if (*old_rep == NULL && rep->txn_id)
> > +    {
> > +      svn_node_kind_t kind;
> > +      const char *file_name
> > +        = svn_dirent_join(path_txn_dir(fs, rep->txn_id, pool),
> > +                          svn_checksum_to_cstring(rep->sha1_checksum,\
> > +                                                  pool), pool);
> > +
> > +      /* in our txn, is there a rep file named with the wanted SHA1?
> > +         If so, read it and use that rep. */
> > +      SVN_ERR(svn_io_check_path(file_name, &kind, pool));
>
> Instead
>  of stat'ing the file to decide whether to open it, it's more efficient
> and more 'robust' to just try to open it and then handle the potential
> failure, isn't it?
>

Access is serialized (see below). Checking for existence
should be more efficient because actual matches are
relatively rare and constructing an error object can be
expensive (NLS). svn_stringbuf_from_file2 is also a very
convenient function to use.

> +      if (kind == svn_node_file)
> > +        {
> > +          svn_stringbuf_t *rep_string;
> > +          SVN_ERR(svn_stringbuf_from_file2(&rep_string, file_name,\
> > +                                           pool));
>
> I'm
>  sure the answer is obvious to those familiar with FSFS, but just to be
> clear please could you tell me which locking
> mechanism guarantees that the opening and reading of this file cannot be
>  happening at the same time as this file is being created and written
> (by another thread or process)?
>

Writing representations is serialized using the protorev lock.
All access to the sha1->rep mapping is done from
rep_write_contents_close() now, which in turn, is protected
by that lock.


> Is it possible that when this
> file was being written, the writer crashed in such a way that the file
> is now present but incomplete, and do we want to be able to recover --
> that is, be able to re-open this txn and resume writing into it -- after
>  such a failure?  If so, we should catch errors here that result from
> reading incomplete data, and in such cases proceed as if the file was
> not found; or else write the file atomically so that that can't happen.
>

A crashed writer process may leave a corrupt protorev and / or
other incomplete files. There is no atomic incremental change
here. The caller (client) using the crashed process is supposed
to detect the crash and abandon the transaction.


> > +          SVN_ERR(read_rep_offsets_body(old_rep, rep_string->data,
> > +                                        rep->txn_id, FALSE, pool));
> > +        }
> > +    }
> > +
> >    /* Add information that is missing in the cached data. */
> >    if (*old_rep)
> >      {
>
> - Julian
>
>
Thanks for the review! The code should be multi-thread safe.

-- Stefan^2.

-- 
Certified & Supported Apache Subversion Downloads:
*

http://www.wandisco.com/subversion/download
*

Re: r1397773 - rep-sharing in a txn - subversion/libsvn_fs_fs/fs_fs.c

Posted by Julian Foad <ju...@btopenworld.com>.
Hi Stefan.  Some review questions and comments...

stefan2@apache.org wrote:
> URL: http://svn.apache.org/viewvc?rev=1397773&view=rev
> Log:
> Due to public request: apply rep-sharing to equal data reps within
> the same transaction. 
> 
> The idea is simple. When writing a noderev to the txn folder,
> write another file named by the rep's SHA1 and store the rep
> struct in there. Lookup is then straight-forward.

Please
 could you update the FSFS structure document.  I assume it says what 
files are stored in the txn dir; if it doesn't it should.

> * subversion/libsvn_fs_fs/fs_fs.c
>   (svn_fs_fs__put_node_revision): also look for SHA1-named files
>   (get_shared_rep): write SHA1-named files
[...]
> @@ -2603,7 +2607,32 @@ svn_fs_fs__put_node_revision(svn_fs_t *f
> 
> -  return svn_io_file_close(noderev_file, pool);
> +  SVN_ERR(svn_io_file_close(noderev_file, pool));
> +
> +  /* if rep sharing has been enabled and the noderev has a data rep
> +   * and its SHA-1 is known, store the rep struct under its SHA1. */

It
 looks like we re-enter this 'if' block, and re-write the rep-reference 
file, every time we store a node-rev with the same representation -- 
even if the rep we're storing was already found from this file.  This 
might not be much of an overhead, but it seems ugly.

Could we 
avoid re-writing it in those cases?  Maybe by refactoring such that we 
write the rep-reference file immediately after writing the rep, instead 
of here in put_node_revision.

> +  if (sha1)
> +   {
> +    apr_file_t *rep_file;
> +    const char *file_name = svn_dirent_join(path_txn_dir(fs, txn_id,\
> +                                            pool), sha1, pool);

Creating
 the file name ('checksum_to_cstring' + 'path_txn_dir' + 
'svn_dirent_join') is common to both this function and the function 
below where we read the file.  Factor it out, like the existing 
path_txn_node_rev() and similar functions, to make the commonality 
clear.

> +    const char *rep_string = representation_string(noderev->data_rep,
> +                                                   ffd->format,
> +                                                   (noderev->kind
> +                                                    == svn_node_dir),
> +                                                   FALSE,
> +                                                   pool);
> +    SVN_ERR(svn_io_file_open(&rep_file, file_name,
> +                             APR_WRITE | APR_CREATE | APR_TRUNCATE
> +                             | APR_BUFFERED, APR_OS_DEFAULT, pool));
> +
> +    SVN_ERR(svn_io_file_write_full(rep_file, rep_string,
> +                                   strlen(rep_string), NULL, pool));
> +
> +    SVN_ERR(svn_io_file_close(rep_file, pool));
> +   }
> +
> +  return SVN_NO_ERROR;
>  }
> 
> @@ -7083,6 +7112,30 @@ get_shared_rep(representation_t **old_re
> 
> +  /* look for intra-revision matches (usually data reps but not limited
> +     to them in case props happen to look like some data rep)
> +   */
> +  if (*old_rep == NULL && rep->txn_id)
> +    {
> +      svn_node_kind_t kind;
> +      const char *file_name
> +        = svn_dirent_join(path_txn_dir(fs, rep->txn_id, pool),
> +                          svn_checksum_to_cstring(rep->sha1_checksum,\
> +                                                  pool), pool);
> +
> +      /* in our txn, is there a rep file named with the wanted SHA1?
> +         If so, read it and use that rep. */
> +      SVN_ERR(svn_io_check_path(file_name, &kind, pool));

Instead
 of stat'ing the file to decide whether to open it, it's more efficient 
and more 'robust' to just try to open it and then handle the potential 
failure, isn't it?

> +      if (kind == svn_node_file)
> +        {
> +          svn_stringbuf_t *rep_string;
> +          SVN_ERR(svn_stringbuf_from_file2(&rep_string, file_name,\
> +                                           pool));

I'm
 sure the answer is obvious to those familiar with FSFS, but just to be clear please could you tell me which locking 
mechanism guarantees that the opening and reading of this file cannot be
 happening at the same time as this file is being created and written 
(by another thread or process)?

Is it possible that when this 
file was being written, the writer crashed in such a way that the file 
is now present but incomplete, and do we want to be able to recover -- 
that is, be able to re-open this txn and resume writing into it -- after
 such a failure?  If so, we should catch errors here that result from 
reading incomplete data, and in such cases proceed as if the file was 
not found; or else write the file atomically so that that can't happen.

> +          SVN_ERR(read_rep_offsets_body(old_rep, rep_string->data,
> +                                        rep->txn_id, FALSE, pool));
> +        }
> +    }
> +
>    /* Add information that is missing in the cached data. */
>    if (*old_rep)
>      {

- Julian