You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Julian Foad <ju...@wandisco.com> on 2010/04/20 17:37:30 UTC

WC-NG: Commit with new pristine store and SHA-1 checksums

On adapting the "commit" data flow to work with the new pristine text
store and SHA-1 checksums.


OBSERVATIONS
============

The call graph during a commit is (adapted from
notes/wc-ng/use-of-tmp-text-base-path):

                          svn_client_commit4()
                            |^[T]  |       |
  wc_to_repos_copy()        |^[M]  |       |
            |               |^     |       |
        svn_client__do_commit()    |       |
                    [N] |^         |       |
                        |^         |       |
                        |^         |       |
  LIBSVN_CLIENT         |^         |       |
  ......................................................................
  LIBSVN_WC             |^         |
                        |^   svn_wc_queue_committed3()
                        |^
                        |^                 |
  svn_wc_transmit_text_deltas3()           |
            [N] |^                         |
  svn_wc__internal_transmit_text_deltas()  |
    [N] |^                                 |
        |^                                 |
        |^              svn_wc_process_committed_queue2()
        |^                      |
        |^              svn_wc__process_committed_internal()
        |^                      |
        |^              process_committed_leaf()
        |^                |^          |v
        |^                |^          |v
  svn_wc__text_base_path(tmp=TRUE)    |v
                                      |v
         (Here the new text base is installed from the given path.)


The calling sequence is:

  svn_client__do_commit() calls
      svn_wc_transmit_text_deltas3(), once per modified file; then

  svn_client_commit4() calls
      svn_wc_queue_committed3(), once per significant node, to build a
queue; then svn_wc_process_committed_queue2(), just once, passing that
queue.


svn_wc_transmit_text_deltas3() does several things:

  - determine the new text base content by translating the
    working file to repository-normal form;
  - transmit deltas of that against the old text base;
  - verify the recorded checksum of the old text base;
  - optionally, store the new text base in a temporary file.

These purposes could be separated out a bit, although I don't think it's
particularly important to do so right now:

  - get a stream of the working text translated to RNF:
      svn_wc_translated_stream2()
  - get a stream of the old text base:
      svn_wc_get_pristine_contents2()
  - transmit deltas, given two readable streams:
      editor->apply_textdelta() f/b svn_txdelta_run()
  - write a stream to a new text base file:
      (We have old and new private APIs for this; I don't think we have
public ones.)

The tricky bit here is: after writing a new text-base file, that file's
path (old way) or SHA1 checksum (new way) needs to be communicated to
svn_wc_process_committed_queue().  The path isn't currently being
communicated, it's being re-derived.

The obvious way (1):  Pass the list of new-text-base checksums on to
svn_wc_process_committed_queue().  That is relatively straightforward.
I need to check whether the Queue is already having a separate entry for
each and every modified file, and make sure it does.

Another possible way (2):  If, in svn_wc_transmit_text_deltas3() or just
afterwards, we were to store the checksum in the ACTUAL_NODE table, in a
checksum field that represents the "Repo-Normal-Form of the ACTUAL text
which is currently being committed", then at commit post-processing time
we could get this checksum from the DB, knowing only the working file's
path, and write it into the BASE_NODE table.  (We wouldn't rely on the
working file remaining untouched on disk, because we've stored a copy of
this checksummed text into the pristine store at the same time.)

What are the pros and cons?  See backward compatibility, below.


THE NEW WAY
===========

This is a straightforward way to modify the new API.

Note: svn_wc_transmit_text_deltas3(), svn_wc_queue_committed3() and
svn_wc_process_committed_queue2() are already new in 1.7; their
predecessors must be kept working for backward compatibility.  

svn_wc_transmit_text_deltas3() shall:
  - write the new text base into the pristine store rather than a
particular path;
  - return the SHA-1 checksum of the new text base;
  - no longer return the old "tempfile" and "md5_digest" outputs.

svn_wc_queue_committed3() shall:
  - take the SHA-1 checksum of every modified file *and every new file*
(instead of an MD-5 checksum).

svn_wc_process_committed_queue2() shall:
  - use the SHA-1 checksums found in the queue.


COMPATIBILITY
=============

We need to keep the old WC interface working:

  svn_wc_transmit_text_deltas2(&tempfile, &md5_digest, ...)
  svn_wc_queue_committed2(queue, path, ..., md5_checksum)
  svn_wc_process_committed_queue(queue, ...)

How?  I can't see a way to communicate the SHA-1 checksum to
svn_wc_process_committed_queue() via the queue, but I can think of the
following ways.


(1)

An advantage of the method that stores the new checksum in the
ACTUAL_NODE table is that the backward-compatible old API can look there
to find the new text base:

svn_wc_transmit_text_deltas2() shall, if TEMPFILE is non-null:
    - store the new text base (with its checksums) in the pristine
store;
    - store the new text base's SHA-1 checksum in ACTUAL_NODE;
    - return the new text base's MD5 digest;
    - set *TEMPFILE to some path that it's safe for the caller to
attempt to delete, but that is not otherwise meaningful;

svn_wc_process_committed_queue() shall:
    - look in ACTUAL_NODE to find each file's new text base SHA-1;
    - "install" the new text base by simply writing that SHA-1 to
BASE_NODE.


(2)

If we use the simpler method (where the new API puts the new text in the
pristine store and passes only its SHA-1 checksum along), the only
solution I can find for the compatibility API is to keep working the old
way: put the temporary text base file at the old specially derivable
path, and then find it there within svn_wc_queue_committed2() or
svn_wc_process_committed_queue():

svn_wc_transmit_text_deltas2() shall, if TEMPFILE is non-null:
    - store the new text base at the special derived path;
    - set *TEMPFILE to that path;
    - return the new text base's MD5 digest.

svn_wc_process_committed_queue() shall:
    - find the new text base at the special derived path;
    - calculate its SHA-1 checksum;
    - store it (with its checksums) in the pristine store;
    - put that SHA-1 checksum in ACTUAL_NODE.


Comments please.  Is either of those ways to be preferred?

One last thought: I haven't described here where an added file gets its
text base when it is committed.  Of course its SHA-1 checksum needs to
be calculated and passed on too, similar to a modified file but not
using svn_wc_transmit_text_deltas3().

- Julian


Re: WC-NG: Commit with new pristine store and SHA-1 checksums

Posted by Julian Foad <ju...@wandisco.com>.
On Wed, 2010-04-21, Julian Foad wrote:
> Greg Stein wrote:
> > On Wed, Apr 21, 2010 at 05:09, Philip Martin <ph...@wandisco.com> wrote:
> > > Julian Foad <ju...@wandisco.com> writes:
> > >
> > >> COMPATIBILITY
> > >> =============
> > >>
> > >> We need to keep the old WC interface working:
> > >>
> > >>   svn_wc_transmit_text_deltas2(&tempfile, &md5_digest, ...)
> > >>   svn_wc_queue_committed2(queue, path, ..., md5_checksum)
> > >>   svn_wc_process_committed_queue(queue, ...)
> > >>
> > >> How?  I can't see a way to communicate the SHA-1 checksum to
> > >> svn_wc_process_committed_queue() via the queue, but I can think of the
> > >> following ways.
> > >
> > > There is an access baton in the old interface, it's opaque and can be
> > > made to store anything.  It could contain a hash of filename=>SHA-1.
> 
> Thanks, Philip - I hadn't thought of that.  I'll bear it in mind.  It
> could be useful for other things too.
> 
> > Presumably, transmit_text_deltas2 is a wrapper around deltas3. Thus,
> > deltas2 has the SHA1 value from that inner call.
> > 
> > Putting something into *TEMPFILE is optional, so we can skip that. The
> > MD5 result from deltas2 can be fetched from the PRISTINE table, given
> > the SHA1 key.
> 
> Yup, transmit_text_deltas2() can easily know and return the MD-5.
> 
> > queue_committed2 can use the MD5 value and key into PRISTINE (we
> > should have an index on PRISTINE.md5_checksum) to find the SHA1.
> 
> I wondered about looking up the pristine text from its MD-5.  Certainly
> possible (preferably via an index for speed).
> 
> At first I had a slight concern about the remote possibility of MD-5
> collisions.  I now think we can alleviate the concern by checking if a
> new pristine text ever has an MD-5 that's already recorded in the
> pristine store against a different SHA-1.  If that ever happens, we can
> issue a warning or error, the resolution of which is "upgrade to 1.7+,
> which no longer relies on MD-5 uniqueness".
> 
> The bit I'm not sure about is whether the MD-5 of *every* new text base
> in the commit is actually passed through the queue.  I'll go and test
> whether it is - it doesn't look like it, from the way I read the code.

BTW I confirmed that, yesterday: it's true with the current code.
However, the APIs have been through several revisions, and the oldest
ones - svn_wc_process_committed() and svn_wc_process_committed2() -
don't even communicate the MD5 checksum on to the post-commit step.

In IRC discussion with Greg we decided a probable strategy for dealing
with the old APIs is to make them rely on a fixed, deterministic, tmp
path, like they always have done.  Instead of the WC-1 scheme which gave
a path like

  <dir/somedir/.svn/tmp/text-base/FOO>

the WC-NG code will provide a path that's safe to be in a single .svn
dir at the root of a WC, such as

  <.svn/tmp/text-base/dir/somedir/FOO>

or maybe encoding <dir/somedir/foo> into a single path component if
that's better than dealing with arbitrary levels of subdirs.

- Julian


> > This seems pretty straight-forward, unless I've missed something.
> > 
> > (note that the pristine store would have this "extra" pristine,
> > unreferenced by any other table; that could get garbage-cleaned by a
> > separate process... BUT: there is an admin lock present during a
> > commit, so we'd simply avoid GC'ing the PRISTINE table/on-disk)
> 
> Yup, I'm happy that we can manage the GC properly, in one way or
> another.
> 
> Thanks for the feedback.
> 
> 
> - Julian
> 
> 


Re: WC-NG: Commit with new pristine store and SHA-1 checksums

Posted by Julian Foad <ju...@wandisco.com>.
Greg Stein wrote:
> On Wed, Apr 21, 2010 at 05:09, Philip Martin <ph...@wandisco.com> wrote:
> > Julian Foad <ju...@wandisco.com> writes:
> >
> >> COMPATIBILITY
> >> =============
> >>
> >> We need to keep the old WC interface working:
> >>
> >>   svn_wc_transmit_text_deltas2(&tempfile, &md5_digest, ...)
> >>   svn_wc_queue_committed2(queue, path, ..., md5_checksum)
> >>   svn_wc_process_committed_queue(queue, ...)
> >>
> >> How?  I can't see a way to communicate the SHA-1 checksum to
> >> svn_wc_process_committed_queue() via the queue, but I can think of the
> >> following ways.
> >
> > There is an access baton in the old interface, it's opaque and can be
> > made to store anything.  It could contain a hash of filename=>SHA-1.

Thanks, Philip - I hadn't thought of that.  I'll bear it in mind.  It
could be useful for other things too.

> Presumably, transmit_text_deltas2 is a wrapper around deltas3. Thus,
> deltas2 has the SHA1 value from that inner call.
> 
> Putting something into *TEMPFILE is optional, so we can skip that. The
> MD5 result from deltas2 can be fetched from the PRISTINE table, given
> the SHA1 key.

Yup, transmit_text_deltas2() can easily know and return the MD-5.

> queue_committed2 can use the MD5 value and key into PRISTINE (we
> should have an index on PRISTINE.md5_checksum) to find the SHA1.

I wondered about looking up the pristine text from its MD-5.  Certainly
possible (preferably via an index for speed).

At first I had a slight concern about the remote possibility of MD-5
collisions.  I now think we can alleviate the concern by checking if a
new pristine text ever has an MD-5 that's already recorded in the
pristine store against a different SHA-1.  If that ever happens, we can
issue a warning or error, the resolution of which is "upgrade to 1.7+,
which no longer relies on MD-5 uniqueness".

The bit I'm not sure about is whether the MD-5 of *every* new text base
in the commit is actually passed through the queue.  I'll go and test
whether it is - it doesn't look like it, from the way I read the code.

> This seems pretty straight-forward, unless I've missed something.
> 
> (note that the pristine store would have this "extra" pristine,
> unreferenced by any other table; that could get garbage-cleaned by a
> separate process... BUT: there is an admin lock present during a
> commit, so we'd simply avoid GC'ing the PRISTINE table/on-disk)

Yup, I'm happy that we can manage the GC properly, in one way or
another.

Thanks for the feedback.


- Julian


Re: WC-NG: Commit with new pristine store and SHA-1 checksums

Posted by Greg Stein <gs...@gmail.com>.
On Wed, Apr 21, 2010 at 05:43, Greg Stein <gs...@gmail.com> wrote:
> On Wed, Apr 21, 2010 at 05:09, Philip Martin <ph...@wandisco.com> wrote:
>> Julian Foad <ju...@wandisco.com> writes:
>>
>>> COMPATIBILITY
>>> =============
>>>
>>> We need to keep the old WC interface working:
>>>
>>>   svn_wc_transmit_text_deltas2(&tempfile, &md5_digest, ...)
>>>   svn_wc_queue_committed2(queue, path, ..., md5_checksum)
>>>   svn_wc_process_committed_queue(queue, ...)
>>>
>>> How?  I can't see a way to communicate the SHA-1 checksum to
>>> svn_wc_process_committed_queue() via the queue, but I can think of the
>>> following ways.
>>
>> There is an access baton in the old interface, it's opaque and can be
>> made to store anything.  It could contain a hash of filename=>SHA-1.
>
> Presumably, transmit_text_deltas2 is a wrapper around deltas3. Thus,
> deltas2 has the SHA1 value from that inner call.
>
> Putting something into *TEMPFILE is optional, so we can skip that. The

Oop. Not optional. But we can arrange for a TEMPFILE. Or maybe refer
to the in-pristine-store file.

>...

Re: WC-NG: Commit with new pristine store and SHA-1 checksums

Posted by Greg Stein <gs...@gmail.com>.
On Wed, Apr 21, 2010 at 05:09, Philip Martin <ph...@wandisco.com> wrote:
> Julian Foad <ju...@wandisco.com> writes:
>
>> COMPATIBILITY
>> =============
>>
>> We need to keep the old WC interface working:
>>
>>   svn_wc_transmit_text_deltas2(&tempfile, &md5_digest, ...)
>>   svn_wc_queue_committed2(queue, path, ..., md5_checksum)
>>   svn_wc_process_committed_queue(queue, ...)
>>
>> How?  I can't see a way to communicate the SHA-1 checksum to
>> svn_wc_process_committed_queue() via the queue, but I can think of the
>> following ways.
>
> There is an access baton in the old interface, it's opaque and can be
> made to store anything.  It could contain a hash of filename=>SHA-1.

Presumably, transmit_text_deltas2 is a wrapper around deltas3. Thus,
deltas2 has the SHA1 value from that inner call.

Putting something into *TEMPFILE is optional, so we can skip that. The
MD5 result from deltas2 can be fetched from the PRISTINE table, given
the SHA1 key.

queue_committed2 can use the MD5 value and key into PRISTINE (we
should have an index on PRISTINE.md5_checksum) to find the SHA1.

This seems pretty straight-forward, unless I've missed something.

(note that the pristine store would have this "extra" pristine,
unreferenced by any other table; that could get garbage-cleaned by a
separate process... BUT: there is an admin lock present during a
commit, so we'd simply avoid GC'ing the PRISTINE table/on-disk)

Cheers,
-g

Re: WC-NG: Commit with new pristine store and SHA-1 checksums

Posted by Philip Martin <ph...@wandisco.com>.
Julian Foad <ju...@wandisco.com> writes:

> COMPATIBILITY
> =============
>
> We need to keep the old WC interface working:
>
>   svn_wc_transmit_text_deltas2(&tempfile, &md5_digest, ...)
>   svn_wc_queue_committed2(queue, path, ..., md5_checksum)
>   svn_wc_process_committed_queue(queue, ...)
>
> How?  I can't see a way to communicate the SHA-1 checksum to
> svn_wc_process_committed_queue() via the queue, but I can think of the
> following ways.

There is an access baton in the old interface, it's opaque and can be
made to store anything.  It could contain a hash of filename=>SHA-1.

-- 
Philip