You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Greg Stein <gs...@gmail.com> on 2010/02/16 19:15:40 UTC

Re: pristine store database -- was: pristine store design

Meta-comment: all of these issues that you're bringing up are
*exactly* why I wanted to punt the issue of external-to-WC stores out
of 1.7. Keeping a 1:1 correspondence of pristine stores to working
copies keeps the problem tractable, especially given all the other
work that is needed.

On Tue, Feb 16, 2010 at 09:58, Bert Huijben <be...@qqmail.nl> wrote:
>...
> The idea was not to move just the pristine store, but also wc.db to a
> central location.

Yup. There were also some comments of separable locations of the
wc.db: one where you keep your metadata (e.g. home dir), and one for
the pristines (e.g. /var/svn). I never liked that idea, but reference
it for completeness sake.

As Bert noted in his response, the schema is designed to manage
multiple working copies.

>...
>> Instead, we could not store size and mtime at all! :)
>
> Or we could store both to perform simple consistency checks...

Dunno about that, but the storage of SIZE is part of the (intended)
algorithm for pristine storage. It is allowed to have a row in
PRISTINE with SIZE==0 in order to say "I know about this pristine, and
this row is present to satisfy integrity constraints with other
tables, but the pristine has NOT been written into the store." Once
the file *is* written, then the resulting size is stored into the
table.

> A query over checksum from BASE_NODE and WORKING_NODE + older_checksum,
> left_checksum  and right_checksum from ACTUAL_NODE would give a list from
> all PRISTINE records that are in-use.
> The rest can safely be deleted if diskspace is required.

Yes. That was the design goal.

it gets more complicated when you have a centralized wc.db and one or
more of the working copies are offline (e.g. removable storage,
network unavailable, etc). Again, these questions are why a
centralized concept has been punted for this generation.

>...
>> So, while checking size and mtime gives a sense of basic sanity, it is
>> really just a puny excuse for not checking full checksum validity. If we
>> really care about correctness of pristines, *every* read of a pristine
>> should verify the checksum along the way. (That would include to always

Since the pristine design returns a *stream* on the pristine, then we
can always insert a checksumming stream in order to verify the
contents. I believe the bottleneck will be I/O, so performing a
checksum should be Just Fine. If the stream is read to completion,
then we can validate the checksum (and don't worry about partial
reads; that isn't all that common, I believe).

Note that we'd only have to insert a single checksum stream, not SHA1 *and* MD5.

Also, since I/O is the (hypothetical) bottleneck, this is also why
compression is handy.

>...
> This is cheap to detect. If we have to read the file anyway, we can just
> recalculate the hash via the stream hash apis. The disk IO for reading the
> file is the real pain with current CPUs. (At least for SHA1/MD5. Sha256 is
> still painful in my tests)
>...
> You would probably see our binary diff fail before you get here and if the
> RA layer provides a hash for the result or input file you can use that for
> verification on the streams. (See the update editor for some examples where
> we do that in one pass since 1.6)

Right, and right.

>...

Cheers,
-g

Re: pristine store database -- was: pristine store design

Posted by "Hyrum K. Wright" <hy...@mail.utexas.edu>.
On Feb 16, 2010, at 10:52 PM, Neels J Hofmeyr wrote:

> Greg Stein wrote:
>> On Tue, Feb 16, 2010 at 16:22, Johan Corveleyn <jc...@gmail.com> wrote:
>>> OTOH, I guess a 0 bytes pristine has to be handled specially anyway
>>> (what's the checksum of 0 bytes), no?
> 
> SHA1: da39a3ee5e6b4b0d3255bfef95601890afd80709
> MD5: d41d8cd98f00b204e9800998ecf8427e
> 
> ;)

And we even special case these values in libsvn_subr:

static const unsigned char svn_sha1__empty_string_digest_array[] = {
  0xda, 0x39, 0xa3, 0xee, 0x5e, 0x6b, 0x4b, 0x0d, 0x32, 0x55,
  0xbf, 0xef, 0x95, 0x60, 0x18, 0x90, 0xaf, 0xd8, 0x07, 0x09
};

static const unsigned char svn_md5__empty_string_digest_array[] = {
  212, 29, 140, 217, 143, 0, 178, 4, 233, 128, 9, 152, 236, 248, 66, 126
};

-Hyrum

Re: pristine store database -- was: pristine store design

Posted by Neels J Hofmeyr <ne...@elego.de>.
Greg Stein wrote:
> On Tue, Feb 16, 2010 at 16:22, Johan Corveleyn <jc...@gmail.com> wrote:
>> OTOH, I guess a 0 bytes pristine has to be handled specially anyway
>> (what's the checksum of 0 bytes), no?

SHA1: da39a3ee5e6b4b0d3255bfef95601890afd80709
MD5: d41d8cd98f00b204e9800998ecf8427e

;)

> 
> Oops. You're right. I meant a null column value, as I documented in
> the schema file.

It *is* kind of useless to store an empty pristine :)
But given that empty files don't happen often in version control it could be
just as silly to special-case it at all.

~Neels


Re: pristine store database -- was: pristine store design

Posted by Greg Stein <gs...@gmail.com>.
On Tue, Feb 16, 2010 at 16:22, Johan Corveleyn <jc...@gmail.com> wrote:
> On Tue, Feb 16, 2010 at 8:15 PM, Greg Stein <gs...@gmail.com> wrote:
>>>> Instead, we could not store size and mtime at all! :)
>>>
>>> Or we could store both to perform simple consistency checks...
>>
>> Dunno about that, but the storage of SIZE is part of the (intended)
>> algorithm for pristine storage. It is allowed to have a row in
>> PRISTINE with SIZE==0 in order to say "I know about this pristine, and
>> this row is present to satisfy integrity constraints with other
>> tables, but the pristine has NOT been written into the store." Once
>> the file *is* written, then the resulting size is stored into the
>> table.
>
> Shouldn't the marker value to indicate "pristine has NOT been written
> into the store" be something like -1 instead of 0? Just taking into
> account that there might be files that really have a size of 0 bytes.
> These should be supported, shouldn't they?
>
> OTOH, I guess a 0 bytes pristine has to be handled specially anyway
> (what's the checksum of 0 bytes), no?

Oops. You're right. I meant a null column value, as I documented in
the schema file.

Cheers,
-g

Re: pristine store database -- was: pristine store design

Posted by Johan Corveleyn <jc...@gmail.com>.
On Tue, Feb 16, 2010 at 8:15 PM, Greg Stein <gs...@gmail.com> wrote:
>>> Instead, we could not store size and mtime at all! :)
>>
>> Or we could store both to perform simple consistency checks...
>
> Dunno about that, but the storage of SIZE is part of the (intended)
> algorithm for pristine storage. It is allowed to have a row in
> PRISTINE with SIZE==0 in order to say "I know about this pristine, and
> this row is present to satisfy integrity constraints with other
> tables, but the pristine has NOT been written into the store." Once
> the file *is* written, then the resulting size is stored into the
> table.

Shouldn't the marker value to indicate "pristine has NOT been written
into the store" be something like -1 instead of 0? Just taking into
account that there might be files that really have a size of 0 bytes.
These should be supported, shouldn't they?

OTOH, I guess a 0 bytes pristine has to be handled specially anyway
(what's the checksum of 0 bytes), no?

Johan