You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Neels J Hofmeyr <ne...@elego.de> on 2010/02/16 13:54:56 UTC

pristine store database -- was: pristine store design

Philip Martin wrote:
> Neels J Hofmeyr <ne...@elego.de> writes:
> 
>> THE PRISTINE STORE
>> ==================
>>
>> The pristine store is a local cache of complete content of files that are
>> known to be in the repository. It is hashed by a checksum of that content
>> (SHA1).
> 
> I'm not sure whether you are planning one table per pristine store or
> one table per working copy, but I think it's one per pristine store.
> Obviously it doesn't makes no difference until pristine stores can be
> shared (and it might be one per directory in the short term depending
> on when stop being one database per directory).

Thanks for that. This is the tip of an iceberg called 'a pristine store does
not equal a working copy [root]'.

The question is how to store the PRISTINE table (also see below) once it
serves various working copies. Will we have a separate SQLite db store, and
create a new file system entity called 'pristine store' that the user can
place anywhere, like a working copy?

We could also keep pristine store and working copy welded together, so that
one working copy can use the pristine store of another working copy, and
that a 'pristine store' that isn't used as a working copy is just a
--depth=empty checkout of any folder URL of that repository. It practically
has the same effect as completely separating pristine stores from working
copies (there is another SQLite store somewhere else), but we can just
re-use the WC API, no need to have a separate pristine *store* API (create
new store, contact local store database, indicate a store location, checking
presence given a location, etc.).

> 
>> SOME IMPLEMENTATION INSIGHTS
>> ============================
>>
>> There is a PRISTINE table in the SQLite database with columns
>>  (checksum, md5_checksum, size, refcount)
>>
>> The pristine contents are stored in the local filesystem in a pristine file,
>> which may or may not be compressed (opaquely hidden behind the pristines API).
>> The goal is to be able to have a pristine store per working copy, per user as
>> well as system-wide, and to configure each working copy as to which pristine
>> store(s) it should use for reading/writing.
>>   
>> There is a canonical way of getting a given CHECKSUM's pristine file name for
>> a given working copy without contacting the WC database (static function
>> get_pristine_fname()).
>>
>> When interacting with the pristine store, we want to, as appropriate, check
>> for (combos of):
>>   db-presence    - presence in the PRISTINE table with noted file size > 0
>>   file-presence  - pristine file presence
>>   stat-match     - PRISTINE table's size and mtime match file system
>>   checksum-match - validity of data in the file against the checksum
>>
>> file-presence is gotten for free from a successful stat-match (fstat),
>> checksum-match (fopen) and unchecked read of the file (fopen).
>>
>> How fast we consider things:
>>   db-presence    - very fast to moderately fast (in case of "empty db cache")
>>   file-presence  - slow (fstat or fopen)
>>   stat-match     - slow (fstat plus SQLite query)
>>   checksum-match - super slow (reading, checksumming)
> 
> I'm prepared to believe a database query can be faster that stat when
> the inode cache is cold, but what about when the inode cache is hot?

Also thanks for this!

I don't know that much about database/file system benchmarks, let alone on
different platforms. My initial classifications are mostly guessing, mixed
with provocative prodding to wake up more experienced devs ;)

I'm also not really aware how expensive it is to calculate a checksum while
reading a stream for other purposes. How much cpu time does it add if the
file I/O would happen anyway? Is it neglectable?

I guess we'll ultimately have to just try out what performs best.

> If the database query requires even one system call then it could well
> be slower.  Multiple processes accessing a working copy, or writing to
> the pristine store, might bias this further towards stat being faster,
> If we decide to share the pristine store between several working
> copies then a shared database could become a bottleneck.
> 
> [...]
> 
>> Use case "need": "I want to use this pristine's content, definitely."
>> ---------------
>> pseudocode:
>>  pristine_check(&present, checksum, _usable)         (3)
>>  if !present:
>>    get_pristine_from_repos(checksum, ra)             (9)
>>  pristine_read(&stream, checksum)                    (6)
>>
>> (3) check for _usable:
>>      - db-presence
>>      - if the checksum is not present in the table, return that it is not
>>        present (don't check for file existence as well).
>>      - stat-match (includes file-presence)
>>      - if the checksum is present in the table but file is bad/not there,
>>        bail, asking user to 'svn cleanup --pristines' (or sth.)
>>
>> (9) See use case "fetch". After this, either the pristine file is ready for
>>     reading, or "fetch" has bailed already.
>>
>> (6) fopen()
> 
> 
> I think this is the most important case from a performance point of
> view.  This is what 'svn status' et al. use, and it's important for
> GUIs as a lot of the "feel" depends on how fast a process can query
> the metadata.

Agreed.

> If we were to do away with the PRISTINE table, then we would not have
> to worry about it becoming a bottleneck.  We don't need the existance
> check if we are just about to open the file, since opening the file
> proves that it exists.

<rant>Yes, I meant that, semantically, there has to be an existence check.
You're right that it is gotten for free from opening the file. It's still
important to note where the antenna sits that detects non-existence.</rant>

> We obviously have the checksum already, from
> the BASE/WORKING table, so we only need the PRISTINE table for the
> size/mtime.  Perhaps we could store those in the BASE/WORKING table
> and eliminate the PRISTINE table, or is this too much of a layering
> violation?  The pristine store is then just a sharded directory, into
> which we move files and from which we read files.

-1

While we could store size&mtime in the BASE/WORKING tables, this causes size
and mtime to be stored multiple times (whereever a pristine is referenced)
and involves editing multiple entries when a pristine is removed/added due
to high-water-mark or repair. That would be nothing less than horrible.
Taking one step away from that, each working copy should have a dedicated
table that stores size and mtime only once. Then we still face the situation
that size and mtime are stored multiple times (once per working copy), and
where, if a central pristine store is restructured, every working copy has
to be updated. Bad idea.

Instead, we could not store size and mtime at all! :)

They are merely half-checks for validity. During normal operation, size and
mtime should never change, because we don't open write streams to pristines.
If anyone messes with the pristine store accidentally, we would pick it up
with the size, or if that stayed the same, with the mtime. But we can pick
up all cases of bitswaps/disk failure *only* by verifying *full checksum
validity*!

So, while checking size and mtime gives a sense of basic sanity, it is
really just a puny excuse for not checking full checksum validity. If we
really care about correctness of pristines, *every* read of a pristine
should verify the checksum along the way. (That would include to always read
the complete pristine, even if just a few lines along the middle are needed)

* neels dreams of disks that hardware-checksum on-the-fly

If I further follow my dream of us emulating such hardware, we would store
checksums for sub-chunks of each pristine, so that we can read small
sections of pristines, being sure that the given section is correct without
having to read the whole pristine.

Whoa, look where you got me now! ;)

I think it's a very valid question. Chuck the mtime and size, thus get rid
of the PRISTINE table, thus do away with checking for any inconsistency
between table and file system, also do away with possible database
bottlenecks, and reduce the location of the pristine store to a mere local
abspath. We have the checksum, we have the filename. Checking mtime and
length protects against accidental editing of the pristine files. But any
malicious or hw-failure corruption can in fact be *protected* by keeping
mtime and length intact! ("hey, we checked it, it must be correct.")

Let's play through a corrupted pristine (with unchanged mtime/length). This
is just theoretical...

Commit modification:

- User makes a checkout / revert / update that uses a locally
  corrupted pristine. The corrupted pristine thus sits in the WC.

- User makes a text mod

- User commits

- Client/network layer communicate the *delta* between the local pristine
  and the local mod to the repository, and the checksum of the modified
  text.

- Repos applies the delta to the intact pristine it has in *its* store.

- Repos finds the resulting checksum to be *different* from the client's
  checksum, because the underlying pristine was corrupt.

--> Yay! No need to do *ANY* local verification at all!!

Of course, in case the client/network layer decide to send the full text
instead of a delta, the corruption is no longer detected. :(


Merge and commit:

- User makes a merge that uses a locally corrupted pristine.

- The merge *delta* applied to the working copy is incorrect.

- User does not note the corruption (e.g. via --accept=mine-full)

- User commits

- Repos accepts the changes based on the corrupted pristine that was
  used to get the merge delta, because it can't tell the difference
  from a normal modification.

--> My goodness, merge needs to check pristine validity on each read,
    as if it wasn't slow enough. But as discussed above, even if merge
    checked mtime and length, it would not necessarily detect disk failure
    and crafted malicious corruption.


Thanks, Philip.

I'm now challenging the need to store mtime and length, and a need to do
more checksumming instead. The checksumming overhead could be smaller than
the database bottleneck slew.

For future optimisation, I'm also suggesting pristines should have
additionally stored checksums for small chunks of each pristine, while still
being indexed by the full checksum.
(Which may imply a db again :/ , but that db would only be hit if we're
trying to save time by reading just a small bit of the pristine)

Everyone, please prove me wrong!

Thanks,
~Neels


Re: pristine store database -- was: pristine store design

Posted by Neels J Hofmeyr <ne...@elego.de>.
Commenting on myself...

Neels J Hofmeyr wrote:
> Philip Martin wrote:
>> Neels J Hofmeyr <ne...@elego.de> writes:
>>
>>> THE PRISTINE STORE
>>> ==================
>>>
>>> The pristine store is a local cache of complete content of files that are
>>> known to be in the repository. It is hashed by a checksum of that content
>>> (SHA1).
>> I'm not sure whether you are planning one table per pristine store or
>> one table per working copy, but I think it's one per pristine store.
>> Obviously it doesn't makes no difference until pristine stores can be
>> shared (and it might be one per directory in the short term depending
>> on when stop being one database per directory).
> 
> Thanks for that. This is the tip of an iceberg called 'a pristine store does
> not equal a working copy [root]'.
> 
> The question is how to store the PRISTINE table (also see below) once it
> serves various working copies. Will we have a separate SQLite db store, and
> create a new file system entity called 'pristine store' that the user can
> place anywhere, like a working copy?
> 
> We could also keep pristine store and working copy welded together, so that
> one working copy can use the pristine store of another working copy, and
> that a 'pristine store' that isn't used as a working copy is just a
> --depth=empty checkout of any folder URL of that repository. It practically
> has the same effect as completely separating pristine stores from working
> copies (there is another SQLite store somewhere else), but we can just
> re-use the WC API, no need to have a separate pristine *store* API (create
> new store, contact local store database, indicate a store location, checking
> presence given a location, etc.).

If we have a single wc.db per user, we can also easily have a single
pristine store per user. Until then, we'll probably better use a separate
pristine store per WC...?

How to tackle a system-wide pristine store also has to cope with write
permissions, so that may be a different thing entirely (like a local service
daemon instead of a publicly writable file system location...)

> 
>>> SOME IMPLEMENTATION INSIGHTS
>>> ============================
>>>
>>> There is a PRISTINE table in the SQLite database with columns
>>>  (checksum, md5_checksum, size, refcount)
>>>
>>> The pristine contents are stored in the local filesystem in a pristine file,
>>> which may or may not be compressed (opaquely hidden behind the pristines API).
>>> The goal is to be able to have a pristine store per working copy, per user as
>>> well as system-wide, and to configure each working copy as to which pristine
>>> store(s) it should use for reading/writing.
>>>   
>>> There is a canonical way of getting a given CHECKSUM's pristine file name for
>>> a given working copy without contacting the WC database (static function
>>> get_pristine_fname()).
>>>
>>> When interacting with the pristine store, we want to, as appropriate, check
>>> for (combos of):
>>>   db-presence    - presence in the PRISTINE table with noted file size > 0
>>>   file-presence  - pristine file presence
>>>   stat-match     - PRISTINE table's size and mtime match file system
>>>   checksum-match - validity of data in the file against the checksum
>>>
>>> file-presence is gotten for free from a successful stat-match (fstat),
>>> checksum-match (fopen) and unchecked read of the file (fopen).
>>>
>>> How fast we consider things:
>>>   db-presence    - very fast to moderately fast (in case of "empty db cache")
>>>   file-presence  - slow (fstat or fopen)
>>>   stat-match     - slow (fstat plus SQLite query)
>>>   checksum-match - super slow (reading, checksumming)
>> I'm prepared to believe a database query can be faster that stat when
>> the inode cache is cold, but what about when the inode cache is hot?
> 
> Also thanks for this!
> 
> I don't know that much about database/file system benchmarks, let alone on
> different platforms. My initial classifications are mostly guessing, mixed
> with provocative prodding to wake up more experienced devs ;)
> 
> I'm also not really aware how expensive it is to calculate a checksum while
> reading a stream for other purposes. How much cpu time does it add if the
> file I/O would happen anyway? Is it neglectable?
> 
> I guess we'll ultimately have to just try out what performs best.
> 
>> If the database query requires even one system call then it could well
>> be slower.  Multiple processes accessing a working copy, or writing to
>> the pristine store, might bias this further towards stat being faster,
>> If we decide to share the pristine store between several working
>> copies then a shared database could become a bottleneck.
>>
>> [...]
>>
>>> Use case "need": "I want to use this pristine's content, definitely."
>>> ---------------
>>> pseudocode:
>>>  pristine_check(&present, checksum, _usable)         (3)
>>>  if !present:
>>>    get_pristine_from_repos(checksum, ra)             (9)
>>>  pristine_read(&stream, checksum)                    (6)
>>>
>>> (3) check for _usable:
>>>      - db-presence
>>>      - if the checksum is not present in the table, return that it is not
>>>        present (don't check for file existence as well).
>>>      - stat-match (includes file-presence)
>>>      - if the checksum is present in the table but file is bad/not there,
>>>        bail, asking user to 'svn cleanup --pristines' (or sth.)
>>>
>>> (9) See use case "fetch". After this, either the pristine file is ready for
>>>     reading, or "fetch" has bailed already.
>>>
>>> (6) fopen()
>>
>> I think this is the most important case from a performance point of
>> view.  This is what 'svn status' et al. use, and it's important for
>> GUIs as a lot of the "feel" depends on how fast a process can query
>> the metadata.
> 
> Agreed.
> 
>> If we were to do away with the PRISTINE table, then we would not have
>> to worry about it becoming a bottleneck.  We don't need the existance
>> check if we are just about to open the file, since opening the file
>> proves that it exists.
> 
> <rant>Yes, I meant that, semantically, there has to be an existence check.
> You're right that it is gotten for free from opening the file. It's still
> important to note where the antenna sits that detects non-existence.</rant>
> 
>> We obviously have the checksum already, from
>> the BASE/WORKING table, so we only need the PRISTINE table for the
>> size/mtime.  Perhaps we could store those in the BASE/WORKING table
>> and eliminate the PRISTINE table, or is this too much of a layering
>> violation?  The pristine store is then just a sharded directory, into
>> which we move files and from which we read files.
> 
> -1
> 
> While we could store size&mtime in the BASE/WORKING tables, this causes size
> and mtime to be stored multiple times (whereever a pristine is referenced)
> and involves editing multiple entries when a pristine is removed/added due
> to high-water-mark or repair. That would be nothing less than horrible.
> Taking one step away from that, each working copy should have a dedicated
> table that stores size and mtime only once. Then we still face the situation
> that size and mtime are stored multiple times (once per working copy), and
> where, if a central pristine store is restructured, every working copy has
> to be updated. Bad idea.
> 
> Instead, we could not store size and mtime at all! :)

A big BUT is that we also need to store and send the MD5 checksum for
backwards compatibility with older servers/clients. So we'll definitely need
a database until 2.0, because of the MD5 compat alone.

We also currently have a 'compressed' flag stored, which allows optionally
compressing pristines. I think it's debatable if that is really useful. The
pristine store should be *fast* and, ideally, random-access-able. Opening a
decompression stream is kind of versus that; it's optimising for disk space,
and that's inherently not what the pristine store is for. I'd lose it.

~Neels

> 
> They are merely half-checks for validity. During normal operation, size and
> mtime should never change, because we don't open write streams to pristines.
> If anyone messes with the pristine store accidentally, we would pick it up
> with the size, or if that stayed the same, with the mtime. But we can pick
> up all cases of bitswaps/disk failure *only* by verifying *full checksum
> validity*!
> 
> So, while checking size and mtime gives a sense of basic sanity, it is
> really just a puny excuse for not checking full checksum validity. If we
> really care about correctness of pristines, *every* read of a pristine
> should verify the checksum along the way. (That would include to always read
> the complete pristine, even if just a few lines along the middle are needed)
> 
> * neels dreams of disks that hardware-checksum on-the-fly
> 
> If I further follow my dream of us emulating such hardware, we would store
> checksums for sub-chunks of each pristine, so that we can read small
> sections of pristines, being sure that the given section is correct without
> having to read the whole pristine.
> 
> Whoa, look where you got me now! ;)
> 
> I think it's a very valid question. Chuck the mtime and size, thus get rid
> of the PRISTINE table, thus do away with checking for any inconsistency
> between table and file system, also do away with possible database
> bottlenecks, and reduce the location of the pristine store to a mere local
> abspath. We have the checksum, we have the filename. Checking mtime and
> length protects against accidental editing of the pristine files. But any
> malicious or hw-failure corruption can in fact be *protected* by keeping
> mtime and length intact! ("hey, we checked it, it must be correct.")
> 
> Let's play through a corrupted pristine (with unchanged mtime/length). This
> is just theoretical...
> 
> Commit modification:
> 
> - User makes a checkout / revert / update that uses a locally
>   corrupted pristine. The corrupted pristine thus sits in the WC.
> 
> - User makes a text mod
> 
> - User commits
> 
> - Client/network layer communicate the *delta* between the local pristine
>   and the local mod to the repository, and the checksum of the modified
>   text.
> 
> - Repos applies the delta to the intact pristine it has in *its* store.
> 
> - Repos finds the resulting checksum to be *different* from the client's
>   checksum, because the underlying pristine was corrupt.
> 
> --> Yay! No need to do *ANY* local verification at all!!
> 
> Of course, in case the client/network layer decide to send the full text
> instead of a delta, the corruption is no longer detected. :(
> 
> 
> Merge and commit:
> 
> - User makes a merge that uses a locally corrupted pristine.
> 
> - The merge *delta* applied to the working copy is incorrect.
> 
> - User does not note the corruption (e.g. via --accept=mine-full)
> 
> - User commits
> 
> - Repos accepts the changes based on the corrupted pristine that was
>   used to get the merge delta, because it can't tell the difference
>   from a normal modification.
> 
> --> My goodness, merge needs to check pristine validity on each read,
>     as if it wasn't slow enough. But as discussed above, even if merge
>     checked mtime and length, it would not necessarily detect disk failure
>     and crafted malicious corruption.
> 
> 
> Thanks, Philip.
> 
> I'm now challenging the need to store mtime and length, and a need to do
> more checksumming instead. The checksumming overhead could be smaller than
> the database bottleneck slew.
> 
> For future optimisation, I'm also suggesting pristines should have
> additionally stored checksums for small chunks of each pristine, while still
> being indexed by the full checksum.
> (Which may imply a db again :/ , but that db would only be hit if we're
> trying to save time by reading just a small bit of the pristine)
> 
> Everyone, please prove me wrong!
> 
> Thanks,
> ~Neels
> 


Re: pristine store database -- was: pristine store design

Posted by "Hyrum K. Wright" <hy...@mail.utexas.edu>.
On Feb 16, 2010, at 10:52 PM, Neels J Hofmeyr wrote:

> Greg Stein wrote:
>> On Tue, Feb 16, 2010 at 16:22, Johan Corveleyn <jc...@gmail.com> wrote:
>>> OTOH, I guess a 0 bytes pristine has to be handled specially anyway
>>> (what's the checksum of 0 bytes), no?
> 
> SHA1: da39a3ee5e6b4b0d3255bfef95601890afd80709
> MD5: d41d8cd98f00b204e9800998ecf8427e
> 
> ;)

And we even special case these values in libsvn_subr:

static const unsigned char svn_sha1__empty_string_digest_array[] = {
  0xda, 0x39, 0xa3, 0xee, 0x5e, 0x6b, 0x4b, 0x0d, 0x32, 0x55,
  0xbf, 0xef, 0x95, 0x60, 0x18, 0x90, 0xaf, 0xd8, 0x07, 0x09
};

static const unsigned char svn_md5__empty_string_digest_array[] = {
  212, 29, 140, 217, 143, 0, 178, 4, 233, 128, 9, 152, 236, 248, 66, 126
};

-Hyrum

Re: pristine store database -- was: pristine store design

Posted by Neels J Hofmeyr <ne...@elego.de>.
Greg Stein wrote:
> On Tue, Feb 16, 2010 at 16:22, Johan Corveleyn <jc...@gmail.com> wrote:
>> OTOH, I guess a 0 bytes pristine has to be handled specially anyway
>> (what's the checksum of 0 bytes), no?

SHA1: da39a3ee5e6b4b0d3255bfef95601890afd80709
MD5: d41d8cd98f00b204e9800998ecf8427e

;)

> 
> Oops. You're right. I meant a null column value, as I documented in
> the schema file.

It *is* kind of useless to store an empty pristine :)
But given that empty files don't happen often in version control it could be
just as silly to special-case it at all.

~Neels


Re: pristine store database -- was: pristine store design

Posted by Greg Stein <gs...@gmail.com>.
On Tue, Feb 16, 2010 at 16:22, Johan Corveleyn <jc...@gmail.com> wrote:
> On Tue, Feb 16, 2010 at 8:15 PM, Greg Stein <gs...@gmail.com> wrote:
>>>> Instead, we could not store size and mtime at all! :)
>>>
>>> Or we could store both to perform simple consistency checks...
>>
>> Dunno about that, but the storage of SIZE is part of the (intended)
>> algorithm for pristine storage. It is allowed to have a row in
>> PRISTINE with SIZE==0 in order to say "I know about this pristine, and
>> this row is present to satisfy integrity constraints with other
>> tables, but the pristine has NOT been written into the store." Once
>> the file *is* written, then the resulting size is stored into the
>> table.
>
> Shouldn't the marker value to indicate "pristine has NOT been written
> into the store" be something like -1 instead of 0? Just taking into
> account that there might be files that really have a size of 0 bytes.
> These should be supported, shouldn't they?
>
> OTOH, I guess a 0 bytes pristine has to be handled specially anyway
> (what's the checksum of 0 bytes), no?

Oops. You're right. I meant a null column value, as I documented in
the schema file.

Cheers,
-g

Re: pristine store database -- was: pristine store design

Posted by Johan Corveleyn <jc...@gmail.com>.
On Tue, Feb 16, 2010 at 8:15 PM, Greg Stein <gs...@gmail.com> wrote:
>>> Instead, we could not store size and mtime at all! :)
>>
>> Or we could store both to perform simple consistency checks...
>
> Dunno about that, but the storage of SIZE is part of the (intended)
> algorithm for pristine storage. It is allowed to have a row in
> PRISTINE with SIZE==0 in order to say "I know about this pristine, and
> this row is present to satisfy integrity constraints with other
> tables, but the pristine has NOT been written into the store." Once
> the file *is* written, then the resulting size is stored into the
> table.

Shouldn't the marker value to indicate "pristine has NOT been written
into the store" be something like -1 instead of 0? Just taking into
account that there might be files that really have a size of 0 bytes.
These should be supported, shouldn't they?

OTOH, I guess a 0 bytes pristine has to be handled specially anyway
(what's the checksum of 0 bytes), no?

Johan

Re: pristine store database -- was: pristine store design

Posted by Greg Stein <gs...@gmail.com>.
Meta-comment: all of these issues that you're bringing up are
*exactly* why I wanted to punt the issue of external-to-WC stores out
of 1.7. Keeping a 1:1 correspondence of pristine stores to working
copies keeps the problem tractable, especially given all the other
work that is needed.

On Tue, Feb 16, 2010 at 09:58, Bert Huijben <be...@qqmail.nl> wrote:
>...
> The idea was not to move just the pristine store, but also wc.db to a
> central location.

Yup. There were also some comments of separable locations of the
wc.db: one where you keep your metadata (e.g. home dir), and one for
the pristines (e.g. /var/svn). I never liked that idea, but reference
it for completeness sake.

As Bert noted in his response, the schema is designed to manage
multiple working copies.

>...
>> Instead, we could not store size and mtime at all! :)
>
> Or we could store both to perform simple consistency checks...

Dunno about that, but the storage of SIZE is part of the (intended)
algorithm for pristine storage. It is allowed to have a row in
PRISTINE with SIZE==0 in order to say "I know about this pristine, and
this row is present to satisfy integrity constraints with other
tables, but the pristine has NOT been written into the store." Once
the file *is* written, then the resulting size is stored into the
table.

> A query over checksum from BASE_NODE and WORKING_NODE + older_checksum,
> left_checksum  and right_checksum from ACTUAL_NODE would give a list from
> all PRISTINE records that are in-use.
> The rest can safely be deleted if diskspace is required.

Yes. That was the design goal.

it gets more complicated when you have a centralized wc.db and one or
more of the working copies are offline (e.g. removable storage,
network unavailable, etc). Again, these questions are why a
centralized concept has been punted for this generation.

>...
>> So, while checking size and mtime gives a sense of basic sanity, it is
>> really just a puny excuse for not checking full checksum validity. If we
>> really care about correctness of pristines, *every* read of a pristine
>> should verify the checksum along the way. (That would include to always

Since the pristine design returns a *stream* on the pristine, then we
can always insert a checksumming stream in order to verify the
contents. I believe the bottleneck will be I/O, so performing a
checksum should be Just Fine. If the stream is read to completion,
then we can validate the checksum (and don't worry about partial
reads; that isn't all that common, I believe).

Note that we'd only have to insert a single checksum stream, not SHA1 *and* MD5.

Also, since I/O is the (hypothetical) bottleneck, this is also why
compression is handy.

>...
> This is cheap to detect. If we have to read the file anyway, we can just
> recalculate the hash via the stream hash apis. The disk IO for reading the
> file is the real pain with current CPUs. (At least for SHA1/MD5. Sha256 is
> still painful in my tests)
>...
> You would probably see our binary diff fail before you get here and if the
> RA layer provides a hash for the result or input file you can use that for
> verification on the streams. (See the update editor for some examples where
> we do that in one pass since 1.6)

Right, and right.

>...

Cheers,
-g

RE: pristine store database -- was: pristine store design

Posted by Bert Huijben <be...@qqmail.nl>.
> -----Original Message-----
> From: Neels J Hofmeyr [mailto:neels@elego.de]
> Sent: dinsdag 16 februari 2010 14:55
> To: dev@subversion.apache.org
> Cc: Philip Martin; Greg Stein; Hyrum Wright
> Subject: pristine store database -- was: pristine store design
> 
> Philip Martin wrote:
> > Neels J Hofmeyr <ne...@elego.de> writes:
> >
> >> THE PRISTINE STORE
> >> ==================
> >>
> >> The pristine store is a local cache of complete content of files that
are
> >> known to be in the repository. It is hashed by a checksum of that
content
> >> (SHA1).
> >
> > I'm not sure whether you are planning one table per pristine store or
> > one table per working copy, but I think it's one per pristine store.
> > Obviously it doesn't makes no difference until pristine stores can be
> > shared (and it might be one per directory in the short term depending
> > on when stop being one database per directory).
> 
> Thanks for that. This is the tip of an iceberg called 'a pristine store
does
> not equal a working copy [root]'.
> 
> The question is how to store the PRISTINE table (also see below) once it
> serves various working copies. Will we have a separate SQLite db store,
and
> create a new file system entity called 'pristine store' that the user can
> place anywhere, like a working copy?

The idea was not to move just the pristine store, but also wc.db to a
central location.

In this case the .svn directory will just have a 'look there' marker,
indicating where the wc database is. (Another option would be to just write
the location in the subversion configuration, but this would make it very
hard to detect if a directory is really a working copy or just a normal
directory)

The wc.db database schema (as designed by Greg) was designed to handle
multiple working copies. (All wc related tables have a wc_id column,
indicating which working copy the record applies to).


This week some other mail talked about the performance characteristics of a
central database, but I don't think this is really relevant here

The normal situation is that there is only one (single-threaded) operation
performing changes on a working copy at a time (and possibly multiple
readers). In this situation the performance of one open database with the
Sqlite + filesystem caching support of an exclusively opened file should be
much better, than deleting and recreating our own database files everywhere
through the filesystem. The old working copy used an exclusive per directory
write lock to the same effect.

<snip>

> While we could store size&mtime in the BASE/WORKING tables, this causes
> size
> and mtime to be stored multiple times (whereever a pristine is referenced)
> and involves editing multiple entries when a pristine is removed/added due
> to high-water-mark or repair. That would be nothing less than horrible.
> Taking one step away from that, each working copy should have a dedicated
> table that stores size and mtime only once. Then we still face the
situation
> that size and mtime are stored multiple times (once per working copy), and
> where, if a central pristine store is restructured, every working copy has
> to be updated. Bad idea.

The size&mtime in BASE_NODE and WORKING_NODE don't relate to pristine data,
but to the in-WC files. If a files date and size haven't changed 'svn
status' sees the file as unmodified.

Note that the PRISTINE table currently doesn't have a mtime column. It does
have a SIZE, MD5 and COMPRESSION column to allow storing the MD5 hash for
communicating over editor-v1.0 (Needed for Subversion 1.0-1.7
compatibility).

> Instead, we could not store size and mtime at all! :)

Or we could store both to perform simple consistency checks...

A query over checksum from BASE_NODE and WORKING_NODE + older_checksum,
left_checksum  and right_checksum from ACTUAL_NODE would give a list from
all PRISTINE records that are in-use.
The rest can safely be deleted if diskspace is required.

> They are merely half-checks for validity. During normal operation, size
and
> mtime should never change, because we don't open write streams to
> pristines.
> If anyone messes with the pristine store accidentally, we would pick it up
> with the size, or if that stayed the same, with the mtime. But we can pick
> up all cases of bitswaps/disk failure *only* by verifying *full checksum
> validity*!

Good luck verifying 20 files with a total of 32 GB of data over a LAN :)

Well working over a LAN is not a design requirement for WC-NG, but a lot of
our users use NFS or CIFS... And checking via fstat is a lot cheaper than
reading all data.

> So, while checking size and mtime gives a sense of basic sanity, it is
> really just a puny excuse for not checking full checksum validity. If we
> really care about correctness of pristines, *every* read of a pristine
> should verify the checksum along the way. (That would include to always
> read
> the complete pristine, even if just a few lines along the middle are
needed)
> 
> * neels dreams of disks that hardware-checksum on-the-fly
> 
> If I further follow my dream of us emulating such hardware, we would store
> checksums for sub-chunks of each pristine, so that we can read small
> sections of pristines, being sure that the given section is correct
without
> having to read the whole pristine.
> 
> Whoa, look where you got me now! ;)
> 
> I think it's a very valid question. Chuck the mtime and size, thus get rid
> of the PRISTINE table, thus do away with checking for any inconsistency
> between table and file system, also do away with possible database
> bottlenecks, and reduce the location of the pristine store to a mere local
> abspath. We have the checksum, we have the filename. Checking mtime and
> length protects against accidental editing of the pristine files. But any
> malicious or hw-failure corruption can in fact be *protected* by keeping
> mtime and length intact! ("hey, we checked it, it must be correct.")

That would still leave MD5. Having a checksum before processing the file is
a common use-case in editor-1.0. And calculating a checksum over a really
large file and then reading it again for the operation is not cheap.

Another reason to keep this record around is supporting compression. In that
case we don't have the size by a cheap fstat anymore.

> Let's play through a corrupted pristine (with unchanged mtime/length).
This
> is just theoretical...
> 
> Commit modification:
> 
> - User makes a checkout / revert / update that uses a locally
>   corrupted pristine. The corrupted pristine thus sits in the WC.
> 
> - User makes a text mod
> 
> - User commits
> 
> - Client/network layer communicate the *delta* between the local pristine
>   and the local mod to the repository, and the checksum of the modified
>   text.
> 
> - Repos applies the delta to the intact pristine it has in *its* store.
> 
> - Repos finds the resulting checksum to be *different* from the client's
>   checksum, because the underlying pristine was corrupt.
> 
> --> Yay! No need to do *ANY* local verification at all!!
> 
> Of course, in case the client/network layer decide to send the full text
> instead of a delta, the corruption is no longer detected. :(
> 
> 
> Merge and commit:
> 
> - User makes a merge that uses a locally corrupted pristine.

This is cheap to detect. If we have to read the file anyway, we can just
recalculate the hash via the stream hash apis. The disk IO for reading the
file is the real pain with current CPUs. (At least for SHA1/MD5. Sha256 is
still painful in my tests)

> - The merge *delta* applied to the working copy is incorrect.
> 
> - User does not note the corruption (e.g. via --accept=mine-full)

You would probably see our binary diff fail before you get here and if the
RA layer provides a hash for the result or input file you can use that for
verification on the streams. (See the update editor for some examples where
we do that in one pass since 1.6)

> - User commits
> 
> - Repos accepts the changes based on the corrupted pristine that was
>   used to get the merge delta, because it can't tell the difference
>   from a normal modification.
> 
> --> My goodness, merge needs to check pristine validity on each read,
>     as if it wasn't slow enough. But as discussed above, even if merge
>     checked mtime and length, it would not necessarily detect disk failure
>     and crafted malicious corruption.

I don't know if we check .svn-base files for merges now. I know we do that
for 'svn update'. Before 1.6 introduced the stream checksum apis, this was
an additional scan, but that is not necessary anymore.

> Thanks, Philip.
> 
> I'm now challenging the need to store mtime and length, and a need to do
> more checksumming instead. The checksumming overhead could be smaller
> than
> the database bottleneck slew.

I assume a normal wc.db is read by the filesystem and harddisks read ahead
buffer for most small working copies. (If a hard drive spins around it
usually reads the next few sectors after the read operation in its internal
buffer if it has nothing better to do)

As we usually look at wc-file statuses at the same time this file should be
pretty hot in the cache. 
(and the strict locking of the db file by sqlite even allows caching
portions of this file over network connections).

	Bert

> For future optimisation, I'm also suggesting pristines should have
> additionally stored checksums for small chunks of each pristine, while
still
> being indexed by the full checksum.
> (Which may imply a db again :/ , but that db would only be hit if we're
> trying to save time by reading just a small bit of the pristine)
> 
> Everyone, please prove me wrong!
> 
> Thanks,
> ~Neels


Re: pristine store database -- was: pristine store design

Posted by Mark Mielke <ma...@mark.mielke.cc>.
On 02/16/2010 03:24 PM, Mark Mielke wrote:
> On 02/16/2010 08:54 AM, Neels J Hofmeyr wrote:
>> They are merely half-checks for validity. During normal operation, 
>> size and
>> mtime should never change, because we don't open write streams to 
>> pristines.
>> If anyone messes with the pristine store accidentally, we would pick 
>> it up
>> with the size, or if that stayed the same, with the mtime. But we can 
>> pick
>> up all cases of bitswaps/disk failure *only* by verifying *full checksum
>> validity*!
>>
>> So, while checking size and mtime gives a sense of basic sanity, it is
>> really just a puny excuse for not checking full checksum validity. If we
>> really care about correctness of pristines, *every* read of a pristine
>> should verify the checksum along the way. (That would include to 
>> always read
>> the complete pristine, even if just a few lines along the middle are 
>> needed)
>
> Checking size and mtime gives huge benefits over checking contents. 
> Size and mtime can be picked up with a single stat(), whereas a 
> checksum requires open()/read()/.../close(). The data for stat() is 
> usually stored in the inode which is read in either situation, and 
> often small enough to be easily cached. For large work spaces, 
> especially those with multi-Kbyte files, doing checksum tests on most 
> operations would result in unacceptable performance.
>
> I think it's fine to compare checksum on any files that are noticed to 
> have changed (size/mtime), but if the file looks unchanged, assuming 
> that it *is* unchanged, is a fine compromise for the performance gains.
>
> If you want a "--compare-checksum" option which does the full check 
> optionally - it might be use to some people. I suspect most people 
> would avoid using it once they see how much more expensive it is...
>

I just realized perhaps you are talking about size/mtime of the pristine 
and not of the working copy. If so, ignore the above. I see no reason to 
check the sanity of the pristine during normal operation, presuming 
there is some sort of transactional model that guarantees that 
Subversion itself will not corrupt the pristine during normal operation 
in the case of an expected failure (control-C by user, network failure, 
...). :-)

Cheers,
mark

Re: pristine store database -- was: pristine store design

Posted by Mark Mielke <ma...@mark.mielke.cc>.
On 02/16/2010 08:54 AM, Neels J Hofmeyr wrote:
> They are merely half-checks for validity. During normal operation, size and
> mtime should never change, because we don't open write streams to pristines.
> If anyone messes with the pristine store accidentally, we would pick it up
> with the size, or if that stayed the same, with the mtime. But we can pick
> up all cases of bitswaps/disk failure *only* by verifying *full checksum
> validity*!
>
> So, while checking size and mtime gives a sense of basic sanity, it is
> really just a puny excuse for not checking full checksum validity. If we
> really care about correctness of pristines, *every* read of a pristine
> should verify the checksum along the way. (That would include to always read
> the complete pristine, even if just a few lines along the middle are needed)
>    

Checking size and mtime gives huge benefits over checking contents. Size 
and mtime can be picked up with a single stat(), whereas a checksum 
requires open()/read()/.../close(). The data for stat() is usually 
stored in the inode which is read in either situation, and often small 
enough to be easily cached. For large work spaces, especially those with 
multi-Kbyte files, doing checksum tests on most operations would result 
in unacceptable performance.

I think it's fine to compare checksum on any files that are noticed to 
have changed (size/mtime), but if the file looks unchanged, assuming 
that it *is* unchanged, is a fine compromise for the performance gains.

If you want a "--compare-checksum" option which does the full check 
optionally - it might be use to some people. I suspect most people would 
avoid using it once they see how much more expensive it is...

Cheers,
mark