You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Greg Hudson <gh...@MIT.EDU> on 2004/05/30 22:12:25 UTC

RFC: Hash dumps in sorted order

I'd like to modify subversion/libsvn_subr/hash.c:hash_write() to dump
hashes in sorted key order.  There are two advantages here:

  * It makes the dumped representation of a hash deterministic.  Not
    exceptionally important for anything we care about right now, but
    it allows the dumped representation of a hash to be signed, for
    instance, or tested for validity.

  * It opens the door to, some day, writing code to search a seekable
    hash representation in O(lg(n)) time.  Obviously, we'd have to be
    very careful about what contexts we do that in, since we have a
    bunch of 1.0.x code running around writing unsorted hashes right
    now.

    (A possible application would be directory lookups in FSFS,
    although I have no plans to implement such a thing any time soon.)

The down side is that sorting takes O(n lg(n)) time, as opposed to the
O(n) time it takes to naively write out a hash.  I doubt there would
be any kind of visible performance difference, though; lg(n) is
effectively a constant, and a much lesser one than the I/O overhead of
writing out hashes.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: RFC: Hash dumps in sorted order

Posted by Greg Hudson <gh...@MIT.EDU>.
On Mon, 2004-05-31 at 22:56, Branko Čibej wrote:
> But not streamy...

We're essentially never streamy when it comes to directories, and
delta_dirs() is no exception (either in the old implementation or the
new one).  We read all the source and target directory entries into
hashes, iterate over the target hash, and look up corresponding entries
in the source hash.

So, you can chalk up streamy directory comparisons as a potential
benefit of the path lookup table (although simply writing out the
directories in sorted order might accomplish the same thing), but it's
irrelevant to producing sorted output.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: RFC: Hash dumps in sorted order

Posted by Branko Čibej <br...@xbc.nu>.
Greg Hudson wrote:

>On Mon, 2004-05-31 at 22:40, Branko Čibej wrote:
>  
>
>>Well, the easiest way to make the repository-side operations sorted is
>>by sorting the directories in the repo, and the easiest way to do that
>>is by putting them into a path-l^H^H^H^H^H^H er, sorted index... :-)
>>    
>>
>
>Sorry, not even a little bit.  Sorting the result of
>svn_fs_dir_entries() is the easiest part.
>  
>
But not streamy...

Aargh. We'll obviously go back and forth on this issue until somebody
(me?) actually implements the lookup table and comes up with numbers.
Let's drop it until then.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: RFC: Hash dumps in sorted order

Posted by Greg Hudson <gh...@MIT.EDU>.
On Mon, 2004-05-31 at 22:40, Branko Čibej wrote:
> Well, the easiest way to make the repository-side operations sorted is
> by sorting the directories in the repo, and the easiest way to do that
> is by putting them into a path-l^H^H^H^H^H^H er, sorted index... :-)

Sorry, not even a little bit.  Sorting the result of
svn_fs_dir_entries() is the easiest part.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: RFC: Hash dumps in sorted order

Posted by Branko Čibej <br...@xbc.nu>.
Julian Foad wrote:

> C. Michael Pilato wrote:
>
>> Greg Hudson <gh...@MIT.EDU> writes:
>>
>>> Seriously, I'm just trying to keep the door open for the future.
>>> Branko's "path look up table" thread is what got me thinking about it,
>>> but (as I mentioned in that thread) I don't see any immediate big
>>> performance wins coming from speeding up single-path directory
>>> lookups.
>>
>>
>> Actually, it should improve the diff-ability of dumpfiles (and working
>> copy admin areas, for debuggability purposes), at the very least.
>> +1 on predictable behavior. :-)
>
>
> Fantastic!  Methinks now is the time to chip in with my patch to make
> Subversion's user-visible operations proceed in sorted order.  In
> particular, I want "svn status" output to be sorted (without piping it
> through a "sort" filter, because that removes streaminess) so that I
> can read it and find things in it (by eye) more easily, and "svn diff"
> output to be sorted so I can meaningfully compare one patch file with
> another similar one.  A separate thread will blink into existence.

Well, the easiest way to make the repository-side operations sorted is
by sorting the directories in the repo, and the easiest way to do that
is by putting them into a path-l^H^H^H^H^H^H er, sorted index... :-)

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: RFC: Hash dumps in sorted order

Posted by Julian Foad <ju...@btopenworld.com>.
C. Michael Pilato wrote:
> Greg Hudson <gh...@MIT.EDU> writes:
> 
>>Seriously, I'm just trying to keep the door open for the future. 
>>Branko's "path look up table" thread is what got me thinking about it,
>>but (as I mentioned in that thread) I don't see any immediate big
>>performance wins coming from speeding up single-path directory
>>lookups.
> 
> Actually, it should improve the diff-ability of dumpfiles (and working
> copy admin areas, for debuggability purposes), at the very least.
> +1 on predictable behavior. :-)

Fantastic!  Methinks now is the time to chip in with my patch to make Subversion's user-visible operations proceed in sorted order.  In particular, I want "svn status" output to be sorted (without piping it through a "sort" filter, because that removes streaminess) so that I can read it and find things in it (by eye) more easily, and "svn diff" output to be sorted so I can meaningfully compare one patch file with another similar one.  A separate thread will blink into existence.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: RFC: Hash dumps in sorted order

Posted by "C. Michael Pilato" <cm...@collab.net>.
Greg Hudson <gh...@MIT.EDU> writes:

> Seriously, I'm just trying to keep the door open for the future. 
> Branko's "path look up table" thread is what got me thinking about it,
> but (as I mentioned in that thread) I don't see any immediate big
> performance wins coming from speeding up single-path directory
> lookups.

Actually, it should improve the diff-ability of dumpfiles (and working
copy admin areas, for debuggability purposes), at the very least.
+1 on predictable behavior. :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: RFC: Hash dumps in sorted order

Posted by Greg Hudson <gh...@MIT.EDU>.
On Sun, 2004-05-30 at 22:38, Ben Collins-Sussman wrote:
> >     (A possible application would be directory lookups in FSFS,
> >     although I have no plans to implement such a thing any time soon.)
> 
> Aww, 'fess up Greg.  What itch are you *really* trying to scratch here? 
> This task is pretty obscure, not the kind of thing you do just because
> "it might be useful someday".  :-)

Seriously, I'm just trying to keep the door open for the future. 
Branko's "path look up table" thread is what got me thinking about it,
but (as I mentioned in that thread) I don't see any immediate big
performance wins coming from speeding up single-path directory lookups.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: RFC: Hash dumps in sorted order

Posted by Ben Collins-Sussman <su...@collab.net>.
On Sun, 2004-05-30 at 17:12, Greg Hudson wrote:
> I'd like to modify subversion/libsvn_subr/hash.c:hash_write() to dump
> hashes in sorted key order.  There are two advantages here:
> 
>   * It makes the dumped representation of a hash deterministic.  Not
>     exceptionally important for anything we care about right now, but
>     it allows the dumped representation of a hash to be signed, for
>     instance, or tested for validity.
> 
>   * It opens the door to, some day, writing code to search a seekable
>     hash representation in O(lg(n)) time.  Obviously, we'd have to be
>     very careful about what contexts we do that in, since we have a
>     bunch of 1.0.x code running around writing unsorted hashes right
>     now.
> 
>     (A possible application would be directory lookups in FSFS,
>     although I have no plans to implement such a thing any time soon.)

Aww, 'fess up Greg.  What itch are you *really* trying to scratch here? 
This task is pretty obscure, not the kind of thing you do just because
"it might be useful someday".  :-)





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: RFC: Hash dumps in sorted order

Posted by kf...@collab.net.
+1 on this, and please note how diligently I resist commenting on any
of the spinoff threads this proposal produced :-).

-Karl

Greg Hudson <gh...@MIT.EDU> writes:
> I'd like to modify subversion/libsvn_subr/hash.c:hash_write() to dump
> hashes in sorted key order.  There are two advantages here:
> 
>   * It makes the dumped representation of a hash deterministic.  Not
>     exceptionally important for anything we care about right now, but
>     it allows the dumped representation of a hash to be signed, for
>     instance, or tested for validity.
> 
>   * It opens the door to, some day, writing code to search a seekable
>     hash representation in O(lg(n)) time.  Obviously, we'd have to be
>     very careful about what contexts we do that in, since we have a
>     bunch of 1.0.x code running around writing unsorted hashes right
>     now.
> 
>     (A possible application would be directory lookups in FSFS,
>     although I have no plans to implement such a thing any time soon.)
> 
> The down side is that sorting takes O(n lg(n)) time, as opposed to the
> O(n) time it takes to naively write out a hash.  I doubt there would
> be any kind of visible performance difference, though; lg(n) is
> effectively a constant, and a much lesser one than the I/O overhead of
> writing out hashes.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org