You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@subversion.apache.org by David Anderson <da...@calixo.net> on 2006/05/06 18:58:14 UTC

[Reminder] Subversion a mentor for Google Summer of Code

Just a quick reminder, as it has been all over the internet for some
time.

Following last year's success, Google is hosting a 2006 edition of the
Summer of Code.  Quickly put, if you're a student and get selected,
you get paid over the summer to work on a specific task within one of
dozens of mentoring open source projects.  More information about the
specifics of the program are available at http://code.google.com/soc/ .

Like last year, Subversion is a mentoring organization within the SoC.
If you'd like to help further the development of Subversion, get paid,
and have fun doing so, then head over to the SoC webpage and apply!

We have compiled a list of tasks that we feel are suitable for the
timeframe of the SoC and interesting to us.  The list is on the
Subversion website, at
<http://subversion.tigris.org/project_tasks.html>.  This list is of
course not exhaustive, so if you have a really great idea that might
interest us, don't let our list stop you from applying.

The deadline for applying is 8th May.  Yes it is soon, but it should
be enough to decide what you want to apply for, and write a short
proposal for it.  Remember, the application is meant to interest us in
the task you're offering to complete, and convince us that you are
able to complete the task you propose.

No need to have finished the task beforehand, or have the deepest
possible knowledge of the Subversion internals before starting.  Some
coding skills, along with a keen will to learn should do the trick!

- Dave.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: [Reminder] Subversion a mentor for Google Summer of Code

Posted by ListMan <li...@burble.net>.

On May 8, 2006, at 1:33 AM, Sachin Garg wrote:

> I looked at bug ID 908, which wants that the local copy in text-base
> should be stored compressed. I did a little digging around in code and
> felt it shouldnt be very hard to implement this and it will atleast
> make my life easier.
>
> I am not going through the Google summer of code thing (am no longer a
> student either :-) but would like to implement this feature (assuming
> someone hasnt already started working on this).
>
> I am a long time subversion user (on Windows, TortoiseSVN) but new to
> subversion code, so will need some guidance if you guys want me to
> work on this.
>
> Some quick quesitions:
>
> # Is libsvn_wc/ the only place where I will need to edit code, or do I
> need to look in other directories too? Which ones?
>
> # Do we already have a compression library (zlib?) linked in  
> subversion?
>
> # How much additional delay this is expected to result in during
> checkouts and commits? Should I use something lightweight like zlib or
> will it be fine to use bzip2 which can give better compression but
> will be slower?
>



could we leave the compression algorithm up to the user? some people  
may want
smaller repositories, others (like me) maybe more focused on speed of  
commit



> # Do we want files in text-base to be always compressed, or do we want
> text-base compression to be optional?
>
> Bug no 525 (optional text-base storage) is slightly related, maybe I
> can have a design which will make it easier to implement 525 too. Like
> implementing text-base access as a layer which can have multiple
> implmentations:
>
> 1. Direct file read
> 2. Read compressed file
> 3. Fetch from server
>
>
> Another possible todo item (which runs in opposite direction from the
> above items :-)
>
> Just like SVN stores text-base for local diffs, how about generalizing
> it to store N previous revisions and change log entires. Storing
> additional revisions shouldn't result in too much bloat, as we can
> probably store just the diffs and can make more operations local.
>
> Sachin Garg [India]
> www.sachingarg.com | www.c10n.info
>
> On 5/7/06, David Anderson <da...@calixo.net> wrote:
>> Just a quick reminder, as it has been all over the internet for some
>> time.
>>
>> Following last year's success, Google is hosting a 2006 edition of  
>> the
>> Summer of Code.  Quickly put, if you're a student and get selected,
>> you get paid over the summer to work on a specific task within one of
>> dozens of mentoring open source projects.  More information about the
>> specifics of the program are available at http://code.google.com/ 
>> soc/ .
>>
>> Like last year, Subversion is a mentoring organization within the  
>> SoC.
>> If you'd like to help further the development of Subversion, get  
>> paid,
>> and have fun doing so, then head over to the SoC webpage and apply!
>>
>> We have compiled a list of tasks that we feel are suitable for the
>> timeframe of the SoC and interesting to us.  The list is on the
>> Subversion website, at
>> <http://subversion.tigris.org/project_tasks.html>.  This list is of
>> course not exhaustive, so if you have a really great idea that might
>> interest us, don't let our list stop you from applying.
>>
>> The deadline for applying is 8th May.  Yes it is soon, but it should
>> be enough to decide what you want to apply for, and write a short
>> proposal for it.  Remember, the application is meant to interest  
>> us in
>> the task you're offering to complete, and convince us that you are
>> able to complete the task you propose.
>>
>> No need to have finished the task beforehand, or have the deepest
>> possible knowledge of the Subversion internals before starting.  Some
>> coding skills, along with a keen will to learn should do the trick!
>>
>> - Dave.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
>> For additional commands, e-mail: dev-help@subversion.tigris.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)

Posted by "Ph. Marek" <ph...@bmlv.gv.at>.

On Monday 08 May 2006 19:18, Jonathan Gilbert wrote:
> I vaguely recall reading that rsync has in fact had one or two such
> collisions in its history (resulting in a corrupt copy of the file being
> synchronized)
AFAIK that happenened because of network bandwidth considerations only 16 or 
32bit checksums were transmitted per 800byte-block, and for BIG files (with 
>100MB) there were so many blocks that they got collisions (in the 
32bit-checksums!)

> , but they are extremely rare and don't stop most people from 
> using it. Still, back when I suggested an rsync-like algorithm for
> Subversion (for a completely different reason), one of the things I was
> told is that Subversion tries to take nothing for granted when it comes to
> data integrity, and that for that reason, my algorithm would be an unlikely
> addition even if I did finish it.
In FSVS (fsvs.tigris.org) I use such an algorithm. I use a rolling checksum, 
and whenever I hit a "special" value (with a predefined number of zero bits) 
I declare the block to be finished and do a MD5 of it.
So I can stop checking the local text for modifications *without* checksumming 
the (possibly big) file.

I believe that if MD5 is successfully used to check integrity of all files 
(small, big, ...) then taking the MD5 of blocks approximately of 100kB is no 
problem, either.

> If we replace the text-base with a bunch of block hashes, we will be
> opening the door (albeit only by the tiniest crack) for working copies to
> get undetectably (in the automated sense) corrupted. The only way to be 
> *absolutely* certain, assuming you trust TCP to move the data reliably
> (which we usually do), is to move one of the two versions to be compared to
> the other system so that they're both in the same place and can be directly
> compared, byte for byte.
>
> I should also point out that if you use a large block size like 32 KB for
> the text base, source code files will almost never find matching blocks,
> which will basically destroy the commit efficiency in that area. This is
> probably only an issue for dialup users, where transferring 32 KB instead
> of 300 bytes translates to real pain. :-)
But it's not a problem for a LAN.
And dialup users would trade the harddisk space against network bandwidth, I 
think :-)



Regards,

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)

Posted by Jonathan Gilbert <o2...@sneakemail.com>.

At 03:42 PM 08/05/2006 +0200, Peter N. Lundblad wrote:
[snip]
>Qi Fred writes:
[snip]
> >   cycle is feasible. The third one concerns the collision of message
> >   digest algorithms. There is a report that different contents give
> >   same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
> >   collisions have not been found in SHA-1 algorithm. Some
> >   investigations should be down to avoid collisions. I prefer to
> >   implement the third working model.
> > 
>I'm no expert in this area, but I pretty sure the collisions concern
>the cryptographic uses of MD5, so I don't think we need to worry about
>that.  Others may want to comment here.

At 09:00 AM 08/05/2006 -0700, Ron wrote:
>I wish I could remember the link, but I read about using the MD5 of the 
>file forward and then the MD5 of the file backwards, producing 2 MD5 
>values and that along with the size of the data produced a chance of 
>collision so small to be (almost) impossible.  Subversion could also add 
>modify time to increase this even more.
>
>Maybe (almost) impossible isn't good enough, but it was 1 in the 
>billions of trillions if I remember correctly.  I am far from an expert 
>in this area, so this maybe common knowledge/debunked.
>
>I would love to see subversion store some kind of hash rather than the 
>full file.  I work on projects with many many gigabytes of binary data 
>and hate to have my entire project stored twice.

The goal of a good hashing algorithm is similar to that of a good random
number generator: to evenly cover the digest space. Two similar messages
which differ even only by a single bit should have as wildly divergent hash
values as possible.

The problem here is that the disgest space is *much* smaller than the
message space. If you have a 32 KB message -- that's 262144 bits -- and use
128-bit digests, then on average, you'll have 2 ^ (262144 - 128) (roughly
4.73526e+78874 -- thousands of orders of magnitude greater than the number
of quark particles in the entire universe) unique messages with a given
digest!

The key thing, though, the reason tools like rsync work, is that while that
is certainly an immense number of messages with the same digest, very few
pairs of them (ideally *none*) are *similar*. In order to find a collision,
you have to basically start over from scratch with a completely
gobbledigook message. This works in our favour, because typically a changed
block is only slightly different. This virtually guarantees that it will
have a unique hash value.

This does not, however, eliminate the *possibility* of a collision. The
more structured the data (e.g. source code, uncompressed image data,
machine code, etc.), the harder it is to collide, but when people add
highly-compressed data (which, by definition, is extremely entropic) to a
repository, the risk increases dramatically. As the similarity & structure
of the blocks is removed, the chance of finding another block with the same
hash increases. The only thing in your favour is the number of possible
hashes, which, while far less than the number of possible colliding blocks,
is still itself a very large number. With MD5, which produces 128 bits of
hash data, there are 2 ^ 128 possible different hashes. So, even with no
structure on a given block of data at all (data that is characteristically
similar to the output of a random number generator), if you compare it with
another block that is also highly entropic, you have only a 1 in 2 ^ 128
chance (1 in 3.40282367e+38) of it having a matching hash code.

I vaguely recall reading that rsync has in fact had one or two such
collisions in its history (resulting in a corrupt copy of the file being
synchronized), but they are extremely rare and don't stop most people from
using it. Still, back when I suggested an rsync-like algorithm for
Subversion (for a completely different reason), one of the things I was
told is that Subversion tries to take nothing for granted when it comes to
data integrity, and that for that reason, my algorithm would be an unlikely
addition even if I did finish it.

If we replace the text-base with a bunch of block hashes, we will be
opening the door (albeit only by the tiniest crack) for working copies to
get undetectably (in the automated sense) corrupted. The only way to be
*absolutely* certain, assuming you trust TCP to move the data reliably
(which we usually do), is to move one of the two versions to be compared to
the other system so that they're both in the same place and can be directly
compared, byte for byte.

I should also point out that if you use a large block size like 32 KB for
the text base, source code files will almost never find matching blocks,
which will basically destroy the commit efficiency in that area. This is
probably only an issue for dialup users, where transferring 32 KB instead
of 300 bytes translates to real pain. :-)

Jonathan Gilbert

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Reminder] Subversion a mentor for Google Summer of Code

Posted by Sachin Garg <sc...@gmail.com>.

I am available for working on this, if for some reason your proposal
doesnt gets selected. If it gets selected, I will be happy to help you
with it.

Best of luck.

Sachin Garg [India]
www.sachingarg.com | www.c10n.info

On 5/8/06, Qi Fred <fr...@gmail.com> wrote:
>
> I have submitted a proposal to Summer of Code 2006 on this task.
> The following is my proposal,
> -------------------------------------
>
>
>
> Name: Qi, Fei
> Email: fred.qi@gmail.com
> IM: fred.qi@gmail.com (gtalk)
> Language: Chinese, Native;
>           English, fluently reading, writing and speaking.
>
>
> * PROJECT TITLE
> ----------------------------------------------------------------------
> Compressed or optional text base storage in Subversion
> ----------------------------------------------------------------------
>
>
> * SUMMARY
>
>   In Subversion, difference comparison and deltas generation are
>   performed off-line based on the locally cached text bases. Text
>   bases of a certain working copy are the unmodified files in the base
>   revision. But such a design doubles approximately the storage space
>   needed on the client side. Two feasible solutions of reducing the
>   storage are: (a) compress the text bases, and (b) disable caching
>   text bases of some or all of the files in the working copy. My
>   proposal is to add a mechanism combines the two solutions to manage
>   text bases.
>
>   The following features are planned to be implemented:
>
>   - By setting options in the runtime configuration files, users can
>     (a) switch between using original and compressed text bases, and
>     (b) enable or disable caching large binary files.
>
>   - By specifying a special property on a certain file, one of the
>     three caching mechanisms can be chosen: original, compressed, and
>     excluded (caching disabled). Note that the text bases can be
>     excluded on client side only if the file is a binary one.
>
>
> * DETAILS of PROJECT
>
>   Compressed or optional text base storage in Subversion have been
>   discussed for a long time in Subversion's development community,
>   - SoC description:
> http://subversion.tigris.org/project_tasks.html
>   - issue 525:
> http://subversion.tigris.org/issues/show_bug.cgi?id=525
>   - issue 908:
> http://subversion.tigris.org/issues/show_bug.cgi?id=908
>   These discussions give the start base of implementing this proposal.
>
> ** Implementations of the Two Solutions
>
>   In my opinion, the two solutions have similar consequence but are
>   different in essence. Utilizing compressed text bases does NOT
>   affect the working model of Subversion. It increases only the
>   runtime complexity introduced by compressing and/or decompressing
>   the text bases. Thus its implementation is somewhat straightforward.
>   But disabling the caching of text bases changes the work model of
>   Subversion because comparison (diff) and generation of deltas depend
>    directly on text bases.
>
>   If a file without cached text base has been modified and intend to
>   be committed, there are three (or more) potential working cycles:
>
>   1) abort and warn the user
>      - abort the commit process
>      - prompt the user to enable caching of the corresponding file
>      - enable caching by the user
>      - restart the commit process
>
>   2) temporarily download the base revision
>      - send a request of base revision to the server
>      - temporarily download the base revision
>      - generate the deltas and committed changes
>      - remove the base file since caching is disabled
>
>   3) make Subversion work without cached text bases
>      - split large binary files into small blocks, for example, 32KB
>      - stores locally the very short message digests of all blocks
>      - detect changes by comparing digests of corresponding blocks
>      - send only the changed blocks to the server or request and
>        download only the changed blocks to the client.
>      - generate deltas and commit changes (on server or client side).
>
>   All the above working cycles solve the problem introduced by disable
>   caching text bases. The first one can be easily implemented, but
>   introduces inconvenient manual operations. The latter two cycles
>   require modifications in both the client and server sides. The
>   problem of the second one is the heavy load of transmission during a
>   commit. Since the contents of large files change seldom, the second
>   cycle is feasible. The third one concerns the collision of message
>    digest algorithms. There is a report that different contents give
>   same MD5 digests (http://eprint.iacr.org/2004/199.pdf).
> But
>   collisions have not been found in SHA-1 algorithm. Some
>   investigations should be down to avoid collisions. I prefer to
>   implement the third working model.
>
>   According to these discussions, I suggest to add a section of
>   runtime configuration options and a special property to manage text
>   bases.
>
> ** Runtime Configurations for text-base Management
>
>   I suggest to add a new section, 'text-base', to the set of options
>   of runtime configuration. This section provides options of text
>   bases management on the client side:
>
>   - compressed: This is a binary option (yes/no). This instructs
>     Subversion client to cache compressed or original text bases. Set
>     this to 'yes' to enable caching text bases in compressed format.
>
>   - exclude-large-bins: This is a binary switch (yes/no). Set this
>     variable to 'yes' if the user want Subversion to disable caching
>     large binary files automatically. Whether the file is large or not
>     is determined by comparing its size with a threshold that
>     specified by the variable 'exclusion-threshold'.
>
>   - exclusion-threshold: This option should be a positive number. Its
>     value describes whether a binary file is large enough to turn off
>     the caching of its corresponding text-base. The suggested default
>      value is 512KB.
>
>   - digest-block-size: This variable specifies the size of blocks the
>     binary files will be split into. This option should be a positive
>     number and its default value is suggested to be 32KB.
>
> ** Special Property for text-base Management
>
>   A special property, 'svn:text-base', is suggested to be added. This
>   property indicates the way Subversion stores the text base of
>   corresponding file. Its value of can be one of the follows:
>
>   - original: This causes Subversion to store the corresponding text
>     base in its original format.
>
>   - compressed: This causes Subversion to store the text base in
>     compressed format.
>
>   - excluded: This cause Subversion to work without cached text base.
>     This value is applicable only to binary files.
>
>
> * SCHEDULE
>
>   In this summer, my main work is to finish my Ph.D dissertation.
>   According to my plan, I can work for this project (3~4 hours) * (4~5
>   days) per week. The following is my detailed schedule ('+' indicates
>   a milestone):
>
>   May 22:
>     - commence with project.
>   W01 (May 22 ~ May 28):
>     - communicate with mentors to confirm the proposal and goals
>     - read related codes and documents in Subversion
>   W02 (May 29 ~ Jun. 4):
>     - sketch the framework of text-base management
>     - prepare test cases
>     - implement the user interface
>   W03 (Jun. 5 ~ Jun. 11):
>     - implement the compressed IO based on svn_stream_compressed()
>      - add logging support
>   W04 (Jun. 12 ~ Jun. 18):
>     - implement compressed text bases support in checkout/update
>       commands
>   W05 (Jun. 19 ~ Jun. 25):
>     - implement compressed text bases support in commit/diff command
>  +W06 (Jun. 26 ~ Jul. 2): (Mid-program evaluations, Jun. 30)
>     - finish the compressed text bases management
>     - commence the working model without cached text bases
>   W07 (Jul. 3 ~ Jul. 9):
>     - function(s) for splitting files into blocks
>     - function(s) for generating message digests of blocks of files
>       (apr-util provides the MD4 and MD5 algorithm)
>   W08 (Jul. 10 ~ Jul. 16):
>     - comparison based on message digests of blocks
>     - support in checkout/update commands
>   W09 (Jul. 17 ~ Jul. 23):
>     - request blocks on client side
>     - receive blocks on client side
>   W10 (Jul. 24 ~ Jul. 30):
>     - send blocks on server side
>   W11 (Jul. 31 ~ Aug. 6):
>     - generation of deltas from blocks
>     - finish the commit command on client side
>  +WW (Aug. 7 ~ Aug. 21):
>     - finish the optional caching support
>     - write a final report
>     - pencil down
>
>
> * Experiences with Subversion and Programming
>
> ** Experiences with Subversion
>
>   I have been a user of Subversion for more than one and a half years.
>   Subversion is a great version control system which out performs all
>   the ones I used before I enter the world of Subversion. I am very
>    familiar with the commands and configuration of Subversion.
>
>   I have subscribed the development mailing list and download the
>   source code of Subversion when I heard of SoC 2006. I have read the
>   'Hacker's Guide to Subversion' and documentations in some header
>   files.
>
> ** Experiences with Programming
>
>   I have using C/C++ as my major development language for more than
>   eight years. Though most of my development work are done under
>   Windows, I have experiences of developing communication programs
>   under Unix/Linux.
>
>   I am a good team player. I have participated in several projects,
>   and three main projects are listed below (More details is available
>   in my resume web page):
>
>   - SportsPartner project: This project aims to track the players and
>     analyze their actions in sports (soccer) games. I am the team
>     leader and key algorithm developer.
>
>   - NightView project: This project aims to design and implement a
>     vision-based pedestrians detector to improve the safety of nightly
>     driving. I am a consultant of this research and develop project.
>
>   - Microarray Image Analysis: This project aims to detect and
>     quantify the intensities of spots on scanned microarray images. My
>     task is to design and implement the algorithm of detect and
>     recognize the regular structures of grids on such images.
>
>
> * BIBLIOGRAPHY
>
>   I got a B. Eng. from Northwestern Polytechnical University, Xi'an,
>   China, in July. 2000. I am now a Ph.D candidate majoring in control
>   science and engineering at Department of Automation, Tsinghua
>   University, Beijing, China. I am expected to get my Ph.D degree in
>   Jan. 2007.
>
>   My resume can be found at the following link addresses:
>   - HTML format: http://fred.qi.googlepages.com/resume.html
>   - PDF format: http://fred.qi.googlepages.com/cv-qf.pdf
>
>
> * OTHER PROJECTS in SoC 2006
>
>   I plan to apply another one or two projects mentored by boost
>   organization. But I prefer to work for this project.
> -----
> Best regards,
> Fei Qi
>
>
> On 5/8/06, Sachin Garg <sc...@gmail.com> wrote:
> > I looked at bug ID 908, which wants that the local copy in text-base
> > should be stored compressed. I did a little digging around in code and
> > felt it shouldnt be very hard to implement this and it will atleast
> > make my life easier.
> >
> > I am not going through the Google summer of code thing (am no longer a
> > student either :-) but would like to implement this feature (assuming
> > someone hasnt already started working on this).
> >
> > I am a long time subversion user (on Windows, TortoiseSVN) but new to
> > subversion code, so will need some guidance if you guys want me to
> > work on this.
> >
> > Some quick quesitions:
> >
> > # Is libsvn_wc/ the only place where I will need to edit code, or do I
> > need to look in other directories too? Which ones?
> >
> > # Do we already have a compression library (zlib?) linked in subversion?
> >
> > # How much additional delay this is expected to result in during
> > checkouts and commits? Should I use something lightweight like zlib or
> > will it be fine to use bzip2 which can give better compression but
> > will be slower?
> >
> > # Do we want files in text-base to be always compressed, or do we want
> > text-base compression to be optional?
> >
> > Bug no 525 (optional text-base storage) is slightly related, maybe I
> > can have a design which will make it easier to implement 525 too. Like
> > implementing text-base access as a layer which can have multiple
> > implmentations:
> >
> > 1. Direct file read
> > 2. Read compressed file
> > 3. Fetch from server
> >
> >
> > Another possible todo item (which runs in opposite direction from the
> > above items :-)
> >
> > Just like SVN stores text-base for local diffs, how about generalizing
> > it to store N previous revisions and change log entires. Storing
> > additional revisions shouldn't result in too much bloat, as we can
> > probably store just the diffs and can make more operations local.
> >
> > Sachin Garg [India]
> > www.sachingarg.com | www.c10n.info
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Reminder] Subversion a mentor for Google Summer of Code

Posted by Qi Fred <fr...@gmail.com>.

What I said is the idea of my first proposal.
In the revised version, I have adopted your comments and suggestions
including:
(1) treate binary and text files equally
(2) sending full text as deltas
(3) if time permitted, to implement rsync.

On 5/15/06, Peter N. Lundblad <pe...@famlundblad.se> wrote:
>
> Qi Fred writes:
> > The reason is NOT text files are small to enable only the binary files
> > working without text-base. The basic idea is that text files have
> > many special properties, such as svn:eol-style, svn:keywords, etc.,
> > which need special considerations. The process of text files would be
> > more complex than binary ones to achieve a same performance.
>
> I think you are mistaken here.  What does properties have to do with
> this?  We can't drop the propsbases, because they are needed for a lot
> of things.  And you can have keyword expansion enabled in binary files
> as well.
>
> > Further more, small files may waste a lot of disk spaces in certain
> > systems. Working without text base is a charming feature.
> > In my mind, a rsync-like algorithm is a way to achieve this goal.
> > This is suggested in my revised proposal submitted to SoC.
>
> Yes, but as I've pointed out before, you don't *need* an rsync
> algorithm.
> It will just require more network bandwidth.
> Note that I'm opposed to a simplified "rsync-like" algorithm as you've
> proposed, that only works if blocks are exactly aligned, because we
> will be stuck with supporting that even when the have the real thing.
>
> Best Regards,
> //Peter
>


-- 
Best Regards,
Fred Qi

Re: [Reminder] Subversion a mentor for Google Summer of Code

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.

Qi Fred writes:
 > The reason is NOT text files are small to enable only the binary files
 > working without text-base. The basic idea is that text files have
 > many special properties, such as svn:eol-style, svn:keywords, etc.,
 > which need special considerations. The process of text files would be
 > more complex than binary ones to achieve a same performance.

I think you are mistaken here.  What does properties have to do with
this?  We can't drop the propsbases, because they are needed for a lot
of things.  And you can have keyword expansion enabled in binary files
as well.

 > Further more, small files may waste a lot of disk spaces in certain
 > systems. Working without text base is a charming feature.
 > In my mind, a rsync-like algorithm is a way to achieve this goal.
 > This is suggested in my revised proposal submitted to SoC.

Yes, but as I've pointed out before, you don't *need* an rsync
algorithm.
It will just require more network bandwidth.
Note that I'm opposed to a simplified "rsync-like" algorithm as you've
proposed, that only works if blocks are exactly aligned, because we
will be stuck with supporting that even when the have the real thing.

Best Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Reminder] Subversion a mentor for Google Summer of Code

Posted by Qi Fred <fr...@gmail.com>.

The reason is NOT text files are small to enable only the binary files
working without text-base. The basic idea is that text files have
many special properties, such as svn:eol-style, svn:keywords, etc.,
which need special considerations. The process of text files would be
more complex than binary ones to achieve a same performance.
If time permitted, this is NOT a problem. But SoC is only 3 months.

Further more, small files may waste a lot of disk spaces in certain
systems. Working without text base is a charming feature.
In my mind, a rsync-like algorithm is a way to achieve this goal.
This is suggested in my revised proposal submitted to SoC.

On 5/14/06, Wesley J. Landaker <wj...@icecavern.net> wrote:
>
> On Monday 08 May 2006 03:09, Qi Fred wrote:
> >   - By specifying a special property on a certain file, one of the
> >     three caching mechanisms can be chosen: original, compressed, and
> >     excluded (caching disabled). Note that the text bases can be
> >     excluded on client side only if the file is a binary one.
>
> I routinely deal with large (> 100 MiB) "text" files (EDIF, XML, etc). I
> wouldn't limit this to binary files on the assumption that "binary" files
> are big and "text" files are small.
>
> --
> Wesley J. Landaker <wj...@icecavern.net> <xm...@icecavern.net>
> OpenPGP FP: 4135 2A3B 4726 ACC5 9094  0097 F0A9 8A4C 4CD6 E3D2
>
>
>

-- 
Best Regards,
Fred Qi

Re: [Reminder] Subversion a mentor for Google Summer of Code

Posted by "Wesley J. Landaker" <wj...@icecavern.net>.

On Monday 08 May 2006 03:09, Qi Fred wrote:
>   - By specifying a special property on a certain file, one of the
>     three caching mechanisms can be chosen: original, compressed, and
>     excluded (caching disabled). Note that the text bases can be
>     excluded on client side only if the file is a binary one.

I routinely deal with large (> 100 MiB) "text" files (EDIF, XML, etc). I 
wouldn't limit this to binary files on the assumption that "binary" files 
are big and "text" files are small.

-- 
Wesley J. Landaker <wj...@icecavern.net> <xm...@icecavern.net>
OpenPGP FP: 4135 2A3B 4726 ACC5 9094  0097 F0A9 8A4C 4CD6 E3D2

Re: Optional/compressed text bases

Posted by Peter Samuelson <pe...@p12n.org>.

[Ron]
> I wish I could remember the link, but I read about using the MD5 of
> the file forward and then the MD5 of the file backwards, producing 2
> MD5 values and that along with the size of the data produced a chance
> of collision so small to be (almost) impossible.  Subversion could
> also add modify time to increase this even more.

The chance of MD5 collisions is already so small to be (almost)
impossible.  The hash is 128 bits long, so you only have to start to
worry when the number of files starts to approach 2^64, or
18000000000000000000.  So 128-bit MD5 is already overkill, never mind
your proposed 256-bit variant.

Collisions can be produced _on purpose_, with a lot of CPU power.  But
that requires access to the repository or working copy, which means the
adversary is already trusted and you have already lost.

Re: Optional/compressed text bases

Posted by Ron <li...@rzweb.com>.

>  >   All the above working cycles solve the problem introduced by disable
 >  >   caching text bases. The first one can be easily implemented, but
 >  >   introduces inconvenient manual operations. The latter two cycles
 >  >   require modifications in both the client and server sides. The
 >  >   problem of the second one is the heavy load of transmission during a
 >  >   commit. Since the contents of large files change seldom, the second
 >  >   cycle is feasible. The third one concerns the collision of message
 >  >   digest algorithms. There is a report that different contents give
 >  >   same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
 >  >   collisions have not been found in SHA-1 algorithm. Some
 >  >   investigations should be down to avoid collisions. I prefer to
 >  >   implement the third working model.
 >  >
 > I'm no expert in this area, but I pretty sure the collisions concern
 > the cryptographic uses of MD5, so I don't think we need to worry about
 > that.  Others may want to comment here.

I wish I could remember the link, but I read about using the MD5 of the 
file forward and then the MD5 of the file backwards, producing 2 MD5 
values and that along with the size of the data produced a chance of 
collision so small to be (almost) impossible.  Subversion could also add 
modify time to increase this even more.

Maybe (almost) impossible isn't good enough, but it was 1 in the 
billions of trillions if I remember correctly.  I am far from an expert 
in this area, so this maybe common knowledge/debunked.

I would love to see subversion store some kind of hash rather than the 
full file.  I work on projects with many many gigabytes of binary data 
and hate to have my entire project stored twice.

Ron



Peter N. Lundblad wrote:
> Hi,
> 
> As this was posted here, I reply on the list.  I know there are other
> applications for this as well.  I hope all applicants will be able to
> benefit from this information.  (Also, note that what I say may not be
> the consensus of the project - I'm only one member.)
> 
> In short, I think the proposal is a good starting point, but there are
> things needing more thought or reconsideration.
> 
> Qi Fred writes:
>  >   The following features are planned to be implemented:
>  > 
>  >   - By setting options in the runtime configuration files, users can
>  >     (a) switch between using original and compressed text bases, and
> 
> I assume these options will determinee which method gets used when
> checking out?  Do you imagine the user being able to switch existing
> working copies?
> 
>  >     (b) enable or disable caching large binary files.
>  > 
>  >   - By specifying a special property on a certain file, one of the
>  >     three caching mechanisms can be chosen: original, compressed, and
>  >     excluded (caching disabled). Note that the text bases can be
>  >     excluded on client side only if the file is a binary one.
>  > 
> Do you propose to use versioned properties for this?  I'd say this
> should only a client-side option.
> 
> Why limit optional text bases to binary files?  Many small files also
> take up much disk space on many filesystems.
> 
>  >   But disabling the caching of text bases changes the work model of
>  >   Subversion because comparison (diff) and generation of deltas depend
>  >   directly on text bases.
> 
> Note that you don't strictly need the text base to generate a text
> delta, it would just be a delta containing only new data, making
> effectively a compressed fulltext.  There is nothing saying that a
> delta sent to the server must be minimal.
> 
>  >   If a file without cached text base has been modified and intend to
>  >   be committed, there are three (or more) potential working cycles:
>  > 
>  >   1) abort and warn the user
> 
> That's not good.  This makes the feature pretty useless except for
> read-only working copies...
>  >   2) temporarily download the base revision
>  > 
> Could as well send a fulltext delta to the server.
> 
>  >   3) make Subversion work without cached text bases
>  >      - split large binary files into small blocks, for example, 32KB
>  >      - stores locally the very short message digests of all blocks
>  >      - detect changes by comparing digests of corresponding blocks
>  >      - send only the changed blocks to the server or request and
>  >        download only the changed blocks to the client.
>  >      - generate deltas and commit changes (on server or client side).
> 
> What happens when someone inserts one byte near the beginning of the
> file?  We need an rsync-like algorithm if we want to do this.  I think
> this is an optional optimization.  People will need to trade disk
> usage (storing text bases) versus network usage.
> 
>  >   All the above working cycles solve the problem introduced by disable
>  >   caching text bases. The first one can be easily implemented, but
>  >   introduces inconvenient manual operations. The latter two cycles
>  >   require modifications in both the client and server sides. The
>  >   problem of the second one is the heavy load of transmission during a
>  >   commit. Since the contents of large files change seldom, the second
>  >   cycle is feasible. The third one concerns the collision of message
>  >   digest algorithms. There is a report that different contents give
>  >   same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
>  >   collisions have not been found in SHA-1 algorithm. Some
>  >   investigations should be down to avoid collisions. I prefer to
>  >   implement the third working model.
>  > 
> I'm no expert in this area, but I pretty sure the collisions concern
> the cryptographic uses of MD5, so I don't think we need to worry about
> that.  Others may want to comment here.
> 
>  >   According to these discussions, I suggest to add a section of
>  >   runtime configuration options and a special property to manage text
>  >   bases.
>  > 
>  > ** Runtime Configurations for text-base Management
>  > 
>  >   I suggest to add a new section, 'text-base', to the set of options
>  >   of runtime configuration. This section provides options of text
>  >   bases management on the client side:
>  > 
>  >   - compressed: This is a binary option (yes/no). This instructs
>  >     Subversion client to cache compressed or original text bases. Set
>  >     this to 'yes' to enable caching text bases in compressed format.
>  > 
>  >   - exclude-large-bins: This is a binary switch (yes/no). Set this
>  >     variable to 'yes' if the user want Subversion to disable caching
>  >     large binary files automatically. Whether the file is large or not
>  >     is determined by comparing its size with a threshold that
>  >     specified by the variable 'exclusion-threshold'.
>  > 
>  >   - exclusion-threshold: This option should be a positive number. Its
>  >     value describes whether a binary file is large enough to turn off
>  >     the caching of its corresponding text-base. The suggested default
>  >     value is 512KB.
> 
> The two options above coludlb e combined into one.  Please keep the
> number of user options low.
> 
>  >   - digest-block-size: This variable specifies the size of blocks the
>  >     binary files will be split into. This option should be a positive
>  >     number and its default value is suggested to be 32KB.
> 
> Drop this.  Who will know how to tweak this (uh, and the method
> doesn't work anyway:-)
> 
>  > ** Special Property for text-base Management
>  > 
>  >   A special property, 'svn:text-base', is suggested to be added. This
>  >   property indicates the way Subversion stores the text base of
>  >   corresponding file. Its value of can be one of the follows:
> 
> As I said above, this shouldn't be versioned.  You may need to extend
> the .svn/entries file, though.
> 
> 
> A problem with the user interface sketched is that there is no way to
> specify the textbase handling per working copy, but only per user.
> Say one repository is on your LAN and another is in China (I live in
> Sweden:-).
> 
> Regards,
> //Peter
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Optional/compressed text bases

Posted by Marc Sherman <ms...@projectile.ca>.

Qi Fred wrote:
> 
>> A problem with the user interface sketched is that there is no way to
>> specify the textbase handling per working copy, but only per user.
>> Say one repository is on your LAN and another is in China (I live in
>> Sweden:-).
> 
> This is NOT the fact since there is an option --config-dir.

That seems incredibly awkward to me.  I'd much prefer to see this as a
switch to checkout, rather than a config option.

- Marc

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.

Qi Fred writes:
 > On 5/8/06, Peter N. Lundblad <pe...@famlundblad.se> wrote:
 > 
 > > I assume these options will determinee which method gets used when
 > > checking out?  Do you imagine the user being able to switch existing
 > > working copies?
 > 
 > 
 > Sure, users may use a svn client supports compressed copy when they create
 > theire initial check outs. But some others may not. Switch means the user
 > can
 > upgrade the client smoothly without re-do check out. Another problem is that
 > uncompress can be very time consuming, and users would like to work with
 > original text-bases. So a mechanism supports switching is necessary.

We don't need to worry about the UI details right now, but it seems clear
that you intend some extra command or something to manage this state.  Correct?

 > Thanks. This is very useful. I am not clear how deltas are genereated and
 > whether minimal deltas are used in commitments. Do you mean that we
 > need not modify any code of the server, if the client sends full text as a
 > delta
 > in a commitment?

Exactly.  The delta is a series of instructions: copy from the delta
source, copy from the target generated thus far and insert new data.
The server will take this, construct the fulltext and generate its own
delta, often based on another revision then the WC one.  I think
having to send the whole new text to the server if you choose to
eliminate text bases is an acceptable tradeof, at least initially.

 > > >   If a file without cached text base has been modified and intend to
 > > >   be committed, there are three (or more) potential working cycles:
 > > >
 > > >   1) abort and warn the user
 > >
 > > That's not good.  This makes the feature pretty useless except for
 > > read-only working copies...
 > > >   2) temporarily download the base revision
 > > >
 > > Could as well send a fulltext delta to the server.
 > 
 > 
 > It would be better if the server accepts compressed delta.

So, we have an 100 MB file.  Are you suggesting that downloading that
file, just to be able to upload a delta is better than just uploading
the whole new text?  Or are you just suggesting that the new text
should be compressed?  IF the latter, then that's already the case, so
you don't need to worry about that.

 > > >   3) make Subversion work without cached text bases
 > > >      - split large binary files into small blocks, for example, 32KB
 > > >      - stores locally the very short message digests of all blocks
 > > >      - detect changes by comparing digests of corresponding blocks
 > > >      - send only the changed blocks to the server or request and
 > > >        download only the changed blocks to the client.
 > > >      - generate deltas and commit changes (on server or client side).
 > >
 > > What happens when someone inserts one byte near the beginning of the
 > > file?  We need an rsync-like algorithm if we want to do this.  I think
 > > this is an optional optimization.  People will need to trade disk
 > > usage (storing text bases) versus network usage.
 > 
 > 
 > The average performane is better than the two previous suggestions.
 > To optimize the worst case would be time consuming, and I am not wheter
 > the time is enough within the Summer of Code limitation.

Sorry.  I don't follow the above.  What I'm sayihng is that your
proposed algorithm won't work if (part of) the file is shifted away
from its original location, for example by inserting or removing some
bytes.  I think that's very common, so I don't think that's only the
worst case.

When I said "optional optimization", I meant that it is optional for
you to implement (or whoever gets to do it).  The feature works fine
without it.

 > > >   All the above working cycles solve the problem introduced by disable
 > > >   caching text bases. The first one can be easily implemented, but
 > > >   introduces inconvenient manual operations. The latter two cycles
 > > >   require modifications in both the client and server sides. The
 > > >   problem of the second one is the heavy load of transmission during a
 > > >   commit. Since the contents of large files change seldom, the second
 > > >   cycle is feasible. The third one concerns the collision of message
 > > >   digest algorithms. There is a report that different contents give
 > > >   same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
 > > >   collisions have not been found in SHA-1 algorithm. Some
 > > >   investigations should be down to avoid collisions. I prefer to
 > > >   implement the third working model.
 > > >
 > > I'm no expert in this area, but I pretty sure the collisions concern
 > > the cryptographic uses of MD5, so I don't think we need to worry about
 > > that.  Others may want to comment here.
 > 
 > 
 > I would like to use MD5 algorithm, but there is a risk that some files are
 > not
 > correctly committed to the server.

There is another problem here, which is that the client may not detect
a modification if the original and the modified file happen to have
the same checksum.  I think this risk is so small that we shouldn't
worry about it; our current timestamp-based modification detection
heuristic fails for real...  And if this really happens, the
work-around would be to checkout a working copy *with* text bases and
do the commit from there.  Note that this is *not* about data corruption.

 > > >   A special property, 'svn:text-base', is suggested to be added. This
 > > >   property indicates the way Subversion stores the text base of
 > > >   corresponding file. Its value of can be one of the follows:
 > >
 > > As I said above, this shouldn't be versioned.  You may need to extend
 > > the .svn/entries file, though.
 > 
 > 
 > This suggestion is good.  Is there a user interface to access the
 > .svn/entries file
 > in current Subversion client? I think we need a new command for users to
 > access this file.

There's no interface to manipulate entries directly, and there
shouldn't be:-)  What we need is an interface to manipulate this
particular state, including fetching/removing/(un)compressing the text base.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)

Posted by Qi Fred <fr...@gmail.com>.

On 5/8/06, Peter N. Lundblad <pe...@famlundblad.se> wrote:

> Hi,
>
> As this was posted here, I reply on the list.  I know there are other
> applications for this as well.  I hope all applicants will be able to
> benefit from this information.  (Also, note that what I say may not be
> the consensus of the project - I'm only one member.)
>
> In short, I think the proposal is a good starting point, but there are
> things needing more thought or reconsideration.
>
> Qi Fred writes:
> >   The following features are planned to be implemented:
> >
> >   - By setting options in the runtime configuration files, users can
> >     (a) switch between using original and compressed text bases, and
>
> I assume these options will determinee which method gets used when
> checking out?  Do you imagine the user being able to switch existing
> working copies?


Sure, users may use a svn client supports compressed copy when they create
theire initial check outs. But some others may not. Switch means the user
can
upgrade the client smoothly without re-do check out. Another problem is that
uncompress can be very time consuming, and users would like to work with
original text-bases. So a mechanism supports switching is necessary.


> >     (b) enable or disable caching large binary files.
> >
> >   - By specifying a special property on a certain file, one of the
> >     three caching mechanisms can be chosen: original, compressed, and
> >     excluded (caching disabled). Note that the text bases can be
> >     excluded on client side only if the file is a binary one.
> >
> Do you propose to use versioned properties for this?  I'd say this
> should only a client-side option.
>
> Why limit optional text bases to binary files?  Many small files also
> take up much disk space on many filesystems.
>
> >   But disabling the caching of text bases changes the work model of
> >   Subversion because comparison (diff) and generation of deltas depend
> >   directly on text bases.
>
> Note that you don't strictly need the text base to generate a text
> delta, it would just be a delta containing only new data, making
> effectively a compressed fulltext.  There is nothing saying that a
> delta sent to the server must be minimal.


Thanks. This is very useful. I am not clear how deltas are genereated and
whether minimal deltas are used in commitments. Do you mean that we
need not modify any code of the server, if the client sends full text as a
delta
in a commitment?



> >   If a file without cached text base has been modified and intend to
> >   be committed, there are three (or more) potential working cycles:
> >
> >   1) abort and warn the user
>
> That's not good.  This makes the feature pretty useless except for
> read-only working copies...
> >   2) temporarily download the base revision
> >
> Could as well send a fulltext delta to the server.


It would be better if the server accepts compressed delta.


> >   3) make Subversion work without cached text bases
> >      - split large binary files into small blocks, for example, 32KB
> >      - stores locally the very short message digests of all blocks
> >      - detect changes by comparing digests of corresponding blocks
> >      - send only the changed blocks to the server or request and
> >        download only the changed blocks to the client.
> >      - generate deltas and commit changes (on server or client side).
>
> What happens when someone inserts one byte near the beginning of the
> file?  We need an rsync-like algorithm if we want to do this.  I think
> this is an optional optimization.  People will need to trade disk
> usage (storing text bases) versus network usage.


The average performane is better than the two previous suggestions.
To optimize the worst case would be time consuming, and I am not wheter
the time is enough within the Summer of Code limitation.


> >   All the above working cycles solve the problem introduced by disable
> >   caching text bases. The first one can be easily implemented, but
> >   introduces inconvenient manual operations. The latter two cycles
> >   require modifications in both the client and server sides. The
> >   problem of the second one is the heavy load of transmission during a
> >   commit. Since the contents of large files change seldom, the second
> >   cycle is feasible. The third one concerns the collision of message
> >   digest algorithms. There is a report that different contents give
> >   same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
> >   collisions have not been found in SHA-1 algorithm. Some
> >   investigations should be down to avoid collisions. I prefer to
> >   implement the third working model.
> >
> I'm no expert in this area, but I pretty sure the collisions concern
> the cryptographic uses of MD5, so I don't think we need to worry about
> that.  Others may want to comment here.


I would like to use MD5 algorithm, but there is a risk that some files are
not
correctly committed to the server.

>   According to these discussions, I suggest to add a section of
> >   runtime configuration options and a special property to manage text
> >   bases.
> >
> > ** Runtime Configurations for text-base Management
> >
> >   I suggest to add a new section, 'text-base', to the set of options
> >   of runtime configuration. This section provides options of text
> >   bases management on the client side:
> >
> >   - compressed: This is a binary option (yes/no). This instructs
> >     Subversion client to cache compressed or original text bases. Set
> >     this to 'yes' to enable caching text bases in compressed format.
> >
> >   - exclude-large-bins: This is a binary switch (yes/no). Set this
> >     variable to 'yes' if the user want Subversion to disable caching
> >     large binary files automatically. Whether the file is large or not
> >     is determined by comparing its size with a threshold that
> >     specified by the variable 'exclusion-threshold'.
> >
> >   - exclusion-threshold: This option should be a positive number. Its
> >     value describes whether a binary file is large enough to turn off
> >     the caching of its corresponding text-base. The suggested default
> >     value is 512KB.
>
> The two options above coludlb e combined into one.  Please keep the
> number of user options low.
>
> >   - digest-block-size: This variable specifies the size of blocks the
> >     binary files will be split into. This option should be a positive
> >     number and its default value is suggested to be 32KB.
>
> Drop this.  Who will know how to tweak this (uh, and the method
> doesn't work anyway:-)


You are right.

> ** Special Property for text-base Management
> >
> >   A special property, 'svn:text-base', is suggested to be added. This
> >   property indicates the way Subversion stores the text base of
> >   corresponding file. Its value of can be one of the follows:
>
> As I said above, this shouldn't be versioned.  You may need to extend
> the .svn/entries file, though.


This suggestion is good.  Is there a user interface to access the
.svn/entries file
in current Subversion client? I think we need a new command for users to
access this file.


> A problem with the user interface sketched is that there is no way to
> specify the textbase handling per working copy, but only per user.
> Say one repository is on your LAN and another is in China (I live in
> Sweden:-).

This is NOT the fact since there is an option --config-dir.



> Regards,
> //Peter
>



--
Best Regards,
Fred Qi

Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.

Hi,

As this was posted here, I reply on the list.  I know there are other
applications for this as well.  I hope all applicants will be able to
benefit from this information.  (Also, note that what I say may not be
the consensus of the project - I'm only one member.)

In short, I think the proposal is a good starting point, but there are
things needing more thought or reconsideration.

Qi Fred writes:
 >   The following features are planned to be implemented:
 > 
 >   - By setting options in the runtime configuration files, users can
 >     (a) switch between using original and compressed text bases, and

I assume these options will determinee which method gets used when
checking out?  Do you imagine the user being able to switch existing
working copies?

 >     (b) enable or disable caching large binary files.
 > 
 >   - By specifying a special property on a certain file, one of the
 >     three caching mechanisms can be chosen: original, compressed, and
 >     excluded (caching disabled). Note that the text bases can be
 >     excluded on client side only if the file is a binary one.
 > 
Do you propose to use versioned properties for this?  I'd say this
should only a client-side option.

Why limit optional text bases to binary files?  Many small files also
take up much disk space on many filesystems.

 >   But disabling the caching of text bases changes the work model of
 >   Subversion because comparison (diff) and generation of deltas depend
 >   directly on text bases.

Note that you don't strictly need the text base to generate a text
delta, it would just be a delta containing only new data, making
effectively a compressed fulltext.  There is nothing saying that a
delta sent to the server must be minimal.

 >   If a file without cached text base has been modified and intend to
 >   be committed, there are three (or more) potential working cycles:
 > 
 >   1) abort and warn the user

That's not good.  This makes the feature pretty useless except for
read-only working copies...
 >   2) temporarily download the base revision
 > 
Could as well send a fulltext delta to the server.

 >   3) make Subversion work without cached text bases
 >      - split large binary files into small blocks, for example, 32KB
 >      - stores locally the very short message digests of all blocks
 >      - detect changes by comparing digests of corresponding blocks
 >      - send only the changed blocks to the server or request and
 >        download only the changed blocks to the client.
 >      - generate deltas and commit changes (on server or client side).

What happens when someone inserts one byte near the beginning of the
file?  We need an rsync-like algorithm if we want to do this.  I think
this is an optional optimization.  People will need to trade disk
usage (storing text bases) versus network usage.

 >   All the above working cycles solve the problem introduced by disable
 >   caching text bases. The first one can be easily implemented, but
 >   introduces inconvenient manual operations. The latter two cycles
 >   require modifications in both the client and server sides. The
 >   problem of the second one is the heavy load of transmission during a
 >   commit. Since the contents of large files change seldom, the second
 >   cycle is feasible. The third one concerns the collision of message
 >   digest algorithms. There is a report that different contents give
 >   same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
 >   collisions have not been found in SHA-1 algorithm. Some
 >   investigations should be down to avoid collisions. I prefer to
 >   implement the third working model.
 > 
I'm no expert in this area, but I pretty sure the collisions concern
the cryptographic uses of MD5, so I don't think we need to worry about
that.  Others may want to comment here.

 >   According to these discussions, I suggest to add a section of
 >   runtime configuration options and a special property to manage text
 >   bases.
 > 
 > ** Runtime Configurations for text-base Management
 > 
 >   I suggest to add a new section, 'text-base', to the set of options
 >   of runtime configuration. This section provides options of text
 >   bases management on the client side:
 > 
 >   - compressed: This is a binary option (yes/no). This instructs
 >     Subversion client to cache compressed or original text bases. Set
 >     this to 'yes' to enable caching text bases in compressed format.
 > 
 >   - exclude-large-bins: This is a binary switch (yes/no). Set this
 >     variable to 'yes' if the user want Subversion to disable caching
 >     large binary files automatically. Whether the file is large or not
 >     is determined by comparing its size with a threshold that
 >     specified by the variable 'exclusion-threshold'.
 > 
 >   - exclusion-threshold: This option should be a positive number. Its
 >     value describes whether a binary file is large enough to turn off
 >     the caching of its corresponding text-base. The suggested default
 >     value is 512KB.

The two options above coludlb e combined into one.  Please keep the
number of user options low.

 >   - digest-block-size: This variable specifies the size of blocks the
 >     binary files will be split into. This option should be a positive
 >     number and its default value is suggested to be 32KB.

Drop this.  Who will know how to tweak this (uh, and the method
doesn't work anyway:-)

 > ** Special Property for text-base Management
 > 
 >   A special property, 'svn:text-base', is suggested to be added. This
 >   property indicates the way Subversion stores the text base of
 >   corresponding file. Its value of can be one of the follows:

As I said above, this shouldn't be versioned.  You may need to extend
the .svn/entries file, though.


A problem with the user interface sketched is that there is no way to
specify the textbase handling per working copy, but only per user.
Say one repository is on your LAN and another is in China (I live in
Sweden:-).

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Reminder] Subversion a mentor for Google Summer of Code

Posted by Qi Fred <fr...@gmail.com>.

I have submitted a proposal to Summer of Code 2006 on this task.
The following is my proposal,
-------------------------------------

Name: Qi, Fei
Email: fred.qi@gmail.com
IM: fred.qi@gmail.com (gtalk)
Language: Chinese, Native;
          English, fluently reading, writing and speaking.

* PROJECT TITLE
----------------------------------------------------------------------
Compressed or optional text base storage in Subversion
----------------------------------------------------------------------

* SUMMARY

  In Subversion, difference comparison and deltas generation are
  performed off-line based on the locally cached text bases. Text
  bases of a certain working copy are the unmodified files in the base
  revision. But such a design doubles approximately the storage space
  needed on the client side. Two feasible solutions of reducing the
  storage are: (a) compress the text bases, and (b) disable caching
  text bases of some or all of the files in the working copy. My
  proposal is to add a mechanism combines the two solutions to manage
  text bases.

  The following features are planned to be implemented:

  - By setting options in the runtime configuration files, users can
    (a) switch between using original and compressed text bases, and
    (b) enable or disable caching large binary files.

  - By specifying a special property on a certain file, one of the
    three caching mechanisms can be chosen: original, compressed, and
    excluded (caching disabled). Note that the text bases can be
    excluded on client side only if the file is a binary one.

* DETAILS of PROJECT

  Compressed or optional text base storage in Subversion have been
  discussed for a long time in Subversion's development community,
  - SoC description: http://subversion.tigris.org/project_tasks.html
  - issue 525: http://subversion.tigris.org/issues/show_bug.cgi?id=525
  - issue 908: http://subversion.tigris.org/issues/show_bug.cgi?id=908
  These discussions give the start base of implementing this proposal.

** Implementations of the Two Solutions

  In my opinion, the two solutions have similar consequence but are
  different in essence. Utilizing compressed text bases does NOT
  affect the working model of Subversion. It increases only the
  runtime complexity introduced by compressing and/or decompressing
  the text bases. Thus its implementation is somewhat straightforward.
  But disabling the caching of text bases changes the work model of
  Subversion because comparison (diff) and generation of deltas depend
  directly on text bases.

  If a file without cached text base has been modified and intend to
  be committed, there are three (or more) potential working cycles:

  1) abort and warn the user
     - abort the commit process
     - prompt the user to enable caching of the corresponding file
     - enable caching by the user
     - restart the commit process

  2) temporarily download the base revision
     - send a request of base revision to the server
     - temporarily download the base revision
     - generate the deltas and committed changes
     - remove the base file since caching is disabled

  3) make Subversion work without cached text bases
     - split large binary files into small blocks, for example, 32KB
     - stores locally the very short message digests of all blocks
     - detect changes by comparing digests of corresponding blocks
     - send only the changed blocks to the server or request and
       download only the changed blocks to the client.
     - generate deltas and commit changes (on server or client side).

  All the above working cycles solve the problem introduced by disable
  caching text bases. The first one can be easily implemented, but
  introduces inconvenient manual operations. The latter two cycles
  require modifications in both the client and server sides. The
  problem of the second one is the heavy load of transmission during a
  commit. Since the contents of large files change seldom, the second
  cycle is feasible. The third one concerns the collision of message
  digest algorithms. There is a report that different contents give
  same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
  collisions have not been found in SHA-1 algorithm. Some
  investigations should be down to avoid collisions. I prefer to
  implement the third working model.

  According to these discussions, I suggest to add a section of
  runtime configuration options and a special property to manage text
  bases.

** Runtime Configurations for text-base Management

  I suggest to add a new section, 'text-base', to the set of options
  of runtime configuration. This section provides options of text
  bases management on the client side:

  - compressed: This is a binary option (yes/no). This instructs
    Subversion client to cache compressed or original text bases. Set
    this to 'yes' to enable caching text bases in compressed format.

  - exclude-large-bins: This is a binary switch (yes/no). Set this
    variable to 'yes' if the user want Subversion to disable caching
    large binary files automatically. Whether the file is large or not
    is determined by comparing its size with a threshold that
    specified by the variable 'exclusion-threshold'.

  - exclusion-threshold: This option should be a positive number. Its
    value describes whether a binary file is large enough to turn off
    the caching of its corresponding text-base. The suggested default
    value is 512KB.

  - digest-block-size: This variable specifies the size of blocks the
    binary files will be split into. This option should be a positive
    number and its default value is suggested to be 32KB.

** Special Property for text-base Management

  A special property, 'svn:text-base', is suggested to be added. This
  property indicates the way Subversion stores the text base of
  corresponding file. Its value of can be one of the follows:

  - original: This causes Subversion to store the corresponding text
    base in its original format.

  - compressed: This causes Subversion to store the text base in
    compressed format.

  - excluded: This cause Subversion to work without cached text base.
    This value is applicable only to binary files.

* SCHEDULE

  In this summer, my main work is to finish my Ph.D dissertation.
  According to my plan, I can work for this project (3~4 hours) * (4~5
  days) per week. The following is my detailed schedule ('+' indicates
  a milestone):

  May 22:
    - commence with project.
  W01 (May 22 ~ May 28):
    - communicate with mentors to confirm the proposal and goals
    - read related codes and documents in Subversion
  W02 (May 29 ~ Jun. 4):
    - sketch the framework of text-base management
    - prepare test cases
    - implement the user interface
  W03 (Jun. 5 ~ Jun. 11):
    - implement the compressed IO based on svn_stream_compressed()
    - add logging support
  W04 (Jun. 12 ~ Jun. 18):
    - implement compressed text bases support in checkout/update
      commands
  W05 (Jun. 19 ~ Jun. 25):
    - implement compressed text bases support in commit/diff command
 +W06 (Jun. 26 ~ Jul. 2): (Mid-program evaluations, Jun. 30)
    - finish the compressed text bases management
    - commence the working model without cached text bases
  W07 (Jul. 3 ~ Jul. 9):
    - function(s) for splitting files into blocks
    - function(s) for generating message digests of blocks of files
      (apr-util provides the MD4 and MD5 algorithm)
  W08 (Jul. 10 ~ Jul. 16):
    - comparison based on message digests of blocks
    - support in checkout/update commands
  W09 (Jul. 17 ~ Jul. 23):
    - request blocks on client side
    - receive blocks on client side
  W10 (Jul. 24 ~ Jul. 30):
    - send blocks on server side
  W11 (Jul. 31 ~ Aug. 6):
    - generation of deltas from blocks
    - finish the commit command on client side
 +WW (Aug. 7 ~ Aug. 21):
    - finish the optional caching support
    - write a final report
    - pencil down

* Experiences with Subversion and Programming

** Experiences with Subversion

  I have been a user of Subversion for more than one and a half years.
  Subversion is a great version control system which out performs all
  the ones I used before I enter the world of Subversion. I am very
  familiar with the commands and configuration of Subversion.

  I have subscribed the development mailing list and download the
  source code of Subversion when I heard of SoC 2006. I have read the
  'Hacker's Guide to Subversion' and documentations in some header
  files.

** Experiences with Programming

  I have using C/C++ as my major development language for more than
  eight years. Though most of my development work are done under
  Windows, I have experiences of developing communication programs
  under Unix/Linux.

  I am a good team player. I have participated in several projects,
  and three main projects are listed below (More details is available
  in my resume web page):

  - SportsPartner project: This project aims to track the players and
    analyze their actions in sports (soccer) games. I am the team
    leader and key algorithm developer.

  - NightView project: This project aims to design and implement a
    vision-based pedestrians detector to improve the safety of nightly
    driving. I am a consultant of this research and develop project.

  - Microarray Image Analysis: This project aims to detect and
    quantify the intensities of spots on scanned microarray images. My
    task is to design and implement the algorithm of detect and
    recognize the regular structures of grids on such images.

* BIBLIOGRAPHY

  I got a B. Eng. from Northwestern Polytechnical University, Xi'an,
  China, in July. 2000. I am now a Ph.D candidate majoring in control
  science and engineering at Department of Automation, Tsinghua
  University, Beijing, China. I am expected to get my Ph.D degree in
  Jan. 2007.

  My resume can be found at the following link addresses:
  - HTML format: http://fred.qi.googlepages.com/resume.html
  - PDF format: http://fred.qi.googlepages.com/cv-qf.pdf

* OTHER PROJECTS in SoC 2006

  I plan to apply another one or two projects mentored by boost
  organization. But I prefer to work for this project.
-----
Best regards,
Fei Qi

On 5/8/06, Sachin Garg <sc...@gmail.com> wrote:
>
> I looked at bug ID 908, which wants that the local copy in text-base
> should be stored compressed. I did a little digging around in code and
> felt it shouldnt be very hard to implement this and it will atleast
> make my life easier.
>
> I am not going through the Google summer of code thing (am no longer a
> student either :-) but would like to implement this feature (assuming
> someone hasnt already started working on this).
>
> I am a long time subversion user (on Windows, TortoiseSVN) but new to
> subversion code, so will need some guidance if you guys want me to
> work on this.
>
> Some quick quesitions:
>
> # Is libsvn_wc/ the only place where I will need to edit code, or do I
> need to look in other directories too? Which ones?
>
> # Do we already have a compression library (zlib?) linked in subversion?
>
> # How much additional delay this is expected to result in during
> checkouts and commits? Should I use something lightweight like zlib or
> will it be fine to use bzip2 which can give better compression but
> will be slower?
>
> # Do we want files in text-base to be always compressed, or do we want
> text-base compression to be optional?
>
> Bug no 525 (optional text-base storage) is slightly related, maybe I
> can have a design which will make it easier to implement 525 too. Like
> implementing text-base access as a layer which can have multiple
> implmentations:
>
> 1. Direct file read
> 2. Read compressed file
> 3. Fetch from server
>
>
> Another possible todo item (which runs in opposite direction from the
> above items :-)
>
> Just like SVN stores text-base for local diffs, how about generalizing
> it to store N previous revisions and change log entires. Storing
> additional revisions shouldn't result in too much bloat, as we can
> probably store just the diffs and can make more operations local.
>
> Sachin Garg [India]
> www.sachingarg.com | www.c10n.info

Re: [Reminder] Subversion a mentor for Google Summer of Code

Posted by Sachin Garg <sc...@gmail.com>.

I looked at bug ID 908, which wants that the local copy in text-base
should be stored compressed. I did a little digging around in code and
felt it shouldnt be very hard to implement this and it will atleast
make my life easier.

I am not going through the Google summer of code thing (am no longer a
student either :-) but would like to implement this feature (assuming
someone hasnt already started working on this).

I am a long time subversion user (on Windows, TortoiseSVN) but new to
subversion code, so will need some guidance if you guys want me to
work on this.

Some quick quesitions:

# Is libsvn_wc/ the only place where I will need to edit code, or do I
need to look in other directories too? Which ones?

# Do we already have a compression library (zlib?) linked in subversion?

# How much additional delay this is expected to result in during
checkouts and commits? Should I use something lightweight like zlib or
will it be fine to use bzip2 which can give better compression but
will be slower?

# Do we want files in text-base to be always compressed, or do we want
text-base compression to be optional?

Bug no 525 (optional text-base storage) is slightly related, maybe I
can have a design which will make it easier to implement 525 too. Like
implementing text-base access as a layer which can have multiple
implmentations:

1. Direct file read
2. Read compressed file
3. Fetch from server

Another possible todo item (which runs in opposite direction from the
above items :-)

Just like SVN stores text-base for local diffs, how about generalizing
it to store N previous revisions and change log entires. Storing
additional revisions shouldn't result in too much bloat, as we can
probably store just the diffs and can make more operations local.

Sachin Garg [India]
www.sachingarg.com | www.c10n.info

On 5/7/06, David Anderson <da...@calixo.net> wrote:
> Just a quick reminder, as it has been all over the internet for some
> time.
>
> Following last year's success, Google is hosting a 2006 edition of the
> Summer of Code.  Quickly put, if you're a student and get selected,
> you get paid over the summer to work on a specific task within one of
> dozens of mentoring open source projects.  More information about the
> specifics of the program are available at http://code.google.com/soc/ .
>
> Like last year, Subversion is a mentoring organization within the SoC.
> If you'd like to help further the development of Subversion, get paid,
> and have fun doing so, then head over to the SoC webpage and apply!
>
> We have compiled a list of tasks that we feel are suitable for the
> timeframe of the SoC and interesting to us.  The list is on the
> Subversion website, at
> <http://subversion.tigris.org/project_tasks.html>.  This list is of
> course not exhaustive, so if you have a really great idea that might
> interest us, don't let our list stop you from applying.
>
> The deadline for applying is 8th May.  Yes it is soon, but it should
> be enough to decide what you want to apply for, and write a short
> proposal for it.  Remember, the application is meant to interest us in
> the task you're offering to complete, and convince us that you are
> able to complete the task you propose.
>
> No need to have finished the task beforehand, or have the deepest
> possible knowledge of the Subversion internals before starting.  Some
> coding skills, along with a keen will to learn should do the trick!
>
> - Dave.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org