You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Thomas Åkesson <th...@akesson.cc> on 2012/11/09 12:28:58 UTC

Re: [RFC] Non-normalizing Unicode Composition Awareness

Revisiting this thread after a few months. Last spring, I did some work in the Wiki designing a proposal for resolving the Mac Unicode issues in a Non-normalizing manner. I ran out of time, but the thought process has been ongoing.

A couple of weeks ago at Subversion Live in London, I had the opportunity to discuss with a number of people. Although there were some different opinions on the matter, I think we concluded that we are actually relatively well aligned on the core idea. 

The proposal I drafted this spring (in the Wiki) proposed that a couple of columns were added to the WC in order to store normalized paths. Since a couple of months the concept of using a Sqlite collation has seemed more appealing. Last week, I did a test with the Sqlite ICU extension (available in sqlite source repository) which turned out to be quite encouraging. With such a collation, it is possible to perform equals in SQL statements that match  paths in a Unicode composition aware manner and therefore return rows regardless what composition the paths have.

This would be very useful, for instance, when given a filesystem path attempting to locate the corresponding node in wc.db. That is basically half the issue with Mac working copies.

Today, I noticed that Branko started some implementation in a branch. Looks like a collation based on utf8proc is in the making? I think that would make a lot of sense because the ICU extension poses some challenges in the build process and we might not need all that functionality that it provides.

I started a wiki page about unicode collation. I will append more info:
http://wiki.apache.org/subversion/UnicodeCollation

Also note the tiny test repo attached to:
http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness

Cheers,
Thomas Å.
 

Re: [RFC] Non-normalizing Unicode Composition Awareness

Posted by Branko Čibej <br...@wandisco.com>.
On 09.11.2012 14:28, C. Michael Pilato wrote:
> On 11/09/2012 07:49 AM, Branko Čibej wrote:
>> On 09.11.2012 12:28, Thomas Åkesson wrote:
>> I'm currently doing the grunt work of implementing the collation (done)
>> and the LIKE and GLOB operators that we'll need (in progress). The next,
>> and biggest, step will be to review the client and WC libraries to make
>> sure that paths sent to the server always come from the wc.db, not from
>> disk.
> I'm not closely following this problem or solution, but how does the above
> play out for "svn import", "svn mkdir IRI", "svn delete IRI", etc?  (If this
> is documented somewhere, a reference by way of response would suffice.)

Since these are server-side operations with no working copy involvement,
and I'm doing this strictly client-side for now, nothing would change.
This is a problem that we'll eventually have to solve on the server as
well. I don't believe it would be correct for the client to verify that
such operations do not create normalization conflicts on the server.

As a matter of interest, a server-side solution is one of the features
we identified for FSv2; although there's no reason to wait for that. In
FSv2, I envision all names being stored twice, once in their original
form, and once NFC-normalized, for indexing. The reason for that is that
I expect server CPU cycles to be more expensive than server storage, and
it therefore makes sense to avoid using a relatively expensive
normalizing collation in the server metadata index.

This /may/ turn out to be an issue for client working copy performance,
too; but for now I've elected to assume that collation won't have a
noticeable effect. If it does, we'll look at other solutions.

-- Brane

-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com


Re: [RFC] Non-normalizing Unicode Composition Awareness

Posted by Thomas Åkesson <th...@bafast.se>.
On 9 nov 2012, at 14:28, "C. Michael Pilato" <cm...@collab.net> wrote:

> On 11/09/2012 07:49 AM, Branko Čibej wrote:
>> On 09.11.2012 12:28, Thomas Åkesson wrote:
>> I'm currently doing the grunt work of implementing the collation (done)
>> and the LIKE and GLOB operators that we'll need (in progress). The next,
>> and biggest, step will be to review the client and WC libraries to make
>> sure that paths sent to the server always come from the wc.db, not from
>> disk.
> 
> I'm not closely following this problem or solution, but how does the above
> play out for "svn import", "svn mkdir IRI", "svn delete IRI", etc?  (If this
> is documented somewhere, a reference by way of response would suffice.)

http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness

The draft proposes that the server does not discriminate any composition, apart from ensuring that creation of new name collisions is not allowed. 

Ensuring that paths come from wc.db applies to existing object. We can discuss whether Mac client should normalize to NFC, but that would be an option in my opinion. 

/Thomas Å.

Re: [RFC] Non-normalizing Unicode Composition Awareness

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 11/09/2012 07:49 AM, Branko Čibej wrote:
> On 09.11.2012 12:28, Thomas Åkesson wrote:
> I'm currently doing the grunt work of implementing the collation (done)
> and the LIKE and GLOB operators that we'll need (in progress). The next,
> and biggest, step will be to review the client and WC libraries to make
> sure that paths sent to the server always come from the wc.db, not from
> disk.

I'm not closely following this problem or solution, but how does the above
play out for "svn import", "svn mkdir IRI", "svn delete IRI", etc?  (If this
is documented somewhere, a reference by way of response would suffice.)


-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Enterprise Cloud Development


Re: [RFC] Non-normalizing Unicode Composition Awareness

Posted by Branko Čibej <br...@wandisco.com>.
On 09.11.2012 12:28, Thomas Åkesson wrote:
> Today, I noticed that Branko started some implementation in a branch. Looks like a collation based on utf8proc is in the making? I think that would make a lot of sense because the ICU extension poses some challenges in the build process and we might not need all that functionality that it provides.

Hi Thomas,

Yes, I started a branch that's intended to fix the normalization
problem. I selected utf8proc because we really don't need ICU (I can't
see a serious need for language-specific case folding, for example, nor
for Unicode regular expressions). Furthermore, utf8proc can be easily
embedded into Subversion so it doesn't present another dependency that
users would have to worry about.

I'm currently doing the grunt work of implementing the collation (done)
and the LIKE and GLOB operators that we'll need (in progress). The next,
and biggest, step will be to review the client and WC libraries to make
sure that paths sent to the server always come from the wc.db, not from
disk.

One open question is what to do about (historical) collisions in
existing repositories, but I don't think that issue is important enough
to resolve now.

It'll take a while, but I hope to be able to finish the work in time for
1.8. If not ... well then, it'll be in 1.9.

-- Brane

-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com