You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Julian Foad <ju...@wandisco.com> on 2011/02/22 18:17:25 UTC

Comments on 'notes/unicode-composition-for-filenames'

> Proposed Support Library
> ========================
> 
>    Assumptions
>    -----------
> 
>    The main assumption is that we'll keep using APR for character set

s/character set/character encoding/.

>    conversion, meaning that the recoding solution to choose would not
>    need to provide any other functionality than recoding.

s/recoding/converting between NFD and NFC UTF8 encodings/.


> Proposed Normal Form
> ====================
> 
> The proposed internal 'normal form' should be NFC, if only if                                       
> it were because it's the most compact form of the two [...]
> would give the maximum performance from utf8proc [...]

I'm not very familiar with all the issues here, but although choosing
NFC may make individual conversions more efficient, I wonder if a
solution that involves normalizing to NFD could have benefits that are
more significant than this.  (Reading through this doc sequentially, we
get to this section on choosing NFC before we get to the list of
possible solutions, and it looks like premature optimization.)

For example, a solution that involves normalizing all input to NFD would
have the advantages that on MacOSX it would need to do *no* conversions
and would continue to work with old repositories in Mac-only workshops.

Further down the road, once we have all this normalization in place and
can guarantee that all new repositories and certain parts of old
repositories (revision >= N, perhaps) are already normalized, then we
won't need to do normalization of repository paths in the majority of
cases.

We will still need to run conversions on client input (from the OS and
from the user), at least on non-MacOSX clients.  In these cases, we are
already running native-to-UTF8 conversions (although bypassed when not
native is UTF8) so I wonder if the overhead of normalizing to NFD is
really that much greater than NFC.

I'm just not clear if these ideas have already been considered and
dismissed.

- Julian

Re: Comments on 'notes/unicode-composition-for-filenames'

Posted by Stefan Sperling <st...@elego.de>.

On Tue, Feb 22, 2011 at 07:41:12PM +0100, Branko Čibej wrote:
> On 22.02.2011 18:17, Julian Foad wrote:
> >> Proposed Support Library
> >> ========================
> >>
> >>    Assumptions
> >>    -----------
> >>
> >>    The main assumption is that we'll keep using APR for character set
> > s/character set/character encoding/.
> >
> >>    conversion, meaning that the recoding solution to choose would not
> >>    need to provide any other functionality than recoding.
> > s/recoding/converting between NFD and NFC UTF8 encodings/.
> 
> Actually -- you have to go all the way and support complete
> normalization, even if your normalization targets are only NFC and NFD.
> That's because there isn't a sane way to detect whether a string is
> normalized or not -- "sane" in the sense that it should take about as
> long to discover that as to just normalize it.

To put it differently, the only way to figure out whether a given
UTF-8 sequence is valid (or, by extension, uses NFC and/or NFD)
is to parse the entire sequence.

Re: Comments on 'notes/unicode-composition-for-filenames'

Posted by Branko Čibej <br...@e-reka.si>.

On 22.02.2011 20:11, Daniel Shahaf wrote:
> Branko Čibej wrote on Tue, Feb 22, 2011 at 19:41:12 +0100:
>> On 22.02.2011 18:17, Julian Foad wrote:
>>> For example, a solution that involves normalizing all input to NFD would
>>> have the advantages that on MacOSX it would need to do *no* conversions
>>> and would continue to work with old repositories in Mac-only workshops.
>> You'd make this configurable? But how? How do you prove that paths in
>> old repositories are normalized in a certain way? You can only assume
>> that for paths that you know were normalized before being written to the
>> repository. And even then, you can't assume too much -- an older tool,
>> without normalization, can still write denormalized strings to the
>> repositury vial file://. Unless you want to have an explicit flag for
> Really?  So the FS layer wouldn't be aware of NFC v. NFD?

It should certainly normalize them, but there's no guarantee that an
older tool wasn't linked to an older libsvn_repos/fs (whilst still being
compatible with the FS layout). I admit that's not a nice way to do
things, but it can happen. We'd either have to allow for this case,
which implies normalizing paths as we read them from the repository; or,
make the normalization mandatory with an "incompatible" FS version bump,
but then you'd have to do a complete dump/reload cycle in order to
upgrade your FS to that version.

I'm guessing not everyone will want to do the dump/reload thing ... but
the noremalize-on-read could probably be made dependent on the FS version.

-- Brane

Re: Comments on 'notes/unicode-composition-for-filenames'

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Branko Čibej wrote on Tue, Feb 22, 2011 at 19:41:12 +0100:
> On 22.02.2011 18:17, Julian Foad wrote:
> > For example, a solution that involves normalizing all input to NFD would
> > have the advantages that on MacOSX it would need to do *no* conversions
> > and would continue to work with old repositories in Mac-only workshops.
> 
> You'd make this configurable? But how? How do you prove that paths in
> old repositories are normalized in a certain way? You can only assume
> that for paths that you know were normalized before being written to the
> repository. And even then, you can't assume too much -- an older tool,
> without normalization, can still write denormalized strings to the
> repositury vial file://. Unless you want to have an explicit flag for

Really?  So the FS layer wouldn't be aware of NFC v. NFD?

Right now the FS layer already requires paths to be in UTF-8; we might
as well declare that the FS layer considers two paths equivalent if they
only differ in canonicalization, and then have the FS canonicalize paths
before storing them.

> every path to see if it's normalized or not -- which implies changing
> the repository format -- then you can only really make assumptions about
> normalization of paths in the repository post-2.0.
> 
> -- Brane

Re: Comments on 'notes/unicode-composition-for-filenames'

Posted by Branko Čibej <br...@e-reka.si>.

On 22.02.2011 18:17, Julian Foad wrote:
>> Proposed Support Library
>> ========================
>>
>>    Assumptions
>>    -----------
>>
>>    The main assumption is that we'll keep using APR for character set
> s/character set/character encoding/.
>
>>    conversion, meaning that the recoding solution to choose would not
>>    need to provide any other functionality than recoding.
> s/recoding/converting between NFD and NFC UTF8 encodings/.

Actually -- you have to go all the way and support complete
normalization, even if your normalization targets are only NFC and NFD.
That's because there isn't a sane way to detect whether a string is
normalized or not -- "sane" in the sense that it should take about as
long to discover that as to just normalize it.

>> Proposed Normal Form
>> ====================
>>
>> The proposed internal 'normal form' should be NFC, if only if                                       
>> it were because it's the most compact form of the two [...]
>> would give the maximum performance from utf8proc [...]
> I'm not very familiar with all the issues here, but although choosing
> NFC may make individual conversions more efficient, I wonder if a
> solution that involves normalizing to NFD could have benefits that are
> more significant than this.  (Reading through this doc sequentially, we
> get to this section on choosing NFC before we get to the list of
> possible solutions, and it looks like premature optimization.)

It's like this: Once we impose a normalization form for our internal
representation, we /always/ have to normalize, regardless of which
system we're on, because we can't (or rather, don't want to) trust the
host system to do it right.

For example, on Windows, file names are NFC/UTF-16; so if APR preserves
the normalization when converting to UTF-8, then our internal
normalization is essentially a no-op -- but we still have to do it, if
only to make sure that our internal representation is correct (see above
about detection not being significanly faster than normalization).

> For example, a solution that involves normalizing all input to NFD would
> have the advantages that on MacOSX it would need to do *no* conversions
> and would continue to work with old repositories in Mac-only workshops.

You'd make this configurable? But how? How do you prove that paths in
old repositories are normalized in a certain way? You can only assume
that for paths that you know were normalized before being written to the
repository. And even then, you can't assume too much -- an older tool,
without normalization, can still write denormalized strings to the
repositury vial file://. Unless you want to have an explicit flag for
every path to see if it's normalized or not -- which implies changing
the repository format -- then you can only really make assumptions about
normalization of paths in the repository post-2.0.

-- Brane