You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Martin Hauner <ma...@gmx.net> on 2006/04/22 17:40:44 UTC

MacOSX filename encoding issue

Hi,

while fixing "svn: Can't convert string from native encoding to 'UTF-8':"
errors in subcommander when using filenames with extended characters on
MacOSX I noticed some strange behaviour that is reproducable with the
svn command line tool (1.3.0).

First thing that i have to do is set LANG so svn works at all. Without
it svn complains with the above error.

setlocale(LC_ALL, "") doesn't seem to work on MacOSX if LANG isn't set.

First I'm using DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE
(utf16: 278A, utf8: E2 9E 8A)


$ svn mkdir ➊
A         ➊

$ svn st
A      ➊

This is as expected, now another character, the german umlaut ö.

ö  (utf16: 00F6, utf8: C3 B6)


$ svn mkdir ö
A         ö
$ svn st
?      ö
!      ö

This is unexpected. It looks like that status gets a different filename
when it reads the dir and thinks that the new dir is missing and that
there is an unversioned item of the same name.

Then entries file in .svn looks good.

Looking at the output of ll -B (works only with LANG unset) shows that
svn is really getting something different:

drwxr-xr-x   3 hauner  hauner  102 Apr 22 15:30 o\314\210
drwxr-xr-x   3 hauner  hauner  102 Apr 22 18:11 \342\236\212

the second line is digit one and converting the numbers to hex delivers
its utf8 code. What should be the ö is something differnt (o + cc 88,
where cc 88 is a character with two dots: COMBINING DIAERESIS).

I'm no unicode expert but i guess a 100% unicode compatible program
(for example a text editor) would combine the o with COMBINING DIAERESIS
to display it as a single ö character?

Now the question is (assuming my analysis is correct) if it is possible
to workaround this strange behaviour of the Mac filesystem?

It would be nice if there were a combining aware utf8strcmp that could
be used by svn. I don't know how hard it would be to write such a
function.


-- 
Martin

Subcommander, http://subcommander.tigris.org
a cross platform Win32/Unix/MacOSX subversion GUI client & diff/merge tool.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: MacOSX filename encoding issue

Posted by Martin Hauner <ma...@gmx.net>.

Jesper Steen Møller wrote:
> Martin Hauner wrote:
>> Hi,
>>
>> while fixing "svn: Can't convert string from native encoding to 'UTF-8':"
>> errors in subcommander when using filenames with extended characters on
>> MacOSX I noticed some strange behaviour that is reproducable with the
>> svn command line tool (1.3.0).
 >>[..]
>> First I'm using DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE
>> (utf16: 278A, utf8: E2 9E 8A)
>>
>>
>> $ svn mkdir ➊
>> A         ➊
>>
>> $ svn st
>> A      ➊
>>
>> This is as expected, now another character, the german umlaut ö.
>>
>> ö  (utf16: 00F6, utf8: C3 B6)
> In Unicode lingo, this is "precomposed".
 >
>> $ svn mkdir ö
>> A         ö
>> $ svn st
>> ?      ö
>> !      ö
>>
>> This is unexpected. It looks like that status gets a different filename
>> when it reads the dir and thinks that the new dir is missing and that
>> there is an unversioned item of the same name.
> A normal user would head straight for the cupboard with the heavy-duty 
> aspirin.

:)

>> Then entries file in .svn looks good.
> By good, you mean precomposed?

Yes, a single precomposed ö entry.

>> Looking at the output of ll -B (works only with LANG unset) shows that
>> svn is really getting something different:
>>
>> drwxr-xr-x   3 hauner  hauner  102 Apr 22 15:30 o\314\210
>> drwxr-xr-x   3 hauner  hauner  102 Apr 22 18:11 \342\236\212
>>
>> the second line is digit one and converting the numbers to hex delivers
>> its utf8 code. What should be the ö is something differnt (o + cc 88,
>> where cc 88 is a character with two dots: COMBINING DIAERESIS).
> This is your umlaut ö "decomposed". File systems on OSX are expected to 
> do this (I know very litttle OSX stuff, but stumbled upon this: 
> <http://developer.apple.com/qa/qa2001/qa1173.html>) This is NFD 
> (normalization form "decomposed", as opposed to FNC, C for "composed").
> There is also NFKD and NFKC which adds "kompatibility" into the mix, for 
> things like ligatures (whether fi and ff are single glyphs or not).

Oh my... this sounds complicated.

And the page alos says "Converting between precomposed and decomposed
Unicode text is a complicated process...". ;)

 >[..]
>> I'm no unicode expert but i guess a 100% unicode compatible program
>> (for example a text editor) would combine the o with COMBINING DIAERESIS
>> to display it as a single ö character?
 >
> True. There is a three level system of compliance, dealing with how 
> combining characters are used. In a way, Subversion supports it all (by 
> storing full UTF-8), but it doesn't deal with normalization as you've 
> discovered.
>> Now the question is (assuming my analysis is correct) if it is possible
>> to workaround this strange behaviour of the Mac filesystem?
 >
> As you correctly suggest, yes: By normalizing before comparing.
 >
>> It would be nice if there were a combining aware utf8strcmp that could
>> be used by svn. I don't know how hard it would be to write such a
>> function.
> It is probably easiest to convert to the same normalization form, and 
> then compare codepoints (binary). I would go composing rather than 
> decomposing, since you can optimize the operation by scanning the 
> codepoints for combining characters and only do the composition if any 
> are found. I'd probably avoid using the compatibility normalization 
> forms since they lose information (e.g. superscript 2 -> 2)...

The link above points to another link that mentions a system function
CFStringNormalize that can convert decomposed to composed.

> Libraries to do normalization already exists:
> 
> There's IBM's ICU for C: <http://icu.sourceforge.net/> (X license)
> There's UCData <http://crl.nmsu.edu/~mleisher/ucdata.html> ("freeware")
> 
> See also the Unicode Howto, 
> <http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html> and Markus Kuhn's 
> excellent Unicode FAQ UTF-8 and Unicode FAQ 
> <http://www.cl.cam.ac.uk/~mgk25/unicode.html>.

Thanks for your info. It removed some confusion and added new confusion
at the same time ;-)

Anyway, i think it would be nice if this could be handled by subversion
(or is it an apr issue?) because the 'ö' is a normal german character on
the keyboard, not some magic special character that's never used in
filenames. And there are a probably a lot of other characters around
the non-english world which cause the same problem.



-- 
Martin

Subcommander, http://subcommander.tigris.org
a cross platform Win32/Unix/MacOSX subversion GUI client & diff/merge tool.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: MacOSX filename encoding issue

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.

Peter Samuelson writes:
 > 
 > [Jesper Steen Møller]
 > Anyway, the problem here is that subversion normalises filenames (and
 > other input) to UTF-8 but not to a specific normalisation form.
 > Assuming that every user of a given repository will use the same
 > normalisation form is conceptually not much better than assuming
 > they'll all use the same character set.

Agreed:-(  Maybe it is "better" in the sense that it doesn't cause
problems as often as character encoding ones, but it is still a time bomb...

 > You guys should pick one and start enforcing it.  NFD is more elegant,
 > but NFC is more efficient and probably more widely used today, so
 > that's what I'd suggest using.

Just FWIW, we already required NFD in the svn_fs.h API documentation,
but we don't check or enforce it anywhere, so it is just of
theoretical interest.  It might be a reason to choose that form, though.

 > Would the following be too small for a Summer of Code project?
 > 
 > - Autoconfage to look for and use libicu if available

I don't know if libicu is the best, since I know nothing about it, so
we might want to leave this choice open for further suggestions.

 > - When converting user input to utf-8, also normalise it to NFC
 > 
 > - Arrange for compatibility with existing repositories full of
 >   non-normalised filenames.  Probably by storing new data as NFC but
 >   normalising old filenames read in by libsvn_ra.  As part of this,
 >   investigate whether any common tools will produce spurious noise
 >   when looking at a repo whose NF suddenly changed one day.

Given this last one with all the compatibility stuff to verify, I
think this is a reasonalbe SoC project.  And I really want this to
happen, because it is the wrong decade to struggle with encoding
problems like we do today... I mean, in the less common cases.

Much of the ground work for this has been laid rarlier, because we
make sure to canonicalize paths everywhere on input, and we have
special routines to convert paths to/from the system's encoding.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: MacOSX filename encoding issue

Posted by Peter Samuelson <pe...@p12n.org>.

[Jesper Steen Møller]
> This is your umlaut ö "decomposed". File systems on OSX are expected
> to do this (I know very litttle OSX stuff, but stumbled upon this:
> <http://developer.apple.com/qa/qa2001/qa1173.html>) This is NFD
> (normalization form "decomposed", as opposed to FNC, C for
> "composed").  There is also NFKD and NFKC which adds "kompatibility"
> into the mix, for things like ligatures (whether fi and ff are single
> glyphs or not).

Right, we can ignore NFKC and NKFD.

Anyway, the problem here is that subversion normalises filenames (and
other input) to UTF-8 but not to a specific normalisation form.
Assuming that every user of a given repository will use the same
normalisation form is conceptually not much better than assuming
they'll all use the same character set.

You guys should pick one and start enforcing it.  NFD is more elegant,
but NFC is more efficient and probably more widely used today, so
that's what I'd suggest using.

Would the following be too small for a Summer of Code project?

- Autoconfage to look for and use libicu if available

- When converting user input to utf-8, also normalise it to NFC

- Arrange for compatibility with existing repositories full of
  non-normalised filenames.  Probably by storing new data as NFC but
  normalising old filenames read in by libsvn_ra.  As part of this,
  investigate whether any common tools will produce spurious noise
  when looking at a repo whose NF suddenly changed one day.

Re: MacOSX filename encoding issue

Posted by Jesper Steen Møller <je...@selskabet.org>.

Martin Hauner wrote:
> Hi,
>
> while fixing "svn: Can't convert string from native encoding to 'UTF-8':"
> errors in subcommander when using filenames with extended characters on
> MacOSX I noticed some strange behaviour that is reproducable with the
> svn command line tool (1.3.0).
>
> First thing that i have to do is set LANG so svn works at all. Without
> it svn complains with the above error.
>
> setlocale(LC_ALL, "") doesn't seem to work on MacOSX if LANG isn't set.
>
> First I'm using DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE
> (utf16: 278A, utf8: E2 9E 8A)
>
>
> $ svn mkdir ➊
> A         ➊
>
> $ svn st
> A      ➊
>
> This is as expected, now another character, the german umlaut ö.
>
> ö  (utf16: 00F6, utf8: C3 B6)
In Unicode lingo, this is "precomposed".
> $ svn mkdir ö
> A         ö
> $ svn st
> ?      ö
> !      ö
>
> This is unexpected. It looks like that status gets a different filename
> when it reads the dir and thinks that the new dir is missing and that
> there is an unversioned item of the same name.
A normal user would head straight for the cupboard with the heavy-duty 
aspirin.
> Then entries file in .svn looks good.
By good, you mean precomposed?
> Looking at the output of ll -B (works only with LANG unset) shows that
> svn is really getting something different:
>
> drwxr-xr-x   3 hauner  hauner  102 Apr 22 15:30 o\314\210
> drwxr-xr-x   3 hauner  hauner  102 Apr 22 18:11 \342\236\212
>
> the second line is digit one and converting the numbers to hex delivers
> its utf8 code. What should be the ö is something differnt (o + cc 88,
> where cc 88 is a character with two dots: COMBINING DIAERESIS).
This is your umlaut ö "decomposed". File systems on OSX are expected to 
do this (I know very litttle OSX stuff, but stumbled upon this: 
<http://developer.apple.com/qa/qa2001/qa1173.html>) This is NFD 
(normalization form "decomposed", as opposed to FNC, C for "composed").
There is also NFKD and NFKC which adds "kompatibility" into the mix, for 
things like ligatures (whether fi and ff are single glyphs or not).

The Linux way is to go for NFC, from the unicode man page:

 >Under Linux, in general only the BMP at implementation level 1 should 
be used at the moment. Up to two combining characters per base character 
for certain scripts (in particular Thai) are also supported by some 
UTF-8 terminal emulators and ISO 10646 fonts (level 2), but in general 
precomposed characters should be preferred where available (Unicode 
calls this "Normalization Form C" ).
> I'm no unicode expert but i guess a 100% unicode compatible program
> (for example a text editor) would combine the o with COMBINING DIAERESIS
> to display it as a single ö character?
True. There is a three level system of compliance, dealing with how 
combining characters are used. In a way, Subversion supports it all (by 
storing full UTF-8), but it doesn't deal with normalization as you've 
discovered.
> Now the question is (assuming my analysis is correct) if it is possible
> to workaround this strange behaviour of the Mac filesystem?
As you correctly suggest, yes: By normalizing before comparing.
> It would be nice if there were a combining aware utf8strcmp that could
> be used by svn. I don't know how hard it would be to write such a
> function.
It is probably easiest to convert to the same normalization form, and 
then compare codepoints (binary). I would go composing rather than 
decomposing, since you can optimize the operation by scanning the 
codepoints for combining characters and only do the composition if any 
are found. I'd probably avoid using the compatibility normalization 
forms since they lose information (e.g. superscript 2 -> 2)...

Libraries to do normalization already exists:

There's IBM's ICU for C: <http://icu.sourceforge.net/> (X license)
There's UCData <http://crl.nmsu.edu/~mleisher/ucdata.html> ("freeware")

See also the Unicode Howto, 
<http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html> and Markus Kuhn's 
excellent Unicode FAQ UTF-8 and Unicode FAQ 
<http://www.cl.cam.ac.uk/~mgk25/unicode.html>.

I'd imagine a spectrum of

-Jesper


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org