You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Sascha Kratky <kr...@unisoftwareplus.com> on 2006/04/21 15:51:17 UTC

UTF-8 conversion error

Hi,

I am using svn 1.3.1 under OS X 10.4.6. Upon running "svn update" I  
get the following error message:

subversion/libsvn_subr/utf.c:466: (apr_err=22)
svn: Can't convert string from 'UTF-8' to native encoding:
subversion/libsvn_subr/utf.c:464: (apr_err=22)
svn: Protokolle/Wien 05_12_19 - ?\195?\150VG.doc

The Word document had been added by a user running Windows XP.

How can I fix the document's name so that "svn update" succeeds?

Thanks,
Sascha


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 conversion error

Posted by Balázs Szabó <dl...@dlux.hu>.
Hi,

The defect I opened for this issue is here: http:// 
subversion.tigris.org/issues/show_bug.cgi?id=2464

Regards,

Balázs Szabó (dLux)
http://www.dlux.hu
--   --  - - - -- -


On 2006.04.22., at 14:14, Ryan Schmidt wrote:

> On Apr 21, 2006, at 21:03, Aaron Montgomery wrote:
>
>>> I can now check out the file from the subversion repository, but  
>>> when I run "svn status" in the directory where the files resides  
>>> I get:
>>>
>>> ?      Wien 05_12_19 - ÖVG.doc
>>> !      Wien 05_12_19 - ÖVG.doc
>>>
>>> The file is reported as both "not under version control" and  
>>> "missing". How can that be?
>>
>> I work on a text editor for Mac OS X and I know that we've had  
>> problems because of the way the system handles decomposed vs. non- 
>> decomposed Unicode characters. It is possible that SVN is  
>> expecting to find an decomposed Ö and you've got a non-decomposed  
>> Ö in the name of the file sitting in the directory. I'm not sure  
>> the best way to handle this. Mac OS is not very well behaved since  
>> its decision on how to represent UTF-8 is not the standard one (I  
>> think the standard says that you should use the shortest encoding  
>> and Mac OS X prefers to always use decomposed characters, but I'm  
>> not really sure). Possibly setting everything to ISO-8859 might  
>> solve this problem.
>
> Yes, we had an extensive thread on this problem in December:
>
> http://svn.haxx.se/users/archive-2005-12/0191.shtml
>
> To summarize:
>
> * Accented and umlauted characters have multiple valid  
> representations in UTF-8: "composed" (for example LATIN CAPITAL  
> LETTER O WITH DIAERESIS (U+00D6)) and "decomposed" (LATIN CAPITAL  
> LETTER O (U+004F) followed by COMBINING DIAERESIS (U+0308)).
>
> * The Mac's usual HFS+ filesystem canonicalizes UTF-8 strings to  
> the decomposed form.
>
> * The usual Windows and Linux filesystems, and the Subversion  
> filesystem, do not canonicalize, meaning, infuriatingly, you can  
> have two distinct files in these filesystems named, for example,  
> "Wien 05_12_19 - ÖVG.doc"
>
> * It seems that if you create such a filename on Windows or Linux,  
> you end up with the composed form.
>
> The upshot of all this is that if you create a filename with such  
> characters on Linux or Windows and commit it to a Subversion  
> repository, you cannot use that file if you check out the working  
> copy on Mac OS X. And that bites.
>
>
> The proof is in the following pudding:
>
> On the Linux machine (Subversion 1.2.3 client and server):
>
> 	linux$ mkdir blöd
> 	linux$ svn import blöd https://server/repo/blöd -m ""
> 	Committed revision 1.
>
> On the Mac[1] (Subversion 1.3.1 client connecting to Linux 1.2.3  
> server):
>
> 	mac$ svn co https://server/repo
> 	A    repo/blöd
> 	Checked out revision 1.
> 	mac$ svn st repo
> 	?      blo¨d
> 	!      blöd
>
> Note that in my terminal it's even shown that way: the file with  
> decomposed characters (the way HFS+ canonicalized it) is  
> unversioned, and the file with composed characters (the one  
> Subversion was expecting) is missing.
>
>
> My suggestion would be that Subversion should
>
> * permit only a single form of a filename in the repository,  
> possibly canonicalized using stringprep, and
>
> * for operations like "svn status", use stringprep to canonicalize  
> filenames provided by the client filesystem before comparing them  
> to the (already-stringprepped?) filenames in the files within  
> the .svn directory.
>
>
> Balázs Szabó asked if this could be opened as a bug:
>
> http://svn.haxx.se/users/archive-2005-12/0386.shtml
>
> ...but nobody answered this question and I cannot see such a bug  
> filed. I'll ask it again: anybody have any objection to this being  
> finally filed as a bug?
>
>
> [1] That was with $LANG set to en_US.UTF-8 on the Mac. With $LANG  
> set to en_US.ISO8859-1, which is what I usually use, I can't check  
> it out at all:
>
> 	mac$ svn co https://server/repo
> 	svn: Can't check path 'repo/blöd': Invalid argument
>
> Separate bug?
>
>
>


Re: UTF-8 conversion error

Posted by Sascha Kratky <kr...@unisoftwareplus.com>.
On 22.04.2006, at 14:14, Ryan Schmidt wrote:

> On Apr 21, 2006, at 21:03, Aaron Montgomery wrote:
>
>>> I can now check out the file from the subversion repository, but  
>>> when I run "svn status" in the directory where the files resides  
>>> I get:
>>>
>>> ?      Wien 05_12_19 - ÖVG.doc
>>> !      Wien 05_12_19 - ÖVG.doc
>>>
>>> The file is reported as both "not under version control" and  
>>> "missing". How can that be?
>>
>> I work on a text editor for Mac OS X and I know that we've had  
>> problems because of the way the system handles decomposed vs. non- 
>> decomposed Unicode characters. It is possible that SVN is  
>> expecting to find an decomposed Ö and you've got a non-decomposed  
>> Ö in the name of the file sitting in the directory. I'm not sure  
>> the best way to handle this. Mac OS is not very well behaved since  
>> its decision on how to represent UTF-8 is not the standard one (I  
>> think the standard says that you should use the shortest encoding  
>> and Mac OS X prefers to always use decomposed characters, but I'm  
>> not really sure). Possibly setting everything to ISO-8859 might  
>> solve this problem.
>
> Yes, we had an extensive thread on this problem in December:
>
> http://svn.haxx.se/users/archive-2005-12/0191.shtml
>
> To summarize:
>
> * Accented and umlauted characters have multiple valid  
> representations in UTF-8: "composed" (for example LATIN CAPITAL  
> LETTER O WITH DIAERESIS (U+00D6)) and "decomposed" (LATIN CAPITAL  
> LETTER O (U+004F) followed by COMBINING DIAERESIS (U+0308)).
>
> * The Mac's usual HFS+ filesystem canonicalizes UTF-8 strings to  
> the decomposed form.
>
> * The usual Windows and Linux filesystems, and the Subversion  
> filesystem, do not canonicalize, meaning, infuriatingly, you can  
> have two distinct files in these filesystems named, for example,  
> "Wien 05_12_19 - ÖVG.doc"
>
> * It seems that if you create such a filename on Windows or Linux,  
> you end up with the composed form.
>
> The upshot of all this is that if you create a filename with such  
> characters on Linux or Windows and commit it to a Subversion  
> repository, you cannot use that file if you check out the working  
> copy on Mac OS X. And that bites.

It's a pity that Subversion's support for unicode file names works in  
theory but not in practice. The only way to avoid this problem is to  
abstain from using non-ASCII characters in file names.

>
>
> The proof is in the following pudding:
>
> On the Linux machine (Subversion 1.2.3 client and server):
>
> 	linux$ mkdir blöd
> 	linux$ svn import blöd https://server/repo/blöd -m ""
> 	Committed revision 1.
>
> On the Mac[1] (Subversion 1.3.1 client connecting to Linux 1.2.3  
> server):
>
> 	mac$ svn co https://server/repo
> 	A    repo/blöd
> 	Checked out revision 1.
> 	mac$ svn st repo
> 	?      blo¨d
> 	!      blöd
>
> Note that in my terminal it's even shown that way: the file with  
> decomposed characters (the way HFS+ canonicalized it) is  
> unversioned, and the file with composed characters (the one  
> Subversion was expecting) is missing.
>
>
> My suggestion would be that Subversion should
>
> * permit only a single form of a filename in the repository,  
> possibly canonicalized using stringprep, and
>
> * for operations like "svn status", use stringprep to canonicalize  
> filenames provided by the client filesystem before comparing them  
> to the (already-stringprepped?) filenames in the files within  
> the .svn directory.
>
>
> Balázs Szabó asked if this could be opened as a bug:
>
> http://svn.haxx.se/users/archive-2005-12/0386.shtml
>
> ...but nobody answered this question and I cannot see such a bug  
> filed. I'll ask it again: anybody have any objection to this being  
> finally filed as a bug?
>
>
> [1] That was with $LANG set to en_US.UTF-8 on the Mac. With $LANG  
> set to en_US.ISO8859-1, which is what I usually use, I can't check  
> it out at all:
>
> 	mac$ svn co https://server/repo
> 	svn: Can't check path 'repo/blöd': Invalid argument
>
> Separate bug?
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org


Re: UTF-8 conversion error

Posted by Ryan Schmidt <su...@ryandesign.com>.
On Apr 21, 2006, at 21:03, Aaron Montgomery wrote:

>> I can now check out the file from the subversion repository, but  
>> when I run "svn status" in the directory where the files resides I  
>> get:
>>
>> ?      Wien 05_12_19 - ÖVG.doc
>> !      Wien 05_12_19 - ÖVG.doc
>>
>> The file is reported as both "not under version control" and  
>> "missing". How can that be?
>
> I work on a text editor for Mac OS X and I know that we've had  
> problems because of the way the system handles decomposed vs. non- 
> decomposed Unicode characters. It is possible that SVN is expecting  
> to find an decomposed Ö and you've got a non-decomposed Ö in the  
> name of the file sitting in the directory. I'm not sure the best  
> way to handle this. Mac OS is not very well behaved since its  
> decision on how to represent UTF-8 is not the standard one (I think  
> the standard says that you should use the shortest encoding and Mac  
> OS X prefers to always use decomposed characters, but I'm not  
> really sure). Possibly setting everything to ISO-8859 might solve  
> this problem.

Yes, we had an extensive thread on this problem in December:

http://svn.haxx.se/users/archive-2005-12/0191.shtml

To summarize:

* Accented and umlauted characters have multiple valid  
representations in UTF-8: "composed" (for example LATIN CAPITAL  
LETTER O WITH DIAERESIS (U+00D6)) and "decomposed" (LATIN CAPITAL  
LETTER O (U+004F) followed by COMBINING DIAERESIS (U+0308)).

* The Mac's usual HFS+ filesystem canonicalizes UTF-8 strings to the  
decomposed form.

* The usual Windows and Linux filesystems, and the Subversion  
filesystem, do not canonicalize, meaning, infuriatingly, you can have  
two distinct files in these filesystems named, for example, "Wien  
05_12_19 - ÖVG.doc"

* It seems that if you create such a filename on Windows or Linux,  
you end up with the composed form.

The upshot of all this is that if you create a filename with such  
characters on Linux or Windows and commit it to a Subversion  
repository, you cannot use that file if you check out the working  
copy on Mac OS X. And that bites.


The proof is in the following pudding:

On the Linux machine (Subversion 1.2.3 client and server):

	linux$ mkdir blöd
	linux$ svn import blöd https://server/repo/blöd -m ""
	Committed revision 1.

On the Mac[1] (Subversion 1.3.1 client connecting to Linux 1.2.3  
server):

	mac$ svn co https://server/repo
	A    repo/blöd
	Checked out revision 1.
	mac$ svn st repo
	?      blo¨d
	!      blöd

Note that in my terminal it's even shown that way: the file with  
decomposed characters (the way HFS+ canonicalized it) is unversioned,  
and the file with composed characters (the one Subversion was  
expecting) is missing.


My suggestion would be that Subversion should

* permit only a single form of a filename in the repository, possibly  
canonicalized using stringprep, and

* for operations like "svn status", use stringprep to canonicalize  
filenames provided by the client filesystem before comparing them to  
the (already-stringprepped?) filenames in the files within the .svn  
directory.


Balázs Szabó asked if this could be opened as a bug:

http://svn.haxx.se/users/archive-2005-12/0386.shtml

...but nobody answered this question and I cannot see such a bug  
filed. I'll ask it again: anybody have any objection to this being  
finally filed as a bug?


[1] That was with $LANG set to en_US.UTF-8 on the Mac. With $LANG set  
to en_US.ISO8859-1, which is what I usually use, I can't check it out  
at all:

	mac$ svn co https://server/repo
	svn: Can't check path 'repo/blöd': Invalid argument

Separate bug?



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org


Re: UTF-8 conversion error

Posted by Aaron Montgomery <ee...@monsterworks.com>.
On Apr 21, 2006, at 11:05 AM, Sascha Kratky wrote:

> Ryan,
>
> I didn't have the environment variable $LANG set. I followed your  
> suggestion an added the LANG=en_US.UTF-8 setting.
> I can now check out the file from the subversion repository, but  
> when I run "svn status" in the directory where the files resides I  
> get:
>
> ?      Wien 05_12_19 - ÖVG.doc
> !      Wien 05_12_19 - ÖVG.doc
>
> The file is reported as both "not under version control" and  
> "missing". How can that be?
>
> Thanks,
> Sascha
>

I work on a text editor for Mac OS X and I know that we've had  
problems because of the way the system handles decomposed vs. non- 
decomposed Unicode characters. It is possible that SVN is expecting  
to find an decomposed Ö and you've got a non-decomposed Ö in the name  
of the file sitting in the directory. I'm not sure the best way to  
handle this. Mac OS is not very well behaved since its decision on  
how to represent UTF-8 is not the standard one (I think the standard  
says that you should use the shortest encoding and Mac OS X prefers  
to always use decomposed characters, but I'm not really sure).  
Possibly setting everything to ISO-8859 might solve this problem.

Good luck and please post to the list if you find the root of the  
problem and a solution, although this hasn't bitten us yet, it might  
at some point.
Aaron

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org


Re: UTF-8 conversion error

Posted by Sascha Kratky <kr...@unisoftwareplus.com>.
Ryan,

I didn't have the environment variable $LANG set. I followed your  
suggestion an added the LANG=en_US.UTF-8 setting.
I can now check out the file from the subversion repository, but when  
I run "svn status" in the directory where the files resides I get:

?      Wien 05_12_19 - ÖVG.doc
!      Wien 05_12_19 - ÖVG.doc

The file is reported as both "not under version control" and  
"missing". How can that be?

Thanks,
Sascha

On 21.04.2006, at 19:18, Ryan Schmidt wrote:

> On Apr 21, 2006, at 17:51, Sascha Kratky wrote:
>
>> I am using svn 1.3.1 under OS X 10.4.6. Upon running "svn update"  
>> I get the following error message:
>>
>> subversion/libsvn_subr/utf.c:466: (apr_err=22)
>> svn: Can't convert string from 'UTF-8' to native encoding:
>> subversion/libsvn_subr/utf.c:464: (apr_err=22)
>> svn: Protokolle/Wien 05_12_19 - ?\195?\150VG.doc
>>
>> The Word document had been added by a user running Windows XP.
>>
>> How can I fix the document's name so that "svn update" succeeds?
>
> I don't think the document's name is broken, I think Subversion  
> just doesn't know what your local encoding is. Have you set the  
> $LANG variable? For example, I have this in my ~/.bashrc on my Mac  
> OS X 10.4.6 machine:
>
> 	export LANG=en_US.ISO8859-1
>
> And I also have my Terminal set to ISO-8859-1. If you use the  
> default UTF-8 Terminal encoding, you may prefer:
>
> 	export LANG=en_US.UTF-8
>
> Listing the directory /usr/share/locale will show you what locale  
> choices are available to you on your system.
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org


Re: UTF-8 conversion error

Posted by Ryan Schmidt <su...@ryandesign.com>.
On Apr 21, 2006, at 17:51, Sascha Kratky wrote:

> I am using svn 1.3.1 under OS X 10.4.6. Upon running "svn update" I  
> get the following error message:
>
> subversion/libsvn_subr/utf.c:466: (apr_err=22)
> svn: Can't convert string from 'UTF-8' to native encoding:
> subversion/libsvn_subr/utf.c:464: (apr_err=22)
> svn: Protokolle/Wien 05_12_19 - ?\195?\150VG.doc
>
> The Word document had been added by a user running Windows XP.
>
> How can I fix the document's name so that "svn update" succeeds?

I don't think the document's name is broken, I think Subversion just  
doesn't know what your local encoding is. Have you set the $LANG  
variable? For example, I have this in my ~/.bashrc on my Mac OS X  
10.4.6 machine:

	export LANG=en_US.ISO8859-1

And I also have my Terminal set to ISO-8859-1. If you use the default  
UTF-8 Terminal encoding, you may prefer:

	export LANG=en_US.UTF-8

Listing the directory /usr/share/locale will show you what locale  
choices are available to you on your system.




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org