You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Thomas Singer <su...@smartcvs.com> on 2006/07/06 06:12:46 UTC

Mac OS X: problems adding files with umlauts

Hi,

I'm using Mac OS X 10.4.7 with Subversion 1.3.1 (r19032) and have
problems adding files with umlauts in the name.

- I've created a file "Überbau.txt" in the working copy
- first problem: when listing the directory content on the console, the
   file name appears as "U??berbau.txt"
- when I invoke 'svn status' in this directory, I get following error
   message:
     ~/test tom$ svn status
     subversion/libsvn_subr/utf.c:466: (apr_err=22)
     svn: Can't convert string from native encoding to 'UTF-8':
     subversion/libsvn_subr/utf.c:464: (apr_err=22)
     svn: U?\204?\136berbau.txt
   Why that? Can't Subversion read every file name?
- ok, after setting LC_ALL, it works (even with the right umlaut!):
     ~/test tom$ export LC_ALL=en_US
     ~/test tom$ svn status
     ?       Überbau.txt
- now I add the file
     ~/test tom$ svn add \303berbau.txt
     A          Überbau.txt
- but when I now invoke 'svn status' again, it shows the same file name
   as missing and unversioned:
     ~/test tom$ svn status
     ?       Überbau.txt
     !       Überbau.txt
   Shouldn't it occur as added? Is this a bug or a user-error?


--
Best regards,
Thomas Singer
_____________
SyntEvo GmbH
Schillerallee 2
83457 Bayerisch Gmain
Germany


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Mac OS X: problems adding files with umlauts

Posted by Thomas Singer <su...@smartcvs.com>.
Hi Ryan,

Thanks for answering.

> I think you answered your own question... you need to set LC_ALL (or LANG) first so Subversion knows what character encoding it's working with.

I've read that I need to set the LC_ALL variable, but I don't understand 
why this is necessary. Shouldn't it be possible to read any file name? 
What happens when the "wrong" value is set to LC_ALL? Please note, that 
we talk about file *names*, not the file *content*. Specifying the 
encoding for reading/writing non-default encoded text files is 
necessary, but why for file *names*?

> I think you're experiencing symptoms of this:
> 
> http://subversion.tigris.org/issues/show_bug.cgi?id=2464 

I think, here are two problems:
1) one needs to specify the "right" value to LC_ALL to make Subversion
    read/write file names with umlauts
2) how to handle the decomposed form of umlauts.

The second problem easily can be resolved by composing file names which 
are reported decomposed by Mac's file system. At least we use this 
strategy to make our CVS client work with umlauts correctly on the Mac.

But solving the second problem seems to me independent of the first 
problem. BTW, the JavaSVN library does not exhibit this problem, because 
Java seems to always list file names correctly (but decomposed on the 
Mac). Since Java has a C core, I assume (though my limited knowledge of 
such low-level C-stuff) that reading the file names correctly can be 
done in Subversion, too.

--
Best regards
Thomas Singer
_____________
SyntEvo GmbH
Schillerallee 2
83457 Bayerisch Gmain
Germany


Ryan Schmidt wrote:
> On Jul 6, 2006, at 08:12, Thomas Singer wrote:
> 
>> I'm using Mac OS X 10.4.7 with Subversion 1.3.1 (r19032) and have
>> problems adding files with umlauts in the name.
>>
>> - I've created a file "Überbau.txt" in the working copy
>> - first problem: when listing the directory content on the console, the
>>   file name appears as "U??berbau.txt"
>> - when I invoke 'svn status' in this directory, I get following error
>>   message:
>>     ~/test tom$ svn status
>>     subversion/libsvn_subr/utf.c:466: (apr_err=22)
>>     svn: Can't convert string from native encoding to 'UTF-8':
>>     subversion/libsvn_subr/utf.c:464: (apr_err=22)
>>     svn: U?\204?\136berbau.txt
>>   Why that? Can't Subversion read every file name?
>> - ok, after setting LC_ALL, it works (even with the right umlaut!):
>>     ~/test tom$ export LC_ALL=en_US
>>     ~/test tom$ svn status
>>     ?       Überbau.txt
> 
> I think you answered your own question... you need to set LC_ALL (or 
> LANG) first so Subversion knows what character encoding it's working with.
> 
> 
>> - now I add the file
>>     ~/test tom$ svn add \303berbau.txt
>>     A          Überbau.txt
>> - but when I now invoke 'svn status' again, it shows the same file name
>>   as missing and unversioned:
>>     ~/test tom$ svn status
>>     ?       Überbau.txt
>>     !       Überbau.txt
>>   Shouldn't it occur as added? Is this a bug or a user-error?
> 
> I think you're experiencing symptoms of this:
> 
> http://subversion.tigris.org/issues/show_bug.cgi?id=2464
> 
> I'm not sure what this "\303" is, but I think you're trying to add 
> "Überbau.txt" with a composed "Ü" (U+00DC) while you need to add it 
> decomposed, as a "U" (U+0055) followed by a combining diaeresis "¨" 
> (U+0308), like HFS+ stores it.
> 
> See the two mailing list threads linked in that bug report, and also:
> 
> http://listserv.dartmouth.edu/scripts/wa.exe?A2=ind0503&L=macscrpt&D=1&T=0&P=20432 
> 
> 
> I should note that I have never figured out how to enter non-ASCII 
> characters into the Terminal, so I don't actually know how to do the above.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Mac OS X: why LC_ALL needs to be specified (Was: problems adding files with umlauts)

Posted by Ulrich Eckhardt <ec...@satorlaser.com>.
On Friday 07 July 2006 11:00, Thomas Singer wrote:
> You are making it too simple: you assume that the file name already _is_
> plain UTF-8.

Indeed, because filenames are supposed to be UTF-8.

> My Java example works as expected: 
>
>    final File dir = new File("file-test");
>    dir.mkdirs();
>    final File file = new File(dir, "invalid\u00FF\u00FE");
>    file.createNewFile();
>    for (String fileName : dir.list()) {
>      System.out.println(fileName);
>    }
>    file.delete();

AFAIK, Java uses UCS2 or UTF-16 internally. It then has to convert that to the 
system's format which, in case of OSX, is UTF-8. Now, FF and FE are both 
valid codepoints in Unicode (thorn and y with diaeresis), so Java just 
encodes them in UTF-8 and everything's fine. C++ is much more direct, it just 
passes as filename to the system what it got from the programmer.

> > The thing is that, as Wilfredo said and whose attribution you snipped,
> > filenames are UTF-8 _by_ _convention_ and nothing enforces this.
>
> As I understand it, file names are stored *in the repository* as UTF-8 (by
> convention) 

Yes, although this is not a convention but a definition/requirement of 
Subversion. Also, this is validated, i.e. it rejects invalid UTF-8 sequences.


> and the Subversion client needs to enforce the correct encoding 
> from the OS' native file name encoding.

Right. In the case of OSX, Subversion probably assumes the encoding is UTF-8 
(because that is what it should be). If this is already wrong, because some 
program broke with the convention, it can't do much. In said case it only 
sees that the UTF-8 sequence is invalid and bails out with an error message.


> With Java this is no problem, since 
> it does not simply treat characters as bytes and lists the directory
> content correctly (on Mac with decomposed umlauts, but thats another
> problem) and hence can (without setting the LC_ALL variable) convert the
> file name to UTF-8 or what ever encoding you want. If Java can do that
> without setting LC_ALL, it also should be technically possible from C(++).

It is technically and practically possible, but it doesn't happen behind the 
scenes like in Java but requires an active effort. Since C++ mostly doesn't 
interpret characters and just passes them on, you need a function that simply 
converts the local encoding of the program (whichever that is is up to the 
programmer and/or the locale) to the externally specified format before 
opening the file.

In other words, the difference between C++ and Java in this aspect is that in 
C++ you provide the bytewise representation of the filename and that name is 
used without conversion, while in Java you provide a string that is converted 
to the filename's bytewise representation according to system requirements. 
That said, I wonder how Java would deal with "invalid\uFFFF\uFFFE" as those 
two are not allowed for interchange (i.e. filenames or content) according to 
Unicode.

Uli

****************************************************
Visit our website at <http://www.domino-printing.com/>
****************************************************
This Email and any files transmitted with it are intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any reading, redistribution, disclosure or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient please contact the sender immediately and delete the material from your computer.

E-mail may be susceptible to data corruption, interception, viruses and unauthorised amendment and Domino UK Limited does not accept liability for any such corruption, interception, viruses or amendment or their consequences.
****************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Mac OS X: why LC_ALL needs to be specified (Was: problems adding files with umlauts)

Posted by Thomas Singer <su...@smartcvs.com>.
Hi Ulrich,

You are making it too simple: you assume that the file name already _is_ 
plain UTF-8. My Java example works as expected:

   final File dir = new File("file-test");
   dir.mkdirs();
   final File file = new File(dir, "invalid\u00FF\u00FE");
   file.createNewFile();
   for (String fileName : dir.list()) {
     System.out.println(fileName);
   }
   file.delete();

> The thing is that, as Wilfredo said and whose attribution you snipped, 
> filenames are UTF-8 _by_ _convention_ and nothing enforces this.

As I understand it, file names are stored *in the repository* as UTF-8 (by 
convention) and the Subversion client needs to enforce the correct encoding 
from the OS' native file name encoding. With Java this is no problem, since 
it does not simply treat characters as bytes and lists the directory content 
correctly (on Mac with decomposed umlauts, but thats another problem) and 
hence can (without setting the LC_ALL variable) convert the file name to 
UTF-8 or what ever encoding you want. If Java can do that without setting 
LC_ALL, it also should be technically possible from C(++).

--
Best regards,
Thomas Singer
_____________
SyntEvo GmbH
Schillerallee 2
83457 Bayerisch Gmain
Germany
www.syntevo.com


Ulrich Eckhardt schrieb:
> On Friday 07 July 2006 09:02, Thomas Singer wrote:
>>> That said, it is possible to write file names containing bytes that can't
>>> decode as UTF-8.
>> I can't believe that. Could you please give an reproducible example?
> 
> C++ code:
> 
> #include <fstream>
> int main() {
>   // the value 0xff and 0xfe must not occur in UTF-8 text
>   char const filename[] = { 'i','n','v','a','l','i','d',0xff,0xfe,'\0' };
>   std::ofstream out(filename);
>   out << "aha!\n";
> }
> 
> The thing is that, as Wilfredo said and whose attribution you snipped, 
> filenames are UTF-8 _by_ _convention_ and nothing enforces this.
> 
> Uli

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Mac OS X: problems adding files with umlauts

Posted by Ulrich Eckhardt <ec...@satorlaser.com>.
On Friday 07 July 2006 09:02, Thomas Singer wrote:
> > That said, it is possible to write file names containing bytes that can't
> > decode as UTF-8.
>
> I can't believe that. Could you please give an reproducible example?

C++ code:

#include <fstream>
int main() {
  // the value 0xff and 0xfe must not occur in UTF-8 text
  char const filename[] = { 'i','n','v','a','l','i','d',0xff,0xfe,'\0' };
  std::ofstream out(filename);
  out << "aha!\n";
}

The thing is that, as Wilfredo said and whose attribution you snipped, 
filenames are UTF-8 _by_ _convention_ and nothing enforces this.

Uli

****************************************************
Visit our website at <http://www.domino-printing.com/>
****************************************************
This Email and any files transmitted with it are intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any reading, redistribution, disclosure or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient please contact the sender immediately and delete the material from your computer.

E-mail may be susceptible to data corruption, interception, viruses and unauthorised amendment and Domino UK Limited does not accept liability for any such corruption, interception, viruses or amendment or their consequences.
****************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Mac OS X: problems adding files with umlauts

Posted by Wilfredo Sánchez Vega <ws...@wsanchez.net>.
  My point was that LC_ALL is *not* relevant to decoding filenames,  
because environment variables have nothing to do with how filenames  
were encoded.

   And I was saying that on Mac OS X, one can reasonably expect files  
to be encoded in UTF-8, because that is what Apple tells developers  
to do, and most comply.  On most Unix systems, there is no way to  
know what encoding was used for a filename, and most developers  
assume they can only (safely) use 7-bit ASCII for decoding; all other  
characters are typically considered "unprintable".  But on Mac OS X,  
UTF-8 the recommended encoding.

   However, there exist byte sequences which are not valid UTF-8  
strings, and yet it is possible to name a file with such a byte  
sequence.  In that case, an attempt to decode the filename assuming a  
UTF-8 encoding will fail.  I would not expect that to happen with any  
file that a user gives a name to, but such a situation may happen if  
software is generating filenames (eg. using some internal  
identifier), since using UTF-8 in filenames isn't an enforced  
requirement on most filesystems.

	-wsv


On Jul 7, 2006, at 12:02 AM, Thomas Singer wrote:

>> That said, it is possible to write file names containing bytes  
>> that can't decode as UTF-8.
>
> I can't believe that. Could you please give an reproducible example?
>
>> I think LC_ALL is relevant to what the encoding of svn's output  
>> should be.
>
> I'm sure, you mixed here two things: the file names and the output.  
> File names should be always convertible to a general character  
> representation like UTF-8. Displaying the file names with the right  
> sign in the output is a different issue and might depend on the  
> used font.
>
> If you think, LC_ALL should be relevant for the file name detection  
> in Subversion, could you give answers for the following questions:
> - What LC_ALL-value the user should set?
> - What should happen when the wrong value was set?
> - What value to set for file names in different languages?

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Mac OS X: problems adding files with umlauts

Posted by Thomas Singer <su...@smartcvs.com>.
> That said, it is possible to write file names containing bytes that can't decode as UTF-8.

I can't believe that. Could you please give an reproducible example?

> I think LC_ALL is relevant to what the encoding of svn's output should be.

I'm sure, you mixed here two things: the file names and the output. File 
names should be always convertible to a general character representation 
like UTF-8. Displaying the file names with the right sign in the output is a 
different issue and might depend on the used font.

If you think, LC_ALL should be relevant for the file name detection in 
Subversion, could you give answers for the following questions:
- What LC_ALL-value the user should set?
- What should happen when the wrong value was set?
- What value to set for file names in different languages?

--
Best regards,
Thomas Singer
_____________
SyntEvo GmbH
Schillerallee 2
83457 Bayerisch Gmain
Germany
www.syntevo.com


Wilfredo Sánchez Vega schrieb:
> On Jul 6, 2006, at 1:58 AM, Ryan Schmidt wrote:
> 
>> On Jul 6, 2006, at 08:12, Thomas Singer wrote:
>>
>>> I'm using Mac OS X 10.4.7 with Subversion 1.3.1 (r19032) and have
>>> problems adding files with umlauts in the name.
>>>
>>> - I've created a file "Überbau.txt" in the working copy
>>> - first problem: when listing the directory content on the console, the
>>>   file name appears as "U??berbau.txt"
>>> - when I invoke 'svn status' in this directory, I get following error
>>>   message:
>>>     ~/test tom$ svn status
>>>     subversion/libsvn_subr/utf.c:466: (apr_err=22)
>>>     svn: Can't convert string from native encoding to 'UTF-8':
>>>     subversion/libsvn_subr/utf.c:464: (apr_err=22)
>>>     svn: U?\204?\136berbau.txt
>>>   Why that? Can't Subversion read every file name?
>>> - ok, after setting LC_ALL, it works (even with the right umlaut!):
>>>     ~/test tom$ export LC_ALL=en_US
>>>     ~/test tom$ svn status
>>>     ?       Überbau.txt
>>
>> I think you answered your own question... you need to set LC_ALL (or 
>> LANG) first so Subversion knows what character encoding it's working 
>> with.
> 
>   Actually, on Mac OS X all file names are, by convention, encoded as 
> UTF-8, so the svn client should be able to decode file names without 
> LC_ALL, which really has nothing to do with file names.  That said, it 
> is possible to write file names containing bytes that can't decode as 
> UTF-8.  In that situation, you're somewhat SOL.
> 
>   I don't know if other OS's specify an assumed encoding for file names.
> 
>   That said, I think LC_ALL is relevant to what the encoding of svn's 
> output should be.
> 
>     -wsv
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: users-help@subversion.tigris.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Mac OS X: problems adding files with umlauts

Posted by Wilfredo Sánchez Vega <ws...@wsanchez.net>.
On Jul 6, 2006, at 1:58 AM, Ryan Schmidt wrote:

> On Jul 6, 2006, at 08:12, Thomas Singer wrote:
>
>> I'm using Mac OS X 10.4.7 with Subversion 1.3.1 (r19032) and have
>> problems adding files with umlauts in the name.
>>
>> - I've created a file "Überbau.txt" in the working copy
>> - first problem: when listing the directory content on the  
>> console, the
>>   file name appears as "U??berbau.txt"
>> - when I invoke 'svn status' in this directory, I get following error
>>   message:
>>     ~/test tom$ svn status
>>     subversion/libsvn_subr/utf.c:466: (apr_err=22)
>>     svn: Can't convert string from native encoding to 'UTF-8':
>>     subversion/libsvn_subr/utf.c:464: (apr_err=22)
>>     svn: U?\204?\136berbau.txt
>>   Why that? Can't Subversion read every file name?
>> - ok, after setting LC_ALL, it works (even with the right umlaut!):
>>     ~/test tom$ export LC_ALL=en_US
>>     ~/test tom$ svn status
>>     ?       Überbau.txt
>
> I think you answered your own question... you need to set LC_ALL  
> (or LANG) first so Subversion knows what character encoding it's  
> working with.

   Actually, on Mac OS X all file names are, by convention, encoded  
as UTF-8, so the svn client should be able to decode file names  
without LC_ALL, which really has nothing to do with file names.  That  
said, it is possible to write file names containing bytes that can't  
decode as UTF-8.  In that situation, you're somewhat SOL.

   I don't know if other OS's specify an assumed encoding for file  
names.

   That said, I think LC_ALL is relevant to what the encoding of  
svn's output should be.

	-wsv

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org


Re: Mac OS X: problems adding files with umlauts

Posted by Ryan Schmidt <su...@ryandesign.com>.
On Jul 6, 2006, at 08:12, Thomas Singer wrote:

> I'm using Mac OS X 10.4.7 with Subversion 1.3.1 (r19032) and have
> problems adding files with umlauts in the name.
>
> - I've created a file "Überbau.txt" in the working copy
> - first problem: when listing the directory content on the console,  
> the
>   file name appears as "U??berbau.txt"
> - when I invoke 'svn status' in this directory, I get following error
>   message:
>     ~/test tom$ svn status
>     subversion/libsvn_subr/utf.c:466: (apr_err=22)
>     svn: Can't convert string from native encoding to 'UTF-8':
>     subversion/libsvn_subr/utf.c:464: (apr_err=22)
>     svn: U?\204?\136berbau.txt
>   Why that? Can't Subversion read every file name?
> - ok, after setting LC_ALL, it works (even with the right umlaut!):
>     ~/test tom$ export LC_ALL=en_US
>     ~/test tom$ svn status
>     ?       Überbau.txt

I think you answered your own question... you need to set LC_ALL (or  
LANG) first so Subversion knows what character encoding it's working  
with.


> - now I add the file
>     ~/test tom$ svn add \303berbau.txt
>     A          Überbau.txt
> - but when I now invoke 'svn status' again, it shows the same file  
> name
>   as missing and unversioned:
>     ~/test tom$ svn status
>     ?       Überbau.txt
>     !       Überbau.txt
>   Shouldn't it occur as added? Is this a bug or a user-error?

I think you're experiencing symptoms of this:

http://subversion.tigris.org/issues/show_bug.cgi?id=2464

I'm not sure what this "\303" is, but I think you're trying to add  
"Überbau.txt" with a composed "Ü" (U+00DC) while you need to add it  
decomposed, as a "U" (U+0055) followed by a combining diaeresis "¨" (U 
+0308), like HFS+ stores it.

See the two mailing list threads linked in that bug report, and also:

http://listserv.dartmouth.edu/scripts/wa.exe? 
A2=ind0503&L=macscrpt&D=1&T=0&P=20432

I should note that I have never figured out how to enter non-ASCII  
characters into the Terminal, so I don't actually know how to do the  
above.




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org