You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Thomas Singer <su...@smartcvs.com> on 2006/07/06 06:12:46 UTC
Mac OS X: problems adding files with umlauts
Hi,
I'm using Mac OS X 10.4.7 with Subversion 1.3.1 (r19032) and have
problems adding files with umlauts in the name.
- I've created a file "Überbau.txt" in the working copy
- first problem: when listing the directory content on the console, the
file name appears as "U??berbau.txt"
- when I invoke 'svn status' in this directory, I get following error
message:
~/test tom$ svn status
subversion/libsvn_subr/utf.c:466: (apr_err=22)
svn: Can't convert string from native encoding to 'UTF-8':
subversion/libsvn_subr/utf.c:464: (apr_err=22)
svn: U?\204?\136berbau.txt
Why that? Can't Subversion read every file name?
- ok, after setting LC_ALL, it works (even with the right umlaut!):
~/test tom$ export LC_ALL=en_US
~/test tom$ svn status
? Überbau.txt
- now I add the file
~/test tom$ svn add \303berbau.txt
A Überbau.txt
- but when I now invoke 'svn status' again, it shows the same file name
as missing and unversioned:
~/test tom$ svn status
? Überbau.txt
! Überbau.txt
Shouldn't it occur as added? Is this a bug or a user-error?
--
Best regards,
Thomas Singer
_____________
SyntEvo GmbH
Schillerallee 2
83457 Bayerisch Gmain
Germany
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: Mac OS X: problems adding files with umlauts
Posted by Thomas Singer <su...@smartcvs.com>.
Hi Ryan,
Thanks for answering.
> I think you answered your own question... you need to set LC_ALL (or LANG) first so Subversion knows what character encoding it's working with.
I've read that I need to set the LC_ALL variable, but I don't understand
why this is necessary. Shouldn't it be possible to read any file name?
What happens when the "wrong" value is set to LC_ALL? Please note, that
we talk about file *names*, not the file *content*. Specifying the
encoding for reading/writing non-default encoded text files is
necessary, but why for file *names*?
> I think you're experiencing symptoms of this:
>
> http://subversion.tigris.org/issues/show_bug.cgi?id=2464
I think, here are two problems:
1) one needs to specify the "right" value to LC_ALL to make Subversion
read/write file names with umlauts
2) how to handle the decomposed form of umlauts.
The second problem easily can be resolved by composing file names which
are reported decomposed by Mac's file system. At least we use this
strategy to make our CVS client work with umlauts correctly on the Mac.
But solving the second problem seems to me independent of the first
problem. BTW, the JavaSVN library does not exhibit this problem, because
Java seems to always list file names correctly (but decomposed on the
Mac). Since Java has a C core, I assume (though my limited knowledge of
such low-level C-stuff) that reading the file names correctly can be
done in Subversion, too.
--
Best regards
Thomas Singer
_____________
SyntEvo GmbH
Schillerallee 2
83457 Bayerisch Gmain
Germany
Ryan Schmidt wrote:
> On Jul 6, 2006, at 08:12, Thomas Singer wrote:
>
>> I'm using Mac OS X 10.4.7 with Subversion 1.3.1 (r19032) and have
>> problems adding files with umlauts in the name.
>>
>> - I've created a file "Überbau.txt" in the working copy
>> - first problem: when listing the directory content on the console, the
>> file name appears as "U??berbau.txt"
>> - when I invoke 'svn status' in this directory, I get following error
>> message:
>> ~/test tom$ svn status
>> subversion/libsvn_subr/utf.c:466: (apr_err=22)
>> svn: Can't convert string from native encoding to 'UTF-8':
>> subversion/libsvn_subr/utf.c:464: (apr_err=22)
>> svn: U?\204?\136berbau.txt
>> Why that? Can't Subversion read every file name?
>> - ok, after setting LC_ALL, it works (even with the right umlaut!):
>> ~/test tom$ export LC_ALL=en_US
>> ~/test tom$ svn status
>> ? Überbau.txt
>
> I think you answered your own question... you need to set LC_ALL (or
> LANG) first so Subversion knows what character encoding it's working with.
>
>
>> - now I add the file
>> ~/test tom$ svn add \303berbau.txt
>> A Überbau.txt
>> - but when I now invoke 'svn status' again, it shows the same file name
>> as missing and unversioned:
>> ~/test tom$ svn status
>> ? Überbau.txt
>> ! Überbau.txt
>> Shouldn't it occur as added? Is this a bug or a user-error?
>
> I think you're experiencing symptoms of this:
>
> http://subversion.tigris.org/issues/show_bug.cgi?id=2464
>
> I'm not sure what this "\303" is, but I think you're trying to add
> "Überbau.txt" with a composed "Ü" (U+00DC) while you need to add it
> decomposed, as a "U" (U+0055) followed by a combining diaeresis "¨"
> (U+0308), like HFS+ stores it.
>
> See the two mailing list threads linked in that bug report, and also:
>
> http://listserv.dartmouth.edu/scripts/wa.exe?A2=ind0503&L=macscrpt&D=1&T=0&P=20432
>
>
> I should note that I have never figured out how to enter non-ASCII
> characters into the Terminal, so I don't actually know how to do the above.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: Mac OS X: why LC_ALL needs to be specified (Was: problems adding
files with umlauts)
Posted by Ulrich Eckhardt <ec...@satorlaser.com>.
On Friday 07 July 2006 11:00, Thomas Singer wrote:
> You are making it too simple: you assume that the file name already _is_
> plain UTF-8.
Indeed, because filenames are supposed to be UTF-8.
> My Java example works as expected:
>
> final File dir = new File("file-test");
> dir.mkdirs();
> final File file = new File(dir, "invalid\u00FF\u00FE");
> file.createNewFile();
> for (String fileName : dir.list()) {
> System.out.println(fileName);
> }
> file.delete();
AFAIK, Java uses UCS2 or UTF-16 internally. It then has to convert that to the
system's format which, in case of OSX, is UTF-8. Now, FF and FE are both
valid codepoints in Unicode (thorn and y with diaeresis), so Java just
encodes them in UTF-8 and everything's fine. C++ is much more direct, it just
passes as filename to the system what it got from the programmer.
> > The thing is that, as Wilfredo said and whose attribution you snipped,
> > filenames are UTF-8 _by_ _convention_ and nothing enforces this.
>
> As I understand it, file names are stored *in the repository* as UTF-8 (by
> convention)
Yes, although this is not a convention but a definition/requirement of
Subversion. Also, this is validated, i.e. it rejects invalid UTF-8 sequences.
> and the Subversion client needs to enforce the correct encoding
> from the OS' native file name encoding.
Right. In the case of OSX, Subversion probably assumes the encoding is UTF-8
(because that is what it should be). If this is already wrong, because some
program broke with the convention, it can't do much. In said case it only
sees that the UTF-8 sequence is invalid and bails out with an error message.
> With Java this is no problem, since
> it does not simply treat characters as bytes and lists the directory
> content correctly (on Mac with decomposed umlauts, but thats another
> problem) and hence can (without setting the LC_ALL variable) convert the
> file name to UTF-8 or what ever encoding you want. If Java can do that
> without setting LC_ALL, it also should be technically possible from C(++).
It is technically and practically possible, but it doesn't happen behind the
scenes like in Java but requires an active effort. Since C++ mostly doesn't
interpret characters and just passes them on, you need a function that simply
converts the local encoding of the program (whichever that is is up to the
programmer and/or the locale) to the externally specified format before
opening the file.
In other words, the difference between C++ and Java in this aspect is that in
C++ you provide the bytewise representation of the filename and that name is
used without conversion, while in Java you provide a string that is converted
to the filename's bytewise representation according to system requirements.
That said, I wonder how Java would deal with "invalid\uFFFF\uFFFE" as those
two are not allowed for interchange (i.e. filenames or content) according to
Unicode.
Uli
****************************************************
Visit our website at <http://www.domino-printing.com/>
****************************************************
This Email and any files transmitted with it are intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any reading, redistribution, disclosure or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient please contact the sender immediately and delete the material from your computer.
E-mail may be susceptible to data corruption, interception, viruses and unauthorised amendment and Domino UK Limited does not accept liability for any such corruption, interception, viruses or amendment or their consequences.
****************************************************
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: Mac OS X: why LC_ALL needs to be specified (Was: problems adding
files with umlauts)
Posted by Thomas Singer <su...@smartcvs.com>.
Hi Ulrich,
You are making it too simple: you assume that the file name already _is_
plain UTF-8. My Java example works as expected:
final File dir = new File("file-test");
dir.mkdirs();
final File file = new File(dir, "invalid\u00FF\u00FE");
file.createNewFile();
for (String fileName : dir.list()) {
System.out.println(fileName);
}
file.delete();
> The thing is that, as Wilfredo said and whose attribution you snipped,
> filenames are UTF-8 _by_ _convention_ and nothing enforces this.
As I understand it, file names are stored *in the repository* as UTF-8 (by
convention) and the Subversion client needs to enforce the correct encoding
from the OS' native file name encoding. With Java this is no problem, since
it does not simply treat characters as bytes and lists the directory content
correctly (on Mac with decomposed umlauts, but thats another problem) and
hence can (without setting the LC_ALL variable) convert the file name to
UTF-8 or what ever encoding you want. If Java can do that without setting
LC_ALL, it also should be technically possible from C(++).
--
Best regards,
Thomas Singer
_____________
SyntEvo GmbH
Schillerallee 2
83457 Bayerisch Gmain
Germany
www.syntevo.com
Ulrich Eckhardt schrieb:
> On Friday 07 July 2006 09:02, Thomas Singer wrote:
>>> That said, it is possible to write file names containing bytes that can't
>>> decode as UTF-8.
>> I can't believe that. Could you please give an reproducible example?
>
> C++ code:
>
> #include <fstream>
> int main() {
> // the value 0xff and 0xfe must not occur in UTF-8 text
> char const filename[] = { 'i','n','v','a','l','i','d',0xff,0xfe,'\0' };
> std::ofstream out(filename);
> out << "aha!\n";
> }
>
> The thing is that, as Wilfredo said and whose attribution you snipped,
> filenames are UTF-8 _by_ _convention_ and nothing enforces this.
>
> Uli
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: Mac OS X: problems adding files with umlauts
Posted by Ulrich Eckhardt <ec...@satorlaser.com>.
On Friday 07 July 2006 09:02, Thomas Singer wrote:
> > That said, it is possible to write file names containing bytes that can't
> > decode as UTF-8.
>
> I can't believe that. Could you please give an reproducible example?
C++ code:
#include <fstream>
int main() {
// the value 0xff and 0xfe must not occur in UTF-8 text
char const filename[] = { 'i','n','v','a','l','i','d',0xff,0xfe,'\0' };
std::ofstream out(filename);
out << "aha!\n";
}
The thing is that, as Wilfredo said and whose attribution you snipped,
filenames are UTF-8 _by_ _convention_ and nothing enforces this.
Uli
****************************************************
Visit our website at <http://www.domino-printing.com/>
****************************************************
This Email and any files transmitted with it are intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any reading, redistribution, disclosure or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient please contact the sender immediately and delete the material from your computer.
E-mail may be susceptible to data corruption, interception, viruses and unauthorised amendment and Domino UK Limited does not accept liability for any such corruption, interception, viruses or amendment or their consequences.
****************************************************
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: Mac OS X: problems adding files with umlauts
Posted by Wilfredo Sánchez Vega <ws...@wsanchez.net>.
My point was that LC_ALL is *not* relevant to decoding filenames,
because environment variables have nothing to do with how filenames
were encoded.
And I was saying that on Mac OS X, one can reasonably expect files
to be encoded in UTF-8, because that is what Apple tells developers
to do, and most comply. On most Unix systems, there is no way to
know what encoding was used for a filename, and most developers
assume they can only (safely) use 7-bit ASCII for decoding; all other
characters are typically considered "unprintable". But on Mac OS X,
UTF-8 the recommended encoding.
However, there exist byte sequences which are not valid UTF-8
strings, and yet it is possible to name a file with such a byte
sequence. In that case, an attempt to decode the filename assuming a
UTF-8 encoding will fail. I would not expect that to happen with any
file that a user gives a name to, but such a situation may happen if
software is generating filenames (eg. using some internal
identifier), since using UTF-8 in filenames isn't an enforced
requirement on most filesystems.
-wsv
On Jul 7, 2006, at 12:02 AM, Thomas Singer wrote:
>> That said, it is possible to write file names containing bytes
>> that can't decode as UTF-8.
>
> I can't believe that. Could you please give an reproducible example?
>
>> I think LC_ALL is relevant to what the encoding of svn's output
>> should be.
>
> I'm sure, you mixed here two things: the file names and the output.
> File names should be always convertible to a general character
> representation like UTF-8. Displaying the file names with the right
> sign in the output is a different issue and might depend on the
> used font.
>
> If you think, LC_ALL should be relevant for the file name detection
> in Subversion, could you give answers for the following questions:
> - What LC_ALL-value the user should set?
> - What should happen when the wrong value was set?
> - What value to set for file names in different languages?
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: Mac OS X: problems adding files with umlauts
Posted by Thomas Singer <su...@smartcvs.com>.
> That said, it is possible to write file names containing bytes that can't decode as UTF-8.
I can't believe that. Could you please give an reproducible example?
> I think LC_ALL is relevant to what the encoding of svn's output should be.
I'm sure, you mixed here two things: the file names and the output. File
names should be always convertible to a general character representation
like UTF-8. Displaying the file names with the right sign in the output is a
different issue and might depend on the used font.
If you think, LC_ALL should be relevant for the file name detection in
Subversion, could you give answers for the following questions:
- What LC_ALL-value the user should set?
- What should happen when the wrong value was set?
- What value to set for file names in different languages?
--
Best regards,
Thomas Singer
_____________
SyntEvo GmbH
Schillerallee 2
83457 Bayerisch Gmain
Germany
www.syntevo.com
Wilfredo Sánchez Vega schrieb:
> On Jul 6, 2006, at 1:58 AM, Ryan Schmidt wrote:
>
>> On Jul 6, 2006, at 08:12, Thomas Singer wrote:
>>
>>> I'm using Mac OS X 10.4.7 with Subversion 1.3.1 (r19032) and have
>>> problems adding files with umlauts in the name.
>>>
>>> - I've created a file "Überbau.txt" in the working copy
>>> - first problem: when listing the directory content on the console, the
>>> file name appears as "U??berbau.txt"
>>> - when I invoke 'svn status' in this directory, I get following error
>>> message:
>>> ~/test tom$ svn status
>>> subversion/libsvn_subr/utf.c:466: (apr_err=22)
>>> svn: Can't convert string from native encoding to 'UTF-8':
>>> subversion/libsvn_subr/utf.c:464: (apr_err=22)
>>> svn: U?\204?\136berbau.txt
>>> Why that? Can't Subversion read every file name?
>>> - ok, after setting LC_ALL, it works (even with the right umlaut!):
>>> ~/test tom$ export LC_ALL=en_US
>>> ~/test tom$ svn status
>>> ? Überbau.txt
>>
>> I think you answered your own question... you need to set LC_ALL (or
>> LANG) first so Subversion knows what character encoding it's working
>> with.
>
> Actually, on Mac OS X all file names are, by convention, encoded as
> UTF-8, so the svn client should be able to decode file names without
> LC_ALL, which really has nothing to do with file names. That said, it
> is possible to write file names containing bytes that can't decode as
> UTF-8. In that situation, you're somewhat SOL.
>
> I don't know if other OS's specify an assumed encoding for file names.
>
> That said, I think LC_ALL is relevant to what the encoding of svn's
> output should be.
>
> -wsv
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: users-help@subversion.tigris.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: Mac OS X: problems adding files with umlauts
Posted by Wilfredo Sánchez Vega <ws...@wsanchez.net>.
On Jul 6, 2006, at 1:58 AM, Ryan Schmidt wrote:
> On Jul 6, 2006, at 08:12, Thomas Singer wrote:
>
>> I'm using Mac OS X 10.4.7 with Subversion 1.3.1 (r19032) and have
>> problems adding files with umlauts in the name.
>>
>> - I've created a file "Überbau.txt" in the working copy
>> - first problem: when listing the directory content on the
>> console, the
>> file name appears as "U??berbau.txt"
>> - when I invoke 'svn status' in this directory, I get following error
>> message:
>> ~/test tom$ svn status
>> subversion/libsvn_subr/utf.c:466: (apr_err=22)
>> svn: Can't convert string from native encoding to 'UTF-8':
>> subversion/libsvn_subr/utf.c:464: (apr_err=22)
>> svn: U?\204?\136berbau.txt
>> Why that? Can't Subversion read every file name?
>> - ok, after setting LC_ALL, it works (even with the right umlaut!):
>> ~/test tom$ export LC_ALL=en_US
>> ~/test tom$ svn status
>> ? Überbau.txt
>
> I think you answered your own question... you need to set LC_ALL
> (or LANG) first so Subversion knows what character encoding it's
> working with.
Actually, on Mac OS X all file names are, by convention, encoded
as UTF-8, so the svn client should be able to decode file names
without LC_ALL, which really has nothing to do with file names. That
said, it is possible to write file names containing bytes that can't
decode as UTF-8. In that situation, you're somewhat SOL.
I don't know if other OS's specify an assumed encoding for file
names.
That said, I think LC_ALL is relevant to what the encoding of
svn's output should be.
-wsv
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: Mac OS X: problems adding files with umlauts
Posted by Ryan Schmidt <su...@ryandesign.com>.
On Jul 6, 2006, at 08:12, Thomas Singer wrote:
> I'm using Mac OS X 10.4.7 with Subversion 1.3.1 (r19032) and have
> problems adding files with umlauts in the name.
>
> - I've created a file "Überbau.txt" in the working copy
> - first problem: when listing the directory content on the console,
> the
> file name appears as "U??berbau.txt"
> - when I invoke 'svn status' in this directory, I get following error
> message:
> ~/test tom$ svn status
> subversion/libsvn_subr/utf.c:466: (apr_err=22)
> svn: Can't convert string from native encoding to 'UTF-8':
> subversion/libsvn_subr/utf.c:464: (apr_err=22)
> svn: U?\204?\136berbau.txt
> Why that? Can't Subversion read every file name?
> - ok, after setting LC_ALL, it works (even with the right umlaut!):
> ~/test tom$ export LC_ALL=en_US
> ~/test tom$ svn status
> ? Überbau.txt
I think you answered your own question... you need to set LC_ALL (or
LANG) first so Subversion knows what character encoding it's working
with.
> - now I add the file
> ~/test tom$ svn add \303berbau.txt
> A Überbau.txt
> - but when I now invoke 'svn status' again, it shows the same file
> name
> as missing and unversioned:
> ~/test tom$ svn status
> ? Überbau.txt
> ! Überbau.txt
> Shouldn't it occur as added? Is this a bug or a user-error?
I think you're experiencing symptoms of this:
http://subversion.tigris.org/issues/show_bug.cgi?id=2464
I'm not sure what this "\303" is, but I think you're trying to add
"Überbau.txt" with a composed "Ü" (U+00DC) while you need to add it
decomposed, as a "U" (U+0055) followed by a combining diaeresis "¨" (U
+0308), like HFS+ stores it.
See the two mailing list threads linked in that bug report, and also:
http://listserv.dartmouth.edu/scripts/wa.exe?
A2=ind0503&L=macscrpt&D=1&T=0&P=20432
I should note that I have never figured out how to enter non-ASCII
characters into the Terminal, so I don't actually know how to do the
above.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org