You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Thomas Singer <su...@smartcvs.com> on 2006/07/07 09:00:32 UTC

Re: Mac OS X: why LC_ALL needs to be specified (Was: problems adding files with umlauts)

Hi Ulrich,

You are making it too simple: you assume that the file name already _is_ 
plain UTF-8. My Java example works as expected:

   final File dir = new File("file-test");
   dir.mkdirs();
   final File file = new File(dir, "invalid\u00FF\u00FE");
   file.createNewFile();
   for (String fileName : dir.list()) {
     System.out.println(fileName);
   }
   file.delete();

> The thing is that, as Wilfredo said and whose attribution you snipped, 
> filenames are UTF-8 _by_ _convention_ and nothing enforces this.

As I understand it, file names are stored *in the repository* as UTF-8 (by 
convention) and the Subversion client needs to enforce the correct encoding 
from the OS' native file name encoding. With Java this is no problem, since 
it does not simply treat characters as bytes and lists the directory content 
correctly (on Mac with decomposed umlauts, but thats another problem) and 
hence can (without setting the LC_ALL variable) convert the file name to 
UTF-8 or what ever encoding you want. If Java can do that without setting 
LC_ALL, it also should be technically possible from C(++).

--
Best regards,
Thomas Singer
_____________
SyntEvo GmbH
Schillerallee 2
83457 Bayerisch Gmain
Germany
www.syntevo.com


Ulrich Eckhardt schrieb:
> On Friday 07 July 2006 09:02, Thomas Singer wrote:
>>> That said, it is possible to write file names containing bytes that can't
>>> decode as UTF-8.
>> I can't believe that. Could you please give an reproducible example?
> 
> C++ code:
> 
> #include <fstream>
> int main() {
>   // the value 0xff and 0xfe must not occur in UTF-8 text
>   char const filename[] = { 'i','n','v','a','l','i','d',0xff,0xfe,'\0' };
>   std::ofstream out(filename);
>   out << "aha!\n";
> }
> 
> The thing is that, as Wilfredo said and whose attribution you snipped, 
> filenames are UTF-8 _by_ _convention_ and nothing enforces this.
> 
> Uli

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Mac OS X: why LC_ALL needs to be specified (Was: problems adding files with umlauts)

Posted by Ulrich Eckhardt <ec...@satorlaser.com>.
On Friday 07 July 2006 11:00, Thomas Singer wrote:
> You are making it too simple: you assume that the file name already _is_
> plain UTF-8.

Indeed, because filenames are supposed to be UTF-8.

> My Java example works as expected: 
>
>    final File dir = new File("file-test");
>    dir.mkdirs();
>    final File file = new File(dir, "invalid\u00FF\u00FE");
>    file.createNewFile();
>    for (String fileName : dir.list()) {
>      System.out.println(fileName);
>    }
>    file.delete();

AFAIK, Java uses UCS2 or UTF-16 internally. It then has to convert that to the 
system's format which, in case of OSX, is UTF-8. Now, FF and FE are both 
valid codepoints in Unicode (thorn and y with diaeresis), so Java just 
encodes them in UTF-8 and everything's fine. C++ is much more direct, it just 
passes as filename to the system what it got from the programmer.

> > The thing is that, as Wilfredo said and whose attribution you snipped,
> > filenames are UTF-8 _by_ _convention_ and nothing enforces this.
>
> As I understand it, file names are stored *in the repository* as UTF-8 (by
> convention) 

Yes, although this is not a convention but a definition/requirement of 
Subversion. Also, this is validated, i.e. it rejects invalid UTF-8 sequences.


> and the Subversion client needs to enforce the correct encoding 
> from the OS' native file name encoding.

Right. In the case of OSX, Subversion probably assumes the encoding is UTF-8 
(because that is what it should be). If this is already wrong, because some 
program broke with the convention, it can't do much. In said case it only 
sees that the UTF-8 sequence is invalid and bails out with an error message.


> With Java this is no problem, since 
> it does not simply treat characters as bytes and lists the directory
> content correctly (on Mac with decomposed umlauts, but thats another
> problem) and hence can (without setting the LC_ALL variable) convert the
> file name to UTF-8 or what ever encoding you want. If Java can do that
> without setting LC_ALL, it also should be technically possible from C(++).

It is technically and practically possible, but it doesn't happen behind the 
scenes like in Java but requires an active effort. Since C++ mostly doesn't 
interpret characters and just passes them on, you need a function that simply 
converts the local encoding of the program (whichever that is is up to the 
programmer and/or the locale) to the externally specified format before 
opening the file.

In other words, the difference between C++ and Java in this aspect is that in 
C++ you provide the bytewise representation of the filename and that name is 
used without conversion, while in Java you provide a string that is converted 
to the filename's bytewise representation according to system requirements. 
That said, I wonder how Java would deal with "invalid\uFFFF\uFFFE" as those 
two are not allowed for interchange (i.e. filenames or content) according to 
Unicode.

Uli

****************************************************
Visit our website at <http://www.domino-printing.com/>
****************************************************
This Email and any files transmitted with it are intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any reading, redistribution, disclosure or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient please contact the sender immediately and delete the material from your computer.

E-mail may be susceptible to data corruption, interception, viruses and unauthorised amendment and Domino UK Limited does not accept liability for any such corruption, interception, viruses or amendment or their consequences.
****************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org