You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Karl Berry <ka...@freefriends.org> on 2022/12/22 22:40:27 UTC

filename encodings and conversion failure

A file with a name that has some "eight-bit" UTF-8 bytes (fn...-utf8.tex)
was committed to one of my repositories. When I try to check it out in
the C locale, svn complains:

$ echo $LC_ALL
C
$ svn update
svn: E000022: Can't convert string from 'UTF-8' to native encoding:
svn: E000022: fn{U+00B1}{U+00D7}{U+00F7}{U+00A7}{U+00B6}-utf8.tex

Or, in ls terms:
$ ls --quoting-style escape fn??*-utf8.tex
fn\302\261\303\227\303\267\302\247\302\266-utf8.tex

Clearly those UTF-8 code points cannot be "converted" by svn to the
7-bit ASCII locale that is "C". Fine; I don't expect it to.  Is there a
way to force svn to complete the checkout anyway? That is, just check
out the file and let the name be whatever the bytes are. I don't
understand why any "conversion" by svn is necessary merely to operate on
files.

Sure, the name may show up as garbage when I do things in my terminal,
but that's my problem, not svn's. I didn't ask (and don't want) svn to
convert anything.

Incidentally, this is not about UTF-8 specifically. The same commit
included names in SJIS and EUC encodings (they are test files for a new
feature in Japanese TeX). The question is, in general, why svn needs to
"convert" filenames at all.

I did some searching both in the mailing list archives and on the web,
to no avail. People had related problems, but I didn't see this (more
basic) question being asked.

This is with a somewhat old svn that I compiled myself:
svn, version 1.13.0 (r1867053)
   compiled Nov 10 2019, 18:06:58 on x86_64-unknown-linux-gnu

I'm guessing svn behavior in this regard has not changed since 1.13.0,
but if I'm wrong about that, sorry for the noise, and I'll happily
recompile the latest.

Thanks for any info,
Karl


Re: filename encodings and conversion failure

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Daniel Sahlberg wrote on Fri, 23 Dec 2022 08:58 +00:00:
> Example: Commit a file with ? (questionmark) in the filename on Linux and
> checkout the file on Windows.

Or case-colliding files:

url=`svn info --show-item=url`
svn mkdir -- $url/foo $url/FOO
svn up

> This is a case where a conversion might /be/ necessary (although I don't
> have a concrete idea of what the conversion should be). Or else these files
> should just be ignored on checkout.
>
> I'm just mentioning this in case someone looks at the code and decides make
> changes to the conversions.

Ditto.

Cheers,

Daniel

Re: filename encodings and conversion failure

Posted by Nico Kadel-Garcia <nk...@gmail.com>.
On Fri, Dec 23, 2022 at 3:58 AM Daniel Sahlberg
<da...@gmail.com> wrote:
>
> Den tors 22 dec. 2022 kl 23:40 skrev Karl Berry <ka...@freefriends.org>:
>>
>> Clearly those UTF-8 code points cannot be "converted" by svn to the
>> 7-bit ASCII locale that is "C". Fine; I don't expect it to.  Is there a
>> way to force svn to complete the checkout anyway? That is, just check
>> out the file and let the name be whatever the bytes are. I don't
>> understand why any "conversion" by svn is necessary merely to operate on
>> files.
>
>
> Not at all related to this issue except it also concerns filenames: It is possible to commit files with a filename that works on only one platform, making a checkout/update fail on other platforms.
>
> Example: Commit a file with ? (questionmark) in the filename on Linux and checkout the file on Windows.

Yes. The source code for HylaFAX had this exact problem, since it had
MixedCaseFileNames.c and Mixedcasefilenames.c . They can be checked
out in the same working copy on UNIX and Linux and MacOS easily, on
Windows it's not so easy due to the "case-insensitive" file systems.

Nico Kadel-Garcia



> [[[
> D:\temp>svn co https://svn.apache.org/repos/private/pmc/subversion/pr/XXXX private_wc
> [...]
> svn: E155009: Failed to run the WC DB work queue associated with 'D:\temp\private_wc\YYY_folder', work item 54 (file-install XY?Z.html 1 0 1 1)
> svn: E720123: Can't move 'D:\temp\private_wc\.svn\tmp\svn-C3A15B21' to 'D:\temp\private_wc\XY?Z.html': The filename, directory name, or volume label syntax is incorrect.
> ]]]
>
> (The above example is from the Subversion private repository, I've masked the actual folders/filenames but it should be reproducible for anyone with access to the repository).
>
> This is a case where a conversion might /be/ necessary (although I don't have a concrete idea of what the conversion should be). Or else these files should just be ignored on checkout.
>
> I'm just mentioning this in case someone looks at the code and decides make changes to the conversions.
>
> Kind regards,
> Daniel

Re: filename encodings and conversion failure

Posted by Daniel Sahlberg <da...@gmail.com>.
Den tors 22 dec. 2022 kl 23:40 skrev Karl Berry <ka...@freefriends.org>:

> Clearly those UTF-8 code points cannot be "converted" by svn to the
> 7-bit ASCII locale that is "C". Fine; I don't expect it to.  Is there a
> way to force svn to complete the checkout anyway? That is, just check
> out the file and let the name be whatever the bytes are. I don't
> understand why any "conversion" by svn is necessary merely to operate on
> files.
>

Not at all related to this issue except it also concerns filenames: It is
possible to commit files with a filename that works on only one platform,
making a checkout/update fail on other platforms.

Example: Commit a file with ? (questionmark) in the filename on Linux and
checkout the file on Windows.

[[[
D:\temp>svn co https://svn.apache.org/repos/private/pmc/subversion/pr/XXXX
private_wc
[...]
svn: E155009: Failed to run the WC DB work queue associated with
'D:\temp\private_wc\YYY_folder', work item 54 (file-install XY?Z.html 1 0 1
1)
svn: E720123: Can't move 'D:\temp\private_wc\.svn\tmp\svn-C3A15B21' to
'D:\temp\private_wc\XY?Z.html': The filename, directory name, or volume
label syntax is incorrect.
]]]

(The above example is from the Subversion private repository, I've masked
the actual folders/filenames but it should be reproducible for anyone with
access to the repository).

This is a case where a conversion might /be/ necessary (although I don't
have a concrete idea of what the conversion should be). Or else these files
should just be ignored on checkout.

I'm just mentioning this in case someone looks at the code and decides make
changes to the conversions.

Kind regards,
Daniel

Re: filename encodings and conversion failure

Posted by Mark Phippard <ma...@gmail.com>.
On Fri, Dec 30, 2022 at 5:41 PM Karl Berry <ka...@freefriends.org> wrote:
>
>     Well my point is that this would not work everywhere.
>
> How can "store as bytes" not work (be implementable?) everywhere?  I'm
> missing something.

18 years ago when Paul Burba and I were working on porting SVN to run
on OS/400 this feature was a godsend. We would not have wanted EBCDIC
bytes stored in the repository and it would have made it impossible to
interoperate with any other platform.

Not that this port was ever significant to the community, but we were
glad it worked this way. To each his own I guess.

Mark

Re: filename encodings and conversion failure

Posted by David Huang <kh...@azeotrope.org>.
On 12/30/2022 4:40 PM, Karl Berry wrote:
>      Well my point is that this would not work everywhere.
>
> How can "store as bytes" not work (be implementable?) everywhere?  I'm
> missing something.

I seem to remember an earlier message in the thread mentioning 
Windows... where filenames are natively UTF-16LE, and the various 
file/path API functions have a "wide character" version that takes a 
UTF-16LE-encoded filenames, and an "ANSI" version that will convert 
single-byte/multi-byte charset filenames to/from UTF-16LE (where the 
source encoding is generally determined by a system-wide setting; e.g., 
in the US, most systems would be using Windows-1252, which is similar to 
ISO 8859-1, although not identical. And Japanese systems would probably 
be set to use Windows-932, which is basically Shift-JIS).

So if SVN on Windows used the UTF-16 APIs and "stored as bytes", it'd be 
incompatible with *nix: a *nix README file, 0x52 0x45 0x41 0x44 0x4d 
0x45 would turn into U+4552 U+4441 U+454D, or 䕒䑁䕍 in Windows. And what 
happens with filenames that are an odd number of bytes long? Or in the 
other direction, a README file committed from Windows couldn't be 
checked out on *nix because 0x52 0x00 0x45 0x00 etc would appear to 
contain NUL characters from the *nix POV.

And if SVN on Windows used the single-byte charset APIs, the README 
example would work, but any filenames with non-ASCII characters would 
either change depending on the system-wide locale setting, or perhaps 
not be able to be checked out at all.

So as a Windows user, I think it's good that SVN converts filenames by 
default. That said, perhaps it would be useful to have an svn: property 
or something that says not to do any conversion on this filename.

-- 
Name: Dave Huang         |  Mammal, mammal / their names are called /
INet: khym@azeotrope.org |  they raise a paw / the bat, the cat /
                          |  dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 47 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++


Re: filename encodings and conversion failure

Posted by Karl Berry <ka...@freefriends.org>.
    Well my point is that this would not work everywhere. 

How can "store as bytes" not work (be implementable?) everywhere?  I'm
missing something.

When stored/returned as bytes, certainly a filename might look like
garbage when presented to the user, depending on their locale, what the
filename is, what the command does, etc. Or it might fail completely if
the user's filesystem has its own requirements, like zfs or Windows,
just like plenty of filenames cannot be portably used now (case clashes
and so on). People have brought up all those things, but they seem like
red herrings to me. They are all for the user to decide and handle.

At least with bytes svn would not have a (seemingly gratuitous) UTF-8
(or any) requirement on filenames in its own repository. In my naivete,
it seems like being able to manipulate any filename, independent of
locale and encoding, would be a pretty desirable characteristic of any VC.

    Blame it on Subversion's insisting on cross-platform
    compatibility.

Requiring UTF-8 filenames is not
cross-platform/cross-locale/cross-whatever compatible either.
Or my question would never have arisen.

In any case, as I said, I certainly do not expect any change here. I'm
sure it's a fundamental decision, made for good reason, and will not
be altered. I guess I just don't understand it. So it goes :).
--thanks again, karl.

Re: filename encodings and conversion failure

Posted by Branko Čibej <br...@apache.org>.
On 26.12.2022 22:26, Karl Berry wrote:
> I certainly don't expect such fundamental behavior to change, but I
> can't help but respond a little. Just ignore me :).
>
>      All the world is not Unix
>      ...
>      the problem of cross-platform compatibility.
>      
> Of course.  Precisely the reason why storing filenames as bytes would be
> more portable than forcing any particular encoding, in principle, seems
> to me.


Well my point is that this would not work everywhere. Blame it on 
Subversion's insisting on cross-platform compatibility. The thing that 
stores filenames as bytes is called something else. :)

-- Brane

Re: filename encodings and conversion failure

Posted by Karl Berry <ka...@freefriends.org>.
    It's also documented [1].
    https://svnbook.red-bean.com/en/1.7/svn.advanced.l10n.html

Thanks. I failed to find that. Now I understand.

I certainly don't expect such fundamental behavior to change, but I
can't help but respond a little. Just ignore me :).

    All the world is not Unix 
    ...
    the problem of cross-platform compatibility. 
    
Of course.  Precisely the reason why storing filenames as bytes would be
more portable than forcing any particular encoding, in principle, seems
to me.

    So assuming you can do some magic with file name encoding and expect
    Subversion to deal with it is, let's say, using the wrong tool 

The situation is precisely the opposite: I'm not doing any magic
whatsoever. Subversion is forcing its own preferred encoding "magic"
(i.e., UTF-8) on me. Ah well. Life goes on.

Thanks much for answering my question, and all the related info.
--best, karl.

Re: filename encodings and conversion failure

Posted by Branko Čibej <br...@apache.org>.
On 24.12.2022 21:49, Karl Berry wrote:
>      The most classic example is "FILE.TXT" versus "file.txt". According to
>
> To repeat: I'm not talking about case clashes and similar underlying
> filesystem problems; other people brought those up.
>
> I'm asking specifically about the failure of the unasked-for
> "UTF-8 conversion", which, so far as I can tell, is spontaneously
> induced by svn. There is no filesystem problem kind in this case
> (standard xfs). That is, I can touch/copy/remove/whatever the file that
> svn refuses to work with. -k

Yes there is. Windows file systems (well, at least NTFS) require file 
names in some sort of Unicode encoding. All the world is not Unix and 
even some Unix file systems store file names in UTF-8 (zfs comes to 
mind). Subversion does its best to conform to whatever the local rules 
are, but as you noticed, there are edge cases where that will cause 
problems.

The case you're describing, with some file names in SJIS and some in 
UTF-8, seems simply impossible to me, because we store file names in 
UTF-8 in the repository. So unless someone hacked the Subversion server, 
committing a SJIS file name should be impossible.

Regarding "spontaneously induced", well, this is how Subversion behaves, 
the reasons for that stem from the problem of cross-platform 
compatibility. It's also documented [1]. So assuming you can do some 
magic with file name encoding and expect Subversion to deal with it is, 
let's say, using the wrong tool for the job.

-- Brane

[1] https://svnbook.red-bean.com/en/1.7/svn.advanced.l10n.html

Re: filename encodings and conversion failure

Posted by Karl Berry <ka...@freefriends.org>.
    The most classic example is "FILE.TXT" versus "file.txt". According to

To repeat: I'm not talking about case clashes and similar underlying
filesystem problems; other people brought those up.

I'm asking specifically about the failure of the unasked-for
"UTF-8 conversion", which, so far as I can tell, is spontaneously
induced by svn. There is no filesystem problem kind in this case
(standard xfs). That is, I can touch/copy/remove/whatever the file that
svn refuses to work with. -k

Re: filename encodings and conversion failure

Posted by Nico Kadel-Garcia <nk...@gmail.com>.
On Fri, Dec 23, 2022 at 4:35 PM Karl Berry <ka...@freefriends.org> wrote:
>
>     Perhaps «export LC_ALL=C.UTF-8», if your platform has that encoding?
>
> Yes, thanks, that is one of the workarounds. But that's not my
> question.
>
> My question is, why can't svn just treat the filenames as bytes? I
> remain baffled by the need to unconditionally convert to/from UTF-8 (or
> any other encoding). Nothing in my environment ("C" in all respects)
> says to do this, as far as I know.

Because filesystems don't, and Subversion is dealing with the
underlying filesystems. It's not a Subversion specific problem, git
and even rsync have some similar problems.

The most classic example is "FILE.TXT" versus "file.txt". According to
NTFS and CIFS, those are the same filename, since those case-ware but
case-insensitive filesystems. In NFS and ext4 an xfs, they're distinct
files, and committing them distinctly on Linux versions of Subversion
causes real issues on Windows based working copies.

There is no good solution for this set of problems except not to use
case-insensitive filesystems. The issues you're describing are very
similar, and are compelling reasons to avoid mixed case and, well,
anything but plain old alphanumeric ASCII in filenames. The populare
"_", "-" ,and "." are inevitable, though "." tends to cause problems
with regex expressions.

Re: filename encodings and conversion failure

Posted by Karl Berry <ka...@freefriends.org>.
    Perhaps «export LC_ALL=C.UTF-8», if your platform has that encoding?

Yes, thanks, that is one of the workarounds. But that's not my
question.

My question is, why can't svn just treat the filenames as bytes? I
remain baffled by the need to unconditionally convert to/from UTF-8 (or
any other encoding). Nothing in my environment ("C" in all respects)
says to do this, as far as I know.

We (this is the TeX Live svn repository, by the way) also have the other
problems mentioned so far, case clashes and Windows special characters
also causing trouble. But those seem different in kind to me. Those
problems are induced by the operating system and/or filesystem, and I
don't expect svn to solve them for me.

In contrast, the 
> svn: E000022: Can't convert string from 'UTF-8' to native encoding:
error seems to be induced purely by svn on its own.

I expect there is a good reason for the behavior, since svn behavior is
usually sensible. I just can't imagine what that reason is. And I wish
there was a way to override it. Just give me the bytes, dear svn!

Thanks,
Karl

Re: filename encodings and conversion failure

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Karl Berry wrote on Thu, 22 Dec 2022 22:40 +00:00:
> A file with a name that has some "eight-bit" UTF-8 bytes (fn...-utf8.tex)
> was committed to one of my repositories. When I try to check it out in
> the C locale, svn complains:
>
> $ echo $LC_ALL
> C
> $ svn update
> svn: E000022: Can't convert string from 'UTF-8' to native encoding:
> svn: E000022: fn{U+00B1}{U+00D7}{U+00F7}{U+00A7}{U+00B6}-utf8.tex
>
> Or, in ls terms:
> $ ls --quoting-style escape fn??*-utf8.tex
> fn\302\261\303\227\303\267\302\247\302\266-utf8.tex
>
> Clearly those UTF-8 code points cannot be "converted" by svn to the
> 7-bit ASCII locale that is "C". Fine; I don't expect it to.  Is there a
> way to force svn to complete the checkout anyway?

Perhaps «export LC_ALL=C.UTF-8», if your platform has that encoding?

Good questions in the rest of the email but I'm ENOTIME to deal with them at the moment.

Cheers,

Daniel

> That is, just check
> out the file and let the name be whatever the bytes are. I don't
> understand why any "conversion" by svn is necessary merely to operate on
> files.
>
> Sure, the name may show up as garbage when I do things in my terminal,
> but that's my problem, not svn's. I didn't ask (and don't want) svn to
> convert anything.
>
> Incidentally, this is not about UTF-8 specifically. The same commit
> included names in SJIS and EUC encodings (they are test files for a new
> feature in Japanese TeX). The question is, in general, why svn needs to
> "convert" filenames at all.
>
> I did some searching both in the mailing list archives and on the web,
> to no avail. People had related problems, but I didn't see this (more
> basic) question being asked.
>
> This is with a somewhat old svn that I compiled myself:
> svn, version 1.13.0 (r1867053)
>    compiled Nov 10 2019, 18:06:58 on x86_64-unknown-linux-gnu
>
> I'm guessing svn behavior in this regard has not changed since 1.13.0,
> but if I'm wrong about that, sorry for the noise, and I'll happily
> recompile the latest.
>
> Thanks for any info,
> Karl