You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Heikki Orsila <sh...@modeemi.cs.tut.fi> on 2005/05/06 08:35:30 UTC

[svn] Recode problem

The default behaviour of SVN is very annoying: importing files often fails
for an error 'svn: Can't recode string'. The default behaviour should be
to accept _any_ filenames (zero terminated byte strings).

This problem happened yesterday when I tried to import my home
repository, and now again when I'm trying to switch our cvs repo at work
into svn. Imo, this is a barrier that slows people to switch to use svn
instead of alternatives.

Heikki Orsila			Barbie's law:
heikki.orsila@iki.fi		"Math is hard, let's go shopping!"
http://www.iki.fi/shd

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-05-09 11:46:33 +0200, Peter N. Lundblad wrote:
> The POSIX way of making filename encoding locale-dependent is
> fundamentally broken IMO.

AFAIK, POSIX just says that a filename is a sequence of bytes, which
is just as bad, since humans work with characters, not bytes.

> But I don't think each tool can solve a system problem. On POSIX
> systems, I think the best solution is to rely on the locale like we
> currently do. People should set up their locale correctly and ensure
> that filenames are in the encoding of the locale. Even if we make
> the filename encoding configuraable, it is easy to wind up with
> filenames with different encodings in the same WC.

Yes, but the user can probably fix that (for the new filenames).

Storing the encoding in the .svn directory allows the user to change
his locale from an application to another. Without that, one gets ugly
side effects, as shown in my previous mail. This is really specific to
Subversion.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Michael Sweet <mi...@easysw.com>.
Peter N. Lundblad wrote:
> ...
> If I understand correctly, OS X isn't pPOSIX conformatnt (but it

OSX apps only use UTF-8 (NFKD) for filenames.  It is certainly
possible for non-OSX clients to create a file with a non-UTF-8
filename, however that case is simply not handled...

-- 
______________________________________________________________________
Michael Sweet, Easy Software Products           mike at easysw dot com
Internet Printing and Publishing Software        http://www.easysw.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Mon, 9 May 2005, Vincent Lefevre wrote:

> On 2005-05-08 21:17:31 -0500, Ben Collins-Sussman wrote:
> > I think you misunderstood Greg's question.  GNOME may have a
> > convention that applications always *save* files in UTF8 encoding.
> > But it has no control over files that already exist.  What happens
> > when I ask a GNOME application to open a file encoded in something
> > other than UTF-8?  Does it "guess" at the encoding?
>
> I don't know. For instance, ROX-Filer (a GTK+ application, i.e. using
> the same GNOME convention) guesses the enconding (but I don't know
> how) and fixes it on the fly if possible (and enabled?). I don't know
> how the guess is done, but using the user's locale isn't necessarily
> the best solution as the filename may come from an external source.
> I think that the best solution is to give the choice to the user,
> just like what Emacs does for the encoding of text file contents.


The POSIX way of making filename encoding locale-dependent is
fundamentally broken IMO. But I don't think each tool can solve a system
problem. On POSIX systems, I think the best solution is to rely on the
locale like we currently do. People should set up their locale correctly
and ensure that filenames are in the encoding of the locale. Even if we
make the filename encoding configuraable, it is easy to wind up with
filenames with different encodings in the same WC.

If I understand correctly, OS X isn't pPOSIX conformatnt (but it could
easily be by setting up the locales correctly to use UTF8). Subversion
already handles systems where paths have a different encoding than the
locale (i.e. Windows NT and later). It seems to me like APR should learn
the fact that OS X always has UTF8 filenames. I don't know OS X, but this
sems to be a simple fx in apr_filepath_encoding. I don't know if APR has a
specific reason not to doit this way.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-05-08 21:17:31 -0500, Ben Collins-Sussman wrote:
> I think you misunderstood Greg's question.  GNOME may have a  
> convention that applications always *save* files in UTF8 encoding.   
> But it has no control over files that already exist.  What happens  
> when I ask a GNOME application to open a file encoded in something  
> other than UTF-8?  Does it "guess" at the encoding?

I don't know. For instance, ROX-Filer (a GTK+ application, i.e. using
the same GNOME convention) guesses the enconding (but I don't know
how) and fixes it on the fly if possible (and enabled?). I don't know
how the guess is done, but using the user's locale isn't necessarily
the best solution as the filename may come from an external source.
I think that the best solution is to give the choice to the user,
just like what Emacs does for the encoding of text file contents.

http://mail.gnome.org/archives/gnome-vfs-list/2003-March/msg00013.html
summarizes the situation:

  In general the filename encoding issue this is a hard problem, and a
  full real solution will not be availible for a long time, until the
  whole world has switched over to a common encoding. Many sources of
  filenames just don't have a corresponding filename encoding specified,
  so until everyone use the same we have to guess. Take an ftp site for
  instance. How are you supposed to know the encoding it uses for
  filenames?

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Ben Collins-Sussman <su...@collab.net>.
On May 8, 2005, at 9:00 PM, Vincent Lefevre wrote:

> On 2005-05-08 12:22:48 -0400, Greg Hudson wrote:
>
>> Uh, okay. I read a filename from a directory. Where can I find out
>> its character set?
>>
>
> The GNOME convention says that all filenames are encoded with UTF-8.
> So one always knows the character set. :)
>

I think you misunderstood Greg's question.  GNOME may have a  
convention that applications always *save* files in UTF8 encoding.   
But it has no control over files that already exist.  What happens  
when I ask a GNOME application to open a file encoded in something  
other than UTF-8?  Does it "guess" at the encoding?

That's the same problem that Subversion faces:  somebody tells the  
svn client to import a path into the repository, and it needs to  
guess at the encoding so that it can convert it to UTF-8.  (The  
repository stores all paths as UTF-8.)  At the moment, the only thing  
the svn client does is look at the LOCALE environment variable.  I  
suspect GNOME applications do the same thing.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-05-09 11:52:20 +0200, Peter N. Lundblad wrote:
> On Mon, 9 May 2005, Vincent Lefevre wrote:
> > Now, for ls, this isn't much a problem: non-ASCII characters are
> > just displayed in an incorrect way. But with Subversion, this is
> > much more a problem since it thinks that the filenames have changed
> > (unless this has been improved recently) if the user uses different
> > locales; this is not just a display problem.
> 
> I don't understand how making the encoding configurable solves this
> problem.

I recall the problem:

Re: [svn] Recode problem

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Mon, 9 May 2005, Vincent Lefevre wrote:

> On 2005-05-08 23:09:40 -0400, Greg Hudson wrote:
> > That means if I save a file with gedit, it won't appear correctly in
> > "ls" if I'm not using a UTF-8 locale.
>
> Well, the problem with ls is already true when one uses different
> locales (in particular in a multi-user environment). Using a
> wrapper for ls (and other such utilities) may be an interesting
> solution.
>
> Now, for ls, this isn't much a problem: non-ASCII characters are
> just displayed in an incorrect way. But with Subversion, this is
> much more a problem since it thinks that the filenames have changed
> (unless this has been improved recently) if the user uses different
> locales; this is not just a display problem.

I don't understand how making the encoding configurable solves this
problem.

>
> > That may be acceptable to GNOME, which doesn't care so much about
> > the command line, but it's not so good for Subversion.
>
> Some users may think that it is better, in particular those who use
> both GNOME and Subversion. So, it could be configurable.
>
Why don't GNOME users use an UTF8 locale? Then standard system tools will
work as expected (assuming they're locale-aware). If the users get files
with some other encoding (say in a tar file), they have to use some tool
to fix that. I don't think svn should be that tool.

Best,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-05-08 23:09:40 -0400, Greg Hudson wrote:
> That means if I save a file with gedit, it won't appear correctly in
> "ls" if I'm not using a UTF-8 locale.

Well, the problem with ls is already true when one uses different
locales (in particular in a multi-user environment). Using a
wrapper for ls (and other such utilities) may be an interesting
solution.

Now, for ls, this isn't much a problem: non-ASCII characters are
just displayed in an incorrect way. But with Subversion, this is
much more a problem since it thinks that the filenames have changed
(unless this has been improved recently) if the user uses different
locales; this is not just a display problem.

> That may be acceptable to GNOME, which doesn't care so much about
> the command line, but it's not so good for Subversion.

Some users may think that it is better, in particular those who use
both GNOME and Subversion. So, it could be configurable.

> > Subversion could store some information in the .svn directories
> > about the charset used for the filenames,
> 
> Doesn't help "svn import".

I agree concerning "svn import". IMHO, the best solution would be to
provide a very configurable way to guess the encoding. Remember that
the files may come from an external source and their encoding doesn't
necessarily match UTF-8 or the locales. For GNOME users, they could
probably use GNOME to convert the filenames to UTF-8. If there exists
some utility that does this job, Subversion could use it.

However, for the files that have already been added, recording the
encoding in the .svn directories would be useful.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Greg Hudson <gh...@mit.edu>.
On Mon, 2005-05-09 at 04:00 +0200, Vincent Lefevre wrote:
> On 2005-05-08 12:22:48 -0400, Greg Hudson wrote:
> > Uh, okay. I read a filename from a directory. Where can I find out
> > its character set?
> 
> The GNOME convention says that all filenames are encoded with UTF-8.
> So one always knows the character set. :)

That means if I save a file with gedit, it won't appear correctly in
"ls" if I'm not using a UTF-8 locale.  That may be acceptable to GNOME,
which doesn't care so much about the command line, but it's not so good
for Subversion.

> Subversion could store some information in the .svn directories about
> the charset used for the filenames,

Doesn't help "svn import".


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-05-08 12:22:48 -0400, Greg Hudson wrote:
> Uh, okay. I read a filename from a directory. Where can I find out
> its character set?

The GNOME convention says that all filenames are encoded with UTF-8.
So one always knows the character set. :)

If you disagree with this convention, Subversion could store some
information in the .svn directories about the charset used for the
filenames, so that there would be no problems when two users with
different locales (or one user using different locales or applications
with different conventions) would access the same working copy.

Also, I wonder how Subversion currently behaves when the repository
contains filenames with characters that cannot be represented with
ISO-8859-1 and the user has a locale based on ISO-8859-1. Choosing
an encoding different from the locales can solve this problem. The
user may not want or may not even be able to change his locales.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Greg Hudson <gh...@MIT.EDU>.
On Sun, 2005-05-08 at 17:46 +0200, Vincent Lefevre wrote:
> > My best understanding is that SVN is doing exactly what a
> > locale-aware program ought to be doing, and MacOS X is not
> > conforming to the standards. Locale-aware programs are not expected
> > to know how to read global configuration files on every Unix-like
> > operating system; they are supposed to use the environment.
> 
> However the environment can't tell everything. Locales aren't even
> a system-level configuration, but just a user choice. Two users on
> a system can use different locales. Even one user on a system can
> use different locales (this is useful as not all programs support
> UTF-8). So, everytime an encoding is used, it is better to have
> some place or some convention to find the encoding, whatever the
> data (XML files, text files, text parts in a binary format, mail
> messages, filenames, etc.).

Uh, okay.  I read a filename from a directory.  Where can I find out its
character set?


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-05-06 11:20:32 -0400, Greg Hudson wrote:
> On Fri, 2005-05-06 at 15:18 +0300, Heikki Orsila wrote:
> > Why can't SVN read systems default settings like the rest of the
> > tools?
> 
> My best understanding is that SVN is doing exactly what a
> locale-aware program ought to be doing, and MacOS X is not
> conforming to the standards. Locale-aware programs are not expected
> to know how to read global configuration files on every Unix-like
> operating system; they are supposed to use the environment.

However the environment can't tell everything. Locales aren't even
a system-level configuration, but just a user choice. Two users on
a system can use different locales. Even one user on a system can
use different locales (this is useful as not all programs support
UTF-8). So, everytime an encoding is used, it is better to have
some place or some convention to find the encoding, whatever the
data (XML files, text files, text parts in a binary format, mail
messages, filenames, etc.).

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Greg Hudson <gh...@MIT.EDU>.
On Fri, 2005-05-06 at 15:18 +0300, Heikki Orsila wrote:
> Why
> can't SVN read systems default settings like the rest of the tools?

My best understanding is that SVN is doing exactly what a locale-aware
program ought to be doing, and MacOS X is not conforming to the
standards.  Locale-aware programs are not expected to know how to read
global configuration files on every Unix-like operating system; they are
supposed to use the environment.

But presumably other tools also deal with this problem.  It would be
educational to know how they do so; we should not invent our own
solution.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Branko Čibej <br...@xbc.nu>.
Heikki Orsila wrote:

>On Fri, 6 May 2005, Erik Huelsmann wrote:
>  
>
>>It's not a bug since svn is not a normal shell tool: you expect to be
>>able to checkout on other supported systems, this is the price you
>>pay...
>>    
>>
>
>It's a high price to pay if you want to people from CVS world to use your
>tool. Why not read the locale settings from global configuration files?
>For example, my debian gnu/linux system is configured with:
>
>	en_US.ISO-8859-15 ISO-8859-15
>	fi_FI@euro ISO-8859-15
>
>But svn tool doesn't work with ISO-8859-15 names! Only a small amount of
>code per supported operating system is required to handle this.
>  
>
If your locale is set correctly, then SVN will work correctly, too. What 
you showed above is a list of locales that are /available/ on the 
system, not the locale that's actually active. That's set via the LANG 
or LC_* environmen variables. I bet that if you type "locale" at the 
shell prompt, you'll get neither of the ones you list above.

Here's the output on one of the Unix boxes I have here (also a Debian 
box, BTW):

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

And this is in /etc/environment:

LC_ALL=en_US.UTF-8
LANG=en_US.UTF-8


-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Heikki Orsila <sh...@modeemi.cs.tut.fi>.
On Fri, 6 May 2005, Erik Huelsmann wrote:
> It's not a bug since svn is not a normal shell tool: you expect to be
> able to checkout on other supported systems, this is the price you
> pay...

It's a high price to pay if you want to people from CVS world to use your
tool. Why not read the locale settings from global configuration files?
For example, my debian gnu/linux system is configured with:

	en_US.ISO-8859-15 ISO-8859-15
	fi_FI@euro ISO-8859-15

But svn tool doesn't work with ISO-8859-15 names! Only a small amount of
code per supported operating system is required to handle this.

Heikki Orsila			Barbie's law:
heikki.orsila@iki.fi		"Math is hard, let's go shopping!"
http://www.iki.fi/shd


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Erik Huelsmann <eh...@gmail.com>.
On 5/6/05, Heikki Orsila <sh...@modeemi.cs.tut.fi> wrote:
> On Fri, 6 May 2005, Ben Collins-Sussman wrote:
> > Subversion *does* accept any filename.  However, each filename must be
> > converted to UTF8 when sent into the repository.  The error you're
> > seeing is that your current locale is unable to interpret the filename,
> > and thus cannot 'recode' the filename into UTF8.  You must set your
> > locale correctly.
> >
> > The only bug here is that the error message isn't friendly to users.
> > In svn 1.2.0, the message is friendlier.
> 
> No, it's a bug since normal shell tools don't reject such filenames. 

It's not a bug since svn is not a normal shell tool: you expect to be
able to checkout on other supported systems, this is the price you
pay...

bye,


Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: [svn] Recode problem

Posted by Heikki Orsila <sh...@modeemi.cs.tut.fi>.
On Fri, 6 May 2005, Ben Collins-Sussman wrote:
> Subversion *does* accept any filename.  However, each filename must be
> converted to UTF8 when sent into the repository.  The error you're
> seeing is that your current locale is unable to interpret the filename,
> and thus cannot 'recode' the filename into UTF8.  You must set your
> locale correctly.
>
> The only bug here is that the error message isn't friendly to users.
> In svn 1.2.0, the message is friendlier.

No, it's a bug since normal shell tools don't reject such filenames. For
example, ls dir/ doesn't throw an error when it encounters an ISO8859-15
name. Setting the locale just to make svn tool work is somewhat odd. Why
can't SVN read systems default settings like the rest of the tools? Doing
web searches, it seems that this feature causes problems for many people.

Heikki Orsila			Barbie's law:
heikki.orsila@iki.fi		"Math is hard, let's go shopping!"
http://www.iki.fi/shd


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [svn] Recode problem

Posted by Ben Collins-Sussman <su...@collab.net>.
On May 6, 2005, at 3:35 AM, Heikki Orsila wrote:

> The default behaviour of SVN is very annoying: importing files often 
> fails
> for an error 'svn: Can't recode string'. The default behaviour should 
> be
> to accept _any_ filenames (zero terminated byte strings).


Subversion *does* accept any filename.  However, each filename must be 
converted to UTF8 when sent into the repository.  The error you're 
seeing is that your current locale is unable to interpret the filename, 
and thus cannot 'recode' the filename into UTF8.  You must set your 
locale correctly.

The only bug here is that the error message isn't friendly to users.  
In svn 1.2.0, the message is friendlier.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org