You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Ben Collins-Sussman <su...@collab.net> on 2002/07/10 21:31:11 UTC

utf-8 sanity check.

I'm getting ready to write some python tests that verify that we can
deal with paths that have international characters in them.  

But before I do that, I want to make sure I understand what's going
on in our code:

  * our application's main() calls setlocale(LC_ALL, locale) if
    --locale is passed by the user.  This officially sets the locale for
    our process.

  * our utf.c routines set up a xlation table by calling
    apr_xlate_open() with two arguments: "UTF-8" and APR_LOCALE_CHARSET.

  * The latter argument causes apr_xlate_open to call nl_langinfo(CODESET).  

  * nl_langinfo(), part of libc, then returns the charset defined by
    the program's locale.  (according to my man page, at least.)

So by this trace, it seems to me that we're all ready to go, then.
There's no need to cache the --locale argument and somehow pass it
down into our svn_utf_* routines.

Am I correct?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Nuutti Kotivuori <na...@iki.fi>.
Ben Collins-Sussman wrote:
>   * our application's main() calls setlocale(LC_ALL, locale) if
>     --locale is passed by the user.  This officially sets the locale
>     for our process.

Tried this out with a small local modification. This also has the
local modification to the log output, that has not been committed
yet. I just allowed the --locale option on the subcommand log.

See this:

naked@oro:~/src/subversion/svn$ svn log -rHEAD
------------------------------------------------------------------------
rev 2463:  jerenkrantz | 2002-07-10 22:40:21 +0300 (Wed, 10 Jul 2002) | 3 lines

naked@oro:~/src/subversion/svn$ svn log --locale fi_FI -rHEAD
------------------------------------------------------------------------
rev 2463:  jerenkrantz | 2002-07-10 22:40:21 +0300 (ke, 10 heinä  2002) | 3 lines

Woowoo!

Then:

naked@oro:~/src/subversion/svn$ LC_TIME=fi_FI svn log -rHEAD
------------------------------------------------------------------------
rev 2463:  jerenkrantz | 2002-07-10 22:40:21 +0300 (Wed, 10 Jul 2002) | 3 lines

naked@oro:~/src/subversion/svn$ LC_TIME=fi_FI svn log --locale "" -rHEAD
------------------------------------------------------------------------
rev 2463:  jerenkrantz | 2002-07-10 22:40:21 +0300 (ke, 10 heinä  2002) | 3 lines

If you don't understand why this is, take a peek at the source, it
will come clear. This is not to say that the rationale there couldn't
be questioned.

But anyway, just a brief sample of things to come.

-- Naked


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Branko Čibej <br...@xbc.nu>.
Ben Collins-Sussman wrote:

>I'm getting ready to write some python tests that verify that we can
>deal with paths that have international characters in them.  
>
>But before I do that, I want to make sure I understand what's going
>on in our code:
>
>  * our application's main() calls setlocale(LC_ALL, locale) if
>    --locale is passed by the user.  This officially sets the locale for
>    our process.
>

Maybe this would be an auspicious time to heed the comment just before 
the setlocale() call in clients/cmdline/main.c?

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Ben Collins-Sussman <su...@collab.net>.
Branko Čibej <br...@xbc.nu> writes:

> Ben Collins-Sussman wrote:
> 
> >Ben Collins-Sussman <su...@collab.net> writes:
> >
> >
> >>  * nl_langinfo(), part of libc, then returns the charset defined by
> >>    the program's locale.  (according to my man page, at least.)
> >>
> >
> >Oof, I just realized that there's no win32 fork under apr/i18n/.
> >
> >According to Herr Tutt, there is no nl_langinfo() on win32.  This
> >means svn currently has no i18n support on win32.
> >
> >Maybe we can persuade Bill or Branko to write apr/i18n/win32/xlate.c.
> >It's a teeny-weeny API.  :-)
> >
> 
> You, sir, are asking for a bash on the ear, and no mistake.
> 
>     Brane, taking half a day off my holidays to wade through 300+ mails.

Well, it's hard to claim i18n support in Alpha if it doesn't work on
win32.  Even if you don't want to do it, it seems to me like this is
another Alpha task that needs to be scheduled.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Branko Čibej <br...@xbc.nu>.
Ben Collins-Sussman wrote:

>Ben Collins-Sussman <su...@collab.net> writes:
>
>  
>
>>  * nl_langinfo(), part of libc, then returns the charset defined by
>>    the program's locale.  (according to my man page, at least.)
>>    
>>
>
>Oof, I just realized that there's no win32 fork under apr/i18n/.
>
>According to Herr Tutt, there is no nl_langinfo() on win32.  This
>means svn currently has no i18n support on win32.
>
>Maybe we can persuade Bill or Branko to write apr/i18n/win32/xlate.c.
>It's a teeny-weeny API.  :-)
>  
>

You, sir, are asking for a bash on the ear, and no mistake.

    Brane, taking half a day off my holidays to wade through 300+ mails.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Ben Collins-Sussman <su...@collab.net>.
Ben Collins-Sussman <su...@collab.net> writes:

>   * nl_langinfo(), part of libc, then returns the charset defined by
>     the program's locale.  (according to my man page, at least.)

Oof, I just realized that there's no win32 fork under apr/i18n/.

According to Herr Tutt, there is no nl_langinfo() on win32.  This
means svn currently has no i18n support on win32.

Maybe we can persuade Bill or Branko to write apr/i18n/win32/xlate.c.
It's a teeny-weeny API.  :-)


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Ulrich Drepper <dr...@redhat.com> writes:

> On Thu, 2002-07-11 at 02:36, Marcus Comstedt wrote:
> 
> >   env LC_CTYPE=en_US.IBM273 svn foo bar
> 
> I wasn't trying to imply that this is any better. It might be a problem
> with the specification of env, though, that it isn't working.
> 
> But fact is that if you change the locale in a shell with
> 
>   export LC_ALL=en_US.IBM273
> 
> the shell from that point on uses the new locales data to parse the
> command line.  And you cannot interpret the parameters with anything
> different.
> 
> This is the theory.  In practice this will hardly ever make a difference
> but...


Yup.  To this I agree 100%.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Ulrich Drepper <dr...@redhat.com>.
On Thu, 2002-07-11 at 02:36, Marcus Comstedt wrote:

>   env LC_CTYPE=en_US.IBM273 svn foo bar

I wasn't trying to imply that this is any better. It might be a problem
with the specification of env, though, that it isn't working.

But fact is that if you change the locale in a shell with

  export LC_ALL=en_US.IBM273

the shell from that point on uses the new locales data to parse the
command line.  And you cannot interpret the parameters with anything
different.

This is the theory.  In practice this will hardly ever make a difference
but...

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

Re: utf-8 sanity check.

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Ulrich Drepper <dr...@redhat.com> writes:

> All parameters must be interpreted according to the locale of the
> shell.  This gets overwritten by the use of env.  So all parameters must
> use Latin1.  If you'd want some option which overwrites the locale for
> subsequent characters you might get into trouble.  E.g., in
> 
>    svn --locale=en_US.IBM273 foo bar
> 
> the <SPACE> you see in the mail is actually a <U0080> in IBM272 which
> might be no separating character in the locale and therefore svn might
> be called with just one parameter (assuming that neither "foo" nor "bar"
> are byte representations for a white-space character in IBM273, I
> haven't checked it).  You should get the idea.

I'm not sure ASCII-incompatible locales (such as EBCDIC in this case)
would work anyway, since there is bound to be code that assumes that
plain ASCII strings can be used without conversion.

Besides, if you do

  env LC_CTYPE=en_US.IBM273 svn foo bar

the shell will parse foo and bar into the argument list of env
_before_ the locale is changed.  env will not reparse the list, only
shift away the LC_CTYPE argument.  Thus svn will be called with two
parameters anyway.

So I don't think this line of reasoning is valid.  Of course, the
policy you're suggesting might be as good as any.  The thing to
remember with it is this:

Let's say I have a file called `räksmörgås' in the repository.  If I
give the option --locale=sv_SE.ISO646-SE to svn, it means that
filenames in the wc should be encoded using this character set, so the
local filename would be `r{ksm|rg}s' if interpreted as US-ASCII.
Let's say that the shell locale is sv_SE.ISO8859-1.  Now the correct
way to update the file would then be

  svn --locale=sv_SE.ISO646-SE update räksmörgås

as you would expect.  (Without the --locale argument, the command
would not work since the file in the wc would not have the expected
filename.)  The part that might be unexpected is that tab-completion
will not work, since that would give

  svn --locale=sv_SE.ISO646-SE update r\{ksm\|rg}s

and that will fail since there is no file named `r{ksm|rg}s' (that
name can't even be represented in ISO646-SE, so you'll get a recoding
error).

Naturally, there's bound to be some strangeness if the shell locale
and --locale does not match.  And not being able to use tab-completion
is probably a minimal breakage.  So I'm leaning towards this being a
rather good approach.  But then again, I never understood the use-case
in which you want to use --locale instead of having the shell locale
set proplery...

(Implementation-wise, doing it like this requires some changes to the
 code in addition to the cache invalidation function, due to the fact
 that translation of the arguments which are not -options is currently
 deferred to the call of parse_{num/all}_args or args_to_target_array,
 and we'd need to translate all args before changing locale.  Nothing
 major though.)


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Ulrich Drepper <dr...@redhat.com>.
On Wed, 2002-07-10 at 15:01, Marcus Comstedt wrote:


>   env LC_CTYPE=en_GB.ISO8859-1 svn --option1=¤ \
>          --locale=en_GB.ISO8859-15 --option2=¤ blah blah
> 
> should the value of --option1 be interpreted according to latin-1 or
> latin-9?  What about --option2?

All parameters must be interpreted according to the locale of the
shell.  This gets overwritten by the use of env.  So all parameters must
use Latin1.  If you'd want some option which overwrites the locale for
subsequent characters you might get into trouble.  E.g., in

   svn --locale=en_US.IBM273 foo bar

the <SPACE> you see in the mail is actually a <U0080> in IBM272 which
might be no separating character in the locale and therefore svn might
be called with just one parameter (assuming that neither "foo" nor "bar"
are byte representations for a white-space character in IBM273, I
haven't checked it).  You should get the idea.

If locale-encoding-specific options are needed you need to pass them
separately from the command line, e.g., in a config file.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

Re: utf-8 sanity check.

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Ben Collins-Sussman <su...@collab.net> writes:
> I have a working copy full of directories and filenames that contain
> high-ascii characters.
> 
> When you run status, or update, or commit, we not only walk over
> entries in the entries file (as UTF-8 strings), but sometimes we stat
> the files directly as well.  Then the UTF-8 paths and filenames need
> to be converted to native charset so we can call apr_file_*()
> routines, no?
> 
> Keep in mind, I'm assuming that the charset of your working copy is
> somehow different than your system locale.  But that's no more
> far-fetched than possibly writing a log message in a locale different
> than your system locale.

Oh, I see.

+shudder+

(Actually, I suspect it is a *bit* more far-fetched than out-of-locale
log messages, but yes, it could happen :-). )

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Ben Collins-Sussman <su...@collab.net> writes:

> Keep in mind, I'm assuming that the charset of your working copy is
> somehow different than your system locale.  But that's no more
> far-fetched than possibly writing a log message in a locale different
> than your system locale.

Actually, I don't quite see the utility of having your system locale
charset set differenly from the charset of your wc.  When would you
want to do that, and why?  It sounds like creating trouble for
yourself just because you can.  :)


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Ben Collins-Sussman <su...@collab.net>.
Karl Fogel <kf...@newton.ch.collab.net> writes:

> Ben Collins-Sussman <su...@collab.net> writes:
> > Use --locale:
> > 
> >    * to specify the charset of the log message being read by -F
> > 
> >    * to interpret working copy paths that will be read later on
> >      (i.e. 'svn up', with no args)
> 
> I'm not sure I understand the latter scenario... ?

I have a working copy full of directories and filenames that contain
high-ascii characters.

When you run status, or update, or commit, we not only walk over
entries in the entries file (as UTF-8 strings), but sometimes we stat
the files directly as well.  Then the UTF-8 paths and filenames need
to be converted to native charset so we can call apr_file_*()
routines, no?

Keep in mind, I'm assuming that the charset of your working copy is
somehow different than your system locale.  But that's no more
far-fetched than possibly writing a log message in a locale different
than your system locale.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Martin Pool <mb...@samba.org> writes:
> If all you need to do is handle commit messages in other charsets, why
> not just require the user to run iconv(1) on their commit message file
> to convert it to UTF-8 before running Subversion?

They'd still have to tell Subversion that it's in UTF-8, not native
(locale) encoding, so that doesn't seem like much gain.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Martin Pool <mb...@samba.org>.
On 11 Jul 2002, Karl Fogel <kf...@newton.ch.collab.net> wrote:
> Marcus Comstedt <ma...@mc.pp.se> writes:
> > > <bikeshed> --message-locale, it just sounds more natural. </bikeshed>
> > 
> > Or --message-charset even.  It might make more sense to supply a
> > charset name to pass to apr_xlate_open rather than the name of a
> > locale, if only the character encoding should be changed.

If all you need to do is handle commit messages in other charsets, why
not just require the user to run iconv(1) on their commit message file
to convert it to UTF-8 before running Subversion?

-- 
Martin 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Ulrich Drepper <dr...@redhat.com> writes:

> But "charset" is also not correct.  Use --message-encoding.

It's not strictly correct, but it's more descriptive than encoding
(which can mean any old encoding, not just encoding of characters) and
also recognizable since it's used in the same way in MIME for example.
--message-character-encoding would be the most correct, but it's a bit
too long.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Ulrich Drepper <dr...@redhat.com>.
On Thu, 2002-07-11 at 20:46, Karl Fogel wrote:

> > Or --message-charset even.  It might make more sense to supply a
> > charset name to pass to apr_xlate_open rather than the name of a
> > locale, if only the character encoding should be changed.
> 
> Yes, let's keep the word `locale' out of it.

But "charset" is also not correct.  Use --message-encoding.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

Re: utf-8 sanity check.

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Marcus Comstedt <ma...@mc.pp.se> writes:
> > <bikeshed> --message-locale, it just sounds more natural. </bikeshed>
> 
> Or --message-charset even.  It might make more sense to supply a
> charset name to pass to apr_xlate_open rather than the name of a
> locale, if only the character encoding should be changed.

Yes, let's keep the word `locale' out of it.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Garrett Rooney <ro...@electricjellyfish.net> writes:

> <bikeshed> --message-locale, it just sounds more natural. </bikeshed>

Or --message-charset even.  It might make more sense to supply a
charset name to pass to apr_xlate_open rather than the name of a
locale, if only the character encoding should be changed.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On Thu, Jul 11, 2002 at 05:48:26PM -0500, Karl Fogel wrote:
> > And if we're only talking about the log message, then shouldn't the switch
> > be --locale-message ?
> 
> Yes, I've said that from the beginning :-).

<bikeshed> --message-locale, it just sounds more natural. </bikeshed>

-garrett 

-- 
garrett rooney                    Remember, any design flaw you're 
rooneg@electricjellyfish.net      sufficiently snide about becomes  
http://electricjellyfish.net/     a feature.       -- Dan Sugalski

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Karl Fogel <kf...@newton.ch.collab.net> writes:

> Greg Stein <gs...@lyra.org> writes:
> 
> > And if we're only talking about the log message, then shouldn't the switch
> > be --locale-message ?
> 
> Yes, I've said that from the beginning :-).

Agreed.  If you use the switch with the sole purpose of changing the
interpretation of the log message file, then having the switch change
the global locale, and thus things like interpretation of filenames,
is not the right thing to do.  A switch to set the locale of the log
message should do just that, and the name should reflect this fact.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Greg Stein <gs...@lyra.org> writes:
> The first option (the log message) is about the only thing that I (barely)
> can see as a true use case. So now your requirement can be rewritten as:
> 
>   "As long as we have a way of supporting a log message in any charset, on
>    all supported platforms, then ..."
> 
> Now how many people will actually be doing that? Karl, you've used yourself
> as an example of writing in Big-5, yet your system locale is (presumably)
> ISO-8859-1. But are you truly a typical user? Will your log messages really
> be different?

Brane has already written us saying he *has* done this before, so I
think we can treat it as a real use case.

> And if we're only talking about the log message, then shouldn't the switch
> be --locale-message ?

Yes, I've said that from the beginning :-).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Kevin Pilch-Bisson <ke...@pilch-bisson.net>.
I don't know if anyone remembers this, but the reason for the --locale switch
was to make the OUTPUT of svn use the given locale.  Branko wanted to add that
feature to make programs that parsed svn output have a reliable way to ensure
that svn would be generating output in the format that they expect.



On Thu, Jul 11, 2002 at 03:51:22PM -0700, Greg Stein wrote:
> On Thu, Jul 11, 2002 at 05:16:49PM -0500, Karl Fogel wrote:
> > Greg Stein <gs...@lyra.org> writes:
> > > Not at all. The Python test can easily set the appropriate locale environ
> > > variable before executing the test process (the exec*e functions).
> > > 
> > > $ svn --locale=FOO status
> > > $ LC_CTYPE=FOO svn status
> > > 
> > > They're the same number of characters, so don't even try to say "well, it is
> > > easier to type". Why are we writing code for this? The system environment
> > > variable is the defined way to handle this stuff. Why invent something new?
> > > 
> > > IMO, it was a mistake to put in initially, and a mistake to retain/fix.
> > 
> > As long as we have a way of doing this in environment, at run time, on
> > all supported platforms, then I'm +1 on removing the option as well.
> 
> I think you're overstating the level of the requirement. I've seen two
> stated use cases for the --locale switch:
> 
>   1) your log message is not in the system locale
> 
>   2) your working copy is not in the system locale
> 
> I find the second case "bogus" (I can't think of a less connotative word).
> If your WC isn't using the system locale, then *what* other programs are
> going to work well with it?
> 
> Exercise for the reader: name one other program with a --locale switch,
> which somebody might be using with that WC. (heck, name one period)
> 
> 
> The first option (the log message) is about the only thing that I (barely)
> can see as a true use case. So now your requirement can be rewritten as:
> 
>   "As long as we have a way of supporting a log message in any charset, on
>    all supported platforms, then ..."
> 
> Now how many people will actually be doing that? Karl, you've used yourself
> as an example of writing in Big-5, yet your system locale is (presumably)
> ISO-8859-1. But are you truly a typical user? Will your log messages really
> be different?
> 
> And if we're only talking about the log message, then shouldn't the switch
> be --locale-message ?
> 
> > But
> > 
> >    $ LC_CTYPE=FOO svn status
> > 
> > will not work on non-Unix, and won't necessarily even work in all Unix
> > shells, right?  Not saying that's a showstopper, just want to know
> > what the right solution *is* for those other situations...
> 
> $ env LC_CTYPE=FOO svn status
> 
> will work on all shells. For the Windows case, I still question the utility.
> I *really* feel that we're catering to a fringe case that actually won't be
> used. I think the --locale switch was added before we knew what it should be
> used for, and now we're just compensating. It seems that we are taking its
> presence as a statement of need, rather than considering the option of
> simply removing the thing.
> 
> Cheers,
> -g
> 
> -- 
> Greg Stein, http://www.lyra.org/
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
> 

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kevin Pilch-Bisson                    http://www.pilch-bisson.net
     "Historically speaking, the presences of wheels in Unix
     has never precluded their reinvention." - Larry Wall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Re: utf-8 sanity check.

Posted by Greg Stein <gs...@lyra.org>.
On Thu, Jul 11, 2002 at 05:16:49PM -0500, Karl Fogel wrote:
> Greg Stein <gs...@lyra.org> writes:
> > Not at all. The Python test can easily set the appropriate locale environ
> > variable before executing the test process (the exec*e functions).
> > 
> > $ svn --locale=FOO status
> > $ LC_CTYPE=FOO svn status
> > 
> > They're the same number of characters, so don't even try to say "well, it is
> > easier to type". Why are we writing code for this? The system environment
> > variable is the defined way to handle this stuff. Why invent something new?
> > 
> > IMO, it was a mistake to put in initially, and a mistake to retain/fix.
> 
> As long as we have a way of doing this in environment, at run time, on
> all supported platforms, then I'm +1 on removing the option as well.

I think you're overstating the level of the requirement. I've seen two
stated use cases for the --locale switch:

  1) your log message is not in the system locale

  2) your working copy is not in the system locale

I find the second case "bogus" (I can't think of a less connotative word).
If your WC isn't using the system locale, then *what* other programs are
going to work well with it?

Exercise for the reader: name one other program with a --locale switch,
which somebody might be using with that WC. (heck, name one period)


The first option (the log message) is about the only thing that I (barely)
can see as a true use case. So now your requirement can be rewritten as:

  "As long as we have a way of supporting a log message in any charset, on
   all supported platforms, then ..."

Now how many people will actually be doing that? Karl, you've used yourself
as an example of writing in Big-5, yet your system locale is (presumably)
ISO-8859-1. But are you truly a typical user? Will your log messages really
be different?

And if we're only talking about the log message, then shouldn't the switch
be --locale-message ?

> But
> 
>    $ LC_CTYPE=FOO svn status
> 
> will not work on non-Unix, and won't necessarily even work in all Unix
> shells, right?  Not saying that's a showstopper, just want to know
> what the right solution *is* for those other situations...

$ env LC_CTYPE=FOO svn status

will work on all shells. For the Windows case, I still question the utility.
I *really* feel that we're catering to a fringe case that actually won't be
used. I think the --locale switch was added before we knew what it should be
used for, and now we're just compensating. It seems that we are taking its
presence as a statement of need, rather than considering the option of
simply removing the thing.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Greg Stein <gs...@lyra.org> writes:
> Not at all. The Python test can easily set the appropriate locale environ
> variable before executing the test process (the exec*e functions).
> 
> $ svn --locale=FOO status
> $ LC_CTYPE=FOO svn status
> 
> They're the same number of characters, so don't even try to say "well, it is
> easier to type". Why are we writing code for this? The system environment
> variable is the defined way to handle this stuff. Why invent something new?
> 
> IMO, it was a mistake to put in initially, and a mistake to retain/fix.

As long as we have a way of doing this in environment, at run time, on
all supported platforms, then I'm +1 on removing the option as well.

But

   $ LC_CTYPE=FOO svn status

will not work on non-Unix, and won't necessarily even work in all Unix
shells, right?  Not saying that's a showstopper, just want to know
what the right solution *is* for those other situations...

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Greg Stein <gs...@lyra.org>.
On Thu, Jul 11, 2002 at 04:44:52PM -0500, Ben Collins-Sussman wrote:
>...
> > > Ben Collins-Sussman <su...@collab.net> writes:
> > > > Use --locale:
> > > >
> > > >    * to specify the charset of the log message being read by -F
> > > >
> > > >    * to interpret working copy paths that will be read later on
> > > >      (i.e. 'svn up', with no args)

Bah. The environ variables should be used.

> > > I'm not sure I understand the latter scenario... ?
> > 
> > I'm not sure that makes much sense either. The only possible reason to
> > do this that I can think of is mounting a hard drive that was created a
> > completely different codepage compared to the default codepage. Even
> > then, I'm still not sure that the intervening layers could handle that.
> > 
> > No harm gained in supporting it of course. 
> 
> The point is, '--locale' is simply a way of manually overriding the
> system locale.  At a minimum, it's useful for testing our utf-8
> abilities in a python test, no?  :-)

Not at all. The Python test can easily set the appropriate locale environ
variable before executing the test process (the exec*e functions).

$ svn --locale=FOO status
$ LC_CTYPE=FOO svn status

They're the same number of characters, so don't even try to say "well, it is
easier to type". Why are we writing code for this? The system environment
variable is the defined way to handle this stuff. Why invent something new?

IMO, it was a mistake to put in initially, and a mistake to retain/fix.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Ben Collins-Sussman <su...@collab.net>.
"Bill Tutt" <ra...@lyra.org> writes:

&> > From: Karl Fogel [mailto:kfogel@newton.ch.collab.net]
> > 
> > Ben Collins-Sussman <su...@collab.net> writes:
> > > Use --locale:
> > >
> > >    * to specify the charset of the log message being read by -F
> > >
> > >    * to interpret working copy paths that will be read later on
> > >      (i.e. 'svn up', with no args)
> > 
> > I'm not sure I understand the latter scenario... ?
> > 
> 
> I'm not sure that makes much sense either. The only possible reason to
> do this that I can think of is mounting a hard drive that was created a
> completely different codepage compared to the default codepage. Even
> then, I'm still not sure that the intervening layers could handle that.
> 
> No harm gained in supporting it of course. 

The point is, '--locale' is simply a way of manually overriding the
system locale.  At a minimum, it's useful for testing our utf-8
abilities in a python test, no?  :-)


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Re: utf-8 sanity check.

Posted by Bill Tutt <ra...@lyra.org>.
> From: Karl Fogel [mailto:kfogel@newton.ch.collab.net]
> 
> Ben Collins-Sussman <su...@collab.net> writes:
> > Use --locale:
> >
> >    * to specify the charset of the log message being read by -F
> >
> >    * to interpret working copy paths that will be read later on
> >      (i.e. 'svn up', with no args)
> 
> I'm not sure I understand the latter scenario... ?
> 

I'm not sure that makes much sense either. The only possible reason to
do this that I can think of is mounting a hard drive that was created a
completely different codepage compared to the default codepage. Even
then, I'm still not sure that the intervening layers could handle that.

No harm gained in supporting it of course. 

This is aside from the problem that I don't think setlocale() gives you
enough information about codepages used for mounted filesystems. Aka old
Shift-JIS FAT drives you need to mount for some bizarre reason. (or
indeed some other complicated scenario)

Bill


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Ben Collins-Sussman <su...@collab.net> writes:
> Use --locale:
> 
>    * to specify the charset of the log message being read by -F
> 
>    * to interpret working copy paths that will be read later on
>      (i.e. 'svn up', with no args)

I'm not sure I understand the latter scenario... ?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Ben Collins-Sussman <su...@collab.net>.
Marcus Comstedt <ma...@mc.pp.se> writes:

> Ben Collins-Sussman <su...@collab.net> writes:
> 
> > Conceptually, I much prefer the second option.  It seems cleaner to
> > me.  
> > 
> > Up till now, svn commandline switches have always been global;  order
> > has never mattered.  I would hate to have the placement of --locale
> > change behaviors;  I'd like it to be "global" to all target paths on
> > the commandline.
> 
> The approach suggested by Ulrich, which is to defer setting the new
> locale until all args have been parsed, also has the feature that
> placement of --locale does not matter.  The downside being that you
> can't use --locale to change the interpretation of the command line
> args at all.  Exactly what is the intended use of --locale?  Can
> anyone produce a reasonable use-case?  Discussing the semantics
> without knowing when the option is intended to be used seems rather
> pointless...

Use --locale:

   * to specify the charset of the log message being read by -F

   * to interpret working copy paths that will be read later on
     (i.e. 'svn up', with no args)

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Ben Collins-Sussman <su...@collab.net> writes:

> Conceptually, I much prefer the second option.  It seems cleaner to
> me.  
> 
> Up till now, svn commandline switches have always been global;  order
> has never mattered.  I would hate to have the placement of --locale
> change behaviors;  I'd like it to be "global" to all target paths on
> the commandline.

The approach suggested by Ulrich, which is to defer setting the new
locale until all args have been parsed, also has the feature that
placement of --locale does not matter.  The downside being that you
can't use --locale to change the interpretation of the command line
args at all.  Exactly what is the intended use of --locale?  Can
anyone produce a reasonable use-case?  Discussing the semantics
without knowing when the option is intended to be used seems rather
pointless...


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Ben Collins-Sussman <su...@collab.net>.
Marcus Comstedt <ma...@mc.pp.se> writes:

> [...] However, there is a slight problem with --locale.  The handle
> returned by apr_xlate_open is cached globally, without any way to
> expire it.  This means that if any UTF conversion takes place
> _before_ the --locale argument is passed, the new locale will not be
> used since a convertor has already been created using the locale set
> in the environment variables.  As long as you make sure to always
> put the --locale argument _first_ on the command line, everything
> should be hunky dory though.
> 
> For a proper fix of the situation, two possibilities exist.  Either
> provide a mechanism for invalidating the cached convertor (simple),
> or make sure that --locale is parsed first (more work).

Conceptually, I much prefer the second option.  It seems cleaner to
me.  

Up till now, svn commandline switches have always been global;  order
has never mattered.  I would hate to have the placement of --locale
change behaviors;  I'd like it to be "global" to all target paths on
the commandline.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: utf-8 sanity check.

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Ben Collins-Sussman <su...@collab.net> writes:

> I'm getting ready to write some python tests that verify that we can
> deal with paths that have international characters in them.  
> 
> But before I do that, I want to make sure I understand what's going
> on in our code:
> 
>   * our application's main() calls setlocale(LC_ALL, locale) if
>     --locale is passed by the user.  This officially sets the locale for
>     our process.
> 
>   * our utf.c routines set up a xlation table by calling
>     apr_xlate_open() with two arguments: "UTF-8" and APR_LOCALE_CHARSET.
> 
>   * The latter argument causes apr_xlate_open to call nl_langinfo(CODESET).  
> 
>   * nl_langinfo(), part of libc, then returns the charset defined by
>     the program's locale.  (according to my man page, at least.)
> 
> So by this trace, it seems to me that we're all ready to go, then.
> There's no need to cache the --locale argument and somehow pass it
> down into our svn_utf_* routines.
> 
> Am I correct?

Yup.  (Unless Karl has done anything strange, I haven't reviewed the
actual checkins yet.)  However, there is a slight problem with
--locale.  The handle returned by apr_xlate_open is cached globally,
without any way to expire it.  This means that if any UTF conversion
takes place _before_ the --locale argument is passed, the new locale
will not be used since a convertor has already been created using the
locale set in the environment variables.  As long as you make sure to
always put the --locale argument _first_ on the command line,
everything should be hunky dory though.

For a proper fix of the situation, two possibilities exist.  Either
provide a mechanism for invalidating the cached convertor (simple), or
make sure that --locale is parsed first (more work).  Which solution
is "correct" depends on what semantics we want for --locale.  In

  env LC_CTYPE=en_GB.ISO8859-1 svn --option1=¤ \
         --locale=en_GB.ISO8859-15 --option2=¤ blah blah

should the value of --option1 be interpreted according to latin-1 or
latin-9?  What about --option2?


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org