You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Marcus Comstedt <ma...@mc.pp.se> on 2002/05/30 15:57:20 UTC

[PATCH] UTF-8 third round...


[ Hmm.  The Mad MIMEr @ subversion.tigris.org ate my first mail,
  removing all parts and giving up because nothing was left.
  Trying again... ]


Ok, as promised, here is a new update of the UTF-8 patch.  The
following things are known still not to be fixed:

* Config file reading.  config_file.c uses stdio rather than APR, so I
  skipped over this file for the moment.  UTF conversion of the paths
  can probably be omitted (with appropriate comments to that effect)
  as they don't seem to be shared with the rest of the code, but
  maybe the contents of the config file itself needs recoding?
  What needs to be done in config_win.c will have to be up to somebody
  in the Windows camp to decide.

* svnadmin/svnlook

* Server side?  I don't know if mod_dav_svn needs to do anything
  special here or whether Apache will take care of everything.
  Repository URL:s are passed as UTF-8 to Neon.  I have not audited
  what happens to them beyond that.

Things that need further consideration (although that can probably be
deferred for now):

* Property values.  Currently, the command line client will recode all
  property values to UTF-8.  This is needed for properties like
  svn:ignore, but not desirable for real binary properties.  Some
  mechanism for distinguishing between text properties and binary
  properties will be needed.

* Keyword expansion.  What should happen if a keyword expands to some
  non-ASCII string?  If the charset of the file is known, then
  probably the string should be recoded to that charset.  If not,
  should the locale-specific charset be used?  Or maybe even refuse
  expansion with a warning?


Implementation note:  I'm now using apr_xlate instead of calling iconv
directly.  This means that it's now up to APR to worry about
portability.  Unfortunately, apr_xlate seems to make several of the
mistakes that Ulrich cautioned about.  In particular, it ignores
return codes greater than 0 from iconv.  It should probably return
some other apr_status_t than APR_SUCCESS in this case, to indicate an
inexact translation.  Of course, we'll need to decide on in which
cases an inexact translation is acceptable as well.  For a pathname,
probably not.  For an error message, probably.  But what about a
property diff, for example?

I divided the patch into two parts this time, one for the libraries
and one for the command line client.  The client patch depends on the
library patch, but not the other way around.  There is also a small
patch to APR that should probably be sent somewhere else.  It fixes
the case where a converter is requested that converts from a charset
to that very same charset.  Previously apr_xlate_open would just pass
the request on to iconv_open, which at least on Solaris may refuse
such a conversion.  Now it will instead simply make an internal
identity translator if topage and frompage are the same.


  // Marcus

Re: Call For Votes: converting log messages to UTF-8

Posted by Ben Collins-Sussman <su...@collab.net>.

Branko Čibej <br...@xbc.nu> writes:

> Well for gods' sakes, every post of mine has
> 
> Content-Type: text/plain; charset=UTF-8
> 
> in it; what sort of useless MUA do you use, anyway? :-)

We both use Emacs 21.1, with Gnus 5.9.  I must have some nice X fonts
installed or something, because I can see Japanese and Chinese spam in
full detail, and Branko's family name definitely starts with a
character that looks like a "C" with a little "v" thing above it.  :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Branko Čibej <br...@xbc.nu>.

Karl Fogel wrote:

>Greg Stein <gs...@lyra.org> writes:
>  
>
>>But I've always been talking about more than just the log message. Things
>>like the author, date strings, property names, etc. While author will
>>generally be simple US-ASCII, I'd prefer to state that it is UTF-8.
>>    
>>
>
>+1 on that.  I *still* don't know how the first letter of Branko's
>family name really looks, because my editor hasn't yet rendered it
>right.
>
Well for gods' sakes, every post of mine has

Content-Type: text/plain; charset=UTF-8

in it; what sort of useless MUA do you use, anyway? :-)


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> But I've always been talking about more than just the log message. Things
> like the author, date strings, property names, etc. While author will
> generally be simple US-ASCII, I'd prefer to state that it is UTF-8.

+1 on that.  I *still* don't know how the first letter of Branko's
family name really looks, because my editor hasn't yet rendered it
right.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Greg Stein <gs...@lyra.org>.

On Fri, May 31, 2002 at 01:55:33PM -0500, Karl Fogel wrote:
> Colin Putney <cp...@whistler.net> writes:
>...
> > 2) Decree that log messages must be text, and store the metadata
> > specifiying the character set. Have the clients pass the character set
> > to the core libraries and have the libraries return the character set
> > along with the log messages at retrieval time.
>...
> It means passing another parameter along with log_msg itself, but
> that's no big deal.

It is a lot more work than just "one parameter". Each time we find another
"text" item, we're going to have to pass the character set. Every interface
the item passes through will also need to pass the charset. The server will
now have two properties on the revision (svn:log and svn:log-charset).

In the public interface, I count two functions that take a log message
directly, and six functions that take svn_client_get_commit_log_t which has
a log message in it. Within the .c code, I found 53 instances of 'log_msg'.
All of those will need double-params.

Note that some other properties' values will also need to be in UTF-8:
'ignore' and 'externals' (since they store paths, which we already define as
required to be in UTF-8).

But I've always been talking about more than just the log message. Things
like the author, date strings, property names, etc. While author will
generally be simple US-ASCII, I'd prefer to state that it is UTF-8.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Fri, May 31, 2002 at 01:55:33PM -0500, Karl Fogel wrote:
> Colin Putney <cp...@whistler.net> writes:
> > 1) Do as Marcus and gstein propos and decree that log messages will be
> > stored as UTF-8 in the repository and do the necessary conversion on
> > input and output as a crutch for those without Unicode capable-tools
> > 
> > 2) Decree that log messages must be text, and store the metadata
> > specifiying the character set. Have the clients pass the character set
> > to the core libraries and have the libraries return the character set
> > along with the log messages at retrieval time.
> 
> I like (2), but it doesn't even have to decree that they be "text".
> If we're storing another property saying what the charset is (or what
> Subversion's best guess is, anyway), then we just store the exact
> sequence of bits the user specified for the log message, along with
> metadata saying how to interpret that sequence of bits.
> 
> This wins all around, because:
> 
>    - Clients receiving log msgs from the repository don't have to
>      guess what encoding to use.  The repository tells them.
> 
>    - The original data is still there, in case svn guessed wrong at
>      input time.
> 
> It means passing another parameter along with log_msg itself, but
> that's no big deal.
> 
> +1 on this, for what it's worth.

+1.

as this discussion goes on, i'm becoming convinced that is the only
correct solution.

-garrett 

-- 
garrett rooney                    Remember, any design flaw you're 
rooneg@electricjellyfish.net      sufficiently snide about becomes  
http://electricjellyfish.net/     a feature.       -- Dan Sugalski

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Colin Putney <cp...@whistler.net>.

On Friday, May 31, 2002, at 11:55  AM, Karl Fogel wrote:

> Colin Putney <cp...@whistler.net> writes:
>> 2) Decree that log messages must be text, and store the metadata
>> specifiying the character set. Have the clients pass the character set
>> to the core libraries and have the libraries return the character set
>> along with the log messages at retrieval time.
>
> I like (2), but it doesn't even have to decree that they be "text".
> If we're storing another property saying what the charset is (or what
> Subversion's best guess is, anyway), then we just store the exact
> sequence of bits the user specified for the log message, along with
> metadata saying how to interpret that sequence of bits.

I'd call this option 3. The difference is small in terms of 
implementation details, but large semantically. If you just store bits 
you'd have to store a mime-type as well as an encoding, and (correct) 
clients have to be prepared to deal with non-text log messages. So the 
trade off between (2) and (3) is greater flexibility for log message 
(Word or RTF documents might be useful for example) vs higher barrier to 
entry for client implementors.

Colin Putney
Whistler.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Colin Putney <cp...@whistler.net> writes:
> 1) Do as Marcus and gstein propos and decree that log messages will be
> stored as UTF-8 in the repository and do the necessary conversion on
> input and output as a crutch for those without Unicode capable-tools
> 
> 2) Decree that log messages must be text, and store the metadata
> specifiying the character set. Have the clients pass the character set
> to the core libraries and have the libraries return the character set
> along with the log messages at retrieval time.

I like (2), but it doesn't even have to decree that they be "text".
If we're storing another property saying what the charset is (or what
Subversion's best guess is, anyway), then we just store the exact
sequence of bits the user specified for the log message, along with
metadata saying how to interpret that sequence of bits.

This wins all around, because:

   - Clients receiving log msgs from the repository don't have to
     guess what encoding to use.  The repository tells them.

   - The original data is still there, in case svn guessed wrong at
     input time.

It means passing another parameter along with log_msg itself, but
that's no big deal.

+1 on this, for what it's worth.

> I think we're still thrashing through the issue, so a vote is premature.

Agreed.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Colin Putney <cp...@whistler.net>.

> if we're going to have this be changeable with a client side config 
> option,
> then i say we might as well not do any re-encoding at all.  having
> utf-8 as the charset of the log messages is only helpful (in my
> opinion) if you can ALWAYS count on it being in that charset.

Agreed, up to a point. If converting log messages to UTF-8 is optional, 
then you still need out-of-band information to display the log message 
correctly. This might be supplied by the user via a --charset= switch, a 
default in the config file, or whatever. So the user must specify a 
character set, either by experimenting until she finds the right one or 
by knowing the convention.

However, providing the capability to convert encodings does make it 
easier for users to establish a policy of universal UTF-8. This is 
important.

It's important because it's a good way to support multilingual 
development. In the past this wasn't much of a requirement, simply 
because it just didn't happen very much. But the Internet is becoming 
pervasive enough that there projects in which development is going on in 
several different languages at once. Ruby is a good example, with 
development going on in Japanese and English, with bilingual developers 
coordinating between the two groups. Even if development happens in one 
language, there are likely to be multilingual documentation and i18n 
efforts in large projects. I think multilingual development will become 
more and more common as time goes on.

The project goals on the Subersion home page are fairly narrow and 
technical, but I think there's a broader philosophical goal implied by 
the project itself: to promote collaboration and cooperation between far 
flung individuals. Not requiring those individuals to cooperate using a 
uniform language furthers that goal.

So, given robust support for multilingual development as a goal (what do 
other think about this?) I can see two strategies:

1) Do as Marcus and gstein propos and decree that log messages will be 
stored as UTF-8 in the repository and do the necessary conversion on 
input and output as a crutch for those without Unicode capable-tools

2) Decree that log messages must be text, and store the metadata 
specifiying the character set. Have the clients pass the character set 
to the core libraries and have the libraries return the character set 
along with the log messages at retrieval time.

I think we're still thrashing through the issue, so a vote is premature.


Colin Putney
Whistler.com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Fri, May 31, 2002 at 10:06:52AM -0500, Karl Fogel wrote:
> It seems to me that everyone's pretty much stated their reasons for
> and against now.  We're no longer adding new material to the
> discussion, we're just reiterating points already made.
> 
> So, I'd like to propose a vote.
> 
> I hope we all agree that we're just choosing a default behavior for
> the client here -- users can get the alternate behavior by setting or
> unsetting a config option in ~/.subversion/options.  I.e., we should
> offer conversion to UTF-8 for those who want it, and should not
> unconditionally *force* conversion to UTF-8 for those who know they
> don't want it.  The only question is how we behave out-of-the-box.
> 
> (If this is controversial, I guess we're not ready to vote yet.)
> 
> The two choices are
> 
>    [ ] By default, recode log messages from user input to UTF-8, using
>        the locale to get a best guess for the original encoding of the
>        user input.
> 
>    [ ] By default, do no re-encoding of log messages.  Store exactly
>        the byte sequence the user enters.  When printing log messages,
>        the svn client would simply assume that the byte '\n' is a line
>        end (it prints out the number of lines in each message as part
>        of the msg header).  When printing out the log message as xml,
>        we'd do our best to escape bytes that are incompatible with
>        being xml content; this probably implies treating the message
>        as Latin-1 or something, but I haven't thought carefully about
>        that.

if we're going to have this be changeable with a client side config option, 
then i say we might as well not do any re-encoding at all.  having
utf-8 as the charset of the log messages is only helpful (in my
opinion) if you can ALWAYS count on it being in that charset.

so i think i'm leaning towards not re-encoding, with the provision
that we include some means of indicating the character set (and
possibly mime type if we're talking about allowing binary data as a
log entry).  without this information, i can't see how you can robustly
display log entries from the client.  the fact that cvs allows
whatever the hell you want in your log and doesn't have a means to
figure out what it is just proves that cvs sucks in this particular
area, which makes it tough to write a good client for it.

-garrett

-- 
garrett rooney                    Remember, any design flaw you're 
rooneg@electricjellyfish.net      sufficiently snide about becomes  
http://electricjellyfish.net/     a feature.       -- Dan Sugalski

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Ben Collins-Sussman <su...@collab.net>.

Jon Trowbridge <tr...@ximian.com> writes:

> On Fri, 2002-05-31 at 10:06, Karl Fogel wrote:
> > I hope we all agree that we're just choosing a default behavior for
> > the client here -- users can get the alternate behavior by setting or
> > unsetting a config option in ~/.subversion/options.  I.e., we should
> > offer conversion to UTF-8 for those who want it, and should not
> > unconditionally *force* conversion to UTF-8 for those who know they
> > don't want it.  The only question is how we behave out-of-the-box.
> 
> Just so that I understand: does this proposal imply a policy that all
> log messages are stored in UTF-8?  Or does a +1 here imply support for
> storing log messages in an unknown and unknowable charset?

We're going to treat the log message as binary data (counted
svn_string_t) no matter what; as I understand it, the issue is how the
'svn' commandline client (and other clients) behave out-of-the-box:
do they UTF-8 encode your log messages by default, or not?  This
behavior should be toggle-able.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Jon Trowbridge <tr...@ximian.com> writes:
> > I hope we all agree that we're just choosing a default behavior for
> > the client here -- users can get the alternate behavior by setting or
> > unsetting a config option in ~/.subversion/options.  I.e., we should
> > offer conversion to UTF-8 for those who want it, and should not
> > unconditionally *force* conversion to UTF-8 for those who know they
> > don't want it.  The only question is how we behave out-of-the-box.
> 
> Just so that I understand: does this proposal imply a policy that all
> log messages are stored in UTF-8?  Or does a +1 here imply support for
> storing log messages in an unknown and unknowable charset?

There are two proposals here.  One of them converts the log message to
UTF-8 at input time, the other doesn't (by default).

The repository is always storing a string of bytes.  That string of
bytes might happen to be a UTF-8 encoding of something, though.

I don't know which proposal you're thinking of +1'ing :-).

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Jon Trowbridge <tr...@ximian.com>.

On Fri, 2002-05-31 at 10:06, Karl Fogel wrote:
> I hope we all agree that we're just choosing a default behavior for
> the client here -- users can get the alternate behavior by setting or
> unsetting a config option in ~/.subversion/options.  I.e., we should
> offer conversion to UTF-8 for those who want it, and should not
> unconditionally *force* conversion to UTF-8 for those who know they
> don't want it.  The only question is how we behave out-of-the-box.

Just so that I understand: does this proposal imply a policy that all
log messages are stored in UTF-8?  Or does a +1 here imply support for
storing log messages in an unknown and unknowable charset?

-JT

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

ting the message
>        as Latin-1 or something, but I haven't thought carefully about
>        that.

The two behaviors, in my mind, boil down to a matter of choosing a
risk:

  1.  do we risk munging userdata at *input* time, by attempting to
      guess at a charset to convert to UTF-8?

      OR

  2.  do we risk munging userdata at *output* time, i.e. not knowing
      how to display the logmsg properly, because we don't know its
      charset?

In my mind, risk #1 is much more dangerous.  If the logmsg is
accidentally corrupted at input-time, it's gone forever.  This is much
worse than possibly seeing a garbled display in some GUI textbox --
that problem is fixable by heuristics (or project policy).

We already have this scenario going on in our code -- the
svn:eol-style property.  By default, we've chosen *not* to start
munging userdata until the user activates this property.  That's a
sensible default.

Therefore, I agree with Karl (the 2nd checkbox), because it's less
risky.  Given that we support both behaviors via some kind of
~/.subversion/ config option, I think the sensible default is not to
munge data at input time.  If users want to flip a switch and force
all log messages into UTF-8, that's totally fine.  But I think a
decision that the *user* must make, not one for our client-app to make
right out of the box.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> 2) if you convert FOO characters, thinking they were BAR, then it will
>    certainly be "funky", but you still won't have data loss -- convert back
>    as if you had BAR.
> 
> So. Option 1 is riskless in terms of data loss.

Huh?  I don't think this is true.

The transformation can be lossy.  For example, suppose you write your
log message in stateless encoding FOO (it may be fixed-width or not,
but it's not stateful).  But Subversion mistakenly deduces from your
locale that it's in *stateful* encoding BAR.  When it converts to
UTF-8, the (alleged) escape sequences of what svn took to be BAR will
be lost.  You cannot get the original string back now.

Greg Hudson was saying something similar in his mail about JIS, I
believe.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Greg Hudson <gh...@MIT.EDU> writes:

> On Fri, 2002-05-31 at 13:38, Greg Stein wrote:
> > Converting from charset FOO to UTF-8 is a specific translation. No data
> > loss. Converting from UTF-8 back to FOO is a perfect restoration.
> 
> Hm, is this always true?
> 
> For instance, a Shift-JIS document could have redundant shift octets. 

You're thinking about ISO-2022.  Shift-JIS is stateless; the name
"shift" comes from the fact that the character codes are shifted round
a bit compared to their codepoints in the original JIS standards, not
from it using shift modes.

For ISO-2022 though, you might end up with another octet sequence than
what you originally had.  Unless what you originally had was
canonicalized in some manner, it is in fact quite likely that you
will.  ISO-2022 is rather messy in that the same sequence of
characters can be represented in numerous ways.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Greg Stein <gs...@lyra.org>.

On Fri, May 31, 2002 at 01:46:41PM -0400, Greg Hudson wrote:
> On Fri, 2002-05-31 at 13:38, Greg Stein wrote:
> > Converting from charset FOO to UTF-8 is a specific translation. No data
> > loss. Converting from UTF-8 back to FOO is a perfect restoration.
> 
> Hm, is this always true?
> 
> For instance, a Shift-JIS document could have redundant shift octets. 
> (Is that invalid?  If so, does an iconv() from Shift-JIS to UTF-8
> actually enforce that?)  Converting such a document to UTF-8 and back is
> presumably not an identity transformation on the octets.  If the source
> document was not actually in Shift-JIS but was in some other character
> set, you could lose data.

Heh. Evil :-)  Yup. Sure sounds like if the user had the wrong locale,
they could lose something. They'd need to go back and correct the log
message, then.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Greg Hudson <gh...@MIT.EDU>.

On Fri, 2002-05-31 at 13:38, Greg Stein wrote:
> Converting from charset FOO to UTF-8 is a specific translation. No data
> loss. Converting from UTF-8 back to FOO is a perfect restoration.

Hm, is this always true?

For instance, a Shift-JIS document could have redundant shift octets. 
(Is that invalid?  If so, does an iconv() from Shift-JIS to UTF-8
actually enforce that?)  Converting such a document to UTF-8 and back is
presumably not an identity transformation on the octets.  If the source
document was not actually in Shift-JIS but was in some other character
set, you could lose data.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Greg Stein <gs...@lyra.org>.

On Fri, May 31, 2002 at 10:55:24AM -0500, Ben Collins-Sussman wrote:
>...
> The two behaviors, in my mind, boil down to a matter of choosing a
> risk:
> 
>   1.  do we risk munging userdata at *input* time, by attempting to
>       guess at a charset to convert to UTF-8?

You're wrong Ben. And this shows part of the problem of this whole
conversation. "oh no... wah wah... it is going to corrupt my data."

Bunk.

Converting from charset FOO to UTF-8 is a specific translation. No data
loss. Converting from UTF-8 back to FOO is a perfect restoration. Two other
situations:

1) if you convert back to BAR, then yes: it won't appear properly.

2) if you convert FOO characters, thinking they were BAR, then it will
   certainly be "funky", but you still won't have data loss -- convert back
   as if you had BAR.

So. Option 1 is riskless in terms of data loss.

[ per (2) you could end up with incorrect unicode characters, but you can
  get the original back and reencode properly ]

>       OR
> 
>   2.  do we risk munging userdata at *output* time, i.e. not knowing
>       how to display the logmsg properly, because we don't know its
>       charset?

And here is your risk.

Jon Trowbridge, who is doing GNOME work, and is familiar with the situation
has said it several times: you'll have an unknown and unknowable charset.
Not a great situation.

>...
> risky.  Given that we support both behaviors via some kind of
> ~/.subversion/ config option, I think the sensible default is not to
> munge data at input time.  If users want to flip a switch and force
> all log messages into UTF-8, that's totally fine.  But I think a
> decision that the *user* must make, not one for our client-app to make
> right out of the box.

As Garrett points out, if tools cannot *know* the log message is UTF-8, then
the whole option and the encoding and everything is bogus. Either you
enforce UTF-8 or you give up the whole ball of wax.

And if you give it up, you fall into the #2 risk category.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Karl Fogel <kf...@newton.ch.collab.net> writes:

> I hesitate to even mention it again, especially as it doesn't support
> the side I'm favoring here :-), but UTF-8 is not always reversible.
> 
> As I wrote earlier:
> 
> > For example, suppose you write your log message in stateless
> > encoding FOO (it may be fixed-width or not, but it's not stateful).
> > But Subversion mistakenly deduces from your locale that it's in
> > *stateful* encoding BAR.  When it converts to UTF-8, the (alleged)
> > escape sequences of what svn took to be BAR will be lost.  You
> > cannot get the original string back now.
> 
> In practice, this probably wouldn't happen often, and more
> importantly, I think people will rarely if ever be in the position of
> actually having to recover an original bit-string from UTF-8 anyway.
> But we just can't promise that it's always recoverable.

It should be extremely uncommon.  I can't imagine how a system locale
using a stateful character encoding would actually work.  Just
making filenames work in the filesystem would be more work for the OS
implementor than it could reasonably be worth.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Hontvari Jozsef <ho...@solware.com> writes:
> If somebody worries about the possible data loss: you can recover the
> original data from all three systems. Email and binary stores the original
> byte stream, UTF-8 is reversible.

I hesitate to even mention it again, especially as it doesn't support
the side I'm favoring here :-), but UTF-8 is not always reversible.

As I wrote earlier:

> For example, suppose you write your log message in stateless
> encoding FOO (it may be fixed-width or not, but it's not stateful).
> But Subversion mistakenly deduces from your locale that it's in
> *stateful* encoding BAR.  When it converts to UTF-8, the (alleged)
> escape sequences of what svn took to be BAR will be lost.  You
> cannot get the original string back now.

In practice, this probably wouldn't happen often, and more
importantly, I think people will rarely if ever be in the position of
actually having to recover an original bit-string from UTF-8 anyway.
But we just can't promise that it's always recoverable.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.

=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:

> >How (when) will the --locale option affect the locale setting?  If I
> >do
> >
> >  svn -m 'log message' --locale zh_TW.BIG5
> >
> >should this be interpreted differently from
> >
> >  svn --locale zh_TW.BIG5 -m 'log message'
> >
> >?  Environment variables are simpler because they are always set
> >before the program execution starts.
> >
> 
> The order of options never matters. --locale is interpreted in main(),
> along with the other options, and before any real work is done.

I know where --locale is interpreted.  But I also know that the -m
would get interpreted before it in the first example, and that it
might have made sense to do any conversion there.  But if that is
qualified as "real work" and such is not allowed to be done during
option parsing, then I guess I'll just have to rewrite my UTF-8 patch
a little.  (Not becase of the log message (which is translated later
anyway), but things like username and password which I translate
directly during the option parsing pass.)

Btw, is creating an error message "real work"?  That can happen during
option parsing, requires an UTF-8 conversion, and would be very
awkward to have to defer.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Branko Čibej <br...@xbc.nu>.

Marcus Comstedt wrote:

>=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:
>
>  
>
>>>Nope. We don't have to do anything. The user can always do:
>>>
>>>$ LC_CTYPE=some.locale svn commit -m "log message"
>>>
>>>That sets the LC_TYPE environment variable for just the one execution of the
>>>'svn' program.
>>>
>>>      
>>>
>>Please, let's just for once hear some design discussion that doesn't
>>assume all the world is Unix. What you propose only works in
>>Bourne-like shells (even the csh incantation is different).
>>    
>>
>
>Not that it really addresses your complaint, but there is a neat
>command called `env' which I often use because it allows you to run
>commands with local environment variable settings regardless of
>shell.  You just run it like this
>
>  env LC_CTYPE=some.locale svn commit -m "log message"
>
>i.e. just like the Bourne shell syntax but with the command name "env"
>first.  Since the syntax is parsed by env, this works fine in (t)csh
>as well.  It probably even works on Windows (I'm assuming that the
>command env is included in cygwin).
>
>Just had to point this out, since this incredibly useful command seems
>to be little known.
>
Yes, I know about 'env'. My point it that this solution won't work on 
non-Unix systems.

>>We already have a --locale option. It was introduced so that
>>front-ends that wrap the command line client (and there will be such;
>>Emacs vc-mode comes to mind) will get predictable output in the
>>presence of localized messages. We might as well reuse that, or
>>introduce a similar option (--input-locale?) for log messages, file
>>names, prop names, etc. I'd support that, FWIW.
>>    
>>
>
>How (when) will the --locale option affect the locale setting?  If I
>do
>
>  svn -m 'log message' --locale zh_TW.BIG5
>
>should this be interpreted differently from
>
>  svn --locale zh_TW.BIG5 -m 'log message'
>
>?  Environment variables are simpler because they are always set
>before the program execution starts.
>  
>
The order of options never matters. --locale is interpreted in main(), 
along with the other options, and before any real work is done.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.

=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:

> >Nope. We don't have to do anything. The user can always do:
> >
> >$ LC_CTYPE=some.locale svn commit -m "log message"
> >
> >That sets the LC_TYPE environment variable for just the one execution of the
> >'svn' program.
> >
> 
> Please, let's just for once hear some design discussion that doesn't
> assume all the world is Unix. What you propose only works in
> Bourne-like shells (even the csh incantation is different).

Not that it really addresses your complaint, but there is a neat
command called `env' which I often use because it allows you to run
commands with local environment variable settings regardless of
shell.  You just run it like this

  env LC_CTYPE=some.locale svn commit -m "log message"

i.e. just like the Bourne shell syntax but with the command name "env"
first.  Since the syntax is parsed by env, this works fine in (t)csh
as well.  It probably even works on Windows (I'm assuming that the
command env is included in cygwin).

Just had to point this out, since this incredibly useful command seems
to be little known.

> We already have a --locale option. It was introduced so that
> front-ends that wrap the command line client (and there will be such;
> Emacs vc-mode comes to mind) will get predictable output in the
> presence of localized messages. We might as well reuse that, or
> introduce a similar option (--input-locale?) for log messages, file
> names, prop names, etc. I'd support that, FWIW.

How (when) will the --locale option affect the locale setting?  If I
do

  svn -m 'log message' --locale zh_TW.BIG5

should this be interpreted differently from

  svn --locale zh_TW.BIG5 -m 'log message'

?  Environment variables are simpler because they are always set
before the program execution starts.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Internationalizing applications is hard

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Thanks for the list, Bill.  We're going to have to prioritize, clearly
:-).


"Bill Tutt" <ra...@lyra.org> writes:
> Indeed. Just to show how complicated il8n can becomes. Win32
> applications have to pay attention to lots of il8n details. Without
> further ado, here's that list:
> 
> * System Locale: 
> Determines which bitmap fonts, and OEM, ANSI, and MAC code pages are
> defaults for the system. This only affects applications that are not
> fully Unicode.
> API Name: GetSystemDefaultLangID
> * User Locale: 
> Determines which settings are used for formatting dates, times,
> currency, and numbers as a default for each user. Also determines the
> sort order for sorting text.
> API Name: GetUserDefaultLangID
> * Thread Locale:
> Determines which settings are used for formatting dates, times,
> currency, and large numbers for a thread. Also determines the sort order
> for sorting text. Defaults to User Locale.
> API Name: GetThreadLocale
> This isn't as applicable for UI applications, but it's kind of required
> for service run apps.
> * Input Locale:
> A pair consisting of language and a method of input.
> API Name: GetKeyboardLayout
> * System UI Language:
> Determines the default language of menus and dialogs, messages, INF
> files, and help files.
> API Name: GetSystemDefaultUILanguage
> * User UI Language:
> Determines the language of menus and dialogs, messages, and help files.
> API Name: GetUseDefaultUILanguage
> 
> Of course just to be annoying, you can't infer anything from all of the
> above returned locales for formatting date/time/currency for UI
> applications, since the Control Panel can override the default value of
> all of the above.
> 
> Indeed, life is still annoying in UI il8n land for just of the few
> following reasons, and believe me this is only a small subset of the
> issues that can come up:
> * Bi-directional text (lots of fun stuff here, UI issues with mirroring
> coordinate spaces, and other odd things)
> * Fonts:
> Do not hard code font face names 
> Do not assume a given font is installed 
> Do not assume selected font supports the desired script
> * Local Calendar Systems: Hebrew, Buddhist, Hijri, etc..
> * Win32 Console Applications:
> The 8-bit console I/O functions use the OEM code page whereas all other
> 8-bit functions use the ANSI code page by default. To avoid conflict in
> code page conversions and to allow multilingual computing, your console
> output should be encoded as Unicode whenever possible.
>  
> Tips and considerations:
> * use WriteConsole to output Unicode strings. Note that this API works
> only on console handles and can not be used for a redirection to a disk
> file. 
> * If the output is being redirected to a disk file, use WriteFile with
> the current console code page that can be retrieved by
> GetConsoleOutputCP (the console code page might be different from the
> currently selected OEM code page!). 
> * Complex scripts (Arabic, Hebrew, Thai, ...) are not supported in
> console. 
> * Always create your log files with UTF-8 encoding. 
> * When doing text alignment (e.g. %1!-14s! %2!-14s! %3!-16s!), either
> allocate the width of columns dynamically or truncate text wider than
> the columns' width. 
> * To make sure that your multilingual resources are displayed properly
> in the console window, always set your console thread locale according
> to console output code page by using SetThreadUILanguage.
> 
> * String Comparisons:
> If your string comparisons are not locale proof: e.g.: Using a locale
> dependent case insensitive string compare on "GIF". (Doesn't work for
> Turkish)
> 
> Etc.... You get the idea.
> 
> Bill
> ----
> Do you want a dangerous fugitive staying in your flat?
> No.
> Well, don't upset him and he'll be a nice fugitive staying in your flat.
>  
> 
> > -----Original Message-----
> > From: Karl Fogel [mailto:kfogel@newton.ch.collab.net]
> > 
> > Changing the locale is not even what we want to do here.
> > 
> > Just because my log message text is in Big5, doesn't mean I don't
> > still want Subversion's error and other messages printed in the
> > dominant system locale (non-Big5).
> > 
> > What I was thinking of was not a --locale option, but simply an option
> > (long opt only, no short equiv) that says "my log message is in
> > encoding FOO", like this:
> > 
> >   --log-message-encoding=FOO
> > 
> > The main purpose of this is for other programs to pass the option,
> > though a human is perfectly free to do so.
> > 
> > I don't have a problem "cluttering" the long-option space.  It's the
> > short opt space where we need to be ultra conservative.
> > 
> > -K
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> > For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Internationalizing applications is hard

Posted by Bill Tutt <ra...@lyra.org>.

Indeed. Just to show how complicated il8n can becomes. Win32
applications have to pay attention to lots of il8n details. Without
further ado, here's that list:

* System Locale: 
Determines which bitmap fonts, and OEM, ANSI, and MAC code pages are
defaults for the system. This only affects applications that are not
fully Unicode.
API Name: GetSystemDefaultLangID
* User Locale: 
Determines which settings are used for formatting dates, times,
currency, and numbers as a default for each user. Also determines the
sort order for sorting text.
API Name: GetUserDefaultLangID
* Thread Locale:
Determines which settings are used for formatting dates, times,
currency, and large numbers for a thread. Also determines the sort order
for sorting text. Defaults to User Locale.
API Name: GetThreadLocale
This isn't as applicable for UI applications, but it's kind of required
for service run apps.
* Input Locale:
A pair consisting of language and a method of input.
API Name: GetKeyboardLayout
* System UI Language:
Determines the default language of menus and dialogs, messages, INF
files, and help files.
API Name: GetSystemDefaultUILanguage
* User UI Language:
Determines the language of menus and dialogs, messages, and help files.
API Name: GetUseDefaultUILanguage

Of course just to be annoying, you can't infer anything from all of the
above returned locales for formatting date/time/currency for UI
applications, since the Control Panel can override the default value of
all of the above.

Indeed, life is still annoying in UI il8n land for just of the few
following reasons, and believe me this is only a small subset of the
issues that can come up:
* Bi-directional text (lots of fun stuff here, UI issues with mirroring
coordinate spaces, and other odd things)
* Fonts:
Do not hard code font face names 
Do not assume a given font is installed 
Do not assume selected font supports the desired script
* Local Calendar Systems: Hebrew, Buddhist, Hijri, etc..
* Win32 Console Applications:
The 8-bit console I/O functions use the OEM code page whereas all other
8-bit functions use the ANSI code page by default. To avoid conflict in
code page conversions and to allow multilingual computing, your console
output should be encoded as Unicode whenever possible.
 
Tips and considerations:
* use WriteConsole to output Unicode strings. Note that this API works
only on console handles and can not be used for a redirection to a disk
file. 
* If the output is being redirected to a disk file, use WriteFile with
the current console code page that can be retrieved by
GetConsoleOutputCP (the console code page might be different from the
currently selected OEM code page!). 
* Complex scripts (Arabic, Hebrew, Thai, ...) are not supported in
console. 
* Always create your log files with UTF-8 encoding. 
* When doing text alignment (e.g. %1!-14s! %2!-14s! %3!-16s!), either
allocate the width of columns dynamically or truncate text wider than
the columns' width. 
* To make sure that your multilingual resources are displayed properly
in the console window, always set your console thread locale according
to console output code page by using SetThreadUILanguage.

* String Comparisons:
If your string comparisons are not locale proof: e.g.: Using a locale
dependent case insensitive string compare on "GIF". (Doesn't work for
Turkish)

Etc.... You get the idea.

Bill
----
Do you want a dangerous fugitive staying in your flat?
No.
Well, don't upset him and he'll be a nice fugitive staying in your flat.
 

> -----Original Message-----
> From: Karl Fogel [mailto:kfogel@newton.ch.collab.net]
> 
> Changing the locale is not even what we want to do here.
> 
> Just because my log message text is in Big5, doesn't mean I don't
> still want Subversion's error and other messages printed in the
> dominant system locale (non-Big5).
> 
> What I was thinking of was not a --locale option, but simply an option
> (long opt only, no short equiv) that says "my log message is in
> encoding FOO", like this:
> 
>   --log-message-encoding=FOO
> 
> The main purpose of this is for other programs to pass the option,
> though a human is perfectly free to do so.
> 
> I don't have a problem "cluttering" the long-option space.  It's the
> short opt space where we need to be ultra conservative.
> 
> -K
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Changing the locale is not even what we want to do here.

Just because my log message text is in Big5, doesn't mean I don't
still want Subversion's error and other messages printed in the
dominant system locale (non-Big5).

What I was thinking of was not a --locale option, but simply an option
(long opt only, no short equiv) that says "my log message is in
encoding FOO", like this:

  --log-message-encoding=FOO

The main purpose of this is for other programs to pass the option,
though a human is perfectly free to do so.

I don't have a problem "cluttering" the long-option space.  It's the
short opt space where we need to be ultra conservative.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Branko Čibej <br...@xbc.nu>.

Greg Stein wrote:

>On Tue, Jun 04, 2002 at 10:56:08PM +0200, Branko Äibej wrote:
>  
>
>>Greg Stein wrote:
>>...
>>    
>>
>>>Nope. We don't have to do anything. The user can always do:
>>>
>>>$ LC_CTYPE=some.locale svn commit -m "log message"
>>>
>>>That sets the LC_TYPE environment variable for just the one execution of the
>>>'svn' program.
>>>      
>>>
>>Please, let's just for once hear some design discussion that doesn't 
>>assume all the world is Unix. What you propose only works in Bourne-like 
>>shells (even the csh incantation is different).
>>    
>>
>
>The point is: it is simple to tweak your environment for an invocation. Our
>defined way to get the locale is the environment variables. Why have more
>than one way to do the same thing?
>
>Regarding non-Unix: in the part of my email (which you cut :-), I pointed
>out how other clients (e.g. a nifty Windows client) can adjust its character
>set as much as it wants.
>
But I want to use the command-line client on Windows (perhaps with 
Emacs), and I'm not aware of an environment variable that would reliably 
change the locale there. That's why I'd rather use a switch, which will 
work the same way on all platforms..

>>We already have a --locale option. It was introduced so that front-ends 
>>that wrap the command line client (and there will be such; Emacs vc-mode 
>>comes to mind) will get predictable output in the presence of localized 
>>messages.
>>    
>>
>
>And I might debate its proper existence. It was added *long* before we had a
>plan for i18n or l10n. I think it was totally premature, and it shouldn't be
>a justification for other behavior.
>
>  
>
>>We might as well reuse that, or introduce a similar option 
>>(--input-locale?) for log messages, file names, prop names, etc. I'd 
>>support that, FWIW.
>>    
>>
>
>I don't think that adding more switches is the answer. There is already a
>mechanism: it is called LC_CTYPE.
>
>Heck, I will suggest right now: let's eliminate the --locale option. I see
>no point in the switch. It has unknown effects, and it isn't even used
>reliably (i.e. it won't meet user expectations).
>
Yes it will, provided we document that the only reliable, portable use 
is "--locale=C" -- which usage, btw, is the whole point of having such a 
switch.

>When we have an i18n plan post-1.0, then we can reintroduce the switch *if
>it is needed*.
>  
>



-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Greg Stein <gs...@lyra.org>.

On Tue, Jun 04, 2002 at 10:56:08PM +0200, Branko Äibej wrote:
> Greg Stein wrote:
>...
> >Nope. We don't have to do anything. The user can always do:
> >
> >$ LC_CTYPE=some.locale svn commit -m "log message"
> >
> >That sets the LC_TYPE environment variable for just the one execution of the
> >'svn' program.
>
> Please, let's just for once hear some design discussion that doesn't 
> assume all the world is Unix. What you propose only works in Bourne-like 
> shells (even the csh incantation is different).

The point is: it is simple to tweak your environment for an invocation. Our
defined way to get the locale is the environment variables. Why have more
than one way to do the same thing?

Regarding non-Unix: in the part of my email (which you cut :-), I pointed
out how other clients (e.g. a nifty Windows client) can adjust its character
set as much as it wants.

> We already have a --locale option. It was introduced so that front-ends 
> that wrap the command line client (and there will be such; Emacs vc-mode 
> comes to mind) will get predictable output in the presence of localized 
> messages.

And I might debate its proper existence. It was added *long* before we had a
plan for i18n or l10n. I think it was totally premature, and it shouldn't be
a justification for other behavior.

> We might as well reuse that, or introduce a similar option 
> (--input-locale?) for log messages, file names, prop names, etc. I'd 
> support that, FWIW.

I don't think that adding more switches is the answer. There is already a
mechanism: it is called LC_CTYPE.

Heck, I will suggest right now: let's eliminate the --locale option. I see
no point in the switch. It has unknown effects, and it isn't even used
reliably (i.e. it won't meet user expectations).

When we have an i18n plan post-1.0, then we can reintroduce the switch *if
it is needed*.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Branko Čibej <br...@xbc.nu>.

Greg Stein wrote:

>And I have a meta-comment to make ...
>
>
>On Tue, Jun 04, 2002 at 10:56:08PM +0200, Branko Äibej wrote:
>  
>
>>...
>>Please, let's just for once hear some design discussion that doesn't 
>>assume all the world is Unix.
>>    
>>
>
>Great. *I* am not going to be representing that position. *You* are free to
>do so. I simply don't have the familiarity of day-to-day development and
>usage. My Windows usage is mostly as a dumb user.
>
>[ yes, I've developed on Windows, but it was not Windows-heavy ]
>
>
>Point is: if you want broad input, then you'll have to help. Each person is
>going to contribute their personal input -- you can't expect everybody to be
>able to understand all the points. (and this goes to the coding, too)
>  
>
I try, I do, but it's so lonely out here that sometimes my insecurity 
starts showing. :-)

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Greg Stein <gs...@lyra.org>.

And I have a meta-comment to make ...

On Tue, Jun 04, 2002 at 10:56:08PM +0200, Branko Äibej wrote:
>...
> Please, let's just for once hear some design discussion that doesn't 
> assume all the world is Unix.

Great. *I* am not going to be representing that position. *You* are free to
do so. I simply don't have the familiarity of day-to-day development and
usage. My Windows usage is mostly as a dumb user.

[ yes, I've developed on Windows, but it was not Windows-heavy ]

Point is: if you want broad input, then you'll have to help. Each person is
going to contribute their personal input -- you can't expect everybody to be
able to understand all the points. (and this goes to the coding, too)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Branko Čibej <br...@xbc.nu>.

Greg Stein wrote:

>On Tue, Jun 04, 2002 at 10:07:22AM -0500, Ben Collins-Sussman wrote:
>  
>
>>...
>>Or, as kfogel proposed earlier, I think a good client application
>>should allow the user to specify a temporary locale for the log
>>message.  Something like
>>
>>   svn commit -m "log message" --locale=some.locale
>>    
>>
>
>Nope. We don't have to do anything. The user can always do:
>
>$ LC_CTYPE=some.locale svn commit -m "log message"
>
>That sets the LC_TYPE environment variable for just the one execution of the
>'svn' program.
>  
>
Please, let's just for once hear some design discussion that doesn't 
assume all the world is Unix. What you propose only works in Bourne-like 
shells (even the csh incantation is different).

We already have a --locale option. It was introduced so that front-ends 
that wrap the command line client (and there will be such; Emacs vc-mode 
comes to mind) will get predictable output in the presence of localized 
messages. We might as well reuse that, or introduce a similar option 
(--input-locale?) for log messages, file names, prop names, etc. I'd 
support that, FWIW.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Ben Collins-Sussman <su...@collab.net>.

Greg Stein <gs...@lyra.org> writes:


> >    svn commit -m "log message" --locale=some.locale
> 
> Nope. We don't have to do anything. The user can always do:
> 
> $ LC_CTYPE=some.locale svn commit -m "log message"

Sweet!  Sign me up!


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Greg Stein <gs...@lyra.org>.

On Tue, Jun 04, 2002 at 10:07:22AM -0500, Ben Collins-Sussman wrote:
>...
> Or, as kfogel proposed earlier, I think a good client application
> should allow the user to specify a temporary locale for the log
> message.  Something like
> 
>    svn commit -m "log message" --locale=some.locale

Nope. We don't have to do anything. The user can always do:

$ LC_CTYPE=some.locale svn commit -m "log message"

That sets the LC_TYPE environment variable for just the one execution of the
'svn' program.

> This way the 95% case is covered by assuming the system locale, and
> when weirdos like kfogel occasionally decide to write a BIG5 log
> message in emacs, he can "nudge" the svn client as a one time thing.

I agree, but we don't need More Parameters(tm). Setting the locale for a
single program invocation is easy to do.

[ that said: recall that something like the --locale switch is part of the
  client code. something like a GUI client could have various switches and
  configs and whatnot. the client only needs to worry about constructing
  UTF-8 by the time it hits the SVN libraries ]

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Ben Collins-Sussman <su...@collab.net>.

Hontvari Jozsef <ho...@solware.com> writes:

> Brane Čibej
> > But you /can/ do exactly that on Windows. Even worse, you can change the
> > input locale used by any program on the fly, without touching the system
> > locale, and there is *no* way for the svn client (that only sees the
> > system locale, and the contents of a file) to figure out what happened.
> 
> You can fool all three of the proposed systems in this way (UTF-8, email,
> binary). But if you consider this then you are optimizing for a user who
> works hard to fool the system - instead of the user who simply wants to use
> it in an international environment. That is a question of specification, it
> must be stated that svn command line client expects any log message in the
> system locale (if not specified otherwise, especially "UTF-8 already" can be
> a useful option theoretically).

Or, as kfogel proposed earlier, I think a good client application
should allow the user to specify a temporary locale for the log
message.  Something like

   svn commit -m "log message" --locale=some.locale

This way the 95% case is covered by assuming the system locale, and
when weirdos like kfogel occasionally decide to write a BIG5 log
message in emacs, he can "nudge" the svn client as a one time thing.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Hontvari Jozsef <ho...@solware.com>.

Brane Čibej
> But you /can/ do exactly that on Windows. Even worse, you can change the
> input locale used by any program on the fly, without touching the system
> locale, and there is *no* way for the svn client (that only sees the
> system locale, and the contents of a file) to figure out what happened.

You can fool all three of the proposed systems in this way (UTF-8, email,
binary). But if you consider this then you are optimizing for a user who
works hard to fool the system - instead of the user who simply wants to use
it in an international environment. That is a question of specification, it
must be stated that svn command line client expects any log message in the
system locale (if not specified otherwise, especially "UTF-8 already" can be
a useful option theoretically).

If somebody worries about the possible data loss: you can recover the
original data from all three systems. Email and binary stores the original
byte stream, UTF-8 is reversible.




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Branko Čibej <br...@xbc.nu>.

Hontvari Jozsef wrote:

>I guess everybody who regularly use latin-2 or any other non-latin-1 charset
>will vote to #1. I am pretty sure from past experience that if you do not
>declare explicitly that any text in subversion is UTF-8 encoded, then in
>practice subversion (and its clients) will be typically used as an ASCII
>only application. (That also means that this should not be a client option,
>it must be enforced.)
>
>I do not know what is the situation with Unix, but in Windows using a locale
>has been straightforward for years, and I cannot really imagine how a client
>could miss a conversion. (The only additional feature which should be
>useful - theoretically - if I could temporarily override the assumed
>character encoding when supplying input to a client in file. I mean if my
>locale is Latin-2, but I saved the log message in UTF-8 for example, then I
>would be able to say to the client, that hey, this file is in UTF-8 and not
>in Latin-2.)
>  
>
But you /can/ do exactly that on Windows. Even worse, you can change the 
input locale used by any program on the fly, without touching the system 
locale, and there is *no* way for the svn client (that only sees the 
system locale, and the contents of a file) to figure out what happened.



-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Call For Votes: converting log messages to UTF-8

Posted by Hontvari Jozsef <ho...@solware.com>.

I guess everybody who regularly use latin-2 or any other non-latin-1 charset
will vote to #1. I am pretty sure from past experience that if you do not
declare explicitly that any text in subversion is UTF-8 encoded, then in
practice subversion (and its clients) will be typically used as an ASCII
only application. (That also means that this should not be a client option,
it must be enforced.)

I do not know what is the situation with Unix, but in Windows using a locale
has been straightforward for years, and I cannot really imagine how a client
could miss a conversion. (The only additional feature which should be
useful - theoretically - if I could temporarily override the assumed
character encoding when supplying input to a client in file. I mean if my
locale is Latin-2, but I saved the log message in UTF-8 for example, then I
would be able to say to the client, that hey, this file is in UTF-8 and not
in Latin-2.)

--- Original Message -----
From: "Karl Fogel" <kf...@newton.ch.collab.net>
To: <su...@collab.net>
Cc: <de...@subversion.tigris.org>
Sent: Friday, May 31, 2002 6:09 PM
Subject: Re: Call For Votes: converting log messages to UTF-8

> Ben Collins-Sussman <su...@collab.net> writes:
> > In my mind, risk #1 is much more dangerous.  If the logmsg is
> > accidentally corrupted at input-time, it's gone forever.  This is much
> > worse than possibly seeing a garbled display in some GUI textbox --
> > that problem is fixable by heuristics (or project policy).
>
> I'd like to add that I have used such heuristics in real life.
>
> More than once, I've had data in some unknown charset (I knew it was
> Chinese, I just didn't know which encoding).  I've put it in a display
> editor and basically flipped through various encodings until suddenly
> it "clicked" and the text made sense.
>
> This heuristic depends on user feedback, but it's 100% reliable (data
> rarely makes sense in two different encodings :-) )... And most
> importantly, it was only possible because I had the *original* data,
> not some tool's mis-reencoding of the original data.
>
> This is why I don't buy the argument that the data is "useless" if you
> don't know the charset.  It's simply not true.  You may have to do
> some work, even some non-automatable work, but you can almost always
> figure it out with some basic educated guessing.  (And automated
> algorithms based on word-frequency tables are easy to imagine, though
> I don't know if anyone's implemented that yet.)  But nothing can be
> done if the original data was lost.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Call For Votes: converting log messages to UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

It seems to me that everyone's pretty much stated their reasons for
and against now.  We're no longer adding new material to the
discussion, we're just reiterating points already made.

So, I'd like to propose a vote.

I hope we all agree that we're just choosing a default behavior for
the client here -- users can get the alternate behavior by setting or
unsetting a config option in ~/.subversion/options.  I.e., we should
offer conversion to UTF-8 for those who want it, and should not
unconditionally *force* conversion to UTF-8 for those who know they
don't want it.  The only question is how we behave out-of-the-box.

(If this is controversial, I guess we're not ready to vote yet.)

The two choices are

   [ ] By default, recode log messages from user input to UTF-8, using
       the locale to get a best guess for the original encoding of the
       user input.

   [ ] By default, do no re-encoding of log messages.  Store exactly
       the byte sequence the user enters.  When printing log messages,
       the svn client would simply assume that the byte '\n' is a line
       end (it prints out the number of lines in each message as part
       of the msg header).  When printing out the log message as xml,
       we'd do our best to escape bytes that are incompatible with
       being xml content; this probably implies treating the message
       as Latin-1 or something, but I haven't thought carefully about
       that.

Don't worry about implementation difficulty.  Both of these choices
are easy (and indeed, if we want to support both, we need to implement
both, which implies changing the log_msg params to counted-length
strings internally no matter what).

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org