You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Greg Stein <gs...@lyra.org> on 2002/05/30 22:45:19 UTC

Re: use of UTF-8

On Thu, May 30, 2002 at 05:06:03PM -0500, Karl Fogel wrote:
> Greg Stein <gs...@lyra.org> writes:
>...
> Right, right.  But the `log_msg' parameter to functions was not 
> `char *' until very recently, and for reasons having nothing to do
> with some prior decision about them being UTF-8.

Yup.

> I'm sorry to keep repeating myself.  It seems (maybe I'm
> misunderstanding?) that you brought up type of those params as
> indicating that some decision had already been made about their
> charset.

Misunderstanding :-). The logic went like this: the param is a char*, thus
it will be (typically) be representing characters, which means it needs an
associated charset, which we have previously stated would be UTF-8.

>...
> > To be concrete: either those char* params are UTF-8, or you add a second
> > parameter to state their charset. (or you just go charset neutral which
> > isn't really a good option)
> 
> Those aren't the only options here (and you're dismissing charset
> neutral as an obviously bad third option, mentioned only to be
> rejected, when in fact it's what this whole thread is really about).

I dismissed it because there has already been quite a bit of material
(from Garrett, Jon, etc etc) stating how clients need to know the charset to
be able to do anything with those log messages.

  --> They contain characters. You need to know their charset.

I'm not sure how it is possible to really consider otherwise. To display
those characters to the user, you need the charset. To edit them, to set
them, to email them, to do whatever.

Basically, I find the notion that "leaving it up to arbitrary interpreation"
is in any way a valid approach.

> I see three options on the table:

Four.

     - add a second parameter to the relevant data structures and routines
       to hold the character set of the string in question (while we're
       talking about log message here, I think there are others; the rule
       for log msgs will apply everywhere)

>    - Keep them as char *, declare them UTF-8, and convert user input
>      as best we can.
> 
>    - Keep them as char *, declare no particular charset, but don't
>      allow zero bytes.
> 
>    - Convert them back to counted-length strings and treat them as
>      binary data again (I guess this is the most militantly charset
>      neutral option).

Of the above four approaches:

1) a second param is very heavyweight from a conceptual and coding
   standpoint. and, in the end, we'll probably have to do conversions
   anyways, so allowing an arbitrary charset rather than fixed doesn't
   seem to buy a lot.

2) my favoriate. note that the *client* does the conversions. the libraries
   simply assume all text strings are in UTF-8.

3) untenable for the clients.

4) this is similar to (3), but we just allow more flexibility.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Colin Putney <co...@whistler.com> writes:
> I'm wondering if this boils down to a question of what the 1.0
> behaviour will be. I'm pretty convinced that the email-like is the way
> to go, but it does require some changes to the existing codebase.
> 
> Is this something that should be part of the I18N work that will be
> done after 1.0? How much of the desire for UTF-8 is really a desire to
> get 1.0 out the door?

I don't think this relates to 1.0 much at all.  (We might make the
discovery that UTF-8 conversion is a good idea, or that it's a bad
idea, at any point between now and 1.0, or some time after 1.0.)



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Colin Putney <co...@whistler.com>.

On Monday, June 3, 2002, at 12:18  PM, Karl Fogel wrote:

> Branko Čibej <br...@xbc.nu> writes:
>> Um. I'd rather say it opens up a huge can of very hungry carnivorous
>> worms. While it might be true that you can trust the locale settings
>> on most machines today (something I'm not at all sure about), you
>> can't trust programs. On Windows, for instance, I can set notepad as
>> my $EDITOR, then go and save the log message as UTF-8 or two different
>> kinds of UTF-16 (big- and little-endian). My locale info says I'm
>> using codepage 1250. Converting that text would produce
>> ... interesting? ... results.
>
> I'm still worried about this scenario too, but the reason I'm willing
> to risk it is that we can change Subversion if we discover we were
> wrong.  So let's see how often problems happen in practice.  After
> all, if conversion to UTF-8 *does* corrupt log messages in real life,
> then we can simply say "Well, that was a mistake", and
> backwards-compatibly change the client libraries's behavior.
>
> It would be simple enough to switch to email/mime-like behavior.  Just
> stop converting to UTF-8, and start storing the literal bits of the
> log message, along with a best guess at the encoding for which they
> were written -- i.e., a new revision prop, `svn:log-message-encoding'
> or whatever.  Revisions that don't have that property are assumed to
> be in UTF-8.

I'm wondering if this boils down to a question of what the 1.0 behaviour 
will be. I'm pretty convinced that the email-like is the way to go, but 
it does require some changes to the existing codebase.

Is this something that should be part of the I18N work that will be done 
after 1.0? How much of the desire for UTF-8 is really a desire to get 
1.0 out the door?


Colin Putney
Whistler.com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Branko Čibej <br...@xbc.nu> writes:
> Um. I'd rather say it opens up a huge can of very hungry carnivorous
> worms. While it might be true that you can trust the locale settings
> on most machines today (something I'm not at all sure about), you
> can't trust programs. On Windows, for instance, I can set notepad as
> my $EDITOR, then go and save the log message as UTF-8 or two different
> kinds of UTF-16 (big- and little-endian). My locale info says I'm
> using codepage 1250. Converting that text would produce
> ... interesting? ... results.

I'm still worried about this scenario too, but the reason I'm willing
to risk it is that we can change Subversion if we discover we were
wrong.  So let's see how often problems happen in practice.  After
all, if conversion to UTF-8 *does* corrupt log messages in real life,
then we can simply say "Well, that was a mistake", and
backwards-compatibly change the client libraries's behavior.

It would be simple enough to switch to email/mime-like behavior.  Just
stop converting to UTF-8, and start storing the literal bits of the
log message, along with a best guess at the encoding for which they
were written -- i.e., a new revision prop, `svn:log-message-encoding'
or whatever.  Revisions that don't have that property are assumed to
be in UTF-8.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Branko Čibej <br...@xbc.nu>.

Greg Stein wrote:

>Sheesh. Of course I can see it. And it is a very wrong position, when we can
>so *easily* just say "it is UTF-8" and be done with it. That opens up a
>whole world of simplicity and determinism for the applications that will be
>built on top of Subversion.
>  
>
Um. I'd rather say it opens up a huge can of very hungry carnivorous 
worms. While it might be true that you can trust the locale settings on 
most machines today (something I'm not at all sure about), you can't 
trust programs. On Windows, for instance, I can set notepad as my 
$EDITOR, then go and save the log message as UTF-8 or two different 
kinds of UTF-16 (big- and little-endian). My locale info says I'm using 
codepage 1250. Converting that text would produce ... interesting? ... 
results.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Greg Stein <gs...@lyra.org>.

On Fri, May 31, 2002 at 09:29:47AM -0500, Karl Fogel wrote:
>...
> Again, CVS doesn't know the charset, and I never encountered a
> complaint about that, over years of doing more CVS support than most.

Granted. But CVS was never bound as tightly into client apps as SVN will be.
It is a different programming model for clients, and those apps will need
the appropriate information to be able to operate properly.

>...
> There is no clear win here.  If you are unable to see how it is even
> *possible* to consider being charset neutral, all I can say is, your

Sheesh. Of course I can see it. And it is a very wrong position, when we can
so *easily* just say "it is UTF-8" and be done with it. That opens up a
whole world of simplicity and determinism for the applications that will be
built on top of Subversion.

>...
> The software we are replacing _is_ effectively charset neutral for log
> messages, and prior to this, we had never listed that as one of the
> "bugs" we were aiming to fix in Subversion.

Bah. That is a non-starter. We've got a ton of things in our code that were
never listed as a "bug" in CVS.

-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> Misunderstanding :-). The logic went like this: the param is a char*, thus
> it will be (typically) be representing characters, which means it needs an
> associated charset, which we have previously stated would be UTF-8.

Uh.  Okay.  Then I wasn't misunderstanding, and my response (that the
param being `char *' has no bearing on this discussion) was
appropriate.

> I dismissed it because there has already been quite a bit of material
> (from Garrett, Jon, etc etc) stating how clients need to know the charset to
> be able to do anything with those log messages.
> 
>   --> They contain characters. You need to know their charset.
> 
> I'm not sure how it is possible to really consider otherwise. To display
> those characters to the user, you need the charset. To edit them, to set
> them, to email them, to do whatever.

Again, CVS doesn't know the charset, and I never encountered a
complaint about that, over years of doing more CVS support than most.
I'm sure CVS has disappointed people by this occasionally, and I just
haven't heard about it, but I'm equally certain that re-encoding log
messages will disappoint other people in other circumstances,
sometimes.

There is no clear win here.  If you are unable to see how it is even
*possible* to consider being charset neutral, all I can say is, your
inability to see it does not add any weight to your technical
arguments against it.  (I guess the other thing I can say is, I wonder
how you were able to use CVS all those years. :-) )

The software we are replacing _is_ effectively charset neutral for log
messages, and prior to this, we had never listed that as one of the
"bugs" we were aiming to fix in Subversion.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Hudson <gh...@MIT.EDU> writes:
> On the other hand, there seems to be a fairly broad consensus for doing
> UTF-8/$LC_CTYPE character set conversion for filenames.  I am...
> confused as to why anyone advocates converting filenames and not log
> messages, since they are both text.  File contents are binary data. 
> Property values... might be binary data; that seems to be the conensus
> for now, anyway, although that leads to questions about how svn:ignore
> should be interpreted and such.  But log messages are definitely text.

Part of the justification is ease of implementation.  We have to sling
filenames around all over the place internally, and write/read them in
xml files appx seventy times a second.  It's just massively easier to
use `char *' UTF-8 for all that.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Greg Hudson <gh...@MIT.EDU>.

On Thu, 2002-05-30 at 18:45, Greg Stein wrote:
> 3) untenable for the clients.

I'd like to keep a little perspective here.

If we don't solve the log message character set problem, then projects
are happy as long as:

  * They are willing to stick with ASCII log messages, or
  * All their developers use the same character set, or
  * All their developers have use a UTF-8 native locale

(That third statement is a little forward-looking, but there has been
some progress in that direction.)

I believe this covers quite a lot of users--everyone who is happy with
CVS, for instance.  Subversion is not going to fail on account of not
doing character set conversion.

This is why I would be happy being charset-neutral and 8-bit clean (not
necessarily binary-clean) for all text fields.  Possibly happier, since
we would never be responsible for misconverting text when LC_CTYPE isn't
set properly, or anything like that.  Plus our code would be simpler.

On the other hand, there seems to be a fairly broad consensus for doing
UTF-8/$LC_CTYPE character set conversion for filenames.  I am...
confused as to why anyone advocates converting filenames and not log
messages, since they are both text.  File contents are binary data. 
Property values... might be binary data; that seems to be the conensus
for now, anyway, although that leads to questions about how svn:ignore
should be interpreted and such.  But log messages are definitely text.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by William Uther <wi...@cs.cmu.edu>.

On 30/5/02 6:45 PM, "Greg Stein" <gs...@lyra.org> wrote:

> On Thu, May 30, 2002 at 05:06:03PM -0500, Karl Fogel wrote:

>> I see three options on the table:
> 
> Four.

Five?  Or Six?  Seven anyone?

     - Add a repository wide property that gives the charset for log
messages.  It could vary from 'local' to 'UTF-8' to ...

     - Add a local configuration option that switches the translation on or
off.

     - Interpret not having a locale set as 'use no translation'.  If you
set a locale, then svn will use it.  If you want to use random charsets,
then don't lie in your locale settings.

>    - add a second parameter to the relevant data structures and routines
>      to hold the character set of the string in question (while we're
>      talking about log message here, I think there are others; the rule
>      for log msgs will apply everywhere)
> 
>>    - Keep them as char *, declare them UTF-8, and convert user input
>>      as best we can.
>> 
>>    - Keep them as char *, declare no particular charset, but don't
>>      allow zero bytes.
>> 
>>    - Convert them back to counted-length strings and treat them as
>>      binary data again (I guess this is the most militantly charset
>>      neutral option).
> 
> Of the above [seven] approaches:

 1) Would require the implementation of global properties in the repository.
In 'no property' was interpreted correctly then this could be implemented
post-1.0 and still be backwards compatible.

 2) Config options are not always a good idea.  Having a local config option
is worse as it removes any guarantees about log messages in the repos.

 3) This mostly the same as option 2.

> [4]) a second param is very heavyweight from a conceptual and coding
>  standpoint. and, in the end, we'll probably have to do conversions
>  anyways, so allowing an arbitrary charset rather than fixed doesn't
>  seem to buy a lot.
> 
> [5]) my favoriate. note that the *client* does the conversions. the libraries
>  simply assume all text strings are in UTF-8.
> 
> [6]) untenable for the clients.
> 
> [7]) this is similar to ([6]), but we just allow more flexibility.

There seems to be a consensus forming for translation.  Might I suggest that
people keep option 1 in mind so that if repos properties are implemented at
some later stage they could be used.

Later,

\x/ill           :-}


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org