You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Greg Stein <gs...@lyra.org> on 2002/05/29 23:00:12 UTC

use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

On Wed, May 29, 2002 at 03:30:35PM -0500, Jon Trowbridge wrote:
> On Wed, 2002-05-29 at 14:30, cmpilato@collab.net wrote:
> > Marcus Comstedt <ma...@mc.pp.se> writes:
> > > Since there are no properties on log messages, how do you propose that
> > > the actual character encoding for a log message be recorded?
> > 
> > That information, as you may have inferred from my previous paragraph,
> > is stored "out of band", in a HACKING file or something, and is
> > regulated by the repos admins.

Untenable.

> ...but if you do this, anyone who wants to write a GUI client that
> allows for log message browsing is out of luck.

Exactly.

> If a log message is in some unknown and unknowable charset, I can't
> stick the text into a text widget and have any confidence that something
> legible will be displayed.

Yup.

Our decision to use UTF-8 for stuff was made a *long* time ago. Here is a
particular comment from svn_fs.h:

/* Here are the rules for directory entry names, and directory paths:

   A directory entry name is a Unicode string encoded in UTF-8, and
   may not contain the null character (U+0000).  The name should be in
   Unicode canonical decomposition and ordering.  No directory entry
...	 

We've always considered all properties to be binary. Thus, ra_dav will need
to encode it in some fashion to keep it safe within an XML body. While the
log message *happens* to be a property, the interface calls it a char*,
which means UTF-8. And we informally decided (meaning: it isn't written down
like what is in svn_fs.h) on using UTF-8 as our library's character set a
long time ago also. Maybe I could find a reference, but I'm not going to
bother. We *did* choose it, so people can attempt to prove otherwise or
provide some technical reason why choosing one charset is Badness(tm).

> Requiring utf-8 here might seem onerous, but it is pretty much the only
> way to avoid a whole class of annoying charset problems down the road.

Right. If the API has a text string, then SVN says that text string is in
UTF-8. If we have standard properties that are to be interpreted as text,
then those will be stored as UTF-8 strings (within the binary property).

While APR doesn't talk about character sets for its API (wrongly, so, IMO),
the Subversion libraries *do*. Anything that is text will be UTF-8. Since
paths and URLs hold "characters" (but are hard to call "text"), they also
use UTF-8 for their character set.

[ and extend as applicable to other concepts in the API... ]

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

the following algorithm will get us a
successor for NR, which is likely (though not guaranteed) to be
relatively "near" NR.

   Setup a cursor on NR's node revision id in the `nodes' table;
   Advance cursor to next row;
   THIS_NR = Current cursor location;
   If THIS_NR.NodeId != NR.NodeId:
      /* unrelated node, no more node revisions of NR */
      return FAILURE;
   If THIS_NR.CopyId == NR.CopyId:
      && THIS_NR.TxnId is not a pending transaction:
      /* same node_id, same copy_id, must be different (older!) txn_id */
      return SUCCESS, THIS_NR;
   ELSE:
      DO:
         IF THIS_NR.TxnId > NR.TxnId:
         && THIS_NR.TxnId is not a pending transaction:
            /* same node_id, older copy_id, older txn_id */
            return SUCCESS, THIS_NR;
         Advance cursor to next row;
         THIS_NR = Current cursor location;
      WHILE (THIS_NR.NodeId == NR.NodeId)
   return FAILURE;

However, I realize that adding ordering to those IDs is probably not a
popular thought.  :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Colin Putney <co...@whistler.com> writes:
> I'm wondering if this boils down to a question of what the 1.0
> behaviour will be. I'm pretty convinced that the email-like is the way
> to go, but it does require some changes to the existing codebase.
> 
> Is this something that should be part of the I18N work that will be
> done after 1.0? How much of the desire for UTF-8 is really a desire to
> get 1.0 out the door?

I don't think this relates to 1.0 much at all.  (We might make the
discovery that UTF-8 conversion is a good idea, or that it's a bad
idea, at any point between now and 1.0, or some time after 1.0.)



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Colin Putney <co...@whistler.com>.

On Monday, June 3, 2002, at 12:18  PM, Karl Fogel wrote:

> Branko Čibej <br...@xbc.nu> writes:
>> Um. I'd rather say it opens up a huge can of very hungry carnivorous
>> worms. While it might be true that you can trust the locale settings
>> on most machines today (something I'm not at all sure about), you
>> can't trust programs. On Windows, for instance, I can set notepad as
>> my $EDITOR, then go and save the log message as UTF-8 or two different
>> kinds of UTF-16 (big- and little-endian). My locale info says I'm
>> using codepage 1250. Converting that text would produce
>> ... interesting? ... results.
>
> I'm still worried about this scenario too, but the reason I'm willing
> to risk it is that we can change Subversion if we discover we were
> wrong.  So let's see how often problems happen in practice.  After
> all, if conversion to UTF-8 *does* corrupt log messages in real life,
> then we can simply say "Well, that was a mistake", and
> backwards-compatibly change the client libraries's behavior.
>
> It would be simple enough to switch to email/mime-like behavior.  Just
> stop converting to UTF-8, and start storing the literal bits of the
> log message, along with a best guess at the encoding for which they
> were written -- i.e., a new revision prop, `svn:log-message-encoding'
> or whatever.  Revisions that don't have that property are assumed to
> be in UTF-8.

I'm wondering if this boils down to a question of what the 1.0 behaviour 
will be. I'm pretty convinced that the email-like is the way to go, but 
it does require some changes to the existing codebase.

Is this something that should be part of the I18N work that will be done 
after 1.0? How much of the desire for UTF-8 is really a desire to get 
1.0 out the door?


Colin Putney
Whistler.com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Branko Čibej <br...@xbc.nu> writes:
> Um. I'd rather say it opens up a huge can of very hungry carnivorous
> worms. While it might be true that you can trust the locale settings
> on most machines today (something I'm not at all sure about), you
> can't trust programs. On Windows, for instance, I can set notepad as
> my $EDITOR, then go and save the log message as UTF-8 or two different
> kinds of UTF-16 (big- and little-endian). My locale info says I'm
> using codepage 1250. Converting that text would produce
> ... interesting? ... results.

I'm still worried about this scenario too, but the reason I'm willing
to risk it is that we can change Subversion if we discover we were
wrong.  So let's see how often problems happen in practice.  After
all, if conversion to UTF-8 *does* corrupt log messages in real life,
then we can simply say "Well, that was a mistake", and
backwards-compatibly change the client libraries's behavior.

It would be simple enough to switch to email/mime-like behavior.  Just
stop converting to UTF-8, and start storing the literal bits of the
log message, along with a best guess at the encoding for which they
were written -- i.e., a new revision prop, `svn:log-message-encoding'
or whatever.  Revisions that don't have that property are assumed to
be in UTF-8.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Branko Čibej <br...@xbc.nu>.

Greg Stein wrote:

>Sheesh. Of course I can see it. And it is a very wrong position, when we can
>so *easily* just say "it is UTF-8" and be done with it. That opens up a
>whole world of simplicity and determinism for the applications that will be
>built on top of Subversion.
>  
>
Um. I'd rather say it opens up a huge can of very hungry carnivorous 
worms. While it might be true that you can trust the locale settings on 
most machines today (something I'm not at all sure about), you can't 
trust programs. On Windows, for instance, I can set notepad as my 
$EDITOR, then go and save the log message as UTF-8 or two different 
kinds of UTF-16 (big- and little-endian). My locale info says I'm using 
codepage 1250. Converting that text would produce ... interesting? ... 
results.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Greg Stein <gs...@lyra.org>.

On Fri, May 31, 2002 at 09:29:47AM -0500, Karl Fogel wrote:
>...
> Again, CVS doesn't know the charset, and I never encountered a
> complaint about that, over years of doing more CVS support than most.

Granted. But CVS was never bound as tightly into client apps as SVN will be.
It is a different programming model for clients, and those apps will need
the appropriate information to be able to operate properly.

>...
> There is no clear win here.  If you are unable to see how it is even
> *possible* to consider being charset neutral, all I can say is, your

Sheesh. Of course I can see it. And it is a very wrong position, when we can
so *easily* just say "it is UTF-8" and be done with it. That opens up a
whole world of simplicity and determinism for the applications that will be
built on top of Subversion.

>...
> The software we are replacing _is_ effectively charset neutral for log
> messages, and prior to this, we had never listed that as one of the
> "bugs" we were aiming to fix in Subversion.

Bah. That is a non-starter. We've got a ton of things in our code that were
never listed as a "bug" in CVS.

-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> Misunderstanding :-). The logic went like this: the param is a char*, thus
> it will be (typically) be representing characters, which means it needs an
> associated charset, which we have previously stated would be UTF-8.

Uh.  Okay.  Then I wasn't misunderstanding, and my response (that the
param being `char *' has no bearing on this discussion) was
appropriate.

> I dismissed it because there has already been quite a bit of material
> (from Garrett, Jon, etc etc) stating how clients need to know the charset to
> be able to do anything with those log messages.
> 
>   --> They contain characters. You need to know their charset.
> 
> I'm not sure how it is possible to really consider otherwise. To display
> those characters to the user, you need the charset. To edit them, to set
> them, to email them, to do whatever.

Again, CVS doesn't know the charset, and I never encountered a
complaint about that, over years of doing more CVS support than most.
I'm sure CVS has disappointed people by this occasionally, and I just
haven't heard about it, but I'm equally certain that re-encoding log
messages will disappoint other people in other circumstances,
sometimes.

There is no clear win here.  If you are unable to see how it is even
*possible* to consider being charset neutral, all I can say is, your
inability to see it does not add any weight to your technical
arguments against it.  (I guess the other thing I can say is, I wonder
how you were able to use CVS all those years. :-) )

The software we are replacing _is_ effectively charset neutral for log
messages, and prior to this, we had never listed that as one of the
"bugs" we were aiming to fix in Subversion.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Hudson <gh...@MIT.EDU> writes:
> On the other hand, there seems to be a fairly broad consensus for doing
> UTF-8/$LC_CTYPE character set conversion for filenames.  I am...
> confused as to why anyone advocates converting filenames and not log
> messages, since they are both text.  File contents are binary data. 
> Property values... might be binary data; that seems to be the conensus
> for now, anyway, although that leads to questions about how svn:ignore
> should be interpreted and such.  But log messages are definitely text.

Part of the justification is ease of implementation.  We have to sling
filenames around all over the place internally, and write/read them in
xml files appx seventy times a second.  It's just massively easier to
use `char *' UTF-8 for all that.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Greg Hudson <gh...@MIT.EDU>.

On Thu, 2002-05-30 at 18:45, Greg Stein wrote:
> 3) untenable for the clients.

I'd like to keep a little perspective here.

If we don't solve the log message character set problem, then projects
are happy as long as:

  * They are willing to stick with ASCII log messages, or
  * All their developers use the same character set, or
  * All their developers have use a UTF-8 native locale

(That third statement is a little forward-looking, but there has been
some progress in that direction.)

I believe this covers quite a lot of users--everyone who is happy with
CVS, for instance.  Subversion is not going to fail on account of not
doing character set conversion.

This is why I would be happy being charset-neutral and 8-bit clean (not
necessarily binary-clean) for all text fields.  Possibly happier, since
we would never be responsible for misconverting text when LC_CTYPE isn't
set properly, or anything like that.  Plus our code would be simpler.

On the other hand, there seems to be a fairly broad consensus for doing
UTF-8/$LC_CTYPE character set conversion for filenames.  I am...
confused as to why anyone advocates converting filenames and not log
messages, since they are both text.  File contents are binary data. 
Property values... might be binary data; that seems to be the conensus
for now, anyway, although that leads to questions about how svn:ignore
should be interpreted and such.  But log messages are definitely text.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by William Uther <wi...@cs.cmu.edu>.

On 30/5/02 6:45 PM, "Greg Stein" <gs...@lyra.org> wrote:

> On Thu, May 30, 2002 at 05:06:03PM -0500, Karl Fogel wrote:

>> I see three options on the table:
> 
> Four.

Five?  Or Six?  Seven anyone?

     - Add a repository wide property that gives the charset for log
messages.  It could vary from 'local' to 'UTF-8' to ...

     - Add a local configuration option that switches the translation on or
off.

     - Interpret not having a locale set as 'use no translation'.  If you
set a locale, then svn will use it.  If you want to use random charsets,
then don't lie in your locale settings.

>    - add a second parameter to the relevant data structures and routines
>      to hold the character set of the string in question (while we're
>      talking about log message here, I think there are others; the rule
>      for log msgs will apply everywhere)
> 
>>    - Keep them as char *, declare them UTF-8, and convert user input
>>      as best we can.
>> 
>>    - Keep them as char *, declare no particular charset, but don't
>>      allow zero bytes.
>> 
>>    - Convert them back to counted-length strings and treat them as
>>      binary data again (I guess this is the most militantly charset
>>      neutral option).
> 
> Of the above [seven] approaches:

 1) Would require the implementation of global properties in the repository.
In 'no property' was interpreted correctly then this could be implemented
post-1.0 and still be backwards compatible.

 2) Config options are not always a good idea.  Having a local config option
is worse as it removes any guarantees about log messages in the repos.

 3) This mostly the same as option 2.

> [4]) a second param is very heavyweight from a conceptual and coding
>  standpoint. and, in the end, we'll probably have to do conversions
>  anyways, so allowing an arbitrary charset rather than fixed doesn't
>  seem to buy a lot.
> 
> [5]) my favoriate. note that the *client* does the conversions. the libraries
>  simply assume all text strings are in UTF-8.
> 
> [6]) untenable for the clients.
> 
> [7]) this is similar to ([6]), but we just allow more flexibility.

There seems to be a consensus forming for translation.  Might I suggest that
people keep option 1 in mind so that if repos properties are implemented at
some later stage they could be used.

Later,

\x/ill           :-}


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 30, 2002 at 05:06:03PM -0500, Karl Fogel wrote:
> Greg Stein <gs...@lyra.org> writes:
>...
> Right, right.  But the `log_msg' parameter to functions was not 
> `char *' until very recently, and for reasons having nothing to do
> with some prior decision about them being UTF-8.

Yup.

> I'm sorry to keep repeating myself.  It seems (maybe I'm
> misunderstanding?) that you brought up type of those params as
> indicating that some decision had already been made about their
> charset.

Misunderstanding :-). The logic went like this: the param is a char*, thus
it will be (typically) be representing characters, which means it needs an
associated charset, which we have previously stated would be UTF-8.

>...
> > To be concrete: either those char* params are UTF-8, or you add a second
> > parameter to state their charset. (or you just go charset neutral which
> > isn't really a good option)
> 
> Those aren't the only options here (and you're dismissing charset
> neutral as an obviously bad third option, mentioned only to be
> rejected, when in fact it's what this whole thread is really about).

I dismissed it because there has already been quite a bit of material
(from Garrett, Jon, etc etc) stating how clients need to know the charset to
be able to do anything with those log messages.

  --> They contain characters. You need to know their charset.

I'm not sure how it is possible to really consider otherwise. To display
those characters to the user, you need the charset. To edit them, to set
them, to email them, to do whatever.

Basically, I find the notion that "leaving it up to arbitrary interpreation"
is in any way a valid approach.

> I see three options on the table:

Four.

     - add a second parameter to the relevant data structures and routines
       to hold the character set of the string in question (while we're
       talking about log message here, I think there are others; the rule
       for log msgs will apply everywhere)

>    - Keep them as char *, declare them UTF-8, and convert user input
>      as best we can.
> 
>    - Keep them as char *, declare no particular charset, but don't
>      allow zero bytes.
> 
>    - Convert them back to counted-length strings and treat them as
>      binary data again (I guess this is the most militantly charset
>      neutral option).

Of the above four approaches:

1) a second param is very heavyweight from a conceptual and coding
   standpoint. and, in the end, we'll probably have to do conversions
   anyways, so allowing an arbitrary charset rather than fixed doesn't
   seem to buy a lot.

2) my favoriate. note that the *client* does the conversions. the libraries
   simply assume all text strings are in UTF-8.

3) untenable for the clients.

4) this is similar to (3), but we just allow more flexibility.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> > The interface calls log messages `char *' as of one day ago :-), and
> 
> And if this conversation was two days ago, I would have said stringbuf.
> 
> The point is: where we have char* in our interfaces, they are almost always
> representing some characters. I'm saying that we decided on saying they were
> UTF-8 and avoiding carrying around charset metadata with those.

Right, right.  But the `log_msg' parameter to functions was not 
`char *' until very recently, and for reasons having nothing to do
with some prior decision about them being UTF-8.

I'm sorry to keep repeating myself.  It seems (maybe I'm
misunderstanding?) that you brought up type of those params as
indicating that some decision had already been made about their
charset.  But they were counted-length strings (and thus could support
binary data!) until rev 2024, and were just caught up in the general
sweep of the conversion.  Their new type indicates nothing about what
charset we should use for log messages.  We have to make that decision
independently of their current type, and then make sure the type
*supports* whatever decision we make.

> To be concrete: either those char* params are UTF-8, or you add a second
> parameter to state their charset. (or you just go charset neutral which
> isn't really a good option)

Those aren't the only options here (and you're dismissing charset
neutral as an obviously bad third option, mentioned only to be
rejected, when in fact it's what this whole thread is really about).

I see three options on the table:

   - Keep them as char *, declare them UTF-8, and convert user input
     as best we can.

   - Keep them as char *, declare no particular charset, but don't
     allow zero bytes.

   - Convert them back to counted-length strings and treat them as
     binary data again (I guess this is the most militantly charset
     neutral option).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 30, 2002 at 10:30:44AM -0500, Karl Fogel wrote:
> Greg Stein <gs...@lyra.org> writes:
> > > If a log message is in some unknown and unknowable charset, I can't
> > > stick the text into a text widget and have any confidence that something
> > > legible will be displayed.
> > 
> > Yup.
> > 
> > Our decision to use UTF-8 for stuff was made a *long* time ago. Here is a
> > particular comment from svn_fs.h:
>...
> Hmm, but that's just talking about paths.

Of course. I was showing one data, and my email was moving on to the rest.

>...
> The issue here is log
> messages (the fact that log messages are stored as property values is
> an implementation detail -- I don't think the ideal that property
> values support binary data has any influence one way or the other on
> whether binary log messages should be allowed).

Yes.

> > We've always considered all properties to be binary. Thus, ra_dav will need
> > to encode it in some fashion to keep it safe within an XML body. While the
> > log message *happens* to be a property, the interface calls it a char*,
> > which means UTF-8. And we informally decided (meaning: it isn't written down
> > like what is in svn_fs.h) on using UTF-8 as our library's character set a
> > long time ago also. Maybe I could find a reference, but I'm not going to
> > bother. We *did* choose it, so people can attempt to prove otherwise or
> > provide some technical reason why choosing one charset is Badness(tm).
> 
> I don't understand the connection here.
> 
> We didn't decide that all data coming into fs is UTF-8.  We decided

I was talking about interfaces -- parameters. Not file contents.

> that pathnames were UTF-8, and that file contents and property values
> would be binary data (as far as the fs is concerned).

Of course.

>...
> > While the
> > log message *happens* to be a property, the interface calls it a char*,
> > which means UTF-8. 
> 
> The interface calls log messages `char *' as of one day ago :-), and

And if this conversation was two days ago, I would have said stringbuf.

The point is: where we have char* in our interfaces, they are almost always
representing some characters. I'm saying that we decided on saying they were
UTF-8 and avoiding carrying around charset metadata with those.

To be concrete: either those char* params are UTF-8, or you add a second
parameter to state their charset. (or you just go charset neutral which
isn't really a good option)

Think back. Like two years ago. We said UTF-8 was the SVN charset. Not just
paths. But all the content [outside of file content and prop values].

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> > If a log message is in some unknown and unknowable charset, I can't
> > stick the text into a text widget and have any confidence that something
> > legible will be displayed.
> 
> Yup.
> 
> Our decision to use UTF-8 for stuff was made a *long* time ago. Here is a
> particular comment from svn_fs.h:
> 
> /* Here are the rules for directory entry names, and directory paths:
> 
>    A directory entry name is a Unicode string encoded in UTF-8, and
>    may not contain the null character (U+0000).  The name should be in
>    Unicode canonical decomposition and ordering.  No directory entry
> ...	 

Hmm, but that's just talking about paths.  No one disagrees that paths
should be enforced to one canonical format.  The issue here is log
messages (the fact that log messages are stored as property values is
an implementation detail -- I don't think the ideal that property
values support binary data has any influence one way or the other on
whether binary log messages should be allowed).

> We've always considered all properties to be binary. Thus, ra_dav will need
> to encode it in some fashion to keep it safe within an XML body. While the
> log message *happens* to be a property, the interface calls it a char*,
> which means UTF-8. And we informally decided (meaning: it isn't written down
> like what is in svn_fs.h) on using UTF-8 as our library's character set a
> long time ago also. Maybe I could find a reference, but I'm not going to
> bother. We *did* choose it, so people can attempt to prove otherwise or
> provide some technical reason why choosing one charset is Badness(tm).

I don't understand the connection here.

We didn't decide that all data coming into fs is UTF-8.  We decided
that pathnames were UTF-8, and that file contents and property values
would be binary data (as far as the fs is concerned).

This doesn't mean we can't enforce some convention for log messages in
particular, but such a decision is certainly not *implied* by anything
in the design of the fs right now.

> While the
> log message *happens* to be a property, the interface calls it a char*,
> which means UTF-8. 

The interface calls log messages `char *' as of one day ago :-), and
that's just fallout from 2024.  There are comments in the code,
indicating that maybe it should go back to supporting binary data, as
it did up until 2024.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Wed, May 29, 2002 at 04:00:12PM -0700, Greg Stein wrote:

> Right. If the API has a text string, then SVN says that text string is in
> UTF-8. If we have standard properties that are to be interpreted as text,
> then those will be stored as UTF-8 strings (within the binary property).
> 
> While APR doesn't talk about character sets for its API (wrongly, so, IMO),
> the Subversion libraries *do*. Anything that is text will be UTF-8. Since
> paths and URLs hold "characters" (but are hard to call "text"), they also
> use UTF-8 for their character set.

+1 on all of this.

making an arbitrary decision to use UTF-8, while it might feel like
we're 'imposing policy on users', solves a ton of problems at a fairly
reasonable cost, and seems like the only sane way to go.

-garrett 

-- 
garrett rooney                    Remember, any design flaw you're 
rooneg@electricjellyfish.net      sufficiently snide about becomes  
http://electricjellyfish.net/     a feature.       -- Dan Sugalski

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org