You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Ulf Tigerstedt <ti...@infa.abo.fi> on 2002/05/29 09:44:42 UTC

[RFC/PATCH] commit messages not 8-bit compatible

Not a correct patch, but something that needs to be fixed one
way or the other:

Problem: åäö (and other nonASCII chars)in commit messages makes the ra_dav 
server barf over the commit.

The client tells:
svn_error: #21093 : <RA layer request failed>
  applying log message to
/lin24/$svn/wbl/d2c7f5da-d11d-b211-a802-896bfc595c95/17: 400 Bad Request
Apache tells:
[Wed May 29 12:06:02 2002] [error] [client 130.232.84.208] XML parser error
code: not well-formed (4)

... and so on.

A quick fix for $EDITOR made messages is below: 
(against r2027)
Index: ./subversion/clients/cmdline/cl.h
===================================================================
--- ./subversion/clients/cmdline/cl.h
+++ ./subversion/clients/cmdline/cl.h   Wed May 29 11:55:51 2002
@@ -361,6 +361,7 @@
 void *svn_cl__make_log_msg_baton (svn_cl__opt_state_t *opt_state,
                                   const char *base_dir,
                                   apr_pool_t *pool);
+void svn_strip_log_highbits(svn_stringbuf_t *buffer);
 
 /* A function of type svn_client_get_commit_log_t. */
 svn_error_t *svn_cl__get_log_message (const char **log_msg,
Index: ./subversion/clients/cmdline/util.c
===================================================================
--- ./subversion/clients/cmdline/util.c
+++ ./subversion/clients/cmdline/util.c Wed May 29 12:34:50 2002
@@ -492,7 +492,13 @@
   return buffer;
 }
 
-
+void 
+svn_strip_log_highbits(svn_stringbuf_t *buffer) { 
+       int i;
+       for (i=buffer->len; i!=0; i--) {
+               buffer->data[i]&=0x7F;
+       }
+}
 #define EDITOR_PREFIX_TXT  "SVN:"
 
 /* This function is of type svn_client_get_commit_log_t. */
@@ -585,6 +591,7 @@
       /* Strip the prefix from the buffer. */
       if (message)
         message = strip_prefix_from_buffer (message, EDITOR_PREFIX_TXT,
pool);
+        svn_strip_log_highbits(message);
 
       if (message)
         {

Don't apply this, but please comment. 
Should the messages be allowed to be 8bit, and is it the client or the
server that should correct it if needed?

-- 
****  Ulf 'Tiggi' Tigerstedt *** KTF / Datateknik  *********
*Being flogged with a rubber chicken can be quite enjoyable*
** - Matt McLeod, In the Scary devil monastery 29.11.1999 **


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:

> >Hm.  Any particular reason?  Apart from breaking the "all strings
> > passed to libsvn_* shall be UTF-8"-paradigm,
> 
> I never heard of such a paradigm. Yes, all _paths_ passed to the svn
> libraries should be UTF-8 and canonicalized, but not all strings. I'm
> sorry I didn't notice that berfore in your patches. AFAIK we never
> discussed canonicalizing anything but paths.

<URL:http://subversion.tigris.org/servlets/ReadMsg?msgId=70776&listName=dev>
says "all arguments", not just paths.

If there is supposed to be different charset encodings for paths and
non-paths, then things will get very messy.  For starters, there is 
plenty of transfers inside the libs between paths and non-paths.  Just
think of all the error messages including paths.  Then we get to more
tricky problems like the svn:ignore property.  Two cases:

A) svn:ignore is not stored at the server as UTF-8.

  Now the interpretation of svn:ignore is not fixed.  But the
  interpretation of pathnames is fixed.  Thus different files will be
  ignored for different users.  Bad.

B) svn:ignore is stored at the server as UTF-8, but the string passed
   to svn_client_propset is not UTF-8 (since it's a property value,
   not a path).  Then somewhere there must be a recoding heuristic
   based on the property name.  Ugly.

Having all strings have the same encoding makes things much more clean
and simple.

> >it would mean that two
> >persons, one using a Latin-1 charset and one using an UTF-8 charset,
> >wouldn't be able to properly read each others log messages even if
> >they are restricting themselves to the common subset of characters.
> >
> >Since there are no properties on log messages, how do you propose that
> >the actual character encoding for a log message be recorded?
> >
> 
> I'd say that problem should be solved by project policy, not by
> Subversion. Just like we don't require a particular repository layout.

Project policy doesn't affect the way my shell renders characters, so
it doesn't really solve the problem at all.  And the idea here is not
to enforce more, but less.  Since the strings are recoded, the end
user can use any character encoding he likes, rather than being stuck
with a "project policy" dictating it for him.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Stein <gs...@lyra.org>.

On Wed, May 29, 2002 at 03:44:43PM -0400, Greg Hudson wrote:
>...
> Karl seems to side
> with Marcus on translating because UTF-8 and the local charset for all
> interesting values;

I'm assuming that is "... translating *between* UTF-8 ..."

And I'm in this camp, FWIW.

> Mike and I would rather be charset-neutral; Branko
> wants to use UTF-8 for paths and be charset-neutral for everything
> else.  (I'm not sure why pathnames deserve special treatment.)

The FS does, and always has, specified that paths are UTF-8.

>...
> (I'd just like to point out at this juncture that, if we didn't use XML
> so damned much, we wouldn't have any problems being charset-neutral. 
> That is all.)

I tend to believe it is entirely unrelated to the use of XML. Simply storing
a log message into the repository in arbitrary character sets leads to
eventual Torment(tm). Applying rationality, today, is much better.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Marcus Comstedt <ma...@mc.pp.se> writes:
> Yup.  The thing here though is that for files it's possible (and
> perfectly reasonable) to have a property declaring the used charset.
> Thus the information is not lost.  You could even hack Emacs to
> actually look at the charset property and select the correct MULE mode
> when you open the file.

Ooooooooooooh.

(I think I see a find-file-hook in my future...).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

charset property (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 30, 2002 at 05:44:22PM +0200, Marcus Comstedt wrote:
> 
> Karl Fogel <kf...@newton.ch.collab.net> writes:
> 
> > For example, file contents can be text without being UTF-8!  (And
> > imagine how would people react if Subversion enforced UTF-8 for all
> > text files.)
> 
> Yup.  The thing here though is that for files it's possible (and
> perfectly reasonable) to have a property declaring the used charset.
> Thus the information is not lost.  You could even hack Emacs to
> actually look at the charset property and select the correct MULE mode
> when you open the file.

Yup. This has already come up. Much like we have svn:mime-type to declare
the mime type of a file (and served by Apache via the Content-Type header),
we can also have svn:charset property, which also gets served as part of the
Content-Type header.

For example:

  Content-Type: text/plain; charset=iso-8859-12

or

  Content-Type: text/plain; charset=euc-kr


Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Karl Fogel <kf...@newton.ch.collab.net> writes:

> For example, file contents can be text without being UTF-8!  (And
> imagine how would people react if Subversion enforced UTF-8 for all
> text files.)

Yup.  The thing here though is that for files it's possible (and
perfectly reasonable) to have a property declaring the used charset.
Thus the information is not lost.  You could even hack Emacs to
actually look at the charset property and select the correct MULE mode
when you open the file.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 30, 2002 at 04:53:47PM -0500, Karl Fogel wrote:
> Greg Stein <gs...@lyra.org> writes:
> > Nope. We said that text strings passed around within the libraries (log
> > message is a good one, paths, property names, etc) would be considered to be
> > in UTF-8. We chose that following the same reasoning as using UTF-8 for the
> > pathnames: consistency and that it can represent any other character set.
> 
> Oh, okay -- what we have here is different memory about what was
> agreed on in the past.  So, let's never mind what we *thought* was
> agreed on, since it's clear what various people think right now :-).

Yah, seems that way, so definitely fair enough to just revisit.

>...
> Right now I mildly prefer this solution:
> 
>    - Don't munge (or convert, to use a less pejorative term) the log
>      message at all, but simply reject log messages that contain any
>      zero bytes.  Log message charsets would be determined by each
>      individual repository's policy, with a recommendation (but not an
>      enforcement) from us to use UTF-8.
> 
> If a lot of people feel strongly that enforcing conversion to UTF-8 is
> the Right Thing, I certainly won't veto.  I mean, I could be wrong :-).

I think a better thread for responding is the other one. I'll defer to that
thread.

> How reliable it is to use locale to determine the source format of the
> conversion (or whatever method we're going to use), though?  For

You *must* use the locale. Looking at the characters is insufficient.

> example, my locale indicates nothing about Chinese editing, but
> sometimes I write text in one of the various char encodings that
> supports Chinese characters.  If I were to do that in a log message on
> some project, my log message would get all messed up.  In such a case,
> leaving it alone would be better, because some tools that can
> heuristically determine the charset -- *if* they have the original
> data to work with.

The best they could do would be to determine whether you've got a
double-byte charset or some variety of single-byte charset. Within those
groups, you might be able to refine a bit. But not much more.

For example, if you have a string of bytes that validates as UTF-8, is that
*really* what it was? Or was it from the latin-1 charset? You just can't
tell from inspection. Thus, the requirement for needing the locale.

> If the data is there, one can guess at the charset
> if necessary.  If the data is destroyed by a misconversion, then it's
> gone.  That's why I feel it's better to leave it alone.

Sorry... guessing isn't possible. Something has to state the charset
(whether that "something" is another attribute, a requirement of a specific
charset, or just never guess). More in the other thread.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by David Mankin <ma...@ants.com>.

On 30 May 2002, Karl Fogel wrote:

> How reliable it is to use locale to determine the source format of the
> conversion (or whatever method we're going to use), though?  For
> example, my locale indicates nothing about Chinese editing, but
> sometimes I write text in one of the various char encodings that
> supports Chinese characters.  If I were to do that in a log message on
> some project, my log message would get all messed up.  In such a case,
> leaving it alone would be better, because some tools that can
> heuristically determine the charset -- *if* they have the original
> data to work with.  If the data is there, one can guess at the charset
> if necessary.  If the data is destroyed by a misconversion, then it's
> gone.  That's why I feel it's better to leave it alone.

</lurk>
I think that we should keep the log messages in UTF-8 for the reasons
mentioned by Marcus and others.  (Mainly, out-of-band policy information
makes writing client software harder.)

In order to solve the "but which charset am I using", I think we should
offer --charset= and config-file defaults which take precedence over
LC_TYPE.  The config-file defaults should control separately the charset
used for recoding local filenames, and the charset used for log
messages.

(Actually, the idea of recoding UTF-8 filenames seems pretty weird to
me, but I've never used non-ascii filenames so I can't speak from
experience.  But what happens if your filename gets recoded when a shell
script which refers to it doesn't?  At least with log-message recoding
filenames in the message will match the checked out filenames.)

 -David Mankin
<lurk>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> Nope. We said that text strings passed around within the libraries (log
> message is a good one, paths, property names, etc) would be considered to be
> in UTF-8. We chose that following the same reasoning as using UTF-8 for the
> pathnames: consistency and that it can represent any other character set.

Oh, okay -- what we have here is different memory about what was
agreed on in the past.  So, let's never mind what we *thought* was
agreed on, since it's clear what various people think right now :-).

I remember (& agree with) paths and property names.  I never thought
the decision covered anything more than that.  If I had realized, I
would have said something sooner.

Right now I mildly prefer this solution:

   - Don't munge (or convert, to use a less pejorative term) the log
     message at all, but simply reject log messages that contain any
     zero bytes.  Log message charsets would be determined by each
     individual repository's policy, with a recommendation (but not an
     enforcement) from us to use UTF-8.

If a lot of people feel strongly that enforcing conversion to UTF-8 is
the Right Thing, I certainly won't veto.  I mean, I could be wrong :-).

How reliable it is to use locale to determine the source format of the
conversion (or whatever method we're going to use), though?  For
example, my locale indicates nothing about Chinese editing, but
sometimes I write text in one of the various char encodings that
supports Chinese characters.  If I were to do that in a log message on
some project, my log message would get all messed up.  In such a case,
leaving it alone would be better, because some tools that can
heuristically determine the charset -- *if* they have the original
data to work with.  If the data is there, one can guess at the charset
if necessary.  If the data is destroyed by a misconversion, then it's
gone.  That's why I feel it's better to leave it alone.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 30, 2002 at 10:38:52AM -0500, Karl Fogel wrote:
> Greg Stein <gs...@lyra.org> writes:
> > As I said elsewhere, we decided on UTF-8 for text for everything a long long
> > time ago. We wanted to absolutely avoid all this character set nonsense. So
> > picking *one* character set (which is theoretically a superset of all
> > others) is nice. It helps all users of the libraries.
> 
> No, I really don't think this is true.
> 
> For example, file contents can be text without being UTF-8!  (And
> imagine how would people react if Subversion enforced UTF-8 for all
> text files.)

It what universe would you think I was talking about file contents?

Or to rephrase: DUH. Of course file contents could use other charsets.

> We decided it for paths, and that's it.

Nope. We said that text strings passed around within the libraries (log
message is a good one, paths, property names, etc) would be considered to be
in UTF-8. We chose that following the same reasoning as using UTF-8 for the
pathnames: consistency and that it can represent any other character set.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> As I said elsewhere, we decided on UTF-8 for text for everything a long long
> time ago. We wanted to absolutely avoid all this character set nonsense. So
> picking *one* character set (which is theoretically a superset of all
> others) is nice. It helps all users of the libraries.

No, I really don't think this is true.

For example, file contents can be text without being UTF-8!  (And
imagine how would people react if Subversion enforced UTF-8 for all
text files.)

We decided it for paths, and that's it.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Greg Hudson <gh...@MIT.EDU> writes:

> I can't tell whether you're advocating the approach taken by Marcus
> (where we translate from UTF-8 to the local character set whenever we
> interact with the system) or not.  We don't "avoid all this character
> set nonsense" if we do the translation, but not doing it means users'
> tools must all use UTF-8 (including all tools which interact with
> pathnames in the working directory).

The main point is (IMO, maybe Greg S has a different point) that the
character set nonsense is hidden from the user.  Requiring all users
to run Plan 9 (the only OS i know of dealing exlusively with UTF-8)
would be nonsensical if anything.  :-)

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Greg Hudson <gh...@MIT.EDU> writes:

> Every time we open a file, we have to convert the pathname from UTF-8 to
> the local character encoding.  In an ideal world APR might take care of
> this for us, but it doesn't.  (Fortunately, we can just wrap
> apr_file_open() with our own function.)

Yeah, almost all of the conversions inside the libraries are shuffled
into io.c, which is bascially an extension to APR anyway.  In fact,
changing the code to use the new wrappers instead of APR directory
actually makes it neater in many cases, since it doesn't have to
fiddle around with converting apr_status_t to svn_error_t* anymore.

> Every time we display a message, we have to convert it.  Again, APR
> might conceivably take care of this for us, but it doesn't.

I made a wrapper for apr_file_printf (it doesn't actually call
apr_file_printf, but it calls apr_pvsprintf just like apr_file_printf
would have) that handles this, so again, this is fairly transparent.
There is still some fudge in svn_handle_{error,warning} though.

> When we prompt the user for a log message via $EDITOR, what we get back
> is in the local character encoding.  Hard to imagine APR taking care of
> this.

This is done by the client, not by the libraries.

> There are more interactions as well.  The libraries interact not just
> with the client, but with the operating system.

And with BerkeleyDB.  Apart from the file/dir access stuff, the impact
is pretty limited though.  The situation could have been much worse if
an attempt had been made to have some strings in UTF-8 and some not.

> There are certainly advantages to the UTF-8 approach, but "avoiding
> character set nonsense" in the libraries is not one of them.

To some extent.  If we agree that things like path names and log
messages are sequences of characters rather than sequences of bits
(done any commits on the file 0110110001101110011001110010111001100011
lately? :) to the user, and that Subversion should support this view,
then it does avoid a lot of nonsense to use a uniform character
encoding internally.  The alternative would be to keep track of the
actual encoding of strings either by affixing some meta information to
the string itself, or by policy (_this_ particular string will always
be US-ASCII, and _that_ string will be ISO-2022, kind of thing).
Otherwise the interpretation is lost.

(And that's _if_ we agree of course.  We could just say "screw the
 user" in this regard like CVS does (no sarcasm intended), and there
 will be even less nonsense in the libs (the nonsense will instead
 have to be dealt with by the user).  But this is an important
 opportunity to do better than CVS, and that's what Subversion is all
 about, isn't it?)

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Hudson <gh...@MIT.EDU>.

On Thu, 2002-05-30 at 05:45, Greg Stein wrote:
> > We don't "avoid all this character
> > set nonsense" if we do the translation, but not doing it means users'
> > tools must all use UTF-8 (including all tools which interact with
> > pathnames in the working directory).
> 
> We avoid it within the library and its APIs.

No, not really.

Every time we open a file, we have to convert the pathname from UTF-8 to
the local character encoding.  In an ideal world APR might take care of
this for us, but it doesn't.  (Fortunately, we can just wrap
apr_file_open() with our own function.)

Every time we display a message, we have to convert it.  Again, APR
might conceivably take care of this for us, but it doesn't.

When we prompt the user for a log message via $EDITOR, what we get back
is in the local character encoding.  Hard to imagine APR taking care of
this.

There are more interactions as well.  The libraries interact not just
with the client, but with the operating system.

> To simplify the data storage and flow, we just say "it's all UTF-8". At the
> boundaries between the SVN libraries and the client programs, the program
> can (as appropriate) recode from UTF-8 to another charset.

There are certainly advantages to the UTF-8 approach, but "avoiding
character set nonsense" in the libraries is not one of them.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Stein <gs...@lyra.org>.

On Wed, May 29, 2002 at 07:45:57PM -0400, Greg Hudson wrote:
> On Wed, 2002-05-29 at 19:29, Greg Stein wrote:
> > Of course it applies to log messages. You want to be able to extract exactly
> > what you put in. That implies that a consistent and uniform character set is
> > chosen.
> 
> > As I said elsewhere, we decided on UTF-8 for text for everything a long long
> > time ago. We wanted to absolutely avoid all this character set nonsense. So
> > picking *one* character set (which is theoretically a superset of all
> > others) is nice. It helps all users of the libraries.
> 
> I can't tell whether you're advocating the approach taken by Marcus
> (where we translate from UTF-8 to the local character set whenever we
> interact with the system) or not.

Hrm. To clarify: I support Marcus' approach.

> We don't "avoid all this character
> set nonsense" if we do the translation, but not doing it means users'
> tools must all use UTF-8 (including all tools which interact with
> pathnames in the working directory).

We avoid it within the library and its APIs. To make the data within the
repository useful, it must record or imply a character set for each datum.
As that data moves out through the libraries, it would need to carry the
character set since one user's charset might not match whatever was stored
into the repos (and having the recorded charset provides the capability to
recode from the stored chars to the user's chars).

To simplify the data storage and flow, we just say "it's all UTF-8". At the
boundaries between the SVN libraries and the client programs, the program
can (as appropriate) recode from UTF-8 to another charset.

Tools are not going to be required to use UTF-8. Yes, it would be nice to
live in a UTF-8 world, but we've got more than enough problems to solve :-)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Hudson <gh...@MIT.EDU>.

On Wed, 2002-05-29 at 19:29, Greg Stein wrote:
> Of course it applies to log messages. You want to be able to extract exactly
> what you put in. That implies that a consistent and uniform character set is
> chosen.

> As I said elsewhere, we decided on UTF-8 for text for everything a long long
> time ago. We wanted to absolutely avoid all this character set nonsense. So
> picking *one* character set (which is theoretically a superset of all
> others) is nice. It helps all users of the libraries.

I can't tell whether you're advocating the approach taken by Marcus
(where we translate from UTF-8 to the local character set whenever we
interact with the system) or not.  We don't "avoid all this character
set nonsense" if we do the translation, but not doing it means users'
tools must all use UTF-8 (including all tools which interact with
pathnames in the working directory).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Stein <gs...@lyra.org>.

On Wed, May 29, 2002 at 09:55:54PM +0200, Branko Èibej wrote:
>...
> And: path names are a bit special because subversion simply won't work 
> unless we can consistently reproduce the file and path names that were 
> fed in. No such argument obtains for log messages.

Of course it applies to log messages. You want to be able to extract exactly
what you put in. That implies that a consistent and uniform character set is
chosen.

As I said elsewhere, we decided on UTF-8 for text for everything a long long
time ago. We wanted to absolutely avoid all this character set nonsense. So
picking *one* character set (which is theoretically a superset of all
others) is nice. It helps all users of the libraries.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Branko Čibej <br...@xbc.nu> writes:

> The only thing we said about properties is that they can store
> arbitrary binary data.

Basically, properties should probably be divided into text properties
and binary properties somehow (I believe there are comments in the
code to that effect as well).  If you want to have a truly binary
property (like an icon), you don't want it printed to stdout by `svn
proplist -v' for example.

If there is such a distinction, it is possible to do UTF-8 recoding
for text properties (fixing the svn:ignore dilemma) and leave binary
properties alone.

Just a site remark.  I was pondering a bit on this very aspect just
yesterday.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Branko Čibej <br...@xbc.nu>.

Greg Hudson wrote:

>On Wed, 2002-05-29 at 15:35, Branko Čibej wrote:
>  
>
>>I never heard of such a paradigm. Yes, all _paths_ passed to the svn 
>>libraries should be UTF-8 and canonicalized, but not all strings. I'm 
>>sorry I didn't notice that berfore in your patches. AFAIK we never 
>>discussed canonicalizing anything but paths.
>>    
>>
>
>It sounds like we don't have a lot of consensus here (which is
>unfortunate for Marcus, who I think signed onto this task with the
>understanding that there was a pre-existing choice).  Karl seems to side
>with Marcus on translating because UTF-8 and the local charset for all
>interesting values; Mike and I would rather be charset-neutral; Branko
>wants to use UTF-8 for paths and be charset-neutral for everything
>else.  (I'm not sure why pathnames deserve special treatment.)
>
>It may be time for a vote, at least after Branko clarifies his position.
>
Like I said, I only remember discussions about a) how filenames should 
be stored in the repository, and b) what the SVN libraries should 
require wrt filename parameters. The decision was that filenames within 
the libs and in the repo should be encoded in UTF-8, with forward 
slashes for path separators.

My position here is that I agree with both decisions. :-)

The only thing we said about properties is that they can store arbitrary 
binary data.


And: path names are a bit special because subversion simply won't work 
unless we can consistently reproduce the file and path names that were 
fed in. No such argument obtains for log messages.

>(I'd just like to point out at this juncture that, if we didn't use XML
>so damned much, we wouldn't have any problems being charset-neutral. 
>That is all.)
>
He he. I will now wisely refrain from getting sucked into /that/ 
discussion again.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

clients need the charset (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Greg Stein <gs...@lyra.org>.

On Fri, May 31, 2002 at 12:37:21AM +0200, Marcus Comstedt wrote:
> Karl Fogel <kf...@newton.ch.collab.net> writes:
> > Log messages are not "textual information used by the Subversion API".
> > They are textual information passed around opaquely by the Subversion
> > API.  Subversion never *uses* it, not the way it uses paths (comparing
> > and finding separators and so forth).
> 
> For the libs, this is correct.
>...
> However, there is the question
> of client implementations too.  Both because Subversion contains such
> an implementation, to be used for both production use and reference,
> and because other implementations must agree on an interpretation to
> be interoperable.  So there is both an implementation and policy
> question for the client that can't be ignored.

Right. The libs are useless without a client.  USE. LESS.

Whether that client is 'svn' or ViewSVN or a GUI or cvs2svn or ...  All of
these need to *know* that charset to make use of it.

Heck. Today, we have four clients: svn, svnlook, svnadmin, and cvs2svn. All
four use that log message. We also have the tweak-cgi and the commit-email
scripts, leveraging svnlook, so you could say we have six clients all
dealing with the log message.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Fri, May 31, 2002 at 09:21:04AM -0500, Karl Fogel wrote:
> Garrett Rooney <ro...@electricjellyfish.net> writes:
> > well, if we're going to go down this route, then i think we should go
> > all the way and just let log messages be any kind of arbitrary data.
> > requiring no null characters in the 'string' is kind of half assed,
> > since who's to say there isn't a character set somewhere that doesn't
> > have nulls in their characters.  plus, who knows, maybe someone,
> > somewhere has a good reason to use a jpeg as their log message ;-)
> 
> I think treating them as binary data would be great.

I have no problem with treating them as binary data, provided that we
have some means of indicating to the client program what exactly that
binary data is, otherwise it is pretty much useless.

If we don't want to do that, then I think we should simply convert to
UTF-8 and let the client figure out how to do the conversion.

-garrett

-- 
garrett rooney                    Remember, any design flaw you're 
rooneg@electricjellyfish.net      sufficiently snide about becomes  
http://electricjellyfish.net/     a feature.       -- Dan Sugalski

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Garrett Rooney <ro...@electricjellyfish.net> writes:
> well, if we're going to go down this route, then i think we should go
> all the way and just let log messages be any kind of arbitrary data.
> requiring no null characters in the 'string' is kind of half assed,
> since who's to say there isn't a character set somewhere that doesn't
> have nulls in their characters.  plus, who knows, maybe someone,
> somewhere has a good reason to use a jpeg as their log message ;-)

I think treating them as binary data would be great.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Thu, May 30, 2002 at 05:19:06PM -0500, Karl Fogel wrote:
> Garrett Rooney <ro...@electricjellyfish.net> writes:
> > perhaps not, but if we just go with UTF-8 for log messages we solve
> > the problem and make life easier for client writers everywhere.
> > they're already going to have to convert a bunch of stuff to UTF-8
> > anyway, so why not be more consistent and say for all textual
> > information used by the subversion API we use UTF-8?
> 
> One reason, and one reason only :-)...
> 
> I'm worried about failed conversions, where the failure is not
> detected until later, when someone's trying to read the log message.
> Locale does not always reliably indicate the charset a given edit
> session is using.  (At least, I know this is true in my life, so I'm
> assuming I can't be the only one).
> 
> We can only convert something to UTF-8 if we know what the something
> is.  If we can't know that with close to 100% reliability, then it's
> better not to transform the data at all.

well, if we're going to go down this route, then i think we should go
all the way and just let log messages be any kind of arbitrary data.
requiring no null characters in the 'string' is kind of half assed,
since who's to say there isn't a character set somewhere that doesn't
have nulls in their characters.  plus, who knows, maybe someone,
somewhere has a good reason to use a jpeg as their log message ;-)

-garrett

-- 
garrett rooney                    Remember, any design flaw you're 
rooneg@electricjellyfish.net      sufficiently snide about becomes  
http://electricjellyfish.net/     a feature.       -- Dan Sugalski

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Karl Fogel <kf...@newton.ch.collab.net> writes:

> Log messages are not "textual information used by the Subversion API".
> They are textual information passed around opaquely by the Subversion
> API.  Subversion never *uses* it, not the way it uses paths (comparing
> and finding separators and so forth).

For the libs, this is correct.  And they indeed do treat log messages
as binary data, even with my patches (although expat will complain if
the string happens to not be well-formed UTF-8, as noted in the
original post in this thread).  However, there is the question
of client implementations too.  Both because Subversion contains such
an implementation, to be used for both production use and reference,
and because other implementations must agree on an interpretation to
be interoperable.  So there is both an implementation and policy
question for the client that can't be ignored.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Garrett Rooney <ro...@electricjellyfish.net> writes:
> perhaps not, but if we just go with UTF-8 for log messages we solve
> the problem and make life easier for client writers everywhere.
> they're already going to have to convert a bunch of stuff to UTF-8
> anyway, so why not be more consistent and say for all textual
> information used by the subversion API we use UTF-8?

One reason, and one reason only :-)...

I'm worried about failed conversions, where the failure is not
detected until later, when someone's trying to read the log message.
Locale does not always reliably indicate the charset a given edit
session is using.  (At least, I know this is true in my life, so I'm
assuming I can't be the only one).

We can only convert something to UTF-8 if we know what the something
is.  If we can't know that with close to 100% reliability, then it's
better not to transform the data at all.

Log messages are not "textual information used by the Subversion API".
They are textual information passed around opaquely by the Subversion
API.  Subversion never *uses* it, not the way it uses paths (comparing
and finding separators and so forth).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Thu, May 30, 2002 at 04:41:44PM -0500, Karl Fogel wrote:
> Garrett Rooney <ro...@electricjellyfish.net> writes:
> > how does this solve the "i'm writing a gui svn client and i need to
> > shove the log message in a text box to display it" problem.  they
> > still need to know the charset to be able to display it correctly, and
> > if we don't provide either a standard char set for log messages or a
> > way to figure out the char set, there is no way to make a client that
> > will work with all subversion repositories automatically.
> 
> It doesn't -- it's not an attempt to solve that problem.

perhaps not, but if we just go with UTF-8 for log messages we solve
the problem and make life easier for client writers everywhere.
they're already going to have to convert a bunch of stuff to UTF-8
anyway, so why not be more consistent and say for all textual
information used by the subversion API we use UTF-8?

-garrett

-- 
garrett rooney                    Remember, any design flaw you're 
rooneg@electricjellyfish.net      sufficiently snide about becomes  
http://electricjellyfish.net/     a feature.       -- Dan Sugalski

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Stein <gs...@lyra.org>.

On Fri, May 31, 2002 at 09:19:56AM -0500, Karl Fogel wrote:
> Greg Stein <gs...@lyra.org> writes:
> > But you *have* to solve that problem. The log message is useless if the
> > clients cannot know its charset.
> 
> CVS has been getting along just fine for years without knowing the
> charset of its log message.  I'd hardly call that "useless".  I think
> you're exaggerating just a tad here :-).

And does CVS have libraries that other apps use? No. Thus, an app can never
really bind very tightly with CVS in the first place, and things such as
charset will just never come into play.

But if you're writing a GUI client, then charset becomes *very* important.

Consider the "Pango" library, used by GNOME for all of its i18n text
rendering. Its input strings are UTF-8 (http://www.pango.org/design.shtml).
If your Pango-based GUI client doesn't know the charset of the log message,
then it can't do anything useful with it.

-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 30, 2002 at 04:41:44PM -0500, Karl Fogel wrote:
> Garrett Rooney <ro...@electricjellyfish.net> writes:

> > Karl wrote:
> > > "Subversion log messages may use any charset, so long as no byte in 
> > >  the message is zero."

> > how does this solve the "i'm writing a gui svn client and i need to
> > shove the log message in a text box to display it" problem.  they
> > still need to know the charset to be able to display it correctly, and
> > if we don't provide either a standard char set for log messages or a
> > way to figure out the char set, there is no way to make a client that
> > will work with all subversion repositories automatically.
> 
> It doesn't -- it's not an attempt to solve that problem.

But you *have* to solve that problem. The log message is useless if the
clients cannot know its charset.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Garrett Rooney <ro...@electricjellyfish.net> writes:
> how does this solve the "i'm writing a gui svn client and i need to
> shove the log message in a text box to display it" problem.  they
> still need to know the charset to be able to display it correctly, and
> if we don't provide either a standard char set for log messages or a
> way to figure out the char set, there is no way to make a client that
> will work with all subversion repositories automatically.

It doesn't -- it's not an attempt to solve that problem.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by cm...@collab.net.

Daniel Berlin <db...@dberlin.org> writes:

> On 30 May 2002 cmpilato@collab.net wrote:
> 
> > 
> > <not-serious>
> > How about just requiring that all log messages be valid HTML 4.0
> > strict, complete with charset and everything.  The clients can launch
> > the browser of choice to view log message. :-)
> 
> Yeah, yeah, and we can have subversion validate it against the
> DTD as well, to make sure everything is valid.
> 
> What i really want, is to use the blink tag in my commit messages.
> 
> "<H1><blink> This is an important commit! </blink></H1>"

YES!!  +1 +1!

> 
> > </not-serious>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Daniel Berlin <db...@dberlin.org>.

On 30 May 2002 cmpilato@collab.net wrote:

> 
> <not-serious>
> How about just requiring that all log messages be valid HTML 4.0
> strict, complete with charset and everything.  The clients can launch
> the browser of choice to view log message. :-)

Yeah, yeah, and we can have subversion validate it against the
DTD as well, to make sure everything is valid.

What i really want, is to use the blink tag in my commit messages.

"<H1><blink> This is an important commit! </blink></H1>"

> </not-serious>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Nuutti Kotivuori <na...@iki.fi>.

cmpilato@collab.net wrote:
> <not-serious> How about just requiring that all log messages be
> valid HTML 4.0 strict, complete with charset and everything.  The
> clients can launch the browser of choice to view log message. :-)
> </not-serious>

Though the comment is humorous, could it be that some people would not
like their commit messages to be plain text? XML maybe? Or like said,
HTML. Or maybe even TeX or Word documents.

Though ofcourse those formats can contain zero-bytes and whatnot so
then the log message would have to be binary all the way.

-- Naked

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by cm...@collab.net.

Garrett Rooney <ro...@electricjellyfish.net> writes:

> On Thu, May 30, 2002 at 10:36:27AM -0500, Karl Fogel wrote:
> > Greg Hudson <gh...@MIT.EDU> writes:
> > > It sounds like we don't have a lot of consensus here (which is
> > > unfortunate for Marcus, who I think signed onto this task with the
> > > understanding that there was a pre-existing choice).  Karl seems to side
> > > with Marcus on translating because UTF-8 and the local charset for all
> > > interesting values; Mike and I would rather be charset-neutral; Branko
> > > wants to use UTF-8 for paths and be charset-neutral for everything
> > > else.  (I'm not sure why pathnames deserve special treatment.)
> > 
> > No, I'd also prefer not to enforce a charset, but I think a good
> > compromise solution available.  What if we simply say:
> > 
> >    "Subversion log messages may use any charset, so long as no byte in
> >    the message is zero."
> 
> how does this solve the "i'm writing a gui svn client and i need to
> shove the log message in a text box to display it" problem.  they
> still need to know the charset to be able to display it correctly, and
> if we don't provide either a standard char set for log messages or a
> way to figure out the char set, there is no way to make a client that
> will work with all subversion repositories automatically.

<not-serious>
How about just requiring that all log messages be valid HTML 4.0
strict, complete with charset and everything.  The clients can launch
the browser of choice to view log message. :-)
</not-serious>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Thu, May 30, 2002 at 10:36:27AM -0500, Karl Fogel wrote:
> Greg Hudson <gh...@MIT.EDU> writes:
> > It sounds like we don't have a lot of consensus here (which is
> > unfortunate for Marcus, who I think signed onto this task with the
> > understanding that there was a pre-existing choice).  Karl seems to side
> > with Marcus on translating because UTF-8 and the local charset for all
> > interesting values; Mike and I would rather be charset-neutral; Branko
> > wants to use UTF-8 for paths and be charset-neutral for everything
> > else.  (I'm not sure why pathnames deserve special treatment.)
> 
> No, I'd also prefer not to enforce a charset, but I think a good
> compromise solution available.  What if we simply say:
> 
>    "Subversion log messages may use any charset, so long as no byte in
>    the message is zero."

how does this solve the "i'm writing a gui svn client and i need to
shove the log message in a text box to display it" problem.  they
still need to know the charset to be able to display it correctly, and
if we don't provide either a standard char set for log messages or a
way to figure out the char set, there is no way to make a client that
will work with all subversion repositories automatically.

-garrett 

-- 
garrett rooney                    Remember, any design flaw you're 
rooneg@electricjellyfish.net      sufficiently snide about becomes  
http://electricjellyfish.net/     a feature.       -- Dan Sugalski

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Hudson <gh...@MIT.EDU> writes:
> It sounds like we don't have a lot of consensus here (which is
> unfortunate for Marcus, who I think signed onto this task with the
> understanding that there was a pre-existing choice).  Karl seems to side
> with Marcus on translating because UTF-8 and the local charset for all
> interesting values; Mike and I would rather be charset-neutral; Branko
> wants to use UTF-8 for paths and be charset-neutral for everything
> else.  (I'm not sure why pathnames deserve special treatment.)

No, I'd also prefer not to enforce a charset, but I think a good
compromise solution available.  What if we simply say:

   "Subversion log messages may use any charset, so long as no byte in
   the message is zero."

This works for most charsets, and it's easy for us to enforce (without
looking, I know just where to put the check in the code :-) ), and
it's easy to explain.

In other words, we never *munge* anyone's log message data, but we
might *reject* it under certain (rare) circumstances.

> (I'd just like to point out at this juncture that, if we didn't use XML
> so damned much, we wouldn't have any problems being charset-neutral. 
> That is all.)

Nah, XML isn't really causing this problem -- it's *technically* easy
to go either way on this question.  We're just trying to find the best
policy from a usability standpoint.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

cmpilato@collab.net writes:

> Perhaps the only relevant question pre-1.0 is the usual one:  what
> does CVS do?

CVS is totally charset ignorant.  It treats all data as binary¹.

The "pre-1.0" comment rings a warning bell though.  If an 1.0 is
released which is charset independant, it will be very difficult to
move over to an all-UTF-8 policy after that.  And vice versa.  There
would have to be repository converters, syncing with independant
client developers and all sorts of hair.

It was the importance of having an proper implementation of the
intended policy by 1.0 that drove me to start implementing the policy
I though you had agreed on, as it didn't look like the regular guys
would have the time.

  // Marcus

¹ For files which do not have binary mode set, it assumes that it can
  recognize line breaks and keyword expansions.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Hudson <gh...@MIT.EDU>.

On Wed, 2002-05-29 at 16:18, cmpilato@collab.net wrote:
> Of course, now I don't know exactly where I want to be with respect to
> this issue.  I've worked in the past with software that was in the
> business of doing charset conversions (to and from some 50 different
> charsets), and that chunk of code is NOT fun.

I think Marcus's patch mostly delegates to iconv(), so it's not so bad.

> Perhaps the only relevant question pre-1.0 is the usual one:  what
> does CVS do?

CVS is charset-neutral, by which I mean it can store 8-bit data, but has
no provisions for converting between character sets.

We'd be like that right now too, except that any data we stuff into an
XML file has to be valid UTF-8 data or expat complains.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by cm...@collab.net.

Greg Hudson <gh...@MIT.EDU> writes:

> On Wed, 2002-05-29 at 16:03, cmpilato@collab.net wrote:
> >    - APR does any necessary conversions between UTF-8 and The
> >      Filesystem.  
> 
> I don't think APR has any particular commitment to accepting UTF-8
> encoded filenames, and it would be quite a change in contract to decide
> that now.

Well, that kills that.

> And while the client binary can handle conversion to UTF-8, that would
> seem to conflict with your assertion that Subversion (which includes
> clients) shouldn't be in the business of charset conversion.

Oh, no, I don't typically consider `svn', `svnadmin' or `svnlook' to
be part of Subversion.  They are all clients of Subversion, which I
tend to think of as a collection of libraries that do that version
control thang.  Sorry for the misunderstanding.

Of course, now I don't know exactly where I want to be with respect to
this issue.  I've worked in the past with software that was in the
business of doing charset conversions (to and from some 50 different
charsets), and that chunk of code is NOT fun.  But, I don't want
Subversion to be crippled in light of being used, on a single project,
by folks with multivarious locales.

Perhaps the only relevant question pre-1.0 is the usual one:  what
does CVS do?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Hudson <gh...@MIT.EDU>.

On Wed, 2002-05-29 at 16:03, cmpilato@collab.net wrote:
>    - APR does any necessary conversions between UTF-8 and The
>      Filesystem.  

I don't think APR has any particular commitment to accepting UTF-8
encoded filenames, and it would be quite a change in contract to decide
that now.

And while the client binary can handle conversion to UTF-8, that would
seem to conflict with your assertion that Subversion (which includes
clients) shouldn't be in the business of charset conversion.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by cm...@collab.net.

Greg Hudson <gh...@MIT.EDU> writes:

> On Wed, 2002-05-29 at 15:35, Branko Èibej wrote:
> > I never heard of such a paradigm. Yes, all _paths_ passed to the svn 
> > libraries should be UTF-8 and canonicalized, but not all strings. I'm 
> > sorry I didn't notice that berfore in your patches. AFAIK we never 
> > discussed canonicalizing anything but paths.
> 
> It sounds like we don't have a lot of consensus here (which is
> unfortunate for Marcus, who I think signed onto this task with the
> understanding that there was a pre-existing choice).  Karl seems to side
> with Marcus on translating because UTF-8 and the local charset for all
> interesting values; Mike and I would rather be charset-neutral; Branko
> wants to use UTF-8 for paths and be charset-neutral for everything
> else.  (I'm not sure why pathnames deserve special treatment.)

Hm.  What makes Branko's desires different than ours, Greg?  It seems
to me that Subversion (from the client layer on down) can satisfy both
the desires to treat all paths as UTF-8 and not perform charset
conversions so long as:

   - the individual client binary does any necessary conversions
     between The User and UTF-8, and 

   - APR does any necessary conversions between UTF-8 and The
     Filesystem.  

Am I missing something?



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Hudson <gh...@MIT.EDU>.

On Wed, 2002-05-29 at 15:35, Branko Čibej wrote:
> I never heard of such a paradigm. Yes, all _paths_ passed to the svn 
> libraries should be UTF-8 and canonicalized, but not all strings. I'm 
> sorry I didn't notice that berfore in your patches. AFAIK we never 
> discussed canonicalizing anything but paths.

It sounds like we don't have a lot of consensus here (which is
unfortunate for Marcus, who I think signed onto this task with the
understanding that there was a pre-existing choice).  Karl seems to side
with Marcus on translating because UTF-8 and the local charset for all
interesting values; Mike and I would rather be charset-neutral; Branko
wants to use UTF-8 for paths and be charset-neutral for everything
else.  (I'm not sure why pathnames deserve special treatment.)

It may be time for a vote, at least after Branko clarifies his position.

(I'd just like to point out at this juncture that, if we didn't use XML
so damned much, we wouldn't have any problems being charset-neutral. 
That is all.)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Branko Čibej <br...@xbc.nu>.

Marcus Comstedt wrote:

>=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:
>
>  
>
>>I must have missed this in your earlier patches. IMHO, only path names
>>should be in (transformed to) UTF-8. Property contents, including log
>>messages, shouldn't be touched.
>>    
>>
>
>Hm.  Any particular reason?  Apart from breaking the "all strings
>passed to libsvn_* shall be UTF-8"-paradigm, 
>
I never heard of such a paradigm. Yes, all _paths_ passed to the svn 
libraries should be UTF-8 and canonicalized, but not all strings. I'm 
sorry I didn't notice that berfore in your patches. AFAIK we never 
discussed canonicalizing anything but paths.

>it would mean that two
>persons, one using a Latin-1 charset and one using an UTF-8 charset,
>wouldn't be able to properly read each others log messages even if
>they are restricting themselves to the common subset of characters.
>
>Since there are no properties on log messages, how do you propose that
>the actual character encoding for a log message be recorded?
>  
>
I'd say that problem should be solved by project policy, not by 
Subversion. Just like we don't require a particular repository layout.



-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Colin Putney <co...@whistler.com> writes:
> I'm wondering if this boils down to a question of what the 1.0
> behaviour will be. I'm pretty convinced that the email-like is the way
> to go, but it does require some changes to the existing codebase.
> 
> Is this something that should be part of the I18N work that will be
> done after 1.0? How much of the desire for UTF-8 is really a desire to
> get 1.0 out the door?

I don't think this relates to 1.0 much at all.  (We might make the
discovery that UTF-8 conversion is a good idea, or that it's a bad
idea, at any point between now and 1.0, or some time after 1.0.)



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Colin Putney <co...@whistler.com>.

On Monday, June 3, 2002, at 12:18  PM, Karl Fogel wrote:

> Branko Čibej <br...@xbc.nu> writes:
>> Um. I'd rather say it opens up a huge can of very hungry carnivorous
>> worms. While it might be true that you can trust the locale settings
>> on most machines today (something I'm not at all sure about), you
>> can't trust programs. On Windows, for instance, I can set notepad as
>> my $EDITOR, then go and save the log message as UTF-8 or two different
>> kinds of UTF-16 (big- and little-endian). My locale info says I'm
>> using codepage 1250. Converting that text would produce
>> ... interesting? ... results.
>
> I'm still worried about this scenario too, but the reason I'm willing
> to risk it is that we can change Subversion if we discover we were
> wrong.  So let's see how often problems happen in practice.  After
> all, if conversion to UTF-8 *does* corrupt log messages in real life,
> then we can simply say "Well, that was a mistake", and
> backwards-compatibly change the client libraries's behavior.
>
> It would be simple enough to switch to email/mime-like behavior.  Just
> stop converting to UTF-8, and start storing the literal bits of the
> log message, along with a best guess at the encoding for which they
> were written -- i.e., a new revision prop, `svn:log-message-encoding'
> or whatever.  Revisions that don't have that property are assumed to
> be in UTF-8.

I'm wondering if this boils down to a question of what the 1.0 behaviour 
will be. I'm pretty convinced that the email-like is the way to go, but 
it does require some changes to the existing codebase.

Is this something that should be part of the I18N work that will be done 
after 1.0? How much of the desire for UTF-8 is really a desire to get 
1.0 out the door?


Colin Putney
Whistler.com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Branko Čibej <br...@xbc.nu> writes:
> Um. I'd rather say it opens up a huge can of very hungry carnivorous
> worms. While it might be true that you can trust the locale settings
> on most machines today (something I'm not at all sure about), you
> can't trust programs. On Windows, for instance, I can set notepad as
> my $EDITOR, then go and save the log message as UTF-8 or two different
> kinds of UTF-16 (big- and little-endian). My locale info says I'm
> using codepage 1250. Converting that text would produce
> ... interesting? ... results.

I'm still worried about this scenario too, but the reason I'm willing
to risk it is that we can change Subversion if we discover we were
wrong.  So let's see how often problems happen in practice.  After
all, if conversion to UTF-8 *does* corrupt log messages in real life,
then we can simply say "Well, that was a mistake", and
backwards-compatibly change the client libraries's behavior.

It would be simple enough to switch to email/mime-like behavior.  Just
stop converting to UTF-8, and start storing the literal bits of the
log message, along with a best guess at the encoding for which they
were written -- i.e., a new revision prop, `svn:log-message-encoding'
or whatever.  Revisions that don't have that property are assumed to
be in UTF-8.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Branko Čibej <br...@xbc.nu>.

Greg Stein wrote:

>Sheesh. Of course I can see it. And it is a very wrong position, when we can
>so *easily* just say "it is UTF-8" and be done with it. That opens up a
>whole world of simplicity and determinism for the applications that will be
>built on top of Subversion.
>  
>
Um. I'd rather say it opens up a huge can of very hungry carnivorous 
worms. While it might be true that you can trust the locale settings on 
most machines today (something I'm not at all sure about), you can't 
trust programs. On Windows, for instance, I can set notepad as my 
$EDITOR, then go and save the log message as UTF-8 or two different 
kinds of UTF-16 (big- and little-endian). My locale info says I'm using 
codepage 1250. Converting that text would produce ... interesting? ... 
results.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Greg Stein <gs...@lyra.org>.

On Fri, May 31, 2002 at 09:29:47AM -0500, Karl Fogel wrote:
>...
> Again, CVS doesn't know the charset, and I never encountered a
> complaint about that, over years of doing more CVS support than most.

Granted. But CVS was never bound as tightly into client apps as SVN will be.
It is a different programming model for clients, and those apps will need
the appropriate information to be able to operate properly.

>...
> There is no clear win here.  If you are unable to see how it is even
> *possible* to consider being charset neutral, all I can say is, your

Sheesh. Of course I can see it. And it is a very wrong position, when we can
so *easily* just say "it is UTF-8" and be done with it. That opens up a
whole world of simplicity and determinism for the applications that will be
built on top of Subversion.

>...
> The software we are replacing _is_ effectively charset neutral for log
> messages, and prior to this, we had never listed that as one of the
> "bugs" we were aiming to fix in Subversion.

Bah. That is a non-starter. We've got a ton of things in our code that were
never listed as a "bug" in CVS.

-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> Misunderstanding :-). The logic went like this: the param is a char*, thus
> it will be (typically) be representing characters, which means it needs an
> associated charset, which we have previously stated would be UTF-8.

Uh.  Okay.  Then I wasn't misunderstanding, and my response (that the
param being `char *' has no bearing on this discussion) was
appropriate.

> I dismissed it because there has already been quite a bit of material
> (from Garrett, Jon, etc etc) stating how clients need to know the charset to
> be able to do anything with those log messages.
> 
>   --> They contain characters. You need to know their charset.
> 
> I'm not sure how it is possible to really consider otherwise. To display
> those characters to the user, you need the charset. To edit them, to set
> them, to email them, to do whatever.

Again, CVS doesn't know the charset, and I never encountered a
complaint about that, over years of doing more CVS support than most.
I'm sure CVS has disappointed people by this occasionally, and I just
haven't heard about it, but I'm equally certain that re-encoding log
messages will disappoint other people in other circumstances,
sometimes.

There is no clear win here.  If you are unable to see how it is even
*possible* to consider being charset neutral, all I can say is, your
inability to see it does not add any weight to your technical
arguments against it.  (I guess the other thing I can say is, I wonder
how you were able to use CVS all those years. :-) )

The software we are replacing _is_ effectively charset neutral for log
messages, and prior to this, we had never listed that as one of the
"bugs" we were aiming to fix in Subversion.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Hudson <gh...@MIT.EDU> writes:
> On the other hand, there seems to be a fairly broad consensus for doing
> UTF-8/$LC_CTYPE character set conversion for filenames.  I am...
> confused as to why anyone advocates converting filenames and not log
> messages, since they are both text.  File contents are binary data. 
> Property values... might be binary data; that seems to be the conensus
> for now, anyway, although that leads to questions about how svn:ignore
> should be interpreted and such.  But log messages are definitely text.

Part of the justification is ease of implementation.  We have to sling
filenames around all over the place internally, and write/read them in
xml files appx seventy times a second.  It's just massively easier to
use `char *' UTF-8 for all that.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Greg Hudson <gh...@MIT.EDU>.

On Thu, 2002-05-30 at 18:45, Greg Stein wrote:
> 3) untenable for the clients.

I'd like to keep a little perspective here.

If we don't solve the log message character set problem, then projects
are happy as long as:

  * They are willing to stick with ASCII log messages, or
  * All their developers use the same character set, or
  * All their developers have use a UTF-8 native locale

(That third statement is a little forward-looking, but there has been
some progress in that direction.)

I believe this covers quite a lot of users--everyone who is happy with
CVS, for instance.  Subversion is not going to fail on account of not
doing character set conversion.

This is why I would be happy being charset-neutral and 8-bit clean (not
necessarily binary-clean) for all text fields.  Possibly happier, since
we would never be responsible for misconverting text when LC_CTYPE isn't
set properly, or anything like that.  Plus our code would be simpler.

On the other hand, there seems to be a fairly broad consensus for doing
UTF-8/$LC_CTYPE character set conversion for filenames.  I am...
confused as to why anyone advocates converting filenames and not log
messages, since they are both text.  File contents are binary data. 
Property values... might be binary data; that seems to be the conensus
for now, anyway, although that leads to questions about how svn:ignore
should be interpreted and such.  But log messages are definitely text.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by William Uther <wi...@cs.cmu.edu>.

On 30/5/02 6:45 PM, "Greg Stein" <gs...@lyra.org> wrote:

> On Thu, May 30, 2002 at 05:06:03PM -0500, Karl Fogel wrote:

>> I see three options on the table:
> 
> Four.

Five?  Or Six?  Seven anyone?

     - Add a repository wide property that gives the charset for log
messages.  It could vary from 'local' to 'UTF-8' to ...

     - Add a local configuration option that switches the translation on or
off.

     - Interpret not having a locale set as 'use no translation'.  If you
set a locale, then svn will use it.  If you want to use random charsets,
then don't lie in your locale settings.

>    - add a second parameter to the relevant data structures and routines
>      to hold the character set of the string in question (while we're
>      talking about log message here, I think there are others; the rule
>      for log msgs will apply everywhere)
> 
>>    - Keep them as char *, declare them UTF-8, and convert user input
>>      as best we can.
>> 
>>    - Keep them as char *, declare no particular charset, but don't
>>      allow zero bytes.
>> 
>>    - Convert them back to counted-length strings and treat them as
>>      binary data again (I guess this is the most militantly charset
>>      neutral option).
> 
> Of the above [seven] approaches:

 1) Would require the implementation of global properties in the repository.
In 'no property' was interpreted correctly then this could be implemented
post-1.0 and still be backwards compatible.

 2) Config options are not always a good idea.  Having a local config option
is worse as it removes any guarantees about log messages in the repos.

 3) This mostly the same as option 2.

> [4]) a second param is very heavyweight from a conceptual and coding
>  standpoint. and, in the end, we'll probably have to do conversions
>  anyways, so allowing an arbitrary charset rather than fixed doesn't
>  seem to buy a lot.
> 
> [5]) my favoriate. note that the *client* does the conversions. the libraries
>  simply assume all text strings are in UTF-8.
> 
> [6]) untenable for the clients.
> 
> [7]) this is similar to ([6]), but we just allow more flexibility.

There seems to be a consensus forming for translation.  Might I suggest that
people keep option 1 in mind so that if repos properties are implemented at
some later stage they could be used.

Later,

\x/ill           :-}


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 30, 2002 at 05:06:03PM -0500, Karl Fogel wrote:
> Greg Stein <gs...@lyra.org> writes:
>...
> Right, right.  But the `log_msg' parameter to functions was not 
> `char *' until very recently, and for reasons having nothing to do
> with some prior decision about them being UTF-8.

Yup.

> I'm sorry to keep repeating myself.  It seems (maybe I'm
> misunderstanding?) that you brought up type of those params as
> indicating that some decision had already been made about their
> charset.

Misunderstanding :-). The logic went like this: the param is a char*, thus
it will be (typically) be representing characters, which means it needs an
associated charset, which we have previously stated would be UTF-8.

>...
> > To be concrete: either those char* params are UTF-8, or you add a second
> > parameter to state their charset. (or you just go charset neutral which
> > isn't really a good option)
> 
> Those aren't the only options here (and you're dismissing charset
> neutral as an obviously bad third option, mentioned only to be
> rejected, when in fact it's what this whole thread is really about).

I dismissed it because there has already been quite a bit of material
(from Garrett, Jon, etc etc) stating how clients need to know the charset to
be able to do anything with those log messages.

  --> They contain characters. You need to know their charset.

I'm not sure how it is possible to really consider otherwise. To display
those characters to the user, you need the charset. To edit them, to set
them, to email them, to do whatever.

Basically, I find the notion that "leaving it up to arbitrary interpreation"
is in any way a valid approach.

> I see three options on the table:

Four.

     - add a second parameter to the relevant data structures and routines
       to hold the character set of the string in question (while we're
       talking about log message here, I think there are others; the rule
       for log msgs will apply everywhere)

>    - Keep them as char *, declare them UTF-8, and convert user input
>      as best we can.
> 
>    - Keep them as char *, declare no particular charset, but don't
>      allow zero bytes.
> 
>    - Convert them back to counted-length strings and treat them as
>      binary data again (I guess this is the most militantly charset
>      neutral option).

Of the above four approaches:

1) a second param is very heavyweight from a conceptual and coding
   standpoint. and, in the end, we'll probably have to do conversions
   anyways, so allowing an arbitrary charset rather than fixed doesn't
   seem to buy a lot.

2) my favoriate. note that the *client* does the conversions. the libraries
   simply assume all text strings are in UTF-8.

3) untenable for the clients.

4) this is similar to (3), but we just allow more flexibility.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> > The interface calls log messages `char *' as of one day ago :-), and
> 
> And if this conversation was two days ago, I would have said stringbuf.
> 
> The point is: where we have char* in our interfaces, they are almost always
> representing some characters. I'm saying that we decided on saying they were
> UTF-8 and avoiding carrying around charset metadata with those.

Right, right.  But the `log_msg' parameter to functions was not 
`char *' until very recently, and for reasons having nothing to do
with some prior decision about them being UTF-8.

I'm sorry to keep repeating myself.  It seems (maybe I'm
misunderstanding?) that you brought up type of those params as
indicating that some decision had already been made about their
charset.  But they were counted-length strings (and thus could support
binary data!) until rev 2024, and were just caught up in the general
sweep of the conversion.  Their new type indicates nothing about what
charset we should use for log messages.  We have to make that decision
independently of their current type, and then make sure the type
*supports* whatever decision we make.

> To be concrete: either those char* params are UTF-8, or you add a second
> parameter to state their charset. (or you just go charset neutral which
> isn't really a good option)

Those aren't the only options here (and you're dismissing charset
neutral as an obviously bad third option, mentioned only to be
rejected, when in fact it's what this whole thread is really about).

I see three options on the table:

   - Keep them as char *, declare them UTF-8, and convert user input
     as best we can.

   - Keep them as char *, declare no particular charset, but don't
     allow zero bytes.

   - Convert them back to counted-length strings and treat them as
     binary data again (I guess this is the most militantly charset
     neutral option).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 30, 2002 at 10:30:44AM -0500, Karl Fogel wrote:
> Greg Stein <gs...@lyra.org> writes:
> > > If a log message is in some unknown and unknowable charset, I can't
> > > stick the text into a text widget and have any confidence that something
> > > legible will be displayed.
> > 
> > Yup.
> > 
> > Our decision to use UTF-8 for stuff was made a *long* time ago. Here is a
> > particular comment from svn_fs.h:
>...
> Hmm, but that's just talking about paths.

Of course. I was showing one data, and my email was moving on to the rest.

>...
> The issue here is log
> messages (the fact that log messages are stored as property values is
> an implementation detail -- I don't think the ideal that property
> values support binary data has any influence one way or the other on
> whether binary log messages should be allowed).

Yes.

> > We've always considered all properties to be binary. Thus, ra_dav will need
> > to encode it in some fashion to keep it safe within an XML body. While the
> > log message *happens* to be a property, the interface calls it a char*,
> > which means UTF-8. And we informally decided (meaning: it isn't written down
> > like what is in svn_fs.h) on using UTF-8 as our library's character set a
> > long time ago also. Maybe I could find a reference, but I'm not going to
> > bother. We *did* choose it, so people can attempt to prove otherwise or
> > provide some technical reason why choosing one charset is Badness(tm).
> 
> I don't understand the connection here.
> 
> We didn't decide that all data coming into fs is UTF-8.  We decided

I was talking about interfaces -- parameters. Not file contents.

> that pathnames were UTF-8, and that file contents and property values
> would be binary data (as far as the fs is concerned).

Of course.

>...
> > While the
> > log message *happens* to be a property, the interface calls it a char*,
> > which means UTF-8. 
> 
> The interface calls log messages `char *' as of one day ago :-), and

And if this conversation was two days ago, I would have said stringbuf.

The point is: where we have char* in our interfaces, they are almost always
representing some characters. I'm saying that we decided on saying they were
UTF-8 and avoiding carrying around charset metadata with those.

To be concrete: either those char* params are UTF-8, or you add a second
parameter to state their charset. (or you just go charset neutral which
isn't really a good option)

Think back. Like two years ago. We said UTF-8 was the SVN charset. Not just
paths. But all the content [outside of file content and prop values].

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Stein <gs...@lyra.org> writes:
> > If a log message is in some unknown and unknowable charset, I can't
> > stick the text into a text widget and have any confidence that something
> > legible will be displayed.
> 
> Yup.
> 
> Our decision to use UTF-8 for stuff was made a *long* time ago. Here is a
> particular comment from svn_fs.h:
> 
> /* Here are the rules for directory entry names, and directory paths:
> 
>    A directory entry name is a Unicode string encoded in UTF-8, and
>    may not contain the null character (U+0000).  The name should be in
>    Unicode canonical decomposition and ordering.  No directory entry
> ...	 

Hmm, but that's just talking about paths.  No one disagrees that paths
should be enforced to one canonical format.  The issue here is log
messages (the fact that log messages are stored as property values is
an implementation detail -- I don't think the ideal that property
values support binary data has any influence one way or the other on
whether binary log messages should be allowed).

> We've always considered all properties to be binary. Thus, ra_dav will need
> to encode it in some fashion to keep it safe within an XML body. While the
> log message *happens* to be a property, the interface calls it a char*,
> which means UTF-8. And we informally decided (meaning: it isn't written down
> like what is in svn_fs.h) on using UTF-8 as our library's character set a
> long time ago also. Maybe I could find a reference, but I'm not going to
> bother. We *did* choose it, so people can attempt to prove otherwise or
> provide some technical reason why choosing one charset is Badness(tm).

I don't understand the connection here.

We didn't decide that all data coming into fs is UTF-8.  We decided
that pathnames were UTF-8, and that file contents and property values
would be binary data (as far as the fs is concerned).

This doesn't mean we can't enforce some convention for log messages in
particular, but such a decision is certainly not *implied* by anything
in the design of the fs right now.

> While the
> log message *happens* to be a property, the interface calls it a char*,
> which means UTF-8. 

The interface calls log messages `char *' as of one day ago :-), and
that's just fallout from 2024.  There are comments in the code,
indicating that maybe it should go back to supporting binary data, as
it did up until 2024.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Wed, May 29, 2002 at 04:00:12PM -0700, Greg Stein wrote:

> Right. If the API has a text string, then SVN says that text string is in
> UTF-8. If we have standard properties that are to be interpreted as text,
> then those will be stored as UTF-8 strings (within the binary property).
> 
> While APR doesn't talk about character sets for its API (wrongly, so, IMO),
> the Subversion libraries *do*. Anything that is text will be UTF-8. Since
> paths and URLs hold "characters" (but are hard to call "text"), they also
> use UTF-8 for their character set.

+1 on all of this.

making an arbitrary decision to use UTF-8, while it might feel like
we're 'imposing policy on users', solves a ton of problems at a fairly
reasonable cost, and seems like the only sane way to go.

-garrett 

-- 
garrett rooney                    Remember, any design flaw you're 
rooneg@electricjellyfish.net      sufficiently snide about becomes  
http://electricjellyfish.net/     a feature.       -- Dan Sugalski

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

Posted by Greg Stein <gs...@lyra.org>.

On Wed, May 29, 2002 at 03:30:35PM -0500, Jon Trowbridge wrote:
> On Wed, 2002-05-29 at 14:30, cmpilato@collab.net wrote:
> > Marcus Comstedt <ma...@mc.pp.se> writes:
> > > Since there are no properties on log messages, how do you propose that
> > > the actual character encoding for a log message be recorded?
> > 
> > That information, as you may have inferred from my previous paragraph,
> > is stored "out of band", in a HACKING file or something, and is
> > regulated by the repos admins.

Untenable.

> ...but if you do this, anyone who wants to write a GUI client that
> allows for log message browsing is out of luck.

Exactly.

> If a log message is in some unknown and unknowable charset, I can't
> stick the text into a text widget and have any confidence that something
> legible will be displayed.

Yup.

Our decision to use UTF-8 for stuff was made a *long* time ago. Here is a
particular comment from svn_fs.h:

/* Here are the rules for directory entry names, and directory paths:

   A directory entry name is a Unicode string encoded in UTF-8, and
   may not contain the null character (U+0000).  The name should be in
   Unicode canonical decomposition and ordering.  No directory entry
...	 

We've always considered all properties to be binary. Thus, ra_dav will need
to encode it in some fashion to keep it safe within an XML body. While the
log message *happens* to be a property, the interface calls it a char*,
which means UTF-8. And we informally decided (meaning: it isn't written down
like what is in svn_fs.h) on using UTF-8 as our library's character set a
long time ago also. Maybe I could find a reference, but I'm not going to
bother. We *did* choose it, so people can attempt to prove otherwise or
provide some technical reason why choosing one charset is Badness(tm).

> Requiring utf-8 here might seem onerous, but it is pretty much the only
> way to avoid a whole class of annoying charset problems down the road.

Right. If the API has a text string, then SVN says that text string is in
UTF-8. If we have standard properties that are to be interpreted as text,
then those will be stored as UTF-8 strings (within the binary property).

While APR doesn't talk about character sets for its API (wrongly, so, IMO),
the Subversion libraries *do*. Anything that is text will be UTF-8. Since
paths and URLs hold "characters" (but are hard to call "text"), they also
use UTF-8 for their character set.

[ and extend as applicable to other concepts in the API... ]

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

the following algorithm will get us a
successor for NR, which is likely (though not guaranteed) to be
relatively "near" NR.

   Setup a cursor on NR's node revision id in the `nodes' table;
   Advance cursor to next row;
   THIS_NR = Current cursor location;
   If THIS_NR.NodeId != NR.NodeId:
      /* unrelated node, no more node revisions of NR */
      return FAILURE;
   If THIS_NR.CopyId == NR.CopyId:
      && THIS_NR.TxnId is not a pending transaction:
      /* same node_id, same copy_id, must be different (older!) txn_id */
      return SUCCESS, THIS_NR;
   ELSE:
      DO:
         IF THIS_NR.TxnId > NR.TxnId:
         && THIS_NR.TxnId is not a pending transaction:
            /* same node_id, older copy_id, older txn_id */
            return SUCCESS, THIS_NR;
         Advance cursor to next row;
         THIS_NR = Current cursor location;
      WHILE (THIS_NR.NodeId == NR.NodeId)
   return FAILURE;

However, I realize that adding ordering to those IDs is probably not a
popular thought.  :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Jon Trowbridge <tr...@ximian.com>.

On Wed, 2002-05-29 at 14:30, cmpilato@collab.net wrote:
> Marcus Comstedt <ma...@mc.pp.se> writes:
>
> > Since there are no properties on log messages, how do you propose that
> > the actual character encoding for a log message be recorded?
> 
> That information, as you may have inferred from my previous paragraph,
> is stored "out of band", in a HACKING file or something, and is
> regulated by the repos admins.

...but if you do this, anyone who wants to write a GUI client that
allows for log message browsing is out of luck.

If a log message is in some unknown and unknowable charset, I can't
stick the text into a text widget and have any confidence that something
legible will be displayed.

Requiring utf-8 here might seem onerous, but it is pretty much the only
way to avoid a whole class of annoying charset problems down the road.

-JT

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

cmpilato@collab.net writes:

> > Since there are no properties on log messages, how do you propose that
> > the actual character encoding for a log message be recorded?
> 
> That information, as you may have inferred from my previous paragraph,
> is stored "out of band", in a HACKING file or something, and is
> regulated by the repos admins.

A client will have a difficult time in extracting that information from
the HACKING file in order to display the log messages properly to the
user, methinks.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by cm...@collab.net.

Marcus Comstedt <ma...@mc.pp.se> writes:

> =?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:
> 
> > I must have missed this in your earlier patches. IMHO, only path names
> > should be in (transformed to) UTF-8. Property contents, including log
> > messages, shouldn't be touched.
> 
> Hm.  Any particular reason?  Apart from breaking the "all strings
> passed to libsvn_* shall be UTF-8"-paradigm, it would mean that two
> persons, one using a Latin-1 charset and one using an UTF-8 charset,
> wouldn't be able to properly read each others log messages even if
> they are restricting themselves to the common subset of characters.

Subversion should *not* be in the business of doing character set
conversions of any sort, in my opinion.  All subversion property
values should be binary, and the interpretation of those bits is left
to policy makers.  That is, we may say, "The Subversion repository
uses UTF-8 encoding for all human-readable property values"; somebody
else may say that their repository users should make sure that they
use Shift-JIS encodings for their repository.

> Since there are no properties on log messages, how do you propose that
> the actual character encoding for a log message be recorded?

That information, as you may have inferred from my previous paragraph,
is stored "out of band", in a HACKING file or something, and is
regulated by the repos admins.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:

> I must have missed this in your earlier patches. IMHO, only path names
> should be in (transformed to) UTF-8. Property contents, including log
> messages, shouldn't be touched.

Hm.  Any particular reason?  Apart from breaking the "all strings
passed to libsvn_* shall be UTF-8"-paradigm, it would mean that two
persons, one using a Latin-1 charset and one using an UTF-8 charset,
wouldn't be able to properly read each others log messages even if
they are restricting themselves to the common subset of characters.

Since there are no properties on log messages, how do you propose that
the actual character encoding for a log message be recorded?

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Branko Čibej <br...@xbc.nu>.

Marcus Comstedt wrote:

>Ulf Tigerstedt <ti...@infa.abo.fi> writes:
>
>  
>
>>Not a correct patch, but something that needs to be fixed one
>>way or the other:
>>
>>Problem: åäö (and other nonASCII chars)in commit messages makes the ra_dav 
>>server barf over the commit.
>>    
>>
>
>The UTF-8 patch I'm working on recodes all commit messages to UTF-8
>which should fix this problem (they are of course coded back when you
>want to look at them).  So a more proper fix is underway, but it will
>take some more time.
>
I must have missed this in your earlier patches. IMHO, only path names 
should be in (transformed to) UTF-8. Property contents, including log 
messages, shouldn't be touched.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Hudson <gh...@MIT.EDU>.

On Wed, 2002-05-29 at 11:50, Karl Fogel wrote:
> He is sending UTF-8, and that's stimulating the bug.  So recoding all
> commit messages as UTF-8 isn't going to help, right?

No, he's sending ISO-8859-1 or something.  UTF-8 would use two-byte (or
longer) sequences for the funny characters.

> But anyway, how are we going to "recode" commit messages to UTF-8, if
> we don't know what encoding they're coming from?

Marcus's approach is to require that LC_CTYPE be set to the encoding
your tools use to write and display funny characters.  (If people used
tools which wrote and displayed UTF-8, that would be ideal, and we would
never have to do any conversion.  Unfortunately, there are a limited
number of tools which do so right now.)

> And what format are
> you storing them in in the revision property?

In the revision property, they would be UTF-8.

> (i.e., What
> circumstances are included in "when you want to look at them"?)

When you do "svn log", your client would get the log message back from
the library in UTF-8 and would convert to your local character set to
display it.  Similarly if you use "svn propget".

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Karl Fogel <kf...@newton.ch.collab.net> writes:

> Typo, sorry, I meant ISO-8859-1.  (ISO-8859-1 == Latin-1, right?)

Yup.  (It's not true for all values of X that ISO-8859-X == Latin-X
though, since there are some non-latin ISO-8859s.  :)

> Aaaaaah, this is what I didn't understand, thanks.
> 
> (My locale has always been English, and I don't change it when I'm
> editing other languages, so I'm not used to thinking of locale as a
> reliable indicator of what language a particular document is in.  But
> I guess it's the best we can do).

The $LC_CTYPE is usually a pretty good indication of what charset is
being used.  At least among non-english users.  :-)  It's not related
to language, only to character encoding.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Marcus Comstedt <ma...@mc.pp.se> writes:
> Que?  What makes you think he's sending UTF-8?  I'm pretty sure he is
> sending Latin-1.

Typo, sorry, I meant ISO-8859-1.  (ISO-8859-1 == Latin-1, right?)

> They are coming from the encoding specified by his locale.

Aaaaaah, this is what I didn't understand, thanks.

(My locale has always been English, and I don't change it when I'm
editing other languages, so I'm not used to thinking of locale as a
reliable indicator of what language a particular document is in.  But
I guess it's the best we can do).

> Messages are stored in UTF-8.  Reverse conversion is done by `svn log'
> when it prints the message to stdout.

Okay, gotcha.

Thanks,
-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Karl Fogel <kf...@newton.ch.collab.net> writes:

> I'm not sure I understand how this helps.
> 
> He is sending UTF-8, and that's stimulating the bug.  So recoding all
> commit messages as UTF-8 isn't going to help, right?

Que?  What makes you think he's sending UTF-8?  I'm pretty sure he is
sending Latin-1.

> But anyway, how are we going to "recode" commit messages to UTF-8, if
> we don't know what encoding they're coming from?

They are coming from the encoding specified by his locale.

> And what format are
> you storing them in in the revision property?  (i.e., What
> circumstances are included in "when you want to look at them"?)

Messages are stored in UTF-8.  Reverse conversion is done by `svn log'
when it prints the message to stdout.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Marcus Comstedt <ma...@mc.pp.se> writes:
> > Problem: åäö (and other nonASCII chars)in commit messages makes the ra_dav 
> > server barf over the commit.
> 
> The UTF-8 patch I'm working on recodes all commit messages to UTF-8
> which should fix this problem (they are of course coded back when you
> want to look at them).  So a more proper fix is underway, but it will
> take some more time.

I'm not sure I understand how this helps.

He is sending UTF-8, and that's stimulating the bug.  So recoding all
commit messages as UTF-8 isn't going to help, right?

But anyway, how are we going to "recode" commit messages to UTF-8, if
we don't know what encoding they're coming from?  And what format are
you storing them in in the revision property?  (i.e., What
circumstances are included in "when you want to look at them"?)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Ulf Tigerstedt <ti...@infa.abo.fi> writes:

> Not a correct patch, but something that needs to be fixed one
> way or the other:
> 
> Problem: åäö (and other nonASCII chars)in commit messages makes the ra_dav 
> server barf over the commit.

The UTF-8 patch I'm working on recodes all commit messages to UTF-8
which should fix this problem (they are of course coded back when you
want to look at them).  So a more proper fix is underway, but it will
take some more time.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Bill Tutt <ra...@lyra.org>.

The really weird cases for HTML/XML encoding are non-UTF8 transmissions
of XML/HTML data.

I had to port some code that supported this horrendous edge case
recently to further turn your gut.

The example goes something like this:
I want (for whatever bizarre reason) to transmit my HTML/XML in a Korean
character set. (Windows character set 1361 to pick a specific one)

The data I want to send looks like this: AA'BC
(Pretend for a second that A' really is a capital letter A with an acute
marker over it.)

Now, unsurprisingly A' isn't representable in Korean. Therefore, I'd
like to be able to transform this into an HTML/XML entity. The helper
function that I had to call from C# produced this output for the above
string:
A&Aacute;BC

Yes, this is evil. Yes, this is an unbelievably edge case scenario. Yes,
I have no idea who the hell needs this, but I had to port access to the
code to our new .Net API, whee. 

Stomach turning il8n factoid of the day,
Bill
----
Do you want a dangerous fugitive staying in your flat?
No.
Well, don't upset him and he'll be a nice fugitive staying in your flat.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Marcus Comstedt <ma...@mc.pp.se> writes:
> HTTP is 8-bit safe.  No need for 7-bit oddities.  The special
> characters in XML are `<', `>', `&', and in the case of attribute
> values `'' and `"'.  No other octets need special treatment.

Thank you!  (I've long needed to see this spelled out so clearly :-)

> And that's precisely what I've been working on for a couple of days
> now.  See the "UTF-8" thread.  I intend to post a new update of the
> patch tomorrow.

Okay, great.  Thanks for the clarifying mails.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Karl Fogel <kf...@newton.ch.collab.net> writes:

> So the idea is:
> 
>    - First get the log message into UTF-8.
> 
>    - Then, our usual encoding step will convert `<', `>' and other
>      special characters in the UTF-8 to entity representations, so
>      that what goes across the wire is 7-bit and safe.  (By "special",
>      you meant "8-bit", right?)  And of course it gets decoded back
>      into UTF-8 on the other end.

HTTP is 8-bit safe.  No need for 7-bit oddities.  The special
characters in XML are `<', `>', `&', and in the case of attribute
values `'' and `"'.  No other octets need special treatment.


> Is that right?
> 
> It's step 1 that seems difficult to me.  If the person didn't write
> the log message in UTF-8 in the first place, how are we going to guess
> what charset they _did_ write it in?  It seems to me we have to add
> new run-time config code, or heuristics, to determine what encoding it
> uses, so that we can losslessly convert it to UTF-8 if it's not UTF-8
> already.

And that's precisely what I've been working on for a couple of days
now.  See the "UTF-8" thread.  I intend to post a new update of the
patch tomorrow.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

cmpilato@collab.net writes:
> The message already is being XML-encoded to some extent, in that '<'
> and '>' and other such special chars are being converted to entity
> representations, IIRC.  I think all we need to do is to make sure that
> all this stuff is first converted to UTF-8, and then just add the
> "charset" XML attribute thingy that states that this particular XML
> document is in UTF-8.

So the idea is:

   - First get the log message into UTF-8.

   - Then, our usual encoding step will convert `<', `>' and other
     special characters in the UTF-8 to entity representations, so
     that what goes across the wire is 7-bit and safe.  (By "special",
     you meant "8-bit", right?)  And of course it gets decoded back
     into UTF-8 on the other end.

Is that right?

It's step 1 that seems difficult to me.  If the person didn't write
the log message in UTF-8 in the first place, how are we going to guess
what charset they _did_ write it in?  It seems to me we have to add
new run-time config code, or heuristics, to determine what encoding it
uses, so that we can losslessly convert it to UTF-8 if it's not UTF-8
already.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Stein <gs...@lyra.org>.

On Wed, May 29, 2002 at 10:55:22AM -0500, cmpilato@collab.net wrote:
>...
> > The next problem in the pipeline (based on what Ulf Tigerstedt
> > encountered) is that the message has to be properly XML-encoded before
> > being sent over the wire -- necessary whether UTF-8 or full binary.
> 
> The message already is being XML-encoded to some extent, in that '<'
> and '>' and other such special chars are being converted to entity
> representations, IIRC.  I think all we need to do is to make sure that
> all this stuff is first converted to UTF-8, and then just add the
> "charset" XML attribute thingy that states that this particular XML
> document is in UTF-8.
> 
> Am I remembering XML specs correctly?

Well, first, any XML "document" needs to choose a character set for its
body. Then you declare that in the <?xml?> thing, or in the Content-Type
header. Subversion normally uses the latter:

  Content-Type: text/xml; charset="utf-8"

Placing the character set in the HTTP header is a bit better than using the
<?xml?> processing instruction.

Anyways... after the charset is decided, then you need to escape certain
characters (as somebody already stated: <, >, &, and sometimes ' and ")

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Branko Čibej <br...@xbc.nu>.

Karl Fogel wrote:

>(Does even UTF-16 have any zero bytes?)
>  
>
About 128 thousand of them. :-)

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Hudson <gh...@MIT.EDU> writes:
> > The first problem in the pipeline is that the `log_msg' variable to a
> > lot of internal functions is now `const char *' instead of stringbuf.
> > As long as people stick to UTF-8, this is fine.  If we want true
> > binary log message support, we'll need to go back to stringbufs for
> > that data (not a difficult change).
> 
> Some corrections:
> 
>   1. svn_string_t, not svn_stringfuf_t, for arbitrary binary data

Good point.

>   2. Being safe for "8-bit data" is different from being safe for
> "binary data."  No 8-bit character set uses the octet 0 (I'm pretty
> certain of that), so you can still use C strings for international text
> unless you want to support UTF-16 or some such.

Now now, that's not a correction, that's what I'm saying above :-).

(Does even UTF-16 have any zero bytes?)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Greg Hudson <gh...@MIT.EDU>.

Karl Fogel <kf...@newton.ch.collab.net> writes:
> The first problem in the pipeline is that the `log_msg' variable to a
> lot of internal functions is now `const char *' instead of stringbuf.
> As long as people stick to UTF-8, this is fine.  If we want true
> binary log message support, we'll need to go back to stringbufs for
> that data (not a difficult change).

Some corrections:

  1. svn_string_t, not svn_stringfuf_t, for arbitrary binary data
  2. Being safe for "8-bit data" is different from being safe for
"binary data."  No 8-bit character set uses the octet 0 (I'm pretty
certain of that), so you can still use C strings for international text
unless you want to support UTF-16 or some such.

Marcus's patch takes a reasonable approach, since it means people can
write log messages in different character encodings and get sane (if not
always perfect) results.  It would still be nicer if people's tools just
used UTF-8, of course, so that applications didn't have to know about
character sets.

The alternative is to base64-encode log messages when they're stuffed
into XML documents, which is charset-neutral like CVS is.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

cmpilato@collab.net writes:

> The message already is being XML-encoded to some extent, in that '<'
> and '>' and other such special chars are being converted to entity
> representations, IIRC.  I think all we need to do is to make sure that
> all this stuff is first converted to UTF-8, and then just add the
> "charset" XML attribute thingy that states that this particular XML
> document is in UTF-8.

UTF-8 is the default charset for XML.  So in this case the attribute
is not needed.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by cm...@collab.net.

Karl Fogel <kf...@newton.ch.collab.net> writes:

> cmpilato@collab.net writes:
> > Log messages are just revision properties, and Subversion claims
> > support for binary property values.  So, even if, client side, log
> > messages were limited to 7bit (which I think should *not* happen), we
> > would still see this bug when someone used the upper ASCII characters
> > on some other property, e.g., a node's "svn:ignore" property value.
> 
> A few words about that.
> 
> Yeah, we should definitely support 8-bit chars in log messages.  On
> the repository side, the revision property value is quite capable of
> storing it, because it can store any binary value, like Mike says.
> 
> A person's $EDITOR can presumably write any binary value too.

Sure.

> The first problem in the pipeline is that the `log_msg' variable to a
> lot of internal functions is now `const char *' instead of stringbuf.
> As long as people stick to UTF-8, this is fine.  If we want true
> binary log message support, we'll need to go back to stringbufs for
> that data (not a difficult change).

In general, I think we can special case the log messages in the client
side to be UTF-8 textual messages.  I've got no problem with that.

> The next problem in the pipeline (based on what Ulf Tigerstedt
> encountered) is that the message has to be properly XML-encoded before
> being sent over the wire -- necessary whether UTF-8 or full binary.

The message already is being XML-encoded to some extent, in that '<'
and '>' and other such special chars are being converted to entity
representations, IIRC.  I think all we need to do is to make sure that
all this stuff is first converted to UTF-8, and then just add the
"charset" XML attribute thingy that states that this particular XML
document is in UTF-8.

Am I remembering XML specs correctly?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Marcus Comstedt <ma...@mc.pp.se>.

Ulf Tigerstedt <ti...@infa.abo.fi> writes:

> As a sidenote, I tried to put vim into UTF-8 fileencoding
> when writing the message. Not surprisingly, it worked flawlessly.
> The log then shows the raw coding.

If you for the moment manually put your log entries into UTF-8, then
you'll be future safe.  Once UTF-8 recoding is enabled in the client,
it will start printing the old entries correctly as well.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Ulf Tigerstedt <ti...@infa.abo.fi>.

On 29 May 2002, Karl Fogel wrote:

> Ulf Tigerstedt <ti...@infa.abo.fi> writes:
> > > Ulf, want to try a patch that does one or both of these things?
> > 
> > Yeah, I'm ready to test anything. 
> 
> Hee hee!
> 
> I meant do you want to *write* a patch?...

Will try, tomorrow. 

As a sidenote, I tried to put vim into UTF-8 fileencoding
when writing the message. Not surprisingly, it worked flawlessly.
The log then shows the raw coding.

This seems somewhat hairy. 
-- 
****  Ulf 'Tiggi' Tigerstedt *** KTF / Datateknik  *********
*Being flogged with a rubber chicken can be quite enjoyable*
** - Matt McLeod, In the Scary devil monastery 29.11.1999 **


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Ulf Tigerstedt <ti...@infa.abo.fi> writes:
> > Ulf, want to try a patch that does one or both of these things?
> 
> Yeah, I'm ready to test anything. 

Hee hee!

I meant do you want to *write* a patch?...

:-)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Ulf Tigerstedt <ti...@infa.abo.fi>.

On 29 May 2002, Karl Fogel wrote:

> cmpilato@collab.net writes:
> > Log messages are just revision properties, and Subversion claims
> > support for binary property values.  So, even if, client side, log
> > messages were limited to 7bit (which I think should *not* happen), we
> > would still see this bug when someone used the upper ASCII characters
> > on some other property, e.g., a node's "svn:ignore" property value.
> 
> A few words about that.
> 
> Yeah, we should definitely support 8-bit chars in log messages.  On
> the repository side, the revision property value is quite capable of
> storing it, because it can store any binary value, like Mike says.

Yeah, I just made the fix to not have svn blow up in my face every 
time I forget to not use swedish in logs. 
After a while it gets annoying.

> The next problem in the pipeline (based on what Ulf Tigerstedt
> encountered) is that the message has to be properly XML-encoded before
> being sent over the wire -- necessary whether UTF-8 or full binary.
> 
> Ulf, want to try a patch that does one or both of these things?

Yeah, I'm ready to test anything. 


BTW, infa.abo.fi has a newsserver with subversion-dev transferred
from maillist to news with snntp. 
Really the best way to read maillists. 
-- 
****  Ulf 'Tiggi' Tigerstedt *** KTF / Datateknik  *********
*Being flogged with a rubber chicken can be quite enjoyable*
** - Matt McLeod, In the Scary devil monastery 29.11.1999 **


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

cmpilato@collab.net writes:
> Log messages are just revision properties, and Subversion claims
> support for binary property values.  So, even if, client side, log
> messages were limited to 7bit (which I think should *not* happen), we
> would still see this bug when someone used the upper ASCII characters
> on some other property, e.g., a node's "svn:ignore" property value.

A few words about that.

Yeah, we should definitely support 8-bit chars in log messages.  On
the repository side, the revision property value is quite capable of
storing it, because it can store any binary value, like Mike says.

A person's $EDITOR can presumably write any binary value too.

The first problem in the pipeline is that the `log_msg' variable to a
lot of internal functions is now `const char *' instead of stringbuf.
As long as people stick to UTF-8, this is fine.  If we want true
binary log message support, we'll need to go back to stringbufs for
that data (not a difficult change).

The next problem in the pipeline (based on what Ulf Tigerstedt
encountered) is that the message has to be properly XML-encoded before
being sent over the wire -- necessary whether UTF-8 or full binary.

Ulf, want to try a patch that does one or both of these things?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC/PATCH] commit messages not 8-bit compatible

Posted by cm...@collab.net.

Ulf Tigerstedt <ti...@infa.abo.fi> writes:

> Not a correct patch, but something that needs to be fixed one
> way or the other:
> 
> Problem: åäö (and other nonASCII chars)in commit messages makes the ra_dav 
> server barf over the commit.

Yeah, this is a known issue.

> -
> +void 
> +svn_strip_log_highbits(svn_stringbuf_t *buffer) { 
> +       int i;
> +       for (i=buffer->len; i!=0; i--) {
> +               buffer->data[i]&=0x7F;
> +       }
> +}
>  #define EDITOR_PREFIX_TXT  "SVN:"
>  
>  /* This function is of type svn_client_get_commit_log_t. */
> @@ -585,6 +591,7 @@
>        /* Strip the prefix from the buffer. */
>        if (message)
>          message = strip_prefix_from_buffer (message, EDITOR_PREFIX_TXT,
> pool);
> +        svn_strip_log_highbits(message);
>  
>        if (message)
>          {
> 
> Don't apply this, but please comment. 
> Should the messages be allowed to be 8bit, and is it the client or the
> server that should correct it if needed?

Ouch, just stipping the high bits from the log message?  I'd hate to
try to read that one after the fact.  This is not the way to go about
things.

Log messages are just revision properties, and Subversion claims
support for binary property values.  So, even if, client side, log
messages were limited to 7bit (which I think should *not* happen), we
would still see this bug when someone used the upper ASCII characters
on some other property, e.g., a node's "svn:ignore" property value.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org