You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Marcus Comstedt <ma...@mc.pp.se> on 2002/05/10 15:15:13 UTC

UTF-8

Hi.

I started looking at inserting the missing UTF-8 translations in the
cmdline client, but found some strangeness already.  All svn library
functions take strings in UTF-8 encoding, wasn't that the plan?  So
how come svn_io_check_path() in libsubr just takes its first argument
(which shoulde hence be an UTF-8 sting) and passes it to apr_stat(),
which does not operate on UTF-8 strings?  Seems to be conversion calls
missing inside the libs too...


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

apr xlate stuff (was: Re: UTF-8)

Posted by Greg Stein <gs...@lyra.org>.
On Thu, May 23, 2002 at 01:29:18AM +0200, Marcus Comstedt wrote:
> Greg Stein <gs...@lyra.org> writes:
> > APR has got a number of functions for transcoding strings, so SVN might not
> > even need any. Further, the apr-iconv project is available for platforms
> > that don't have iconv() builtin.
> 
> Hm, as far as I could see, the iconv stuff in apr was all empty.  You
> mean there is a separate project that has to be checked out (like
> apr-util) to get it?

Yup. The 'apr-iconv' project. But that is really about an iconv()
replacement. I think most of your focus would be in apr_xlate since APR will
use the glibc iconv for you.

[ honestly, though, I'm not sure of apr-iconv's status; it also does not
  look like it is really hooked into apr_xlate right now ]

Just make sure to check the error return values. apr_xlate_open() is going
to return an error if the translation cannot be performed (e.g. you're on a
platform where iconv isn't present).

> > Again, note that iconv() is unportable, so the apr_xlate functions should be
> > used instead.
> 
> Yeah, I know.  I just got a bit miffed when I found that all the
> iconv/ dirs in APR were empty.  Didn't know that there were
> xlate-functions to do the same thing.  Given the feedback from you and
> Ulrich I should be able to make something much nicer.  Thanks!

Cool!

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Greg Stein <gs...@lyra.org> writes:

> APR has got a number of functions for transcoding strings, so SVN might not
> even need any. Further, the apr-iconv project is available for platforms
> that don't have iconv() builtin.

Hm, as far as I could see, the iconv stuff in apr was all empty.  You
mean there is a separate project that has to be checked out (like
apr-util) to get it?


> Again, note that iconv() is unportable, so the apr_xlate functions should be
> used instead.

Yeah, I know.  I just got a bit miffed when I found that all the
iconv/ dirs in APR were empty.  Didn't know that there were
xlate-functions to do the same thing.  Given the feedback from you and
Ulrich I should be able to make something much nicer.  Thanks!


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Greg Stein <gs...@lyra.org>.
On Wed, May 22, 2002 at 11:59:51PM +0200, Marcus Comstedt wrote:
> Ulrich Drepper <dr...@redhat.com> writes:
>...
> > First, Unix requires a function named nl_langinfo() which returns just
> > the wanted information.  So you should have something like
> > 
> > static const char *
> > find_native_charset (void)
> > {
> > #ifdef HAVE_NL_LANGINFO
> >   return nl_langinfo (CODESET);
> > #else
> >   ...
> > #endif
> > }
> 
> Cool.  I sat for an whole hour trying to find a function that did
> that, without succeeding.  :-)

Well, you shouldn't have to worry about it at all, actually. Take a look at:

  apr/include/apr_xlate.h
  apr/i18n/unix/*

You'll note that xlate.c already has a call to nl_langinfo() in it.
Otherwise, it defaults to some other code to derive the current charset.

APR has got a number of functions for transcoding strings, so SVN might not
even need any. Further, the apr-iconv project is available for platforms
that don't have iconv() builtin.

> > To support systems without nl_langinfo() you cannot simply look at the
> > LC_CTYPE environment variable.  Its value need not have anything to do
> > with the selected locale.  The order in which the setlocale() function
> > looks at environment variables for the LC_CTYPE category is
> > 
> >   LC_ALL  ->  LC_CTYPE   ->  LANG
> > 
> > I.e., if LC_ALL is set use it.  Otherwise if LC_CTYPE is set, use this. 
> > Else use LANG if set.
> 
> I know.  It was a quick hack, as I suspected there had to be a better
> way to do it...

Ulrich's suggestions should be applied towards patches to fix up APR, rather
than encode this knowledge into SVN only.

> > But your problems won't stop there.  Charsets can have many different
> > names.  Other systems provide different mechanisms to determine the
> > current charset etc etc.
> > 
> > Look at the file localcharset.c which is used in some GNU packages (and
> > other GPL'ed ones).  I attach a copy.  This is what you'll have to do.
> 
> Nice.  I'll take a look at it.

I would encourage you to look at this. Ulrich is very well qualified to talk
about this stuff (since he is the primary glibc maintainer :-).

>...
> > > +  if (cd == (iconv_t)-1)
> > > +    return svn_error_create (0, errno, NULL, pool, "recoding string");
> > > +
> > > +  do {
> > > +
> > > +    char *destbuf = apr_palloc (pool, buflen);
> > > +
> > > +    /* Set up state variables for iconv */
> > > +    const char *srcptr = src_data;
> > > +    char *destptr = destbuf;
> > > +    size_t srclen = src_length;
> > > +    size_t destlen = buflen;
> > > +
> > > +    /* Attempt the conversion */
> > > +    if (iconv(cd, &srcptr, &srclen, &destptr, &destlen) != (size_t)-1) {

Again, note that iconv() is unportable, so the apr_xlate functions should be
used instead.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Ulrich Drepper <dr...@redhat.com> writes:

> I cannot comment on the svn code, only on the i18n-specific code
> 
> > +/* Figure out what the name of the native character set is, so
> > +   that we can pass it to iconv_open.                          */
> > +static const char *
> > +find_native_charset (void)
> > +{
> > +  /* First try $LC_CTYPE */
> > +  const char *dot, *ctype = getenv ("LC_CTYPE");
> > +
> > +  /* Skip anything before the dot, if any */
> > +  if (ctype != NULL && (dot = strchr (ctype, '.')))
> > +    ctype = dot + 1;
> > +
> > +  if (ctype != NULL && !strcmp (ctype, "iso_8859_1"))
> > +    ctype = "ISO8859-1";
> > +
> > +  if (ctype == NULL || ! *ctype) {
> > +    /* Fall back to US-ASCII */
> > +    ctype = "646";
> > +  }
> > +
> > +  return ctype;
> > +}
> 
> That's horrible.  It looks like code from the early 90s when nobody knew
> how to use locales.

I _said_ that code was horrible.  :-)  If you want to make a better
one, please be my guest.  I have been concentrating on inserting the
appropriate calls all over svn, not on the actual implementation of
the conversion functions.


> First, Unix requires a function named nl_langinfo() which returns just
> the wanted information.  So you should have something like
> 
> static const char *
> find_native_charset (void)
> {
> #ifdef HAVE_NL_LANGINFO
>   return nl_langinfo (CODESET);
> #else
>   ...
> #endif
> }

Cool.  I sat for an whole hour trying to find a function that did
that, without succeeding.  :-)


> To support systems without nl_langinfo() you cannot simply look at the
> LC_CTYPE environment variable.  Its value need not have anything to do
> with the selected locale.  The order in which the setlocale() function
> looks at environment variables for the LC_CTYPE category is
> 
>   LC_ALL  ->  LC_CTYPE   ->  LANG
> 
> I.e., if LC_ALL is set use it.  Otherwise if LC_CTYPE is set, use this. 
> Else use LANG if set.

I know.  It was a quick hack, as I suspected there had to be a better
way to do it...


> But your problems won't stop there.  Charsets can have many different
> names.  Other systems provide different mechanisms to determine the
> current charset etc etc.
> 
> Look at the file localcharset.c which is used in some GNU packages (and
> other GPL'ed ones).  I attach a copy.  This is what you'll have to do.

Nice.  I'll take a look at it.


> > +static svn_error_t *
> > +svn_utf_convert_to_stringbuf (iconv_t cd,
> > +			      const char *src_data, apr_size_t src_length,
> > +			      svn_stringbuf_t **dest,
> > +			      apr_pool_t *pool)
> > +{
> > +  /* 2 bytes per character will be enough in most cases.
> > +     If not, we'll make a larger buffer and try again.   */
> > +  apr_size_t buflen = src_length * 2 + 1;
> > +
> > +  if (cd == (iconv_t)-1)
> > +    return svn_error_create (0, errno, NULL, pool, "recoding string");
> > +
> > +  do {
> > +
> > +    char *destbuf = apr_palloc (pool, buflen);
> > +
> > +    /* Set up state variables for iconv */
> > +    const char *srcptr = src_data;
> > +    char *destptr = destbuf;
> > +    size_t srclen = src_length;
> > +    size_t destlen = buflen;
> > +
> > +    /* Attempt the conversion */
> > +    if (iconv(cd, &srcptr, &srclen, &destptr, &destlen) != (size_t)-1) {
> > +
> > +      /* Conversion succeeded.  Return buffer */
> > +      *dest = svn_stringbuf_ncreate (destbuf, buflen - destlen, pool);
> > +
> > +      return SVN_NO_ERROR;
> 
> That doesn't seem right.  Note that iconv() can return a value !=
> (size_t)-1 and still have problems.  That is a stupidity in the POSIX
> spec which requires that characters which are valid in the input charset
> but have no corresponding position in the output charset must be
> ignored.  In glibc this does not happen (without the user explicitly
> saying so) since this is a security problem.
> 
> For you here it means that the loop should terminate of != (size_t)-1
> but an error saying that the result couldn't be represented must be
> returned if the value is != 0.  Otherwise you are silently loosing
> information.

Ah.  Right.  Should have read the RETURN VALUES section of the
manpage more closely.


Thanks for your input!


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Ulrich Drepper <dr...@redhat.com>.
I cannot comment on the svn code, only on the i18n-specific code

> +/* Figure out what the name of the native character set is, so
> +   that we can pass it to iconv_open.                          */
> +static const char *
> +find_native_charset (void)
> +{
> +  /* First try $LC_CTYPE */
> +  const char *dot, *ctype = getenv ("LC_CTYPE");
> +
> +  /* Skip anything before the dot, if any */
> +  if (ctype != NULL && (dot = strchr (ctype, '.')))
> +    ctype = dot + 1;
> +
> +  if (ctype != NULL && !strcmp (ctype, "iso_8859_1"))
> +    ctype = "ISO8859-1";
> +
> +  if (ctype == NULL || ! *ctype) {
> +    /* Fall back to US-ASCII */
> +    ctype = "646";
> +  }
> +
> +  return ctype;
> +}

That's horrible.  It looks like code from the early 90s when nobody knew
how to use locales.

First, Unix requires a function named nl_langinfo() which returns just
the wanted information.  So you should have something like

static const char *
find_native_charset (void)
{
#ifdef HAVE_NL_LANGINFO
  return nl_langinfo (CODESET);
#else
  ...
#endif
}

To support systems without nl_langinfo() you cannot simply look at the
LC_CTYPE environment variable.  Its value need not have anything to do
with the selected locale.  The order in which the setlocale() function
looks at environment variables for the LC_CTYPE category is

  LC_ALL  ->  LC_CTYPE   ->  LANG

I.e., if LC_ALL is set use it.  Otherwise if LC_CTYPE is set, use this. 
Else use LANG if set.

But your problems won't stop there.  Charsets can have many different
names.  Other systems provide different mechanisms to determine the
current charset etc etc.

Look at the file localcharset.c which is used in some GNU packages (and
other GPL'ed ones).  I attach a copy.  This is what you'll have to do.



> +static svn_error_t *
> +svn_utf_convert_to_stringbuf (iconv_t cd,
> +			      const char *src_data, apr_size_t src_length,
> +			      svn_stringbuf_t **dest,
> +			      apr_pool_t *pool)
> +{
> +  /* 2 bytes per character will be enough in most cases.
> +     If not, we'll make a larger buffer and try again.   */
> +  apr_size_t buflen = src_length * 2 + 1;
> +
> +  if (cd == (iconv_t)-1)
> +    return svn_error_create (0, errno, NULL, pool, "recoding string");
> +
> +  do {
> +
> +    char *destbuf = apr_palloc (pool, buflen);
> +
> +    /* Set up state variables for iconv */
> +    const char *srcptr = src_data;
> +    char *destptr = destbuf;
> +    size_t srclen = src_length;
> +    size_t destlen = buflen;
> +
> +    /* Attempt the conversion */
> +    if (iconv(cd, &srcptr, &srclen, &destptr, &destlen) != (size_t)-1) {
> +
> +      /* Conversion succeeded.  Return buffer */
> +      *dest = svn_stringbuf_ncreate (destbuf, buflen - destlen, pool);
> +
> +      return SVN_NO_ERROR;

That doesn't seem right.  Note that iconv() can return a value !=
(size_t)-1 and still have problems.  That is a stupidity in the POSIX
spec which requires that characters which are valid in the input charset
but have no corresponding position in the output charset must be
ignored.  In glibc this does not happen (without the user explicitly
saying so) since this is a security problem.

For you here it means that the loop should terminate of != (size_t)-1
but an error saying that the result couldn't be represented must be
returned if the value is != 0.  Otherwise you are silently loosing
information.


-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Greg Hudson <gh...@MIT.EDU> writes:

> A possible compromise here is to base64-encode filenames and property
> names in XML documents (and any other externally-provided strings which
> might show up there), which should render Subversion as charset-neutral
> as CVS is.

Certainly.  Both solutions work.  It's just a matter of selecting
one.  As far as I have come to understand, the Subversion project has
chosen the UTF-8 thing.


> I'm... disappointed that you would equate the desire for simple code
> with the desire to coddle programmers.

Well, I don't think that the simplicity will be much affected with the
new wrapper functions you suggested.  The programmers still have to be
alert and not use some APR functions directly though.


> We want to eliminate the external diff fairly soon.  But even ignoring
> that, aren't the labels just revision numbers or something fairly
> innocuous like that?

When external diff is removed, the svn_io_run_cmd() function can be
removed, and the conversions in it will be gone.  No problem.
The labels typically contain the filename.


Anyway, I've added wrapper functions for apr_file_open, apr_stat,
apr_file_rename, apr_dir_make and apr_dir_open.  Now the only stuff in
libsvn_client that needs to explicilty do the UTF thing are

add.c: gets directory entries from APR
auth.c: gets username from APR
commit.c: gets directory entries from APR
commit_util.c: debugging code printing to stdout
diff.c: prints stuff on an application provided stream for the user to read
repos_diff.c: calls apr_file_remove from a cleanup handler where normal
              error handling might not be appropriate

I'll probably a wrapper for apr_dir_read too, to get rid of the
conversions in add.c and commit.c.  Adding a wrapper for
apr_get_username() probably wouldn't make much sense though since it's
only used in two places (once in libsvn_client, and once in libsvn_subr).

As for the debugging code, one could argue that conversions could be
omitted for code that is just for debugging anyway.  OTOH, I think
that debugging code should behave as similar to the "real" code as
possible.  In any case, this code should probably be completely
removed before long.

Finally, having a wrapper for apr_file_printf might be nice, but
slightly tricky to do right, since you'd either have to print to a
string first (ok, but not so neat) and convert that, or find out which
of the arguments are strings in order to convert them separately.
Hm, I see now that apr_file_printf prints to a string anyway, so
probably the first solution would not be so bad after all...


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Greg Hudson <gh...@MIT.EDU>.
A possible compromise here is to base64-encode filenames and property
names in XML documents (and any other externally-provided strings which
might show up there), which should render Subversion as charset-neutral
as CVS is.

On Thu, 2002-05-23 at 12:28, Marcus Comstedt wrote:
> > The "advantage" I was talking about was not having to do character set
> > conversion in applications, not any particular advantage to the user. 
> > (Although the user also benefits indirectly from having a single
> > character set for all languages.)

> The advantage of allowing the programmers to be lazy is rather
> insignificant compared to the disadvantage of imposing undesried
> charsets on the users.

I'm... disappointed that you would equate the desire for simple code
with the desire to coddle programmers.

> Not all arguments are provided by the OS or user.  When running diff,
> for example, the labels are provided by Subversion.

We want to eliminate the external diff fairly soon.  But even ignoring
that, aren't the labels just revision numbers or something fairly
innocuous like that?


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Greg Hudson <gh...@MIT.EDU> writes:

> By itself, no.  But I think it's reasonable for application developers
> not to incur the hair of charset conversions when there is a superior
> approach available to the world.

To you, it may be superior.  To us who _actually use_ non-ASCII
characters, the prospect of replacing all our tools does not have a
particularly "superior" ring.  If anyone wants to take the step, the
support will be there.  Those who don't, will be supported as well.


> > If you want to use UTF-8 under Unix, all you have to do is select an
> > UTF-8 locale and the conversions will be identity conversions.
> 
> I note that wasn't the default you chose.

The default locale is chosen by the OS.  You can change it by
modifying your login files, or by selecting a locale on the login
screen in systems like CDE.  This is nothing Subversion-specific.


> The "advantage" I was talking about was not having to do character set
> conversion in applications, not any particular advantage to the user. 
> (Although the user also benefits indirectly from having a single
> character set for all languages.)

The advantage of allowing the programmers to be lazy is rather
insignificant compared to the disadvantage of imposing undesried
charsets on the users.


> It's true, anything which shows up in XML and isn't encoded will be
> checked for UTF-8 validity by expat.  As far as I know, the user-visible
> objects which meet this criterion are filenames and property names. 
> Property values and file data only show up in XML after being
> base64-encoded, as far as I am aware.

That's good.  But unless filenames and property names are known to be
UTF-8, they too would have to be base64-encoded.


> I'm guessing property names aren't a big issue.  If the issue is mainly
> filenames, then it might be okay to handle conversion, if you do so by
> wrapping an svn_file_open() around apr_file_open().  Don't do it by
> adding a conversion step before every file open call.

Yes, adding more wrapper functions to apr calls would probably be a
good idea.  There are already some in io.c.


> Does this apply to the libraries?  A library function which prints to
> stdout/stderr is useless for GUI programs, so we generally try to avoid
> that.  And I don't think we do much in the way of logging.

Well, you have svn_handle_error, for example.  And svn_client_diff.
Functions that the client calls explicitly to have stuff printed on a
stream.


> Anyway, messages presumably need to be localized, not just
> charset-converted.  (If the message contains a filename, the filename
> might need to be charset-converted.)  It doesn't add much value to
> charset-convert them, other than perhaps filenames, without doing
> localization as well.

Localized error messages would be nice, but are not as important.  The
main thing here is not having the error message say

  svn_wc_merge: `åiåaäeö' not under revision control

when the file is actually called `åiåaäeö'.  Since the string has to
be converted anyway, it makes a cleaner design to keep it as UTF-8 in
the error struct, and convert it on output instead.  


> >  · Name service calls such as getpwnam need conversion
> >  · Command line arguments passed to exec need conversion
> 
> I don't see a reason why we should be mucking around in any way with
> usernames and command-line arguments provided by the operating system or
> user.

Not all arguments are provided by the OS or user.  When running diff,
for example, the labels are provided by Subversion.  The argments
which _are_ provided by the user are converted to UTF-8 and then back
again simply for symmetry reasons.  It's _much_ easier to specify that
all strings entering the libsvn layer have to be UTF-8 encoded, than
trying to keep track of which ones are and which ones are not.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Greg Hudson <gh...@MIT.EDU>.
(Thanks to Daniel Sternberg for the alternate mailing list archive
pointer.  I'll bookmark it.)

On Thu, 2002-05-23 at 10:03, Marcus Comstedt wrote:
> Incentive is one thing.  But Subversion is not in a position to
> _demand_ such a thing.

By itself, no.  But I think it's reasonable for application developers
not to incur the hair of charset conversions when there is a superior
approach available to the world.

> If you want to use UTF-8 under Unix, all you have to do is select an
> UTF-8 locale and the conversions will be identity conversions.

I note that wasn't the default you chose.

> It's should be up to each individual user to choose what kind of
> "advantage of Unicode" he would prefer.

The "advantage" I was talking about was not having to do character set
conversion in applications, not any particular advantage to the user. 
(Although the user also benefits indirectly from having a single
character set for all languages.)

> Although it's not 100% accurate to say thay UTF-8 validity of strings
> need not be enforced, since strings are being put in XML files without
> charset declarations, and such XML files must conform to UTF-8
> validity rules.

It's true, anything which shows up in XML and isn't encoded will be
checked for UTF-8 validity by expat.  As far as I know, the user-visible
objects which meet this criterion are filenames and property names. 
Property values and file data only show up in XML after being
base64-encoded, as far as I am aware.

I'm guessing property names aren't a big issue.  If the issue is mainly
filenames, then it might be okay to handle conversion, if you do so by
wrapping an svn_file_open() around apr_file_open().  Don't do it by
adding a conversion step before every file open call.

> Conversion needed:
> 
>  · Messages printed to stdout/stderr or non-XML logfiles need
>    conversion

Does this apply to the libraries?  A library function which prints to
stdout/stderr is useless for GUI programs, so we generally try to avoid
that.  And I don't think we do much in the way of logging.

Anyway, messages presumably need to be localized, not just
charset-converted.  (If the message contains a filename, the filename
might need to be charset-converted.)  It doesn't add much value to
charset-convert them, other than perhaps filenames, without doing
localization as well.

>  · Name service calls such as getpwnam need conversion
>  · Command line arguments passed to exec need conversion

I don't see a reason why we should be mucking around in any way with
usernames and command-line arguments provided by the operating system or
user.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Greg Hudson <gh...@MIT.EDU> writes:

> (Major irritation: I can't get at Marcus's patch via the list archives
> to review it again.  When I click on the attachment, I get a server
> error.  This is in addition to having to frob the URL to get to the
> messages at the end of a month to find it in the first place.)

Hm, does this have anything to do with the text/x-patch MIME type
perhaps?


> Or they have more incentive to want tools which use UTF-8.  We lose most
> of the advantage of Unicode if we have to put in the conversion hooks
> everywhere anyway to support legacy character sets.

Incentive is one thing.  But Subversion is not in a position to
_demand_ such a thing.  It is a support tool that has to fit in with
the tools and operating systems already in use.  If you want to use
UTF-8 under Unix, all you have to do is select an UTF-8 locale and the
conversions will be identity conversions.  Us who want to use other
locales can do so, and have the strings converted accordingly.  To me
it's much more important to have characters like åäö show up properly
in other tools like the shell, than being able to use characters like
"happy little snowman with combining broom above".  It's should be up
to each individual user to choose what kind of "advantage of Unicode"
he would prefer.


> Also, even if we don't do conversion, I don't think we really hurt sites
> which want to use some alternate character set.  We can version
> arbitrary binary data, not just UTF-8 data.  We don't enforce UTF-8
> validity of strings anywhere in the code.  So as long as everyone is
> using the same character set, they'll be okay even if it isn't UTF-8.

That's the approach taken by CVS, and it works fairly well in
practice.  I can't say why the UTF-8 approach was chosen for
Subversion, since I was not part of taking that decision.  (Although
it's not 100% accurate to say thay UTF-8 validity of strings need not
be enforced, since strings are being put in XML files without charset
declarations, and such XML files must conform to UTF-8 validity
rules.  I think the actual validity test has been removed, but that
doesn't make it technically correct to dump binary strings straight
into XML data.  Things like base64 could be used to fix this though.)

For _file contents_, binary properties should be assumed of course.
No conversions are done for those.  (Although it might be neat to be
able to manually enable conversions for file contents through
properties.  That's probebly post 1.0 though.)


> "All system calls" is kind of vague, but the implication is that the
> Subversion libraries will write out and read in all data in the native
> character set--to files, and over the wire on the network.

I was too vague, yes.  It does _not_ apply to data written to sockets
and files that are internal to Subversion.  For example, the contents
of the property files are still UTF-8.  Neither does it apply to
external files/streams which are _supposed_ to be in UTF-8, such as
those specified with --xml-file.


> and that a
> working directory couldn't be moved between machines which used
> different character sets.

A working directory containing _filenames_ which are non-ASCII can
only be moved if the tool used to do the move takes care of
translating the filenames.  But I think that is acceptable, such is
the normal situation when moving stuff between systems.


> If you're taking a less expansive approach than that, please describe
> what the libraries do which needs to do character set conversion, and
> what doesn't.

Conversion needed:

 · Messages printed to stdout/stderr or non-XML logfiles need
   conversion
 · Use of pathnames to local files and directories need conversion
 · Name service calls such as getpwnam need conversion
 · Command line arguments passed to exec need conversion

Conversion not needed:

 · Data sent between client and server do not need conversion
 · Data stored in internal files do not need conversion
 · Data written to XML files/streams do not need conversion
 · Contents of files under version control do not need conversion (but
   may not be assumed to be UTF-8 by the code; something which the
   keyword expansion code has to be aware of.)

I hope this will make it a little more clear.  Apologies for the
previous vagueness.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Mail archives (was Re: UTF-8)

Posted by Eric Gillespie <ep...@pretzelnet.org>.
Can we get this on the FAQ?

Daniel Stenberg <da...@haxx.se> writes:

> For reasons like this, I'm running a separate web archive of the mailing list
> here: http://www.contactor.se/~dast/svn/

Can someone add a link to this from the Subversion web site?  I
suspect the current mail archives are universally loathed.

--  
Eric Gillespie <*> epg@pretzelnet.org

Conformity is a sin.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Daniel Stenberg <da...@haxx.se>.
On 23 May 2002, Greg Hudson wrote:

> (Major irritation: I can't get at Marcus's patch via the list archives to
> review it again.  When I click on the attachment, I get a server error.
> This is in addition to having to frob the URL to get to the messages at the
> end of a month to find it in the first place.)

For reasons like this, I'm running a separate web archive of the mailing list
here: http://www.contactor.se/~dast/svn/

Marcus' particular mail and his patch are fully readable here:
http://www.contactor.se/~dast/svn/archive-2002-05/0699.shtml

I hope this helps.

-- 
      Daniel Stenberg - http://daniel.haxx.se - +46-705-44 31 77
   ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Greg Hudson <gh...@MIT.EDU>.
(Major irritation: I can't get at Marcus's patch via the list archives
to review it again.  When I click on the attachment, I get a server
error.  This is in addition to having to frob the URL to get to the
messages at the end of a month to find it in the first place.)

On Thu, 2002-05-23 at 08:26, Marcus Comstedt wrote:
> They are necessary because Subversion is not a closed world.  It needs
> to communicate with the operating system and the user.  If they do not
> use UTF-8, and Subversion does internally, conversions need to be
> made.

Or they have more incentive to want tools which use UTF-8.  We lose most
of the advantage of Unicode if we have to put in the conversion hooks
everywhere anyway to support legacy character sets.

Also, even if we don't do conversion, I don't think we really hurt sites
which want to use some alternate character set.  We can version
arbitrary binary data, not just UTF-8 data.  We don't enforce UTF-8
validity of strings anywhere in the code.  So as long as everyone is
using the same character set, they'll be okay even if it isn't UTF-8.

Your stated design was:
>  · Therefore, all system calls (direct, via APR, or via libc)
>    involving strings have to use converted versions of the strings.

"All system calls" is kind of vague, but the implication is that the
Subversion libraries will write out and read in all data in the native
character set--to files, and over the wire on the network.  That doesn't
seem right at all; it would mean that a client and server couldn't talk
to each other if they didn't use the same character set, and that a
working directory couldn't be moved between machines which used
different character sets.

If you're taking a less expansive approach than that, please describe
what the libraries do which needs to do character set conversion, and
what doesn't.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Greg Hudson <gh...@MIT.EDU> writes:

> Er, I'm not sure if I missed an earlier conversation about this.
> 
> Why precisely are these conversions necessary?  They don't seem to be
> about UTF-8 support, but about supporting conversion from UTF-8 to other
> character sets.  It seems like a lot of hair for Subversion to be
> getting into that business.

They are necessary because Subversion is not a closed world.  It needs
to communicate with the operating system and the user.  If they do not
use UTF-8, and Subversion does internally, conversions need to be
made.

I don't know if there has been any lengthy discussion about the design
choice to always use UTF-8 internally regardless of the character
encoding used by the operating system and user, I've only seen Greg
Stein say that it is decided (but not documented) that it should work
that way.  And it does have real advantages from a version control
system perspective.  It ensures that if I commit a file "räksmörgås"
to the repository, then it will still be called "räksmörgås" when
somebody else checks it out, even if this someone is using a different
locale.  (Or, if the different locale doesn't even allow for the
characters "åäö", that he gets an error message rather than an
incorrectly named file.)

(Of course, other fixed internal representations than UTF-8 could have
 been chosen, but I assume UTF-8 was because A) it can handle all
 characters in frequent use and B) it is the default encoding for
 XML, and XML is used as a transport mechanism inside Subversion.)


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Greg Hudson <gh...@MIT.EDU>.
Er, I'm not sure if I missed an earlier conversation about this.

Why precisely are these conversions necessary?  They don't seem to be
about UTF-8 support, but about supporting conversion from UTF-8 to other
character sets.  It seems like a lot of hair for Subversion to be
getting into that business.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Greg Stein <gs...@lyra.org>.
On Thu, May 23, 2002 at 01:20:44AM +0200, Marcus Comstedt wrote:
>...
> > I will also recommend that you break the patch into smaller units. For
> > example, we can add the UTF-8 conversion functions before *any* other code
> > is changed.
> 
> Yeah, but that's the bit that's really needs some more work (although
> Ulrich Drepper did provide some opportunities for progress here).  :-)
> The other bits I think will be difficult to break into units,
> unfortunately.

Understood.

> The strings get passed around a lot, so it's hard to
> make a partitioning that doesn't require you to add extra
> reverse-conversions.  Of course, if we don't care about consistent
> handling of 8-bit characters during the merge phase, you could do it
> one library at a time, and just ignore the fact that things are not
> converted properly.  But that would also mean not being able to
> test/use 8-bit values in a reliable way during this time, which would
> kind of defeat the purpose.

Well, it isn't like we're doing it properly right now, anyways, so I'm not
sure how much more you could break things :-)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Greg Stein <gs...@lyra.org> writes:

> Yup. However, your conversion functions take/return svn_stringbuf_t. Yet the
> functions outside of SVN are all 'const char *'. Thus, it makes a *lot* more
> sense for your conversion functions to start with the correct type.
> 
> [ and yes: the *implementation* might be a stringbuf, but that can be hidden
>   from the API ]
> 
> Unfortunately, that is going to be a pretty big change for your patch...

Good point.  The stringbuf variants might still be needed somewhere,
but it's simple enough to add a wrapper function which extracts the
->data for you.  At least I can use M-x grep to find all the places to
fix.  :-)


> I will also recommend that you break the patch into smaller units. For
> example, we can add the UTF-8 conversion functions before *any* other code
> is changed.

Yeah, but that's the bit that's really needs some more work (although
Ulrich Drepper did provide some opportunities for progress here).  :-)
The other bits I think will be difficult to break into units,
unfortunately.  The strings get passed around a lot, so it's hard to
make a partitioning that doesn't require you to add extra
reverse-conversions.  Of course, if we don't care about consistent
handling of 8-bit characters during the merge phase, you could do it
one library at a time, and just ignore the fact that things are not
converted properly.  But that would also mean not being able to
test/use 8-bit values in a reliable way during this time, which would
kind of defeat the purpose.


> When you attach patches, please ensure their MIME type is text/plain. If
> your mailer cannot adjust the mime type after attaching, but merely resides
> on a file extension, then just name the patch with '.txt' on the end.

Ok, will do.  I'm using Gnus, which asks me about the MIME type, but
so far I have considered the proposed default value ("text/x-patch")
to be appropriate.


> You should also include [PATCH] in the subject line, so that people know
> what is in there and can review it.

Right.  I'll try to remeber that.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Greg Stein <gs...@lyra.org>.
On Wed, May 22, 2002 at 11:44:38PM +0200, Marcus Comstedt wrote:
>...
> Inside libsvn_*:
> 
>   · All strings are assumed to be UTF-8 encoded

Yup.

>   · Therefore, all system calls (direct, via APR, or via libc)
>     involving strings have to use converted versions of the strings.
>     The conversion is placed as close to the system call as possible
>     unless there is a compelling argument to do otherwise (arguments
>     to non-static svn_*-functions must always be UTF-8 though).

Yup. However, your conversion functions take/return svn_stringbuf_t. Yet the
functions outside of SVN are all 'const char *'. Thus, it makes a *lot* more
sense for your conversion functions to start with the correct type.

[ and yes: the *implementation* might be a stringbuf, but that can be hidden
  from the API ]

Unfortunately, that is going to be a pretty big change for your patch...

I will also recommend that you break the patch into smaller units. For
example, we can add the UTF-8 conversion functions before *any* other code
is changed.

> Inside clients/cmdline:
>...

This all sounds good.

> The current diff (against rev 2000) is included as an attachment.  The
> diff for all libs is not yet complete, but those for clients/cmdline
> and libsvn_client are, and may be unmercifully reviewed.  :-)

When you attach patches, please ensure their MIME type is text/plain. If
your mailer cannot adjust the mime type after attaching, but merely resides
on a file extension, then just name the patch with '.txt' on the end.

You should also include [PATCH] in the subject line, so that people know
what is in there and can review it.

[ see the "Patch Submission Guidelines" in HACKING. although, I note that it
  doesn't talk about attachments, nor about the dangers of line-wrapping on
  inlined patches ]

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Philip Martin <ph...@codematters.co.uk> writes:

> This is a memory optimisation, I don't know if it is significant or
> desirable, that depends on how often this function is called. It
> optimises by avoiding the second allocation that currently occurs when
> the stringbuf is created.  In addition if APR ever provides a realloc
> function then the svn_stringbuf_ensure call may be able to expand
> in-place when E2BIG occurs.

Memory optimization is significant and desirable, that function gets
called a lot.  I'll make the suggested changes.  Thanks!


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Philip Martin <ph...@codematters.co.uk>.
Marcus Comstedt <ma...@mc.pp.se> writes:

> +static svn_error_t *
> +svn_utf_convert_to_stringbuf (iconv_t cd,
> +			      const char *src_data, apr_size_t src_length,
> +			      svn_stringbuf_t **dest,
> +			      apr_pool_t *pool)
> +{
> +  /* 2 bytes per character will be enough in most cases.
> +     If not, we'll make a larger buffer and try again.   */
> +  apr_size_t buflen = src_length * 2 + 1;
> +
> +  if (cd == (iconv_t)-1)
> +    return svn_error_create (0, errno, NULL, pool, "recoding string");

Create the dest stringbuf here.

> +
> +  do {
> +
> +    char *destbuf = apr_palloc (pool, buflen);

Get rid of destbuf and do do svn_stringbuf_ensure instead of apr_palloc.

> +
> +    /* Set up state variables for iconv */
> +    const char *srcptr = src_data;
> +    char *destptr = destbuf;

Use dest->data instead of destbuf.

> +    size_t srclen = src_length;
> +    size_t destlen = buflen;
> +
> +    /* Attempt the conversion */
> +    if (iconv(cd, &srcptr, &srclen, &destptr, &destlen) != (size_t)-1) {
> +
> +      /* Conversion succeeded.  Return buffer */
> +      *dest = svn_stringbuf_ncreate (destbuf, buflen - destlen, pool);

And set dest->len here instead of creating a stringbuf.

> +
> +      return SVN_NO_ERROR;
> +    }
> +
> +    /* In case we got here because the buffer was too small,
> +       double the size for the next iteration...              */
> +    buflen *= 2;
> +
> +  } while (errno == E2BIG);
> +
> +  /* Some other error occured. */
> +  return svn_error_create (0, errno, NULL, pool, "recoding string");
> +}
> +

This is a memory optimisation, I don't know if it is significant or
desirable, that depends on how often this function is called. It
optimises by avoiding the second allocation that currently occurs when
the stringbuf is created.  In addition if APR ever provides a realloc
function then the svn_stringbuf_ensure call may be able to expand
in-place when E2BIG occurs.

-- 
Philip

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Ok, here's another progress report on inserting the missing UTF-8 code
into Subversion.

I have now completed the patches for clients/cmdline and
libsvn_client.  I have manually inspected every single file in each
and to the best of my knowledge, this is _all_ that is required for
these two components.  The other libs probably still need some more
patches, but I expect to have gone through them as well by mid next
week.  My guess is that only a rather small number of further patches
will be required.

I was thinking a bit about whether this should become a branch, but it
would probably get too messy to merge pretty quickly, so I'd advise
against it.  Better to put it in the trunk directly, with
--disable-utf8 as the default for the time being of course.

The coding guidelines I've been using are as follows:


Inside libsvn_*:

  · All strings are assumed to be UTF-8 encoded

  · Therefore, all system calls (direct, via APR, or via libc)
    involving strings have to use converted versions of the strings.
    The conversion is placed as close to the system call as possible
    unless there is a compelling argument to do otherwise (arguments
    to non-static svn_*-functions must always be UTF-8 though).


Inside clients/cmdline:

  · Strings are assumed to be encoded with the native character
    encoding unless explicitly marked as being UTF-8 coded with a
    comment. 

  · In addition, the following functions return UTF-8 encoded strings:

     - svn_cl__parse_num_args
     - svn_cl__parse_all_args
     - svn_cl__args_to_target_array
     - svn_cl__stringlist_to_array
     - svn_cl__newlinelist_to_array
     - svn_cl__edit_externally (which also expects UTF-8-encoded input)

  · Strings that are not part of the above exceptions are recoded
    before being passed to any svn_* function (except svn_cl__*).
    Strings passed to callbacks from libsvn_* are recoded before use.
    Strings that are UTF-8 encoded as part of the above exceptions but
    need to be printed to stdout or similar are decoded back again.

  (The reason for the various exceptions is that they make the
   foo-cmd.c files much, much cleaner.)


The current diff (against rev 2000) is included as an attachment.  The
diff for all libs is not yet complete, but those for clients/cmdline
and libsvn_client are, and may be unmercifully reviewed.  :-)


  // Marcus



Re: UTF-8

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Ben Collins-Sussman <su...@collab.net> writes:

> UTF-8 was the *plan*, as in, *someday*.  If you look throughout our
> code, you'll find essentially zero regard for i18n, l10n, or UTF-8
> conversions.  Nothing, nada, zip.
> 
> Someday, this will be a big project.  :-)

Yeah, that's why I intend to add some.  It doesn't look like _that_
big a project, the main point is deciding who convers where.  And if
we have decided that strings that go between the client and the svn
libs are UTF-8 encoded, and strings that go between the svn libs and
APR are not, then we have a fairly clean cut.  It's just a matter of
inserting conversions wherever a client calls svn libs, or svn libs
call APR.  (Natually, I'm thinking slices rather than code positions
here.)

Now, considering that the svn libs need to do conversions too, I'm
thinking that it would probably be a good idea to put the actual
conversion code in libsvn_subr.  That would mean having to make an
exception to the rule "strings passed from client to svn libs are
always UTF-8" though, as the strings passed to the "encode as UTF-8"
function are of course not very likely to be UTF-8 encoded already. :)

So, would you agree that an utf.c in libsvn_subr is a good idea?


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: UTF-8

Posted by Ben Collins-Sussman <su...@collab.net>.
Marcus Comstedt <ma...@mc.pp.se> writes:

> I started looking at inserting the missing UTF-8 translations in the
> cmdline client, but found some strangeness already.  All svn library
> functions take strings in UTF-8 encoding, wasn't that the plan?

Yes, that was the plan.

>  So how come svn_io_check_path() in libsubr just takes its first
> argument (which shoulde hence be an UTF-8 sting) and passes it to
> apr_stat(), which does not operate on UTF-8 strings?  Seems to be
> conversion calls missing inside the libs too...

UTF-8 was the *plan*, as in, *someday*.  If you look throughout our
code, you'll find essentially zero regard for i18n, l10n, or UTF-8
conversions.  Nothing, nada, zip.

Someday, this will be a big project.  :-)


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org