You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Peter Samuelson <pe...@p12n.org> on 2006/04/21 12:00:38 UTC

[PATCH] $LastChangedDate$ encoding

The $LastChangedDate$ keyword is localized to the correct language but
not to the correct encoding.  This patch seems to help - comments?
I tested with LC_CTYPE=hu_HU.iso88592.  Effect of patch:

  -$LastChangedDate: 2006-04-21 06:46:54 -0500 (p, 21 ápr 2006) $
  +$LastChangedDate: 2006-04-21 06:46:54 -0500 (p, 21 ápr 2006) $

I didn't check for other instances of the same bug.  I seem to recall a
bit of controversy recently regarding repository paths in keywords, but
I don't remember the conclusion.

[[[
Fix encoding of LastChangedDate keyword expansion.

* subversion/libsvn_subr/subst.c (keyword_printf):
  Convert the long-format revision date to the native encoding.
]]]

Index: subversion/libsvn_subr/subst.c
===================================================================
--- subversion/libsvn_subr/subst.c	(revisione 19429)
+++ subversion/libsvn_subr/subst.c	(copia locale)
@@ -224,8 +224,12 @@
           break;
         case 'D': /* long format of date of this revision */
           if (date)
-            svn_stringbuf_appendcstr(value,
-                                     svn_time_to_human_cstring(date, pool));
+            {
+              char *date_keyword;
+              char *date_utf8 = svn_time_to_human_cstring(date, pool);
+              svn_utf_cstring_from_utf8(&date_keyword, date_utf8, pool);
+              svn_stringbuf_appendcstr(value, date_keyword);
+            }
           break;
         case 'r': /* number of this revision */
           if (rev)

Re: [PATCH] $LastChangedDate$ encoding

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
Peter Samuelson writes:
 > 
 > Ivan argued for not localising the keyword expansion at all, but using
 > English with ASCII.  [Note: I think that is reasonable, but perhaps not
 > easy to implement efficiently, as it may involve switching locales.]
 > 
 > 

I thin we should go this route.  Specifically, we should drop the
human readable pdate part and use the same format as Id uses.  I think
it was a mistake to use a locale-dependent date format here.

This does not solve the problem, because we still use UTF8 for the
basename and author.  I think the best we can do is to stay with UTF8
until we know the encoding of the file - it will at least give
consistent results.

Changing the $Date$ expansion will break compatibility, but we will
keep the same format for the machine-parsable part as we have today.

I'd be +1 on a backport of this to 1.4.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] $LastChangedDate$ encoding

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2006-05-07 09:07:14 -0500, Peter Samuelson wrote:
> [Vincent Lefevre]
> > > The encoding should be consistent with filenames, which are also
> > > specific to a WC.
> > 
> > There's absolutely no reason why they should be the same.
> 
> I gave you a reason earlier.  There are many situations where you want
> to embed a filename inside a file.  (I said this in the context of XML
> files, but that's by no means the only example.)

You can still include UTF-8 encoded filenames. Anyway, how filenames
are encoded on the disk is OS / file system specific (e.g., under
Mac OS X / HFS+, it *must* be UTF-8). However, I don't think the
encoding used in the file contents should be OS specific.

> >     * Using UTF-8 (current behavior):
> >       + Pros: fixed encoding; no loss; compatible with file formats
> >         based on UTF-8, which are common (UTF-8 is more or less the
> >         default encoding nowadays).
> >       + Cons: may be incompatible with some documents.
> 
> Also may be incompatible with user expectations.

This is true for *any* choice, so that I didn't bother to mention it.

> I daresay it is very common to use Subversion in an environment where
> either you're only a single user, or all users have the same locale
> settings.

This may happen, but the opposite is much more common. All those
Subversion repositories publicly available on the Internet can be
accessed by anyone, with all the possible locales...

Even in a country, many encodings can be used. For instance, in France,
the most common are ISO-8859-1, ISO-8859-15 and UTF-8 (more and more
users are switching to UTF-8, but many of them are still using one of
the first two).

> Subversion localises everything very well - users never have
> to know or care that it is thinking in UTF-8 under the hood.  The only
> instance I know where it does not do this is in keyword expansions.

This is not true. With publicly available repositories, files may be
in various encodings (mainly ISO-8859-1 and UTF-8), and Subversion
won't convert their contents into the locale encoding.

> Also, you seem to assume that the common case is files with a
> well-defined encoding, like XML documents.  I doubt that.  I guess it
> is more common to use Subversion to store text documents and program
> source code, not XML.  And program source code rarely has a
> well-defined encoding; typically users write their comments in the same
> encoding they are using for the rest of their computing.

This may be true when a user writes a file for himself, but for
sources that are shared amongst many users (such as all free
software), this is no longer true.

> (Side note: if it were really true that "UTF-8 is more or less the
> default encoding nowadays", then this whole question would be a
> non-issue, as users would all be using UTF-8 for LC_CTYPE.)

No. LC_CTYPE may have some value, but files may have other encodings.
This is particularly true with XML, but can occur with other files
(e.g. ChangeLogs in Debian are encoded in UTF-8). Again, remember
that files may be shared, in particular when using Subversion.

> >     * Using the encoding specified by the locales:
> >       + Pros: compatible with tools that don't understand encodings
> >         different from the one specified by the locales.
> 
> Which is to say, most tools with most file formats.

I'd say the opposite. But this really depends on what tools people
are using.

> At least on my Unix box, very few tools I use automatically recode
> file content when outputting to my terminal. I can only think of
> vorbiscomment and iconv.

I can cite: emacs, mutt, various web browsers, screen.

> (And vorbiscomment doesn't count - I don't think you can put
> keywords into ogg vorbis files, since they expand to variable
> lengths.)

No problem. From the Subversion book:

  Subversion 1.2 introduced a new variant of the keyword syntax which
  brought additional, useful—though perhaps atypical—functionality.
  You can now tell Subversion to maintain a fixed length (in terms of
  the number of bytes consumed) for the substituted keyword. By using
  a double-colon (::) after the keyword name, followed by a number of
  space characters, you define that fixed width. When Subversion goes
  to substitute your keyword for the keyword and its value, it will
  essentially replace only those space characters, leaving the overall
  width of the keyword field unchanged. If the substituted value is
  shorter than the defined field width, there will be extra padding
  characters (spaces) at the end of the substituted field; if it is
  too long, it is truncated with a special hash (#) character just
  before the final dollar sign terminator.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] $LastChangedDate$ encoding

Posted by Peter Samuelson <pe...@p12n.org>.
[Vincent Lefevre]
> > The encoding should be consistent with filenames, which are also
> > specific to a WC.
> 
> There's absolutely no reason why they should be the same.

I gave you a reason earlier.  There are many situations where you want
to embed a filename inside a file.  (I said this in the context of XML
files, but that's by no means the only example.)

>     * Using UTF-8 (current behavior):
>       + Pros: fixed encoding; no loss; compatible with file formats
>         based on UTF-8, which are common (UTF-8 is more or less the
>         default encoding nowadays).
>       + Cons: may be incompatible with some documents.

Also may be incompatible with user expectations.

I daresay it is very common to use Subversion in an environment where
either you're only a single user, or all users have the same locale
settings.  Subversion localises everything very well - users never have
to know or care that it is thinking in UTF-8 under the hood.  The only
instance I know where it does not do this is in keyword expansions.

Also, you seem to assume that the common case is files with a
well-defined encoding, like XML documents.  I doubt that.  I guess it
is more common to use Subversion to store text documents and program
source code, not XML.  And program source code rarely has a
well-defined encoding; typically users write their comments in the same
encoding they are using for the rest of their computing.

In short, I bet it is _more_ common for a file under version control to
match the encoding of a user's LC_CTYPE than it is for the file to be
UTF-8 when the user's LC_CTYPE is not.

(Side note: if it were really true that "UTF-8 is more or less the
default encoding nowadays", then this whole question would be a
non-issue, as users would all be using UTF-8 for LC_CTYPE.)


>     * Using the encoding specified by the locales:
>       + Pros: compatible with tools that don't understand encodings
>         different from the one specified by the locales.

Which is to say, most tools with most file formats.  At least on my
Unix box, very few tools I use automatically recode file content when
outputting to my terminal.  I can only think of vorbiscomment and
iconv.  (And vorbiscomment doesn't count - I don't think you can put
keywords into ogg vorbis files, since they expand to variable lengths.)

Re: [PATCH] $LastChangedDate$ encoding

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2006-05-07 05:40:40 -0500, Peter Samuelson wrote:
[...]
> Furthermore, I view keyword expansion as an action specific to a WC.

I disagree, in particular because the content encoding is fixed,
and Subversion doesn't convert the encoding used on the repository
side into the user's locale encoding (and I think it shouldn't
IMHO, possibly except in a future extension, just like svn:eol).

> The encoding should be consistent with filenames, which are also
> specific to a WC.

There's absolutely no reason why they should be the same.

> Ivan argued for not localising the keyword expansion at all, but using
> English with ASCII.  [Note: I think that is reasonable, but perhaps not
> easy to implement efficiently, as it may involve switching locales.]

Well, Subversion already uses a locale-independent format for the date
in the Id keyword.

> The reason I bring this up is that I applied my patch to the Debian
> unstable version of svn 1.3.1, and I got an angry complaint about it,

[from me]

> partly because my package is now inconsistent with official svn.  I
> view my patch as a bug fix, but if consensus is reached that the status
> quo really is better, I'll seriously consider reverting the patch.
> 
> Further arguments: http://bugs.debian.org/290774

I've posted the pros and the cons of various solutions before advanced
keyword expansion is implemented:

    * Using UTF-8 (current behavior):
      + Pros: fixed encoding; no loss; compatible with file formats
        based on UTF-8, which are common (UTF-8 is more or less the
        default encoding nowadays).
      + Cons: may be incompatible with some documents.

    * Using US-ASCII (transliteration):
      + Pros: fixed encoding; compatible with any encoding (except
        EBCDIC, but this one is not tractable) and any file format.
      + Cons: small loss for non-ASCII characters.

    * Using the encoding specified by the locales:
      + Pros: compatible with tools that don't understand encodings
        different from the one specified by the locales.
      + Cons: all the documents using keywords should have the same
        encoding; also requires every user of the repository to use
        the same locales or compatible ones (which may require root
        access to install them, or may not even be available on some
        OS's); if externals are used, the corresponding repositories
        should assume compatible encodings; not backward compatible.

The first 2 are OK for me, not the 3rd one. Also, if this is
configurable, it is also OK for me.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] $LastChangedDate$ encoding

Posted by Peter Samuelson <pe...@p12n.org>.
Resurrecting this issue....  This thread died 2 weeks ago; I'd like it
to be resolved one way or another.  Could people comment further on
this and make some sort of decision?

I posted a patch for the $LastChangedDate$ keyword expansion in a WC,
to convert the expansion to the user's LC_CTYPE encoding.  The
expansion is already performed in the user's native language; it is my
view that it is inconsistent to localise to a native language (and
native date format) but _not_ to a native encoding.

Furthermore, I view keyword expansion as an action specific to a WC.
The encoding should be consistent with filenames, which are also
specific to a WC.


Julian (and Malcolm, speaking on IRC) basically agree with me.

Brane argued for the status quo: use the user's own language and date
format, but render it in UTF-8.  His point is that the desired encoding
of the file is not known, so a fixed default should be used.

Ivan argued for not localising the keyword expansion at all, but using
English with ASCII.  [Note: I think that is reasonable, but perhaps not
easy to implement efficiently, as it may involve switching locales.]


The reason I bring this up is that I applied my patch to the Debian
unstable version of svn 1.3.1, and I got an angry complaint about it,
partly because my package is now inconsistent with official svn.  I
view my patch as a bug fix, but if consensus is reached that the status
quo really is better, I'll seriously consider reverting the patch.

Further arguments: http://bugs.debian.org/290774

Thanks,
Peter

Re: [PATCH] $LastChangedDate$ encoding

Posted by Ivan Zhakov <ch...@gmail.com>.
On 4/23/06, Peter Samuelson <pe...@p12n.org> wrote:
>
> [Branko Cibej]
> > That's really not correct. Keywords should be expanded in the file's
> > internal encoding, not the locale encoding. Since we don't *know* the
> > file's encoding, UTF-8 is marginally better than something random,
> > IMHO.
>
> Let's take that further, then, and say that keywords should be expanded
> in the file's internal _language_, which we also don't know.  Should we
> always expand them as POSIX (English), then?
>
> Using the user's locale for the date _format_, but not for the date
> _encoding_, makes little sense to me.
>
+1 for always expanding keywords as English.

--
Ivan Zhakov

Re: [PATCH] $LastChangedDate$ encoding

Posted by Jesper Steen Møller <je...@selskabet.org>.
Peter Samuelson wrote:
> [Branko Cibej]
>   
>> That's really not correct. Keywords should be expanded in the file's
>> internal encoding, not the locale encoding. Since we don't *know* the
>> file's encoding, UTF-8 is marginally better than something random,
>> IMHO.
>>     
> [...]
>
> In my opinion, the keyword expansions are specific to the WC, and as
> such, should match the encoding of other things specific to the WC -
> like filenames.  It should be safe to assume that the user's editors
> and pagers and other tools will default to the encoding from the user's
> environment - at least, that seems to be safer than assuming that the
> user's tools will all treat files as UTF-8.
>   
Another use for the proposal to let Subversion handle file content 
encoding, see <http://svn.haxx.se/dev/archive-2006-03/1182.shtml>.
(I'm still willing to work on this, but the reception was mixed.)

-Jesprr



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] $LastChangedDate$ encoding

Posted by Julian Foad <ju...@btopenworld.com>.
Peter Samuelson wrote:
> [Branko Cibej]
> 
>>That's really not correct. Keywords should be expanded in the file's
>>internal encoding, not the locale encoding. Since we don't *know* the
>>file's encoding, UTF-8 is marginally better than something random,
>>IMHO.
> 
> 
> Let's take that further, then, and say that keywords should be expanded
> in the file's internal _language_, which we also don't know.  Should we
> always expand them as POSIX (English), then?
> 
> Using the user's locale for the date _format_, but not for the date
> _encoding_, makes little sense to me.
> 
> In my opinion, the keyword expansions are specific to the WC, and as
> such, should match the encoding of other things specific to the WC -
> like filenames.  It should be safe to assume that the user's editors
> and pagers and other tools will default to the encoding from the user's
> environment - at least, that seems to be safer than assuming that the
> user's tools will all treat files as UTF-8.

That makes sense.  +1 on using the same locale for both the format and the 
encoding of the keywords.  (I just saw Ivan's "+1 to English" - with UTF-8 - 
which I don't think I agree with, but it's a possibility and is better than 
mixed locales.)

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] $LastChangedDate$ encoding

Posted by Peter Samuelson <pe...@p12n.org>.
[Branko Cibej]
> That's really not correct. Keywords should be expanded in the file's
> internal encoding, not the locale encoding. Since we don't *know* the
> file's encoding, UTF-8 is marginally better than something random,
> IMHO.

Let's take that further, then, and say that keywords should be expanded
in the file's internal _language_, which we also don't know.  Should we
always expand them as POSIX (English), then?

Using the user's locale for the date _format_, but not for the date
_encoding_, makes little sense to me.

In my opinion, the keyword expansions are specific to the WC, and as
such, should match the encoding of other things specific to the WC -
like filenames.  It should be safe to assume that the user's editors
and pagers and other tools will default to the encoding from the user's
environment - at least, that seems to be safer than assuming that the
user's tools will all treat files as UTF-8.

Re: [PATCH] $LastChangedDate$ encoding

Posted by Branko Čibej <br...@xbc.nu>.
Peter Samuelson wrote:
> The $LastChangedDate$ keyword is localized to the correct language but
> not to the correct encoding.  This patch seems to help - comments?
> I tested with LC_CTYPE=hu_HU.iso88592.  Effect of patch:
>
>   -$LastChangedDate: 2006-04-21 06:46:54 -0500 (p, 21 ápr 2006) $
>   +$LastChangedDate: 2006-04-21 06:46:54 -0500 (p, 21 ápr 2006) $
>
> I didn't check for other instances of the same bug.  I seem to recall a
> bit of controversy recently regarding repository paths in keywords, but
> I don't remember the conclusion.
>
> [[[
> Fix encoding of LastChangedDate keyword expansion.
>
> * subversion/libsvn_subr/subst.c (keyword_printf):
>   Convert the long-format revision date to the native encoding.
> ]]]
>
> Index: subversion/libsvn_subr/subst.c
> ===================================================================
> --- subversion/libsvn_subr/subst.c	(revisione 19429)
> +++ subversion/libsvn_subr/subst.c	(copia locale)
> @@ -224,8 +224,12 @@
>            break;
>          case 'D': /* long format of date of this revision */
>            if (date)
> -            svn_stringbuf_appendcstr(value,
> -                                     svn_time_to_human_cstring(date, pool));
> +            {
> +              char *date_keyword;
> +              char *date_utf8 = svn_time_to_human_cstring(date, pool);
> +              svn_utf_cstring_from_utf8(&date_keyword, date_utf8, pool);
> +              svn_stringbuf_appendcstr(value, date_keyword);
> +            }
>   
That's really not correct. Keywords should be expanded in the file's 
internal encoding, not the locale encoding. Since we don't *know* the 
file's encoding, UTF-8 is marginally better than something random, IMHO.

-- Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org