You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Konstantin Preißer <kp...@apache.org> on 2013/09/25 16:52:02 UTC

International characters in source files and SVN commit messages (was: RE:r1525975)

Hi all,

> -----Original Message-----
> From: kpreisser@apache.org [mailto:kpreisser@apache.org]
> Sent: Tuesday, September 24, 2013 9:11 PM

> --- tomcat/site/trunk/xdocs/whoweare.xml (original)
> +++ tomcat/site/trunk/xdocs/whoweare.xml Tue Sep 24 19:10:44 2013
> @@ -100,6 +100,9 @@ A complete list of all the Apache Commit
>  <p><b>Costin Manolache</b> (costin at apache.org)<br/></p>
>  <!--Your bio goes here-->
> 
> +<p><b>Konstantin Preißer</b> (kpreisser at apache.org)<br/></p>

When editing the whoweare.xml, I wrote the "ß" character (sharp s) which is now displayed as "ß" in the commit message, because the source XML file is encoded in UTF-8 (the default encoding for XML files).

As far as I understand, SVN needs to treat changes in text files at byte-level, not at character-level, to be independent from character encodings. Therefore e.g. ".patch" files don't have a character encoding as they describe changes at byte-level.

However, when the Commit E-Mail is sent, the bytes need to be converted to characters, and it seems the SVN commit diff is interpreted as ISO-8859-1 (or Windows-1252). Therefore, the UTF-8 bytes 0xC3 0x9F are displayed as "ß", instead of "ß".

That would be the preferred way to handle such issues? One way I can think would be to XML-encode such characters ("ß" as "&#xDF;"). However, personally I would rather not do this, but write such characters directly ("ß"), so that the source is better readable (and encodings like UTF-8 guarantee that the characters are interpreted the same on each system, independently from the system language or geographic location).

Could it be possible to change SVN Commit E-Mail system so that it may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming all files which contain bytes > 0x7F are encoded as UTF-8)? (Or, that it tries to decode it as UTF-8, and if it fails, decode it as ISO-8859-1 ?)

For example, when I use TortoiseSVN to view the unified diff of r152597, then it prints the "ß" character, so it seems to interpret it as UTF-8.

Can you give me a hint?

Thanks!

Kind regards,
Konstantin Preißer


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: International characters in source files and SVN commit messages (was: RE:r1525975)

Posted by Christopher Schultz <ch...@christopherschultz.net>.
Sebb,

On 9/26/13 1:13 PM, sebb wrote:
> On 25 September 2013 17:02, Konstantin Preißer <kp...@apache.org> wrote:
>> Mark,
>>
>>> -----Original Message-----
>>> From: Mark Thomas [mailto:markt@apache.org]
>>> Sent: Wednesday, September 25, 2013 5:54 PM
>>
>>> I'd say yes. Property files are a 'special' case:
>>> http://stackoverflow.com/questions/4659929/how-to-use-utf-8-in-
>>> resource-properties-with-resourcebundle
>>
>> OK, thank you for the clarification.
>>
>>> It doesn't bother me but I'm only one committer. I think this falls
>>> under the category if someone cares enough about the commit e-mails
>>> using UTF-8 then they need to work with infra to make that happen. I'm
>>> happy with things as they are.
> 
> There is a property that can be used to change the encoding used by
> the SVN mailer, for example:
> 
> svn:mime-type text/xml; charset=utf-8
> 
> Make sure this agrees with the contents and any xml encoding attribute.

The irony of the above is that text/xml implies that the file contains
an XML processing instruction (i.e. <?xml?>) which properly contains ...
a character encoding attribute, making the "charset" attribute of the
mime-type both superfluous and - in some cases - dangerous in cases of
disagreement.

I am -0 for setting mime-type for XML files to something including a
character set because there's no enforcement of that character set
anywhere at all.

-chris


Re: International characters in source files and SVN commit messages (was: RE:r1525975)

Posted by sebb <se...@gmail.com>.
On 26 September 2013 23:29, Konstantin Kolinko <kn...@gmail.com> wrote:
> 2013/9/26 sebb <se...@gmail.com>:
>> On 25 September 2013 17:02, Konstantin Preißer <kp...@apache.org> wrote:
>>> Mark,
>>>
>>>> -----Original Message-----
>>>> From: Mark Thomas [mailto:markt@apache.org]
>>>> Sent: Wednesday, September 25, 2013 5:54 PM
>>>
>>>> I'd say yes. Property files are a 'special' case:
>>>> http://stackoverflow.com/questions/4659929/how-to-use-utf-8-in-
>>>> resource-properties-with-resourcebundle
>>>
>>> OK, thank you for the clarification.
>>>
>>>> It doesn't bother me but I'm only one committer. I think this falls
>>>> under the category if someone cares enough about the commit e-mails
>>>> using UTF-8 then they need to work with infra to make that happen. I'm
>>>> happy with things as they are.
>>
>> There is a property that can be used to change the encoding used by
>> the SVN mailer, for example:
>>
>> svn:mime-type text/xml; charset=utf-8
>>
>> Make sure this agrees with the contents and any xml encoding attribute.
>>
>
> -1 for changing svn:mime-type in such a way.
> Placing an encoding into svn:mime-type is wrong, as
> a) It is not portable. (Git does not have svn properties).

There are other svn properties that are required, so that does not make sense.

> b) It is hard to keep in sync.  Beware that case may matter for some
> software (UTF-8 vs utf-8).

How often does the encoding change?

> ( c) You may be relying on an undocumented feature. I remember some
> long discussions several years ago on whether file encoding can be
> part of svn:mime-type, or it should be a separate property, with no
> clear outcome.

See http://opensource.perlig.de/svnmailer/doc-1.0/#groups-charset-property

> http://subversion.tigris.org/issues/show_bug.cgi?id=2329
> http://subversion.tigris.org/issues/show_bug.cgi?id=2194
> )
>
> Regarding whoweare.xml file,  you need to add explicit encoding to the
> top of the file (like it is done in
> tc7.0.x/trunk/webapps/docs/changelog.xml).  Without that I consider
> those files as ISO-8859-1, like the rest of our sources.

The default for XML is UTF-8.

>
> I think commit mailer should treat the files as ISO-8859-1, as such

XML is UTF-8 by default

> interpretation does not lose any data and as that is the format of
> unified diff.

Not sure about those last two assertions.

> In the past there were several cases when accented characters in
> Tomcat's changelog files were corrupted during editing (due to a
> conversion done in someone's editor). It was seen in commit message.
> Last time it happened two or three years ago.

That may be so, but I'm not sure what bearing that has on the svn
commit message encoding.

> http://svn.apache.org/r999983
> http://svn.apache.org/r1196769
>
> As of now, several xml files in Tomcat (those changelogs) are
> officially UTF-8, and I am OK with people using accented characters
> for new text there until something breaks.
> (Personally, I will probably still use numeric entities, as I do not
> have those characters on my keyboard.)
>
> AFAIK, TortoiseSVN diff viewer has some logic to autodetect the use of UTF-8.
>
> Best regards,
> Konstantin Kolinko
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: dev-help@tomcat.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


RE: International characters in source files and SVN commit messages (was: RE:r1525975)

Posted by Konstantin Preißer <kp...@apache.org>.
Hi Konstantin,

> -----Original Message-----
> From: Konstantin Kolinko [mailto:knst.kolinko@gmail.com]
> Sent: Friday, September 27, 2013 12:30 AM
> To: Tomcat Developers List
> Subject: Re: International characters in source files and SVN commit
> messages (was: RE:r1525975)
> 
> Regarding whoweare.xml file,  you need to add explicit encoding to the
> top of the file (like it is done in
> tc7.0.x/trunk/webapps/docs/changelog.xml).  Without that I consider
> those files as ISO-8859-1, like the rest of our sources.

Note that for XML files, if the "encoding" flag in the XML declaration is
missing, the encoding is determined by the file's BOM bytes.
If it has none, then the encoding is "UTF-8" [1]. So the XML files which
don't have a "encoding" flag or BOM bytes are UTF-8.
As such, the "whoweare.xml" is already in UTF-8 (but personally I prefer to
explicitly declare the UTF-8 encoding in XML files).

> In the past there were several cases when accented characters in
> Tomcat's changelog files were corrupted during editing (due to a
> conversion done in someone's editor). It was seen in commit message.
> Last time it happened two or three years ago.
> 
> http://svn.apache.org/r999983
> http://svn.apache.org/r1196769
> 
> As of now, several xml files in Tomcat (those changelogs) are
> officially UTF-8, and I am OK with people using accented characters
> for new text there until something breaks.
> (Personally, I will probably still use numeric entities, as I do not
> have those characters on my keyboard.)
> 
> AFAIK, TortoiseSVN diff viewer has some logic to autodetect the use of
UTF-
> 8.

Yes, I guess this is "if it doesn't have a BOM, try to decode as UTF-8; if
it fails, decode as ansi/iso-8859-1" which I mentioned in another mail.
E.g., when a Diff contains the text "aßa" in ISO-8859-1, it will display it
as "aßa" (UTF-8), but when it contains "aßaßa" in ISO-8859-1, then it
displays that one. This seems also be used e.g. by Notepad++.

I think such a logic could also be used by the commit mailer to decide if
the text is UTF-8 or ISO-8859-1 for better readability, but I have no strong
preference for it.


Kind regards,
Konstantin Preißer
 

[1] http://www.opentag.com/xfaq_enc.htm#enc_default


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: International characters in source files and SVN commit messages (was: RE:r1525975)

Posted by Konstantin Kolinko <kn...@gmail.com>.
2013/9/26 sebb <se...@gmail.com>:
> On 25 September 2013 17:02, Konstantin Preißer <kp...@apache.org> wrote:
>> Mark,
>>
>>> -----Original Message-----
>>> From: Mark Thomas [mailto:markt@apache.org]
>>> Sent: Wednesday, September 25, 2013 5:54 PM
>>
>>> I'd say yes. Property files are a 'special' case:
>>> http://stackoverflow.com/questions/4659929/how-to-use-utf-8-in-
>>> resource-properties-with-resourcebundle
>>
>> OK, thank you for the clarification.
>>
>>> It doesn't bother me but I'm only one committer. I think this falls
>>> under the category if someone cares enough about the commit e-mails
>>> using UTF-8 then they need to work with infra to make that happen. I'm
>>> happy with things as they are.
>
> There is a property that can be used to change the encoding used by
> the SVN mailer, for example:
>
> svn:mime-type text/xml; charset=utf-8
>
> Make sure this agrees with the contents and any xml encoding attribute.
>

-1 for changing svn:mime-type in such a way.
Placing an encoding into svn:mime-type is wrong, as
a) It is not portable. (Git does not have svn properties).
b) It is hard to keep in sync.  Beware that case may matter for some
software (UTF-8 vs utf-8).

( c) You may be relying on an undocumented feature. I remember some
long discussions several years ago on whether file encoding can be
part of svn:mime-type, or it should be a separate property, with no
clear outcome.
http://subversion.tigris.org/issues/show_bug.cgi?id=2329
http://subversion.tigris.org/issues/show_bug.cgi?id=2194
)

Regarding whoweare.xml file,  you need to add explicit encoding to the
top of the file (like it is done in
tc7.0.x/trunk/webapps/docs/changelog.xml).  Without that I consider
those files as ISO-8859-1, like the rest of our sources.


I think commit mailer should treat the files as ISO-8859-1, as such
interpretation does not lose any data and as that is the format of
unified diff.

In the past there were several cases when accented characters in
Tomcat's changelog files were corrupted during editing (due to a
conversion done in someone's editor). It was seen in commit message.
Last time it happened two or three years ago.

http://svn.apache.org/r999983
http://svn.apache.org/r1196769

As of now, several xml files in Tomcat (those changelogs) are
officially UTF-8, and I am OK with people using accented characters
for new text there until something breaks.
(Personally, I will probably still use numeric entities, as I do not
have those characters on my keyboard.)

AFAIK, TortoiseSVN diff viewer has some logic to autodetect the use of UTF-8.

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: International characters in source files and SVN commit messages (was: RE:r1525975)

Posted by sebb <se...@gmail.com>.
On 25 September 2013 17:02, Konstantin Preißer <kp...@apache.org> wrote:
> Mark,
>
>> -----Original Message-----
>> From: Mark Thomas [mailto:markt@apache.org]
>> Sent: Wednesday, September 25, 2013 5:54 PM
>
>> I'd say yes. Property files are a 'special' case:
>> http://stackoverflow.com/questions/4659929/how-to-use-utf-8-in-
>> resource-properties-with-resourcebundle
>
> OK, thank you for the clarification.
>
>> It doesn't bother me but I'm only one committer. I think this falls
>> under the category if someone cares enough about the commit e-mails
>> using UTF-8 then they need to work with infra to make that happen. I'm
>> happy with things as they are.

There is a property that can be used to change the encoding used by
the SVN mailer, for example:

svn:mime-type text/xml; charset=utf-8

Make sure this agrees with the contents and any xml encoding attribute.

> OK.
>
> Thanks!
>
> Konstantin
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: dev-help@tomcat.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


RE: International characters in source files and SVN commit messages (was: RE:r1525975)

Posted by Konstantin Preißer <kp...@apache.org>.
Mark,

> -----Original Message-----
> From: Mark Thomas [mailto:markt@apache.org]
> Sent: Wednesday, September 25, 2013 5:54 PM

> I'd say yes. Property files are a 'special' case:
> http://stackoverflow.com/questions/4659929/how-to-use-utf-8-in-
> resource-properties-with-resourcebundle

OK, thank you for the clarification.

> It doesn't bother me but I'm only one committer. I think this falls
> under the category if someone cares enough about the commit e-mails
> using UTF-8 then they need to work with infra to make that happen. I'm
> happy with things as they are.

OK.

Thanks!

Konstantin


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: International characters in source files and SVN commit messages (was: RE:r1525975)

Posted by Mark Thomas <ma...@apache.org>.
On 25/09/2013 08:36, Konstantin Preißer wrote:
> Hi Mark,
> 
> thanks for the reply.
> 
>> -----Original Message----- From: Mark Thomas
>> [mailto:markt@apache.org] Sent: Wednesday, September 25, 2013 5:01
>> PM
> 
>>> One way I can think would be to XML-encode such characters ("ß"
>>> as "&#xDF;"). However, personally I would rather not do this, but
>>> write such characters directly ("ß"), so that the source is
>>> better readable (and encodings like UTF-8 guarantee that the
>>> characters are interpreted the same on each system, independently
>>> from the system language or geographic location).
>> 
>> I don't like the idea of using XML encoding at all.
> 
> Just to avoid a misunderstanding, with "XML encoding" you mean
> numeric character references like &#nnn; ?

Yes.

>>> Could it be possible to change SVN Commit E-Mail system so that
>>> it may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming
>>> all files which contain bytes > 0x7F are encoded as UTF-8)? (Or,
>>> that it tries to decode it as UTF-8, and if it fails, decode it
>>> as ISO-8859-1 ?)
>> 
>> This is a question for infra. If UTF-8 fails then ISO-8859-1 is
>> going to fail as well.
> 
> I mean, to guess a character encoding by first decoding it as UTF-8,
> and if it fails, assume the file was encoded as
> ISO-8859-1/Windows-1252. This approach seems to be used by some
> programs to decide if the file was encoded as UTF-8 or as ANSI when
> it doesn't have BOM bytes.
> 
> For example, consider a file that contains only ASCII characters (<
> 0x7F) stored as single-byte-per-character. As UTF-8 is
> ASCII-compatble, you will get the same results if you interpret it as
> UTF-8 and with ISO-8859-1.
> 
> However, if you have a file that contains "äöü" (german umlaut
> characters) as ISO-8859-1 (Bytes: E4 F6 FC), then UTF-8 decoding will
> fail because the bytes after the one which starts with 11xxxxxx
> (binary) don't start with 10xxxxxx; but decoding as ISO-8859-1 will
> succeed.
> 
> This approach to guess the encoding (UTF-8 vs.
> ISO-8859-1/Windows-1252) seems to be used by programs like Notepad++
> when opening text files without a BOM, and by TortoiseSVN when
> displaying file changes, and seems to be working well if you have
> files with either UTF-8 or ISO-8859-1/Windows-1252 (or other local
> encodings). Of course, this will not always work, e.g. if your text
> file that is encoded with ISO-8859-1 actually contains text like
> "ß". (Personally, for my projects I use UTF-8 for everything :) )
> 
> 
> I was asking because I saw some i18n files like
> "LocalStrings_ja.properties" that encode non-ASCII characters with
> "\uXXXX", and I'd like to know if it is okay to put characters "ß"
> character in the XML file without encoding it by a numeric character
> reference,

I'd say yes. Property files are a 'special' case:
http://stackoverflow.com/questions/4659929/how-to-use-utf-8-in-resource-properties-with-resourcebundle

> while the Commit E-Mails don't use UTF-8. If you are okay
> with this, then I don't mind changing the encoding for the SVN Commit
> E-Mails.

It doesn't bother me but I'm only one committer. I think this falls
under the category if someone cares enough about the commit e-mails
using UTF-8 then they need to work with infra to make that happen. I'm
happy with things as they are.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


RE: International characters in source files and SVN commit messages (was: RE:r1525975)

Posted by Konstantin Preißer <kp...@apache.org>.
> If you are okay with this, then I don't mind changing the encoding for the SVN
> Commit E-Mails.

Sorry; I meant "then I don't care about changing it".

Konstantin


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


RE: International characters in source files and SVN commit messages (was: RE:r1525975)

Posted by Konstantin Preißer <kp...@apache.org>.
Hi Mark,

thanks for the reply.

> -----Original Message-----
> From: Mark Thomas [mailto:markt@apache.org]
> Sent: Wednesday, September 25, 2013 5:01 PM

> > One way I can
> > think would be to XML-encode such characters ("ß" as "&#xDF;").
> > However, personally I would rather not do this, but write such
> > characters directly ("ß"), so that the source is better readable (and
> > encodings like UTF-8 guarantee that the characters are interpreted
> > the same on each system, independently from the system language or
> > geographic location).
> 
> I don't like the idea of using XML encoding at all.

Just to avoid a misunderstanding, with "XML encoding" you mean numeric character references like &#nnn; ?


> > Could it be possible to change SVN Commit E-Mail system so that it
> > may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming all
> > files which contain bytes > 0x7F are encoded as UTF-8)? (Or, that it
> > tries to decode it as UTF-8, and if it fails, decode it as ISO-8859-1
> > ?)
> 
> This is a question for infra. If UTF-8 fails then ISO-8859-1 is going to
> fail as well.

I mean, to guess a character encoding by first decoding it as UTF-8, and if it fails, assume the file was encoded as ISO-8859-1/Windows-1252. This approach seems to be used by some programs to decide if the file was encoded as UTF-8 or as ANSI when it doesn't have BOM bytes.

For example, consider a file that contains only ASCII characters (< 0x7F) stored as single-byte-per-character. As UTF-8 is ASCII-compatble, you will get the same results if you interpret it as UTF-8 and with ISO-8859-1.

However, if you have a file that contains "äöü" (german umlaut characters) as ISO-8859-1 (Bytes: E4 F6 FC), then UTF-8 decoding will fail because the bytes after the one which starts with 11xxxxxx (binary) don't start with 10xxxxxx; but decoding as ISO-8859-1 will succeed.

This approach to guess the encoding (UTF-8 vs. ISO-8859-1/Windows-1252) seems to be used by programs like Notepad++ when opening text files without a BOM, and by TortoiseSVN when displaying file changes, and seems to be working well if you have files with either UTF-8 or ISO-8859-1/Windows-1252 (or other local  encodings). Of course, this will not always work, e.g. if your text file that is encoded with ISO-8859-1 actually contains text like "ß". (Personally, for my projects I use UTF-8 for everything :) )


I was asking because I saw some i18n files like "LocalStrings_ja.properties" that encode non-ASCII characters with "\uXXXX", and I'd like to know if it is okay to put characters "ß" character in the XML file without encoding it by a numeric character reference, while the Commit E-Mails don't use UTF-8. If you are okay with this, then I don't mind changing the encoding for the SVN Commit E-Mails.

Thanks!

Konstantin


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Re: International characters in source files and SVN commit messages (was: RE:r1525975)

Posted by Mark Thomas <ma...@apache.org>.
On 25/09/2013 07:52, Konstantin Preißer wrote:
> Hi all,
> 
>> -----Original Message----- From: kpreisser@apache.org
>> [mailto:kpreisser@apache.org] Sent: Tuesday, September 24, 2013
>> 9:11 PM
> 
>> --- tomcat/site/trunk/xdocs/whoweare.xml (original) +++
>> tomcat/site/trunk/xdocs/whoweare.xml Tue Sep 24 19:10:44 2013 @@
>> -100,6 +100,9 @@ A complete list of all the Apache Commit 
>> <p><b>Costin Manolache</b> (costin at apache.org)<br/></p> <!--Your
>> bio goes here-->
>> 
>> +<p><b>Konstantin Preißer</b> (kpreisser at apache.org)<br/></p>
> 
> When editing the whoweare.xml, I wrote the "ß" character (sharp s)
> which is now displayed as "ß" in the commit message, because the
> source XML file is encoded in UTF-8 (the default encoding for XML
> files).
> 
> As far as I understand, SVN needs to treat changes in text files at
> byte-level, not at character-level, to be independent from character
> encodings. Therefore e.g. ".patch" files don't have a character
> encoding as they describe changes at byte-level.
> 
> However, when the Commit E-Mail is sent, the bytes need to be
> converted to characters, and it seems the SVN commit diff is
> interpreted as ISO-8859-1 (or Windows-1252). Therefore, the UTF-8
> bytes 0xC3 0x9F are displayed as "ß", instead of "ß".
> 
> That would be the preferred way to handle such issues? One way I can
> think would be to XML-encode such characters ("ß" as "&#xDF;").
> However, personally I would rather not do this, but write such
> characters directly ("ß"), so that the source is better readable (and
> encodings like UTF-8 guarantee that the characters are interpreted
> the same on each system, independently from the system language or
> geographic location).

I don't like the idea of using XML encoding at all.

> Could it be possible to change SVN Commit E-Mail system so that it
> may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming all
> files which contain bytes > 0x7F are encoded as UTF-8)? (Or, that it
> tries to decode it as UTF-8, and if it fails, decode it as ISO-8859-1
> ?)

This is a question for infra. If UTF-8 fails then ISO-8859-1 is going to
fail as well.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org