You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Ivan Cenov <i_...@botevgrad.com> on 2010/08/24 19:19:43 UTC

About character encoding of the text files

 Hello,

This is my first post in this list. I was pointed to post here in 
another thread
ViewVC site (http://viewvc.tigris.org/issues/show_bug.cgi?id=11).

The original reason was that ViewVC is unable to show correctly text files
that contain Cyrillic characters (character set windows-1251). (The same 
issue
is related for Western Europe's characters too.)
People told me that ViewVC cannot do this because it lacks of encoding
information. This information should come from Subversion and
Subversion could have this information if the users have supplied it 
into Subversion.
The last posts in above mentioned thread give more information about the 
problem.

As I understood, information about character encoding may be supplied as
svn: property, say, svn:encoding encoding_type. Par example:
svn:encoding windows-1251.

So, are there any intentions among the Subversion developers and users 
to be defined
such property? Would it be reliable way for this task?
If there is an issue about this problem, what is its priority?

-- 

Regards,

Ivan Cenov
OKTO-7 Co., Botevgrad, Bulgaria
i_cenov@botevgrad.com, imc@okto7.com
   GSM: +359 888 76 10 80
phone: +359 723 6 61 20, +359 723 6 61 61
   fax: +359 723 6 62 62

Re: About character encoding of the text files

Posted by Ivan Cenov <i_...@botevgrad.com>.
 На 26.8.2010 г. 17:26, Peter Samuelson написа:
> Did you apply Mike's recent fix to ViewVC?  He described it:
No, I did not. I prefer to wait for official updates that may be viewed at
http://mysite:3343/csvn/packagesUpdate/available
I am not so fluent with this software so I fear of makeing some mess instead
of something good.
> | So, Ivan, if you missed my commit to ViewVC yesterday, the trunk and
> | 1.1.x branch tip code will parse svn:mime-type, extract the charset=
> | bit, and pass it's value off to Pygments when doing syntax
> | highlighting for the markup and annotate views.
>
> Without that fix, I would expect to see exactly what you saw.
I wanted to be more informative with this post, because I understood 
that the
problem is not trivial one.

-- 

Regards,

Ivan Cenov
OKTO-7 Co., Botevgrad, Bulgaria
i_cenov@botevgrad.com, imc@okto7.com
   GSM: +359 888 76 10 80
phone: +359 723 6 61 20, +359 723 6 61 61
   fax: +359 723 6 62 62

Re: About character encoding of the text files

Posted by Peter Samuelson <pe...@p12n.org>.
[Ivan Cenov]
> I set svn:mime-type text/plain; charset=windows-1251 on several files.
> Also, I entered a commit message with Bulgarian (Cyrillic) text) and
> English text
> Then I showed one of these committed files in ViewVC.
> 
> The page came in UTF-8. The log message showed properly - the Cyrillic text
> appeared and was readable. The file content was replaced with ?????????

Did you apply Mike's recent fix to ViewVC?  He described it:

| So, Ivan, if you missed my commit to ViewVC yesterday, the trunk and
| 1.1.x branch tip code will parse svn:mime-type, extract the charset=
| bit, and pass it's value off to Pygments when doing syntax
| highlighting for the markup and annotate views.

Without that fix, I would expect to see exactly what you saw.
-- 
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/

Re: About character encoding of the text files

Posted by Ivan Cenov <i_...@botevgrad.com>.
 На 26.8.2010 г. 16:29, Mark Phippard написа
> Will this fix the problem?  Isn't there still the problem that the
> page advertises its encoding to the browser as UTF-8?  Does ViewVC
> convert from the encoding in the mime-type to UTF-8 before sending the
> content to the browser?  Or is that what Pygments is doing for you?
>
Hi,

Here is what happens in the browser (Firefox 3.6.8). I mean the page 
that shows
the file content.

I set svn:mime-type text/plain; charset=windows-1251 on several files.
Also, I entered a commit message with Bulgarian (Cyrillic) text) and 
English text
Then I showed one of these committed files in ViewVC.

The page came in UTF-8. The log message showed properly - the Cyrillic text
appeared and was readable. The file content was replaced with ?????????

Then I changed charset in Firefox. The things swapped:
The log message changed and the Cyrillic letters turned to some other 
symbols,
not ????????.
The file content appeared and became readable.

I have two screen captures (JPG) each of them of 100k size. If this 
mailing list
allows attachments I could attach them in a new mail, please tell me if 
this is OK.

-- 

Regards,

Ivan Cenov
OKTO-7 Co., Botevgrad, Bulgaria
i_cenov@botevgrad.com, imc@okto7.com
   GSM: +359 888 76 10 80
phone: +359 723 6 61 20, +359 723 6 61 61
   fax: +359 723 6 62 62

Re: About character encoding of the text files

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 08/26/2010 09:29 AM, Mark Phippard wrote:
> On Thu, Aug 26, 2010 at 9:27 AM, C. Michael Pilato <cm...@collab.net> wrote:
> 
>> [And just wrap this up from the ViewVC side of things]
>>
>> As I saw this thread returning the old endorsement of tossing encoding
>> information into the svn:mime-type property, I went ahead and taught ViewVC
>> to look there for that information.  So, Ivan, if you missed my commit to
>> ViewVC yesterday, the trunk and 1.1.x branch tip code will parse
>> svn:mime-type, extract the charset= bit, and pass it's value off to Pygments
>> when doing syntax highlighting for the markup and annotate views.
> 
> Will this fix the problem?  Isn't there still the problem that the
> page advertises its encoding to the browser as UTF-8?  Does ViewVC
> convert from the encoding in the mime-type to UTF-8 before sending the
> content to the browser?  Or is that what Pygments is doing for you?

File contents and encoding come into play in the following places in ViewVC:

  - the checkout (or download) view
  - the markup/annotate view
  - the diff view

The checkout view is a direct repository dump of the file contents without
any ViewVC manipulation, and has since 1.1.0 been able to present to the
browser the svn:mime-type property as-is, encoding value and all.  (Meaning,
it all works.)  My recent changes should have no visible effect on the
result here, though the svn:mime-type property is now parsed and the
Content-type of the response reconstructed -- a less direct route for that data.

The markup/annotate view optionally employs Pygments, and since 1.1.2 has
been coded to use the 'chardet' optional Python library to guess at file
content encodings for the purpose of conversion to UTF-8.  But guessing is
an inexact science, and it's possible that Pygments doesn't like when you
provide it a mime-type value that has parameters (such as 'charset')
attached.  Anyway, my changes now provide Pygments with the user-specified
(via svn:mime-type) encoding directly for that UTF-8 conversion.  Of course,
if you aren't using Pygments, then today you still get nothing.  (I'd like
to fix this by making the fallback code use 'chardet' directly or something.)

Finally, the diff view has always been at a loss for anything decent in this
space, and that remains the case today.  I've been wanting to explore the
use of Pygments/chardet for this view, too, but I lack Round Tuits.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: About character encoding of the text files

Posted by Mark Phippard <ma...@gmail.com>.
On Thu, Aug 26, 2010 at 9:27 AM, C. Michael Pilato <cm...@collab.net> wrote:

> [And just wrap this up from the ViewVC side of things]
>
> As I saw this thread returning the old endorsement of tossing encoding
> information into the svn:mime-type property, I went ahead and taught ViewVC
> to look there for that information.  So, Ivan, if you missed my commit to
> ViewVC yesterday, the trunk and 1.1.x branch tip code will parse
> svn:mime-type, extract the charset= bit, and pass it's value off to Pygments
> when doing syntax highlighting for the markup and annotate views.

Will this fix the problem?  Isn't there still the problem that the
page advertises its encoding to the browser as UTF-8?  Does ViewVC
convert from the encoding in the mime-type to UTF-8 before sending the
content to the browser?  Or is that what Pygments is doing for you?

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: About character encoding of the text files

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 08/26/2010 04:35 AM, Ivan Cenov wrote:
> 
>  На 25.8.2010 г. 22:19, Stefan Sperling написа:
>>
>> Looks more like auto-props documentation needs a fix.
>> The escaping rules don't seem to be documented.
>> You can write ';;' to get a literal ';'.
>>
>> So you can use this in your config for auto-props:
>>
>>     svn:mime-type=text/plain;; charset=windows1251
>>
>> Stefan
>>
> Thanks,
> This is the easiest way.
> 

[And just wrap this up from the ViewVC side of things]

As I saw this thread returning the old endorsement of tossing encoding
information into the svn:mime-type property, I went ahead and taught ViewVC
to look there for that information.  So, Ivan, if you missed my commit to
ViewVC yesterday, the trunk and 1.1.x branch tip code will parse
svn:mime-type, extract the charset= bit, and pass it's value off to Pygments
when doing syntax highlighting for the markup and annotate views.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: About character encoding of the text files

Posted by Ivan Cenov <i_...@botevgrad.com>.
 На 25.8.2010 г. 22:19, Stefan Sperling написа:
>
> Looks more like auto-props documentation needs a fix.
> The escaping rules don't seem to be documented.
> You can write ';;' to get a literal ';'.
>
> So you can use this in your config for auto-props:
>
> 	svn:mime-type=text/plain;; charset=windows1251
>
> Stefan
>
Thanks,
This is the easiest way.

-- 

Regards,

Ivan Cenov
OKTO-7 Co., Botevgrad, Bulgaria
i_cenov@botevgrad.com, imc@okto7.com
   GSM: +359 888 76 10 80
phone: +359 723 6 61 20, +359 723 6 61 61
   fax: +359 723 6 62 62

Re: About character encoding of the text files

Posted by Stefan Sperling <st...@elego.de>.
On Wed, Aug 25, 2010 at 06:58:50PM +0200, Branko Čibej wrote:
> On 25.08.2010 18:54, Ivan Cenov wrote:
> > Well, I tested with svn:mime-type=text/plain; charset=windows1251. I
> > tried to define it as
> > auto property in [auto-props] section of Subversion config file
> > (it resides in C:\Documents and Settings\username\Application
> > Data\Subversion).
> > This was not successful because ';' after 'plain' is a delimiter and
> > so "charset=windows1251"
> > is truncated. This is argument against svn:mime-type.
> It's more an argument for fixing autoprops, IMHO :)

Looks more like auto-props documentation needs a fix.
The escaping rules don't seem to be documented.
You can write ';;' to get a literal ';'.

So you can use this in your config for auto-props:

	svn:mime-type=text/plain;; charset=windows1251

Stefan

Re: About character encoding of the text files

Posted by Branko Čibej <br...@xbc.nu>.
On 25.08.2010 18:54, Ivan Cenov wrote:
>  На 25.8.2010 г. 19:27, Matthew Bentham написа:
>>
>>
>> Maybe, but doing it this way is consistent with the way that the
>> charset is included in the "Content-Type" http header alongside the
>> mime type, described eg. here:
>>
>> http://www.w3.org/International/O-HTTP-charset
>>
>> It makes sense to include it alongside the mime-type, because it's
>> only valid to set it if the document is of type 'text', eg.
>> text/plain or text/html.
>>
>> Matthew
>>
>
> Ok, I have understood. It does not make big difference, more important
> is that the information exists.
> It is better to be compliant with the existing norms and rules, so
> svn:mime-type is OK too.
>
> Well, I tested with svn:mime-type=text/plain; charset=windows1251. I
> tried to define it as
> auto property in [auto-props] section of Subversion config file
> (it resides in C:\Documents and Settings\username\Application
> Data\Subversion).
> This was not successful because ';' after 'plain' is a delimiter and
> so "charset=windows1251"
> is truncated. This is argument against svn:mime-type.
It's more an argument for fixing autoprops, IMHO :)

-- Brane

Re: About character encoding of the text files

Posted by Ivan Cenov <i_...@botevgrad.com>.
 На 25.8.2010 г. 19:27, Matthew Bentham написа:
>
>
> Maybe, but doing it this way is consistent with the way that the 
> charset is included in the "Content-Type" http header alongside the 
> mime type, described eg. here:
>
> http://www.w3.org/International/O-HTTP-charset
>
> It makes sense to include it alongside the mime-type, because it's 
> only valid to set it if the document is of type 'text', eg. text/plain 
> or text/html.
>
> Matthew
>

Ok, I have understood. It does not make big difference, more important 
is that the information exists.
It is better to be compliant with the existing norms and rules, so 
svn:mime-type is OK too.

Well, I tested with svn:mime-type=text/plain; charset=windows1251. I 
tried to define it as
auto property in [auto-props] section of Subversion config file
(it resides in C:\Documents and Settings\username\Application 
Data\Subversion).
This was not successful because ';' after 'plain' is a delimiter and so 
"charset=windows1251"
is truncated. This is argument against svn:mime-type.

-- 

Regards,

Ivan Cenov
OKTO-7 Co., Botevgrad, Bulgaria
i_cenov@botevgrad.com, imc@okto7.com
   GSM: +359 888 76 10 80
phone: +359 723 6 61 20, +359 723 6 61 61
   fax: +359 723 6 62 62

Re: About character encoding of the text files

Posted by Matthew Bentham <mj...@artvps.com>.
On 25/08/2010 17:01, Ivan Cenov wrote:
>    ???? 25.8.2010 ??. 08:54, B Smith-Mannschott ????????????:
>> The property svn:mime-type carries charset information as an
>> additional field:
>>
>> $ svn propset svn:mime-type "text/plain;charset=Windows-1251"
>> file1.txt file2.txt ...
>>
>> // ben
>>
>
> Hi, I tried with
> *
> svn:*mime*-type* text/plain; charset=Windows-1251
> and
> *svn:mime-type* text/plain; charset=windows-1251
> and
> *svn:mime-type* text/plain;charset=windows-1251
> and
> *svn:mime-type* text/plain; charset=Windows-1251
>
> on a file d.c but without success. ViewVC continued to show ???? instead
> of cyrillic letters.
>
> Well, Subversion supplies the information. It is up to ViewVC to deal
> with it...
>
>   From my point of view, it would be better a dedicated property to be
> defined.
> svn:charset is only an example.
>

Maybe, but doing it this way is consistent with the way that the charset 
is included in the "Content-Type" http header alongside the mime type, 
described eg. here:

http://www.w3.org/International/O-HTTP-charset

It makes sense to include it alongside the mime-type, because it's only 
valid to set it if the document is of type 'text', eg. text/plain or 
text/html.

Matthew

Re: About character encoding of the text files

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
B Smith-Mannschott wrote on Wed, Aug 25, 2010 at 07:54:26 +0200:
> On Tue, Aug 24, 2010 at 21:19, Ivan Cenov <i_...@botevgrad.com> wrote:
> 
> >  Hello,
> >
> > This is my first post in this list. I was pointed to post here in another
> > thread
> > ViewVC site (http://viewvc.tigris.org/issues/show_bug.cgi?id=11).
> >
> > The original reason was that ViewVC is unable to show correctly text files
> > that contain Cyrillic characters (character set windows-1251). (The same
> > issue
> > is related for Western Europe's characters too.)
> > People told me that ViewVC cannot do this because it lacks of encoding
> > information. This information should come from Subversion and
> > Subversion could have this information if the users have supplied it into
> > Subversion.
> > The last posts in above mentioned thread give more information about the
> > problem.
> >
> > As I understood, information about character encoding may be supplied as
> > svn: property, say, svn:encoding encoding_type. Par example:
> > svn:encoding windows-1251.
> >
> > So, are there any intentions among the Subversion developers and users to
> > be defined
> > such property? Would it be reliable way for this task?
> > If there is an issue about this problem, what is its priority?
> >
> >
> The property svn:mime-type carries charset information as an additional
> field:
> 
> $ svn propset svn:mime-type "text/plain;charset=Windows-1251" file1.txt
> file2.txt ...
> 

Ben, does svnbook document this syntax (and when to use it)?

(or the "$Keyword::$" syntax, while I'm at it)

> // ben

Re: About character encoding of the text files

Posted by Ivan Cenov <i_...@botevgrad.com>.
 На 25.8.2010 г. 08:54, B Smith-Mannschott написа:
> The property svn:mime-type carries charset information as an 
> additional field:
>
> $ svn propset svn:mime-type "text/plain;charset=Windows-1251" 
> file1.txt file2.txt ...
>
> // ben
>

Hi, I tried with
*
svn:*mime*-type* text/plain; charset=Windows-1251
and
*svn:mime-type* text/plain; charset=windows-1251
and
*svn:mime-type* text/plain;charset=windows-1251
and
*svn:mime-type* text/plain; charset=Windows-1251

on a file d.c but without success. ViewVC continued to show ???? instead 
of cyrillic letters.

Well, Subversion supplies the information. It is up to ViewVC to deal 
with it...

 From my point of view, it would be better a dedicated property to be 
defined.
svn:charset is only an example.

-- 

Regards,

Ivan Cenov
OKTO-7 Co., Botevgrad, Bulgaria
i_cenov@botevgrad.com, imc@okto7.com
   GSM: +359 888 76 10 80
phone: +359 723 6 61 20, +359 723 6 61 61
   fax: +359 723 6 62 62

Re: About character encoding of the text files

Posted by B Smith-Mannschott <bs...@gmail.com>.
On Tue, Aug 24, 2010 at 21:19, Ivan Cenov <i_...@botevgrad.com> wrote:

>  Hello,
>
> This is my first post in this list. I was pointed to post here in another
> thread
> ViewVC site (http://viewvc.tigris.org/issues/show_bug.cgi?id=11).
>
> The original reason was that ViewVC is unable to show correctly text files
> that contain Cyrillic characters (character set windows-1251). (The same
> issue
> is related for Western Europe's characters too.)
> People told me that ViewVC cannot do this because it lacks of encoding
> information. This information should come from Subversion and
> Subversion could have this information if the users have supplied it into
> Subversion.
> The last posts in above mentioned thread give more information about the
> problem.
>
> As I understood, information about character encoding may be supplied as
> svn: property, say, svn:encoding encoding_type. Par example:
> svn:encoding windows-1251.
>
> So, are there any intentions among the Subversion developers and users to
> be defined
> such property? Would it be reliable way for this task?
> If there is an issue about this problem, what is its priority?
>
>
The property svn:mime-type carries charset information as an additional
field:

$ svn propset svn:mime-type "text/plain;charset=Windows-1251" file1.txt
file2.txt ...

// ben