You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Tim Armes <ta...@fr.imaje.com> on 2004/02/19 08:52:22 UTC

Encoding problems

I'm stumped.  Perhaps someone from this list can help me out with this
problem:

Subversion keeps all the log messages in UTF-8 format.  This is very
sensible.  When you call a console command, the output is converted to the
console's charset.  This makes sense too.

Now, on my Windows server, if I call svnlook using PHP's popen command, the
string I get back is, correctly, in CP850.  If I want this to be display
correctly on the web page I use iconv to convert it to ISO-8859-1.  This
works.

However, on another user's machine (to which I don't have access,
unfortunately) which is a Unix box with the locale in Icelandic, his log
messages are being displayed like this:

T?\195?\179k ?\195?\186t if statement ?\195?\186r index.php sem hvort 
e?\195?\176 er gerast aldrei, en ef ?\195?\190?\195?\166r
skyldu gerast er $ID n?\195?\186 h?\195?\182ndla?\195?\176 ?\195?\161 sama 
h?\195?\161tt og $LANG, ?\195?\190.e.a.s. ?\195?\190a?\195?\176 er sett
?\195?\161
default ef userinn gefur ?\195?\190a?\195?\176 ekki.

It's the UTF-8 encoding, except that the characters are being converted into
human readable numbers.  My first assumption was that svnlook was for some
reason returning the string as UTF-8, and that PHP's print function was
printing the characters as above, expanded.  That's not the cas however, my
tests have shown that PHP doesn't do such a thing, it happily prints top-bit
set characters as they are.

The implication then is that it's the svnlook command that's returning the
string exactly as shown above, but I don't believe that either.  His locale
setup is in ISO-8859-1, so you would expect that svnlook should return
characters in that encoding. 

The sequence is:

Use popen to run svnlook, reading one line at a time with fgets.
Store these lines in an array
Print them.

Does anyone have any idea at what point the UTF-8 encoding could be returned
with the top-set bit characters simply expanded into numeric strings?
Indeed, why should the string be in UTF-8 at all?

Tim
###########################################

This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange.
For more information, connect to http://www.F-Secure.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org