You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Gunter Ohrner <G....@post.rwth-aachen.de> on 2006/01/05 11:26:35 UTC

about character set / encoding conversion in subversion

Hi!

I want to get more familiar with character set management in subversion to
better understand some issues I'm experiencing. Maybe someone here can
enlighten me or point me to some clarifying documentation.

So far I've always believed, and the docs I found seemed to support it in my
eyes, that subversion converts all file names from a client's custom locale
to UTF8 and only managed UTF8 encoded names internally. I imagined this
conversion to UTF8 would happen in the client before supplying the data to
the server and the conversion back to the clients locale would also be
carried out by the client after receiving the canonical UTF8 encoded names
from the server. That made sense to me as this way the server would not
have to know the (possibly very different) locales / encodings used by the
clients accessing it.

Well, as it seems I was wrong, and the server seems to do parts of (all of?)
the recoding. Now I'm puzzled - how and especially where does the "local
endocing <-> UTF8 conversion" happen inside subversion? Is there any
documentation detailing this and the design decisions behind?

What character recoding is done within the server (svnserve 1.1.4) and how
do I have to configure the server to know all neccessary encodings?

Greetings,

  Gunter

-- 
Tourist, Rincewind decided, meant "idiot".        -- (Terry Pratchett, 
The Colour of Magic)
*** PGP-Verschlüsselung bei eMails erwünscht :-) *** PGP: 0x1128F25F ***


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: about character set / encoding conversion in subversion

Posted by Gunter Ohrner <G....@post.rwth-aachen.de>.
Ben Collins-Sussman wrote:
>> So far I've always believed, and the docs I found seemed to support it in
>> my eyes, that subversion converts all file names from a client's custom
>> locale to UTF8 and only managed UTF8 encoded names internally. I imagined

> Yes, that's how it works.

Mh, ok. So that's good news that I have a correct idea of how svn should
work, and bad news as now I really do not understand what's up with my
server... :-/

>> What character recoding is done within the server (svnserve 1.1.4) and
>> how do I have to configure the server to know all neccessary encodings?

> So 'svn' does this the most, obviously.  But even simple programs like
> 'svnserve'  or 'svnlook' do it to a small degree.  For example,
> svnserve has its own svnserve.conf file which contains locally-encoded
> system paths.  svnserve needs to convert them to UTF8 before passing
> them to internal APIs.

Ok. Well, generally, svnserve works fine with just this repository.
(svnserve 1.1.4 using bdb backend.) However, if trying to commit changed
files containing umlauts, svnserve (!) answers "Can't recode string" to the
client's request. Comitting files / whole directories to this repository
which do not contain any special characters works just fine.

The client is using de_DE@euro (ISO 8859-15) as its locale, that is the
LANG-environment variable is set to this value, "locale -a" lists
"de_DE@euro" (besides others) and I have no problems working locally with
the offending files.

So how can I find what svnserve is trying to recode, between which charsets
it wants to recode, and why it's failing?

On the client it looks like that:

-----------------------------------------------------------------------
read(3, "( success ( 1 2 ( ANONYMOUS ) ( edit-pipeline ) ) ) ", 4096) = 52
write(3, "( 2 ( edit-pipeline )
   41:svn://192.168.42.1/home/misc/Spr%C3%BCche ) ", 69) = 69
read(3, "( failure ( ( 22 19:Can\'t recode string
   28:subversion/libsvn_subr/utf.c 363 ) ) ) ", 4096) = 82
-----------------------------------------------------------------------

Looking at the code reveals that a call to apr_xlate_conv_buffer inside
Subversion's convert_to_stringbuf seems to be failing.

Thanks a lot for your explanations!

Greetings,

  Gunter

-- 
Biers was where the undead drank. And when Igor the barman was asked 
for a Bloody Mary, he didn't mix a metaphor.        -- (Terry 
Pratchett, Hogfather)
*** PGP-Verschlüsselung bei eMails erwünscht :-) *** PGP: 0x1128F25F ***


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: about character set / encoding conversion in subversion

Posted by Gunter Ohrner <G....@post.rwth-aachen.de>.
Ben Collins-Sussman wrote:
>> What character recoding is done within the server (svnserve 1.1.4) and
>> how do I have to configure the server to know all neccessary encodings?

> So 'svn' does this the most, obviously.  But even simple programs like
> 'svnserve'  or 'svnlook' do it to a small degree.  For example,
> svnserve has its own svnserve.conf file which contains locally-encoded
> system paths.  svnserve needs to convert them to UTF8 before passing
> them to internal APIs.

Mh, ok. All local paths use standard ASCII characters, as it can be seen by
the fact that svnserve did work fine with this repository if the committed
files didn't contain any non-ASCII-characters.

However it DID break if the comitted files contained characters from richer
charsets.

My server process (svnserve) now runs with LANG=de_DE.utf8@euro and seems to
be happy so far... :-/  bThough I still don't understand why it wanted to
transcode some of the committed file's names into its local encoding?

Now maybe someone could clarify, am I officially too stupid to configure
subversion, or is it really nowhere written that the subversion server
process will need to transcode some strings into its own locale / encoding
for some reason and thus needs to use a local enconding which can represent
all characters of all encodings used by any clients accessing this server?

Anyway, sorry for wasting your time, thanks a lot for your help :-))) and I
really hope it works reliably now. :-)

Greetings,

  Gunter

-- 
To mess up a Linux box, you need to work at it; to mess up your Windows 
box, you just need to work on it.        -- Scott Granneman, Security 
Focus
*** PGP-Verschlüsselung bei eMails erwünscht :-) *** PGP: 0x1128F25F ***


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: about character set / encoding conversion in subversion

Posted by Ben Collins-Sussman <su...@red-bean.com>.
On 1/5/06, Gunter Ohrner <G....@post.rwth-aachen.de> wrote:
> Hi!
>
> I want to get more familiar with character set management in subversion to
> better understand some issues I'm experiencing. Maybe someone here can
> enlighten me or point me to some clarifying documentation.
>
> So far I've always believed, and the docs I found seemed to support it in my
> eyes, that subversion converts all file names from a client's custom locale
> to UTF8 and only managed UTF8 encoded names internally. I imagined this
> conversion to UTF8 would happen in the client before supplying the data to
> the server and the conversion back to the clients locale would also be
> carried out by the client after receiving the canonical UTF8 encoded names
> from the server. That made sense to me as this way the server would not
> have to know the (possibly very different) locales / encodings used by the
> clients accessing it.

Yes, that's how it works.


> What character recoding is done within the server (svnserve 1.1.4) and how
> do I have to configure the server to know all neccessary encodings?
>

Subversion is 90% a collection of libraries, and 10% binaries that
make calls to those libraries.  The design is:  for every API in every
library, file paths and log messages are always UTF8.  That means that
any application which (1) makes use of subversion APIs and (2) deals
with "local" system paths has the responsibility of converting the
local system paths to UTF8 before passing them to the subversion APIs.

So 'svn' does this the most, obviously.  But even simple programs like
'svnserve'  or 'svnlook' do it to a small degree.  For example,
svnserve has its own svnserve.conf file which contains locally-encoded
system paths.  svnserve needs to convert them to UTF8 before passing
them to internal APIs.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org