You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Frédéric Hébert <fg...@gmail.com> on 2009/05/10 19:21:41 UTC

svnadmin dump - Erroneous UTF-8 encoding with binary files

Hello,

 Through a double sshed connection (home PC => firewall => dev server),
 I made a dump of all a repository from my dev server with "svnadmin
 dump repos > file".

 "Back to" my home computer, I've been surprized to see that my dump
 file contained bad encoded UTF-8 characters like the following
(see svn:log property) :

        Revision-number: 1 Prop-content-length: 146 Content-length:
        146

        K 7
        svn:log
        V 46
        Création de l'arborescence de base du dépôt
        K 10
        svn:author
        V 5
        fredo
        K 8
        svn:date
        V 27
        2009-04-07T19:59:37.972139Z
        PROPS-END

 
These bad characters appeared either in svn:log properties or files content.

All three computers have UTF-8 locales, and ssh clients and servers have
SendEnv and AcceptEnv setted to LC_* and LANGUAGE.

Back to the dev server I have made some tests, and it seems to me that
encoding errors are due to the presence of binary files in the dump, eg
files with svn:mime-type property set to application/octet-stream.
For example, my django project contains pure plain text in
'trunk/templates' and images in 'trunk/media/images' :

    dev-server~:$ svnadmin dump -r 33
/var/svn/enseignements-dev.ehess.fr/ |\
                 svndumpfilter include 'trunk/templates' > \
                 /tmp/svn_enseignements_r33_nobinary.dump


It's output through xxd is something like that (Année on the third line
contains c3a9 sequence which is the utf-8 code for the "french" é):


                001b450: 6872 6566 3d22 2f7b 7b20 616e 6e65 6575 
href="/{{ anneeu
                001b460: 6e69 762e 616e 6e65 6520 7d7d 2f22 2074 
niv.annee }}/" t
                001b470: 6974 6c65 3d22 416e 6ec3 a965 2075 6e69 
itle="Ann..e uni
                001b480: 7665 7273 6974 6169 7265 207b 7b20 616e 
versitaire {{ an
                001b490: 6e65 6575 6e69 7620 7d7d 223e 7b7b 2061 
neeuniv }}">{{ a
                001b4a0: 6e6e 6565 756e 6976 207d 7d3c 2f61 3e3c 
nneeuniv }}</a><
                001b4b0: 2f6c 693e 0a20 2020 2020 203c 6c69 3e3c 
/li>.      <li><


    dev-server~:$ svnadmin dump -r 33
/var/svn/enseignements-dev.ehess.fr/ | 2>&1
                 svndumpfilter include 'trunk/templates' include
'trunk/media/images' >
                 /tmp/svn_enseignements_r33.dump


On the third line, the 'é' letter is made of four bytes the two é
characters) :

                0000000: 6872 6566 3d22 2f7b 7b20 616e 6e65 6575 
href="/{{ anneeu
                0000010: 6e69 762e 616e 6e65 6520 7d7d 2f22 2074 
niv.annee }}/" t
                0000020: 6974 6c65 3d22 416e 6ec3 83c2 a965 2075 
itle="Ann....e u
                0000030: 6e69 7665 7273 6974 6169 7265 207b 7b20 
niversitaire {{
                0000040: 616e 6e65 6575 6e69 7620 7d7d 223e 7b7b 
anneeuniv }}">{{
                0000050: 2061 6e6e 6565 756e 6976 207d 7d3c 2f61  
anneeuniv }}</a
                0000060: 3e3c 2f6c 693e 0a20 2020 2020 203c 6c69 
></li>.      <li
                0000070: 3e43 6f6d 7074 6520 7265 6e64 753c 2f6c 
>Compte rendu</l
                0000080: 0a   


Some of you have an idea about this ?
The only solution I could  issue  is to  delete binary files from the
repos,...
 
Many thanks in advance and forgive me if I am totally wrong.

Frédéric

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2175222

To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].


Re: svnadmin dump - Erroneous UTF-8 encoding with binary files

Posted by Frédéric Hébert <fg...@gmail.com>.
Stefan Sperling wrote:
> On Sun, May 10, 2009 at 09:21:41PM +0200, Frédéric Hébert wrote:
>> Back to the dev server I have made some tests, and it seems to me that
>> encoding errors are due to the presence of binary files in the dump, eg
>> files with svn:mime-type property set to application/octet-stream.
>> For example, my django project contains pure plain text in
>> 'trunk/templates' and images in 'trunk/media/images' :
>>
>>     dev-server~:$ svnadmin dump -r 33
>> /var/svn/enseignements-dev.ehess.fr/ |\
>>                  svndumpfilter include 'trunk/templates' > \
>>                  /tmp/svn_enseignements_r33_nobinary.dump
> 

> Do I understand correctly that you propose that Subversion
> applies the wrong mime-type settings to your text files,
> using the mime-type setting of unrelated binary files for text files?

Of course not !

That's not I wanted to say. In my dump file there are two kinds of file
content; binary file with 'application/octet-stream' mime-type and plain
text file with no svn:mime-type setted.

What I has in mind is that Subversion handles "separatly" both raw
output and encoded output in his dump file.

> 
> Have you tested what happens when you set the mime-type on text files
> to 'text/plain' explicitly? It would be interesting to see if this
> fixes the problem.
> 

I am sometimes (?) a kind of idiot. Of course, a dump file is himself a
binary file without encoding (a file could not contains both raw binary
and text contents). Character set is handled at import time in
Subversion internals.

I has not tested xxd on fsfs files after an import and It works. They
are utf-8.

> Stefan


Sorry about that.


Frédéric

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2176457

To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].


Re: svnadmin dump - Erroneous UTF-8 encoding with binary files

Posted by Stefan Sperling <st...@elego.de>.
On Sun, May 10, 2009 at 09:21:41PM +0200, Frédéric Hébert wrote:
> Back to the dev server I have made some tests, and it seems to me that
> encoding errors are due to the presence of binary files in the dump, eg
> files with svn:mime-type property set to application/octet-stream.
> For example, my django project contains pure plain text in
> 'trunk/templates' and images in 'trunk/media/images' :
> 
>     dev-server~:$ svnadmin dump -r 33
> /var/svn/enseignements-dev.ehess.fr/ |\
>                  svndumpfilter include 'trunk/templates' > \
>                  /tmp/svn_enseignements_r33_nobinary.dump

Do I understand correctly that you propose that Subversion
applies the wrong mime-type settings to your text files,
using the mime-type setting of unrelated binary files for text files?

Have you tested what happens when you set the mime-type on text files
to 'text/plain' explicitly? It would be interesting to see if this
fixes the problem.

Stefan


Re: svnadmin dump - Erroneous UTF-8 encoding with binary files

Posted by Henrik Sundberg <st...@gmail.com>.
2009/5/10 Frédéric Hébert <fg...@gmail.com>:
>  Through a double sshed connection (home PC => firewall => dev server),
>  I made a dump of all a repository from my dev server with "svnadmin
>  dump repos > file".
>
>  "Back to" my home computer, I've been surprized to see that my dump
>  file contained bad encoded UTF-8 characters like the following

Do you see anything wrong in the repository, or is it just the dump
format that seems odd?
The dump is not a text file.
http://svnbook.red-bean.com/en/1.5/svn.reposadmin.maint.html#svn.reposadmin.maint.migrate
says:
"While the Subversion repository dump format contains human-readable
portions and a familiar structure (it resembles an RFC 822 format, the
same type of format used for most email), it is not a plain-text file
format. It is a binary file format, highly sensitive to meddling. For
example, many text editors will corrupt the file by automatically
converting line endings."

/$

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2176463

To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].