You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Frédéric Hébert <fg...@gmail.com> on 2009/05/10 19:21:41 UTC
svnadmin dump - Erroneous UTF-8 encoding with binary files
Hello,
Through a double sshed connection (home PC => firewall => dev server),
I made a dump of all a repository from my dev server with "svnadmin
dump repos > file".
"Back to" my home computer, I've been surprized to see that my dump
file contained bad encoded UTF-8 characters like the following
(see svn:log property) :
Revision-number: 1 Prop-content-length: 146 Content-length:
146
K 7
svn:log
V 46
Création de l'arborescence de base du dépôt
K 10
svn:author
V 5
fredo
K 8
svn:date
V 27
2009-04-07T19:59:37.972139Z
PROPS-END
These bad characters appeared either in svn:log properties or files content.
All three computers have UTF-8 locales, and ssh clients and servers have
SendEnv and AcceptEnv setted to LC_* and LANGUAGE.
Back to the dev server I have made some tests, and it seems to me that
encoding errors are due to the presence of binary files in the dump, eg
files with svn:mime-type property set to application/octet-stream.
For example, my django project contains pure plain text in
'trunk/templates' and images in 'trunk/media/images' :
dev-server~:$ svnadmin dump -r 33
/var/svn/enseignements-dev.ehess.fr/ |\
svndumpfilter include 'trunk/templates' > \
/tmp/svn_enseignements_r33_nobinary.dump
It's output through xxd is something like that (Année on the third line
contains c3a9 sequence which is the utf-8 code for the "french" é):
001b450: 6872 6566 3d22 2f7b 7b20 616e 6e65 6575
href="/{{ anneeu
001b460: 6e69 762e 616e 6e65 6520 7d7d 2f22 2074
niv.annee }}/" t
001b470: 6974 6c65 3d22 416e 6ec3 a965 2075 6e69
itle="Ann..e uni
001b480: 7665 7273 6974 6169 7265 207b 7b20 616e
versitaire {{ an
001b490: 6e65 6575 6e69 7620 7d7d 223e 7b7b 2061
neeuniv }}">{{ a
001b4a0: 6e6e 6565 756e 6976 207d 7d3c 2f61 3e3c
nneeuniv }}</a><
001b4b0: 2f6c 693e 0a20 2020 2020 203c 6c69 3e3c
/li>. <li><
dev-server~:$ svnadmin dump -r 33
/var/svn/enseignements-dev.ehess.fr/ | 2>&1
svndumpfilter include 'trunk/templates' include
'trunk/media/images' >
/tmp/svn_enseignements_r33.dump
On the third line, the 'é' letter is made of four bytes the two é
characters) :
0000000: 6872 6566 3d22 2f7b 7b20 616e 6e65 6575
href="/{{ anneeu
0000010: 6e69 762e 616e 6e65 6520 7d7d 2f22 2074
niv.annee }}/" t
0000020: 6974 6c65 3d22 416e 6ec3 83c2 a965 2075
itle="Ann....e u
0000030: 6e69 7665 7273 6974 6169 7265 207b 7b20
niversitaire {{
0000040: 616e 6e65 6575 6e69 7620 7d7d 223e 7b7b
anneeuniv }}">{{
0000050: 2061 6e6e 6565 756e 6976 207d 7d3c 2f61
anneeuniv }}</a
0000060: 3e3c 2f6c 693e 0a20 2020 2020 203c 6c69
></li>. <li
0000070: 3e43 6f6d 7074 6520 7265 6e64 753c 2f6c
>Compte rendu</l
0000080: 0a
Some of you have an idea about this ?
The only solution I could issue is to delete binary files from the
repos,...
Many thanks in advance and forgive me if I am totally wrong.
Frédéric
------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2175222
To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].
Re: svnadmin dump - Erroneous UTF-8 encoding with binary files
Posted by Frédéric Hébert <fg...@gmail.com>.
Stefan Sperling wrote:
> On Sun, May 10, 2009 at 09:21:41PM +0200, Frédéric Hébert wrote:
>> Back to the dev server I have made some tests, and it seems to me that
>> encoding errors are due to the presence of binary files in the dump, eg
>> files with svn:mime-type property set to application/octet-stream.
>> For example, my django project contains pure plain text in
>> 'trunk/templates' and images in 'trunk/media/images' :
>>
>> dev-server~:$ svnadmin dump -r 33
>> /var/svn/enseignements-dev.ehess.fr/ |\
>> svndumpfilter include 'trunk/templates' > \
>> /tmp/svn_enseignements_r33_nobinary.dump
>
> Do I understand correctly that you propose that Subversion
> applies the wrong mime-type settings to your text files,
> using the mime-type setting of unrelated binary files for text files?
Of course not !
That's not I wanted to say. In my dump file there are two kinds of file
content; binary file with 'application/octet-stream' mime-type and plain
text file with no svn:mime-type setted.
What I has in mind is that Subversion handles "separatly" both raw
output and encoded output in his dump file.
>
> Have you tested what happens when you set the mime-type on text files
> to 'text/plain' explicitly? It would be interesting to see if this
> fixes the problem.
>
I am sometimes (?) a kind of idiot. Of course, a dump file is himself a
binary file without encoding (a file could not contains both raw binary
and text contents). Character set is handled at import time in
Subversion internals.
I has not tested xxd on fsfs files after an import and It works. They
are utf-8.
> Stefan
Sorry about that.
Frédéric
------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2176457
To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].
Re: svnadmin dump - Erroneous UTF-8 encoding with binary files
Posted by Stefan Sperling <st...@elego.de>.
On Sun, May 10, 2009 at 09:21:41PM +0200, Frédéric Hébert wrote:
> Back to the dev server I have made some tests, and it seems to me that
> encoding errors are due to the presence of binary files in the dump, eg
> files with svn:mime-type property set to application/octet-stream.
> For example, my django project contains pure plain text in
> 'trunk/templates' and images in 'trunk/media/images' :
>
> dev-server~:$ svnadmin dump -r 33
> /var/svn/enseignements-dev.ehess.fr/ |\
> svndumpfilter include 'trunk/templates' > \
> /tmp/svn_enseignements_r33_nobinary.dump
Do I understand correctly that you propose that Subversion
applies the wrong mime-type settings to your text files,
using the mime-type setting of unrelated binary files for text files?
Have you tested what happens when you set the mime-type on text files
to 'text/plain' explicitly? It would be interesting to see if this
fixes the problem.
Stefan
Re: svnadmin dump - Erroneous UTF-8 encoding with binary files
Posted by Henrik Sundberg <st...@gmail.com>.
2009/5/10 Frédéric Hébert <fg...@gmail.com>:
> Through a double sshed connection (home PC => firewall => dev server),
> I made a dump of all a repository from my dev server with "svnadmin
> dump repos > file".
>
> "Back to" my home computer, I've been surprized to see that my dump
> file contained bad encoded UTF-8 characters like the following
Do you see anything wrong in the repository, or is it just the dump
format that seems odd?
The dump is not a text file.
http://svnbook.red-bean.com/en/1.5/svn.reposadmin.maint.html#svn.reposadmin.maint.migrate
says:
"While the Subversion repository dump format contains human-readable
portions and a familiar structure (it resembles an RFC 822 format, the
same type of format used for most email), it is not a plain-text file
format. It is a binary file format, highly sensitive to meddling. For
example, many text editors will corrupt the file by automatically
converting line endings."
/$
------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2176463
To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].