You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Barry Scott <ba...@barrys-emacs.org> on 2008/07/03 21:57:18 UTC

Are log messages Unicode?

In pysvn I have assumed that log messages are in UTF-8 and decode  
them to unicode.

A user has reported that their logs are in latin-1 and fail to decode  
as UTF-8.

Did I make a bad assumption?

Barry


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Are log messages Unicode?

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Barry Scott wrote on Sun, 13 Jul 2008 at 00:01 +0100:
> On Jul 12, 2008, at 15:54, Barry Scott wrote:
> > I have the dump of the repos that causes pysvn to fail. 
...
> > 
> > Is this proof that the repos has none UTF-8 log text?
> > 

Yes.

> > svn 1.4.6 is happy to show the log:
> > 

svn trunk shows the log too, in the same way (with ?\ddd escapes).  

> > $ svn log -r219 file:///Users/barry/tmp/repos/trunk/dotfiles
> > ------------------------------------------------------------------------
> > r219 | bortzmeyer | 2003-01-17 14:04:31 +0000 (Fri, 17 Jan 2003) | 3 lines
> > 
> > Bitbucket r?\233serv?\233 ?\224 dev/null
> > Classement dans Mail/spam seulement apr?\232s le localstart qui lance
> > spamc
> > 
> > ------------------------------------------------------------------------
> > 
> > But the \233 are supposed to be é I understand.
> > 

In UTF-8 it is an invalid sequence; é would be 0xC3 0xA9.

Re: Are log messages Unicode?

Posted by Barry Scott <ba...@barrys-emacs.org>.
On Jul 12, 2008, at 15:54, Barry Scott wrote:

>
> On Jul 7, 2008, at 17:15, Karl Fogel wrote:
>
>> "Ben Collins-Sussman" <su...@red-bean.com> writes:
>>> On Sun, Jul 6, 2008 at 5:23 AM, Barry Scott <barry@barrys- 
>>> emacs.org> wrote:
>>>> Using the svn_client API is it possible for a client to write
>>>> none-UTF-8 log messages?
>>>> Clearly if this happened it would be a bug in the client given the
>>>> above statement.
>>>
>>> I don't recall the details, but it's actually the *programmers'*
>>> burden to convert paths and log messages from native locale to UTF8
>>> (and back again).  If you read the svn APIs, you'll notice that  
>>> every
>>> path and log message passed into APIs (or passed around between  
>>> APIs)
>>> are presumed to *already* be UTF8.  So if you're writing your own
>>> client, it's your job to convert user input to UTF8 before  
>>> passing to
>>> svn_client_*().  Look at the commandline client to see how it's  
>>> doing
>>> that;  I believe there a number of convenience routines in  
>>> libsvn_subr
>>> to help with conversion.
>>
>> I think Barry's asking if the client and/or server do any validation.
>> That is, if the programmer supplies a non-UTF8 log message, our  
>> client
>> libraries should reject it; and if such a log message were to  
>> reach the
>> repository (perhaps because someone wrote their own client  
>> software from
>> scratch), the repository should reject it too.
>>
>> I don't know whether we do such validation or not, but agree we  
>> should.
>>
>> Barry, got time to test/trace it?
>>
>
> I have the dump of the repos that causes pysvn to fail. In the  
> attachment is
> the fragment of the dump file for r219 that causes the problems. If  
> you need the
> whole 3MB of the full dump I'll have to ask permission to pass it  
> on to you.
>
> Python cannot decode the svn:log as utf-8.
>
>  $ python2.5 extract_log_text.py
> 'Bitbucket r\xe9serv\xe9 \xe0 dev/null\nClassement dans Mail/spam  
> seulement apr\xe8s le localstart qui lance spamc\n'
> '\xe9s'
> Traceback (most recent call last):
>   File "extract_log_text.py", line 12, in <module>
>     print log.decode( 'utf-8' )
>   File "/Library/Frameworks/Python.framework/Versions/2.5/lib/ 
> python2.5/encodings/utf_8.py", line 16, in decode
>     return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position  
> 11-13: invalid data
>
> Is this proof that the repos has none UTF-8 log text?
>
> svn 1.4.6 is happy to show the log:
>
> $ svn log -r219 file:///Users/barry/tmp/repos/trunk/dotfiles
> ---------------------------------------------------------------------- 
> --
> r219 | bortzmeyer | 2003-01-17 14:04:31 +0000 (Fri, 17 Jan 2003) |  
> 3 lines
>
> Bitbucket r?\233serv?\233 ?\224 dev/null
> Classement dans Mail/spam seulement apr?\232s le localstart qui  
> lance spamc
>
> ---------------------------------------------------------------------- 
> --
>
> But the \233 are supposed to be é I understand.
>
> Barry
>
>

Re: Are log messages Unicode?

Posted by Barry Scott <ba...@barrys-emacs.org>.
On Jul 7, 2008, at 17:15, Karl Fogel wrote:

> "Ben Collins-Sussman" <su...@red-bean.com> writes:
>> On Sun, Jul 6, 2008 at 5:23 AM, Barry Scott <barry@barrys- 
>> emacs.org> wrote:
>>> Using the svn_client API is it possible for a client to write
>>> none-UTF-8 log messages?
>>> Clearly if this happened it would be a bug in the client given the
>>> above statement.
>>
>> I don't recall the details, but it's actually the *programmers'*
>> burden to convert paths and log messages from native locale to UTF8
>> (and back again).  If you read the svn APIs, you'll notice that every
>> path and log message passed into APIs (or passed around between APIs)
>> are presumed to *already* be UTF8.  So if you're writing your own
>> client, it's your job to convert user input to UTF8 before passing to
>> svn_client_*().  Look at the commandline client to see how it's doing
>> that;  I believe there a number of convenience routines in  
>> libsvn_subr
>> to help with conversion.
>
> I think Barry's asking if the client and/or server do any validation.
> That is, if the programmer supplies a non-UTF8 log message, our client
> libraries should reject it; and if such a log message were to reach  
> the
> repository (perhaps because someone wrote their own client software  
> from
> scratch), the repository should reject it too.
>
> I don't know whether we do such validation or not, but agree we  
> should.
>
> Barry, got time to test/trace it?
>

I have the dump of the repos that causes pysvn to fail. In the  
attachment is
the fragment of the dump file for r219 that causes the problems. If  
you need the
whole 3MB of the full dump I'll have to ask permission to pass it on  
to you.

Python cannot decode the svn:log as utf-8.

  $ python2.5 extract_log_text.py
'Bitbucket r\xe9serv\xe9 \xe0 dev/null\nClassement dans Mail/spam  
seulement apr\xe8s le localstart qui lance spamc\n'
'\xe9s'
Traceback (most recent call last):
   File "extract_log_text.py", line 12, in <module>
     print log.decode( 'utf-8' )
   File "/Library/Frameworks/Python.framework/Versions/2.5/lib/ 
python2.5/encodings/utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position  
11-13: invalid data

Is this proof that the repos has none UTF-8 log text?

svn 1.4.6 is happy to show the log:

$ svn log -r219 file:///Users/barry/tmp/repos/trunk/dotfiles
------------------------------------------------------------------------
r219 | bortzmeyer | 2003-01-17 14:04:31 +0000 (Fri, 17 Jan 2003) | 3  
lines

Bitbucket r?\233serv?\233 ?\224 dev/null
Classement dans Mail/spam seulement apr?\232s le localstart qui lance  
spamc

------------------------------------------------------------------------

But the \233 are supposed to be é I understand.

Barry




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Are log messages Unicode?

Posted by Barry Scott <ba...@barrys-emacs.org>.
On Jul 7, 2008, at 17:15, Karl Fogel wrote:

> "Ben Collins-Sussman" <su...@red-bean.com> writes:
>> On Sun, Jul 6, 2008 at 5:23 AM, Barry Scott <barry@barrys- 
>> emacs.org> wrote:
>>> Using the svn_client API is it possible for a client to write
>>> none-UTF-8 log messages?
>>> Clearly if this happened it would be a bug in the client given the
>>> above statement.
>>
>> I don't recall the details, but it's actually the *programmers'*
>> burden to convert paths and log messages from native locale to UTF8
>> (and back again).  If you read the svn APIs, you'll notice that every
>> path and log message passed into APIs (or passed around between APIs)
>> are presumed to *already* be UTF8.  So if you're writing your own
>> client, it's your job to convert user input to UTF8 before passing to
>> svn_client_*().  Look at the commandline client to see how it's doing
>> that;  I believe there a number of convenience routines in  
>> libsvn_subr
>> to help with conversion.
>
> I think Barry's asking if the client and/or server do any validation.
> That is, if the programmer supplies a non-UTF8 log message, our client
> libraries should reject it; and if such a log message were to reach  
> the
> repository (perhaps because someone wrote their own client software  
> from
> scratch), the repository should reject it too.
>
> I don't know whether we do such validation or not, but agree we  
> should.
>
> Barry, got time to test/trace it?
>

Karl's correct, I'm not asking for programming help, I'm trying to  
understand
a user reported problem using pysvn. If the SVN API does not check that
the strings are UTF-8 then I can close the bug as not pysvn's problem.
Another client must have messed up the users repos.

The user has not given me their repos to test against.

Barry


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Are log messages Unicode?

Posted by Barry Scott <ba...@barrys-emacs.org>.
My user says that repos was created by cvs2svn and wonders if it is  
the source
of the bad log entry.

Barry



On Jul 13, 2008, at 22:37, Neels Janosch Hofmeyr wrote:

> Hi list, long time no see :)
>
> Daniel Shahaf wrote:
>> Karl Fogel wrote on Mon, 7 Jul 2008 at 12:15 -0400:
>>> "Ben Collins-Sussman" <su...@red-bean.com> writes:
>>>> On Sun, Jul 6, 2008 at 5:23 AM, Barry Scott <barry@barrys- 
>>>> emacs.org> wrote:
>>>>> Using the svn_client API is it possible for a client to write
>>>>> none-UTF-8 log messages?
>>>>> Clearly if this happened it would be a bug in the client given the
>>>>> above statement.
>>>> I don't recall the details, but it's actually the *programmers'*
>>>> burden to convert paths and log messages from native locale to UTF8
>>>> (and back again).  If you read the svn APIs, you'll notice that  
>>>> every
>>>> path and log message passed into APIs (or passed around between  
>>>> APIs)
>>>> are presumed to *already* be UTF8.  So if you're writing your own
>>>> client, it's your job to convert user input to UTF8 before  
>>>> passing to
>>>> svn_client_*().  Look at the commandline client to see how it's  
>>>> doing
>>>> that;  I believe there a number of convenience routines in  
>>>> libsvn_subr
>>>> to help with conversion.
>>> I think Barry's asking if the client and/or server do any  
>>> validation.
>>> That is, if the programmer supplies a non-UTF8 log message, our  
>>> client
>>> libraries should reject it; and if such a log message were to  
>>> reach the
>>> repository (perhaps because someone wrote their own client  
>>> software from
>>> scratch), the repository should reject it too.
>>>
>>> I don't know whether we do such validation or not, but agree we  
>>> should.
>>>
>>
>> Since r31614 (Neels' fix of issue #1796) we do UTF-8 validation of  
>> log
>> messages in libsvn_repos.  It has not been backported to 1.5.x.
>
> Quoting message "[PATCH] issue 1796: ..." from 03 Jun 2008 by me:
>
> "
> The subversion server and client do not validate props in places where
> they should:
> - where the server receives props from a client out there. (#1796)
> - where the server reads props from the repository file system.
> - where the svn client reads props from a server out there.
> (Approval by kfogel)
>
> [My] patch starts by fixing the specific problems of issue 1796, only:
> - where the server receives props from a client out there. (#1796)
> , and limited only to the log message prop (SVN_PROP_REVISION_LOG).
> "
>
> I am still intending to continue on these issues... (I have been
> diverted because of the social shock following a recent unexpected  
> death
> in my close family)
>
> I am still at the point where I am trying to find out
>
> - the best place to validate props being read from the repository file
> system by the server;
>
> - how to write a unit test on whether the server validates props read
> from the file system (the code that writes *to* the file system now
> validates props; so, how do I get *unvalidated* props written to the
> file system in the first place?);
>
> - the best place to validate props in the client, reading from a  
> server
> out there;
>
> - how to write a unit test on whether the client validates props read
> from a server out there;
>
> - which other props need to be validated;
>
> - what the formats for these other props are (are they, by chance, all
> UTF8 & LF? That would be nice.).
>
> Since other/more people are taking interest in these issues, maybe it
> would make sense to file separate issues in the issue tracker for the
> remaining two cases? :
>
> - where the server reads props from the repository file system.
> - where the svn client reads props from a server out there.
>
>>
>> The cmdline client also does some conversions; in my case, it
>> dropped the bytes it couldn't understand:
>>
>>     % svn ci iota -F dump-fragment.txt
>>     Sending        iota
>>     Transmitting file data .
>>     Committed revision 2.
>>
>>     # It should have failed.  Let's see...
>>     % xxd ../../repos1/db/revprops/0/2
>>     ...
>>     00000a0: 370a 7376 6e3a 6c6f 670a 5620 3130 310a  7.svn:log.V  
>> 101.
>>     00000b0: 4269 7462 7563 6b65 7420 7273 6572 7620  Bitbucket rserv
>>     00000c0: 2064 6576 2f6e 756c 6c0a 436c 6173 7365   dev/ 
>> null.Classe
>>     ...
>>
>>     # Ah, but that's not the log message I specified!
>>     % xxd dump-fragment.txt
>>     0000040: 380a 0a4b 2037 0a73 766e 3a6c 6f67 0a56  8..K  
>> 7.svn:log.V
>>     0000050: 2031 3031 0a42 6974 6275 636b 6574 2072    
>> 101.Bitbucket r
>>     0000060: e973 6572 76e9 20e0 2064 6576 2f6e 756c  .serv. . dev/ 
>> nul
>>     # It dropped these bytes:                         ^    ^ ^
>>
>>> Barry, got time to test/trace it?
>
> Hm, that's not nice. Silently dropped bytes aren't good. The user  
> should
> at least be informed about what's happening...
>
> -- 
> Neels Hofmeyr -- elego Software Solutions GmbH
> Gustav-Meyer-Allee 25 / Gebäude 12, 13355 Berlin, Germany
> phone: +49 30 23458696  mobile: +49 177 2345869  fax: +49 30 23458695
> http://www.elegosoft.com | Geschäftsführer: Olaf Wagner | Sitz: Berlin
> Handelsreg: Amtsgericht Charlottenburg HRB 77719 | USt-IdNr:  
> DE163214194
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Are log messages Unicode?

Posted by Neels Janosch Hofmeyr <ne...@elego.de>.

Daniel Shahaf wrote:
> (patch manager hat on)
> 
> Neels Janosch Hofmeyr wrote on Sun, 13 Jul 2008 at 23:37 +0200:
>> Daniel Shahaf wrote:
>>> Karl Fogel wrote on Mon, 7 Jul 2008 at 12:15 -0400:
>>>> I think Barry's asking if the client and/or server do any validation.
>>>> That is, if the programmer supplies a non-UTF8 log message, our client
>>>> libraries should reject it; and if such a log message were to reach the
>>>> repository (perhaps because someone wrote their own client software from
>>>> scratch), the repository should reject it too.
>>>>
>>>> I don't know whether we do such validation or not, but agree we should.
>>>>
>>> Since r31614 (Neels' fix of issue #1796) we do UTF-8 validation of log 
>>> messages in libsvn_repos.  It has not been backported to 1.5.x.
>> Quoting message "[PATCH] issue 1796: ..." from 03 Jun 2008 by me:
>>
>> "
>> The subversion server and client do not validate props in places where
>> they should:
>> - where the server receives props from a client out there. (#1796)
>> - where the server reads props from the repository file system.
>> - where the svn client reads props from a server out there.
>> (Approval by kfogel)
>>
>> [My] patch starts by fixing the specific problems of issue 1796, only:
>> - where the server receives props from a client out there. (#1796)
>> , and limited only to the log message prop (SVN_PROP_REVISION_LOG).
>> "
>>
>> I am still intending to continue on these issues... (I have been
>> diverted because of the social shock following a recent unexpected death
>> in my close family)
>>
>> I am still at the point where I am trying to find out
>>
> 
> Comments, anyone?
> 
> Neels, I think you can answer some of these questions yourself :)
> 
>> - the best place to validate props being read from the repository file
>> system by the server;
>>
>> - how to write a unit test on whether the server validates props read
>> from the file system (the code that writes *to* the file system now
>> validates props; so, how do I get *unvalidated* props written to the
>> file system in the first place?);
>>
>> - the best place to validate props in the client, reading from a server
>> out there;
>>
>> - how to write a unit test on whether the client validates props read
>> from a server out there;
>>
>> - which other props need to be validated;
>>
>> - what the formats for these other props are (are they, by chance, all
>> UTF8 & LF? That would be nice.).
>>
>> Since other/more people are taking interest in these issues, maybe it
>> would make sense to file separate issues in the issue tracker for the
>> remaining two cases? :
>>
>> - where the server reads props from the repository file system.
>> - where the svn client reads props from a server out there.
>>
>>> The cmdline client also does some conversions; in my case, it
>>> dropped the bytes it couldn't understand:
>>>
>>>     % svn ci iota -F dump-fragment.txt
>>>     Sending        iota
>>>     Transmitting file data .
>>>     Committed revision 2.
>>>     
>>>     # It should have failed.  Let's see...
>>>     % xxd ../../repos1/db/revprops/0/2
>>>     ...
>>>     00000a0: 370a 7376 6e3a 6c6f 670a 5620 3130 310a  7.svn:log.V 101.
>>>     00000b0: 4269 7462 7563 6b65 7420 7273 6572 7620  Bitbucket rserv
>>>     00000c0: 2064 6576 2f6e 756c 6c0a 436c 6173 7365   dev/null.Classe
>>>     ...
>>>
>>>     # Ah, but that's not the log message I specified!
>>>     % xxd dump-fragment.txt
>>>     0000040: 380a 0a4b 2037 0a73 766e 3a6c 6f67 0a56  8..K 7.svn:log.V
>>>     0000050: 2031 3031 0a42 6974 6275 636b 6574 2072   101.Bitbucket r
>>>     0000060: e973 6572 76e9 20e0 2064 6576 2f6e 756c  .serv. . dev/nul
>>>     # It dropped these bytes:                         ^    ^ ^
>>>
>>>> Barry, got time to test/trace it?
>> Hm, that's not nice. Silently dropped bytes aren't good. The user should
>> at least be informed about what's happening...
>>
> 
> +1 (want to write the patch?)
> 
> Daniel
> (who won't have time to review patches in the near future)

There have been some new thoughts about the remaining validations in
http://subversion.tigris.org/servlets/ReadMsg?listName=dev&msgNo=141457
amounting to not validating log messages traveling towards the user.


Answering the original question:

On Sun, Jul 6, 2008 at 5:23 AM, Barry Scott <ba...@barrys-emacs.org> wrote:
> Using the svn_client API is it possible for a client to write
> none-UTF-8 log messages?

No, it is not possible to send a non-UTF8 log message using the svn
cmdline client, since it performs a conversion to UTF8-with-LF.

It is, however, possible to do so using any other, lenient client. But
since the patch for 1796 was committed (around 6 Jun 2008), the svn
*server* rejects all non-UTF8 log messages from whichever client.


The dropped bytes issue above is not yet accounted for, but probably
caused by that conversion in the svn cmdline client.

(I guess it's that "translate_string" line of code that I switched off
in my 2nd attachment to issue 1796 on the issue tracker site, trying to
prove a point.

Index: subversion/svn/util.c
===================================================================
--- subversion/svn/util.c	(revision 31304)
+++ subversion/svn/util.c	(working copy)
@@ -651,14 +651,10 @@
  svn_stringbuf_appendcstr(default_msg, APR_EOL_STR APR_EOL_STR);

  *tmp_file = NULL;
- if (lmb->message)
+ if (1)
    {
-     svn_string_t *log_msg_string = svn_string_create(lmb->message, pool);
-
-     SVN_ERR_W(svn_subst_translate_string(&log_msg_string, log_msg_string,
-                                          lmb->message_encoding, pool),
-               _("Error normalizing log message to internal format"));
-
+		SVN_ERR(svn_cmdline_printf(pool, "*** TEST BUILD: FORGING COMMIT
MESSAGE ***\n"));
+     svn_string_t *log_msg_string =
svn_string_create("forged\r\ncommit\r\nmessage\r\n", pool);
       *log_msg = log_msg_string->data;

       /* Trim incoming messages the EOF marker text and the junk that

)

-- 
Neels Hofmeyr -- elego Software Solutions GmbH
Gustav-Meyer-Allee 25 / Gebäude 12, 13355 Berlin, Germany
phone: +49 30 23458696  mobile: +49 177 2345869  fax: +49 30 23458695
http://www.elegosoft.com | Geschäftsführer: Olaf Wagner | Sitz: Berlin
Handelsreg: Amtsgericht Charlottenburg HRB 77719 | USt-IdNr: DE163214194


Re: Are log messages Unicode?

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
(patch manager hat on)

Neels Janosch Hofmeyr wrote on Sun, 13 Jul 2008 at 23:37 +0200:
> Daniel Shahaf wrote:
> > Karl Fogel wrote on Mon, 7 Jul 2008 at 12:15 -0400:
> >> I think Barry's asking if the client and/or server do any validation.
> >> That is, if the programmer supplies a non-UTF8 log message, our client
> >> libraries should reject it; and if such a log message were to reach the
> >> repository (perhaps because someone wrote their own client software from
> >> scratch), the repository should reject it too.
> >>
> >> I don't know whether we do such validation or not, but agree we should.
> >>
> > 
> > Since r31614 (Neels' fix of issue #1796) we do UTF-8 validation of log 
> > messages in libsvn_repos.  It has not been backported to 1.5.x.
> 
> Quoting message "[PATCH] issue 1796: ..." from 03 Jun 2008 by me:
> 
> "
> The subversion server and client do not validate props in places where
> they should:
> - where the server receives props from a client out there. (#1796)
> - where the server reads props from the repository file system.
> - where the svn client reads props from a server out there.
> (Approval by kfogel)
> 
> [My] patch starts by fixing the specific problems of issue 1796, only:
> - where the server receives props from a client out there. (#1796)
> , and limited only to the log message prop (SVN_PROP_REVISION_LOG).
> "
> 
> I am still intending to continue on these issues... (I have been
> diverted because of the social shock following a recent unexpected death
> in my close family)
> 
> I am still at the point where I am trying to find out
> 

Comments, anyone?

Neels, I think you can answer some of these questions yourself :)

> - the best place to validate props being read from the repository file
> system by the server;
> 
> - how to write a unit test on whether the server validates props read
> from the file system (the code that writes *to* the file system now
> validates props; so, how do I get *unvalidated* props written to the
> file system in the first place?);
> 
> - the best place to validate props in the client, reading from a server
> out there;
> 
> - how to write a unit test on whether the client validates props read
> from a server out there;
> 
> - which other props need to be validated;
> 
> - what the formats for these other props are (are they, by chance, all
> UTF8 & LF? That would be nice.).
> 
> Since other/more people are taking interest in these issues, maybe it
> would make sense to file separate issues in the issue tracker for the
> remaining two cases? :
> 
> - where the server reads props from the repository file system.
> - where the svn client reads props from a server out there.
> 
> > 
> > The cmdline client also does some conversions; in my case, it
> > dropped the bytes it couldn't understand:
> > 
> >     % svn ci iota -F dump-fragment.txt
> >     Sending        iota
> >     Transmitting file data .
> >     Committed revision 2.
> >     
> >     # It should have failed.  Let's see...
> >     % xxd ../../repos1/db/revprops/0/2
> >     ...
> >     00000a0: 370a 7376 6e3a 6c6f 670a 5620 3130 310a  7.svn:log.V 101.
> >     00000b0: 4269 7462 7563 6b65 7420 7273 6572 7620  Bitbucket rserv
> >     00000c0: 2064 6576 2f6e 756c 6c0a 436c 6173 7365   dev/null.Classe
> >     ...
> > 
> >     # Ah, but that's not the log message I specified!
> >     % xxd dump-fragment.txt
> >     0000040: 380a 0a4b 2037 0a73 766e 3a6c 6f67 0a56  8..K 7.svn:log.V
> >     0000050: 2031 3031 0a42 6974 6275 636b 6574 2072   101.Bitbucket r
> >     0000060: e973 6572 76e9 20e0 2064 6576 2f6e 756c  .serv. . dev/nul
> >     # It dropped these bytes:                         ^    ^ ^
> > 
> >> Barry, got time to test/trace it?
> 
> Hm, that's not nice. Silently dropped bytes aren't good. The user should
> at least be informed about what's happening...
> 

+1 (want to write the patch?)

Daniel
(who won't have time to review patches in the near future)


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Are log messages Unicode?

Posted by Neels Janosch Hofmeyr <ne...@elego.de>.
Hi list, long time no see :)

Daniel Shahaf wrote:
> Karl Fogel wrote on Mon, 7 Jul 2008 at 12:15 -0400:
>> "Ben Collins-Sussman" <su...@red-bean.com> writes:
>>> On Sun, Jul 6, 2008 at 5:23 AM, Barry Scott <ba...@barrys-emacs.org> wrote:
>>>> Using the svn_client API is it possible for a client to write
>>>> none-UTF-8 log messages?
>>>> Clearly if this happened it would be a bug in the client given the
>>>> above statement.
>>> I don't recall the details, but it's actually the *programmers'*
>>> burden to convert paths and log messages from native locale to UTF8
>>> (and back again).  If you read the svn APIs, you'll notice that every
>>> path and log message passed into APIs (or passed around between APIs)
>>> are presumed to *already* be UTF8.  So if you're writing your own
>>> client, it's your job to convert user input to UTF8 before passing to
>>> svn_client_*().  Look at the commandline client to see how it's doing
>>> that;  I believe there a number of convenience routines in libsvn_subr
>>> to help with conversion.
>> I think Barry's asking if the client and/or server do any validation.
>> That is, if the programmer supplies a non-UTF8 log message, our client
>> libraries should reject it; and if such a log message were to reach the
>> repository (perhaps because someone wrote their own client software from
>> scratch), the repository should reject it too.
>>
>> I don't know whether we do such validation or not, but agree we should.
>>
> 
> Since r31614 (Neels' fix of issue #1796) we do UTF-8 validation of log 
> messages in libsvn_repos.  It has not been backported to 1.5.x.

Quoting message "[PATCH] issue 1796: ..." from 03 Jun 2008 by me:

"
The subversion server and client do not validate props in places where
they should:
- where the server receives props from a client out there. (#1796)
- where the server reads props from the repository file system.
- where the svn client reads props from a server out there.
(Approval by kfogel)

[My] patch starts by fixing the specific problems of issue 1796, only:
- where the server receives props from a client out there. (#1796)
, and limited only to the log message prop (SVN_PROP_REVISION_LOG).
"

I am still intending to continue on these issues... (I have been
diverted because of the social shock following a recent unexpected death
in my close family)

I am still at the point where I am trying to find out

- the best place to validate props being read from the repository file
system by the server;

- how to write a unit test on whether the server validates props read
from the file system (the code that writes *to* the file system now
validates props; so, how do I get *unvalidated* props written to the
file system in the first place?);

- the best place to validate props in the client, reading from a server
out there;

- how to write a unit test on whether the client validates props read
from a server out there;

- which other props need to be validated;

- what the formats for these other props are (are they, by chance, all
UTF8 & LF? That would be nice.).

Since other/more people are taking interest in these issues, maybe it
would make sense to file separate issues in the issue tracker for the
remaining two cases? :

- where the server reads props from the repository file system.
- where the svn client reads props from a server out there.

> 
> The cmdline client also does some conversions; in my case, it
> dropped the bytes it couldn't understand:
> 
>     % svn ci iota -F dump-fragment.txt
>     Sending        iota
>     Transmitting file data .
>     Committed revision 2.
>     
>     # It should have failed.  Let's see...
>     % xxd ../../repos1/db/revprops/0/2
>     ...
>     00000a0: 370a 7376 6e3a 6c6f 670a 5620 3130 310a  7.svn:log.V 101.
>     00000b0: 4269 7462 7563 6b65 7420 7273 6572 7620  Bitbucket rserv
>     00000c0: 2064 6576 2f6e 756c 6c0a 436c 6173 7365   dev/null.Classe
>     ...
> 
>     # Ah, but that's not the log message I specified!
>     % xxd dump-fragment.txt
>     0000040: 380a 0a4b 2037 0a73 766e 3a6c 6f67 0a56  8..K 7.svn:log.V
>     0000050: 2031 3031 0a42 6974 6275 636b 6574 2072   101.Bitbucket r
>     0000060: e973 6572 76e9 20e0 2064 6576 2f6e 756c  .serv. . dev/nul
>     # It dropped these bytes:                         ^    ^ ^
> 
>> Barry, got time to test/trace it?

Hm, that's not nice. Silently dropped bytes aren't good. The user should
at least be informed about what's happening...

-- 
Neels Hofmeyr -- elego Software Solutions GmbH
Gustav-Meyer-Allee 25 / Gebäude 12, 13355 Berlin, Germany
phone: +49 30 23458696  mobile: +49 177 2345869  fax: +49 30 23458695
http://www.elegosoft.com | Geschäftsführer: Olaf Wagner | Sitz: Berlin
Handelsreg: Amtsgericht Charlottenburg HRB 77719 | USt-IdNr: DE163214194


Re: Are log messages Unicode?

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Karl Fogel wrote on Mon, 7 Jul 2008 at 12:15 -0400:
> "Ben Collins-Sussman" <su...@red-bean.com> writes:
> > On Sun, Jul 6, 2008 at 5:23 AM, Barry Scott <ba...@barrys-emacs.org> wrote:
> >> Using the svn_client API is it possible for a client to write
> >> none-UTF-8 log messages?
> >> Clearly if this happened it would be a bug in the client given the
> >> above statement.
> >
> > I don't recall the details, but it's actually the *programmers'*
> > burden to convert paths and log messages from native locale to UTF8
> > (and back again).  If you read the svn APIs, you'll notice that every
> > path and log message passed into APIs (or passed around between APIs)
> > are presumed to *already* be UTF8.  So if you're writing your own
> > client, it's your job to convert user input to UTF8 before passing to
> > svn_client_*().  Look at the commandline client to see how it's doing
> > that;  I believe there a number of convenience routines in libsvn_subr
> > to help with conversion.
> 
> I think Barry's asking if the client and/or server do any validation.
> That is, if the programmer supplies a non-UTF8 log message, our client
> libraries should reject it; and if such a log message were to reach the
> repository (perhaps because someone wrote their own client software from
> scratch), the repository should reject it too.
> 
> I don't know whether we do such validation or not, but agree we should.
> 

Since r31614 (Neels' fix of issue #1796) we do UTF-8 validation of log 
messages in libsvn_repos.  It has not been backported to 1.5.x.

The cmdline client also does some conversions; in my case, it
dropped the bytes it couldn't understand:

    % svn ci iota -F dump-fragment.txt
    Sending        iota
    Transmitting file data .
    Committed revision 2.
    
    # It should have failed.  Let's see...
    % xxd ../../repos1/db/revprops/0/2
    ...
    00000a0: 370a 7376 6e3a 6c6f 670a 5620 3130 310a  7.svn:log.V 101.
    00000b0: 4269 7462 7563 6b65 7420 7273 6572 7620  Bitbucket rserv
    00000c0: 2064 6576 2f6e 756c 6c0a 436c 6173 7365   dev/null.Classe
    ...

    # Ah, but that's not the log message I specified!
    % xxd dump-fragment.txt
    0000040: 380a 0a4b 2037 0a73 766e 3a6c 6f67 0a56  8..K 7.svn:log.V
    0000050: 2031 3031 0a42 6974 6275 636b 6574 2072   101.Bitbucket r
    0000060: e973 6572 76e9 20e0 2064 6576 2f6e 756c  .serv. . dev/nul
    # It dropped these bytes:                         ^    ^ ^

> Barry, got time to test/trace it?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Are log messages Unicode?

Posted by Karl Fogel <kf...@red-bean.com>.
"Ben Collins-Sussman" <su...@red-bean.com> writes:
> On Sun, Jul 6, 2008 at 5:23 AM, Barry Scott <ba...@barrys-emacs.org> wrote:
>> Using the svn_client API is it possible for a client to write
>> none-UTF-8 log messages?
>> Clearly if this happened it would be a bug in the client given the
>> above statement.
>
> I don't recall the details, but it's actually the *programmers'*
> burden to convert paths and log messages from native locale to UTF8
> (and back again).  If you read the svn APIs, you'll notice that every
> path and log message passed into APIs (or passed around between APIs)
> are presumed to *already* be UTF8.  So if you're writing your own
> client, it's your job to convert user input to UTF8 before passing to
> svn_client_*().  Look at the commandline client to see how it's doing
> that;  I believe there a number of convenience routines in libsvn_subr
> to help with conversion.

I think Barry's asking if the client and/or server do any validation.
That is, if the programmer supplies a non-UTF8 log message, our client
libraries should reject it; and if such a log message were to reach the
repository (perhaps because someone wrote their own client software from
scratch), the repository should reject it too.

I don't know whether we do such validation or not, but agree we should.

Barry, got time to test/trace it?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Are log messages Unicode?

Posted by Ben Collins-Sussman <su...@red-bean.com>.
On Sun, Jul 6, 2008 at 5:23 AM, Barry Scott <ba...@barrys-emacs.org> wrote:
>
> On Jul 3, 2008, at 23:12, Ben Collins-Sussman wrote:
>
>> The repository stores all log messages as UTF8, and they travel that
>> way to the clients.  The clients are responsible for converting UTF8
>> to the native locale.  Thus, if a user has latin-1 set as a native
>> locale, that's what 'svn log' will show him.  For more info, see:
>>
>>   http://svnbook.red-bean.com/nightly/en/svn.advanced.l10n.html
>>
>
> Using the svn_client API is it possible for a client to write none-UTF-8 log
> messages?
> Clearly if this happened it would be a bug in the client given the above
> statement.

I don't recall the details, but it's actually the *programmers'*
burden to convert paths and log messages from native locale to UTF8
(and back again).  If you read the svn APIs, you'll notice that every
path and log message passed into APIs (or passed around between APIs)
are presumed to *already* be UTF8.  So if you're writing your own
client, it's your job to convert user input to UTF8 before passing to
svn_client_*().  Look at the commandline client to see how it's doing
that;  I believe there a number of convenience routines in libsvn_subr
to help with conversion.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Are log messages Unicode?

Posted by Barry Scott <ba...@barrys-emacs.org>.
On Jul 3, 2008, at 23:12, Ben Collins-Sussman wrote:

> The repository stores all log messages as UTF8, and they travel that
> way to the clients.  The clients are responsible for converting UTF8
> to the native locale.  Thus, if a user has latin-1 set as a native
> locale, that's what 'svn log' will show him.  For more info, see:
>
>    http://svnbook.red-bean.com/nightly/en/svn.advanced.l10n.html
>

Using the svn_client API is it possible for a client to write none- 
UTF-8 log messages?
Clearly if this happened it would be a bug in the client given the above
statement.

Barry


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Are log messages Unicode?

Posted by Ben Collins-Sussman <su...@red-bean.com>.
The repository stores all log messages as UTF8, and they travel that
way to the clients.  The clients are responsible for converting UTF8
to the native locale.  Thus, if a user has latin-1 set as a native
locale, that's what 'svn log' will show him.  For more info, see:

   http://svnbook.red-bean.com/nightly/en/svn.advanced.l10n.html

On Thu, Jul 3, 2008 at 4:57 PM, Barry Scott <ba...@barrys-emacs.org> wrote:
> In pysvn I have assumed that log messages are in UTF-8 and decode them to
> unicode.
>
> A user has reported that their logs are in latin-1 and fail to decode as
> UTF-8.
>
> Did I make a bad assumption?
>
> Barry
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org