You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Yasuhito FUTATSUKI <fu...@yf.bsdclub.org> on 2020/12/14 18:51:03 UTC

Re: svn commit: r1884427 - in /subversion/trunk/tools/hook-scripts/mailer: mailer.py tests/mailer-t1.output tests/mailer-tweak.py

First of all, thank you for doing this. I worried that mailer.py didn't
support Python 3 for a long time.

In message <20...@svn01-us-east.apache.org>, stsp@apache.o
rg writes:
>Author: stsp
>Date: Mon Dec 14 16:57:10 2020
>New Revision: 1884427
>
>URL: http://svn.apache.org/viewvc?rev=1884427&view=rev
>Log:
>Make mailer.py work properly with Python 3, and drop Python 2 support.
>
>Most of the changes deal with the handling binary data vs Python strings.
>
>I've made sure that mailer.py will work in a UTF-8 environment. In general,
>UTF-8 is recommended for hook scripts. See the SVNUseUTF8 mod_dav_svn option.
>Environments using other encodings may not work as expected, but those will
>be problematic for hook scripts in general.

Perhaps the problem on locale other than UTF-8 is only on iterpretation of
REPOS-PATH argument. Other paths in the script are all Subversion's
internal paths.

>                                            SVN repositories store internal
>data such as paths in UTF-8. Our Python3 bindings do not deal with encoding
>or decoding of such data, and thus need to work with raw UTF-8 strings, not
>Python strings.

I have no objection the code itself, however it is a bit incorrect.

Our Python3 bindings accept str objects as inputs for char *. It always
convert them to UTF-8 C strings on all APIs without regarding preferred
encoding or file system encoding on Python.  As some of APIs accept
local paths and they should be encoded as locale's charset/encoding,
I also prefer to encode to bytes on caller side explicitly before call
APIs, to avoid making bugs.  Of course, return values of wrapper APIs
corresponding char ** args and return value of C API are not decoded,
raw bytes objects. 


This commit also fixed an issue that it could truncate Subject on the
way of multi-byte character sequence, and parhaps it also changes the
unit of truncate_subject config parameter, from octets on raw subject
to characters on raw subject. They are not equal if a subject contains 
multi-byte characters.

The issue of RFC5321 violation of max line length I pointed out on
Janunary might be fixed by this commit because of fix on
email.header.Header implementation of Python 3 library, but I've not
confirmed yet.

Cheers,
-- 
Yasuhito FUTATSUKI <fu...@yf.bsdclub.org>

Re: svn commit: r1884427 - in /subversion/trunk/tools/hook-scripts/mailer: mailer.py tests/mailer-t1.output tests/mailer-tweak.py

Posted by Yasuhito FUTATSUKI <fu...@yf.bsdclub.org>.
On 2021/01/15 7:30, Stefan Sperling wrote:
> On Fri, Jan 15, 2021 at 01:44:43AM +0900, Yasuhito FUTATSUKI wrote:
>> On 2020/12/29 0:49, Stefan Sperling wrote:
>>> On Tue, Dec 15, 2020 at 07:47:59PM +0900, Yasuhito FUTATSUKI wrote:
>>>> As far as I read the code again, it seems this is already support
>>>> locales other than C/C.UTF-8 if mailer.py is kicked with appropriate
>>>> LC_CTYPE environment. The REPOS-PATH argment comes from sys.argv[2]
>>>> already decoded in file system encoding (i.e. encoding corresponding
>>>> to LC_CTYPE) and Python encodes it if access to file system. On the
>>>> other hand, svn.repos.open() wrapper function accepts str and encode
>>>> it to UTF-8 string, as the svn_repos_open() C API expects.
>>>>
>>>> Then I overlooked sys.argv[x] and cmd_args[x] in main() are
>>>> already str.
>>>>
>>>> Perhaps it needs (although the last hunk is not nessesary):
>>>
>>> Yes, indeed. If python already encodes arguments to strings then this
>>> patch looks correct to me. It also works as expected in my testing.
>> I revised the patch and checked in UTF-8 locale. Now it seems it works
>> with post-revprop-change, post-lock and post-unlock hooks.
>>
>> If this patch is OK, then I'll do:
>>
>> (1) Make it support Python 2 again, to merge it to 1.14.x.
>> (2) Fix an issue to encode Unicode paths in Subject header.
>> (3) Fix an issue of max line length in Subject header.
>>
>> Perhaps (2) and (3) will be resolved at the same time.
> 
> Yes, in my opinion this patch is OK.
> The mailer.py test is still passing and your changes look fine to me.
> 
> Thank you! My experience with python2/3 conversion is limited, and your
> help with this is much appreciated.

I committed it in r1885557. Perhaps If you did not commit r1884427,
I didn't work for it so quickly (yes, even this is quickly). Thank you.

Then I'll go ahead to resolve above issues.

Cheers,
-- 
Yasuhito FUTATSUKI <fu...@yf.bsclub.org>

Re: svn commit: r1884427 - in /subversion/trunk/tools/hook-scripts/mailer: mailer.py tests/mailer-t1.output tests/mailer-tweak.py

Posted by Stefan Sperling <st...@elego.de>.
On Fri, Jan 15, 2021 at 01:44:43AM +0900, Yasuhito FUTATSUKI wrote:
> On 2020/12/29 0:49, Stefan Sperling wrote:
> > On Tue, Dec 15, 2020 at 07:47:59PM +0900, Yasuhito FUTATSUKI wrote:
> >> As far as I read the code again, it seems this is already support
> >> locales other than C/C.UTF-8 if mailer.py is kicked with appropriate
> >> LC_CTYPE environment. The REPOS-PATH argment comes from sys.argv[2]
> >> already decoded in file system encoding (i.e. encoding corresponding
> >> to LC_CTYPE) and Python encodes it if access to file system. On the
> >> other hand, svn.repos.open() wrapper function accepts str and encode
> >> it to UTF-8 string, as the svn_repos_open() C API expects.
> >>
> >> Then I overlooked sys.argv[x] and cmd_args[x] in main() are
> >> already str.
> >>
> >> Perhaps it needs (although the last hunk is not nessesary):
> > 
> > Yes, indeed. If python already encodes arguments to strings then this
> > patch looks correct to me. It also works as expected in my testing.
> I revised the patch and checked in UTF-8 locale. Now it seems it works
> with post-revprop-change, post-lock and post-unlock hooks.
> 
> If this patch is OK, then I'll do:
> 
> (1) Make it support Python 2 again, to merge it to 1.14.x.
> (2) Fix an issue to encode Unicode paths in Subject header.
> (3) Fix an issue of max line length in Subject header.
> 
> Perhaps (2) and (3) will be resolved at the same time.

Yes, in my opinion this patch is OK.
The mailer.py test is still passing and your changes look fine to me.

Thank you! My experience with python2/3 conversion is limited, and your
help with this is much appreciated.

Regards,
Stefan

> Cheers,
> -- 
> Yasuhito FUTATSUKI <fu...@yf.bsclub.org>

> Follow up to r1884427: mailer.py: Additional fix for Python 3 support 
> 
> * tools/hook-scripts/mailer/mailer.py
>   ():
>    - Don't include StringIO.
>    - import locale for output class 'StandardOutput'.
>   (main): Don't decode cmd_args[x], because they are already str.
>   (OutputBase.write_binary): New method, as the abstract base class of
>    output classes.
>   (OutputBase.write): Tweak comment. This method has implementation now.
>   (OutputBase.run): Use self.write_binary() to write content of buf.
>   (StandardOutput.write): Override default write method to use default
>    encoding for console output.
>   (PropChange.generate): Use sys.stdin.buffer instead of sys.stdin to read
>    from stdin as bytes.
>   (Lock.__init__):
>     - Use sys.stdin.buffer instead of sys.stdin to read from stdin as bytes.
>     - Strip new line character from items of self.dirlist
>   (DiffGenerator.__getitem__): Pass change.path as is to svn.fs.FileDiff.
>   (TextCommitRenderer.render): Don't decode data.author. data.author is
>    already bytes (or None) here.
>   (Repository.__init__): Ensure self.author is str or None.
>   ((entry routine)): Encode repository path as UTF-8 before canonicalize
>    explicitly.
>    
> Index: tools/hook-scripts/mailer/mailer.py
> ===================================================================
> --- tools/hook-scripts/mailer/mailer.py	(revision 1885374)
> +++ tools/hook-scripts/mailer/mailer.py	(working copy)
> @@ -50,10 +50,11 @@
>  from urllib.parse import quote as urllib_parse_quote
>  import time
>  import subprocess
> -from io import BytesIO, StringIO
> +from io import BytesIO
>  import smtplib
>  import re
>  import tempfile
> +import locale
>  
>  # Minimal version of Subversion's bindings required
>  _MIN_SVN_VERSION = [1, 5, 0]
> @@ -88,10 +89,10 @@
>      messenger = Commit(pool, cfg, repos)
>    elif cmd == 'propchange' or cmd == 'propchange2':
>      revision = int(cmd_args[0])
> -    author = cmd_args[1].decode('utf-8')
> -    propname = cmd_args[2].decode('utf-8')
> +    author = cmd_args[1]
> +    propname = cmd_args[2]
>      if cmd == 'propchange2' and cmd_args[3]:
> -      action = cmd_args[3].decode('utf-8')
> +      action = cmd_args[3]
>      else:
>        action = 'A'
>      repos = Repository(repos_dir, revision, pool)
> @@ -103,7 +104,7 @@
>                   })
>      messenger = PropChange(pool, cfg, repos, author, propname, action)
>    elif cmd == 'lock' or cmd == 'unlock':
> -    author = cmd_args[0].decode('utf-8')
> +    author = cmd_args[0]
>      repos = Repository(repos_dir, 0, pool) ### any old revision will do
>      # Override the repos revision author with the author of the lock/unlock
>      repos.author = author
> @@ -169,9 +170,13 @@
>      representation."""
>      raise NotImplementedError
>  
> +  def write_binary(self, output):
> +    """Override this method.
> +    Append the binary data OUTPUT to the output representation."""
> +    raise NotImplementedError
> +
>    def write(self, output):
> -    """Override this method.
> -    Append the literal text string OUTPUT to the output representation."""
> +    """Append the literal text string OUTPUT to the output representation."""
>      return self.write_binary(output.encode('utf-8'))
>  
>    def run(self, cmd):
> @@ -184,7 +189,7 @@
>  
>      buf = pipe_ob.stdout.read(self._CHUNKSIZE)
>      while buf:
> -      self.write(buf)
> +      self.write_binary(buf)
>        buf = pipe_ob.stdout.read(self._CHUNKSIZE)
>  
>      # wait on the child so we don't end up with a billion zombies
> @@ -360,7 +365,12 @@
>    def finish(self):
>      pass
>  
> +  def write(self, output):
> +    """Write text as *default* encoding string"""
> +    return self.write_binary(output.encode(locale.getpreferredencoding(),
> +                                           'backslashreplace'))
>  
> +
>  class PipeOutput(MailedOutput):
>    "Deliver a mail message to an MTA via a pipe."
>  
> @@ -532,7 +542,7 @@
>          elif self.action == 'M':
>            self.output.write('Property diff:\n')
>            tempfile1 = tempfile.NamedTemporaryFile()
> -          tempfile1.write(sys.stdin.read())
> +          tempfile1.write(sys.stdin.buffer.read())
>            tempfile1.flush()
>            tempfile2 = tempfile.NamedTemporaryFile()
>            tempfile2.write(self.repos.get_rev_prop(self.propname))
> @@ -594,9 +604,8 @@
>                          or 'unlock_subject_prefix'))
>  
>      # read all the locked paths from STDIN and strip off the trailing newlines
> -    self.dirlist = [x for x in sys.stdin.readlines()]
> -    for i in range(len(self.dirlist)):
> -      self.dirlist[i] = self.dirlist[i].decode('utf-8')
> +    self.dirlist = [x.decode('utf-8').rstrip()
> +                    for x in sys.stdin.buffer.readlines()]
>  
>      # collect the set of groups and the unique sets of params for the options
>      self.groups = { }
> @@ -910,7 +919,7 @@
>              kind = 'C'
>              if self.diffsels.copy:
>                diff = svn.fs.FileDiff(None, None, self.repos.root_this,
> -                                     change.path.decode('utf-8'), self.pool)
> +                                     change.path, self.pool)
>                label1 = '/dev/null\t00:00:00 1970\t' \
>                         '(empty, because file is newly added)'
>                label2 = '%s\t%s\t(r%s, copy of r%s, %s)' \
> @@ -1092,7 +1101,7 @@
>  
>      w = self.output.write
>  
> -    w('Author: %s\nDate: %s\nNew Revision: %s\n' % (data.author.decode('utf-8'),
> +    w('Author: %s\nDate: %s\nNew Revision: %s\n' % (data.author,
>                                                        data.date,
>                                                        data.rev))
>  
> @@ -1221,6 +1230,8 @@
>      self.root_this = self.get_root(rev)
>  
>      self.author = self.get_rev_prop(svn.core.SVN_PROP_REVISION_AUTHOR)
> +    if self.author is not None:
> +      self.author = self.author.decode('utf-8')
>  
>    def get_rev_prop(self, propname, rev = None):
>      if not rev:
> @@ -1510,7 +1521,7 @@
>      usage()
>  
>    cmd = sys.argv[1]
> -  repos_dir = svn.core.svn_path_canonicalize(sys.argv[2])
> +  repos_dir = svn.core.svn_path_canonicalize(sys.argv[2].encode('utf-8'))
>    repos_dir = repos_dir.decode('utf-8')
>    try:
>      expected_args = cmd_list[cmd]


Re: svn commit: r1884427 - in /subversion/trunk/tools/hook-scripts/mailer: mailer.py tests/mailer-t1.output tests/mailer-tweak.py

Posted by Yasuhito FUTATSUKI <fu...@yf.bsdclub.org>.
On 2020/12/29 0:49, Stefan Sperling wrote:
> On Tue, Dec 15, 2020 at 07:47:59PM +0900, Yasuhito FUTATSUKI wrote:
>> As far as I read the code again, it seems this is already support
>> locales other than C/C.UTF-8 if mailer.py is kicked with appropriate
>> LC_CTYPE environment. The REPOS-PATH argment comes from sys.argv[2]
>> already decoded in file system encoding (i.e. encoding corresponding
>> to LC_CTYPE) and Python encodes it if access to file system. On the
>> other hand, svn.repos.open() wrapper function accepts str and encode
>> it to UTF-8 string, as the svn_repos_open() C API expects.
>>
>> Then I overlooked sys.argv[x] and cmd_args[x] in main() are
>> already str.
>>
>> Perhaps it needs (although the last hunk is not nessesary):
> 
> Yes, indeed. If python already encodes arguments to strings then this
> patch looks correct to me. It also works as expected in my testing.
I revised the patch and checked in UTF-8 locale. Now it seems it works
with post-revprop-change, post-lock and post-unlock hooks.

If this patch is OK, then I'll do:

(1) Make it support Python 2 again, to merge it to 1.14.x.
(2) Fix an issue to encode Unicode paths in Subject header.
(3) Fix an issue of max line length in Subject header.

Perhaps (2) and (3) will be resolved at the same time.

Cheers,
-- 
Yasuhito FUTATSUKI <fu...@yf.bsclub.org>

Re: svn commit: r1884427 - in /subversion/trunk/tools/hook-scripts/mailer: mailer.py tests/mailer-t1.output tests/mailer-tweak.py

Posted by Stefan Sperling <st...@elego.de>.
On Tue, Dec 15, 2020 at 07:47:59PM +0900, Yasuhito FUTATSUKI wrote:
> As far as I read the code again, it seems this is already support
> locales other than C/C.UTF-8 if mailer.py is kicked with appropriate
> LC_CTYPE environment. The REPOS-PATH argment comes from sys.argv[2]
> already decoded in file system encoding (i.e. encoding corresponding
> to LC_CTYPE) and Python encodes it if access to file system. On the
> other hand, svn.repos.open() wrapper function accepts str and encode
> it to UTF-8 string, as the svn_repos_open() C API expects.
> 
> Then I overlooked sys.argv[x] and cmd_args[x] in main() are
> already str.
> 
> Perhaps it needs (although the last hunk is not nessesary):

Yes, indeed. If python already encodes arguments to strings then this
patch looks correct to me. It also works as expected in my testing.

> [[[
> Index: tools/hook-scripts/mailer/mailer.py
> ===================================================================
> --- tools/hook-scripts/mailer/mailer.py (revision 1884434)
> +++ tools/hook-scripts/mailer/mailer.py (working copy)
> @@ -88,10 +88,10 @@
>      messenger = Commit(pool, cfg, repos)
>    elif cmd == 'propchange' or cmd == 'propchange2':
>      revision = int(cmd_args[0])
> -    author = cmd_args[1].decode('utf-8')
> -    propname = cmd_args[2].decode('utf-8')
> +    author = cmd_args[1]
> +    propname = cmd_args[2]
>      if cmd == 'propchange2' and cmd_args[3]:
> -      action = cmd_args[3].decode('utf-8')
> +      action = cmd_args[3]
>      else:
>        action = 'A'
>      repos = Repository(repos_dir, revision, pool)
> @@ -103,7 +103,7 @@
>                   })
>      messenger = PropChange(pool, cfg, repos, author, propname, action)
>    elif cmd == 'lock' or cmd == 'unlock':
> -    author = cmd_args[0].decode('utf-8')
> +    author = cmd_args[0]
>      repos = Repository(repos_dir, 0, pool) ### any old revision will do
>      # Override the repos revision author with the author of the lock/unlock
>      repos.author = author
> @@ -1510,7 +1510,7 @@
>      usage()
>  
>    cmd = sys.argv[1]
> -  repos_dir = svn.core.svn_path_canonicalize(sys.argv[2])
> +  repos_dir = svn.core.svn_path_canonicalize(sys.argv[2].encode('utf-8'))
>    repos_dir = repos_dir.decode('utf-8')
>    try:
>      expected_args = cmd_list[cmd]
> ]]]

Re: svn commit: r1884427 - in /subversion/trunk/tools/hook-scripts/mailer: mailer.py tests/mailer-t1.output tests/mailer-tweak.py

Posted by Yasuhito FUTATSUKI <fu...@yf.bsdclub.org>.
On 2020/12/15 5:06, Stefan Sperling wrote:
> On Tue, Dec 15, 2020 at 03:51:03AM +0900, Yasuhito FUTATSUKI wrote:
>> First of all, thank you for doing this. I worried that mailer.py didn't
>> support Python 3 for a long time.
>>
>> In message <20...@svn01-us-east.apache.org>, stsp@apache.o
>> rg writes:
>>> Author: stsp
>>> Date: Mon Dec 14 16:57:10 2020
>>> New Revision: 1884427
>>>
>>> URL: http://svn.apache.org/viewvc?rev=1884427&view=rev
>>> Log:
>>> Make mailer.py work properly with Python 3, and drop Python 2 support.
>>>
>>> Most of the changes deal with the handling binary data vs Python strings.
>>>
>>> I've made sure that mailer.py will work in a UTF-8 environment. In general,
>>> UTF-8 is recommended for hook scripts. See the SVNUseUTF8 mod_dav_svn option.
>>> Environments using other encodings may not work as expected, but those will
>>> be problematic for hook scripts in general.
>>
>> Perhaps the problem on locale other than UTF-8 is only on iterpretation of
>> REPOS-PATH argument. Other paths in the script are all Subversion's
>> internal paths.
> 
> Yes, I agree that the REPOS-PATH argument is an exceptional case.
> Unforunately, we have to decode repos_dir because it is used with the
> os.path.join() function.

As far as I read the code again, it seems this is already support
locales other than C/C.UTF-8 if mailer.py is kicked with appropriate
LC_CTYPE environment. The REPOS-PATH argment comes from sys.argv[2]
already decoded in file system encoding (i.e. encoding corresponding
to LC_CTYPE) and Python encodes it if access to file system. On the
other hand, svn.repos.open() wrapper function accepts str and encode
it to UTF-8 string, as the svn_repos_open() C API expects.

Then I overlooked sys.argv[x] and cmd_args[x] in main() are
already str.

Perhaps it needs (although the last hunk is not nessesary):
[[[
Index: tools/hook-scripts/mailer/mailer.py
===================================================================
--- tools/hook-scripts/mailer/mailer.py (revision 1884434)
+++ tools/hook-scripts/mailer/mailer.py (working copy)
@@ -88,10 +88,10 @@
     messenger = Commit(pool, cfg, repos)
   elif cmd == 'propchange' or cmd == 'propchange2':
     revision = int(cmd_args[0])
-    author = cmd_args[1].decode('utf-8')
-    propname = cmd_args[2].decode('utf-8')
+    author = cmd_args[1]
+    propname = cmd_args[2]
     if cmd == 'propchange2' and cmd_args[3]:
-      action = cmd_args[3].decode('utf-8')
+      action = cmd_args[3]
     else:
       action = 'A'
     repos = Repository(repos_dir, revision, pool)
@@ -103,7 +103,7 @@
                  })
     messenger = PropChange(pool, cfg, repos, author, propname, action)
   elif cmd == 'lock' or cmd == 'unlock':
-    author = cmd_args[0].decode('utf-8')
+    author = cmd_args[0]
     repos = Repository(repos_dir, 0, pool) ### any old revision will do
     # Override the repos revision author with the author of the lock/unlock
     repos.author = author
@@ -1510,7 +1510,7 @@
     usage()
 
   cmd = sys.argv[1]
-  repos_dir = svn.core.svn_path_canonicalize(sys.argv[2])
+  repos_dir = svn.core.svn_path_canonicalize(sys.argv[2].encode('utf-8'))
   repos_dir = repos_dir.decode('utf-8')
   try:
     expected_args = cmd_list[cmd]
]]]

(Also, I found svn.repos.open() and svn.core.svn_path_canonicalize() already
accepted str in commited code :))

> There is no way of knowing which character encoding is used for paths.
> At least on UNIX a filename is just a string of bytes and it is independent
> of the locale.
> 
> Note that hook scripts will run in an empty environment by default.
> Which means this script will run in the "C" locale by default.
> If another locale is desired, the administrator needs to configure the
> server as documented here:
> https://subversion.apache.org/docs/release-notes/1.8.html#hooks-env
> 
> My assumption is that for hook scripts like this one, the server will be
> configured with SVNUseUTF8 and the hook script environment will contain
> something like LANG=C.UTF-8.
> 
> If svnadmin is then also used in the UTF-8 locale to create repositories,
> the hook script should work fine.

I assumed that hook scripts kicked by file:// or svn+ssh:// scheme.

I think even if the repos path is neither UTF-8 nor ascii but appropriate
encoding string, hook scripts themselves know it and can invoke mailer.py
script with appropriate LC_CTYPE by using env(1) utility or directly setting
environment variable within hook scripts themselves.

Of course, we can easily move or make symlink of repository path from
non UTF-8 path to UTF-8 or ascii path, so I don't insist that we should
support non UTF-8 path.

>>> data such as paths in UTF-8. Our Python3 bindings do not deal with encoding
>>> or decoding of such data, and thus need to work with raw UTF-8 strings, not
>>> Python strings.
>>
>> I have no objection the code itself, however it is a bit incorrect.
>>
>> Our Python3 bindings accept str objects as inputs for char *.                                           ^ as well as bytes objects

                                     If str objects are passed, v
>>                                                               It always
>> convert them to UTF-8 C strings on all APIs without regarding preferred
>> encoding or file system encoding on Python.  As some of APIs accept
>> local paths and they should be encoded as locale's charset/encoding,
>> I also prefer to encode to bytes on caller side explicitly before call
>> APIs, to avoid making bugs.  Of course, return values of wrapper APIs
>> corresponding char ** args and return value of C API are not decoded,
>> raw bytes objects. 
> 
> Thank you, I did not know this. The patch below undoes changes which I
> made in order to encode input to SVN APIs as UTF-8.
> 
> The mailer.py test is still passing fine with this change.

Yes, however as I wrote above I prefer not to revert them.
 
I also found some bytes/str mixture still left around return values
for each call of Repository.get_rev_prop(), however I just found,
but not tested. So I continue to look over the code and I'll try
to fix the problems if they still exist.

Cheers,
-- 
Yasuhito FUTATSUKI <fu...@yf.bsclub.org>

Re: svn commit: r1884427 - in /subversion/trunk/tools/hook-scripts/mailer: mailer.py tests/mailer-t1.output tests/mailer-tweak.py

Posted by Stefan Sperling <st...@elego.de>.
On Tue, Dec 15, 2020 at 03:51:03AM +0900, Yasuhito FUTATSUKI wrote:
> First of all, thank you for doing this. I worried that mailer.py didn't
> support Python 3 for a long time.
> 
> In message <20...@svn01-us-east.apache.org>, stsp@apache.o
> rg writes:
> >Author: stsp
> >Date: Mon Dec 14 16:57:10 2020
> >New Revision: 1884427
> >
> >URL: http://svn.apache.org/viewvc?rev=1884427&view=rev
> >Log:
> >Make mailer.py work properly with Python 3, and drop Python 2 support.
> >
> >Most of the changes deal with the handling binary data vs Python strings.
> >
> >I've made sure that mailer.py will work in a UTF-8 environment. In general,
> >UTF-8 is recommended for hook scripts. See the SVNUseUTF8 mod_dav_svn option.
> >Environments using other encodings may not work as expected, but those will
> >be problematic for hook scripts in general.
> 
> Perhaps the problem on locale other than UTF-8 is only on iterpretation of
> REPOS-PATH argument. Other paths in the script are all Subversion's
> internal paths.

Yes, I agree that the REPOS-PATH argument is an exceptional case.
Unforunately, we have to decode repos_dir because it is used with the
os.path.join() function.

There is no way of knowing which character encoding is used for paths.
At least on UNIX a filename is just a string of bytes and it is independent
of the locale.

Note that hook scripts will run in an empty environment by default.
Which means this script will run in the "C" locale by default.
If another locale is desired, the administrator needs to configure the
server as documented here:
https://subversion.apache.org/docs/release-notes/1.8.html#hooks-env

My assumption is that for hook scripts like this one, the server will be
configured with SVNUseUTF8 and the hook script environment will contain
something like LANG=C.UTF-8.

If svnadmin is then also used in the UTF-8 locale to create repositories,
the hook script should work fine.

> >data such as paths in UTF-8. Our Python3 bindings do not deal with encoding
> >or decoding of such data, and thus need to work with raw UTF-8 strings, not
> >Python strings.
> 
> I have no objection the code itself, however it is a bit incorrect.
> 
> Our Python3 bindings accept str objects as inputs for char *. It always
> convert them to UTF-8 C strings on all APIs without regarding preferred
> encoding or file system encoding on Python.  As some of APIs accept
> local paths and they should be encoded as locale's charset/encoding,
> I also prefer to encode to bytes on caller side explicitly before call
> APIs, to avoid making bugs.  Of course, return values of wrapper APIs
> corresponding char ** args and return value of C API are not decoded,
> raw bytes objects. 

Thank you, I did not know this. The patch below undoes changes which I
made in order to encode input to SVN APIs as UTF-8.

The mailer.py test is still passing fine with this change.

Index: tools/hook-scripts/mailer/mailer.py
===================================================================
--- tools/hook-scripts/mailer/mailer.py	(revision 1884427)
+++ tools/hook-scripts/mailer/mailer.py	(working copy)
@@ -624,8 +624,7 @@ class Lock(Messenger):
 
     # The lock comment is the same for all paths, so we can just pull
     # the comment for the first path in the dirlist and cache it.
-    self.lock = svn.fs.svn_fs_get_lock(self.repos.fs_ptr,
-                                       self.dirlist[0].encode('utf-8'),
+    self.lock = svn.fs.svn_fs_get_lock(self.repos.fs_ptr, self.dirlist[0],
                                        self.pool)
 
   def generate(self):
@@ -910,7 +909,7 @@ class DiffGenerator:
             kind = 'C'
             if self.diffsels.copy:
               diff = svn.fs.FileDiff(None, None, self.repos.root_this,
-                                     change.path.decode('utf-8'), self.pool)
+                                     change.path, self.pool)
               label1 = '/dev/null\t00:00:00 1970\t' \
                        '(empty, because file is newly added)'
               label2 = '%s\t%s\t(r%s, copy of r%s, %s)' \
Index: tools/hook-scripts/mailer/tests/mailer-tweak.py
===================================================================
--- tools/hook-scripts/mailer/tests/mailer-tweak.py	(revision 1884427)
+++ tools/hook-scripts/mailer/tests/mailer-tweak.py	(working copy)
@@ -53,7 +53,7 @@ def tweak_dates(pool, home='.'):
     date = core.svn_time_to_cstring((DATE_BASE+i*DATE_INCR) * 1000000, pool)
     #print date
     fs.change_rev_prop(fsob, i+1, core.SVN_PROP_REVISION_DATE, date, pool)
-    fs.change_rev_prop(fsob, i+1, core.SVN_PROP_REVISION_AUTHOR, b'mailer test', pool)
+    fs.change_rev_prop(fsob, i+1, core.SVN_PROP_REVISION_AUTHOR, 'mailer test', pool)
 
 def main():
   if len(sys.argv) != 2: