You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@subversion.apache.org by Daniel Shahaf <d....@daniel.shahaf.name> on 2020/12/21 07:15:53 UTC

mailer.py py2/py3 change: non-UTF-8 environments (was: svn commit: r1884427 - in /subversion/trunk/tools/hook-scripts/mailer: mailer.py tests/mailer-t1.output tests/mailer-tweak.py)

stsp@apache.org wrote on Mon, 14 Dec 2020 16:57 -0000:
> URL: http://svn.apache.org/viewvc?rev=1884427&view=rev
> Log:
> Make mailer.py work properly with Python 3, and drop Python 2 support.
> 
> Most of the changes deal with the handling binary data vs Python strings.
> 
> I've made sure that mailer.py will work in a UTF-8 environment. In general,
> UTF-8 is recommended for hook scripts. See the SVNUseUTF8 mod_dav_svn option.
> Environments using other encodings may not work as expected, but those will
> be problematic for hook scripts in general.

Correct me if I'm wrong, but it sounds like you haven't ruled out the
possibility that this commit will constitute a regression for anyone
who runs mailer.py in a non-UTF-8 environment and will upgrade to this
commit.

I suppose it's fair to classify non-UTF-8 environments as "patches
welcome", following the precedent of libmagic support in the Windows
build, but:

1. Can we detect non-UTF-8 environments and warn or error out hard upon
them?  «locale.getlocale()[1]» seems promising?

2. A change that hasn't been confirmed *not* to constitute a regression
merits a release notes entry.  Would you do the honours?

Cheers,

Daniel

>                                            SVN repositories store internal
> data such as paths in UTF-8. Our Python3 bindings do not deal with encoding
> or decoding of such data, and thus need to work with raw UTF-8 strings, not
> Python strings.
> 
> The encoding of file and property contents is not guaranteed to be UTF-8.
> This was already a problem before this change. This hook script sends email
> with a content type header specifying the UTF-8 encoding. Diffs which contain
> non-UTF-8 text will most likely not render properly when viewed in an email
> reader. At least this problem is now obvious in mailer.py's implementation,
> since all unidiff text is now written out directly as binary data.
> 
> As an additional fix, iterate file groups in sorted order. This results in
> stable output and makes test cases in our tests/ subdirectory reproducible.
> 
> Tested with Python 3.7.5 which is the version I use in my SVN development
> setup at present. Tests with newer versions are welcome.
> 
> * tools/hook-scripts/mailer/mailer.py:
>   Drop Python2-specific includes. Adjust includes as per 2to3.
>   (main): Decode arguments from UTF-8 to string.
>   (OutputBase:write): Encode string to UTF-8 and pass to write_binary().
>    OutputBase implementations now need to provide a self.write_binary
>    member which implements a write() method for binary data.
>   (MailedOutput): email.Header package is gone, use email.header instead,
>    and likewise replace use of email.Utils with email.utils
>   (SMTPOutput): Provide self.write_binary in terms of a BytesIO() object.
>    We cannot use StringIO since diffs may contain data in arbitrary encodings.
>   (StandardOutput): Provide self.write_binary in terms of stdout.buffer.
>   (PipeOutput): Provide self.write_binary in terms of pipe.stdin.
>   (Commit): Decode log message and paths from UTF-8 to string, and iterate
>     path groups from mailer.conf in sorted order.
>   (Lock): Decode directory entries from UTF-8 to string. Encode paths back
>    to UTF-8 when we ask libsvn_fs for a lock on a path.
>     Iterate path groups from mailer.conf in sorted order.
>   (DiffGenerator): Decode repository paths from UTF-8 to string.
>   (TextCommitRenderer): Decode author, log message, and path from UTF-8 to
>     string. Write diff data via write_binary, bypassing the re-encoding step.
>   (Config): Decode paths from UTF-8 to string before matching them against
>    regular expressions. Also decode the repository directory path from UTF-8.
> 
> * tools/hook-scripts/mailer/tests/mailer-t1.output: Adjust expected output.
>    File groups are now provided in stable sorted order. This should fix
>    spurious test failures in the future.
> 
>  * tools/hook-scripts/mailer/tests/mailer-tweak.py: Drop L suffix from long
>     integers and pass binary data instead of strings into libsvn_fs.

Re: mailer.py py2/py3 change: non-UTF-8 environments (was: svn commit: r1884427 - in /subversion/trunk/tools/hook-scripts/mailer: mailer.py tests/mailer-t1.output tests/mailer-tweak.py)

Posted by Stefan Sperling <st...@apache.org>.
On Mon, Dec 21, 2020 at 07:15:53AM +0000, Daniel Shahaf wrote:
> stsp@apache.org wrote on Mon, 14 Dec 2020 16:57 -0000:
> > URL: http://svn.apache.org/viewvc?rev=1884427&view=rev
> > Log:
> > Make mailer.py work properly with Python 3, and drop Python 2 support.
> > 
> > Most of the changes deal with the handling binary data vs Python strings.
> > 
> > I've made sure that mailer.py will work in a UTF-8 environment. In general,
> > UTF-8 is recommended for hook scripts. See the SVNUseUTF8 mod_dav_svn option.
> > Environments using other encodings may not work as expected, but those will
> > be problematic for hook scripts in general.
> 
> Correct me if I'm wrong, but it sounds like you haven't ruled out the
> possibility that this commit will constitute a regression for anyone
> who runs mailer.py in a non-UTF-8 environment and will upgrade to this
> commit.

Anyone who didn't explicitly set a locale in hook-envs will run this
script in the C locale today, and this case still works.

I would not count broken email generated in non-UTF-8 locales as a regression.
mailer.py's email content encoding header was set to UTF-8 17 years ago:
https://svn.apache.org/r847862

> I suppose it's fair to classify non-UTF-8 environments as "patches
> welcome", following the precedent of libmagic support in the Windows
> build, but:
> 
> 1. Can we detect non-UTF-8 environments and warn or error out hard upon
> them?  «locale.getlocale()[1]» seems promising?

I don't see what value this adds. When there is an encoding problem it
will already be obvious from gargabe characters in email notifications.
And many people can live with that, so there is no reason to error out.

Given that file diffs can contain an arbitrary mix of encodings, this script
cannot guarantee readable diff output unless ASCII/UTF-8 encoding is used
for all files. I don't see a good way around this limitation. Auto-detecting
file encoding is tricky, and requring people to tag file encodings via the
svn:mime-type property is simply not going to work in pratice.

> 2. A change that hasn't been confirmed *not* to constitute a regression
> merits a release notes entry.  Would you do the honours?

I think release notes should mention Python3 support for mailer.py and
recommend that mailer.py is run in the UTF-8 or C locale.

Re: mailer.py py2/py3 change: non-UTF-8 environments (was: svn commit: r1884427 - in /subversion/trunk/tools/hook-scripts/mailer: mailer.py tests/mailer-t1.output tests/mailer-tweak.py)

Posted by Stefan Sperling <st...@apache.org>.
On Mon, Dec 21, 2020 at 07:15:53AM +0000, Daniel Shahaf wrote:
> stsp@apache.org wrote on Mon, 14 Dec 2020 16:57 -0000:
> > URL: http://svn.apache.org/viewvc?rev=1884427&view=rev
> > Log:
> > Make mailer.py work properly with Python 3, and drop Python 2 support.
> > 
> > Most of the changes deal with the handling binary data vs Python strings.
> > 
> > I've made sure that mailer.py will work in a UTF-8 environment. In general,
> > UTF-8 is recommended for hook scripts. See the SVNUseUTF8 mod_dav_svn option.
> > Environments using other encodings may not work as expected, but those will
> > be problematic for hook scripts in general.
> 
> Correct me if I'm wrong, but it sounds like you haven't ruled out the
> possibility that this commit will constitute a regression for anyone
> who runs mailer.py in a non-UTF-8 environment and will upgrade to this
> commit.

Anyone who didn't explicitly set a locale in hook-envs will run this
script in the C locale today, and this case still works.

I would not count broken email generated in non-UTF-8 locales as a regression.
mailer.py's email content encoding header was set to UTF-8 17 years ago:
https://svn.apache.org/r847862

> I suppose it's fair to classify non-UTF-8 environments as "patches
> welcome", following the precedent of libmagic support in the Windows
> build, but:
> 
> 1. Can we detect non-UTF-8 environments and warn or error out hard upon
> them?  «locale.getlocale()[1]» seems promising?

I don't see what value this adds. When there is an encoding problem it
will already be obvious from gargabe characters in email notifications.
And many people can live with that, so there is no reason to error out.

Given that file diffs can contain an arbitrary mix of encodings, this script
cannot guarantee readable diff output unless ASCII/UTF-8 encoding is used
for all files. I don't see a good way around this limitation. Auto-detecting
file encoding is tricky, and requring people to tag file encodings via the
svn:mime-type property is simply not going to work in pratice.

> 2. A change that hasn't been confirmed *not* to constitute a regression
> merits a release notes entry.  Would you do the honours?

I think release notes should mention Python3 support for mailer.py and
recommend that mailer.py is run in the UTF-8 or C locale.

Re: mailer.py py2/py3 change: non-UTF-8 environments

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Yasuhito FUTATSUKI wrote on Wed, Dec 23, 2020 at 03:13:42 +0900:
> On 2020/12/21 16:15, Daniel Shahaf wrote:
> > stsp@apache.org wrote on Mon, 14 Dec 2020 16:57 -0000:
> >> URL: http://svn.apache.org/viewvc?rev=1884427&view=rev
> >> Log:
> >> Make mailer.py work properly with Python 3, and drop Python 2 support.
> >>
> >> Most of the changes deal with the handling binary data vs Python strings.
> >>
> >> I've made sure that mailer.py will work in a UTF-8 environment. In general,
> >> UTF-8 is recommended for hook scripts. See the SVNUseUTF8 mod_dav_svn option.
> >> Environments using other encodings may not work as expected, but those will
> >> be problematic for hook scripts in general.
> > 
> > Correct me if I'm wrong, but it sounds like you haven't ruled out the
> > possibility that this commit will constitute a regression for anyone
> > who runs mailer.py in a non-UTF-8 environment and will upgrade to this
> > commit.
> 
> It is not for the commit message, but for the committed code, it rather
> supports non-UTF-8 (or C) locale, at least the handling of REPOS-PATH
> (as I wrote another thread).
> 

*nod*, but that does leave outstanding the case of non-UTF-8 locale and data
other than REPOS-PATH.

> It seems the code before commit for Python 2 could not handle REPOS-PATH
> if it contains non ascii character on non-UTF-8 environment because it
> didn't care that svn.core.svn_path_canonicalize and svn.repos.open only
> accept bytes object of UTF-8 path. Even if a user might pass REPOS-PATH
> as transcoded to UTF-8, the script searches for a default config file 
> under the transcoded REPOS-PATH and it is not desirable behavior.
> 
> For Python 2/3 support, I don't want to drop Python 2 support for mailer.py
> if we can easily, but I also think Python 3 support is higher priority.

+1

> And as also I wrote another thread, it seems still incomplete to support
> Python 3. I'll take a look this, but it has not made progress at all
> since I then, sorry.

No worries!  I started a separate thread since I thought the issues were independent.

Thanks,

Daniel

Re: mailer.py py2/py3 change: non-UTF-8 environments

Posted by Yasuhito FUTATSUKI <fu...@yf.bsdclub.org>.
On 2020/12/21 16:15, Daniel Shahaf wrote:
> stsp@apache.org wrote on Mon, 14 Dec 2020 16:57 -0000:
>> URL: http://svn.apache.org/viewvc?rev=1884427&view=rev
>> Log:
>> Make mailer.py work properly with Python 3, and drop Python 2 support.
>>
>> Most of the changes deal with the handling binary data vs Python strings.
>>
>> I've made sure that mailer.py will work in a UTF-8 environment. In general,
>> UTF-8 is recommended for hook scripts. See the SVNUseUTF8 mod_dav_svn option.
>> Environments using other encodings may not work as expected, but those will
>> be problematic for hook scripts in general.
> 
> Correct me if I'm wrong, but it sounds like you haven't ruled out the
> possibility that this commit will constitute a regression for anyone
> who runs mailer.py in a non-UTF-8 environment and will upgrade to this
> commit.

It is not for the commit message, but for the committed code, it rather
supports non-UTF-8 (or C) locale, at least the handling of REPOS-PATH
(as I wrote another thread).

It seems the code before commit for Python 2 could not handle REPOS-PATH
if it contains non ascii character on non-UTF-8 environment because it
didn't care that svn.core.svn_path_canonicalize and svn.repos.open only
accept bytes object of UTF-8 path. Even if a user might pass REPOS-PATH
as transcoded to UTF-8, the script searches for a default config file 
under the transcoded REPOS-PATH and it is not desirable behavior.

For Python 2/3 support, I don't want to drop Python 2 support for mailer.py
if we can easily, but I also think Python 3 support is higher priority.

And as also I wrote another thread, it seems still incomplete to support
Python 3. I'll take a look this, but it has not made progress at all
since I then, sorry.

Cheers,
-- 
Yasuhito FUTATSUKI <fu...@yf.bsclub.org>