You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Arfrever Frehtes Taifersar Arahesis <Ar...@GMail.Com> on 2009/03/31 12:20:13 UTC

[RFC] str versus bytes in subversion/tests/cmdline

Python 3 contains major changes in handling of strings.
http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

str type was renamed to bytes type. ("string" -> b"string")
unicode type was renamed to str type. (u"string" -> "string")

I will use Python 3 names of these types in present e-mail.

In Python 2:
>>> "abc" == u"abc"
True
>>>

In Python 3:
>>> b"abc" == "abc"
False
>>>

Explicit encoding / decoding between these types is now required.

bytes.decode() returns str.
str.encode() returns bytes.

(bytes type doesn't support encode(). str type doesn't support decode().)

subversion/tests/cmdline tests use subprocess.Popen to obtain
the output of all commands and to send the input to them.
subprocess.Popen.{stdin,stdout,stderr}() support only bytes type.

Encoding / decoding doesn't work with invalid UTF-8 characters.

merge_tests.py 4 ("some simple property merges") test sets some
properties with invalid UTF-8 characters and later checks the output of svn.

This problem has 2 solutions:

1. Internally store the output of commands in bytes type, perform some
encodings/decodings and convert huge number of strings to bytes type
(i.e. "string" -> b"string" in source code).

Invalid UTF-8 characters would be still supported by
subversion/tests/cmdline/svntest.

See the attached, unfinished patch (subversion-svntest-python-3.patch) for
the "python-3-compatibility" branch which makes basic_tests.py 1 ("basic
checkout of a wc") test pass with both Python 2.6 and Python 3.0!

2. Internally store the output of commands in str type, decode output
of commands quickly after obtaining it from subprocess.Popen, convert
significantly smaller number of strings to bytes type and *ban invalid UTF-8
characters* in subversion/tests/cmdline.

In this case merge_tests.py 4 test would have to be changed to no longer
set invalid UTF-8 characters in some properties.

See the attached patch (subversion-svntest-decode_subprocess_output.patch)
for trunk which implements decoding of outpuf of commands.

--
Arfrever Frehtes Taifersar Arahesis

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1495347

Re: [RFC] str versus bytes in subversion/tests/cmdline

Posted by Greg Stein <gs...@gmail.com>.
On Fri, Apr 3, 2009 at 15:04, Arfrever Frehtes Taifersar Arahesis
<ar...@gmail.com> wrote:
> 2009-04-01 14:20 Greg Stein <gs...@gmail.com> napisał(a):
>> On Wed, Apr 1, 2009 at 15:04, Arfrever Frehtes Taifersar Arahesis
>> <Ar...@gmail.com> wrote:
>>>...
>>>> 2. Internally store the output of commands in str type, decode output
>>>> of commands quickly after obtaining it from subprocess.Popen, convert
>>>> significantly smaller number of strings to bytes type and *ban invalid UTF-8
>>>> characters* in subversion/tests/cmdline.
>>>>
>>>> In this case merge_tests.py 4 test would have to be changed to no longer
>>>> set invalid UTF-8 characters in some properties.
>>>>
>>>> See the attached patch (subversion-svntest-decode_subprocess_output.patch)
>>>> for trunk which implements decoding of outpuf of commands.
>>>
>>> I have decided to implement the improved version of the second
>>> solution. svntest will try to store output of commands in str type,
>>> but will use bytes type for strings with invalid UTF-8 characters.
>>> bytes type will have to be used also when writing to files opened in
>>> binary mode.
>>
>> "bytes type" ?? there is no b"foo" syntax in 2.4, so I don't even know
>> how you're going to start on this.
>>
>>> subversion/tests/cmdline/svntest/wc.py:StateItem.tweak() will contain
>>> workaround for merge_tests.py 4. The properties set by merge_tests.py
>>> 4 (simple_property_merges()) will have bytes type. The expected output
>>> of error message with property values with invalid UTF-8 characters
>>> will depend on Python version.
>>>
>>> See the attached, unfinished patch (subversion-svntest-python-3-v2.patch).
>>
>> It relies on the b"foo" syntax, so how could this be applied?
>
> This part will be applied only on the "python-3-compatibility" branch.

Ah. Fair enough!

> (Python 2.6 supports b"string". In this version b"string" == "string".)
> Changes to function attributes (e.g. func_name -> __name__) will be
> applied also on this branch.
>
> The following changes will be applied on trunk:
>  - os.tempnam() -> tempfile.mkstemp()
>  - file() -> open()
>  - Addition of some calls to variable.encode() / variable.decode()
>  - sys.path.append(os.path.dirname(__file__))
>   It's a workaround for changed behavior of 'import' statement.
>   Changes generated by 2to3 seem to not work.
>  - try:\n  bytes\nexcept NameError:\n  bytes = str
>  - Some other improvements
>
> Workaround for problems (pristine_url) caused by global variables and
> circular imports won't be applied anywhere. Support for the --verbose
> option is still broken.
>
> Anyway, with the newest version of that patch, all tests pass with
> Python 2.6 and Python 3.0!

Cool. Cleanups in there are most welcome! I've been spending some time
in there recently :-( ... and have noticed some pretty ugly stuff. At
some point, I'll take a break from wc-ng again and keep cleaning that
up. Not that it *needs* it (non-core code doesn't have to be Perfect),
but more to demonstrate patterns and set precedent for when people
write more tests. I also have plans for all those globals which should
help to reduce some of the circular dependencies and whatnot.

The problem is in making serious architectural changes. There is just
*so much* code to change.

Cheers,
-g

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1532176


Re: [RFC] str versus bytes in subversion/tests/cmdline

Posted by Arfrever Frehtes Taifersar Arahesis <Ar...@GMail.Com>.
2009-04-01 14:20 Greg Stein <gs...@gmail.com> napisał(a):
> On Wed, Apr 1, 2009 at 15:04, Arfrever Frehtes Taifersar Arahesis
> <Ar...@gmail.com> wrote:
>>...
>>> 2. Internally store the output of commands in str type, decode output
>>> of commands quickly after obtaining it from subprocess.Popen, convert
>>> significantly smaller number of strings to bytes type and *ban invalid UTF-8
>>> characters* in subversion/tests/cmdline.
>>>
>>> In this case merge_tests.py 4 test would have to be changed to no longer
>>> set invalid UTF-8 characters in some properties.
>>>
>>> See the attached patch (subversion-svntest-decode_subprocess_output.patch)
>>> for trunk which implements decoding of outpuf of commands.
>>
>> I have decided to implement the improved version of the second
>> solution. svntest will try to store output of commands in str type,
>> but will use bytes type for strings with invalid UTF-8 characters.
>> bytes type will have to be used also when writing to files opened in
>> binary mode.
>
> "bytes type" ?? there is no b"foo" syntax in 2.4, so I don't even know
> how you're going to start on this.
>
>> subversion/tests/cmdline/svntest/wc.py:StateItem.tweak() will contain
>> workaround for merge_tests.py 4. The properties set by merge_tests.py
>> 4 (simple_property_merges()) will have bytes type. The expected output
>> of error message with property values with invalid UTF-8 characters
>> will depend on Python version.
>>
>> See the attached, unfinished patch (subversion-svntest-python-3-v2.patch).
>
> It relies on the b"foo" syntax, so how could this be applied?

This part will be applied only on the "python-3-compatibility" branch.
(Python 2.6 supports b"string". In this version b"string" == "string".)
Changes to function attributes (e.g. func_name -> __name__) will be
applied also on this branch.

The following changes will be applied on trunk:
 - os.tempnam() -> tempfile.mkstemp()
 - file() -> open()
 - Addition of some calls to variable.encode() / variable.decode()
 - sys.path.append(os.path.dirname(__file__))
   It's a workaround for changed behavior of 'import' statement.
   Changes generated by 2to3 seem to not work.
 - try:\n  bytes\nexcept NameError:\n  bytes = str
 - Some other improvements

Workaround for problems (pristine_url) caused by global variables and
circular imports won't be applied anywhere. Support for the --verbose
option is still broken.

Anyway, with the newest version of that patch, all tests pass with
Python 2.6 and Python 3.0!

--
Arfrever Frehtes Taifersar Arahesis

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1532136


Re: [RFC] str versus bytes in subversion/tests/cmdline

Posted by Greg Stein <gs...@gmail.com>.
On Wed, Apr 1, 2009 at 15:04, Arfrever Frehtes Taifersar Arahesis
<Ar...@gmail.com> wrote:
>...
>> 2. Internally store the output of commands in str type, decode output
>> of commands quickly after obtaining it from subprocess.Popen, convert
>> significantly smaller number of strings to bytes type and *ban invalid UTF-8
>> characters* in subversion/tests/cmdline.
>>
>> In this case merge_tests.py 4 test would have to be changed to no longer
>> set invalid UTF-8 characters in some properties.
>>
>> See the attached patch (subversion-svntest-decode_subprocess_output.patch)
>> for trunk which implements decoding of outpuf of commands.
>
> I have decided to implement the improved version of the second
> solution. svntest will try to store output of commands in str type,
> but will use bytes type for strings with invalid UTF-8 characters.
> bytes type will have to be used also when writing to files opened in
> binary mode.

"bytes type" ?? there is no b"foo" syntax in 2.4, so I don't even know
how you're going to start on this.

> subversion/tests/cmdline/svntest/wc.py:StateItem.tweak() will contain
> workaround for merge_tests.py 4. The properties set by merge_tests.py
> 4 (simple_property_merges()) will have bytes type. The expected output
> of error message with property values with invalid UTF-8 characters
> will depend on Python version.
>
> See the attached, unfinished patch (subversion-svntest-python-3-v2.patch).

It relies on the b"foo" syntax, so how could this be applied?

Cheers,
-g

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1507836

Re: [RFC] str versus bytes in subversion/tests/cmdline

Posted by Arfrever Frehtes Taifersar Arahesis <Ar...@GMail.Com>.
2009-03-31 14:20 Arfrever Frehtes Taifersar Arahesis
<ar...@gmail.com> napisał(a):
> Python 3 contains major changes in handling of strings.
> http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
>
> str type was renamed to bytes type. ("string" -> b"string")
> unicode type was renamed to str type. (u"string" -> "string")
>
> I will use Python 3 names of these types in present e-mail.
>
> In Python 2:
>>>> "abc" == u"abc"
> True
>>>>
>
> In Python 3:
>>>> b"abc" == "abc"
> False
>>>>
>
> Explicit encoding / decoding between these types is now required.
>
> bytes.decode() returns str.
> str.encode() returns bytes.
>
> (bytes type doesn't support encode(). str type doesn't support decode().)
>
> subversion/tests/cmdline tests use subprocess.Popen to obtain
> the output of all commands and to send the input to them.
> subprocess.Popen.{stdin,stdout,stderr}() support only bytes type.
>
> Encoding / decoding doesn't work with invalid UTF-8 characters.
>
> merge_tests.py 4 ("some simple property merges") test sets some
> properties with invalid UTF-8 characters and later checks the output of svn.
>
> This problem has 2 solutions:
>
> 1. Internally store the output of commands in bytes type, perform some
> encodings/decodings and convert huge number of strings to bytes type
> (i.e. "string" -> b"string" in source code).
>
> Invalid UTF-8 characters would be still supported by
> subversion/tests/cmdline/svntest.
>
> See the attached, unfinished patch (subversion-svntest-python-3.patch) for
> the "python-3-compatibility" branch which makes basic_tests.py 1 ("basic
> checkout of a wc") test pass with both Python 2.6 and Python 3.0!
>
> 2. Internally store the output of commands in str type, decode output
> of commands quickly after obtaining it from subprocess.Popen, convert
> significantly smaller number of strings to bytes type and *ban invalid UTF-8
> characters* in subversion/tests/cmdline.
>
> In this case merge_tests.py 4 test would have to be changed to no longer
> set invalid UTF-8 characters in some properties.
>
> See the attached patch (subversion-svntest-decode_subprocess_output.patch)
> for trunk which implements decoding of outpuf of commands.

I have decided to implement the improved version of the second
solution. svntest will try to store output of commands in str type,
but will use bytes type for strings with invalid UTF-8 characters.
bytes type will have to be used also when writing to files opened in
binary mode.

subversion/tests/cmdline/svntest/wc.py:StateItem.tweak() will contain
workaround for merge_tests.py 4. The properties set by merge_tests.py
4 (simple_property_merges()) will have bytes type. The expected output
of error message with property values with invalid UTF-8 characters
will depend on Python version.

See the attached, unfinished patch (subversion-svntest-python-3-v2.patch).

Summary of test results with Python 3:
  994 tests PASSED
  24 tests SKIPPED
  28 tests XFAILED (1 WORK-IN-PROGRESS)
  43 tests FAILED

23 tests fail due to os.tempnam().

--
Arfrever Frehtes Taifersar Arahesis

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1506819

Re: [RFC] str versus bytes in subversion/tests/cmdline

Posted by Arfrever Frehtes Taifersar Arahesis <Ar...@GMail.Com>.
2009-03-31 14:20 Arfrever Frehtes Taifersar Arahesis
<ar...@gmail.com> napisał(a):
> Python 3 contains major changes in handling of strings.
> http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
>
> str type was renamed to bytes type. ("string" -> b"string")
> unicode type was renamed to str type. (u"string" -> "string")
>
> I will use Python 3 names of these types in present e-mail.
>
> In Python 2:
>>>> "abc" == u"abc"
> True
>>>>
>
> In Python 3:
>>>> b"abc" == "abc"
> False
>>>>
>
> Explicit encoding / decoding between these types is now required.
>
> bytes.decode() returns str.
> str.encode() returns bytes.
>
> (bytes type doesn't support encode(). str type doesn't support decode().)
>
> subversion/tests/cmdline tests use subprocess.Popen to obtain
> the output of all commands and to send the input to them.
> subprocess.Popen.{stdin,stdout,stderr}() support only bytes type.
>
> Encoding / decoding doesn't work with invalid UTF-8 characters.
>
> merge_tests.py 4 ("some simple property merges") test sets some
> properties with invalid UTF-8 characters and later checks the output of svn.
>
> This problem has 2 solutions:
>
> 1. Internally store the output of commands in bytes type, perform some
> encodings/decodings and convert huge number of strings to bytes type
> (i.e. "string" -> b"string" in source code).
>
> Invalid UTF-8 characters would be still supported by
> subversion/tests/cmdline/svntest.
>
> See the attached, unfinished patch (subversion-svntest-python-3.patch) for
> the "python-3-compatibility" branch which makes basic_tests.py 1 ("basic
> checkout of a wc") test pass with both Python 2.6 and Python 3.0!

I forgot to say that, with this patch applied, paths are stored in str type.
The majority of other variables (e.g. values of properties) are stored
in bytes type.

(I noticed that this patch contains unrelated changes to
tools/po/l10n-report.py.)

I would like to mention that Python 2 is very tolerant in case of
encoding / decoding
(e.g. it allows to call str.encode() or unicode.decode()), so the
changes related to
encoding / decoding will be merged to trunk.

PS. I would like to thank Arfrever for implementing __eq__() and __ne__() so
that comparison of instances of ExpectedOutput works at all.

--
Arfrever Frehtes Taifersar Arahesis

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1495758