You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Frederik Bode <fr...@ml6.eu> on 2019/07/15 09:30:19 UTC

[PROPOSAL] Standardize Gradle structure in Python SDK

Hi Mark & others,

+1 on using this structure. I don't see any other alternative to gradle as
some of the Python tasks have Java tasks as
a dependency. You can't debug that using just `python nosetests... or
tox`.  Parallelizing such tasks requires different
projects (and I don't think gradle supports multiple projects per
directory), so for each python
version we need a different folder. Having seperate build.gradle files for
each python version would also enable the
different versions to diverge on which tests they execute (e.g. not running
some tests for python 3.5 and 3.6 to
reduce Jenkins footprint in a PreCommit).

Valentyn, the code duplication problem can be addressed using a Gradle
script plugin , which
is a build.script (maybe another name is better?)  in
/sdks/python/test-suites/[runner], which you then import in
/sdks/python/test-suites/[runner]/pyXX/build.gradle with the
correct pythonVersion set, and then using `apply from`. See [1] for an
example. The location of
the python2 test-suite can be remedied by moving it to your suggestion. +1
on that as well.

On the coupling of projects or the structure not being natural, I think
that we can look at
this differently. Right now, in the Python SDK, all common code for tests
that needs to be parallelized is
in placed in BeamModulePlugin, which in turn couples all projects that use
it. It's one centralized
item that couples many different projects. It has code from Java, Python
and Go, and
currently has almost 2000 lines of code, which is IMHO not the way to go.
Moving the python code
from this binary plugin to a script plugin file defined in the the parent
directory of children
projects that use that code (as described in the paragraph above) moves the
coupling of a lot of projects through
BeamModulePlugin (a global coupling), to a per sub-tree coupling (a local
coupling).
In short, yes it might be unnatural, but it's still better than before.

[1] https://github.com/apache/beam/pull/8877
<https://github.com/apache/beam/pull/8877/files>

Thanks,
Frederik Bode

On Mon, Jun 3, 2019 at 5:13 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Hey Mark & others,
>
> We've been following the structure proposed in this thread to extend test
> coverage for Beam Python SDK on Python 3.5, 3.6, 3.7 interpreters, see [1].
>
> This structure allowed us to add 3.x suites without slowing down the
> pre/postcommit execution time. We can actually see a drop in precommit
> latency [2] around March 23 we first made some Python 3.x suites run in
> parallel, and we have added more suites since then without slowing down
> pre/postcommits. Therefore I am in favor of this proposal, especially since
> AFAIK we don't have better one. Thanks a lot!
>
> I do have some feedback on this proposal:
>
> 1. There is a duplication of gradle code between test suites for different
> python minor versions, for example see the identical definition of
> DirectRunner PostCommitIT suite for Python 3.6 and Python 3.7 [4,5].
>
> Possible solution to reduce the duplication is to move common code that
> defines a task into a separate groovy file shared across multiple gradle
> files. We have an example of this, where enablePythonPerformanceTest() is
> defined in BeamModulePlugin.groovy, and used in several build.gradle files
> to create a gradle task required for performance tests, see: [6]. I
> followed the same example in a Python 3 test suite for Portable Flink
> Runner I am working on [3], however I am not sure if BeamModulePlugin is
> the best place to define common gradle tasks to needed for Python CI.
> Perhaps we can make a separate groovy file for this purpose in
> sdk/python/test-suites?
>

I would suggest placing the shared code in a new file in
https://github.com/apache/beam/tree/master/buildSrc/src/main/groovy/org/apache/beam/gradle,
we have several other groovy files related to building defined there
already.


> 2. Python 3 test suites currently live in sdks/python/test-suites, while
> most Python 2 suites are still defined in sdks/python/build.gradle.
>
> This may cause confusion for folks working on adding new Python suites. If
> there is an overall agreement on proposed structure I suggest to  start
> moving Python 2 CI tasks out of  sdks/python/build.gradle into
> sdks/python/test-suites/[runner]/py27/build.gradle, or a common groovy
> file. If there are better alternatives we can continue discussing them here.
>

For the runner specific Java ITs, we have been trying to get the tests to
be placed inside the runners own directory instead of underneath the SDK
directories. This was to make it easy to associate test task -> runner. So
an alternative would be to have runners/[runner]/test-suites/py27/... be
that location.


> Thanks,
> Valenyn
>
>
> [1] https://github.com/apache/beam/tree/master/sdks/python/test-suites
> [2]
> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=1546507894013&to=1554189164736
> [3] https://github.com/apache/beam/pull/8745
> [4]
> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py36/build.gradle#L27
> [5]
> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py37/build.gradle#L27
> [6]
> https://github.com/apache/beam/search?q=enablePythonPerformanceTest&unscoped_q=enablePythonPerformanceTest
>
>
> On Fri, Mar 29, 2019 at 9:45 AM Udi Meiri <eh...@google.com> wrote:
>
>> I don't use gradle commands for Python development either, because they
>> are slow (no incremental testing).
>>
>>
>>
>> On Fri, Mar 29, 2019 at 9:16 AM Michael Luckey <ad...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Mar 29, 2019 at 2:31 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> On Fri, Mar 29, 2019 at 12:54 PM Michael Luckey <ad...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Really like the idea of improving here.
>>>> >
>>>> > Unfortunately, I haven't worked with python on that scale yet, so
>>>> bear with my naive understandings in this regard. If I understand
>>>> correctly, the suggestion will result in a couple of projects consisting
>>>> only of a build,gradle file to kind of workaround on gradles decision not
>>>> to parallelize within projects, right? In consequence, this also kind of
>>>> decouples projects from their content - they stuff which constitutes the
>>>> project - and forces the build file to 'somehow reach out to content of
>>>> other (only python root?) projects. E.g couples projects. This somehow
>>>> 'feels non natural' to me. But, of course, might be the path to go. As I
>>>> said before, never worked on python on that scale.
>>>>
>>>> It feels a bit odd to me as well. Is it possible to have multiple
>>>> projects per directory (e.g. a suite of testing ones) rather than
>>>> having to break things up like this, especially if the goal is
>>>> primarily to get parallel running of tests? Especially if we could
>>>> automatically create the cross-product rather than manually? There
>>>> also seems to be some redundancy with what tox is doing here.
>>>>
>>>
>>> Not sure, whether I understand correctly. But I do not think that's
>>> possible. If we are going to do some cross-product, we are probably better
>>> of doing that on tasks, e.g. by leveraging task rules or programmatically
>>> adding tasks (which is already done in parts). Of course, this will not
>>> help with parallelisation (but might enable that, see below).
>>>
>>>
>>>>
>>>> > But I believe to remember Robert talking about using in project
>>>> parallelisation for his development. Is this something which could also
>>>> work on CI? Of course, that will not help with different python versions,
>>>> but maybe that could be solved also by gradles variants which are
>>>> introduced in 5.3 - definitely need some time to investigate the
>>>> possibilities here. On first sight it feels like lots of duplication to
>>>> create 'builds' for any python version. Or wouldn't that be the case?
>>>> >
>>>> > And another naive thought on my side, isn't that non
>>>> parallelizability also caused by the monolithic setup of the python code
>>>> base? E.g. if I understand correctly, java sdk is split into
>>>> core/runners/ios etc, each encapsulate into full blown projects, i.e.
>>>> buckets of sources, tests and build file. Would it technically possible to
>>>> do something similar with python? I assume that being discussed before and
>>>> teared apart, but couldn't find on mailing list.
>>>>
>>>> Neither the culture nor the tooling of Python supports lots of
>>>> interdependent "sub-packages" for a single project--at least not
>>>> something smaller than one would want to deploy to Pypi. So while one
>>>> could do this, it'd be going against the grain. There are also much
>>>> lower-hanging opportunities for parallelization (e.g. running the test
>>>> suites for separate python versions in parallel).
>>>>
>>>> It's not very natural (as I understand it) with Go either. If we're
>>>> talking directory re-organization, I think it would make sense to
>>>> consider having top-level java, python, go, ... next to model,
>>>> website, etc.
>>>>
>>>
>>> Yes. We shouldn't work against common culture/practices, but try to
>>> embrace native tooling and add support where required.
>>>
>>> To reiterate on parallelisation, there are (at least) three
>>> opportunities:
>>>
>>> 1. Parallelise on test level. For python, this is detox?
>>>
>>
>> This is actually 2 levels. :)
>> 1a. Parallelise at the nosetest level - run unit tests in parallel in a
>> single tox environment. (I have a PR in progress to migrate to pytest, and
>> we should be able to do file-level parallelism provided we solve pickling
>> issues.)
>> 1b. Parallelise at the tox environment level, e..g, somehow running the
>> multiple tox environments (py27,py27-cython,py35,...) in parallel.
>>
>>
>>> 2. Parallelise on Gradle project level (--parallel option)
>>> 3. Parallelise on CI level
>>>
>>> So what I ve done before, if 1. does not help, nor 2. cause project is
>>> 'just to big' was 3, i.e. splitting on CI level. So the simplest thing I
>>> could imagine right now would be - as suggested above - to split
>>> pythonPreCommit into something like pythonX_YPrecommit, which then runs
>>> those different python versions in parallel. Of course, that could be done
>>> ad infinitum by splitting further into runners, IOs whatever.
>>>
>>> OTOH, we do already have tons of jobs running as we seem to map Gradle
>>> tasks to Jenkins jobs. So it might be more appropriate to leverage CI
>>> parallelization on jenkins pipeline level. Something like creating
>>> pythonPrecommit as a pipeline, which itself runs several steps in parallel.
>>> Did not contemplate on the impacts, though, on PRs and phrase triggering.
>>> Of course, we might need to improve on our 'verbosity' as currently it
>>> seems not always clear what command line actually triggered etc. With this
>>> approach, it seems possible to run tasks within same Gradle project in
>>> parallel with whatever granularity we might want to choose. (Of course we
>>> do add overhead here, but that's probably always the case by doing things
>>> in parallel)
>>>
>>> Of course 3. would not help on parallelisation on developers machine,
>>> but I never had a problem with that.
>>>
>>>
>>>>
>>>> > And as a last thought, will usage of pygradle help with better
>>>> python/gradle integration? Currently, we mainly use gradle to call into
>>>> shell scripts, which doesn't help gradle nor probably pythons tooling to do
>>>> the job very well? But deeper integration might cause problems on python
>>>> dev side, dunno :(
>>>>
>>>> Possibly.
>>>>
>>>> Are there any Python developers that primarily use the gradle
>>>> commands? Personally, I only use them if I'm using Java (or sometimes
>>>> work that is a mix of Java and Python, e.g. the Python-on-Flink
>>>> tests). Otherwise I use tox, or "python setup.py test [-s ...]"
>>>> directly. Gradle primarily has value as a top-level orchestration (in
>>>> particular for CI) and easy way for those who only touch Python
>>>> occasionally to run all the tests. If that's the case, optimizing our
>>>> gradle scripts for CI seems best.
>>>>
>>>
>>> My impression till today is, that nobody is actually using Gradle on
>>> daily python development. So I assume it is currently used on CI only. But
>>> I believe gradle could add value for devs also. We just need to improve our
>>> current integration. And to be very clear, gradle should not replace python
>>> tooling, but support usage in a way which looks familiar to any python dev.
>>> And probably render some shell scripting obsolete.
>>>
>>>
>>>>
>>>> > On Thu, Mar 28, 2019 at 6:37 PM Mark Liu <ma...@apache.org> wrote:
>>>> >>
>>>> >> Thank you Ahmet. Answer your questions below:
>>>> >>
>>>> >>>
>>>> >>> - Could you comment on what kind of parallelization we will gain by
>>>> this? In terms of real numbers, how would this affect build and test times?
>>>> >>
>>>> >>
>>>> >> The proposal is based on Gradle parallel execution: "you can force
>>>> Gradle to execute tasks in parallel as long as those tasks are in different
>>>> projects". In Beam, project is declared per build.gradle file and
>>>> registered in settings.gradle. Tasks that are included in single Gradle
>>>> execution will run in parallel only if they are declared in separate
>>>> build.gradle files.
>>>> >>
>>>> >> An example of applying parallel is beam_PreCommit_Python job which
>>>> runs :pythonPreCommit task that contains tasks distributed in 4
>>>> build.gradle. The execution graph looks like
>>>> https://scans.gradle.com/s/4frpmto6o7hto/timeline:
>>>> >>
>>>> >> Without this proposal, all tasks will run in sequential which can be
>>>> ~2x longer. If more py36 and py37 tests added in the future, things will be
>>>> even worse.
>>>> >>
>>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>>> quantify the improvement related to this?
>>>> >>
>>>> >>
>>>> >> The general code complexity of function/method/property may not
>>>> change here since we basically group tasks in a different way without
>>>> changing inside logic. I don't know if there is any tool to measure Gradle
>>>> build complexity. Would love to try if there is.
>>>> >>
>>>> >>>
>>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>>> Just want to clarify this. In either case, would you need help?
>>>> >>
>>>> >>
>>>> >> Yes, I'd love to take on major refactor works. At the same time,
>>>> I'll create jira for each kind of tests (like flink/protable/hdfs tests) in
>>>> sdks/python/build.gradle to move into test-suites. Test owners or anyone
>>>> interested to this work are welcome to contribute!
>>>> >>
>>>> >> Mark
>>>> >>
>>>> >> On Wed, Mar 27, 2019 at 3:53 PM Ahmet Altay <al...@google.com>
>>>> wrote:
>>>> >>>
>>>> >>> This sounds good to me. Thank you for doing this. Few questions:
>>>> >>> - Could you comment on what kind of parallelization we will gain by
>>>> this? In terms of real numbers, how would this affect build and test times?
>>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>>> quantify the improvement related to this?
>>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>>> Just want to clarify this. In either case, would you need help?
>>>> >>>
>>>> >>> Thank you,
>>>> >>> Ahmet
>>>> >>>
>>>> >>> On Wed, Mar 27, 2019 at 10:19 AM Mark Liu <ma...@apache.org>
>>>> wrote:
>>>> >>>>
>>>> >>>> Hi Python SDK Developers,
>>>> >>>>
>>>> >>>> You may notice that Gradle files changed a lot recently as
>>>> parallelization applied to Python tests and more python versions were
>>>> enabled in testing. There are tricks over the build scripts and tests are
>>>> grown naturally and distributed under sdks/python, which caused frictions
>>>> (like rollback PR-8059).
>>>> >>>>
>>>> >>>> Thus, I created BEAM-6907 and would like to initiate some works to
>>>> cleanup and standardize Gradle structure in Python SDK. In general, I think
>>>> we want to:
>>>> >>>>
>>>> >>>> - Apply parallel execution
>>>> >>>> - Share common tasks
>>>> >>>> - Centralize test related tasks
>>>> >>>> - Have a clear Gradle structure for projects/tasks
>>>> >>>>
>>>> >>>> This is Gradle directory structure I proposed:
>>>> >>>>
>>>> >>>> sdks/python/
>>>> >>>>
>>>> >>>> build.gradle    --> hold builds, snapshot, analytic tasks
>>>> >>>> test-suites/    --> all pre/post/VR test suites under here
>>>> >>>>
>>>> >>>> README.md
>>>> >>>>
>>>> >>>> dataflow/    --> grouped by runner or unit test (tox)
>>>> >>>>
>>>> >>>> py27/    --> grouped by py version
>>>> >>>>
>>>> >>>> build.gradle
>>>> >>>>
>>>> >>>> py35/
>>>> >>>>
>>>> >>>> ...
>>>> >>>>
>>>> >>>> direct/
>>>> >>>>
>>>> >>>> py27/
>>>> >>>>
>>>> >>>> ...
>>>> >>>>
>>>> >>>> flink/
>>>> >>>>
>>>> >>>> tox/
>>>> >>>> ...
>>>> >>>>
>>>> >>>>
>>>> >>>> The ideas are:
>>>> >>>> - Only keep builds, snapshot and analytic jobs in
>>>> sdks/python/build.gradle
>>>> >>>> - Move all test related tasks to sdks/python/test-suites/
>>>> >>>> - In sdks/python/test-suites, we first group by runners, unit test
>>>> or other testing that can't fit to them, and then group by py versions if
>>>> needed.
>>>> >>>> - An example of ../test-suites/../py35/build.gradle is this.
>>>> >>>>
>>>> >>>> Please feel free to explore existing Gradle scripts in Python SDK
>>>> and bring any thoughts on this proposal if you have.
>>>> >>>>
>>>> >>>> Thanks!
>>>> >>>> Mark
>>>
>>> [image: https://ml6.eu]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>


* Frederik Bode*

ML6 Ghent
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
+32 4 92 78 96 18


**** DISCLAIMER ****

This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system manager.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail ifyou have received this e-mail by mistake and delete
this e-mail from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

Re: [PROPOSAL] Standardize Gradle structure in Python SDK

Posted by Mark Liu <ma...@apache.org>.
Adding my comments below.

On Mon, Jul 15, 2019 at 2:52 PM Kenneth Knowles <ke...@apache.org> wrote:

> Gradle comments inline
>
> On Mon, Jul 15, 2019 at 2:30 AM Frederik Bode <fr...@ml6.eu>
> wrote:
>
>> Hi Mark & others,
>>
>> +1 on using this structure. I don't see any other alternative to gradle
>> as some of the Python tasks have Java tasks as
>> a dependency. You can't debug that using just `python nosetests... or
>> tox`.  Parallelizing such tasks requires different
>> projects (and I don't think gradle supports multiple projects per
>> directory), so for each python
>> version we need a different folder. Having seperate build.gradle files
>> for each python version would also enable the
>> different versions to diverge on which tests they execute (e.g. not
>> running some tests for python 3.5 and 3.6 to
>> reduce Jenkins footprint in a PreCommit).
>>
>> Valentyn, the code duplication problem can be addressed using a Gradle
>> script plugin , which
>> is a build.script (maybe another name is better?)  in
>> /sdks/python/test-suites/[runner], which you then import in
>> /sdks/python/test-suites/[runner]/pyXX/build.gradle with the
>> correct pythonVersion set, and then using `apply from`. See [1] for an
>> example. The location of
>> the python2 test-suite can be remedied by moving it to your
>> suggestion. +1 on that as well.
>>
>
> The `apply from` syntax is another way of authoring a gradle plugin, and a
> bit more limited. BeamModulePlugin used to be a script `build_rules.groovy`
> applied that way, and we moved it to the magical buildSrc/ directory so it
> could have its own clear dependencies (its the buildSrc/build.gradle) and
> be refactored into pieces over time. It is just as easy to make a plugin
> there and I would recommend it. This is what we did with the vendored
> artifacts.
>

+1 using Gradle plugin. New build.gradle also creates an unnecessary Gradle
project which adds overheads in execution. Since we already have some
plugin examples in buildSrc/, adding extra one for a particular purpose (or
using existing plugins) would be preferred.

>
> On the coupling of projects or the structure not being natural, I think
>> that we can look at
>> this differently. Right now, in the Python SDK, all common code for tests
>> that needs to be parallelized is
>> in placed in BeamModulePlugin, which in turn couples all projects that
>> use it. It's one centralized
>> item that couples many different projects. It has code from Java, Python
>> and Go, and
>> currently has almost 2000 lines of code, which is IMHO not the way to go.
>>
>
> Exactly. Each of BeamModulePlugin.applyXYZNature(...) are probably good to
> separate as other plugins, for example you could start BeamPythonPlugin.
> The reason these are separate methods instead of separate plugins is
> because we started off in one big `build_rules.groovy` file; it is
> historical and totally OK to fix up.
>

One extra bonus for separating BeamModulePlugin per language is that we can
avoid irrelevant precommit tests being triggered when making changes in
BeamModulePlugin. This came to me recently when I made Python specific
changes. Some Java and Go precommit tests triggered because I touched
BeamModulePlugin. Sometimes the test is flaky so I have to
investigate/verify the failure is not related to my changes.

>
>
Moving the python code
>> from this binary plugin to a script plugin file defined in the the parent
>> directory of children
>> projects that use that code (as described in the paragraph above) moves
>> the coupling of a lot of projects through
>> BeamModulePlugin (a global coupling), to a per sub-tree coupling (a local
>> coupling).
>>
>
> Relative filesystem paths are also a weakness that caused pain plenty of
> times. I highly recommend building a plugin with an id rather than `apply
> from` relative paths such as parent directories. Or if there is a way to
> `apply from` without a relative path, that is probably even more confusing.
> buildSrc is the way to go. It is also much faster since it gains
> incremental compilation of the plugin.
>
> Kenn
>
>
>> In short, yes it might be unnatural, but it's still better than before.
>>
>> [1] https://github.com/apache/beam/pull/8877
>> <https://github.com/apache/beam/pull/8877/files>
>>
>> Thanks,
>> Frederik Bode
>>
>> On Mon, Jun 3, 2019 at 5:13 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> Hey Mark & others,
>>>
>>> We've been following the structure proposed in this thread to extend
>>> test coverage for Beam Python SDK on Python 3.5, 3.6, 3.7 interpreters, see
>>> [1].
>>>
>>> This structure allowed us to add 3.x suites without slowing down the
>>> pre/postcommit execution time. We can actually see a drop in precommit
>>> latency [2] around March 23 we first made some Python 3.x suites run in
>>> parallel, and we have added more suites since then without slowing down
>>> pre/postcommits. Therefore I am in favor of this proposal, especially since
>>> AFAIK we don't have better one. Thanks a lot!
>>>
>>> I do have some feedback on this proposal:
>>>
>>> 1. There is a duplication of gradle code between test suites for
>>> different python minor versions, for example see the identical definition
>>> of DirectRunner PostCommitIT suite for Python 3.6 and Python 3.7 [4,5].
>>>
>>> Possible solution to reduce the duplication is to move common code that
>>> defines a task into a separate groovy file shared across multiple gradle
>>> files. We have an example of this, where enablePythonPerformanceTest() is
>>> defined in BeamModulePlugin.groovy, and used in several build.gradle files
>>> to create a gradle task required for performance tests, see: [6]. I
>>> followed the same example in a Python 3 test suite for Portable Flink
>>> Runner I am working on [3], however I am not sure if BeamModulePlugin is
>>> the best place to define common gradle tasks to needed for Python CI.
>>> Perhaps we can make a separate groovy file for this purpose in
>>> sdk/python/test-suites?
>>>
>>
>> I would suggest placing the shared code in a new file in
>> https://github.com/apache/beam/tree/master/buildSrc/src/main/groovy/org/apache/beam/gradle,
>> we have several other groovy files related to building defined there
>> already.
>>
>>
>>> 2. Python 3 test suites currently live in sdks/python/test-suites, while
>>> most Python 2 suites are still defined in sdks/python/build.gradle.
>>>
>>> This may cause confusion for folks working on adding new Python suites.
>>> If there is an overall agreement on proposed structure I suggest to  start
>>> moving Python 2 CI tasks out of  sdks/python/build.gradle into
>>> sdks/python/test-suites/[runner]/py27/build.gradle, or a common groovy
>>> file. If there are better alternatives we can continue discussing them here.
>>>
>>
>> For the runner specific Java ITs, we have been trying to get the tests to
>> be placed inside the runners own directory instead of underneath the SDK
>> directories. This was to make it easy to associate test task -> runner. So
>> an alternative would be to have runners/[runner]/test-suites/py27/... be
>> that location.
>>
>>
>>> Thanks,
>>> Valenyn
>>>
>>>
>>> [1] https://github.com/apache/beam/tree/master/sdks/python/test-suites
>>> [2]
>>> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=1546507894013&to=1554189164736
>>> [3] https://github.com/apache/beam/pull/8745
>>> [4]
>>> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py36/build.gradle#L27
>>> [5]
>>> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py37/build.gradle#L27
>>> [6]
>>> https://github.com/apache/beam/search?q=enablePythonPerformanceTest&unscoped_q=enablePythonPerformanceTest
>>>
>>>
>>> On Fri, Mar 29, 2019 at 9:45 AM Udi Meiri <eh...@google.com> wrote:
>>>
>>>> I don't use gradle commands for Python development either, because they
>>>> are slow (no incremental testing).
>>>>
>>>>
>>>>
>>>> On Fri, Mar 29, 2019 at 9:16 AM Michael Luckey <ad...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 29, 2019 at 2:31 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Fri, Mar 29, 2019 at 12:54 PM Michael Luckey <ad...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > Really like the idea of improving here.
>>>>>> >
>>>>>> > Unfortunately, I haven't worked with python on that scale yet, so
>>>>>> bear with my naive understandings in this regard. If I understand
>>>>>> correctly, the suggestion will result in a couple of projects consisting
>>>>>> only of a build,gradle file to kind of workaround on gradles decision not
>>>>>> to parallelize within projects, right? In consequence, this also kind of
>>>>>> decouples projects from their content - they stuff which constitutes the
>>>>>> project - and forces the build file to 'somehow reach out to content of
>>>>>> other (only python root?) projects. E.g couples projects. This somehow
>>>>>> 'feels non natural' to me. But, of course, might be the path to go. As I
>>>>>> said before, never worked on python on that scale.
>>>>>>
>>>>>> It feels a bit odd to me as well. Is it possible to have multiple
>>>>>> projects per directory (e.g. a suite of testing ones) rather than
>>>>>> having to break things up like this, especially if the goal is
>>>>>> primarily to get parallel running of tests? Especially if we could
>>>>>> automatically create the cross-product rather than manually? There
>>>>>> also seems to be some redundancy with what tox is doing here.
>>>>>>
>>>>>
>>>>> Not sure, whether I understand correctly. But I do not think that's
>>>>> possible. If we are going to do some cross-product, we are probably better
>>>>> of doing that on tasks, e.g. by leveraging task rules or programmatically
>>>>> adding tasks (which is already done in parts). Of course, this will not
>>>>> help with parallelisation (but might enable that, see below).
>>>>>
>>>>>
>>>>>>
>>>>>> > But I believe to remember Robert talking about using in project
>>>>>> parallelisation for his development. Is this something which could also
>>>>>> work on CI? Of course, that will not help with different python versions,
>>>>>> but maybe that could be solved also by gradles variants which are
>>>>>> introduced in 5.3 - definitely need some time to investigate the
>>>>>> possibilities here. On first sight it feels like lots of duplication to
>>>>>> create 'builds' for any python version. Or wouldn't that be the case?
>>>>>> >
>>>>>> > And another naive thought on my side, isn't that non
>>>>>> parallelizability also caused by the monolithic setup of the python code
>>>>>> base? E.g. if I understand correctly, java sdk is split into
>>>>>> core/runners/ios etc, each encapsulate into full blown projects, i.e.
>>>>>> buckets of sources, tests and build file. Would it technically possible to
>>>>>> do something similar with python? I assume that being discussed before and
>>>>>> teared apart, but couldn't find on mailing list.
>>>>>>
>>>>>> Neither the culture nor the tooling of Python supports lots of
>>>>>> interdependent "sub-packages" for a single project--at least not
>>>>>> something smaller than one would want to deploy to Pypi. So while one
>>>>>> could do this, it'd be going against the grain. There are also much
>>>>>> lower-hanging opportunities for parallelization (e.g. running the test
>>>>>> suites for separate python versions in parallel).
>>>>>>
>>>>>> It's not very natural (as I understand it) with Go either. If we're
>>>>>> talking directory re-organization, I think it would make sense to
>>>>>> consider having top-level java, python, go, ... next to model,
>>>>>> website, etc.
>>>>>>
>>>>>
>>>>> Yes. We shouldn't work against common culture/practices, but try to
>>>>> embrace native tooling and add support where required.
>>>>>
>>>>> To reiterate on parallelisation, there are (at least) three
>>>>> opportunities:
>>>>>
>>>>> 1. Parallelise on test level. For python, this is detox?
>>>>>
>>>>
>>>> This is actually 2 levels. :)
>>>> 1a. Parallelise at the nosetest level - run unit tests in parallel in a
>>>> single tox environment. (I have a PR in progress to migrate to pytest, and
>>>> we should be able to do file-level parallelism provided we solve pickling
>>>> issues.)
>>>> 1b. Parallelise at the tox environment level, e..g, somehow running the
>>>> multiple tox environments (py27,py27-cython,py35,...) in parallel.
>>>>
>>>>
>>>>> 2. Parallelise on Gradle project level (--parallel option)
>>>>> 3. Parallelise on CI level
>>>>>
>>>>> So what I ve done before, if 1. does not help, nor 2. cause project is
>>>>> 'just to big' was 3, i.e. splitting on CI level. So the simplest thing I
>>>>> could imagine right now would be - as suggested above - to split
>>>>> pythonPreCommit into something like pythonX_YPrecommit, which then runs
>>>>> those different python versions in parallel. Of course, that could be done
>>>>> ad infinitum by splitting further into runners, IOs whatever.
>>>>>
>>>>> OTOH, we do already have tons of jobs running as we seem to map Gradle
>>>>> tasks to Jenkins jobs. So it might be more appropriate to leverage CI
>>>>> parallelization on jenkins pipeline level. Something like creating
>>>>> pythonPrecommit as a pipeline, which itself runs several steps in parallel.
>>>>> Did not contemplate on the impacts, though, on PRs and phrase triggering.
>>>>> Of course, we might need to improve on our 'verbosity' as currently it
>>>>> seems not always clear what command line actually triggered etc. With this
>>>>> approach, it seems possible to run tasks within same Gradle project in
>>>>> parallel with whatever granularity we might want to choose. (Of course we
>>>>> do add overhead here, but that's probably always the case by doing things
>>>>> in parallel)
>>>>>
>>>>> Of course 3. would not help on parallelisation on developers machine,
>>>>> but I never had a problem with that.
>>>>>
>>>>>
>>>>>>
>>>>>> > And as a last thought, will usage of pygradle help with better
>>>>>> python/gradle integration? Currently, we mainly use gradle to call into
>>>>>> shell scripts, which doesn't help gradle nor probably pythons tooling to do
>>>>>> the job very well? But deeper integration might cause problems on python
>>>>>> dev side, dunno :(
>>>>>>
>>>>>> Possibly.
>>>>>>
>>>>>> Are there any Python developers that primarily use the gradle
>>>>>> commands? Personally, I only use them if I'm using Java (or sometimes
>>>>>> work that is a mix of Java and Python, e.g. the Python-on-Flink
>>>>>> tests). Otherwise I use tox, or "python setup.py test [-s ...]"
>>>>>> directly. Gradle primarily has value as a top-level orchestration (in
>>>>>> particular for CI) and easy way for those who only touch Python
>>>>>> occasionally to run all the tests. If that's the case, optimizing our
>>>>>> gradle scripts for CI seems best.
>>>>>>
>>>>>
>>>>> My impression till today is, that nobody is actually using Gradle on
>>>>> daily python development. So I assume it is currently used on CI only. But
>>>>> I believe gradle could add value for devs also. We just need to improve our
>>>>> current integration. And to be very clear, gradle should not replace python
>>>>> tooling, but support usage in a way which looks familiar to any python dev.
>>>>> And probably render some shell scripting obsolete.
>>>>>
>>>>>
>>>>>>
>>>>>> > On Thu, Mar 28, 2019 at 6:37 PM Mark Liu <ma...@apache.org>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Thank you Ahmet. Answer your questions below:
>>>>>> >>
>>>>>> >>>
>>>>>> >>> - Could you comment on what kind of parallelization we will gain
>>>>>> by this? In terms of real numbers, how would this affect build and test
>>>>>> times?
>>>>>> >>
>>>>>> >>
>>>>>> >> The proposal is based on Gradle parallel execution: "you can force
>>>>>> Gradle to execute tasks in parallel as long as those tasks are in different
>>>>>> projects". In Beam, project is declared per build.gradle file and
>>>>>> registered in settings.gradle. Tasks that are included in single Gradle
>>>>>> execution will run in parallel only if they are declared in separate
>>>>>> build.gradle files.
>>>>>> >>
>>>>>> >> An example of applying parallel is beam_PreCommit_Python job which
>>>>>> runs :pythonPreCommit task that contains tasks distributed in 4
>>>>>> build.gradle. The execution graph looks like
>>>>>> https://scans.gradle.com/s/4frpmto6o7hto/timeline:
>>>>>> >>
>>>>>> >> Without this proposal, all tasks will run in sequential which can
>>>>>> be ~2x longer. If more py36 and py37 tests added in the future, things will
>>>>>> be even worse.
>>>>>> >>
>>>>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>>>>> quantify the improvement related to this?
>>>>>> >>
>>>>>> >>
>>>>>> >> The general code complexity of function/method/property may not
>>>>>> change here since we basically group tasks in a different way without
>>>>>> changing inside logic. I don't know if there is any tool to measure Gradle
>>>>>> build complexity. Would love to try if there is.
>>>>>> >>
>>>>>> >>>
>>>>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>>>>> Just want to clarify this. In either case, would you need help?
>>>>>> >>
>>>>>> >>
>>>>>> >> Yes, I'd love to take on major refactor works. At the same time,
>>>>>> I'll create jira for each kind of tests (like flink/protable/hdfs tests) in
>>>>>> sdks/python/build.gradle to move into test-suites. Test owners or anyone
>>>>>> interested to this work are welcome to contribute!
>>>>>> >>
>>>>>> >> Mark
>>>>>> >>
>>>>>> >> On Wed, Mar 27, 2019 at 3:53 PM Ahmet Altay <al...@google.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> This sounds good to me. Thank you for doing this. Few questions:
>>>>>> >>> - Could you comment on what kind of parallelization we will gain
>>>>>> by this? In terms of real numbers, how would this affect build and test
>>>>>> times?
>>>>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>>>>> quantify the improvement related to this?
>>>>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>>>>> Just want to clarify this. In either case, would you need help?
>>>>>> >>>
>>>>>> >>> Thank you,
>>>>>> >>> Ahmet
>>>>>> >>>
>>>>>> >>> On Wed, Mar 27, 2019 at 10:19 AM Mark Liu <ma...@apache.org>
>>>>>> wrote:
>>>>>> >>>>
>>>>>> >>>> Hi Python SDK Developers,
>>>>>> >>>>
>>>>>> >>>> You may notice that Gradle files changed a lot recently as
>>>>>> parallelization applied to Python tests and more python versions were
>>>>>> enabled in testing. There are tricks over the build scripts and tests are
>>>>>> grown naturally and distributed under sdks/python, which caused frictions
>>>>>> (like rollback PR-8059).
>>>>>> >>>>
>>>>>> >>>> Thus, I created BEAM-6907 and would like to initiate some works
>>>>>> to cleanup and standardize Gradle structure in Python SDK. In general, I
>>>>>> think we want to:
>>>>>> >>>>
>>>>>> >>>> - Apply parallel execution
>>>>>> >>>> - Share common tasks
>>>>>> >>>> - Centralize test related tasks
>>>>>> >>>> - Have a clear Gradle structure for projects/tasks
>>>>>> >>>>
>>>>>> >>>> This is Gradle directory structure I proposed:
>>>>>> >>>>
>>>>>> >>>> sdks/python/
>>>>>> >>>>
>>>>>> >>>> build.gradle    --> hold builds, snapshot, analytic tasks
>>>>>> >>>> test-suites/    --> all pre/post/VR test suites under here
>>>>>> >>>>
>>>>>> >>>> README.md
>>>>>> >>>>
>>>>>> >>>> dataflow/    --> grouped by runner or unit test (tox)
>>>>>> >>>>
>>>>>> >>>> py27/    --> grouped by py version
>>>>>> >>>>
>>>>>> >>>> build.gradle
>>>>>> >>>>
>>>>>> >>>> py35/
>>>>>> >>>>
>>>>>> >>>> ...
>>>>>> >>>>
>>>>>> >>>> direct/
>>>>>> >>>>
>>>>>> >>>> py27/
>>>>>> >>>>
>>>>>> >>>> ...
>>>>>> >>>>
>>>>>> >>>> flink/
>>>>>> >>>>
>>>>>> >>>> tox/
>>>>>> >>>> ...
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> The ideas are:
>>>>>> >>>> - Only keep builds, snapshot and analytic jobs in
>>>>>> sdks/python/build.gradle
>>>>>> >>>> - Move all test related tasks to sdks/python/test-suites/
>>>>>> >>>> - In sdks/python/test-suites, we first group by runners, unit
>>>>>> test or other testing that can't fit to them, and then group by py versions
>>>>>> if needed.
>>>>>> >>>> - An example of ../test-suites/../py35/build.gradle is this.
>>>>>> >>>>
>>>>>> >>>> Please feel free to explore existing Gradle scripts in Python
>>>>>> SDK and bring any thoughts on this proposal if you have.
>>>>>> >>>>
>>>>>> >>>> Thanks!
>>>>>> >>>> Mark
>>>>>
>>>>> [image: https://ml6.eu]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>>
>>
>> * Frederik Bode*
>>
>> ML6 Ghent
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
>> +32 4 92 78 96 18
>>
>>
>> **** DISCLAIMER ****
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. If you have received this email in error please notify the
>> system manager. This message contains confidential information and is
>> intended only for the individual named. If you are not the named addressee
>> you should not disseminate, distribute or copy this e-mail. Please notify
>> the sender immediately by e-mail ifyou have received this e-mail by mistake
>> and delete this e-mail from your system. If you are not the intended
>> recipient you are notified that disclosing, copying, distributing or taking
>> any action in reliance on the contents of this information is strictly
>> prohibited.
>>
>

Re: [PROPOSAL] Standardize Gradle structure in Python SDK

Posted by Kenneth Knowles <ke...@apache.org>.
Gradle comments inline

On Mon, Jul 15, 2019 at 2:30 AM Frederik Bode <fr...@ml6.eu> wrote:

> Hi Mark & others,
>
> +1 on using this structure. I don't see any other alternative to gradle as
> some of the Python tasks have Java tasks as
> a dependency. You can't debug that using just `python nosetests... or
> tox`.  Parallelizing such tasks requires different
> projects (and I don't think gradle supports multiple projects per
> directory), so for each python
> version we need a different folder. Having seperate build.gradle files for
> each python version would also enable the
> different versions to diverge on which tests they execute (e.g. not
> running some tests for python 3.5 and 3.6 to
> reduce Jenkins footprint in a PreCommit).
>
> Valentyn, the code duplication problem can be addressed using a Gradle
> script plugin , which
> is a build.script (maybe another name is better?)  in
> /sdks/python/test-suites/[runner], which you then import in
> /sdks/python/test-suites/[runner]/pyXX/build.gradle with the
> correct pythonVersion set, and then using `apply from`. See [1] for an
> example. The location of
> the python2 test-suite can be remedied by moving it to your suggestion. +1
> on that as well.
>

The `apply from` syntax is another way of authoring a gradle plugin, and a
bit more limited. BeamModulePlugin used to be a script `build_rules.groovy`
applied that way, and we moved it to the magical buildSrc/ directory so it
could have its own clear dependencies (its the buildSrc/build.gradle) and
be refactored into pieces over time. It is just as easy to make a plugin
there and I would recommend it. This is what we did with the vendored
artifacts.

On the coupling of projects or the structure not being natural, I think
> that we can look at
> this differently. Right now, in the Python SDK, all common code for tests
> that needs to be parallelized is
> in placed in BeamModulePlugin, which in turn couples all projects that use
> it. It's one centralized
> item that couples many different projects. It has code from Java, Python
> and Go, and
> currently has almost 2000 lines of code, which is IMHO not the way to go.
>

Exactly. Each of BeamModulePlugin.applyXYZNature(...) are probably good to
separate as other plugins, for example you could start BeamPythonPlugin.
The reason these are separate methods instead of separate plugins is
because we started off in one big `build_rules.groovy` file; it is
historical and totally OK to fix up.


> Moving the python code
> from this binary plugin to a script plugin file defined in the the parent
> directory of children
> projects that use that code (as described in the paragraph above) moves
> the coupling of a lot of projects through
> BeamModulePlugin (a global coupling), to a per sub-tree coupling (a local
> coupling).
>

Relative filesystem paths are also a weakness that caused pain plenty of
times. I highly recommend building a plugin with an id rather than `apply
from` relative paths such as parent directories. Or if there is a way to
`apply from` without a relative path, that is probably even more confusing.
buildSrc is the way to go. It is also much faster since it gains
incremental compilation of the plugin.

Kenn


> In short, yes it might be unnatural, but it's still better than before.
>
> [1] https://github.com/apache/beam/pull/8877
> <https://github.com/apache/beam/pull/8877/files>
>
> Thanks,
> Frederik Bode
>
> On Mon, Jun 3, 2019 at 5:13 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Hey Mark & others,
>>
>> We've been following the structure proposed in this thread to extend test
>> coverage for Beam Python SDK on Python 3.5, 3.6, 3.7 interpreters, see [1].
>>
>> This structure allowed us to add 3.x suites without slowing down the
>> pre/postcommit execution time. We can actually see a drop in precommit
>> latency [2] around March 23 we first made some Python 3.x suites run in
>> parallel, and we have added more suites since then without slowing down
>> pre/postcommits. Therefore I am in favor of this proposal, especially since
>> AFAIK we don't have better one. Thanks a lot!
>>
>> I do have some feedback on this proposal:
>>
>> 1. There is a duplication of gradle code between test suites for
>> different python minor versions, for example see the identical definition
>> of DirectRunner PostCommitIT suite for Python 3.6 and Python 3.7 [4,5].
>>
>> Possible solution to reduce the duplication is to move common code that
>> defines a task into a separate groovy file shared across multiple gradle
>> files. We have an example of this, where enablePythonPerformanceTest() is
>> defined in BeamModulePlugin.groovy, and used in several build.gradle files
>> to create a gradle task required for performance tests, see: [6]. I
>> followed the same example in a Python 3 test suite for Portable Flink
>> Runner I am working on [3], however I am not sure if BeamModulePlugin is
>> the best place to define common gradle tasks to needed for Python CI.
>> Perhaps we can make a separate groovy file for this purpose in
>> sdk/python/test-suites?
>>
>
> I would suggest placing the shared code in a new file in
> https://github.com/apache/beam/tree/master/buildSrc/src/main/groovy/org/apache/beam/gradle,
> we have several other groovy files related to building defined there
> already.
>
>
>> 2. Python 3 test suites currently live in sdks/python/test-suites, while
>> most Python 2 suites are still defined in sdks/python/build.gradle.
>>
>> This may cause confusion for folks working on adding new Python suites.
>> If there is an overall agreement on proposed structure I suggest to  start
>> moving Python 2 CI tasks out of  sdks/python/build.gradle into
>> sdks/python/test-suites/[runner]/py27/build.gradle, or a common groovy
>> file. If there are better alternatives we can continue discussing them here.
>>
>
> For the runner specific Java ITs, we have been trying to get the tests to
> be placed inside the runners own directory instead of underneath the SDK
> directories. This was to make it easy to associate test task -> runner. So
> an alternative would be to have runners/[runner]/test-suites/py27/... be
> that location.
>
>
>> Thanks,
>> Valenyn
>>
>>
>> [1] https://github.com/apache/beam/tree/master/sdks/python/test-suites
>> [2]
>> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=1546507894013&to=1554189164736
>> [3] https://github.com/apache/beam/pull/8745
>> [4]
>> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py36/build.gradle#L27
>> [5]
>> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py37/build.gradle#L27
>> [6]
>> https://github.com/apache/beam/search?q=enablePythonPerformanceTest&unscoped_q=enablePythonPerformanceTest
>>
>>
>> On Fri, Mar 29, 2019 at 9:45 AM Udi Meiri <eh...@google.com> wrote:
>>
>>> I don't use gradle commands for Python development either, because they
>>> are slow (no incremental testing).
>>>
>>>
>>>
>>> On Fri, Mar 29, 2019 at 9:16 AM Michael Luckey <ad...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Mar 29, 2019 at 2:31 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> On Fri, Mar 29, 2019 at 12:54 PM Michael Luckey <ad...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Really like the idea of improving here.
>>>>> >
>>>>> > Unfortunately, I haven't worked with python on that scale yet, so
>>>>> bear with my naive understandings in this regard. If I understand
>>>>> correctly, the suggestion will result in a couple of projects consisting
>>>>> only of a build,gradle file to kind of workaround on gradles decision not
>>>>> to parallelize within projects, right? In consequence, this also kind of
>>>>> decouples projects from their content - they stuff which constitutes the
>>>>> project - and forces the build file to 'somehow reach out to content of
>>>>> other (only python root?) projects. E.g couples projects. This somehow
>>>>> 'feels non natural' to me. But, of course, might be the path to go. As I
>>>>> said before, never worked on python on that scale.
>>>>>
>>>>> It feels a bit odd to me as well. Is it possible to have multiple
>>>>> projects per directory (e.g. a suite of testing ones) rather than
>>>>> having to break things up like this, especially if the goal is
>>>>> primarily to get parallel running of tests? Especially if we could
>>>>> automatically create the cross-product rather than manually? There
>>>>> also seems to be some redundancy with what tox is doing here.
>>>>>
>>>>
>>>> Not sure, whether I understand correctly. But I do not think that's
>>>> possible. If we are going to do some cross-product, we are probably better
>>>> of doing that on tasks, e.g. by leveraging task rules or programmatically
>>>> adding tasks (which is already done in parts). Of course, this will not
>>>> help with parallelisation (but might enable that, see below).
>>>>
>>>>
>>>>>
>>>>> > But I believe to remember Robert talking about using in project
>>>>> parallelisation for his development. Is this something which could also
>>>>> work on CI? Of course, that will not help with different python versions,
>>>>> but maybe that could be solved also by gradles variants which are
>>>>> introduced in 5.3 - definitely need some time to investigate the
>>>>> possibilities here. On first sight it feels like lots of duplication to
>>>>> create 'builds' for any python version. Or wouldn't that be the case?
>>>>> >
>>>>> > And another naive thought on my side, isn't that non
>>>>> parallelizability also caused by the monolithic setup of the python code
>>>>> base? E.g. if I understand correctly, java sdk is split into
>>>>> core/runners/ios etc, each encapsulate into full blown projects, i.e.
>>>>> buckets of sources, tests and build file. Would it technically possible to
>>>>> do something similar with python? I assume that being discussed before and
>>>>> teared apart, but couldn't find on mailing list.
>>>>>
>>>>> Neither the culture nor the tooling of Python supports lots of
>>>>> interdependent "sub-packages" for a single project--at least not
>>>>> something smaller than one would want to deploy to Pypi. So while one
>>>>> could do this, it'd be going against the grain. There are also much
>>>>> lower-hanging opportunities for parallelization (e.g. running the test
>>>>> suites for separate python versions in parallel).
>>>>>
>>>>> It's not very natural (as I understand it) with Go either. If we're
>>>>> talking directory re-organization, I think it would make sense to
>>>>> consider having top-level java, python, go, ... next to model,
>>>>> website, etc.
>>>>>
>>>>
>>>> Yes. We shouldn't work against common culture/practices, but try to
>>>> embrace native tooling and add support where required.
>>>>
>>>> To reiterate on parallelisation, there are (at least) three
>>>> opportunities:
>>>>
>>>> 1. Parallelise on test level. For python, this is detox?
>>>>
>>>
>>> This is actually 2 levels. :)
>>> 1a. Parallelise at the nosetest level - run unit tests in parallel in a
>>> single tox environment. (I have a PR in progress to migrate to pytest, and
>>> we should be able to do file-level parallelism provided we solve pickling
>>> issues.)
>>> 1b. Parallelise at the tox environment level, e..g, somehow running the
>>> multiple tox environments (py27,py27-cython,py35,...) in parallel.
>>>
>>>
>>>> 2. Parallelise on Gradle project level (--parallel option)
>>>> 3. Parallelise on CI level
>>>>
>>>> So what I ve done before, if 1. does not help, nor 2. cause project is
>>>> 'just to big' was 3, i.e. splitting on CI level. So the simplest thing I
>>>> could imagine right now would be - as suggested above - to split
>>>> pythonPreCommit into something like pythonX_YPrecommit, which then runs
>>>> those different python versions in parallel. Of course, that could be done
>>>> ad infinitum by splitting further into runners, IOs whatever.
>>>>
>>>> OTOH, we do already have tons of jobs running as we seem to map Gradle
>>>> tasks to Jenkins jobs. So it might be more appropriate to leverage CI
>>>> parallelization on jenkins pipeline level. Something like creating
>>>> pythonPrecommit as a pipeline, which itself runs several steps in parallel.
>>>> Did not contemplate on the impacts, though, on PRs and phrase triggering.
>>>> Of course, we might need to improve on our 'verbosity' as currently it
>>>> seems not always clear what command line actually triggered etc. With this
>>>> approach, it seems possible to run tasks within same Gradle project in
>>>> parallel with whatever granularity we might want to choose. (Of course we
>>>> do add overhead here, but that's probably always the case by doing things
>>>> in parallel)
>>>>
>>>> Of course 3. would not help on parallelisation on developers machine,
>>>> but I never had a problem with that.
>>>>
>>>>
>>>>>
>>>>> > And as a last thought, will usage of pygradle help with better
>>>>> python/gradle integration? Currently, we mainly use gradle to call into
>>>>> shell scripts, which doesn't help gradle nor probably pythons tooling to do
>>>>> the job very well? But deeper integration might cause problems on python
>>>>> dev side, dunno :(
>>>>>
>>>>> Possibly.
>>>>>
>>>>> Are there any Python developers that primarily use the gradle
>>>>> commands? Personally, I only use them if I'm using Java (or sometimes
>>>>> work that is a mix of Java and Python, e.g. the Python-on-Flink
>>>>> tests). Otherwise I use tox, or "python setup.py test [-s ...]"
>>>>> directly. Gradle primarily has value as a top-level orchestration (in
>>>>> particular for CI) and easy way for those who only touch Python
>>>>> occasionally to run all the tests. If that's the case, optimizing our
>>>>> gradle scripts for CI seems best.
>>>>>
>>>>
>>>> My impression till today is, that nobody is actually using Gradle on
>>>> daily python development. So I assume it is currently used on CI only. But
>>>> I believe gradle could add value for devs also. We just need to improve our
>>>> current integration. And to be very clear, gradle should not replace python
>>>> tooling, but support usage in a way which looks familiar to any python dev.
>>>> And probably render some shell scripting obsolete.
>>>>
>>>>
>>>>>
>>>>> > On Thu, Mar 28, 2019 at 6:37 PM Mark Liu <ma...@apache.org> wrote:
>>>>> >>
>>>>> >> Thank you Ahmet. Answer your questions below:
>>>>> >>
>>>>> >>>
>>>>> >>> - Could you comment on what kind of parallelization we will gain
>>>>> by this? In terms of real numbers, how would this affect build and test
>>>>> times?
>>>>> >>
>>>>> >>
>>>>> >> The proposal is based on Gradle parallel execution: "you can force
>>>>> Gradle to execute tasks in parallel as long as those tasks are in different
>>>>> projects". In Beam, project is declared per build.gradle file and
>>>>> registered in settings.gradle. Tasks that are included in single Gradle
>>>>> execution will run in parallel only if they are declared in separate
>>>>> build.gradle files.
>>>>> >>
>>>>> >> An example of applying parallel is beam_PreCommit_Python job which
>>>>> runs :pythonPreCommit task that contains tasks distributed in 4
>>>>> build.gradle. The execution graph looks like
>>>>> https://scans.gradle.com/s/4frpmto6o7hto/timeline:
>>>>> >>
>>>>> >> Without this proposal, all tasks will run in sequential which can
>>>>> be ~2x longer. If more py36 and py37 tests added in the future, things will
>>>>> be even worse.
>>>>> >>
>>>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>>>> quantify the improvement related to this?
>>>>> >>
>>>>> >>
>>>>> >> The general code complexity of function/method/property may not
>>>>> change here since we basically group tasks in a different way without
>>>>> changing inside logic. I don't know if there is any tool to measure Gradle
>>>>> build complexity. Would love to try if there is.
>>>>> >>
>>>>> >>>
>>>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>>>> Just want to clarify this. In either case, would you need help?
>>>>> >>
>>>>> >>
>>>>> >> Yes, I'd love to take on major refactor works. At the same time,
>>>>> I'll create jira for each kind of tests (like flink/protable/hdfs tests) in
>>>>> sdks/python/build.gradle to move into test-suites. Test owners or anyone
>>>>> interested to this work are welcome to contribute!
>>>>> >>
>>>>> >> Mark
>>>>> >>
>>>>> >> On Wed, Mar 27, 2019 at 3:53 PM Ahmet Altay <al...@google.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> This sounds good to me. Thank you for doing this. Few questions:
>>>>> >>> - Could you comment on what kind of parallelization we will gain
>>>>> by this? In terms of real numbers, how would this affect build and test
>>>>> times?
>>>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>>>> quantify the improvement related to this?
>>>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>>>> Just want to clarify this. In either case, would you need help?
>>>>> >>>
>>>>> >>> Thank you,
>>>>> >>> Ahmet
>>>>> >>>
>>>>> >>> On Wed, Mar 27, 2019 at 10:19 AM Mark Liu <ma...@apache.org>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Hi Python SDK Developers,
>>>>> >>>>
>>>>> >>>> You may notice that Gradle files changed a lot recently as
>>>>> parallelization applied to Python tests and more python versions were
>>>>> enabled in testing. There are tricks over the build scripts and tests are
>>>>> grown naturally and distributed under sdks/python, which caused frictions
>>>>> (like rollback PR-8059).
>>>>> >>>>
>>>>> >>>> Thus, I created BEAM-6907 and would like to initiate some works
>>>>> to cleanup and standardize Gradle structure in Python SDK. In general, I
>>>>> think we want to:
>>>>> >>>>
>>>>> >>>> - Apply parallel execution
>>>>> >>>> - Share common tasks
>>>>> >>>> - Centralize test related tasks
>>>>> >>>> - Have a clear Gradle structure for projects/tasks
>>>>> >>>>
>>>>> >>>> This is Gradle directory structure I proposed:
>>>>> >>>>
>>>>> >>>> sdks/python/
>>>>> >>>>
>>>>> >>>> build.gradle    --> hold builds, snapshot, analytic tasks
>>>>> >>>> test-suites/    --> all pre/post/VR test suites under here
>>>>> >>>>
>>>>> >>>> README.md
>>>>> >>>>
>>>>> >>>> dataflow/    --> grouped by runner or unit test (tox)
>>>>> >>>>
>>>>> >>>> py27/    --> grouped by py version
>>>>> >>>>
>>>>> >>>> build.gradle
>>>>> >>>>
>>>>> >>>> py35/
>>>>> >>>>
>>>>> >>>> ...
>>>>> >>>>
>>>>> >>>> direct/
>>>>> >>>>
>>>>> >>>> py27/
>>>>> >>>>
>>>>> >>>> ...
>>>>> >>>>
>>>>> >>>> flink/
>>>>> >>>>
>>>>> >>>> tox/
>>>>> >>>> ...
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> The ideas are:
>>>>> >>>> - Only keep builds, snapshot and analytic jobs in
>>>>> sdks/python/build.gradle
>>>>> >>>> - Move all test related tasks to sdks/python/test-suites/
>>>>> >>>> - In sdks/python/test-suites, we first group by runners, unit
>>>>> test or other testing that can't fit to them, and then group by py versions
>>>>> if needed.
>>>>> >>>> - An example of ../test-suites/../py35/build.gradle is this.
>>>>> >>>>
>>>>> >>>> Please feel free to explore existing Gradle scripts in Python SDK
>>>>> and bring any thoughts on this proposal if you have.
>>>>> >>>>
>>>>> >>>> Thanks!
>>>>> >>>> Mark
>>>>
>>>> [image: https://ml6.eu]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>
>
> * Frederik Bode*
>
> ML6 Ghent
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
> +32 4 92 78 96 18
>
>
> **** DISCLAIMER ****
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail ifyou have received this e-mail by mistake and delete
> this e-mail from your system. If you are not the intended recipient you are
> notified that disclosing, copying, distributing or taking any action in
> reliance on the contents of this information is strictly prohibited.
>