You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Mark Liu <ma...@apache.org> on 2019/08/08 23:19:11 UTC

Re: [PROPOSAL] Standardize Gradle structure in Python SDK

Adding my comments below.

On Mon, Jul 15, 2019 at 2:52 PM Kenneth Knowles <ke...@apache.org> wrote:

> Gradle comments inline
>
> On Mon, Jul 15, 2019 at 2:30 AM Frederik Bode <fr...@ml6.eu>
> wrote:
>
>> Hi Mark & others,
>>
>> +1 on using this structure. I don't see any other alternative to gradle
>> as some of the Python tasks have Java tasks as
>> a dependency. You can't debug that using just `python nosetests... or
>> tox`.  Parallelizing such tasks requires different
>> projects (and I don't think gradle supports multiple projects per
>> directory), so for each python
>> version we need a different folder. Having seperate build.gradle files
>> for each python version would also enable the
>> different versions to diverge on which tests they execute (e.g. not
>> running some tests for python 3.5 and 3.6 to
>> reduce Jenkins footprint in a PreCommit).
>>
>> Valentyn, the code duplication problem can be addressed using a Gradle
>> script plugin , which
>> is a build.script (maybe another name is better?)  in
>> /sdks/python/test-suites/[runner], which you then import in
>> /sdks/python/test-suites/[runner]/pyXX/build.gradle with the
>> correct pythonVersion set, and then using `apply from`. See [1] for an
>> example. The location of
>> the python2 test-suite can be remedied by moving it to your
>> suggestion. +1 on that as well.
>>
>
> The `apply from` syntax is another way of authoring a gradle plugin, and a
> bit more limited. BeamModulePlugin used to be a script `build_rules.groovy`
> applied that way, and we moved it to the magical buildSrc/ directory so it
> could have its own clear dependencies (its the buildSrc/build.gradle) and
> be refactored into pieces over time. It is just as easy to make a plugin
> there and I would recommend it. This is what we did with the vendored
> artifacts.
>

+1 using Gradle plugin. New build.gradle also creates an unnecessary Gradle
project which adds overheads in execution. Since we already have some
plugin examples in buildSrc/, adding extra one for a particular purpose (or
using existing plugins) would be preferred.

>
> On the coupling of projects or the structure not being natural, I think
>> that we can look at
>> this differently. Right now, in the Python SDK, all common code for tests
>> that needs to be parallelized is
>> in placed in BeamModulePlugin, which in turn couples all projects that
>> use it. It's one centralized
>> item that couples many different projects. It has code from Java, Python
>> and Go, and
>> currently has almost 2000 lines of code, which is IMHO not the way to go.
>>
>
> Exactly. Each of BeamModulePlugin.applyXYZNature(...) are probably good to
> separate as other plugins, for example you could start BeamPythonPlugin.
> The reason these are separate methods instead of separate plugins is
> because we started off in one big `build_rules.groovy` file; it is
> historical and totally OK to fix up.
>

One extra bonus for separating BeamModulePlugin per language is that we can
avoid irrelevant precommit tests being triggered when making changes in
BeamModulePlugin. This came to me recently when I made Python specific
changes. Some Java and Go precommit tests triggered because I touched
BeamModulePlugin. Sometimes the test is flaky so I have to
investigate/verify the failure is not related to my changes.

>
>
Moving the python code
>> from this binary plugin to a script plugin file defined in the the parent
>> directory of children
>> projects that use that code (as described in the paragraph above) moves
>> the coupling of a lot of projects through
>> BeamModulePlugin (a global coupling), to a per sub-tree coupling (a local
>> coupling).
>>
>
> Relative filesystem paths are also a weakness that caused pain plenty of
> times. I highly recommend building a plugin with an id rather than `apply
> from` relative paths such as parent directories. Or if there is a way to
> `apply from` without a relative path, that is probably even more confusing.
> buildSrc is the way to go. It is also much faster since it gains
> incremental compilation of the plugin.
>
> Kenn
>
>
>> In short, yes it might be unnatural, but it's still better than before.
>>
>> [1] https://github.com/apache/beam/pull/8877
>> <https://github.com/apache/beam/pull/8877/files>
>>
>> Thanks,
>> Frederik Bode
>>
>> On Mon, Jun 3, 2019 at 5:13 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> Hey Mark & others,
>>>
>>> We've been following the structure proposed in this thread to extend
>>> test coverage for Beam Python SDK on Python 3.5, 3.6, 3.7 interpreters, see
>>> [1].
>>>
>>> This structure allowed us to add 3.x suites without slowing down the
>>> pre/postcommit execution time. We can actually see a drop in precommit
>>> latency [2] around March 23 we first made some Python 3.x suites run in
>>> parallel, and we have added more suites since then without slowing down
>>> pre/postcommits. Therefore I am in favor of this proposal, especially since
>>> AFAIK we don't have better one. Thanks a lot!
>>>
>>> I do have some feedback on this proposal:
>>>
>>> 1. There is a duplication of gradle code between test suites for
>>> different python minor versions, for example see the identical definition
>>> of DirectRunner PostCommitIT suite for Python 3.6 and Python 3.7 [4,5].
>>>
>>> Possible solution to reduce the duplication is to move common code that
>>> defines a task into a separate groovy file shared across multiple gradle
>>> files. We have an example of this, where enablePythonPerformanceTest() is
>>> defined in BeamModulePlugin.groovy, and used in several build.gradle files
>>> to create a gradle task required for performance tests, see: [6]. I
>>> followed the same example in a Python 3 test suite for Portable Flink
>>> Runner I am working on [3], however I am not sure if BeamModulePlugin is
>>> the best place to define common gradle tasks to needed for Python CI.
>>> Perhaps we can make a separate groovy file for this purpose in
>>> sdk/python/test-suites?
>>>
>>
>> I would suggest placing the shared code in a new file in
>> https://github.com/apache/beam/tree/master/buildSrc/src/main/groovy/org/apache/beam/gradle,
>> we have several other groovy files related to building defined there
>> already.
>>
>>
>>> 2. Python 3 test suites currently live in sdks/python/test-suites, while
>>> most Python 2 suites are still defined in sdks/python/build.gradle.
>>>
>>> This may cause confusion for folks working on adding new Python suites.
>>> If there is an overall agreement on proposed structure I suggest to  start
>>> moving Python 2 CI tasks out of  sdks/python/build.gradle into
>>> sdks/python/test-suites/[runner]/py27/build.gradle, or a common groovy
>>> file. If there are better alternatives we can continue discussing them here.
>>>
>>
>> For the runner specific Java ITs, we have been trying to get the tests to
>> be placed inside the runners own directory instead of underneath the SDK
>> directories. This was to make it easy to associate test task -> runner. So
>> an alternative would be to have runners/[runner]/test-suites/py27/... be
>> that location.
>>
>>
>>> Thanks,
>>> Valenyn
>>>
>>>
>>> [1] https://github.com/apache/beam/tree/master/sdks/python/test-suites
>>> [2]
>>> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=1546507894013&to=1554189164736
>>> [3] https://github.com/apache/beam/pull/8745
>>> [4]
>>> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py36/build.gradle#L27
>>> [5]
>>> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py37/build.gradle#L27
>>> [6]
>>> https://github.com/apache/beam/search?q=enablePythonPerformanceTest&unscoped_q=enablePythonPerformanceTest
>>>
>>>
>>> On Fri, Mar 29, 2019 at 9:45 AM Udi Meiri <eh...@google.com> wrote:
>>>
>>>> I don't use gradle commands for Python development either, because they
>>>> are slow (no incremental testing).
>>>>
>>>>
>>>>
>>>> On Fri, Mar 29, 2019 at 9:16 AM Michael Luckey <ad...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 29, 2019 at 2:31 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Fri, Mar 29, 2019 at 12:54 PM Michael Luckey <ad...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > Really like the idea of improving here.
>>>>>> >
>>>>>> > Unfortunately, I haven't worked with python on that scale yet, so
>>>>>> bear with my naive understandings in this regard. If I understand
>>>>>> correctly, the suggestion will result in a couple of projects consisting
>>>>>> only of a build,gradle file to kind of workaround on gradles decision not
>>>>>> to parallelize within projects, right? In consequence, this also kind of
>>>>>> decouples projects from their content - they stuff which constitutes the
>>>>>> project - and forces the build file to 'somehow reach out to content of
>>>>>> other (only python root?) projects. E.g couples projects. This somehow
>>>>>> 'feels non natural' to me. But, of course, might be the path to go. As I
>>>>>> said before, never worked on python on that scale.
>>>>>>
>>>>>> It feels a bit odd to me as well. Is it possible to have multiple
>>>>>> projects per directory (e.g. a suite of testing ones) rather than
>>>>>> having to break things up like this, especially if the goal is
>>>>>> primarily to get parallel running of tests? Especially if we could
>>>>>> automatically create the cross-product rather than manually? There
>>>>>> also seems to be some redundancy with what tox is doing here.
>>>>>>
>>>>>
>>>>> Not sure, whether I understand correctly. But I do not think that's
>>>>> possible. If we are going to do some cross-product, we are probably better
>>>>> of doing that on tasks, e.g. by leveraging task rules or programmatically
>>>>> adding tasks (which is already done in parts). Of course, this will not
>>>>> help with parallelisation (but might enable that, see below).
>>>>>
>>>>>
>>>>>>
>>>>>> > But I believe to remember Robert talking about using in project
>>>>>> parallelisation for his development. Is this something which could also
>>>>>> work on CI? Of course, that will not help with different python versions,
>>>>>> but maybe that could be solved also by gradles variants which are
>>>>>> introduced in 5.3 - definitely need some time to investigate the
>>>>>> possibilities here. On first sight it feels like lots of duplication to
>>>>>> create 'builds' for any python version. Or wouldn't that be the case?
>>>>>> >
>>>>>> > And another naive thought on my side, isn't that non
>>>>>> parallelizability also caused by the monolithic setup of the python code
>>>>>> base? E.g. if I understand correctly, java sdk is split into
>>>>>> core/runners/ios etc, each encapsulate into full blown projects, i.e.
>>>>>> buckets of sources, tests and build file. Would it technically possible to
>>>>>> do something similar with python? I assume that being discussed before and
>>>>>> teared apart, but couldn't find on mailing list.
>>>>>>
>>>>>> Neither the culture nor the tooling of Python supports lots of
>>>>>> interdependent "sub-packages" for a single project--at least not
>>>>>> something smaller than one would want to deploy to Pypi. So while one
>>>>>> could do this, it'd be going against the grain. There are also much
>>>>>> lower-hanging opportunities for parallelization (e.g. running the test
>>>>>> suites for separate python versions in parallel).
>>>>>>
>>>>>> It's not very natural (as I understand it) with Go either. If we're
>>>>>> talking directory re-organization, I think it would make sense to
>>>>>> consider having top-level java, python, go, ... next to model,
>>>>>> website, etc.
>>>>>>
>>>>>
>>>>> Yes. We shouldn't work against common culture/practices, but try to
>>>>> embrace native tooling and add support where required.
>>>>>
>>>>> To reiterate on parallelisation, there are (at least) three
>>>>> opportunities:
>>>>>
>>>>> 1. Parallelise on test level. For python, this is detox?
>>>>>
>>>>
>>>> This is actually 2 levels. :)
>>>> 1a. Parallelise at the nosetest level - run unit tests in parallel in a
>>>> single tox environment. (I have a PR in progress to migrate to pytest, and
>>>> we should be able to do file-level parallelism provided we solve pickling
>>>> issues.)
>>>> 1b. Parallelise at the tox environment level, e..g, somehow running the
>>>> multiple tox environments (py27,py27-cython,py35,...) in parallel.
>>>>
>>>>
>>>>> 2. Parallelise on Gradle project level (--parallel option)
>>>>> 3. Parallelise on CI level
>>>>>
>>>>> So what I ve done before, if 1. does not help, nor 2. cause project is
>>>>> 'just to big' was 3, i.e. splitting on CI level. So the simplest thing I
>>>>> could imagine right now would be - as suggested above - to split
>>>>> pythonPreCommit into something like pythonX_YPrecommit, which then runs
>>>>> those different python versions in parallel. Of course, that could be done
>>>>> ad infinitum by splitting further into runners, IOs whatever.
>>>>>
>>>>> OTOH, we do already have tons of jobs running as we seem to map Gradle
>>>>> tasks to Jenkins jobs. So it might be more appropriate to leverage CI
>>>>> parallelization on jenkins pipeline level. Something like creating
>>>>> pythonPrecommit as a pipeline, which itself runs several steps in parallel.
>>>>> Did not contemplate on the impacts, though, on PRs and phrase triggering.
>>>>> Of course, we might need to improve on our 'verbosity' as currently it
>>>>> seems not always clear what command line actually triggered etc. With this
>>>>> approach, it seems possible to run tasks within same Gradle project in
>>>>> parallel with whatever granularity we might want to choose. (Of course we
>>>>> do add overhead here, but that's probably always the case by doing things
>>>>> in parallel)
>>>>>
>>>>> Of course 3. would not help on parallelisation on developers machine,
>>>>> but I never had a problem with that.
>>>>>
>>>>>
>>>>>>
>>>>>> > And as a last thought, will usage of pygradle help with better
>>>>>> python/gradle integration? Currently, we mainly use gradle to call into
>>>>>> shell scripts, which doesn't help gradle nor probably pythons tooling to do
>>>>>> the job very well? But deeper integration might cause problems on python
>>>>>> dev side, dunno :(
>>>>>>
>>>>>> Possibly.
>>>>>>
>>>>>> Are there any Python developers that primarily use the gradle
>>>>>> commands? Personally, I only use them if I'm using Java (or sometimes
>>>>>> work that is a mix of Java and Python, e.g. the Python-on-Flink
>>>>>> tests). Otherwise I use tox, or "python setup.py test [-s ...]"
>>>>>> directly. Gradle primarily has value as a top-level orchestration (in
>>>>>> particular for CI) and easy way for those who only touch Python
>>>>>> occasionally to run all the tests. If that's the case, optimizing our
>>>>>> gradle scripts for CI seems best.
>>>>>>
>>>>>
>>>>> My impression till today is, that nobody is actually using Gradle on
>>>>> daily python development. So I assume it is currently used on CI only. But
>>>>> I believe gradle could add value for devs also. We just need to improve our
>>>>> current integration. And to be very clear, gradle should not replace python
>>>>> tooling, but support usage in a way which looks familiar to any python dev.
>>>>> And probably render some shell scripting obsolete.
>>>>>
>>>>>
>>>>>>
>>>>>> > On Thu, Mar 28, 2019 at 6:37 PM Mark Liu <ma...@apache.org>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Thank you Ahmet. Answer your questions below:
>>>>>> >>
>>>>>> >>>
>>>>>> >>> - Could you comment on what kind of parallelization we will gain
>>>>>> by this? In terms of real numbers, how would this affect build and test
>>>>>> times?
>>>>>> >>
>>>>>> >>
>>>>>> >> The proposal is based on Gradle parallel execution: "you can force
>>>>>> Gradle to execute tasks in parallel as long as those tasks are in different
>>>>>> projects". In Beam, project is declared per build.gradle file and
>>>>>> registered in settings.gradle. Tasks that are included in single Gradle
>>>>>> execution will run in parallel only if they are declared in separate
>>>>>> build.gradle files.
>>>>>> >>
>>>>>> >> An example of applying parallel is beam_PreCommit_Python job which
>>>>>> runs :pythonPreCommit task that contains tasks distributed in 4
>>>>>> build.gradle. The execution graph looks like
>>>>>> https://scans.gradle.com/s/4frpmto6o7hto/timeline:
>>>>>> >>
>>>>>> >> Without this proposal, all tasks will run in sequential which can
>>>>>> be ~2x longer. If more py36 and py37 tests added in the future, things will
>>>>>> be even worse.
>>>>>> >>
>>>>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>>>>> quantify the improvement related to this?
>>>>>> >>
>>>>>> >>
>>>>>> >> The general code complexity of function/method/property may not
>>>>>> change here since we basically group tasks in a different way without
>>>>>> changing inside logic. I don't know if there is any tool to measure Gradle
>>>>>> build complexity. Would love to try if there is.
>>>>>> >>
>>>>>> >>>
>>>>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>>>>> Just want to clarify this. In either case, would you need help?
>>>>>> >>
>>>>>> >>
>>>>>> >> Yes, I'd love to take on major refactor works. At the same time,
>>>>>> I'll create jira for each kind of tests (like flink/protable/hdfs tests) in
>>>>>> sdks/python/build.gradle to move into test-suites. Test owners or anyone
>>>>>> interested to this work are welcome to contribute!
>>>>>> >>
>>>>>> >> Mark
>>>>>> >>
>>>>>> >> On Wed, Mar 27, 2019 at 3:53 PM Ahmet Altay <al...@google.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> This sounds good to me. Thank you for doing this. Few questions:
>>>>>> >>> - Could you comment on what kind of parallelization we will gain
>>>>>> by this? In terms of real numbers, how would this affect build and test
>>>>>> times?
>>>>>> >>> - I am guessing this will reduce complexity. Is it possible to
>>>>>> quantify the improvement related to this?
>>>>>> >>> - Beyond the proposal, I am assuming you are willing to work on.
>>>>>> Just want to clarify this. In either case, would you need help?
>>>>>> >>>
>>>>>> >>> Thank you,
>>>>>> >>> Ahmet
>>>>>> >>>
>>>>>> >>> On Wed, Mar 27, 2019 at 10:19 AM Mark Liu <ma...@apache.org>
>>>>>> wrote:
>>>>>> >>>>
>>>>>> >>>> Hi Python SDK Developers,
>>>>>> >>>>
>>>>>> >>>> You may notice that Gradle files changed a lot recently as
>>>>>> parallelization applied to Python tests and more python versions were
>>>>>> enabled in testing. There are tricks over the build scripts and tests are
>>>>>> grown naturally and distributed under sdks/python, which caused frictions
>>>>>> (like rollback PR-8059).
>>>>>> >>>>
>>>>>> >>>> Thus, I created BEAM-6907 and would like to initiate some works
>>>>>> to cleanup and standardize Gradle structure in Python SDK. In general, I
>>>>>> think we want to:
>>>>>> >>>>
>>>>>> >>>> - Apply parallel execution
>>>>>> >>>> - Share common tasks
>>>>>> >>>> - Centralize test related tasks
>>>>>> >>>> - Have a clear Gradle structure for projects/tasks
>>>>>> >>>>
>>>>>> >>>> This is Gradle directory structure I proposed:
>>>>>> >>>>
>>>>>> >>>> sdks/python/
>>>>>> >>>>
>>>>>> >>>> build.gradle    --> hold builds, snapshot, analytic tasks
>>>>>> >>>> test-suites/    --> all pre/post/VR test suites under here
>>>>>> >>>>
>>>>>> >>>> README.md
>>>>>> >>>>
>>>>>> >>>> dataflow/    --> grouped by runner or unit test (tox)
>>>>>> >>>>
>>>>>> >>>> py27/    --> grouped by py version
>>>>>> >>>>
>>>>>> >>>> build.gradle
>>>>>> >>>>
>>>>>> >>>> py35/
>>>>>> >>>>
>>>>>> >>>> ...
>>>>>> >>>>
>>>>>> >>>> direct/
>>>>>> >>>>
>>>>>> >>>> py27/
>>>>>> >>>>
>>>>>> >>>> ...
>>>>>> >>>>
>>>>>> >>>> flink/
>>>>>> >>>>
>>>>>> >>>> tox/
>>>>>> >>>> ...
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> The ideas are:
>>>>>> >>>> - Only keep builds, snapshot and analytic jobs in
>>>>>> sdks/python/build.gradle
>>>>>> >>>> - Move all test related tasks to sdks/python/test-suites/
>>>>>> >>>> - In sdks/python/test-suites, we first group by runners, unit
>>>>>> test or other testing that can't fit to them, and then group by py versions
>>>>>> if needed.
>>>>>> >>>> - An example of ../test-suites/../py35/build.gradle is this.
>>>>>> >>>>
>>>>>> >>>> Please feel free to explore existing Gradle scripts in Python
>>>>>> SDK and bring any thoughts on this proposal if you have.
>>>>>> >>>>
>>>>>> >>>> Thanks!
>>>>>> >>>> Mark
>>>>>
>>>>> [image: https://ml6.eu]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>>
>>
>> * Frederik Bode*
>>
>> ML6 Ghent
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
>> +32 4 92 78 96 18
>>
>>
>> **** DISCLAIMER ****
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. If you have received this email in error please notify the
>> system manager. This message contains confidential information and is
>> intended only for the individual named. If you are not the named addressee
>> you should not disseminate, distribute or copy this e-mail. Please notify
>> the sender immediately by e-mail ifyou have received this e-mail by mistake
>> and delete this e-mail from your system. If you are not the intended
>> recipient you are notified that disclosing, copying, distributing or taking
>> any action in reliance on the contents of this information is strictly
>> prohibited.
>>
>