You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Daniel Collins <dp...@google.com> on 2020/12/01 00:35:45 UTC

Any interest in sharding targets?

Hello all,

Any time I have the misfortune of creating a new beam branch, building a
subtarget (sdks/io/google-cloud-platform/.../pubsublite in my case) takes
O(30 mins) on my laptop. A lot of the steps seem to block on each other and
even the leaf rebuild can take minutes since all the GCP I/O transforms are
in one target. A couple of questions for the (hopefully?) gradle experts
here:

1) Do you think that sharding these targets would increase parallelism in
the underlying build?
2) Do you think doing so would have any knock-on negative effects, either
for compilation time or development speed?
3) Do you think this would be an hours, days or weeks time investment to do?

The above implicitly comes with "willing to help out O(hours/days), but no
gradle knowledge so I would need some guidance".

-Dan

Re: Any interest in sharding targets?

Posted by Kenneth Knowles <ke...@apache.org>.
On Tue, Dec 1, 2020 at 9:21 AM Daniel Collins <dp...@google.com> wrote:

> > High-level: ensure you have gradle cache enabled so only the first build
> is slow. If you encounter nondeterministic or noncached targets upstream of
> the module you are editing, that's worth discussing and probably fixing.
>
> I do have caching enabled (I do not rebuild non-gcp targets every time)
>

I'm not sure this answers the question, actually. With caching enabled,
even have a `./gradlew clean` you will not have rebuilds if inputs did not
change. The difference is between UP-TO-DATE (does not require caching
enabled) and FROM-CACHE statuses. But even UP-TO-DATE check should be
enough to make the single-module rebuild the bottleneck.

> Can you share the exact gradle command?
>
> Not sure on gradle syntax, I'm running from intellij. But I think it is
> "gradle build --scan <top level>:sdks:java:io:google-cloud-platform" for
> the most recent run, will attach a scan soon.
>

Noting that `build` is a poorly chosen name and actually means "compile,
package, and test". So maybe a more narrowly defined command will be
useful. I also use IntelliJ. I tried for a minute to figure out what
command it was running, but I could not find it :-/

> But most things aren't rebuilt anyhow.
>
> I'd wonder how much core being a monolithic target affects presubmit
> times. Wouldn't these all have to be rebuilt on every new build, or is
> there caching there as well?
>

Yes, everything depending on core impacts build times. It isn't all that
monolithic TBH. Are you modifying core? If not, it should only change when
you start a new branch or rebase. The gradle cache is key to making branch
changes not invalidate your build results.


> But primarily core build times just adds latency to deciding to work on
> beam -> starting work on beam, you're right. Its the google-cloud-platform
> target whose size impedes my workflow the most.
>
> > That's going to be a separate issue from wanting to build a single part
> of the GCP IO package without building the rest of the package
>
> It sounds like you'd be open to splitting up this target? Or am I reading
> the rest of your comment incorrectly?
>

Unfortunately I think our backwards-compatibility policies prohibit such a
thing. Otherwise, yes, I am in complete agreement. There may be a clever
way to only rebuild part of it during development.

So if the project you were to take on were a more complex build for the GCP
IO module that separately compiled the different IOs against their deps,
but brought the classes back together to assemble the final product... I
have no philosophical objections but there would be some risk of causing
accidental trouble. It isn't as straightforward as it would be with bazel.
I'd honestly try most anything else to just power through it.

Kenn


>
> On Tue, Dec 1, 2020 at 10:28 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>> High-level: ensure you have gradle cache enabled so only the first build
>> is slow. If you encounter nondeterministic or noncached targets upstream of
>> the module you are editing, that's worth discussing and probably fixing.
>>
>> That's going to be a separate issue from wanting to build a single part
>> of the GCP IO package without building the rest of the package. Details and
>> questions below.
>>
>> On Mon, Nov 30, 2020 at 4:36 PM Daniel Collins <dp...@google.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> Any time I have the misfortune of creating a new beam branch, building a
>>> subtarget (sdks/io/google-cloud-platform/.../pubsublite in my case) takes
>>> O(30 mins) on my laptop.
>>>
>>
>> Can you share the exact gradle command?
>>
>>
>>> A lot of the steps seem to block on each other and even the leaf rebuild
>>> can take minutes since all the GCP I/O transforms are in one target. A
>>> couple of questions for the (hopefully?) gradle experts here:
>>>
>>> 1) Do you think that sharding these targets would increase parallelism
>>> in the underlying build?
>>>
>>
>> I'd start with --scan so you can see some details and share it with
>> others easily. I'm not sure if --profile gives even finer-grained telemetry.
>>
>> To demonstrate, here are two scans of `./gradlew
>> :sdks:java:io:google-cloud-platform:compileTestJava`:
>>
>>  - from clean (8m): https://scans.gradle.com/s/j5jtqywn3uw4o/timeline
>>  - after modifying a file in the module (1m):
>> https://gradle.com/s/g74hsjddl6x5g/timeline
>>
>> These are certainly slow, and there are decidedly nonideal bits in the
>> dep graph (most of the execution-oriented bits should not be needed to just
>> *compile* the tests). But most things aren't rebuilt anyhow.
>>
>> 2) Do you think doing so would have any knock-on negative effects, either
>>> for compilation time or development speed?
>>>
>>
>> The answer is always "avoid rebuilding" so smaller seems better. I'm not
>> totally clear how much is to be gained in this case.
>>
>> The other answer is -PskipCheckerFramework which will net you a 4x
>> speedup in Java compile time, at the cost of you probably having to rewrite
>> your code once you un-disable it and discover you've got a bunch of lurking
>> NPEs.
>>
>> Kenn
>>
>>
>>> 3) Do you think this would be an hours, days or weeks time investment to
>>> do?
>>>
>>
>>
>>>
>>> The above implicitly comes with "willing to help out O(hours/days), but
>>> no gradle knowledge so I would need some guidance".
>>>
>>> -Dan
>>>
>>

Re: Any interest in sharding targets?

Posted by Daniel Collins <dp...@google.com>.
> High-level: ensure you have gradle cache enabled so only the first build
is slow. If you encounter nondeterministic or noncached targets upstream of
the module you are editing, that's worth discussing and probably fixing.

I do have caching enabled (I do not rebuild non-gcp targets every time)

> Can you share the exact gradle command?

Not sure on gradle syntax, I'm running from intellij. But I think it is
"gradle build --scan <top level>:sdks:java:io:google-cloud-platform" for
the most recent run, will attach a scan soon.

> from clean (8m)

Sounds like you have a really strong desktop :)

> But most things aren't rebuilt anyhow.

I'd wonder how much core being a monolithic target affects presubmit times.
Wouldn't these all have to be rebuilt on every new build, or is there
caching there as well?

But primarily core build times just adds latency to deciding to work on
beam -> starting work on beam, you're right. Its the google-cloud-platform
target whose size impedes my workflow the most.

> That's going to be a separate issue from wanting to build a single part
of the GCP IO package without building the rest of the package

It sounds like you'd be open to splitting up this target? Or am I reading
the rest of your comment incorrectly?


On Tue, Dec 1, 2020 at 10:28 AM Kenneth Knowles <ke...@apache.org> wrote:

> High-level: ensure you have gradle cache enabled so only the first build
> is slow. If you encounter nondeterministic or noncached targets upstream of
> the module you are editing, that's worth discussing and probably fixing.
>
> That's going to be a separate issue from wanting to build a single part of
> the GCP IO package without building the rest of the package. Details and
> questions below.
>
> On Mon, Nov 30, 2020 at 4:36 PM Daniel Collins <dp...@google.com>
> wrote:
>
>> Hello all,
>>
>> Any time I have the misfortune of creating a new beam branch, building a
>> subtarget (sdks/io/google-cloud-platform/.../pubsublite in my case) takes
>> O(30 mins) on my laptop.
>>
>
> Can you share the exact gradle command?
>
>
>> A lot of the steps seem to block on each other and even the leaf rebuild
>> can take minutes since all the GCP I/O transforms are in one target. A
>> couple of questions for the (hopefully?) gradle experts here:
>>
>> 1) Do you think that sharding these targets would increase parallelism in
>> the underlying build?
>>
>
> I'd start with --scan so you can see some details and share it with others
> easily. I'm not sure if --profile gives even finer-grained telemetry.
>
> To demonstrate, here are two scans of `./gradlew
> :sdks:java:io:google-cloud-platform:compileTestJava`:
>
>  - from clean (8m): https://scans.gradle.com/s/j5jtqywn3uw4o/timeline
>  - after modifying a file in the module (1m):
> https://gradle.com/s/g74hsjddl6x5g/timeline
>
> These are certainly slow, and there are decidedly nonideal bits in the dep
> graph (most of the execution-oriented bits should not be needed to just
> *compile* the tests). But most things aren't rebuilt anyhow.
>
> 2) Do you think doing so would have any knock-on negative effects, either
>> for compilation time or development speed?
>>
>
> The answer is always "avoid rebuilding" so smaller seems better. I'm not
> totally clear how much is to be gained in this case.
>
> The other answer is -PskipCheckerFramework which will net you a 4x speedup
> in Java compile time, at the cost of you probably having to rewrite your
> code once you un-disable it and discover you've got a bunch of lurking NPEs.
>
> Kenn
>
>
>> 3) Do you think this would be an hours, days or weeks time investment to
>> do?
>>
>
>
>>
>> The above implicitly comes with "willing to help out O(hours/days), but
>> no gradle knowledge so I would need some guidance".
>>
>> -Dan
>>
>

Re: Any interest in sharding targets?

Posted by Kenneth Knowles <ke...@apache.org>.
High-level: ensure you have gradle cache enabled so only the first build is
slow. If you encounter nondeterministic or noncached targets upstream of
the module you are editing, that's worth discussing and probably fixing.

That's going to be a separate issue from wanting to build a single part of
the GCP IO package without building the rest of the package. Details and
questions below.

On Mon, Nov 30, 2020 at 4:36 PM Daniel Collins <dp...@google.com> wrote:

> Hello all,
>
> Any time I have the misfortune of creating a new beam branch, building a
> subtarget (sdks/io/google-cloud-platform/.../pubsublite in my case) takes
> O(30 mins) on my laptop.
>

Can you share the exact gradle command?


> A lot of the steps seem to block on each other and even the leaf rebuild
> can take minutes since all the GCP I/O transforms are in one target. A
> couple of questions for the (hopefully?) gradle experts here:
>
> 1) Do you think that sharding these targets would increase parallelism in
> the underlying build?
>

I'd start with --scan so you can see some details and share it with others
easily. I'm not sure if --profile gives even finer-grained telemetry.

To demonstrate, here are two scans of `./gradlew
:sdks:java:io:google-cloud-platform:compileTestJava`:

 - from clean (8m): https://scans.gradle.com/s/j5jtqywn3uw4o/timeline
 - after modifying a file in the module (1m):
https://gradle.com/s/g74hsjddl6x5g/timeline

These are certainly slow, and there are decidedly nonideal bits in the dep
graph (most of the execution-oriented bits should not be needed to just
*compile* the tests). But most things aren't rebuilt anyhow.

2) Do you think doing so would have any knock-on negative effects, either
> for compilation time or development speed?
>

The answer is always "avoid rebuilding" so smaller seems better. I'm not
totally clear how much is to be gained in this case.

The other answer is -PskipCheckerFramework which will net you a 4x speedup
in Java compile time, at the cost of you probably having to rewrite your
code once you un-disable it and discover you've got a bunch of lurking NPEs.

Kenn


> 3) Do you think this would be an hours, days or weeks time investment to
> do?
>


>
> The above implicitly comes with "willing to help out O(hours/days), but no
> gradle knowledge so I would need some guidance".
>
> -Dan
>