You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Michał Walenia <mi...@polidea.com> on 2019/07/12 09:16:45 UTC

Circular dependencies between DataflowRunner and google cloud IO

Hi all,
recently when I was trying to implement a performance test of BigQueryIO, I
ran into an issue when trying to run the test on Dataflow.
The problem was that I encountered a circular dependency when compiling the
tests. I added the test in org.apache.beam.sdk.io.gcp.bigquery package, so
I also needed to add DataflowRunner as a dependency in order to launch the
test. The error was that DataflowRunner package depends on
org.apache.beam.sdk.io.gcp.bigquery package (for example in [1]). Should it
be like that?
For now, in order to solve the problem, I intend to move the performance
test to its own package in my PR [2] I am wondering about the right
approach to this - shouldn’t we decouple the DataflowRunner code from IOs?
If not, what’s the reason behind the way the modules are organized?
I noticed that 5 tests are excluded from the integrationTest task in
io.google-cloud-platform.bigquery build.gradle file [3]. Are they launched
on Dataflow anywhere? I couldn’t find their usage except for the exclusions.

[1] PubSubIO translations section in DataflowRunner.java
<https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1104>
[2] My PR <https://github.com/apache/beam/pull/9041>
[3] DefaultCoderCloudObjectTranslatorRegistrar
<https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/DefaultCoderCloudObjectTranslatorRegistrar.java#L45>

Best regards
Michal

-- 

Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.walenia@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Re: Circular dependencies between DataflowRunner and google cloud IO

Posted by Michał Walenia <mi...@polidea.com>.
By separate package, I mean a distinct Gradle module with a separate set of
dependencies.
I know about the movement away from Perfkit and I plan to create Jenkins
jobs utilizing only Gradle tasks for the tests.
Lukasz, should I add you as a reviewer for the pull request?

Thank you for the resources.
Michal

On Mon, Jul 15, 2019 at 3:50 PM Lukasz Cwik <lc...@google.com> wrote:

> When do you mean by separate package?
>
> Most of our perf tests are integrated with PerfkitBenchmarker such as our
> JDBC benchmark[1] and Python BigQuery benchmark[2] and point to a specific
> IT class/method that contains the benchmarks.
>
> Note that there was a recent discussion about moving away from
> PerfkitBenchmarker though and it would be good to follow up with Lukasz
> Gajowy about it since he has done a bunch of the existing benchmarks[3].
>
> Finally, there were some docs about benchmarking on our design docs[4]
> page that may provide additional details[5, 6].
>
> 1:
> https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_JDBC.groovy
> 2:
> https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_BigQueryIO_Python.groovy
> 3:
> https://lists.apache.org/thread.html/dab1c093799248787e8b75e63b66d7389b594b649a4d9a4a5db1cfbb@%3Cdev.beam.apache.org%3E
> 4: https://beam.apache.org/contribute/design-documents/
> 5:
> https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE
> 6:
> https://docs.google.com/document/d/1Cb7XVmqe__nA_WCrriAifL-3WCzbZzV4Am5W_SkQLeA
>
> On Mon, Jul 15, 2019 at 8:03 AM Michał Walenia <mi...@polidea.com>
> wrote:
>
>> Thanks for the information. I don't think that the task you mentioned
>> fits my case - I want to run a performance test on a real service and
>> running it with a battery of other tests doesn't make much sense to me.
>> It's similar to the case of other integration tests that are excluded
>> from the task you linked.
>> Is keeping the test in a separate package viable in your opinion?
>>
>> Thanks!
>> Michal
>>
>> On Fri, Jul 12, 2019 at 3:45 PM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> Yes, there is a dependency between Dataflow -> GCP IOs and this is
>>> expected since Dataflow depends on parts of those implementations for its
>>> own execution purposes. We definitely don't want GCP IOs depending on
>>> Dataflow since we would like users of other runners to still be able to use
>>> GCP IOs without bringing in Dataflow specific dependencies.
>>>
>>> There is already a test definition inside of the Dataflow runner package
>>> that is meant to run integration tests defined in the GCP IO package named
>>> googleCloudPlatformLegacyWorkerIntegrationTest[1] task, does this fit your
>>> needs?
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/build.gradle#L318
>>>
>>> On Fri, Jul 12, 2019 at 5:17 AM Michał Walenia <
>>> michal.walenia@polidea.com> wrote:
>>>
>>>> Hi all,
>>>> recently when I was trying to implement a performance test of
>>>> BigQueryIO, I ran into an issue when trying to run the test on Dataflow.
>>>> The problem was that I encountered a circular dependency when compiling
>>>> the tests. I added the test in org.apache.beam.sdk.io.gcp.bigquery package,
>>>> so I also needed to add DataflowRunner as a dependency in order to launch
>>>> the test. The error was that DataflowRunner package depends on
>>>> org.apache.beam.sdk.io.gcp.bigquery package (for example in [1]).
>>>> Should it be like that?
>>>> For now, in order to solve the problem, I intend to move the
>>>> performance test to its own package in my PR [2] I am wondering about the
>>>> right approach to this - shouldn’t we decouple the DataflowRunner code from
>>>> IOs? If not, what’s the reason behind the way the modules are organized?
>>>> I noticed that 5 tests are excluded from the integrationTest task in
>>>> io.google-cloud-platform.bigquery build.gradle file [3]. Are they
>>>> launched on Dataflow anywhere? I couldn’t find their usage except for the
>>>> exclusions.
>>>>
>>>> [1] PubSubIO translations section in DataflowRunner.java
>>>> <https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1104>
>>>> [2] My PR <https://github.com/apache/beam/pull/9041>
>>>> [3] DefaultCoderCloudObjectTranslatorRegistrar
>>>> <https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/DefaultCoderCloudObjectTranslatorRegistrar.java#L45>
>>>>
>>>> Best regards
>>>> Michal
>>>>
>>>> --
>>>>
>>>> Michał Walenia
>>>> Polidea <https://www.polidea.com/> | Software Engineer
>>>>
>>>> M: +48 791 432 002 <+48791432002>
>>>> E: michal.walenia@polidea.com
>>>>
>>>> Unique Tech
>>>> Check out our projects! <https://www.polidea.com/our-work>
>>>>
>>>
>>
>> --
>>
>> Michał Walenia
>> Polidea <https://www.polidea.com/> | Software Engineer
>>
>> M: +48 791 432 002 <+48791432002>
>> E: michal.walenia@polidea.com
>>
>> Unique Tech
>> Check out our projects! <https://www.polidea.com/our-work>
>>
>

-- 

Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.walenia@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Re: Circular dependencies between DataflowRunner and google cloud IO

Posted by Lukasz Cwik <lc...@google.com>.
When do you mean by separate package?

Most of our perf tests are integrated with PerfkitBenchmarker such as our
JDBC benchmark[1] and Python BigQuery benchmark[2] and point to a specific
IT class/method that contains the benchmarks.

Note that there was a recent discussion about moving away from
PerfkitBenchmarker though and it would be good to follow up with Lukasz
Gajowy about it since he has done a bunch of the existing benchmarks[3].

Finally, there were some docs about benchmarking on our design docs[4] page
that may provide additional details[5, 6].

1:
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_JDBC.groovy
2:
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_BigQueryIO_Python.groovy
3:
https://lists.apache.org/thread.html/dab1c093799248787e8b75e63b66d7389b594b649a4d9a4a5db1cfbb@%3Cdev.beam.apache.org%3E
4: https://beam.apache.org/contribute/design-documents/
5:
https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE
6:
https://docs.google.com/document/d/1Cb7XVmqe__nA_WCrriAifL-3WCzbZzV4Am5W_SkQLeA

On Mon, Jul 15, 2019 at 8:03 AM Michał Walenia <mi...@polidea.com>
wrote:

> Thanks for the information. I don't think that the task you mentioned fits
> my case - I want to run a performance test on a real service and running it
> with a battery of other tests doesn't make much sense to me.
> It's similar to the case of other integration tests that are excluded from
> the task you linked.
> Is keeping the test in a separate package viable in your opinion?
>
> Thanks!
> Michal
>
> On Fri, Jul 12, 2019 at 3:45 PM Lukasz Cwik <lc...@google.com> wrote:
>
>> Yes, there is a dependency between Dataflow -> GCP IOs and this is
>> expected since Dataflow depends on parts of those implementations for its
>> own execution purposes. We definitely don't want GCP IOs depending on
>> Dataflow since we would like users of other runners to still be able to use
>> GCP IOs without bringing in Dataflow specific dependencies.
>>
>> There is already a test definition inside of the Dataflow runner package
>> that is meant to run integration tests defined in the GCP IO package named
>> googleCloudPlatformLegacyWorkerIntegrationTest[1] task, does this fit your
>> needs?
>>
>> 1:
>> https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/build.gradle#L318
>>
>> On Fri, Jul 12, 2019 at 5:17 AM Michał Walenia <
>> michal.walenia@polidea.com> wrote:
>>
>>> Hi all,
>>> recently when I was trying to implement a performance test of
>>> BigQueryIO, I ran into an issue when trying to run the test on Dataflow.
>>> The problem was that I encountered a circular dependency when compiling
>>> the tests. I added the test in org.apache.beam.sdk.io.gcp.bigquery package,
>>> so I also needed to add DataflowRunner as a dependency in order to launch
>>> the test. The error was that DataflowRunner package depends on
>>> org.apache.beam.sdk.io.gcp.bigquery package (for example in [1]).
>>> Should it be like that?
>>> For now, in order to solve the problem, I intend to move the performance
>>> test to its own package in my PR [2] I am wondering about the right
>>> approach to this - shouldn’t we decouple the DataflowRunner code from IOs?
>>> If not, what’s the reason behind the way the modules are organized?
>>> I noticed that 5 tests are excluded from the integrationTest task in
>>> io.google-cloud-platform.bigquery build.gradle file [3]. Are they
>>> launched on Dataflow anywhere? I couldn’t find their usage except for the
>>> exclusions.
>>>
>>> [1] PubSubIO translations section in DataflowRunner.java
>>> <https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1104>
>>> [2] My PR <https://github.com/apache/beam/pull/9041>
>>> [3] DefaultCoderCloudObjectTranslatorRegistrar
>>> <https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/DefaultCoderCloudObjectTranslatorRegistrar.java#L45>
>>>
>>> Best regards
>>> Michal
>>>
>>> --
>>>
>>> Michał Walenia
>>> Polidea <https://www.polidea.com/> | Software Engineer
>>>
>>> M: +48 791 432 002 <+48791432002>
>>> E: michal.walenia@polidea.com
>>>
>>> Unique Tech
>>> Check out our projects! <https://www.polidea.com/our-work>
>>>
>>
>
> --
>
> Michał Walenia
> Polidea <https://www.polidea.com/> | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.walenia@polidea.com
>
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>

Re: Circular dependencies between DataflowRunner and google cloud IO

Posted by Michał Walenia <mi...@polidea.com>.
Thanks for the information. I don't think that the task you mentioned fits
my case - I want to run a performance test on a real service and running it
with a battery of other tests doesn't make much sense to me.
It's similar to the case of other integration tests that are excluded from
the task you linked.
Is keeping the test in a separate package viable in your opinion?

Thanks!
Michal

On Fri, Jul 12, 2019 at 3:45 PM Lukasz Cwik <lc...@google.com> wrote:

> Yes, there is a dependency between Dataflow -> GCP IOs and this is
> expected since Dataflow depends on parts of those implementations for its
> own execution purposes. We definitely don't want GCP IOs depending on
> Dataflow since we would like users of other runners to still be able to use
> GCP IOs without bringing in Dataflow specific dependencies.
>
> There is already a test definition inside of the Dataflow runner package
> that is meant to run integration tests defined in the GCP IO package named
> googleCloudPlatformLegacyWorkerIntegrationTest[1] task, does this fit your
> needs?
>
> 1:
> https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/build.gradle#L318
>
> On Fri, Jul 12, 2019 at 5:17 AM Michał Walenia <mi...@polidea.com>
> wrote:
>
>> Hi all,
>> recently when I was trying to implement a performance test of BigQueryIO,
>> I ran into an issue when trying to run the test on Dataflow.
>> The problem was that I encountered a circular dependency when compiling
>> the tests. I added the test in org.apache.beam.sdk.io.gcp.bigquery package,
>> so I also needed to add DataflowRunner as a dependency in order to launch
>> the test. The error was that DataflowRunner package depends on
>> org.apache.beam.sdk.io.gcp.bigquery package (for example in [1]). Should
>> it be like that?
>> For now, in order to solve the problem, I intend to move the performance
>> test to its own package in my PR [2] I am wondering about the right
>> approach to this - shouldn’t we decouple the DataflowRunner code from IOs?
>> If not, what’s the reason behind the way the modules are organized?
>> I noticed that 5 tests are excluded from the integrationTest task in
>> io.google-cloud-platform.bigquery build.gradle file [3]. Are they
>> launched on Dataflow anywhere? I couldn’t find their usage except for the
>> exclusions.
>>
>> [1] PubSubIO translations section in DataflowRunner.java
>> <https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1104>
>> [2] My PR <https://github.com/apache/beam/pull/9041>
>> [3] DefaultCoderCloudObjectTranslatorRegistrar
>> <https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/DefaultCoderCloudObjectTranslatorRegistrar.java#L45>
>>
>> Best regards
>> Michal
>>
>> --
>>
>> Michał Walenia
>> Polidea <https://www.polidea.com/> | Software Engineer
>>
>> M: +48 791 432 002 <+48791432002>
>> E: michal.walenia@polidea.com
>>
>> Unique Tech
>> Check out our projects! <https://www.polidea.com/our-work>
>>
>

-- 

Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.walenia@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Re: Circular dependencies between DataflowRunner and google cloud IO

Posted by Lukasz Cwik <lc...@google.com>.
Yes, there is a dependency between Dataflow -> GCP IOs and this is expected
since Dataflow depends on parts of those implementations for its own
execution purposes. We definitely don't want GCP IOs depending on Dataflow
since we would like users of other runners to still be able to use GCP IOs
without bringing in Dataflow specific dependencies.

There is already a test definition inside of the Dataflow runner package
that is meant to run integration tests defined in the GCP IO package named
googleCloudPlatformLegacyWorkerIntegrationTest[1] task, does this fit your
needs?

1:
https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/build.gradle#L318

On Fri, Jul 12, 2019 at 5:17 AM Michał Walenia <mi...@polidea.com>
wrote:

> Hi all,
> recently when I was trying to implement a performance test of BigQueryIO,
> I ran into an issue when trying to run the test on Dataflow.
> The problem was that I encountered a circular dependency when compiling
> the tests. I added the test in org.apache.beam.sdk.io.gcp.bigquery package,
> so I also needed to add DataflowRunner as a dependency in order to launch
> the test. The error was that DataflowRunner package depends on
> org.apache.beam.sdk.io.gcp.bigquery package (for example in [1]). Should
> it be like that?
> For now, in order to solve the problem, I intend to move the performance
> test to its own package in my PR [2] I am wondering about the right
> approach to this - shouldn’t we decouple the DataflowRunner code from IOs?
> If not, what’s the reason behind the way the modules are organized?
> I noticed that 5 tests are excluded from the integrationTest task in
> io.google-cloud-platform.bigquery build.gradle file [3]. Are they
> launched on Dataflow anywhere? I couldn’t find their usage except for the
> exclusions.
>
> [1] PubSubIO translations section in DataflowRunner.java
> <https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1104>
> [2] My PR <https://github.com/apache/beam/pull/9041>
> [3] DefaultCoderCloudObjectTranslatorRegistrar
> <https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/DefaultCoderCloudObjectTranslatorRegistrar.java#L45>
>
> Best regards
> Michal
>
> --
>
> Michał Walenia
> Polidea <https://www.polidea.com/> | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.walenia@polidea.com
>
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>