You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Elizaveta Lomteva <el...@akvelon.com> on 2022/07/26 18:24:18 UTC

[CDAP IO] SparkReceiverIO integration testing

Hi, community!
Our team has prepared SparkReceiverIO Read via SDF PR [1]. We have started working on integration tests for the SparkReceiverIO connector which will allow to read data from Custom Spark Receivers in Apache Beam pipeline.

A general Apache Beam recommendation is to implement “ write then read” style integration tests. But in our case, only the Read interface was implemented because Spark Receivers couldn't be used for the write.

Since SparkReceiverIO is an abstract IO working with Spark Receivers, there is no exact implementation for a particular source. Therefore, we think to choose RabbitMQ as a test source for the following reasons:

  *   It’s possible to implement a Custom Spark Receiver on RabbitMQ as a test streaming receiver
  *   RabbitMQ is lightweight and easy to deploy
  *   There is a test container for RabbitMQ
  *   It’s possible to generate as much test input to the RabbitMQ as we need
  *   Apache Beam has a RabbitMQ IO [2]  that could hypothetically be used in the “write” step of the test

Cons of this choice are:

  *   We would need a RabbitMQ test container and additional Kubernetes configuration in ./test-infra
  *   The RabbitMQ peak throughput is less compared with Kafka, for example [3]


Based on this, two questions arise:

  1.  Are there any restrictions when choosing a test source? Can we use RabbitMQ in our case?

  2.  If RabbitMQ is suitable for our purposes, can we use the RabbitMQ IO to write data in the integration test “write” step or should we use RabbitMQ API directly without adding a dependency on Apache Beam RabbitMQ IO?


Any ideas or comments would be greatly appreciated!


Thank you in advance,

Elizaveta


[1] [BEAM-14378] [CdapIO] SparkReceiverIO Read via SDF #17828 – https://github.com/apache/beam/pull/17828

[2] Apache Beam RabbitMQ IO – https://github.com/apache/beam/tree/master/sdks/java/io/rabbitmq

[3] Benchmarking Apache Kafka, RabbitMQ article (2020 year) – https://www.confluent.io/blog/kafka-fastest-messaging-system/

Re: [CDAP IO] SparkReceiverIO integration testing

Posted by Chamikara Jayalath via dev <de...@beam.apache.org>.

On Tue, Jul 26, 2022 at 11:24 AM Elizaveta Lomteva <
elizaveta.lomteva@akvelon.com> wrote:

> Hi, community!
> Our team has prepared SparkReceiverIO Read via SDF PR [1]. We have started
> working on integration tests for the SparkReceiverIO connector which will
> allow to read data from Custom Spark Receivers in Apache Beam pipeline.
>
> A general Apache Beam recommendation is to implement “ write then read”
> style integration tests. But in our case, only the Read interface was
> implemented because Spark Receivers couldn't be used for the write.
>
> Since SparkReceiverIO is an abstract IO working with Spark Receivers,
> there is no exact implementation for a particular source. Therefore, we
> think to choose RabbitMQ as a test source for the following reasons:
>
>    - It’s possible to implement a Custom Spark Receiver on RabbitMQ as a
>    test streaming receiver
>    - RabbitMQ is lightweight and easy to deploy
>    - There is a test container for RabbitMQ
>    - It’s possible to generate as much test input to the RabbitMQ as we
>    need
>    - Apache Beam has a RabbitMQ IO [2]  that could hypothetically be used
>    in the “write” step of the test
>
> Cons of this choice are:
>
>    - We would need a RabbitMQ test container and additional Kubernetes
>    configuration in ./test-infra
>    - The RabbitMQ peak throughput is less compared with Kafka, for
>    example [3]
>
>
> Based on this, two questions arise:
>
>    1.
>
>    Are there any restrictions when choosing a test source? Can we use
>    RabbitMQ in our case?
>
>
I think the main requirement is that we want to test SparkReceiverIO in a
way that is similar to the way it would be used by actual end-users. So if
RabbitMQ-based receiver is a good representative for a typical Spark
Receiver , this should be fine.



>
>    1.
>    2.
>
>    If RabbitMQ is suitable for our purposes, can we use the RabbitMQ IO
>    to write data in the integration test “write” step or should we use
>    RabbitMQ API directly without adding a dependency on Apache Beam RabbitMQ
>    IO?
>
>

I would use RabbitMQIO and implement a write-then-read type test assuming
we can develop a non-flaky test that uses both connectors. If you run into
flakes I think just developing a test for the source is fine.

Thanks,
Cham


>
>    1.
>
>
> Any ideas or comments would be greatly appreciated!
>
> Thank you in advance,
>
> Elizaveta
>
> [1] [BEAM-14378] [CdapIO] SparkReceiverIO Read via SDF #17828 –
> https://github.com/apache/beam/pull/17828
>
> [2] Apache Beam RabbitMQ IO –
> https://github.com/apache/beam/tree/master/sdks/java/io/rabbitmq
> [3] Benchmarking Apache Kafka, RabbitMQ article (2020 year) –
> https://www.confluent.io/blog/kafka-fastest-messaging-system/
>
>
>
>