You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Elizaveta Lomteva <el...@akvelon.com> on 2022/06/01 18:35:06 UTC

Re: [EXTERNAL] Re: SDF to read from a Spark custom receiver

Thank you for your help, we will try to implement offset handling in the Hubspot custom receiver using the interface implementation and start the receivers from a specific offset.

________________________________
From: Robert Bradshaw <ro...@google.com>
Sent: Tuesday, May 31, 2022 9:17:29 PM
To: dev
Subject: [EXTERNAL] Re: SDF to read from a Spark custom receiver

Though the Spark custom receiver protocol doesn't look rich enough to
support SDFs, it does look like the Hubspot APIs do support pagination
which could be used to build an SDF (using the page as the offsets)
directly. (That being said, it doesn't look like it supports reading
in parallel, so you would only have one "split" and you might want to
follow the read with a Reshuffle to decouple the downstream
parallelism from the more constrained Read.

On Mon, May 30, 2022 at 9:21 AM Elizaveta Lomteva
<el...@akvelon.com> wrote:
>
> Hi, Beam community!
>
> We are working on an SDF to read from an unbounded data source that is a Spark streaming custom receiver [1]. The source Spark custom receiver [2] does not offer offset support. This introduces constraint for the Splittable DoFn approach because it won’t be able to read from multiple receivers in a worker – as they will all read the same data.
>
>
> What are recommended practices to implement reading via SplittableDoFn if the streaming source doesn’t work with offsets? Could someone please share with us some thoughts or recommendations on this?
>
>
> Thank you,
> Elizaveta
>
>
> [1] Spark Streaming Receiver – https://spark.apache.org/docs/latest/streaming-custom-receivers.html
>
> [2] HubspotReceiver – https://github.com/data-integrations/hubspot/blob/develop/src/main/java/io/cdap/plugin/hubspot/source/streaming/HubspotReceiver.java