You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by "lostluck (via GitHub)" <gi...@apache.org> on 2023/04/26 16:00:31 UTC

[GitHub] [beam] lostluck commented on issue #25212: [Feature Request]: Support streaming (pubsub) in RedisIO connector

lostluck commented on issue #25212:
URL: https://github.com/apache/beam/issues/25212#issuecomment-1523668920

Hi @shhivam ! Great question! I've been meaning to write a more direct tutorial about this, but lately I've been focused on the Go Direct Runner replacement, [Prism](https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners/prism) and helping get Timers into the SDK. Here's what I think. Please do feel free to ask questions, and do @ mention me to get my attention. I can't promise I'll reply fast, but I do try to help others go

There's one way to create an unbounded source: An [Splittable DoFn](https://beam.apache.org/documentation/programming-guide/#splittable-dofns) that also returns [process continuations](https://beam.apache.org/documentation/programming-guide/#user-initiated-checkpoint) and critically: provides [watermark estimates.](https://beam.apache.org/documentation/programming-guide/#watermark-estimation).

Alternatively, the 2.47.0 release will include a [`periodic.Impulse`](https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/transforms/periodic/periodic.go#L118) transform which will, "impulse" downstream transforms, periodically, making them unbounded. It can be thought of as a streaming replacement to `beam.Impulse`.
It also serves as an good example of how to write an unbounded source. Note that while we use the term "unbounded" they can in fact terminate if there's no more work, which allows a streaming execution pipeline to gracefully terminate if desired.

For any kind of "database" model, I'd recommend starting with a batch source whose input elements are configuration objects for the datastore.

For a pubsub approach, it is a little trickier to manage being able to split / scale horizontally, and the appropriate ways to scale up and down depend on the underlying message broker's model. But as a first approximation, using periodic impulse to be able to "poll" whatever subscription is created for the job, seems reasonable. You'll also want to use [Bundle Finalization](https://beam.apache.org/documentation/programming-guide/#bundle-finalization) to perform whatever "Acknlowledgement" the broker requires.

It looks like [Redis uses At-Most-Once semantics](https://redis.io/docs/manual/pubsub/#delivery-semantics) for Pub Sub, ( unless it's using [Redis streams](https://redis.io/docs/data-types/streams-tutorial/)) so if you want to add guarantees to the Redis PubSub source, then you'll need to add a `beam.Reshuffle` or a GBK to provide some sort of checkpointing on that data, to avoid data loss.

The last thing I'll say, is that if you want a streaming job to drain quickly, you'll also want to add support for [Truncating on Drain](https://beam.apache.org/documentation/programming-guide/#truncating-during-drain), which allows any restrictions to be "reduced" if necessary to some smaller size. Otherwise, bounded restrictions will be executed to completion to avoid dataloss. This only applies to runners that support Drains, which I think is only Dataflow, but I do intended to support it for Prism eventually, because something open source should be able to test that behavior.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org