You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Wiśniowski Piotr <co...@gmail.com> on 2023/10/03 10:23:15 UTC

Deployment approach for stateful streaming jobs

Hi,

Any typical patterns for deployment of stateful streaming pipelines from 
Beam? Targeting Dataflow and Python SDK with significant usage of 
stateful processing with long windows (typically days long).

 From our current practice of maintaining pipelines we did identify 3 
typical scenarios:

1. Deployment without breaking changes (no changes to pipeline graph, 
coders, states, outputs etc.) - just update DataFlow job in place.

2. Deployment with changes to internal state (changes to coders, state, 
or even pipeline graph but without changing pipeline input/output 
schemas) - in this case updating job in place would not work as the 
state did change and reading state saved by old pipeline would result in 
error.

3. Deployment with changes to output schema (and potentially to internal 
state too) - we need to take special care of changing output schema to 
be sure downstream processes also have a time to switch from old version 
of data to the new one.

To be specific I need some advice/patterns/knowledge on p.2 and p.3. I 
guess it will require spinning new pipelines with data back-filling or 
migration jobs? Would really appreciate detailed examples how you are 
dealing with deploying similar streaming stateful pipelines. Ideally 
with details on how much data to reprocess to populate internal state, 
what needs to be done when changing output schema of a pipeline, how to 
orchestrate all this activities.

Best
Wisniowski PIotr