You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by "Koffman, Noa (Nokia - IL/Kfar Sava)" <no...@nokia.com> on 2022/02/03 20:33:19 UTC

Deployment of beam pipelines on flink cluster

Hi all,
We are currently using beam to create a few pipelines, and we then deploy some of them on our on-prem Flink cluster, and some on GCP Dataflow, and we have a few questions regarding the automation of the pipelines deployment:

Flink deployment:
What are currently running beam as a k8s pod running beam which starts a java process for each pipeline and has the flinkrunner parameters, the java processes deploy the pipelines as flink jobs (running on a flink cluster deployed on k8s as well)
The main issues we have with the current deployment is – every time the beam pod gets restarted, it re-creates the flink jobs, and this causes duplications and errors.
We were wondering what the recommended way is to automate deploying the beam pipelines to flink, and is there any documentation on this?
We also tried to see how to generate jars from the beam pipelines and then running the on flink, but we had issues with the Fat Jars, as they included some flink libraries, and we were not able to run them as is on flink, is there a way to get a ‘leaner’ fat jar?


Dataflow deployment:
We are trying to deploy and run a pipeline in Dataflow. We have a jar with a pipeline that runs locally and we are deploying it using:

java $JVM_OPT -Dlog4j.configuration=file:log4j.properties -cp "/opt/apache/beam/jars/*" XXX.PipelinePubSubToBigQuery --runner=DataflowRunner --project=<our project> --region=<our region>  --stagingLocation=<CS staging bucket> --tempLocation=<CS temp bucket> --templateLocation=<CS location path> &

We noticed that it uploads to CS over 200 jars, and even a minor change in one jar results in a ~15 min deployment.

We have a few questions:

  1.  What is the recommended way of handling the deployment and job creation? Should we start the deployment from a microservice, or is there another preferred way?
  2.  How do we programmatically start running the job after the deployment?
  3.  Prior to using Dataflow, our pipeline ended with result.waitUntilFinish(), but with dataflow there is an exception of missing jobId. Should we still try and call this method?
Any help regarding this subject would be appreciated
Thanks
Noa