You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Eugene Kirpichov <ek...@gmail.com> on 2020/08/26 23:10:30 UTC

Getting Beam(Python)-on-Flink-on-k8s to work

Hi folks,

I'm still working with Pachama <https://pachama.com/> right now; we have a
Kubernetes Engine cluster on GCP and want to run Beam Python batch
pipelines with custom containers against it.
Flink and Cloud Dataflow are the two options; Cloud Dataflow doesn't
support custom containers for batch pipelines yes so we're going with Flink.

I'm struggling to find complete documentation on how to do this. There
seems to be lots of conflicting or incomplete information: several ways to
deploy Flink, several ways to get Beam working with it, bizarre
StackOverflow questions, and no documentation explaining a complete working
example.

== My requests ==
* Could people briefly share their working setup? Would be good to know
which directions are promising.
* It would be particularly helpful if someone could volunteer an hour of
their time to talk to me about their working Beam/Flink/k8s setup. It's for
a good cause (fixing the planet :) ) and on my side I volunteer to write up
the findings to share with the community so others suffer less.

== Appendix: My findings so far ==
There are multiple ways to deploy Flink on k8s:
- The GCP marketplace Flink operator
<https://cloud.google.com/blog/products/data-analytics/open-source-processing-engines-for-kubernetes>
(couldn't
get it to work) and the respective CLI version
<https://github.com/GoogleCloudPlatform/flink-on-k8s-operator> (buggy, but
I got it working)
- https://github.com/lyft/flinkk8soperator (haven't tried)
- Flink's native k8s support
<https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html>
(super
easy to get working)
<https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html>

I confirmed that my Flink cluster was operational by running a simple
Wordcount job, initiated from my machine. However I wasn't yet able to get
Beam working:

- With the Flink operator, I was able to submit a Beam job, but hit the
issue that I need Docker installed on my Flink nodes. I haven't yet tried
changing the operator's yaml files to add Docker inside them. I also
haven't tried this
<https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md>
yet because it implies submitting jobs using "kubectl apply" which is
weird - why not just submit it through the Flink job server?

- With Flink's native k8s support, I tried two things:
- Creating a fat portable jar using --output_executable_path. The jar is
huge (200+MB) and takes forever to upload to my Flink cluster - this is a
non-starter. But if I actually upload it, then I hit the same issue with
lacking Docker. Haven't tried fixing it yet.
- Simply running my pipeline --runner=FlinkRunner
--environment_type=DOCKER --flink_master=$PUBLIC_IP:8081. The Java process
appears to send 1+GB of data to somewhere, but the job never even starts.

I looked at a few conference talks:
-
https://www.cncf.io/wp-content/uploads/2020/02/CNCF-Webinar_-Apache-Flink-on-Kubernetes-Operator-1.pdf
- seems to imply that I need to add a Beam worker "sidecar" to the Flink
workers; and that I need to submit my job using "kubectl apply".
- https://www.youtube.com/watch?v=8k1iezoc5Sc which also mentions the
sidecar, but also mentions the fat jar option

--
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov