You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Janek Bevendorff <ja...@uni-weimar.de> on 2023/12/08 14:15:22 UTC
SparkRunner / PortableRunner Spark config besides Spark Master
Hi,
I'm struggling to figure out the best way to make Python Beam jobs
execute on a Spark cluster running on Kubernetes. Unfortunately, the
available documentation is incomplete and confusing at best.
The most flexible way I found was to compile a JAR from my Python job
and submit that via spark-submit. Unfortunately, this seems to be
extremely buggy and I cannot get it to feed logs from the SDK containers
back to the Spark executors back to the driver. See:
https://github.com/apache/beam/issues/29683
The other way would be to use a Beam job server, but here I cannot find
a sensible way to set any Spark config options besides the master URL. I
have a spark-defaults.conf with vital configuration, which needs to be
passed to the job. I see two ways forward here:
1) I could let users run the job server locally in a Docker container.
This way they could potentially mount their spark-defaults.conf
somewhere, but I don't really see where (pointers here?). They would
also need to mount their Kubernetes access credentials somehow,
otherwise the job server cannot access the cluster.
2) I could run the Job server in the Kubernetes cluster, which would
resolve the Kubernetes credential issue but not the Spark config issue.
Though, even if that were solved, I would now force all users to use the
same Spark config (not ideal).
Is there a better way? From what I can see, the compiled JAR is the only
viable option, but the log issue is a deal breaker.
Thanks
Janek