You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Janek Bevendorff <ja...@uni-weimar.de> on 2023/12/08 14:15:22 UTC

SparkRunner / PortableRunner Spark config besides Spark Master

Hi,

I'm struggling to figure out the best way to make Python Beam jobs 
execute on a Spark cluster running on Kubernetes. Unfortunately, the 
available documentation is incomplete and confusing at best.

The most flexible way I found was to compile a JAR from my Python job 
and submit that via spark-submit. Unfortunately, this seems to be 
extremely buggy and I cannot get it to feed logs from the SDK containers 
back to the Spark executors back to the driver. See: 
https://github.com/apache/beam/issues/29683

The other way would be to use a Beam job server, but here I cannot find 
a sensible way to set any Spark config options besides the master URL. I 
have a spark-defaults.conf with vital configuration, which needs to be 
passed to the job. I see two ways forward here:

1) I could let users run the job server locally in a Docker container. 
This way they could potentially mount their spark-defaults.conf 
somewhere, but I don't really see where (pointers here?). They would 
also need to mount their Kubernetes access credentials somehow, 
otherwise the job server cannot access the cluster.

2) I could run the Job server in the Kubernetes cluster, which would 
resolve the Kubernetes credential issue but not the Spark config issue. 
Though, even if that were solved, I would now force all users to use the 
same Spark config (not ideal).

Is there a better way? From what I can see, the compiled JAR is the only 
viable option, but the log issue is a deal breaker.

Thanks
Janek