You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Florian Pinault <Fl...@ecmwf.int> on 2022/03/28 12:15:58 UTC

[Question] Spark: standard setup to use beam-spark to parallelize python code

Greetings,

We are setting up an Apache Beam cluster using Spark as a backend to run python code. This is currently a toy example with 4 virtual machines running Centos (a client, a spark main, and two spark-workers).
We are running into version issues (detail below) and would need help on which versions to set up.
We currently are trying spark-2.4.8-bin-hadoop2.7, with the pip package beam 2.37.0 on the client, and using a job-server to create docker image.


I saw here https://beam.apache.org/blog/beam-2.33.0/ that "Spark 2.x users will need to update Spark's Jackson runtime dependencies (spark.jackson.version) to at least version 2.9.2, due to Beam updating its dependencies."

 But it looks like the jackson-core version in the job-server is 2.13.0 whereas the jars in spark-2.4.8-bin-hadoop2.7/jars are

-. 1 mluser mluser 46986 May 8 2021 jackson-annotations-2.6.7.jar
-. 1 mluser mluser 258919 May 8 2021 jackson-core-2.6.7.jar
-. 1 mluser mluser 232248 May 8 2021 jackson-core-asl-1.9.13.jar
-. 1 mluser mluser 1166637 May 8 2021 jackson-databind-2.6.7.3.jar
-. 1 mluser mluser 320444 May 8 2021 jackson-dataformat-yaml-2.6.7.jar
-. 1 mluser mluser 18336 May 8 2021 jackson-jaxrs-1.9.13.jar
-. 1 mluser mluser 780664 May 8 2021 jackson-mapper-asl-1.9.13.jar
-. 1 mluser mluser 32612 May 8 2021 jackson-module-jaxb-annotations-2.6.7.jar
-. 1 mluser mluser 42858 May 8 2021 jackson-module-paranamer-2.7.9.jar
-. 1 mluser mluser 515645 May 8 2021 jackson-module-scala_2.11-2.6.7.1.jar

There must be something to update, but I am not sure how to update these jar files with their dependencies, and not sure if this would get us very far.

Would you have a list of binaries that work together or some running CI from the apache foundation similar to what we are trying to achieve?


Re: [Question] Spark: standard setup to use beam-spark to parallelize python code

Posted by Alexey Romanenko <ar...@gmail.com>.
Well, it’s caused by recent jackson's version update in Beam [1] - so, the jackson runtime dependencies should be updated manually (at least to 2.9.2) in case of using Spark 2.x. 

Either, use Spark 3..x if possible since it already provides jackson jars of version 2.10.0.
 
[1] https://github.com/apache/beam/commit/9694f70df1447e96684b665279679edafec13a0c <https://github.com/apache/beam/commit/9694f70df1447e96684b665279679edafec13a0c>

—
Alexey

> On 28 Mar 2022, at 14:15, Florian Pinault <Fl...@ecmwf.int> wrote:
> 
> Greetings,
>  
> We are setting up an Apache Beam cluster using Spark as a backend to run python code. This is currently a toy example with 4 virtual machines running Centos (a client, a spark main, and two spark-workers). 
> We are running into version issues (detail below) and would need help on which versions to set up.
> We currently are trying spark-2.4.8-bin-hadoop2.7, with the pip package beam 2.37.0 on the client, and using a job-server to create docker image.
>  
> I saw here https://beam.apache.org/blog/beam-2.33.0/ <https://beam.apache.org/blog/beam-2.33.0/> that "Spark 2.x users will need to update Spark's Jackson runtime dependencies (spark.jackson.version) to at least version 2.9.2, due to Beam updating its dependencies." 
>  But it looks like the jackson-core version in the job-server is 2.13.0 whereas the jars in spark-2.4.8-bin-hadoop2.7/jars are
> -. 1 mluser mluser 46986 May 8 2021 jackson-annotations-2.6.7.jar
> -. 1 mluser mluser 258919 May 8 2021 jackson-core-2.6.7.jar
> -. 1 mluser mluser 232248 May 8 2021 jackson-core-asl-1.9.13.jar
> -. 1 mluser mluser 1166637 May 8 2021 jackson-databind-2.6.7.3.jar
> -. 1 mluser mluser 320444 May 8 2021 jackson-dataformat-yaml-2.6.7.jar
> -. 1 mluser mluser 18336 May 8 2021 jackson-jaxrs-1.9.13.jar
> -. 1 mluser mluser 780664 May 8 2021 jackson-mapper-asl-1.9.13.jar
> -. 1 mluser mluser 32612 May 8 2021 jackson-module-jaxb-annotations-2.6.7.jar
> -. 1 mluser mluser 42858 May 8 2021 jackson-module-paranamer-2.7.9.jar
> -. 1 mluser mluser 515645 May 8 2021 jackson-module-scala_2.11-2.6.7.1.jar
>  
> There must be something to update, but I am not sure how to update these jar files with their dependencies, and not sure if this would get us very far.
>  
> Would you have a list of binaries that work together or some running CI from the apache foundation similar to what we are trying to achieve?