You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by karan alang <ka...@gmail.com> on 2022/05/24 21:30:52 UTC
GCP Dataproc - adding multiple packages(kafka, mongodb) while submitting spark jobs not working
Hello All,
I've a Structured Streaming job on GCP Dataproc, and i'm trying to pass
multiple packages (kafka, mongoDB) to the dataproc submit command, and
that is not working.
Command that is working (when i add single dependency eg. Kafka) :
```
gcloud dataproc jobs submit pyspark main.py \
--cluster versa-structured-stream \
--properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,
spark.dynamicAllocation.enabled=true,spark.shuffle.service.enabled=true
```
However, when i add the mongoDB package as well (tried a few options) - it
seems to be failing.
eg.
```
Option 1 :
gcloud dataproc jobs submit pyspark main.py \
--cluster versa-structured-stream \
--properties ^#^spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=2
\
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar
\
--files=gs://kafka-certs/versa-kafka-gke-ca.p12,gs://kafka-certs/syslog-vani.p12,gs://kafka-certs/alarm-compression-user.p12,gs://kafka-certs/appstats-user.p12,gs://kafka-certs/insights-user.p12,gs://kafka-certs/intfutil-user.p12,gs://kafka-certs/reloadpred-chkpoint-user.p12,gs://kafka-certs/reloadpred-user.p12,gs://dataproc-spark-configs/topic-customer-map.cfg,gs://dataproc-spark-configs/params.cfg
\
--region us-east1 \
--py-files streams.zip,utils.zip
Option 2 :
gcloud dataproc jobs submit pyspark main.py \
--cluster versa-structured-stream \
--properties
spark.jars.packages='org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2',spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=
2 \
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar
\
--files=gs://kafka-certs/versa-kafka-gke-ca.p12,gs://kafka-certs/syslog-vani.p12,gs://kafka-certs/alarm-compression-user.p12,gs://kafka-certs/appstats-user.p12,gs://kafka-certs/insights-user.p12,gs://kafka-certs/intfutil-user.p12,gs://kafka-certs/reloadpred-chkpoint-user.p12,gs://kafka-certs/reloadpred-user.p12,gs://dataproc-spark-configs/topic-customer-map.cfg,gs://dataproc-spark-configs/params.cfg
\
--region us-east1 \
--py-files streams.zip,utils.zip
```
Any pointers on how to fix/debug this ?
details also in the stackoverflow link -
https://stackoverflow.com/questions/72369619/gcp-dataproc-adding-multiple-packageskafka-mongodb-while-submitting-jobs-no
tia!