You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by karan alang <ka...@gmail.com> on 2022/05/24 21:30:52 UTC

GCP Dataproc - adding multiple packages(kafka, mongodb) while submitting spark jobs not working

Hello All,
I've a Structured Streaming job on GCP Dataproc, and i'm trying to pass
multiple packages (kafka, mongoDB)  to the dataproc submit command, and
that is not working.

Command that is working (when i add single dependency eg. Kafka) :
```

gcloud dataproc jobs submit pyspark main.py \
  --cluster versa-structured-stream  \
  --properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,
spark.dynamicAllocation.enabled=true,spark.shuffle.service.enabled=true

```

However, when i add the mongoDB package as well (tried a few options) - it
seems to be failing.
eg.
```
Option 1 :
gcloud dataproc jobs submit pyspark main.py \

  --cluster versa-structured-stream  \
  --properties ^#^spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=2
\
  --jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar
\
  --files=gs://kafka-certs/versa-kafka-gke-ca.p12,gs://kafka-certs/syslog-vani.p12,gs://kafka-certs/alarm-compression-user.p12,gs://kafka-certs/appstats-user.p12,gs://kafka-certs/insights-user.p12,gs://kafka-certs/intfutil-user.p12,gs://kafka-certs/reloadpred-chkpoint-user.p12,gs://kafka-certs/reloadpred-user.p12,gs://dataproc-spark-configs/topic-customer-map.cfg,gs://dataproc-spark-configs/params.cfg
 \
  --region us-east1 \
  --py-files streams.zip,utils.zip


Option 2 :
gcloud dataproc jobs submit pyspark main.py \
  --cluster versa-structured-stream \
  --properties
spark.jars.packages='org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2',spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=
2 \

  --jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar
\
  --files=gs://kafka-certs/versa-kafka-gke-ca.p12,gs://kafka-certs/syslog-vani.p12,gs://kafka-certs/alarm-compression-user.p12,gs://kafka-certs/appstats-user.p12,gs://kafka-certs/insights-user.p12,gs://kafka-certs/intfutil-user.p12,gs://kafka-certs/reloadpred-chkpoint-user.p12,gs://kafka-certs/reloadpred-user.p12,gs://dataproc-spark-configs/topic-customer-map.cfg,gs://dataproc-spark-configs/params.cfg
 \
  --region us-east1 \
  --py-files streams.zip,utils.zip


```

Any pointers on how to fix/debug this ?

details also in the stackoverflow link -
https://stackoverflow.com/questions/72369619/gcp-dataproc-adding-multiple-packageskafka-mongodb-while-submitting-jobs-no

tia!