You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by Mich Talebzadeh <> on 2021/01/14 16:51:01 UTC

Adding third party specific jars to Spark

The primer for this was the process of developing code for accessing
BigQuery data from PyCharm on premises so that advanced analytics and
graphics can be done on local.

Writes are an issue as BiqQuery buffers data in a temporary storage on GS
bucket before pushing it into BigQuery database

One option is to use Dataproc clusters for doing write intensive activities
there ($$$) and thereafter do the reads on on-premises (Linux) and on local
(assuming you have a powerful enough Windows Box). The issue was more with

To make this work believe or not is a bit of art as you need to find the
correct versions of Spark plus the correct versions of JAR files to
BigQuery that work in tandem

Anyhow the read and write to BigQuery work with Spark-3.0.1-bin-hadoop3.2/
and the following two JAR files

-rwxr--r--  1 hduser hadoop 33943429 Jan 12 23:30 spark-bigquery-latest_2.12.jar
-rwxr--r--  1 hduser hadoop 17663298 Jan 13 19:20
lrwxrwxrwx  1 hduser hadoop       38 Jan 13 19:22 gcs-connector.jar ->

For me the option that worked *was to put these two jar files in directory *

Adding them to spark.driver.extraClassPath in
$SPARK_HOME/conf/spark-defaults.conf did not work. Using spark-submit on
PyCharm terminal with --jars added other issues.

So in short I put these two files in $SPARK_HOME/jars and it worked. I am
not sure this is ideal but one advantage it has would be to create a
container jar file spark-libs.jar

jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .

and put it under HDFS directory so all nodes of the cluster can access it.
You need to add it to $SPARK_HOME/conf/spark-defaults.conf


If anyone has any suggestions please let me know.
