You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by karan alang <ka...@gmail.com> on 2022/02/02 06:50:34 UTC

Structured Streaming on GCP Dataproc - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

Hello All,

I'm running a simple Structured Streaming on GCP, which reads data from
Kafka and prints onto console.

Command :

cloud dataproc jobs submit pyspark
/Users/karanalang/Documents/Technology/gcp/DataProc/StructuredStreaming_Kafka_GCP-Batch-feb1.py
    --cluster dataproc-ss-poc      --jars
gs://spark-jars-karan/spark-sql-kafka-0-10_2.12-3.1.2.jar
gs://spark-jars-karan/spark-core_2.12-3.1.2.jar     --region us-central1

I'm getting error :

File
"/tmp/01c16a55009a42a0a29da6dde9aae4d5/StructuredStreaming_Kafka_GCP-Batch-feb1.py",
line 49, in <module>

    df = spark.read.format('kafka')\

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 210, in load

  File
"/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line
1304, in __call__

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
111, in deco

  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
line 326, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.

: java.lang.NoClassDefFoundError:
org/apache/kafka/common/serialization/ByteArraySerializer

at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:599)

at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)

at org.apache.spark.sql.kafka010.KafkaSourceProvider.org
$apache$spark$sql$kafka010$KafkaSourceProvider$$validateBatchOptions(KafkaSourceProvider.scala:348)

at
org.apache.spark.sql.kafka010.KafkaSourceProvider.createRelation(KafkaSourceProvider.scala:128)

at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)

at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)

at
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)

at scala.Option.getOrElse(Option.scala:189)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:282)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:238)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.ClassNotFoundException:
org.apache.kafka.common.serialization.ByteArraySerializer

at java.net.URLClassLoader.findClass(URLClassLoader.java:387)

at java.lang.ClassLoader.loadClass(ClassLoader.java:418)

at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

Additional details are in stackoverflow -

https://stackoverflow.com/questions/70951195/gcp-dataproc-java-lang-noclassdeffounderror-org-apache-kafka-common-serializa

Do we need to pass any other jar ?
What needs to be done to debug/fix this ?

tia !

Re: Structured Streaming on GCP Dataproc - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

Posted by Mich Talebzadeh <mi...@gmail.com>.
Well you are now using a package instead of the jar.  There is a difference
between using a jar and using a package in spark-submit. --jar adds only
that jar. --package adds the jar and all its dependencies listed in maven.
Packages do resolve the dependencies. They do so via ivy
<https://ant.apache.org/ivy/> which is a dependency manager. You can of
course do it manually and working out yourself looking at missing jars
through .ivy2/jars sub-directory step-by-step using

cd ~/.ivy2/jars
grep -lRi  <missing class>

//example
grep -lRi  HttpRequestInitializer

Then you can add the missing jars to -- jars and repeat until everything
works (a bit tedious) .


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 2 Feb 2022 at 23:30, karan alang <ka...@gmail.com> wrote:

> Hi Mitch, All -
>
> thnx, i was able to resolve this using the command below  :
>
> ---
> gcloud dataproc jobs submit pyspark
> /Users/karanalang/Documents/Technology/gcp/DataProc/StructuredStreaming_Kafka_GCP-Batch-feb2.py
>  --cluster dataproc-ss-poc  --properties
> spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
> --region us-central1
> ----
>
>
> On Wed, Feb 2, 2022 at 1:25 AM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> The current Spark version on GCP is 3.1.2.
>>
>> Try using this jar file instead
>>
>> spark-sql-kafka-0-10_2.12-3.0.1.jar
>>
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 2 Feb 2022 at 06:51, karan alang <ka...@gmail.com> wrote:
>>
>>> Hello All,
>>>
>>> I'm running a simple Structured Streaming on GCP, which reads data from
>>> Kafka and prints onto console.
>>>
>>> Command :
>>>
>>> cloud dataproc jobs submit pyspark     /Users/karanalang/Documents/Technology/gcp/DataProc/StructuredStreaming_Kafka_GCP-Batch-feb1.py
>>>     --cluster dataproc-ss-poc      --jars
>>> gs://spark-jars-karan/spark-sql-kafka-0-10_2.12-3.1.2.jar
>>> gs://spark-jars-karan/spark-core_2.12-3.1.2.jar     --region us-central1
>>>
>>> I'm getting error :
>>>
>>> File
>>> "/tmp/01c16a55009a42a0a29da6dde9aae4d5/StructuredStreaming_Kafka_GCP-Batch-feb1.py",
>>> line 49, in <module>
>>>
>>>     df = spark.read.format('kafka')\
>>>
>>>   File
>>> "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line
>>> 210, in load
>>>
>>>   File
>>> "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line
>>> 1304, in __call__
>>>
>>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
>>> line 111, in deco
>>>
>>>   File
>>> "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326,
>>> in get_return_value
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.
>>>
>>> : java.lang.NoClassDefFoundError:
>>> org/apache/kafka/common/serialization/ByteArraySerializer
>>>
>>> at
>>> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:599)
>>>
>>> at
>>> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
>>>
>>> at org.apache.spark.sql.kafka010.KafkaSourceProvider.org
>>> $apache$spark$sql$kafka010$KafkaSourceProvider$$validateBatchOptions(KafkaSourceProvider.scala:348)
>>>
>>> at
>>> org.apache.spark.sql.kafka010.KafkaSourceProvider.createRelation(KafkaSourceProvider.scala:128)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>>>
>>> at
>>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>>>
>>> at
>>> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>>>
>>> at scala.Option.getOrElse(Option.scala:189)
>>>
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>>>
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>>
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>>
>>> at py4j.Gateway.invoke(Gateway.java:282)
>>>
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>>
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>
>>> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>>>
>>> at java.lang.Thread.run(Thread.java:748)
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.kafka.common.serialization.ByteArraySerializer
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>>>
>>> Additional details are in stackoverflow -
>>>
>>>
>>> https://stackoverflow.com/questions/70951195/gcp-dataproc-java-lang-noclassdeffounderror-org-apache-kafka-common-serializa
>>>
>>> Do we need to pass any other jar ?
>>> What needs to be done to debug/fix this ?
>>>
>>> tia !
>>>
>>>
>>>

Re: Structured Streaming on GCP Dataproc - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

Posted by karan alang <ka...@gmail.com>.
Hi Mitch, All -

thnx, i was able to resolve this using the command below  :

---
gcloud dataproc jobs submit pyspark
/Users/karanalang/Documents/Technology/gcp/DataProc/StructuredStreaming_Kafka_GCP-Batch-feb2.py
 --cluster dataproc-ss-poc  --properties
spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
--region us-central1
----


On Wed, Feb 2, 2022 at 1:25 AM Mich Talebzadeh <mi...@gmail.com>
wrote:

> The current Spark version on GCP is 3.1.2.
>
> Try using this jar file instead
>
> spark-sql-kafka-0-10_2.12-3.0.1.jar
>
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 2 Feb 2022 at 06:51, karan alang <ka...@gmail.com> wrote:
>
>> Hello All,
>>
>> I'm running a simple Structured Streaming on GCP, which reads data from
>> Kafka and prints onto console.
>>
>> Command :
>>
>> cloud dataproc jobs submit pyspark     /Users/karanalang/Documents/Technology/gcp/DataProc/StructuredStreaming_Kafka_GCP-Batch-feb1.py
>>     --cluster dataproc-ss-poc      --jars
>> gs://spark-jars-karan/spark-sql-kafka-0-10_2.12-3.1.2.jar
>> gs://spark-jars-karan/spark-core_2.12-3.1.2.jar     --region us-central1
>>
>> I'm getting error :
>>
>> File
>> "/tmp/01c16a55009a42a0a29da6dde9aae4d5/StructuredStreaming_Kafka_GCP-Batch-feb1.py",
>> line 49, in <module>
>>
>>     df = spark.read.format('kafka')\
>>
>>   File
>> "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line
>> 210, in load
>>
>>   File
>> "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line
>> 1304, in __call__
>>
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
>> line 111, in deco
>>
>>   File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>> line 326, in get_return_value
>>
>> py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.
>>
>> : java.lang.NoClassDefFoundError:
>> org/apache/kafka/common/serialization/ByteArraySerializer
>>
>> at
>> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:599)
>>
>> at
>> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
>>
>> at org.apache.spark.sql.kafka010.KafkaSourceProvider.org
>> $apache$spark$sql$kafka010$KafkaSourceProvider$$validateBatchOptions(KafkaSourceProvider.scala:348)
>>
>> at
>> org.apache.spark.sql.kafka010.KafkaSourceProvider.createRelation(KafkaSourceProvider.scala:128)
>>
>> at
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>>
>> at
>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>>
>> at
>> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>>
>> at scala.Option.getOrElse(Option.scala:189)
>>
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>>
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:498)
>>
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>
>> at py4j.Gateway.invoke(Gateway.java:282)
>>
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>
>> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>>
>> at java.lang.Thread.run(Thread.java:748)
>>
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.kafka.common.serialization.ByteArraySerializer
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>>
>> Additional details are in stackoverflow -
>>
>>
>> https://stackoverflow.com/questions/70951195/gcp-dataproc-java-lang-noclassdeffounderror-org-apache-kafka-common-serializa
>>
>> Do we need to pass any other jar ?
>> What needs to be done to debug/fix this ?
>>
>> tia !
>>
>>
>>

Re: Structured Streaming on GCP Dataproc - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

Posted by Mich Talebzadeh <mi...@gmail.com>.
The current Spark version on GCP is 3.1.2.

Try using this jar file instead

spark-sql-kafka-0-10_2.12-3.0.1.jar


HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 2 Feb 2022 at 06:51, karan alang <ka...@gmail.com> wrote:

> Hello All,
>
> I'm running a simple Structured Streaming on GCP, which reads data from
> Kafka and prints onto console.
>
> Command :
>
> cloud dataproc jobs submit pyspark     /Users/karanalang/Documents/Technology/gcp/DataProc/StructuredStreaming_Kafka_GCP-Batch-feb1.py
>     --cluster dataproc-ss-poc      --jars
> gs://spark-jars-karan/spark-sql-kafka-0-10_2.12-3.1.2.jar
> gs://spark-jars-karan/spark-core_2.12-3.1.2.jar     --region us-central1
>
> I'm getting error :
>
> File
> "/tmp/01c16a55009a42a0a29da6dde9aae4d5/StructuredStreaming_Kafka_GCP-Batch-feb1.py",
> line 49, in <module>
>
>     df = spark.read.format('kafka')\
>
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
> line 210, in load
>
>   File
> "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line
> 1304, in __call__
>
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
> 111, in deco
>
>   File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
> line 326, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.
>
> : java.lang.NoClassDefFoundError:
> org/apache/kafka/common/serialization/ByteArraySerializer
>
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:599)
>
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
>
> at org.apache.spark.sql.kafka010.KafkaSourceProvider.org
> $apache$spark$sql$kafka010$KafkaSourceProvider$$validateBatchOptions(KafkaSourceProvider.scala:348)
>
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createRelation(KafkaSourceProvider.scala:128)
>
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>
> at
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>
> at scala.Option.getOrElse(Option.scala:189)
>
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>
> at py4j.Gateway.invoke(Gateway.java:282)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.ClassNotFoundException:
> org.apache.kafka.common.serialization.ByteArraySerializer
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>
> Additional details are in stackoverflow -
>
>
> https://stackoverflow.com/questions/70951195/gcp-dataproc-java-lang-noclassdeffounderror-org-apache-kafka-common-serializa
>
> Do we need to pass any other jar ?
> What needs to be done to debug/fix this ?
>
> tia !
>
>
>