You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shahab Yunus <sh...@gmail.com> on 2018/03/16 05:17:16 UTC

Accessing Scala RDD from pyspark

Hi there.

I am calling custom Scala code from pyspark (interpreter). The customer
Scala code is simple: it just reads a textFile using sparkContext.textFile
and returns RDD[String].

In pyspark, I am using sc._jvm to make the call to the Scala code:


*s_rdd = sc._jvm.package_name.class_name.method().*

It returns a py4j.JavaObject. Now I want to use this in pyspark and doing
the following wrapping:
*py_rdd = RDD(s_dd, sparkSession)*

No error yet. But when I make a call to any RDD methods using py_rdd (e.g.
py_rdd.count()), I get the following error:
py4j.protocol.Py4JError: An error occurred while calling o50.rdd. Trace:
py4j.Py4JException: Method rdd([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)

Why is that? What I am doing wrong?

Using:
Scala version 2.11.8
(OpenJDK 64-Bit Server VM, Java 1.8.0_121)
Spark 2.0.2
Hadoop 2.7.3-amzn-0


Thanks & Regards,
Shahab