You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nick Ruest <ru...@gmail.com> on 2020/05/26 01:04:20 UTC

PySpark .collect() output to Scala Array[Row]

Hi,

I've hit a wall with trying to implement a couple of Scala methods in a
Python version of our project. I've implemented a number of these
already, but I'm getting hung up with this one.

My Python function looks like this:

def Write_Graphml(data, graphml_path, sc):
    return sc.getOrCreate()._jvm.io.archivesunleashed.app.WriteGraphML(data,
graphml_path).apply


Where data is a DataFrame that has been collected; data.collect().

On the Scala side is it basically:

object WriteGraphML {
  apply(data: Array[Row], graphmlPath: String): Boolean = {
    ...
    massages an Array[Row] into GraphML
    ...
    True
}

When I try to use it in PySpark, I end up getting this error message:

Py4JError: An error occurred while calling
None.io.archivesunleashed.app.WriteGraphML. Trace:
py4j.Py4JException: Constructor
io.archivesunleashed.app.WriteGraphML([class java.util.ArrayList, class
java.lang.String]) does not exist
	at
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
	at
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
	at py4j.Gateway.invoke(Gateway.java:237)
	at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


I originally dug into what the error message stated, and tried a variety
of tweaks such as:

sc.getOrCreate()._jvm.io.archivesunleashed.app.WriteGraphML.apply(data,
graphml_path)

And, went as far as trying get_attr, and calling the "WriteGraphML$" and
few other varieties with that method.

All the results produced the same variety of error message above; that
the Constructor or method does not exist.

I came across this[1] Based on lots of Googling and Stack Overflow
searches, and it has me thinking that the problem is because of how Py4J
is passing off the Python List (data) to the JVM, and then passing it to
Scala. It's ending up as an ArrayList instead of an Array[Row].

Do I need to tweak data before it is passed to Write_Graphml? Or am I
doing something else wrong here.

I had originally posted a version of this message to the dev list, and
Sean Owen suggested WriteGraphML should be a implemented as a class, not
an object. Is that the right path? I have a number of other Scala
functions implemented in the PySpark side of our project that are
objects, and everything works fine.

...and is there a best practices guide or documentation for implementing
Scala functions in PySpark? I've found a number of blog posts that have
been helpful.

Thanks in advance for any help!

cheers!

-nruest

[1]
https://stackoverflow.com/questions/61928886/pyspark-list-to-scala-sequence

Re: PySpark .collect() output to Scala Array[Row]

Posted by Wim Van Leuven <wi...@highestpoint.biz>.

Looking at the stack trace, your data from Spark gets serialized to an
ArrayList (of something) whereas in your scala code you are using an Array
of Rows. So, the types don't lign up. That's the exception you are seeing:
the JVM searches for a signature that simply does not exist.

Try to turn the Array into a java.util.ArrayList?
-w

On Tue, 26 May 2020 at 03:04, Nick Ruest <ru...@gmail.com> wrote:

> Hi,
>
> I've hit a wall with trying to implement a couple of Scala methods in a
> Python version of our project. I've implemented a number of these
> already, but I'm getting hung up with this one.
>
> My Python function looks like this:
>
> def Write_Graphml(data, graphml_path, sc):
>     return sc.getOrCreate()._jvm.io
> .archivesunleashed.app.WriteGraphML(data,
> graphml_path).apply
>
>
> Where data is a DataFrame that has been collected; data.collect().
>
> On the Scala side is it basically:
>
> object WriteGraphML {
>   apply(data: Array[Row], graphmlPath: String): Boolean = {
>     ...
>     massages an Array[Row] into GraphML
>     ...
>     True
> }
>
> When I try to use it in PySpark, I end up getting this error message:
>
> Py4JError: An error occurred while calling
> None.io.archivesunleashed.app.WriteGraphML. Trace:
> py4j.Py4JException: Constructor
> io.archivesunleashed.app.WriteGraphML([class java.util.ArrayList, class
> java.lang.String]) does not exist
>         at
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
>         at
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
>         at py4j.Gateway.invoke(Gateway.java:237)
>         at
>
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>         at
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>         at py4j.GatewayConnection.run(GatewayConnection.java:238)
>         at java.lang.Thread.run(Thread.java:748)
>
>
> I originally dug into what the error message stated, and tried a variety
> of tweaks such as:
>
> sc.getOrCreate()._jvm.io.archivesunleashed.app.WriteGraphML.apply(data,
> graphml_path)
>
> And, went as far as trying get_attr, and calling the "WriteGraphML$" and
> few other varieties with that method.
>
> All the results produced the same variety of error message above; that
> the Constructor or method does not exist.
>
> I came across this[1] Based on lots of Googling and Stack Overflow
> searches, and it has me thinking that the problem is because of how Py4J
> is passing off the Python List (data) to the JVM, and then passing it to
> Scala. It's ending up as an ArrayList instead of an Array[Row].
>
> Do I need to tweak data before it is passed to Write_Graphml? Or am I
> doing something else wrong here.
>
> I had originally posted a version of this message to the dev list, and
> Sean Owen suggested WriteGraphML should be a implemented as a class, not
> an object. Is that the right path? I have a number of other Scala
> functions implemented in the PySpark side of our project that are
> objects, and everything works fine.
>
> ...and is there a best practices guide or documentation for implementing
> Scala functions in PySpark? I've found a number of blog posts that have
> been helpful.
>
> Thanks in advance for any help!
>
> cheers!
>
> -nruest
>
> [1]
> https://stackoverflow.com/questions/61928886/pyspark-list-to-scala-sequence
>
>