You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ruslan Dautkhanov (JIRA)" <ji...@apache.org> on 2018/04/02 06:34:00 UTC

[jira] [Created] (SPARK-23842) accessing java from PySpark lambda functions

Ruslan Dautkhanov created SPARK-23842:
-----------------------------------------

             Summary: accessing java from PySpark lambda functions
                 Key: SPARK-23842
                 URL: https://issues.apache.org/jira/browse/SPARK-23842
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.3.0, 2.2.1, 2.2.0
            Reporter: Ruslan Dautkhanov


Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be more of a Spark issue than py4j.. 

|We have a third-party Java library that is distributed to Spark executors through {{--jars}} parameter.
We want to call a static Java method in that library on executor's side through Spark's {{map()}} or create an object of that library's class through {{mapPartitions()}} call. 
None of the approaches worked so far. It seems Spark tries to serialize everything it sees in a lambda function, distribute to executors etc.
I am aware of an older py4j issue/question [#171|https://github.com/bartdag/py4j/issues/171] but looking at that discussion isn't helpful.
We thought to create a reference to that "class" through a call like {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose functionality of that library in pyspark executors' lambda functions.
We don't want Spark to try to serialize spark session variables "spark" nor its reference to py4j gateway {{spark._jvm}} (because it leads to expected non-serializable exceptions), so tried to "trick" Spark not to try to serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} into {{exec()}} call.
It led to another issue that {{spark}} (spark session) nor {{sc}} (spark context) variables seems not available in spark executors' lambda functions. So we're stuck and don't know how to call a generic java class through py4j on executor's side (from within {{map}} or {{mapPartitions}} lambda functions).
It would be an easier adventure from Scala/ Java for Spark as those can directly call that 3rd-party libraries methods, but our users ask to have a way to do the same from PySpark.|




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org