You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@livy.apache.org by "Rabe, Jens" <je...@iwes.fraunhofer.de> on 2018/10/08 10:31:20 UTC

Submitting a PySpark batch job ignores jars sent with it

Hello,

I defined a custom format to read data into spark. This works when used in Scala Spark or e.g. from Zeppelin, also with PySpark.

I now try to use this from Livy. I post something like this to http://mylivy:8998/batches:

{
  "file":"/path/to/myjob.py",
  "args":["foo", "bar"],
  "jars":"/path/to/myformat-assembly.jar"
}

In the log I see the jar gets loaded and added:

    "2018-10-08 12:23:28 INFO  SparkContext:54 - Added JAR file:/// path/to/myformat-assembly.jar at spark://172.30.10.10:45613/jars/ myformat-assembly.jar with timestamp 1538994208755"

But my PySpark job doesn't find the format:

        "Traceback (most recent call last):",
        "  File \"/path/to/myjob.py \", line 13, in <module>",
        "    data = spark.read.format(\"my.custom.format\").load(path)",
        "  File \"/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py\", line 166, in load",
        "  File \"/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py\", line 1257, in __call__",
        "  File \"/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 63, in deco",
        "  File \"/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py\", line 328, in get_return_value",
        "py4j.protocol.Py4JJavaError: An error occurred while calling o29.load.",
        ": java.lang.ClassNotFoundException: Failed to find data source: my.custom.format. Please find packages at http://spark.apache.org/third-party-projects.html",

When opening a session (which loads the same library jar) and sending the respective command, it fails as well.

However, I just added a simple object into this library, and calling this works (like using sc._jvm.somepackage.Foo.bar())

What am I missing?

RE: Submitting a PySpark batch job ignores jars sent with it

Posted by "Rabe, Jens" <je...@iwes.fraunhofer.de>.
Please disregard, I used an obsolete version of the jar which did indeed not have the classes in...

From: Rabe, Jens <je...@iwes.fraunhofer.de>
Sent: Montag, 8. Oktober 2018 12:31
To: user@livy.incubator.apache.org
Subject: Submitting a PySpark batch job ignores jars sent with it

Hello,

I defined a custom format to read data into spark. This works when used in Scala Spark or e.g. from Zeppelin, also with PySpark.

I now try to use this from Livy. I post something like this to http://mylivy:8998/batches:

{
  "file":"/path/to/myjob.py",
  "args":["foo", "bar"],
  "jars":"/path/to/myformat-assembly.jar"
}

In the log I see the jar gets loaded and added:

    "2018-10-08 12:23:28 INFO  SparkContext:54 - Added JAR file:/// path/to/myformat-assembly.jar at spark://172.30.10.10:45613/jars/ myformat-assembly.jar with timestamp 1538994208755"

But my PySpark job doesn't find the format:

        "Traceback (most recent call last):",
        "  File \"/path/to/myjob.py \", line 13, in <module>",
        "    data = spark.read.format(\"my.custom.format\").load(path)",
        "  File \"/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py\", line 166, in load",
        "  File \"/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py\", line 1257, in __call__",
        "  File \"/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 63, in deco",
        "  File \"/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py\", line 328, in get_return_value",
        "py4j.protocol.Py4JJavaError: An error occurred while calling o29.load.",
        ": java.lang.ClassNotFoundException: Failed to find data source: my.custom.format. Please find packages at http://spark.apache.org/third-party-projects.html",

When opening a session (which loads the same library jar) and sending the respective command, it fails as well.

However, I just added a simple object into this library, and calling this works (like using sc._jvm.somepackage.Foo.bar())

What am I missing?