You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "David Lassiter (Jira)" <ji...@apache.org> on 2022/07/19 04:54:00 UTC

[jira] [Updated] (SPARK-39813) Unable to connect to Presto in Pyspark: java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver

     [ https://issues.apache.org/jira/browse/SPARK-39813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Lassiter updated SPARK-39813:
-----------------------------------
    Description: 
My team has a bash script + python script that uses pyspark to extract data from a hive database. The scripts work when run on a server. However, we need to containerize this since we will not have access to that server in the future.

Thus, I am trying to get the job to work from a container. 

When trying to run the scripts locally or in a docker container, I am running into driver issues. Unfortunately, nobody on my team helped set up the environment on the server, where everything is working. Thus, we are having a hard time figuring out what is wrong with our local/containerized environmenets and cannot replicate a successful script run. 

 

From a container, I run a bash script that does the following:

 

```\{bash}

$SPARK_HOME/bin/spark-submit etl_job.py

```

 

The contents of `etl_job.py` are as follows:

```\{python}
print('\n\nStarting python job\n\n')
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from credentials import PRESTO_USER, PRESTO_PASSWORD, PRESTO_URL, PRESTO_SSL, SSL_PASSWORD, TEST_SCHEMA, TEST_TABLE
import pandas as pd

print('\n\nStarting spark session \n\n')
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

print('\n\nConnecting to Presto\n\n')
Prestoprod = (
spark.read.format("jdbc")
.option("url", PRESTO_URL)
.option("user", PRESTO_USER)
.option("password", PRESTO_PASSWORD)
.option("driver", "com.facebook.presto.jdbc.PrestoDriver")
.option("SSL", "true")
.option("SSLKeyStorePath", PRESTO_SSL)
.option("SSLKeyStorePassword", SSL_PASSWORD)
)

print('\n\nSuccessfully connected to Presto.\n\n')
results = (
Prestoprod.option(
"query",
"select * "
"from hive.\{TEST_SCHEMA}.\{TEST_TABLE}",
)
.load()
)
```

 

However, when I run the job, I get the following error:

```

py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver

```

 

Here is the full log:

```

Starting python job

 

Starting spark session 

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/07/19 04:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Connecting to Presto

 

Successfully connected to Presto.

Traceback (most recent call last):
  File "/home/jovyan/etl_job.py", line 30, in <module>
    .load()
  File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 184, in load
    return self._df(self._jreader.load())
  File "/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in _{_}call{_}_
  File "/usr/local/spark/python/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver
        at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
        at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:101)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:101)
        at scala.Option.foreach(Option.scala:437)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:101)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:39)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
        at scala.Option.getOrElse(Option.scala:201)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:833)

```

  was:
My team has a bash script + python script that uses pyspark to extract data from a hive database. The scripts work when run on a server. However, we need to containerize this since we will not have access to that server in the future.

Thus, I am trying to get the job to work from a container. 

When trying to run the scripts locally or in a docker container, I am running into driver issues. Unfortunately, nobody on my team helped set up the environment on the server, where everything is working. Thus, we are having a hard time figuring out what is wrong with our local/containerized environmenets and cannot replicate a successful script run. 

 

From a container, I run a bash script that does the following:

 

```\{bash}

$SPARK_HOME/bin/spark-submit etl_job.py

```

 

The contents of `etl_job.py` are as follows:

```\{python}
print('\n\nStarting python job\n\n')
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from credentials import PRESTO_USER, PRESTO_PASSWORD, PRESTO_URL, PRESTO_SSL, SSL_PASSWORD, TEST_SCHEMA, TEST_TABLE
import pandas as pd

print('\n\nStarting spark session \n\n')
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

print('\n\nConnecting to Presto\n\n')
Prestoprod = (
spark.read.format("jdbc")
.option("url", PRESTO_URL)
.option("user", PRESTO_USER)
.option("password", PRESTO_PASSWORD)
.option("driver", "com.facebook.presto.jdbc.PrestoDriver")
.option("SSL", "true")
.option("SSLKeyStorePath", PRESTO_SSL)
.option("SSLKeyStorePassword", SSL_PASSWORD)
)

print('\n\nSuccessfully connected to Presto.\n\n')
results = (
Prestoprod.option(
"query",
"select sample_number, template, lot, protocol, recd_date, stab_pull_date, login_date, product, project "
"from hive.\{TEST_SCHEMA}.\{TEST_TABLE}",
)
.load()
)
```

 

However, when I run the job, I get the following error:

```

py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver

```

 

Here is the full log:

```

Starting python job

 


Starting spark session 


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/07/19 04:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Connecting to Presto

 


Successfully connected to Presto.


Traceback (most recent call last):
  File "/home/jovyan/etl_job.py", line 30, in <module>
    .load()
  File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 184, in load
    return self._df(self._jreader.load())
  File "/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/usr/local/spark/python/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver
        at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
        at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:101)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:101)
        at scala.Option.foreach(Option.scala:437)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:101)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:39)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
        at scala.Option.getOrElse(Option.scala:201)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:833)

```


> Unable to connect to Presto in Pyspark: java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-39813
>                 URL: https://issues.apache.org/jira/browse/SPARK-39813
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.3.0
>         Environment: I am running this in a docker built from the jupyter/all-spark-notebook image.
>            Reporter: David Lassiter
>            Priority: Major
>              Labels: AWS
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> My team has a bash script + python script that uses pyspark to extract data from a hive database. The scripts work when run on a server. However, we need to containerize this since we will not have access to that server in the future.
> Thus, I am trying to get the job to work from a container. 
> When trying to run the scripts locally or in a docker container, I am running into driver issues. Unfortunately, nobody on my team helped set up the environment on the server, where everything is working. Thus, we are having a hard time figuring out what is wrong with our local/containerized environmenets and cannot replicate a successful script run. 
>  
> From a container, I run a bash script that does the following:
>  
> ```\{bash}
> $SPARK_HOME/bin/spark-submit etl_job.py
> ```
>  
> The contents of `etl_job.py` are as follows:
> ```\{python}
> print('\n\nStarting python job\n\n')
> from pyspark.context import SparkContext
> from pyspark.sql.session import SparkSession
> from credentials import PRESTO_USER, PRESTO_PASSWORD, PRESTO_URL, PRESTO_SSL, SSL_PASSWORD, TEST_SCHEMA, TEST_TABLE
> import pandas as pd
> print('\n\nStarting spark session \n\n')
> sc = SparkContext.getOrCreate()
> spark = SparkSession(sc)
> print('\n\nConnecting to Presto\n\n')
> Prestoprod = (
> spark.read.format("jdbc")
> .option("url", PRESTO_URL)
> .option("user", PRESTO_USER)
> .option("password", PRESTO_PASSWORD)
> .option("driver", "com.facebook.presto.jdbc.PrestoDriver")
> .option("SSL", "true")
> .option("SSLKeyStorePath", PRESTO_SSL)
> .option("SSLKeyStorePassword", SSL_PASSWORD)
> )
> print('\n\nSuccessfully connected to Presto.\n\n')
> results = (
> Prestoprod.option(
> "query",
> "select * "
> "from hive.\{TEST_SCHEMA}.\{TEST_TABLE}",
> )
> .load()
> )
> ```
>  
> However, when I run the job, I get the following error:
> ```
> py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
> : java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver
> ```
>  
> Here is the full log:
> ```
> Starting python job
>  
> Starting spark session 
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
> 22/07/19 04:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> Connecting to Presto
>  
> Successfully connected to Presto.
> Traceback (most recent call last):
>   File "/home/jovyan/etl_job.py", line 30, in <module>
>     .load()
>   File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 184, in load
>     return self._df(self._jreader.load())
>   File "/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in _{_}call{_}_
>   File "/usr/local/spark/python/pyspark/sql/utils.py", line 190, in deco
>     return f(*a, **kw)
>   File "/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
> : java.lang.ClassNotFoundException: com.facebook.presto.jdbc.PrestoDriver
>         at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
>         at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
>         at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
>         at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
>         at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:101)
>         at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:101)
>         at scala.Option.foreach(Option.scala:437)
>         at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:101)
>         at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:39)
>         at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
>         at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>         at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>         at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>         at scala.Option.getOrElse(Option.scala:201)
>         at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>         at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>         at py4j.Gateway.invoke(Gateway.java:282)
>         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>         at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>         at java.base/java.lang.Thread.run(Thread.java:833)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org