You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/21 21:42:21 UTC

[GitHub] [hudi] lewisrobbins opened a new issue, #6171: [SUPPORT] ClassNotFound Exception when saving DataFrame as Hudi

lewisrobbins opened a new issue, #6171:
URL: https://github.com/apache/hudi/issues/6171

   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create an EMR Cluster with EMR release 6.6.0
   2. Create a simple Spark/Hudi job3. 
   3. Submit to spark with `spark-submit hudi.py --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" --jars /usr/lib/hudi/hudi-spark-bundle.jar`
   4. Observe stack trace error
   
   
   ```python
   import sys
   from pyspark.sql import SparkSession
   
   
   spark = (SparkSession
        .builder
        .appName("PythonMnMCount")
        .getOrCreate())
   
   # Create a DataFrame
   inputDF = spark.createDataFrame(
       [
           ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
           ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
           ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
           ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
           ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
           ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
       ],
       ["id", "creation_date", "last_update_time"]
   )
   
   # Specify common DataSourceWriteOptions in the single hudiOptions variable
   hudiOptions = {
   'hoodie.table.name': 'my_hudi_table',
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.partitionpath.field': 'creation_date',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.table': 'my_hudi_table',
   'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
   }
   
   # Write a DataFrame as a Hudi dataset
   inputDF.write \
   .format('org.apache.hudi') \
   .option('hoodie.datasource.write.operation', 'insert') \
   .options(**hudiOptions) \
   .mode('overwrite') \
   .save('s3://my-bucket/myhudidataset/')
   ```
   
   I have also tried saving with `inputDF.write.format('hudi`), but I get the same error
   
   **Expected behavior**
   
   Spark saves data as Hudi in S3
   
   **Environment Description**
   
   * Hudi version : 0.10.1-amzn-0
   
   * Spark version : 3.2.0
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   Stack trace:
   
   ```
   Traceback (most recent call last):
     File "/home/hadoop/hudi.py", line 42, in <module>
       .save('s3://us-flights-data/myhudidataset/')
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 740, in save
     File "/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1310, in __call__
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
     File "/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py", line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o88.save.
   : java.lang.ClassNotFoundException: 
   Failed to find data source: org.apache.hudi. Please find packages at
   http://spark.apache.org/third-party-projects.html
          
   	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:443)
   	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:670)
   	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
   	at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
   	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   	at java.lang.Thread.run(Thread.java:750)
   Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource
   	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
   	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
   	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
   	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:656)
   	at scala.util.Try$.apply(Try.scala:213)
   	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:656)
   	at scala.util.Failure.orElse(Try.scala:224)
   	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:656)
   	... 16 more
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] mahesh2247 commented on issue #6171: [SUPPORT] ClassNotFound Exception when saving DataFrame as Hudi in EMR

Posted by "mahesh2247 (via GitHub)" <gi...@apache.org>.

mahesh2247 commented on issue #6171:
URL: https://github.com/apache/hudi/issues/6171#issuecomment-1407386524

   Please help @rmahindra123 . Getting same error while running in Glue Notebook. I have specified conf files using %spark_conf magic
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rmahindra123 commented on issue #6171: [SUPPORT] ClassNotFound Exception when saving DataFrame as Hudi in EMR

Posted by GitBox <gi...@apache.org>.

rmahindra123 commented on issue #6171:
URL: https://github.com/apache/hudi/issues/6171#issuecomment-1192218126

   @lewis262626 While i try to reproduce the steps, could you try specifying the confs before the main class?
   
   park-submit  --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" --jars /usr/lib/hudi/hudi-spark-bundle.jar hudi.py


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] lewis262626 commented on issue #6171: [SUPPORT] ClassNotFound Exception when saving DataFrame as Hudi in EMR

Posted by GitBox <gi...@apache.org>.

lewis262626 commented on issue #6171:
URL: https://github.com/apache/hudi/issues/6171#issuecomment-1192972444

   @rmahindra123 That fixed it, cheers.
   
   Closing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] lewis262626 closed issue #6171: [SUPPORT] ClassNotFound Exception when saving DataFrame as Hudi in EMR

Posted by GitBox <gi...@apache.org>.

lewis262626 closed issue #6171: [SUPPORT] ClassNotFound Exception when saving DataFrame as Hudi in EMR
URL: https://github.com/apache/hudi/issues/6171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org