You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/17 21:29:30 UTC

[GitHub] [hudi] zafer-sahin opened a new issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

zafer-sahin opened a new issue #2498:
URL: https://github.com/apache/hudi/issues/2498


   
   - Hudi is not able to read MERGE_ON_READ table when using the versions [0.6.0] and [0.7.0]   When I run the same code with the version [0.5.3] I am able to read the table generated by the option of merge on read.
   
   
   **Steps to reproduce the behavior:**
   
   **1.**Start a pyspark shell
   **2.**`pyspark --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.0 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'`
   Or
   `pyspark --packages org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'`
   **3.**  ```
             >>>S3_SNAPSHOT = <snapshot location>
             >>>S3_MERGE_ON_READ = <location to replicate data>
             >>> from pyspark.sql.functions import *
             >>> df = spark.read.parquet(S3_SNAPSHOT)
            >>>df.count()                                                                  
   21/01/27 14:49:13 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
   950897550  
               >>> hudi_options_insert = {
               ...     "hoodie.table.name": "sample_schema.table_name",
               ...     "hoodie.datasource.write.storage.type": "MERGE_ON_READ",
               ...     "hoodie.datasource.write.recordkey.field": "id",
               ...     "hoodie.datasource.write.operation": "bulk_insert",
               ...     "hoodie.datasource.write.partitionpath.field": "ds",
               ...     "hoodie.datasource.write.precombine.field": "id",
               ...     "hoodie.insert.shuffle.parallelism": 135
               ...     }
            >>>df.write.format("hudi").options(**hudi_options_insert).mode("overwrite").save(S3_MERGE_ON_READ)
   ```
   
   
   4.
   
   **Expected behavior**
   
   Data is loaded to dataframe perfectly when spark shell is created with the parameters:
   `pyspark --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   `
   
   **Environment Description**
   EMR 
   * Hudi version :[0.7.0], [0.6.0] is giving error.  [0.5.3] is running fluently
   
   * Spark version : [2.4.4], [3.0.1]
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no 
   
   
   
   
   **Stacktrace**
   
   ```
           >>> df_mor = spark.read.format("hudi").load(S3_MERGE_ON_READ + "/*")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 178, in load
       return self._df(self._jreader.load(path))
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
     File "/usr/lib/spark/python/pyspark/sql/utils.py", line 128, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o86.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
   	at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-775719782


   We will keep investigating. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-792904115


   We know spark3 bundling in hive has some issues and will be fixed in upcoming release. https://issues.apache.org/jira/browse/HUDI-1568
   If you are also facing issue w/ spark2, then we might need to investigate further. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-900643746


   Not sure if hudi has been tested or certified with spark3.1.2. Can you give it a try with 3.0.1 ? and then we can go from there. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-904370726


   @pete91z : I guess we did not expose diff artifacts pertaining to spark2, spark3 in 060 and 070 which we fixed it in 080. So, spark-bundle you get with 070 is compiled against spark2 and scala 11. I have asked @codope to verify that. If thats the case, only option I can think of it to download the source and build hudi-spark-bundle for spark3 and then use the bundle. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-999991849


   Closing out the issue as its jar mismatch issue and we have some proposed solution above. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-769423274


   Can you try this config 
   "hoodie.datasource.write.table.type" and set it to MERGE_ON_READ


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-900643746


   Not sure if hudi has been tested or certified with spark3.1.2. Can you give it a try with 3.0.1 ? and then we can go from there. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-903879741


   @codope is looking into this. we will get back to you in a day or two. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
codope commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-904323753


   I can reproduce the same behavior with Hudi 0.7.0 but not with 0.8.0 on Apache Spark 3.0.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zafer-sahin edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
zafer-sahin edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-770769437


   @nsivabalan I was able to execute all steps successfully in the [quick start](https://hudi.apache.org/docs/quick-start-guide.html) and I could reproduce the issue by changing the storage type in the hudi options. I have changed the storage type of quick start example to merge_on_read and it failed as well. Here is the modification I have applied. 
   
   ` pyspark --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.0 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 
   `
   
   ```
   tableName = "hudi_trips_cow"
   basePath = "S3:///tmp/hudi_trips_mor"
   dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
   inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
   df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
   ```
   
   Below **storage type** has modified. And I am getting an error **when I read the file.**
   ```
   hudi_options = {
     'hoodie.table.name': tableName,
     "hoodie.datasource.write.storage.type": "MERGE_ON_READ",  
     'hoodie.datasource.write.recordkey.field': 'uuid',
     'hoodie.datasource.write.partitionpath.field': 'partitionpath',
     'hoodie.datasource.write.table.name': tableName,
     'hoodie.datasource.write.operation': 'insert',
     'hoodie.datasource.write.precombine.field': 'ts',
     'hoodie.upsert.shuffle.parallelism': 2, 
     'hoodie.insert.shuffle.parallelism': 2
   }
   
   df.write.format("hudi"). \
     options(**hudi_options). \
     mode("overwrite"). \
     save(basePath)
   
   tripsSnapshotDF = spark. \
     read. \
     format("hudi"). \
     load(basePath + "/*/*/*/*")
   ```
   
   
   Please find the error stack below.
   ```
   An error occurred while calling o267.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
   	at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-942292643


   @saqibalimalik : As mentioned before, if you are running EMR's version of spark, you might have to route your support request to EMR folks (CC @umehrot2 ). If it's apache spark, we can look into it. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] saqibalimalik edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
saqibalimalik edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-916483215


   I see same error when I try to read MOR table using pyspark. No issues querying using hive/presto.
   
   `Spark 3.0.1 and hudi-spark3-bundle_2.12:0.9.0`
   
   Reading as 
   ```python
   queryHudiRead = spark.read.format("org.apache.hudi").load("s3://bucket/table")
   queryHudiRead.show()
   ```
   
   Getting below error
   ```bash
   An error was encountered:
   An error occurred while calling o85.showString.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
   	at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildFileIndex$6(MergeOnReadSnapshotRelation.scala:217)
   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
   	at scala.collection.immutable.List.foreach(List.scala:392)
   	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
   	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
   	at scala.collection.immutable.List.map(List.scala:298)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:209)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildScan(MergeOnReadSnapshotRelation.scala:110)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$apply$4(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:331)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:408)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:330)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
   	at scala.collection.Iterator.foreach(Iterator.scala:941)
   	at scala.collection.Iterator.foreach$(Iterator.scala:941)
   	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:160)
   	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1429)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:365)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:94)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:94)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:87)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:107)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:104)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:100)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$5(QueryExecution.scala:219)
   	at org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:381)
   	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:219)
   	at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:99)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:132)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:104)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:132)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:131)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3665)
   	at org.apache.spark.sql.Dataset.head(Dataset.scala:2737)
   	at org.apache.spark.sql.Dataset.take(Dataset.scala:2944)
   	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
   	at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-904371497


   btw, 080 has 3 diff bundles for hudi-spark-bundle. so use the one meant for spark3 if thats your requirement. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jmnatzaganian edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
jmnatzaganian edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-974507962


   I'm also having the same type of issue in EMR 6.4 after building and deploying Hudi 0.9.0. Note that as mentioned [above](https://github.com/apache/hudi/issues/2498#issuecomment-969228521), the default binaries work just fine (EMR 6.4 with Hudi 0.8.0).
   
   It seems that there's likely something off with the build or referencing. I used `mvn clean package -DskipTests -Dspark3 -Dscala-2.12 -T 30`.
   
   What's really interesting is that I can create an MoR table w/o issue, but trying to do a `load` renders the loaded DF unusable. It looks like the DF is loaded, but then becomes unusable.
   
   This [tip](https://github.com/apache/hudi/issues/2498#issuecomment-942282671) also worked for me (i.e. using `spark.sql` and referencing the table from the Glue data catalog). Unfortunately, querying the data this way seems to be *much* slower (compared to 0.8.0).
   
   I documented my build and installation process in [this](https://apache-hudi.slack.com/archives/C4D716NPQ/p1637354714476100) slack thread.
   
   Edit:
   I tested this with a CoW table and I did not have the issue, i.e. the following works just fine. It did; however, take 2.7x longer to do the read than it did in 0.8.0.
   ````
   df = spark.read.format("org.apache.hudi").load(path)
   df.show()
   ````


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pete91z commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
pete91z commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-899427179


   I am seeing this issue with MOR tables using Apache Spark 3.1.2 (Not using AWS EMR) and Hudi 0.7.0, is it possible to re-open please? Or is it fixed in 0.8.0?
   
   Context: I have created a table using deltastreamer. Deltastreamer appears to work fine, but later when I try and create dataframe to read the table in pyspark I get the following error:
   
   Traceback (most recent call last):
     File "./init_hudi_for_billing_ds", line 45, in <module>
       billingDF=sqlContext.read.format("hudi").load(basePath+"/*/*")
     File "/home/spark_311/py1/lib64/python3.6/dist-packages/pyspark/sql/readwriter.py", line 204, in load
       return self._df(self._jreader.load(path))
     File "/home/spark_311/py1/lib64/python3.6/dist-packages/py4j/java_gateway.py", line 1305, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/home/spark_311/py1/lib64/python3.6/dist-packages/pyspark/sql/utils.py", line 111, in deco
       return f(*a, **kw)
     File "/home/spark_311/py1/lib64/python3.6/dist-packages/py4j/protocol.py", line 328, in get_return_value
       format(target_id, ".", name), value)
   py4j.protocol.Py4JJavaError: An error occurred while calling o30.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
   	at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   I've tried setting the table.type as MERGE_ON_READ in the hudi options, but has no effect. These errors are not seen with COPY_ON_WRITE tables.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] parisni commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
parisni commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-942282671






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] JohnEngelhart commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
JohnEngelhart commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-969228521


   My team was running into this issue. We are running EMR 6.4. Spark 3.1.2. Hudi 0.8.0. We eventually found that you need to provide the EMR dependencies in your spark submit/shell/notebook. Amazon makes their own slightly modified version of Spark/Hudi etc. So in your POM set spark hudi bundle, spark avro, spark core, spark sql all to provided. EMR does not include hudi/avro natively on their classpath though. So you need to include them in your --jars config noted here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
   
   Hopefully it saves someone the trouble that my team went through.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #2498:
URL: https://github.com/apache/hudi/issues/2498


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pete91z commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
pete91z commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-901071172


   Same error experienced with Spark 3.0.1
   
   [2021-08-18 12:24:50,117] WARN Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties (org.apache.hadoop.metrics2.impl.MetricsConfig:134)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/spark_301/py1/lib64/python3.6/dist-packages/pyspark/sql/readwriter.py", line 178, in load
       return self._df(self._jreader.load(path))
     File "/home/spark_301/py1/lib64/python3.6/dist-packages/py4j/java_gateway.py", line 1305, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/home/spark_301/py1/lib64/python3.6/dist-packages/pyspark/sql/utils.py", line 128, in deco
       return f(*a, **kw)
     File "/home/spark_301/py1/lib64/python3.6/dist-packages/py4j/protocol.py", line 328, in get_return_value
       format(target_id, ".", name), value)
   py4j.protocol.Py4JJavaError: An error occurred while calling o30.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
   	at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] green2k commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
green2k commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-785274437


   There's exactly the same problem with Spark hosted on Databricks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] garyli1019 commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
garyli1019 commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-786364265


   I am seeing the same problem when the compiled spark distribution is different from the runtime spark distribution. Compile hudi jar against the runtime spark distribution should fix this problem. @green2k @andormarkus 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-904371098


   oops, missed to read Sagar's full comment. looks like he has verified it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] saqibalimalik edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
saqibalimalik edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-916483215


   I see same error when I try to read MOR table using spark. No issues querying using hive/presto.
   
   `Spark 3.0.1 and hudi-spark3-bundle_2.12:0.9.0`
   
   Reading as 
   ```python
   queryHudiRead = spark.read.format("org.apache.hudi").load("s3://bucket/table")
   queryHudiRead.show()
   ```
   
   Getting below error
   ```bash
   An error was encountered:
   An error occurred while calling o85.showString.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
   	at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildFileIndex$6(MergeOnReadSnapshotRelation.scala:217)
   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
   	at scala.collection.immutable.List.foreach(List.scala:392)
   	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
   	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
   	at scala.collection.immutable.List.map(List.scala:298)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:209)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildScan(MergeOnReadSnapshotRelation.scala:110)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$apply$4(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:331)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:408)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:330)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
   	at scala.collection.Iterator.foreach(Iterator.scala:941)
   	at scala.collection.Iterator.foreach$(Iterator.scala:941)
   	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:160)
   	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1429)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:365)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:94)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:94)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:87)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:107)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:104)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:100)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$5(QueryExecution.scala:219)
   	at org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:381)
   	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:219)
   	at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:99)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:132)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:104)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:132)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:131)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3665)
   	at org.apache.spark.sql.Dataset.head(Dataset.scala:2737)
   	at org.apache.spark.sql.Dataset.take(Dataset.scala:2944)
   	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
   	at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-789071166


   let me bump up the severity for this issue and we will try to repro and fix it well in 0.8.0


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pete91z commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
pete91z commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-900945358


   Thanks for the response, I will try on 3.0.1 and get back to you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] parisni commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
parisni commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-942282671


   same issue here:
   ```
   # This fails with the above error
   spark.sql("select * from my_table_rt").show() 
   # also this fails with the same error
   spark.read.format("hudi").load(my_table_path).show()
   # this works 
   spark.sql("select * from my_table_ro").show()
   ```
   
   Using our own spark 2.4.4 build compiled with glue metastore on emr 5. 
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 380, in show
       print(self._jdf.showString(n, 20, vertical))
     File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
     File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o138.showString.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
           at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$7.apply(MergeOnReadSnapshotRelation.scala:217)
           at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$7.apply(MergeOnReadSnapshotRelation.scala:209)
           at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
           at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
           at scala.collection.immutable.List.foreach(List.scala:392)
           at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
           at scala.collection.immutable.List.map(List.scala:296)
           at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:209)
           at org.apache.hudi.MergeOnReadSnapshotRelation.buildScan(MergeOnReadSnapshotRelation.scala:110)
           at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:309)
           at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:309)
           at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:342)
           at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:341)
           at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:419)
           at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:337)
           at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:305)
           at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
           at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
           at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
           at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
           at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
           at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
           at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
           at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
           at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
           at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
           at scala.collection.Iterator$class.foreach(Iterator.scala:891)
           at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
           at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
           at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
           at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
           at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
           at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
           at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
           at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
           at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
           at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
           at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
           at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
           at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
           at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
           at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
           at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
           at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
           at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
           at py4j.Gateway.invoke(Gateway.java:282)
           at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
           at py4j.commands.CallCommand.execute(CallCommand.java:79)
           at py4j.GatewayConnection.run(GatewayConnection.java:238)
           at java.lang.Thread.run(Thread.java:748)
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] dmenin commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
dmenin commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-975581549


   I confirm I have the same problem with Spark 3.0 and hudi 0.9
   
   ```
   py4j.protocol.Py4JJavaError: An error occurred while calling o158.showString.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
   	at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildFileIndex$3(MergeOnReadSnapshotRelation.scala:173)
   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-769918590


   @zafer-sahin : not sure if its some env issue. Were you able to run the pyspark examples given in [quick start](https://hudi.apache.org/docs/quick-start-guide.html). If that works, but just MOR fails, then we can look into it. If you haven't tried it, can you try it and let us know. Also, your Precombine field should be something like timestamp and can't be same as record keys. Basically this is used to determine the ordering of multiple entries for the same record keys. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] parisni commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
parisni commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-942307668


   @nsivabalan 
   
    we are using OSS spark 2.4.4 on EMR 5.
   the hudi bundle is: --packages org.apache.hudi:hudi-spark-bundle_2.11:0.9.0,org.apache.spark:spark-avro_2.11:2.4.4 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-772510104


   Have filed a jira. Will follow up in the jira. thanks. https://issues.apache.org/jira/browse/HUDI-1578
   Closing this for now. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pete91z commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
pete91z commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-900945358






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] andormarkus commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
andormarkus commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-775454500


   Hi @nsivabalan 
   
   We created MOR Hudi table with Hudi DeltaStreamer [0.7.0].
   We have tried to read the table with Pyspark (python) and Spark (scala) as well and in both case we got the above mentioned error.
   
   We created COW Hudi table with DeltaStreamer and we could read the table with Pyspark (python) and Spark (scala).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] andormarkus commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
andormarkus commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-776172403


   Thanks @vinothchandar 
   As soon someone confirms Apache Spark not affected with this issue I can raise an AWS Support ticket.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Magicbeanbuyer commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
Magicbeanbuyer commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-775872798


   Hey @vinothchandar,
   
   we've came across the same issue with reading MERGE_ON_READ table using spark. We consume data from our AWS MSK topic, write the data using `deltastreamer` on AWS EMR, and store the data in an S3 bucket.
   
   Following is our implementation.
   
   ### Write data
   
   ```
   spark-submit \
     --jars /usr/lib/hudi/hudi-utilities-bundle_2.12-0.7.0.jar \
     --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
     --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"\
     --conf "spark.sql.hive.convertMetastoreParquet=false" \
     /usr/lib/hudi/hudi-utilities-bundle_2.12-0.7.0.jar \
     --spark-master yarn \
     --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
     --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
     --table-type MERGE_ON_READ \
     --source-ordering-field id \
     --target-base-path $target_base_path \
     --target-table $target_table \
     --hoodie-conf "hoodie.deltastreamer.schemaprovider.source.schema.file=$schema_file_path" \
     --hoodie-conf "hoodie.deltastreamer.schemaprovider.target.schema.file=$schema_file_path" \
     --hoodie-conf "hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator" \
     --hoodie-conf "hoodie.datasource.write.recordkey.field=id" \
     --hoodie-conf "hoodie.datasource.write.partitionpath.field=partitiontime:TIMESTAMP" \
     --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-dd'T'HH:mm:ss.SSSZ" \
     --hoodie-conf "hoodie.datasource.write.hive_style_partitioning=true" \
     --hoodie-conf "hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy" \
     --hoodie-conf "hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING" \
     --hoodie-conf "hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=milliseconds" \
     --hoodie-conf "hoodie.deltastreamer.keygen.timebased.timezone=UTC" \
     --hoodie-conf "hoodie.deltastreamer.source.kafka.topic=$kafka_topic" \
     --hoodie-conf "bootstrap.servers=$kafka_bootstrap_servers" \
     --hoodie-conf "auto.offset.reset=earliest"
   ```
   The hoodie table is generated in our S3 bucket no problem. However, Error message was thrown when we try to read it using either `python` or `scala`.
   
   ### Read Data
   #### Scala
   ```
   spark-shell \
     --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
     --conf "spark.sql.hive.convertMetastoreParquet=false"
   ```
   Trying to load data
   ```
   val basePath="s3://path/to/base/table"
   val df = spark.read.format("hudi").load(basePath + "/*/*/*/*")
   ```
   Error message 
   ```
   java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
     at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
     at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
     at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
     at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
     at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
     at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
     at scala.Option.getOrElse(Option.scala:189)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
     ... 47 elided
   ```
   
   #### Python
   ```
   pyspark \
     --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
     --conf "spark.sql.hive.convertMetastoreParquet=false"
   ```
   Trying to load data
   ```
   basePath="s3://path/to/base/table"
   df = spark.read.format("hudi").load(basePath + "/*/*/*/*")
   ```
   Error message 
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 178, in load
       return self._df(self._jreader.load(path))
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
     File "/usr/lib/spark/python/pyspark/sql/utils.py", line 128, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
           at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
           at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
           at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
           at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
           at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
           at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
           at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
           at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
           at scala.Option.getOrElse(Option.scala:189)
           at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
           at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
           at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
           at py4j.Gateway.invoke(Gateway.java:282)
           at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
           at py4j.commands.CallCommand.execute(CallCommand.java:79)
           at py4j.GatewayConnection.run(GatewayConnection.java:238)
           at java.lang.Thread.run(Thread.java:748)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zafer-sahin commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
zafer-sahin commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-770769437


   @nsivabalan I was able to execute all steps successfully in the [quick start](https://hudi.apache.org/docs/quick-start-guide.html) and I could reproduce the issue by changing the storage type in the hudi options. I have changed the storage type of quick start example to merge_on_read and it failed as well. Here is the modification I have applied. 
   
   ```
   tableName = "hudi_trips_cow"
   basePath = "S3:///tmp/hudi_trips_mor"
   dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
   inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
   df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
   ```
   
   Below **storage type** has modified. And I am getting an error **when I read the file.**
   ```
   hudi_options = {
     'hoodie.table.name': tableName,
     "hoodie.datasource.write.storage.type": "MERGE_ON_READ",  
     'hoodie.datasource.write.recordkey.field': 'uuid',
     'hoodie.datasource.write.partitionpath.field': 'partitionpath',
     'hoodie.datasource.write.table.name': tableName,
     'hoodie.datasource.write.operation': 'insert',
     'hoodie.datasource.write.precombine.field': 'ts',
     'hoodie.upsert.shuffle.parallelism': 2, 
     'hoodie.insert.shuffle.parallelism': 2
   }
   
   df.write.format("hudi"). \
     options(**hudi_options). \
     mode("overwrite"). \
     save(basePath)
   
   tripsSnapshotDF = spark. \
     read. \
     format("hudi"). \
     load(basePath + "/*/*/*/*")
   ```
   
   
   Please find the error stack below.
   ```
   An error occurred while calling o267.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
   	at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] andormarkus commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
andormarkus commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-776047022


   @vinothchandar 
   We are using EMR 6.2.0 which gives you AWS Spark 3.0.1 and the latest Apache Spark release is 3.0.1.
   I dont see reason version mismatch from this perspective.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Magicbeanbuyer edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
Magicbeanbuyer edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-775872798


   Hey @vinothchandar,
   
   we've come across the same issue with reading MERGE_ON_READ table using spark. We consume data from our AWS MSK topic, write the data using `deltastreamer` on AWS EMR, and store the data in an S3 bucket.
   
   Following is our implementation.
   
   ### Write data
   
   ```
   spark-submit \
     --jars /usr/lib/hudi/hudi-utilities-bundle_2.12-0.7.0.jar \
     --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
     --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"\
     --conf "spark.sql.hive.convertMetastoreParquet=false" \
     /usr/lib/hudi/hudi-utilities-bundle_2.12-0.7.0.jar \
     --spark-master yarn \
     --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
     --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
     --table-type MERGE_ON_READ \
     --source-ordering-field id \
     --target-base-path $target_base_path \
     --target-table $target_table \
     --hoodie-conf "hoodie.deltastreamer.schemaprovider.source.schema.file=$schema_file_path" \
     --hoodie-conf "hoodie.deltastreamer.schemaprovider.target.schema.file=$schema_file_path" \
     --hoodie-conf "hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator" \
     --hoodie-conf "hoodie.datasource.write.recordkey.field=id" \
     --hoodie-conf "hoodie.datasource.write.partitionpath.field=partitiontime:TIMESTAMP" \
     --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-dd'T'HH:mm:ss.SSSZ" \
     --hoodie-conf "hoodie.datasource.write.hive_style_partitioning=true" \
     --hoodie-conf "hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy" \
     --hoodie-conf "hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING" \
     --hoodie-conf "hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=milliseconds" \
     --hoodie-conf "hoodie.deltastreamer.keygen.timebased.timezone=UTC" \
     --hoodie-conf "hoodie.deltastreamer.source.kafka.topic=$kafka_topic" \
     --hoodie-conf "bootstrap.servers=$kafka_bootstrap_servers" \
     --hoodie-conf "auto.offset.reset=earliest"
   ```
   The hoodie table is generated in our S3 bucket no problem. However, Error message was thrown when we try to read it using either `python` or `scala`.
   
   ### Read Data
   #### Scala
   ```
   spark-shell \
     --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
     --conf "spark.sql.hive.convertMetastoreParquet=false"
   ```
   Trying to load data
   ```
   val basePath="s3://path/to/base/table"
   val df = spark.read.format("hudi").load(basePath + "/*/*/*/*")
   ```
   Error message 
   ```
   java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
     at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
     at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
     at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
     at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
     at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
     at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
     at scala.Option.getOrElse(Option.scala:189)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
     ... 47 elided
   ```
   
   #### Python
   ```
   pyspark \
     --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
     --conf "spark.sql.hive.convertMetastoreParquet=false"
   ```
   Trying to load data
   ```
   basePath="s3://path/to/base/table"
   df = spark.read.format("hudi").load(basePath + "/*/*/*/*")
   ```
   Error message 
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 178, in load
       return self._df(self._jreader.load(path))
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
     File "/usr/lib/spark/python/pyspark/sql/utils.py", line 128, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
           at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
           at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
           at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
           at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
           at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
           at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
           at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
           at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
           at scala.Option.getOrElse(Option.scala:189)
           at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
           at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
           at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
           at py4j.Gateway.invoke(Gateway.java:282)
           at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
           at py4j.commands.CallCommand.execute(CallCommand.java:79)
           at py4j.GatewayConnection.run(GatewayConnection.java:238)
           at java.lang.Thread.run(Thread.java:748)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
codope edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-904323753


   I can reproduce the same behavior with Hudi 0.7.0 but not with 0.8.0 on Apache Spark 3.0.1.
   
   UPDATE: Built the 0.7.0 code for Spark3: `mvn clean package -DskipTests -Dscala-2.12 -Dspark3`. Now I can query MOR tables as well. So, probably we never pushed the hudi-spark-bundle for Spark 3 to maven repo? @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-942292643






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] parisni commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
parisni commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-943156528


   @nsivabalan Also I have to mention this is OSS spark 2.4.4 with metastore overwrite with aws glue to connect spark to glue: https://github.com/awslabs/aws-glue-libs
   This might be related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] saqibalimalik commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
saqibalimalik commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-926167413


   @vinothchandar I am running on emr's spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-776028785


   Folks, this is due to version mismatch between aws spark and apache spark.  Hudi releases are built against apache spark and aws typically follows up with a EMR release.  The problematic access is only on the MOR query, that explains why MOR is problematic, while COW is not.  https://dev.to/bytearray/using-your-own-apache-spark-hudi-versions-with-aws-emr-40a0 if interested on steps. 
   
   If one of you could help verify, that the issue does not exist when querying from Apache Spark, we can route the issue accordingly. cc @umehrot2  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #2498:
URL: https://github.com/apache/hudi/issues/2498


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-769918590


   @zafer-sahin : not sure if its some env issue. Were you able to run the pyspark examples given in [quick start](https://hudi.apache.org/docs/quick-start-guide.html). If that works, but just MOR fails, then we can look into it. If you haven't tried it, can you try it and let us know. Also, your Precombine field should be something like timestamp and can't be same as record key field. Basically this is used to determine the ordering of multiple entries for the same record keys. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-769918590


   @zafer-sahin : not sure if its some env issue. Were you able to run the pyspark examples given in [quick start](https://hudi.apache.org/docs/quick-start-guide.html). If that works, but just MOR fails, then we can look into it. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zafer-sahin commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
zafer-sahin commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-769721260


   Hi, I am still getting a similar error.
   
   
   
   >>> hudi_options_insert = {
   ...     "hoodie.table.name": "the_table_name",
   ...     "hoodie.datasource.write.storage.type": "MERGE_ON_READ",
   ...     "hoodie.datasource.write.table.type": "MERGE_ON_READ",
   ...     "hoodie.datasource.write.recordkey.field": "id",
   ...     "hoodie.datasource.write.operation": "bulk_insert",
   ...     "hoodie.datasource.write.partitionpath.field": "ds",
   ...     "hoodie.datasource.write.precombine.field": "id",
   ...     "hoodie.insert.shuffle.parallelism": 135
   ...     }
   >>> df.write.format("hudi").options(**hudi_options_insert).mode("overwrite").save(S3_MERGE_ON_READ)
   
   
   `Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 178, in load
       return self._df(self._jreader.load(path))
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
     File "/usr/lib/spark/python/pyspark/sql/utils.py", line 128, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o87.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
   	at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] saqibalimalik commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
saqibalimalik commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-917040262


   I tested with both 0.8.0 and 0.9.0 and get same error. I should also mention that I have partitioning based on 2 fields as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pete91z commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
pete91z commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-916724389


   OK Thank you @nsivabalan and @codope  I will look into building 0.7.0 and re-test. I will also look at 0.8.0 when I get time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] saqibalimalik commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
saqibalimalik commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-916483215


   I see same error when I try to read MOR table using spark. No issues querying using hive/presto.
   
   `Spark 3.0.1 and hudi-spark3-bundle_2.12:0.9.0`
   
   Reading as 
   ```python
   queryHudiRead = spark.read.format("org.apache.hudi").load("s3://bucket/table")
   queryHudiRead.show()
   ```
   
   Getting below error
   ```bash
   An error was encountered:
   An error occurred while calling o85.showString.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
   	at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildFileIndex$6(MergeOnReadSnapshotRelation.scala:217)
   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
   	at scala.collection.immutable.List.foreach(List.scala:392)
   	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
   	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
   	at scala.collection.immutable.List.map(List.scala:298)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:209)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildScan(MergeOnReadSnapshotRelation.scala:110)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$apply$4(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:331)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:408)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:330)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
   	at scala.collection.Iterator.foreach(Iterator.scala:941)
   	at scala.collection.Iterator.foreach$(Iterator.scala:941)
   	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:160)
   	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1429)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:365)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:94)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:94)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:87)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:107)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:104)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:100)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$5(QueryExecution.scala:219)
   	at org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:381)
   	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:219)
   	at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:99)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:132)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:104)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:132)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:131)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3665)
   	at org.apache.spark.sql.Dataset.head(Dataset.scala:2737)
   	at org.apache.spark.sql.Dataset.take(Dataset.scala:2944)
   	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
   	at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   Traceback (most recent call last):
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 441, in show
       print(self._jdf.showString(n, 20, vertical))
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
       format(target_id, ".", name), value)
   py4j.protocol.Py4JJavaError: An error occurred while calling o85.showString.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
   	at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildFileIndex$6(MergeOnReadSnapshotRelation.scala:217)
   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
   	at scala.collection.immutable.List.foreach(List.scala:392)
   	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
   	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
   	at scala.collection.immutable.List.map(List.scala:298)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:209)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildScan(MergeOnReadSnapshotRelation.scala:110)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$apply$4(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:331)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:408)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:330)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
   	at scala.collection.Iterator.foreach(Iterator.scala:941)
   	at scala.collection.Iterator.foreach$(Iterator.scala:941)
   	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:160)
   	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1429)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:365)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:94)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:94)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:87)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:107)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:104)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:100)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$5(QueryExecution.scala:219)
   	at org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:381)
   	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:219)
   	at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:99)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:132)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:104)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:132)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:131)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3665)
   	at org.apache.spark.sql.Dataset.head(Dataset.scala:2737)
   	at org.apache.spark.sql.Dataset.take(Dataset.scala:2944)
   	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
   	at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zafer-sahin edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
zafer-sahin edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-769721260


   Hi, I am still getting a similar error at the time of reading.
   
   
   
   >>> hudi_options_insert = {
   ...     "hoodie.table.name": "the_table_name",
   ...     "hoodie.datasource.write.storage.type": "MERGE_ON_READ",
   ...     "hoodie.datasource.write.table.type": "MERGE_ON_READ",
   ...     "hoodie.datasource.write.recordkey.field": "id",
   ...     "hoodie.datasource.write.operation": "bulk_insert",
   ...     "hoodie.datasource.write.partitionpath.field": "ds",
   ...     "hoodie.datasource.write.precombine.field": "id",
   ...     "hoodie.insert.shuffle.parallelism": 135
   ...     }
   >>> df.write.format("hudi").options(**hudi_options_insert).mode("overwrite").save(S3_MERGE_ON_READ)
   >>> df_mor = spark.read.format("hudi").load(S3_MERGE_ON_READ + "/*") 
   
   
   `Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 178, in load
       return self._df(self._jreader.load(path))
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
     File "/usr/lib/spark/python/pyspark/sql/utils.py", line 128, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o87.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
   	at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jmnatzaganian commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
jmnatzaganian commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-974507962


   I'm also having the same type of issue in EMR 6.4 after building and deploying Hudi 0.9.0. Note that as mentioned [above](https://github.com/apache/hudi/issues/2498#issuecomment-969228521), the default binaries work just fine (EMR 6.4 with Hudi 0.8.0).
   
   It seems that there's likely something off with the build or referencing. I used `mvn clean package -DskipTests -Dspark3 -Dscala-2.12 -T 30`.
   
   What's really interesting is that I can create an MoR table w/o issue, but trying to do a `load` renders the loaded DF unusable. It looks like the DF is loaded, but then becomes unusable.
   
   This [tip](https://github.com/apache/hudi/issues/2498#issuecomment-942282671) also worked for me (i.e. using `spark.sql` and referencing the table from the Glue data catalog). Unfortunately, querying the data this way seems to be *much* slower (compared to 0.8.0).
   
   I documented my build and installation process in [this](https://apache-hudi.slack.com/archives/C4D716NPQ/p1637354714476100) slack thread.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-926135063


    @saqibalimalik looks like there is some mismatch between @codope 's comment and what @saqibalimalik is reporting.  
   
   @saqibalimalik to confirm, you are running apache spark and not emr's spark?  
   
   > Spark 3.0.1 and hudi-spark3-bundle_2.12:0.9.0
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] parisni edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
parisni edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-942307668


   @nsivabalan 
   
    we are using OSS spark 2.4.4 on aws s3
   the hudi bundle is: --packages org.apache.hudi:hudi-spark-bundle_2.11:0.9.0,org.apache.spark:spark-avro_2.11:2.4.4 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-776127759


   @andormarkus AFAIK aws has its own spark fork. @umehrot2 mentioned on slack IIRC that this is related


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] parisni edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
parisni edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-942307668


   @nsivabalan 
   
    we are using OSS spark 2.4.4 on aws s3
   the hudi bundle is: --packages org.apache.hudi:hudi-spark-bundle_2.11:0.9.0,org.apache.spark:spark-avro_2.11:2.4.4 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-809502029


   @zafer-sahin @Magicbeanbuyer : Can you folks try out w/ spark2 and let us know if you still encounter the same issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-810436565


   once you respond, can you please remove "awaiting-user-response" label for the issue. If possible add "awaiting-community-help" label. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Magicbeanbuyer commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
Magicbeanbuyer commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-816555746


   Hey @nsivabalan, 
   
   We have wrapped up our POC, therefore no  longer have the setup anymore. Sorry couldn't contribute further to the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-997339302


   For `NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>`, please refer to https://github.com/apache/hudi/issues/2498#issuecomment-969228521 for the proposed fix. You have to provide the jars explicitly in class path or add it to spark/jars dir.
   
   We had issues w/ EMR version of spark and after copying spark-sql jar to spark/jars directory, able to resolve it. we don't face this issue when we use open source spark. Its an issue only with emr spark. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-997339302


   For `NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>`, please refer to https://github.com/apache/hudi/issues/2498#issuecomment-969228521 for the proposed fix. You have to provide the jars explicitly in class path or add it to spark/jars dir.
   
   We had issues w/ EMR version of spark and after copying spark-sql jar to spark/jars directory, able to resolve it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #2498:
URL: https://github.com/apache/hudi/issues/2498


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org