You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/02/01 11:00:23 UTC

[GitHub] [hudi] zafer-sahin edited a comment on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

zafer-sahin edited a comment on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-770769437


   @nsivabalan I was able to execute all steps successfully in the [quick start](https://hudi.apache.org/docs/quick-start-guide.html) and I could reproduce the issue by changing the storage type in the hudi options. I have changed the storage type of quick start example to merge_on_read and it failed as well. Here is the modification I have applied. 
   
   ` pyspark --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.0 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 
   `
   
   ```
   tableName = "hudi_trips_cow"
   basePath = "S3:///tmp/hudi_trips_mor"
   dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
   inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
   df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
   ```
   
   Below **storage type** has modified. And I am getting an error **when I read the file.**
   ```
   hudi_options = {
     'hoodie.table.name': tableName,
     "hoodie.datasource.write.storage.type": "MERGE_ON_READ",  
     'hoodie.datasource.write.recordkey.field': 'uuid',
     'hoodie.datasource.write.partitionpath.field': 'partitionpath',
     'hoodie.datasource.write.table.name': tableName,
     'hoodie.datasource.write.operation': 'insert',
     'hoodie.datasource.write.precombine.field': 'ts',
     'hoodie.upsert.shuffle.parallelism': 2, 
     'hoodie.insert.shuffle.parallelism': 2
   }
   
   df.write.format("hudi"). \
     options(**hudi_options). \
     mode("overwrite"). \
     save(basePath)
   
   tripsSnapshotDF = spark. \
     read. \
     format("hudi"). \
     load(basePath + "/*/*/*/*")
   ```
   
   
   Please find the error stack below.
   ```
   An error occurred while calling o267.load.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
   	at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org