You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/09/09 22:26:46 UTC

[GitHub] [hudi] saqibalimalik commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

saqibalimalik commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-916483215


   I see same error when I try to read MOR table using spark. No issues querying using hive/presto.
   
   `Spark 3.0.1 and hudi-spark3-bundle_2.12:0.9.0`
   
   Reading as 
   ```python
   queryHudiRead = spark.read.format("org.apache.hudi").load("s3://bucket/table")
   queryHudiRead.show()
   ```
   
   Getting below error
   ```bash
   An error was encountered:
   An error occurred while calling o85.showString.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
   	at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildFileIndex$6(MergeOnReadSnapshotRelation.scala:217)
   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
   	at scala.collection.immutable.List.foreach(List.scala:392)
   	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
   	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
   	at scala.collection.immutable.List.map(List.scala:298)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:209)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildScan(MergeOnReadSnapshotRelation.scala:110)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$apply$4(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:331)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:408)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:330)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
   	at scala.collection.Iterator.foreach(Iterator.scala:941)
   	at scala.collection.Iterator.foreach$(Iterator.scala:941)
   	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:160)
   	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1429)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:365)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:94)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:94)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:87)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:107)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:104)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:100)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$5(QueryExecution.scala:219)
   	at org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:381)
   	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:219)
   	at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:99)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:132)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:104)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:132)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:131)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3665)
   	at org.apache.spark.sql.Dataset.head(Dataset.scala:2737)
   	at org.apache.spark.sql.Dataset.take(Dataset.scala:2944)
   	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
   	at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   Traceback (most recent call last):
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 441, in show
       print(self._jdf.showString(n, 20, vertical))
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
       format(target_id, ".", name), value)
   py4j.protocol.Py4JJavaError: An error occurred while calling o85.showString.
   : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
   	at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildFileIndex$6(MergeOnReadSnapshotRelation.scala:217)
   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
   	at scala.collection.immutable.List.foreach(List.scala:392)
   	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
   	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
   	at scala.collection.immutable.List.map(List.scala:298)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:209)
   	at org.apache.hudi.MergeOnReadSnapshotRelation.buildScan(MergeOnReadSnapshotRelation.scala:110)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$apply$4(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:331)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:408)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:330)
   	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:298)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
   	at scala.collection.Iterator.foreach(Iterator.scala:941)
   	at scala.collection.Iterator.foreach$(Iterator.scala:941)
   	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:162)
   	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:160)
   	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1429)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
   	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
   	at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:365)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:94)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:94)
   	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:87)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:107)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:149)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:104)
   	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:100)
   	at org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$5(QueryExecution.scala:219)
   	at org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:381)
   	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:219)
   	at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:99)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:132)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:104)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:227)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:132)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:131)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3665)
   	at org.apache.spark.sql.Dataset.head(Dataset.scala:2737)
   	at org.apache.spark.sql.Dataset.take(Dataset.scala:2944)
   	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
   	at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org