You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "alexeykudinkin (via GitHub)" <gi...@apache.org> on 2023/02/24 07:01:03 UTC

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7847: [HUDI-5697] Revisiting refreshing of Hudi relations after write operations on the tables

alexeykudinkin commented on code in PR #7847:
URL: https://github.com/apache/hudi/pull/7847#discussion_r1116571566


##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalogUtils.scala:
##########
@@ -17,8 +17,76 @@
 
 package org.apache.spark.sql
 
+import org.apache.spark.sql.catalyst.catalog.CatalogTableType
+import org.apache.spark.sql.catalyst.{QualifiedTableName, TableIdentifier}
+
 /**
  * NOTE: Since support for [[TableCatalog]] was only added in Spark 3, this trait
  *       is going to be an empty one simply serving as a placeholder (for compatibility w/ Spark 2)
  */
 trait HoodieCatalogUtils {}
+
+object HoodieCatalogUtils {
+
+  /**
+   * Please check scala-doc for other overloaded [[refreshTable()]] operation
+   */
+  def refreshTable(spark: SparkSession, qualifiedTableName: String): Unit = {
+    val tableId = spark.sessionState.sqlParser.parseTableIdentifier(qualifiedTableName)
+    refreshTable(spark, tableId)
+  }
+
+  /**
+   * Refreshes metadata and flushes cached data (resolved [[LogicalPlan]] representation,
+   * already loaded [[InMemoryRelation]]) for the table identified by [[tableId]].
+   *
+   * This method is usually invoked at the ond of the write operation to make sure cached
+   * data/metadata are synchronized with the state on storage.
+   *
+   * NOTE: PLEASE READ CAREFULLY BEFORE CHANGING
+   *       This is borrowed from Spark 3.1.3 and modified to satisfy Hudi needs:

Review Comment:
   Great question!
   
   This seems to be the PR that changed it: https://github.com/apache/spark/pull/31206
   
   I don't see any particular rationale for changing the part that triggers `relation.refresh()`. I guess the reason why Spark's core doesn't really care too much about it is simply b/c after listing of the (parquet) table, for ex, they simply create `InMemoryFileIndex` that is passed into `HadoopFsRelation` in that case you'd not notice the refresh as it actually just happens in memory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org