You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/08 06:15:39 UTC

[GitHub] [hudi] xushiyan commented on a diff in pull request #5791: [MINOR] follow up HUDI-4178 automatically enable schema evolution when read hoodie table.

xushiyan commented on code in PR #5791:
URL: https://github.com/apache/hudi/pull/5791#discussion_r891957547


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -509,12 +510,22 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext,
     StructType(dataStructSchema.filterNot(f => partitionColumns.contains(f.name)))
 
   private def isSchemaEvolutionEnabled = {
+    // Auto set schema evolution
+    // Wrapped as a function, this function is triggered only when called, minimize the time cosumption.
+    def detectSchemaEvolution(): Boolean = {
+      val result = new FileBasedInternalSchemaStorageManager(metaClient).isValidHistorySchemaExist

Review Comment:
   is it possible to avoid creating a new FileBasedInternalSchemaStorageManager every check? the actual info is retrieved by the existing metaclient, so `isValidHistorySchemaExist` can be a static util taking the metaclient?. This would involve more refactoring work so it's optional here.



##########
hudi-common/src/main/java/org/apache/hudi/internal/schema/io/FileBasedInternalSchemaStorageManager.java:
##########
@@ -131,6 +131,27 @@ private List<String> getValidInstants() {
         .filterCompletedInstants().getInstants().map(f -> f.getTimestamp()).collect(Collectors.toList());
   }
 
+  /**
+   * Return whether an available historySchema file exist in schema folder or not.
+   */
+  public boolean isValidHistorySchemaExist() {
+    try {
+      List<String> validateCommits = getValidInstants();
+      FileSystem fs = FSUtils.getFs(baseSchemaPath.toString(), conf);
+      if (fs.exists(baseSchemaPath)) {
+        List<String> validaSchemaFiles = Arrays.stream(fs.listStatus(baseSchemaPath))
+            .filter(f -> f.isFile() && f.getPath().getName().endsWith(SCHEMA_COMMIT_ACTION))
+            .map(file -> file.getPath().getName()).filter(f -> validateCommits.contains(f.split("\\.")[0])).sorted().collect(Collectors.toList());

Review Comment:
   if every flag check `isSchemaEvolutionEnabled` invokes this, any performance concern? Not sure if the usability improvement can justify the perf cost or not.



##########
hudi-common/src/main/java/org/apache/hudi/internal/schema/io/FileBasedInternalSchemaStorageManager.java:
##########
@@ -131,6 +131,27 @@ private List<String> getValidInstants() {
         .filterCompletedInstants().getInstants().map(f -> f.getTimestamp()).collect(Collectors.toList());
   }
 
+  /**
+   * Return whether an available historySchema file exist in schema folder or not.
+   */
+  public boolean isValidHistorySchemaExist() {
+    try {
+      List<String> validateCommits = getValidInstants();
+      FileSystem fs = FSUtils.getFs(baseSchemaPath.toString(), conf);
+      if (fs.exists(baseSchemaPath)) {
+        List<String> validaSchemaFiles = Arrays.stream(fs.listStatus(baseSchemaPath))
+            .filter(f -> f.isFile() && f.getPath().getName().endsWith(SCHEMA_COMMIT_ACTION))
+            .map(file -> file.getPath().getName()).filter(f -> validateCommits.contains(f.split("\\.")[0])).sorted().collect(Collectors.toList());

Review Comment:
   pls try simplifying this method chain to improve readability.. you can leverage `Stream::anyMatch()` ? also better to illustrate in the javadoc with an example showing file names



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org