You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/21 22:09:44 UTC

[GitHub] [hudi] zhedoubushishi commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

zhedoubushishi commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r927140605


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {
+          val isBootstrapTable = BootstrapIndex.getBootstrapIndex(metaClient).useIndex()
+          if (isBootstrapTable) {
+            // For bootstrapped tables its possible the schema does not contain partition field when source table

Review Comment:
   Hi @nsivabalan.
   In this case, let' say the source table is a Hive style partitioned parquet table(partition column is not included in the parquet files) and after bootstrapping, we generated a partitioned Hudi table. But when reading this Hudi table, now we read it as a non-partitioned table because the partition column is not included in the data files.
   
   Yes in the long term, we should be able to infer the partition column and schema type in the case of bootstrapped tables but it is a more complex issue to resolve at this time.
   
   We identified that the partition validation logic mainly serves the purpose to allow partition pruning in HoodieFileIndex.
   
   Rather than entirely breaking bootstrap feature we have decided in the case of bootstrapped tables to ignore this validation and treat queries as non-partitioned tables. The impact of this is that queries will not see the effects of partition pruning through Hudi.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org