You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/21 07:18:13 UTC

[GitHub] [hudi] rahil-c opened a new pull request, #6163: Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

rahil-c opened a new pull request, #6163:
URL: https://github.com/apache/hudi/pull/6163

   …ion column is missing from schema
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

yihua commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r927189394


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {

Review Comment:
   This check is hacky.  Could we remove this check?  At the minimum, for bootstrapped table, we disable partition schema.  For better, we need to find a way to get the schema from the bootstrap base path.  How is the schema fetched for reading bootstrapped table?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1193048723

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10135",
       "triggerID" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f67bf25d9653616fe4a882762e477d3bd116400c Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10135) 
   * 08c3aef746ec9037ac1f1f7314d4a04a7e931ace UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6163: Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1191488197

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10135",
       "triggerID" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f67bf25d9653616fe4a882762e477d3bd116400c Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10135) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1193049295

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10135",
       "triggerID" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10241",
       "triggerID" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f67bf25d9653616fe4a882762e477d3bd116400c Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10135) 
   * 08c3aef746ec9037ac1f1f7314d4a04a7e931ace Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10241) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1193134785

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 08c3aef746ec9037ac1f1f7314d4a04a7e931ace UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rahil-c commented on pull request #6163: Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

rahil-c commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1191132853

   cc @zhedoubushishi @umehrot2 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r928032598


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {

Review Comment:
   @yihua I don't think we should remove this check. It is deliberately added to cover cases when bootstrapped table have had upserts. After the initial bootstrap, new upserts will have all the columns written in the hudi table. At that time I believe it will also have the partition column and then we should start treating it as a normal table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6163: Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1191157852

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f67bf25d9653616fe4a882762e477d3bd116400c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1193135533

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10241",
       "triggerID" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 08c3aef746ec9037ac1f1f7314d4a04a7e931ace Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10241) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1193144103

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10241",
       "triggerID" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 08c3aef746ec9037ac1f1f7314d4a04a7e931ace Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10241) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6163: Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1191162346

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10135",
       "triggerID" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f67bf25d9653616fe4a882762e477d3bd116400c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10135) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zhedoubushishi commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

zhedoubushishi commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r927140605


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {
+          val isBootstrapTable = BootstrapIndex.getBootstrapIndex(metaClient).useIndex()
+          if (isBootstrapTable) {
+            // For bootstrapped tables its possible the schema does not contain partition field when source table

Review Comment:
   Hi @nsivabalan.
   In this case, let' say the source table is a Hive style partitioned parquet table(partition column is not included in the parquet files) and after bootstrapping, we generated a partitioned Hudi table. But when reading this Hudi table, now we read it as a non-partitioned table because the partition column is not included in the data files.
   
   Yes in the long term, we should be able to infer the partition column and schema type in the case of bootstrapped tables but it is a more complex issue to resolve at this time.
   
   We identified that the partition validation logic mainly serves the purpose to allow partition pruning in HoodieFileIndex.
   
   Rather than entirely breaking bootstrap feature we have decided in the case of bootstrapped tables to ignore this validation and treat queries as non-partitioned tables. The impact of this is that queries will not see the effects of partition pruning through Hudi.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua merged pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

yihua merged PR #6163:
URL: https://github.com/apache/hudi/pull/6163


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r928037390


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {

Review Comment:
   I agree that in general this is just a temporary solution to not break bootstrap tables. This is tricky to handle. Because its not just about obtaining the partition schema from source, but also extracting the partition column values from the source path and writing them as correct data type in the target location. I remember having several discussions about it a year back.
   
   As per your question => https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/ParquetBootstrapMetadataHandler.java#L61 we are simply reading the source file footer right now to get the source schema. I think EMR team can take it up in the next release, but for now we should atleast prevent failures. @rahil-c can you create a jira to track this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1193064644

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10135",
       "triggerID" : "f67bf25d9653616fe4a882762e477d3bd116400c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10241",
       "triggerID" : "08c3aef746ec9037ac1f1f7314d4a04a7e931ace",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 08c3aef746ec9037ac1f1f7314d4a04a7e931ace Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10241) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rahil-c commented on pull request #6163: Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

rahil-c commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1191705310

   retriggering run https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10135&view=results 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zhedoubushishi commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

zhedoubushishi commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r927140605


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {
+          val isBootstrapTable = BootstrapIndex.getBootstrapIndex(metaClient).useIndex()
+          if (isBootstrapTable) {
+            // For bootstrapped tables its possible the schema does not contain partition field when source table

Review Comment:
   Hi @nsivabalan.
   In this case, let' say the source table is a Hive style partitioned parquet table(partition column is not included in the parquet files) and after bootstrapping, we generated a partitioned Hudi table. But when reading this Hudi table, within this FileIndex code path, now we treat it as a non-partitioned table because the partition column is not included in the data files.
   
   Yes in the long term, we should be able to infer the partition column and schema type in the case of bootstrapped tables but it is a more complex issue to resolve at this time.
   
   We identified that the partition validation logic in FileIndex mainly serves the purpose to allow partition pruning in HoodieFileIndex.
   
   Rather than entirely breaking bootstrap feature we have decided in the case of bootstrapped tables to ignore this validation and treat queries as non-partitioned tables. The impact of this is that queries will not see the effects of partition pruning through Hudi.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6163: Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r926699569


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {
+          val isBootstrapTable = BootstrapIndex.getBootstrapIndex(metaClient).useIndex()
+          if (isBootstrapTable) {
+            // For bootstrapped tables its possible the schema does not contain partition field when source table

Review Comment:
   if yes, I agree. but if its feasible to generate a partitioned hudi table, we can't proceed w/ this fix right. can you help me understand please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1193004534

   Created a tracking jira => https://issues.apache.org/jira/browse/HUDI-4453


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on pull request #6163: Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

codope commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1191152323

    @yihua @alexeykudinkin Can you please review this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

yihua commented on PR #6163:
URL: https://github.com/apache/hudi/pull/6163#issuecomment-1193171020

   CI is green.  Merging the PR
   <img width="1608" alt="Screen Shot 2022-07-23 at 11 43 52" src="https://user-images.githubusercontent.com/2497195/180618764-40c44926-7e00-47dd-a177-ba2b15c55a9d.png">
   .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6163: Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r926698953


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {
+          val isBootstrapTable = BootstrapIndex.getBootstrapIndex(metaClient).useIndex()
+          if (isBootstrapTable) {
+            // For bootstrapped tables its possible the schema does not contain partition field when source table

Review Comment:
   I haven't played much w/ bootstrapped table. help me clarify something. in this case, hudi table is actually non-partitioned is it? i.e. when source table has hive style partitioned, but does not contain the actual partition field in the dataframe ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Posted by GitBox <gi...@apache.org>.

yihua commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r928069250


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {

Review Comment:
   Got it.  For the time being, I'll land this fix.  As a follow-up, could one of you add docs in the code to clarify why the check is needed?  It's not clear from reading the code at first glance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org