You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/14 15:32:34 UTC

[GitHub] [hudi] codope opened a new pull request, #6673: [HUDI-4785] Fix partition discovery in bootstrap operation

codope opened a new pull request, #6673:
URL: https://github.com/apache/hudi/pull/6673

   ### Change Logs
   
   This PR fixes the following issues for bootstrap operation:
   * Partition discovery in `SparkFullBootstrapDataProviderBase`
   * Handling of `FULL_RECORD` mode in `SparkBootstrapCommitActionExecutor`
   * Schema resolution in `HoodieBootstrapRelation`
   
   ### Impact
   
   No public API change.
   
   **Risk level: none | low | medium | high**
   
   Medium. Only for bootstrapped tables.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #6673: [HUDI-4785] Fix partition discovery in bootstrap operation

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #6673:
URL: https://github.com/apache/hudi/pull/6673#discussion_r972517320


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBootstrapRelation.scala:
##########
@@ -147,7 +146,7 @@ class HoodieBootstrapRelation(@transient val _sqlContext: SQLContext,
     if (fullSchema == null) {
       logInfo("Inferring schema..")
       val schemaResolver = new TableSchemaResolver(metaClient)
-      val tableSchema = schemaResolver.getTableAvroSchemaWithoutMetadataFields
+      val tableSchema = TableSchemaResolver.appendPartitionColumns(schemaResolver.getTableAvroSchemaWithoutMetadataFields, metaClient.getTableConfig.getPartitionFields)

Review Comment:
   We should also fix the table schema stored inside the commit metadata to include the partition column with the correct inferred type, fixed in #6676.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -50,9 +53,25 @@ public class HoodieBootstrapConfig extends HoodieConfig {
       .sinceVersion("0.6.0")
       .withDocumentation("Base path of the dataset that needs to be bootstrapped as a Hudi table");
 
+  public static final ConfigProperty<String> PARTITION_SELECTOR_REGEX_MODE = ConfigProperty
+      .key("hoodie.bootstrap.mode.selector.regex.mode")
+      .defaultValue(METADATA_ONLY.name())
+      .sinceVersion("0.6.0")
+      .withValidValues(METADATA_ONLY.name(), FULL_RECORD.name())
+      .withDocumentation("Bootstrap mode to apply for partition paths, that match regex above. "
+          + "METADATA_ONLY will generate just skeleton base files with keys/footers, avoiding full cost of rewriting the dataset. "
+          + "FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table.");
+
   public static final ConfigProperty<String> MODE_SELECTOR_CLASS_NAME = ConfigProperty
       .key("hoodie.bootstrap.mode.selector")
       .defaultValue(MetadataOnlyBootstrapModeSelector.class.getCanonicalName())
+      /*.withInferFunction(cfg -> {

Review Comment:
   nit: remove unused code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6673: [HUDI-4785] Fix partition discovery in bootstrap operation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6673:
URL: https://github.com/apache/hudi/pull/6673#issuecomment-1247016066

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d549379aa13fdd32255ab4b47b184ae98014d44f",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11366",
       "triggerID" : "d549379aa13fdd32255ab4b47b184ae98014d44f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d549379aa13fdd32255ab4b47b184ae98014d44f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11366) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua merged pull request #6673: [HUDI-4785] Fix partition discovery in bootstrap operation

Posted by GitBox <gi...@apache.org>.
yihua merged PR #6673:
URL: https://github.com/apache/hudi/pull/6673


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on pull request #6673: [HUDI-4785] Fix partition discovery in bootstrap operation

Posted by GitBox <gi...@apache.org>.
yihua commented on PR #6673:
URL: https://github.com/apache/hudi/pull/6673#issuecomment-1248801714

   Merging this as the rebasing only touches the `TestDataSourceForBootstrap` and it passes locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6673: [HUDI-4785] Fix partition discovery in bootstrap operation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6673:
URL: https://github.com/apache/hudi/pull/6673#issuecomment-1247480519

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d549379aa13fdd32255ab4b47b184ae98014d44f",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11366",
       "triggerID" : "d549379aa13fdd32255ab4b47b184ae98014d44f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d549379aa13fdd32255ab4b47b184ae98014d44f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11366) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6673: [HUDI-4785] Fix partition discovery in bootstrap operation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6673:
URL: https://github.com/apache/hudi/pull/6673#issuecomment-1247008823

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d549379aa13fdd32255ab4b47b184ae98014d44f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d549379aa13fdd32255ab4b47b184ae98014d44f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d549379aa13fdd32255ab4b47b184ae98014d44f UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #6673: [HUDI-4785] Fix partition discovery in bootstrap operation

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #6673:
URL: https://github.com/apache/hudi/pull/6673#discussion_r1257984792


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/SparkBootstrapCommitActionExecutor.java:
##########
@@ -307,12 +321,21 @@ private Map<BootstrapMode, List<Pair<String, List<HoodieFileStatus>>>> listAndPr
     BootstrapModeSelector selector =
         (BootstrapModeSelector) ReflectionUtils.loadClass(config.getBootstrapModeSelectorClass(), config);
 
-    Map<BootstrapMode, List<String>> result = selector.select(folders);
+    Map<BootstrapMode, List<String>> result = new HashMap<>();

Review Comment:
   No changes should be made here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org