You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/02 07:27:52 UTC

[GitHub] [hudi] yihua opened a new pull request, #6851: [HUDI-4966] Add a partition extractor to handle partition values with slashes

yihua opened a new pull request, #6851:
URL: https://github.com/apache/hudi/pull/6851

   ### Change Logs
   
   This PR fixes the issue reported in #6281.
   
   For Deltastreamer, when using `TimestampBasedKeyGenerator` with the customized output dateformat (`hoodie.deltastreamer.keygen.timebased.output.dateformat`) of partition path containing slashes, e.g., "yyyy/MM/dd", and hive-style partitioning disabled (by default), the meta sync fails.  Relevant key generator configs are:
   
   ```
   --hoodie-conf hoodie.datasource.write.partitionpath.field=createdDate
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
   --hoodie-conf hoodie.deltastreamer.keygen.timebased.timezone=GMT
   --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
   --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS 
   ```
   Hive Sync exception:
   ```
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table test_table
   ...
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: default.test_table add partition failed
   ...
   Caused by: MetaException(message:Invalid partition key & values; keys [createddate, ], values [2022, 10, 02, ])
       at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
   ...
   ```
   Glue Sync exception:
   ```
   Exception in thread "main" org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
   ...
   Caused by: org.apache.hudi.aws.sync.HoodieGlueSyncException: Fail to add partitions to default.test_table
   	at org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient.addPartitionsToTable(AWSGlueCatalogSyncClient.java:147)
   ...
   Caused by: org.apache.hudi.com.amazonaws.services.glue.model.InvalidInputException: The number of partition keys do not match the number of partition values (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: <>; Proxy: null)
   	at org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1819)
   ...
   ```
   
   The exception is thrown because the partition values for meta sync are not properly extracted.  In the current logic, "hoodie.datasource.hive_sync.partition_extractor_class" determines the partition extractor to use and in such a case, the `MultiPartKeysValueExtractor` is inferred to be used.  The root cause is that this extractor splits the parts by slashes, i.e., `2022/10/02` -> `[2022, 10, 02]`, instead of treating it as a single value, as there is only one partition column.  In general, if user specifies the output dateformat to contain slashes, that fails the extraction.
   
   This PR fixes the problem by introducing a new partition extractor, `SinglePartPartitionValueExtractor`, so that we treat the partition value as a whole when there is only a single partition column, instead of relying on `MultiPartKeysValueExtractor`.  The slash (`/`) is replaced by dash (`-`), as slashes are encoded by default, making it inconvenient for querying.
   
   ### Impact
   
   **Risk level: low**
   
   The fix is tested locally with Hive sync and on EMR with Glue Sync.  Before this fix, the meta sync fails.  After the fix, the meta sync succeeds.  The correct partitions can be shown: beeline (with `show partitions test_table;`) in Hive and Glue web UI for Glue Data Catalog.
   
   ### Documentation Update
   
   We need to improve the docs for meta sync with `TimestampBasedKeyGenerator`.  Docs update is tracked in HUDI-4967.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6851: [HUDI-4966] Add a partition extractor to handle partition values with slashes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6851:
URL: https://github.com/apache/hudi/pull/6851#issuecomment-1264581009

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1e9d7d3b20a72047d0d5b6e385d9078ffc7fdb65",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1e9d7d3b20a72047d0d5b6e385d9078ffc7fdb65",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1e9d7d3b20a72047d0d5b6e385d9078ffc7fdb65 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on pull request #6851: [HUDI-4966] Add a partition extractor to handle partition values with slashes

Posted by GitBox <gi...@apache.org>.
yihua commented on PR #6851:
URL: https://github.com/apache/hudi/pull/6851#issuecomment-1264718033

   > @yihua thanks for looking into this. I think the user's problem can also be resolved by using `SlashEncodedDayPartitionValueExtractor` ? probably need to follow the [migration guide](https://hudi.apache.org/releases/release-0.12.0#configuration-updates)
   
   If the date output format is “yyyy/MM/dd”, yes.  But we should also allow user to specify any format, e.g., “yyyyMM/dd/HH” or “MM/dd/yyyy”, which don’t have any corresponding partition extractor that works.  The new partition extractor addresses this problem.  Even for “yyyy/MM/dd”, there is no need to specify extractor after the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan merged pull request #6851: [HUDI-4966] Add a partition extractor to handle partition values with slashes

Posted by GitBox <gi...@apache.org>.
xushiyan merged PR #6851:
URL: https://github.com/apache/hudi/pull/6851


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6851: [HUDI-4966] Add a partition extractor to handle partition values with slashes

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6851:
URL: https://github.com/apache/hudi/pull/6851#issuecomment-1264581861

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1e9d7d3b20a72047d0d5b6e385d9078ffc7fdb65",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11962",
       "triggerID" : "1e9d7d3b20a72047d0d5b6e385d9078ffc7fdb65",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1e9d7d3b20a72047d0d5b6e385d9078ffc7fdb65 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11962) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org