You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Udit Mehrotra (Jira)" <ji...@apache.org> on 2022/03/15 19:17:00 UTC
[jira] [Updated] (HUDI-3625) [Umbrella] Optimized storage layout for cloud object stores
[ https://issues.apache.org/jira/browse/HUDI-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Udit Mehrotra updated HUDI-3625:
--------------------------------
Labels: hudi-umbrellas (was: )
> [Umbrella] Optimized storage layout for cloud object stores
> -----------------------------------------------------------
>
> Key: HUDI-3625
> URL: https://issues.apache.org/jira/browse/HUDI-3625
> Project: Apache Hudi
> Issue Type: New Feature
> Components: core
> Reporter: Udit Mehrotra
> Assignee: Udit Mehrotra
> Priority: Major
> Labels: hudi-umbrellas
>
> Amazon S3 among other cloud object stores, throttle requests based on object prefix => [https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/]. Hudi follows the traditional Hive storage layout, with files being stored under separate partition paths under a common table path/prefix. This introduces the potential for throttling because of request limits being reached for the common table path/prefix, when writing significant number of files concurrently.
> We propose implementing an alternate storage layout, that would be more suitable for cloud object stores like S3 to avoid running into throttling issues as the data scales. At a high level, we need to be able to distribute data files evenly across randomly generated prefixes, so that request limits get distributed across those prefixes, instead of a single table prefix.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)