You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Udit Mehrotra (Jira)" <ji...@apache.org> on 2022/03/15 19:17:00 UTC

[jira] [Updated] (HUDI-3625) [Umbrella] Optimized storage layout for cloud object stores

     [ https://issues.apache.org/jira/browse/HUDI-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Udit Mehrotra updated HUDI-3625:
--------------------------------
    Labels: hudi-umbrellas  (was: )

> [Umbrella] Optimized storage layout for cloud object stores
> -----------------------------------------------------------
>
>                 Key: HUDI-3625
>                 URL: https://issues.apache.org/jira/browse/HUDI-3625
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: core
>            Reporter: Udit Mehrotra
>            Assignee: Udit Mehrotra
>            Priority: Major
>              Labels: hudi-umbrellas
>
> Amazon S3 among other cloud object stores, throttle requests based on object prefix => [https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/]. Hudi follows the traditional Hive storage layout, with files being stored under separate partition paths under a common table path/prefix. This introduces the potential for throttling because of request limits being reached for the common table path/prefix, when writing significant number of files concurrently.
> We propose implementing an alternate storage layout, that would be more suitable for cloud object stores like S3 to avoid running into throttling issues as the data scales. At a high level, we need to be able to distribute data files evenly across randomly generated prefixes, so that request limits get distributed across those prefixes, instead of a single table prefix.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)