You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Shimin Yang (Jira)" <ji...@apache.org> on 2021/01/04 08:45:00 UTC

[jira] [Updated] (HUDI-1503) Implement a Hash(Bucket)-based Index

     [ https://issues.apache.org/jira/browse/HUDI-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shimin Yang updated HUDI-1503:
------------------------------
    Description: 
This ticket is to introduce a new hash based index, which can improve the performance of  write operations and speed up the queries at the same time(removing shuffle for Spark/Hive).

The new hash-based index works with a customized hash-based partitioner, which partition records based on the hash value of index keys and a fixed bucket number. So there's no need to visit the existing files to determine which file group each record belongs.

Meanwhile, the file group id, hash mode and bucket num can be used by the query engines to eliminate shuffle introduced by aggregation and join.

We implemented an HoodieIndex based on hive hash function which used on production environment of ByteDance for many very-large volume dataset, and we hope this feature can be contributed to the community soon.

  was:
This ticket is to introduce a new hash based index, which can improve the performance of  write operations and speed up the queries at the same time(removing shuffle for Spark/Hive).

The new hash-based index works with a customized hash-based partitioner, which partition records based on the hash value of index keys and a fixed bucket number. So there's no need to visit the existing files to determine which file group each record belongs.

Meanwhile, the file group id, hash mode and bucket num can be used by the query engines to eliminate shuffle introduced by aggregation and join.

We implemented an HoodieIndex based on hive hash function which used in our production environment in ByteDance for many very-large volume dataset, and we hope this feature can be contributed to the community soon.


> Implement a Hash(Bucket)-based Index
> ------------------------------------
>
>                 Key: HUDI-1503
>                 URL: https://issues.apache.org/jira/browse/HUDI-1503
>             Project: Apache Hudi
>          Issue Type: Wish
>          Components: Index, Performance
>            Reporter: Shimin Yang
>            Priority: Major
>
> This ticket is to introduce a new hash based index, which can improve the performance of  write operations and speed up the queries at the same time(removing shuffle for Spark/Hive).
> The new hash-based index works with a customized hash-based partitioner, which partition records based on the hash value of index keys and a fixed bucket number. So there's no need to visit the existing files to determine which file group each record belongs.
> Meanwhile, the file group id, hash mode and bucket num can be used by the query engines to eliminate shuffle introduced by aggregation and join.
> We implemented an HoodieIndex based on hive hash function which used on production environment of ByteDance for many very-large volume dataset, and we hope this feature can be contributed to the community soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)