You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Manoj Govindassamy <ma...@gmail.com> on 2021/11/05 16:30:38 UTC

[DISCUSS] Metadata based bloom index

Hi Hudi Community,

Hudi has several indices to help lookup records. The most commonly used one
is the BloomFilter based index. This index today works by loading the bloom
filter from all the data files of interested partitions. This is a time
consuming operation. Better would be if can leverage the metadata table
infrastructure of the Hudi tables. That is, if all the bloom filters can be
loaded directly from a single metadata table partition, it would greatly
speed up the entire record key lookup process.

Let me know your thoughts on this high level idea. Planning to start a RFC
on this and I can share more details on the design and implementation.

Regards,
Manoj

Re: [DISCUSS] Metadata based bloom index

Posted by Vinoth Chandar <vi...@apache.org>.

+1 on this. I think cloud storage throttling is more of an issue that
causes degradations when tables are enormous.
but this approach should nicely handle that as well

On Fri, Nov 5, 2021 at 9:31 AM Manoj Govindassamy <
manoj.govindassamy@gmail.com> wrote:

> Hi Hudi Community,
>
> Hudi has several indices to help lookup records. The most commonly used one
> is the BloomFilter based index. This index today works by loading the bloom
> filter from all the data files of interested partitions. This is a time
> consuming operation. Better would be if can leverage the metadata table
> infrastructure of the Hudi tables. That is, if all the bloom filters can be
> loaded directly from a single metadata table partition, it would greatly
> speed up the entire record key lookup process.
>
> Let me know your thoughts on this high level idea. Planning to start a RFC
> on this and I can share more details on the design and implementation.
>
> Regards,
> Manoj
>