You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Sivabalan <n....@gmail.com> on 2020/02/22 16:52:00 UTC

[DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

As Aapche Hudi is getting widely adopted, performance has become the need
of the hour. This RFC focusses on improving performance of the Hudi index
by introducing record level index. The proposal is to implement a new index
format that is a mapping of (recordKey <-> partition, fileId) or
((recordKey, partitionPath) → fileId). This mapping will be stored and
maintained by Hudi as another implementation of HoodieIndex. This record
level indexing will definitely give a boost to both read and write
performance.

Here
<https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets>
is the link to RFC.

Appreciate your review and thoughts.

-- 
Regards,
-Sivabalan

Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

Posted by Balaji Varadarajan <va...@gmail.com>.

+1. Lets do it :)

Balaji.V

On Mon, Feb 24, 2020 at 6:36 PM Shiyan Xu <xu...@gmail.com>
wrote:

> +1 great reading and values!
>
> On Mon, 24 Feb 2020, 15:31 nishith agarwal, <n3...@gmail.com> wrote:
>
> > +100
> > - Reduces index lookup time hence improves job runtime
> > - Paves the way for streaming style ingestion
> > - Eliminates dependency on Hbase (alternate "global index" support at the
> > moment)
> >
> > -Nishith
> >
> > On Mon, Feb 24, 2020 at 10:56 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > +1 from me as well. This will be a product defining feature, if we can
> do
> > > it/
> > >
> > > On Sun, Feb 23, 2020 at 6:27 PM vino yang <ya...@gmail.com>
> wrote:
> > >
> > > > Hi Sivabalan,
> > > >
> > > > Thanks for your proposal.
> > > >
> > > > Big +1 from my side, indexing for record granularity is really good
> for
> > > > performance. It is also towards the streaming processing.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > Sivabalan <n....@gmail.com> 于2020年2月23日周日 上午12:52写道：
> > > >
> > > > > As Aapche Hudi is getting widely adopted, performance has become
> the
> > > need
> > > > > of the hour. This RFC focusses on improving performance of the Hudi
> > > index
> > > > > by introducing record level index. The proposal is to implement a
> new
> > > > index
> > > > > format that is a mapping of (recordKey <-> partition, fileId) or
> > > > > ((recordKey, partitionPath) → fileId). This mapping will be stored
> > and
> > > > > maintained by Hudi as another implementation of HoodieIndex. This
> > > record
> > > > > level indexing will definitely give a boost to both read and write
> > > > > performance.
> > > > >
> > > > > Here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
> > > > > >
> > > > > is the link to RFC.
> > > > >
> > > > > Appreciate your review and thoughts.
> > > > >
> > > > > --
> > > > > Regards,
> > > > > -Sivabalan
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

Posted by Shiyan Xu <xu...@gmail.com>.

+1 great reading and values!

On Mon, 24 Feb 2020, 15:31 nishith agarwal, <n3...@gmail.com> wrote:

> +100
> - Reduces index lookup time hence improves job runtime
> - Paves the way for streaming style ingestion
> - Eliminates dependency on Hbase (alternate "global index" support at the
> moment)
>
> -Nishith
>
> On Mon, Feb 24, 2020 at 10:56 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> > +1 from me as well. This will be a product defining feature, if we can do
> > it/
> >
> > On Sun, Feb 23, 2020 at 6:27 PM vino yang <ya...@gmail.com> wrote:
> >
> > > Hi Sivabalan,
> > >
> > > Thanks for your proposal.
> > >
> > > Big +1 from my side, indexing for record granularity is really good for
> > > performance. It is also towards the streaming processing.
> > >
> > > Best,
> > > Vino
> > >
> > > Sivabalan <n....@gmail.com> 于2020年2月23日周日 上午12:52写道：
> > >
> > > > As Aapche Hudi is getting widely adopted, performance has become the
> > need
> > > > of the hour. This RFC focusses on improving performance of the Hudi
> > index
> > > > by introducing record level index. The proposal is to implement a new
> > > index
> > > > format that is a mapping of (recordKey <-> partition, fileId) or
> > > > ((recordKey, partitionPath) → fileId). This mapping will be stored
> and
> > > > maintained by Hudi as another implementation of HoodieIndex. This
> > record
> > > > level indexing will definitely give a boost to both read and write
> > > > performance.
> > > >
> > > > Here
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
> > > > >
> > > > is the link to RFC.
> > > >
> > > > Appreciate your review and thoughts.
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >
>

Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

Posted by nishith agarwal <n3...@gmail.com>.

+100
- Reduces index lookup time hence improves job runtime
- Paves the way for streaming style ingestion
- Eliminates dependency on Hbase (alternate "global index" support at the
moment)

-Nishith

On Mon, Feb 24, 2020 at 10:56 AM Vinoth Chandar <vi...@apache.org> wrote:

> +1 from me as well. This will be a product defining feature, if we can do
> it/
>
> On Sun, Feb 23, 2020 at 6:27 PM vino yang <ya...@gmail.com> wrote:
>
> > Hi Sivabalan,
> >
> > Thanks for your proposal.
> >
> > Big +1 from my side, indexing for record granularity is really good for
> > performance. It is also towards the streaming processing.
> >
> > Best,
> > Vino
> >
> > Sivabalan <n....@gmail.com> 于2020年2月23日周日 上午12:52写道：
> >
> > > As Aapche Hudi is getting widely adopted, performance has become the
> need
> > > of the hour. This RFC focusses on improving performance of the Hudi
> index
> > > by introducing record level index. The proposal is to implement a new
> > index
> > > format that is a mapping of (recordKey <-> partition, fileId) or
> > > ((recordKey, partitionPath) → fileId). This mapping will be stored and
> > > maintained by Hudi as another implementation of HoodieIndex. This
> record
> > > level indexing will definitely give a boost to both read and write
> > > performance.
> > >
> > > Here
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
> > > >
> > > is the link to RFC.
> > >
> > > Appreciate your review and thoughts.
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>

Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

Posted by Vinoth Chandar <vi...@apache.org>.

+1 from me as well. This will be a product defining feature, if we can do
it/

On Sun, Feb 23, 2020 at 6:27 PM vino yang <ya...@gmail.com> wrote:

> Hi Sivabalan,
>
> Thanks for your proposal.
>
> Big +1 from my side, indexing for record granularity is really good for
> performance. It is also towards the streaming processing.
>
> Best,
> Vino
>
> Sivabalan <n....@gmail.com> 于2020年2月23日周日 上午12:52写道：
>
> > As Aapche Hudi is getting widely adopted, performance has become the need
> > of the hour. This RFC focusses on improving performance of the Hudi index
> > by introducing record level index. The proposal is to implement a new
> index
> > format that is a mapping of (recordKey <-> partition, fileId) or
> > ((recordKey, partitionPath) → fileId). This mapping will be stored and
> > maintained by Hudi as another implementation of HoodieIndex. This record
> > level indexing will definitely give a boost to both read and write
> > performance.
> >
> > Here
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
> > >
> > is the link to RFC.
> >
> > Appreciate your review and thoughts.
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

Posted by vino yang <ya...@gmail.com>.

Hi Sivabalan,

Thanks for your proposal.

Big +1 from my side, indexing for record granularity is really good for
performance. It is also towards the streaming processing.

Best,
Vino

Sivabalan <n....@gmail.com> 于2020年2月23日周日 上午12:52写道：

> As Aapche Hudi is getting widely adopted, performance has become the need
> of the hour. This RFC focusses on improving performance of the Hudi index
> by introducing record level index. The proposal is to implement a new index
> format that is a mapping of (recordKey <-> partition, fileId) or
> ((recordKey, partitionPath) → fileId). This mapping will be stored and
> maintained by Hudi as another implementation of HoodieIndex. This record
> level indexing will definitely give a boost to both read and write
> performance.
>
> Here
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
> >
> is the link to RFC.
>
> Appreciate your review and thoughts.
>
> --
> Regards,
> -Sivabalan
>