You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2019/10/27 07:13:54 UTC

DISCUSS RFC 6 - Add indexing support to the log file

https://cwiki.apache.org/confluence/display/HUDI/RFC-6+Add+indexing+support+to+the+log+file


Feedback welcome, on this RFC tackling HUDI-86

Re: DISCUSS RFC 6 - Add indexing support to the log file

Posted by Vinoth Chandar <vi...@apache.org>.
Since attachments don't really work on the mailing list, Can you may be
attach them to comments on the RFC itself?

In this scenario, we will get a larger range than is probably in the newly
compacted base file, correct? Current thinking is, yes it will lead to less
efficient pruning by ranges, but should still be correct? Do you have
correctness concerns? If so, I would really like to understand since there
is a good chance its not fixable :)

We could have the next delta commit log a "corrected" rangeInfo down the
line as an optimization. May be leave the initial design simpler, as this
adds to complexity around compatibility handling and so on? Good point, we
can think about it more.

On Thu, Nov 14, 2019 at 6:37 AM Sivabalan <n....@gmail.com> wrote:

> I have s doubt on the design. I guess this is the right place to discuss.
>
> I want to understand how compaction interplays with this new scheme.
> Let's assume all log block are of new format only. Once compaction
> completes, those log blocks/files not compacted will have range info
> pertaining to compacted ones right? When this will get fixed? Won't the
> look up return true for those keys from compacted log files. I have
> attached two diagrams depicting before and after compaction. If you look at
> 2nd pic (after compaction), ideally min and max should have been 6 and 11.
>
> In general, when does the key range pruning will happen? And will the
> bloom filter also be adjusted?
>
>
> On Wed, Oct 30, 2019 at 10:09 PM Nishith <n3...@gmail.com> wrote:
>
>> Thanks for the detailed design write up Vinoth. I concur with the others
>> on option 2, default indexing as off and enable it when we have enough
>> confidence on stability & performance. Although, I do think practically it
>> might be good to have the code in place for users who might revert to an
>> older build as part of some build rollback mechanisms that they may have in
>> place (for reasons not even related to hudi). The latest data block
>> (denoted by the latest version) being a new block as suggested by Balaji
>> sounds like one option - not sure how the complicated the code will become
>> though...
>> Will comment on the RFC about some doubts/concerns regarding first
>> migration customers from canIndexLogFiles = false to true and then rollback
>> to ensure my understand is correct.
>>
>> -Nishith
>>
>> Sent from my iPhone
>>
>> > On Oct 30, 2019, at 4:00 PM, Balaji Varadarajan
>> <v....@ymail.com.invalid> wrote:
>> >
>> > Thanks Vinoth for proposing a clean and extendable design. The overall
>> design looks great. Another rollout option is to only use consolidated log
>> index for index lookup if latest "valid" log block has been written in new
>> format. If that is not the case, we can revert to scanning previous log
>> blocks for index lookup.
>> > Balaji.V    On Tuesday, October 29, 2019, 07:52:00 PM PDT, Bhavani
>> Sudha <bh...@gmail.com> wrote:
>> >
>> > I vote for the second option. Also it can give time to analyze on how to
>> > deal with backwards compatibility. I ll take a look at the RFC later
>> > tonight and get back.
>> >
>> >
>> >> On Sun, Oct 27, 2019 at 10:24 AM Vinoth Chandar <vi...@apache.org>
>> wrote:
>> >>
>> >> One issue I have some open questions myself
>> >>
>> >> Is it ok to assume log will have old data block versions, followed by
>> new
>> >> data block versions. For e.g, if rollout new code, then revert back
>> then
>> >> there could be an arbitrary mix of new and old data blocks. Handling
>> this
>> >> might make design/code fairly complex. Alternatively we can keep it
>> simple
>> >> for now, disable by default and only advise to enable for new tables or
>> >> when hudi version is stable
>> >>
>> >>
>> >>> On Sun, Oct 27, 2019 at 12:13 AM Vinoth Chandar <vi...@apache.org>
>> wrote:
>> >>>
>> >>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/HUDI/RFC-6+Add+indexing+support+to+the+log+file
>> >>>
>> >>>
>> >>> Feedback welcome, on this RFC tackling HUDI-86
>> >>>
>> >>
>>
>
>
> --
> Regards,
> -Sivabalan
>

Re: DISCUSS RFC 6 - Add indexing support to the log file

Posted by Sivabalan <n....@gmail.com>.
I have s doubt on the design. I guess this is the right place to discuss.

I want to understand how compaction interplays with this new scheme.
Let's assume all log block are of new format only. Once compaction
completes, those log blocks/files not compacted will have range info
pertaining to compacted ones right? When this will get fixed? Won't the
look up return true for those keys from compacted log files. I have
attached two diagrams depicting before and after compaction. If you look at
2nd pic (after compaction), ideally min and max should have been 6 and 11.

In general, when does the key range pruning will happen? And will the bloom
filter also be adjusted?


On Wed, Oct 30, 2019 at 10:09 PM Nishith <n3...@gmail.com> wrote:

> Thanks for the detailed design write up Vinoth. I concur with the others
> on option 2, default indexing as off and enable it when we have enough
> confidence on stability & performance. Although, I do think practically it
> might be good to have the code in place for users who might revert to an
> older build as part of some build rollback mechanisms that they may have in
> place (for reasons not even related to hudi). The latest data block
> (denoted by the latest version) being a new block as suggested by Balaji
> sounds like one option - not sure how the complicated the code will become
> though...
> Will comment on the RFC about some doubts/concerns regarding first
> migration customers from canIndexLogFiles = false to true and then rollback
> to ensure my understand is correct.
>
> -Nishith
>
> Sent from my iPhone
>
> > On Oct 30, 2019, at 4:00 PM, Balaji Varadarajan
> <v....@ymail.com.invalid> wrote:
> >
> > Thanks Vinoth for proposing a clean and extendable design. The overall
> design looks great. Another rollout option is to only use consolidated log
> index for index lookup if latest "valid" log block has been written in new
> format. If that is not the case, we can revert to scanning previous log
> blocks for index lookup.
> > Balaji.V    On Tuesday, October 29, 2019, 07:52:00 PM PDT, Bhavani Sudha
> <bh...@gmail.com> wrote:
> >
> > I vote for the second option. Also it can give time to analyze on how to
> > deal with backwards compatibility. I ll take a look at the RFC later
> > tonight and get back.
> >
> >
> >> On Sun, Oct 27, 2019 at 10:24 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> >>
> >> One issue I have some open questions myself
> >>
> >> Is it ok to assume log will have old data block versions, followed by
> new
> >> data block versions. For e.g, if rollout new code, then revert back then
> >> there could be an arbitrary mix of new and old data blocks. Handling
> this
> >> might make design/code fairly complex. Alternatively we can keep it
> simple
> >> for now, disable by default and only advise to enable for new tables or
> >> when hudi version is stable
> >>
> >>
> >>> On Sun, Oct 27, 2019 at 12:13 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> >>>
> >>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/HUDI/RFC-6+Add+indexing+support+to+the+log+file
> >>>
> >>>
> >>> Feedback welcome, on this RFC tackling HUDI-86
> >>>
> >>
>


-- 
Regards,
-Sivabalan

Re: DISCUSS RFC 6 - Add indexing support to the log file

Posted by Nishith <n3...@gmail.com>.
Thanks for the detailed design write up Vinoth. I concur with the others on option 2, default indexing as off and enable it when we have enough confidence on stability & performance. Although, I do think practically it might be good to have the code in place for users who might revert to an older build as part of some build rollback mechanisms that they may have in place (for reasons not even related to hudi). The latest data block (denoted by the latest version) being a new block as suggested by Balaji sounds like one option - not sure how the complicated the code will become though...
Will comment on the RFC about some doubts/concerns regarding first migration customers from canIndexLogFiles = false to true and then rollback to ensure my understand is correct. 

-Nishith

Sent from my iPhone

> On Oct 30, 2019, at 4:00 PM, Balaji Varadarajan <v....@ymail.com.invalid> wrote:
> 
> Thanks Vinoth for proposing a clean and extendable design. The overall design looks great. Another rollout option is to only use consolidated log index for index lookup if latest "valid" log block has been written in new format. If that is not the case, we can revert to scanning previous log blocks for index lookup.
> Balaji.V    On Tuesday, October 29, 2019, 07:52:00 PM PDT, Bhavani Sudha <bh...@gmail.com> wrote:  
> 
> I vote for the second option. Also it can give time to analyze on how to
> deal with backwards compatibility. I ll take a look at the RFC later
> tonight and get back.
> 
> 
>> On Sun, Oct 27, 2019 at 10:24 AM Vinoth Chandar <vi...@apache.org> wrote:
>> 
>> One issue I have some open questions myself
>> 
>> Is it ok to assume log will have old data block versions, followed by new
>> data block versions. For e.g, if rollout new code, then revert back then
>> there could be an arbitrary mix of new and old data blocks. Handling this
>> might make design/code fairly complex. Alternatively we can keep it simple
>> for now, disable by default and only advise to enable for new tables or
>> when hudi version is stable
>> 
>> 
>>> On Sun, Oct 27, 2019 at 12:13 AM Vinoth Chandar <vi...@apache.org> wrote:
>>> 
>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/HUDI/RFC-6+Add+indexing+support+to+the+log+file
>>> 
>>> 
>>> Feedback welcome, on this RFC tackling HUDI-86
>>> 
>> 

Re: DISCUSS RFC 6 - Add indexing support to the log file

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 Thanks Vinoth for proposing a clean and extendable design. The overall design looks great. Another rollout option is to only use consolidated log index for index lookup if latest "valid" log block has been written in new format. If that is not the case, we can revert to scanning previous log blocks for index lookup.
Balaji.V    On Tuesday, October 29, 2019, 07:52:00 PM PDT, Bhavani Sudha <bh...@gmail.com> wrote:  
 
 I vote for the second option. Also it can give time to analyze on how to
deal with backwards compatibility. I ll take a look at the RFC later
tonight and get back.


On Sun, Oct 27, 2019 at 10:24 AM Vinoth Chandar <vi...@apache.org> wrote:

> One issue I have some open questions myself
>
> Is it ok to assume log will have old data block versions, followed by new
> data block versions. For e.g, if rollout new code, then revert back then
> there could be an arbitrary mix of new and old data blocks. Handling this
> might make design/code fairly complex. Alternatively we can keep it simple
> for now, disable by default and only advise to enable for new tables or
> when hudi version is stable
>
>
> On Sun, Oct 27, 2019 at 12:13 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC-6+Add+indexing+support+to+the+log+file
> >
> >
> > Feedback welcome, on this RFC tackling HUDI-86
> >
>
  

Re: DISCUSS RFC 6 - Add indexing support to the log file

Posted by Bhavani Sudha <bh...@gmail.com>.
I vote for the second option. Also it can give time to analyze on how to
deal with backwards compatibility. I ll take a look at the RFC later
tonight and get back.


On Sun, Oct 27, 2019 at 10:24 AM Vinoth Chandar <vi...@apache.org> wrote:

> One issue I have some open questions myself
>
> Is it ok to assume log will have old data block versions, followed by new
> data block versions. For e.g, if rollout new code, then revert back then
> there could be an arbitrary mix of new and old data blocks. Handling this
> might make design/code fairly complex. Alternatively we can keep it simple
> for now, disable by default and only advise to enable for new tables or
> when hudi version is stable
>
>
> On Sun, Oct 27, 2019 at 12:13 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC-6+Add+indexing+support+to+the+log+file
> >
> >
> > Feedback welcome, on this RFC tackling HUDI-86
> >
>

Re: DISCUSS RFC 6 - Add indexing support to the log file

Posted by Vinoth Chandar <vi...@apache.org>.
One issue I have some open questions myself

Is it ok to assume log will have old data block versions, followed by new
data block versions. For e.g, if rollout new code, then revert back then
there could be an arbitrary mix of new and old data blocks. Handling this
might make design/code fairly complex. Alternatively we can keep it simple
for now, disable by default and only advise to enable for new tables or
when hudi version is stable


On Sun, Oct 27, 2019 at 12:13 AM Vinoth Chandar <vi...@apache.org> wrote:

>
> https://cwiki.apache.org/confluence/display/HUDI/RFC-6+Add+indexing+support+to+the+log+file
>
>
> Feedback welcome, on this RFC tackling HUDI-86
>