You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Nicolas Paris <ni...@riseup.net> on 2021/11/02 09:27:19 UTC

Re: Limitations of non unique keys

for example does the move of blooms into hfiles (0.10.0 feature) makes
unique bloom keys mandatory ?



On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
>
> > Are you asking if there are advantages to allowing duplicates or not having keys in your table?
> it's all about allowing duplicates
>
> use case is say an Order table and choosing key = customer_id
> then being able to do indexed delete without need of prescanning the
> dataset
>
> I wonder if there will be trouble I am unaware of with such trick
>
> On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > Hi,
> >
> > Are you asking if there are advantages to allowing duplicates or not
> > having
> > keys in your table?
> >
> > Having keys, helps with othe practical scenarios, in addition to what
> > you
> > called out.
> > e.g: Oftentimes, you would want to backfill an insert-only table and you
> > don't want to introduce duplicates when doing so.
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <ni...@riseup.net>
> > wrote:
> >
> > > Hi devs,
> > >
> > > AFAIK, hudi has been designed to have primary keys in the hudi's key.
> > > However it is possible to also choose a non unique field. I have listed
> > > several trouble with such design:
> > >
> > > Non unique key yield to :
> > > - cannot delete / update a unique record
> > > - cannot apply primary key for new sql tables feature
> > >
> > > Is there other downsides to choose a non unique key you have in mind ?
> > >
> > > In my case, having user_id as a hudi key will help to apply deletion on
> > > the user level in any user table. The table are insert only, so the
> > > drawbacks listed above do not really apply. In case of error in the
> > > tables I have several options:
> > >
> > > - rollback to a previous commit
> > > - read partition/filter overwrite partition
> > >
> > > Thanks
> > >

Re: Limitations of non unique keys

Posted by Sivabalan <n....@gmail.com>.

got you. thanks for the clarification.

On Fri, Nov 5, 2021 at 3:53 PM Vinoth Chandar <ma...@gmail.com>
wrote:

> Hi Siva,
>
> I think this is more about bloom filters and record level index, which is
> different from RFC-27.
>
> RFC-08 talks about record level indexing. Bloom filter indexes have a
> discuss thread just kicked off.
>
> Main thing we are trying to solidify in 0.10.0 is foundational
> metadata table and concurrency mechanisms to be able to add an index in the
> background say.
>
> Thanks
> Vinoth
>
> On Fri, Nov 5, 2021 at 8:47 AM Sivabalan <n....@gmail.com> wrote:
>
> > Thanks for bringing this up. We have a RFC-27 on data skipping
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
> > >
> > which is the secondary indexing being discussed here. We are flushing out
> > few more details on this end and will put up patches once we figure out
> > the unknowns. We have a WIP patch here
> > <https://github.com/apache/hudi/pull/3475>, but needs some refactoring
> and
> > updates before we its ready for review.
> > And we are also thinking of moving the existing bloom filters (from data
> > files) into metadata table and re-use them instead of reading from all
> data
> > files with the expectation to boost performance for index lookup. We will
> > start a discussion thread around this and go from there.
> >
> >
> >
> > On Wed, Nov 3, 2021 at 5:36 PM Nicolas Paris <ni...@riseup.net>
> > wrote:
> >
> > >
> > > > In another words, we are generalizing this so hudi feels more like
> > > > MySQL and not HBase/Cassandra (key value store). Thats the direction
> > > > we are approaching.
> > >
> > > wow this is amazing. I haven't found yet RFC about this, nor ready to
> > > test PR.
> > >
> > > This answer my initial question: with the secondary indexes options
> > > comming, the hudi key shall be a primary key (if exists). There is no
> > > reason to choose anything else.
> > >
> > > On Wed Nov 3, 2021 at 9:03 PM CET, Vinoth Chandar wrote:
> > > > Hi.
> > > >
> > > > With the indexing approach we are taking, you should be able to add
> > > > secondary indexes on any column. not just the key.
> > > > In another words, we are generalizing this so hudi feels more like
> > MySQL
> > > > and not HBase/Cassandra (key value store). Thats the direction we are
> > > > approaching.
> > > >
> > > > love to hear more feedback.
> > > >
> > > > On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris <
> nicolas.paris@riseup.net
> > >
> > > > wrote:
> > > >
> > > > > for example does the move of blooms into hfiles (0.10.0 feature)
> > makes
> > > > > unique bloom keys mandatory ?
> > > > >
> > > > >
> > > > >
> > > > > On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
> > > > > >
> > > > > > > Are you asking if there are advantages to allowing duplicates
> or
> > > not
> > > > > having keys in your table?
> > > > > > it's all about allowing duplicates
> > > > > >
> > > > > > use case is say an Order table and choosing key = customer_id
> > > > > > then being able to do indexed delete without need of prescanning
> > the
> > > > > > dataset
> > > > > >
> > > > > > I wonder if there will be trouble I am unaware of with such trick
> > > > > >
> > > > > > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > Are you asking if there are advantages to allowing duplicates
> or
> > > not
> > > > > > > having
> > > > > > > keys in your table?
> > > > > > >
> > > > > > > Having keys, helps with othe practical scenarios, in addition
> to
> > > what
> > > > > > > you
> > > > > > > called out.
> > > > > > > e.g: Oftentimes, you would want to backfill an insert-only
> table
> > > and
> > > > > you
> > > > > > > don't want to introduce duplicates when doing so.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vinoth
> > > > > > >
> > > > > > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <
> > > > > nicolas.paris@riseup.net>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi devs,
> > > > > > > >
> > > > > > > > AFAIK, hudi has been designed to have primary keys in the
> > hudi's
> > > key.
> > > > > > > > However it is possible to also choose a non unique field. I
> > have
> > > > > listed
> > > > > > > > several trouble with such design:
> > > > > > > >
> > > > > > > > Non unique key yield to :
> > > > > > > > - cannot delete / update a unique record
> > > > > > > > - cannot apply primary key for new sql tables feature
> > > > > > > >
> > > > > > > > Is there other downsides to choose a non unique key you have
> in
> > > mind
> > > > > ?
> > > > > > > >
> > > > > > > > In my case, having user_id as a hudi key will help to apply
> > > deletion
> > > > > on
> > > > > > > > the user level in any user table. The table are insert only,
> so
> > > the
> > > > > > > > drawbacks listed above do not really apply. In case of error
> in
> > > the
> > > > > > > > tables I have several options:
> > > > > > > >
> > > > > > > > - rollback to a previous commit
> > > > > > > > - read partition/filter overwrite partition
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > >
> > > > >
> > >
> > >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


-- 
Regards,
-Sivabalan

Re: Limitations of non unique keys

Posted by Vinoth Chandar <ma...@gmail.com>.

Hi Siva,

I think this is more about bloom filters and record level index, which is
different from RFC-27.

RFC-08 talks about record level indexing. Bloom filter indexes have a
discuss thread just kicked off.

Main thing we are trying to solidify in 0.10.0 is foundational
metadata table and concurrency mechanisms to be able to add an index in the
background say.

Thanks
Vinoth

On Fri, Nov 5, 2021 at 8:47 AM Sivabalan <n....@gmail.com> wrote:

> Thanks for bringing this up. We have a RFC-27 on data skipping
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
> >
> which is the secondary indexing being discussed here. We are flushing out
> few more details on this end and will put up patches once we figure out
> the unknowns. We have a WIP patch here
> <https://github.com/apache/hudi/pull/3475>, but needs some refactoring and
> updates before we its ready for review.
> And we are also thinking of moving the existing bloom filters (from data
> files) into metadata table and re-use them instead of reading from all data
> files with the expectation to boost performance for index lookup. We will
> start a discussion thread around this and go from there.
>
>
>
> On Wed, Nov 3, 2021 at 5:36 PM Nicolas Paris <ni...@riseup.net>
> wrote:
>
> >
> > > In another words, we are generalizing this so hudi feels more like
> > > MySQL and not HBase/Cassandra (key value store). Thats the direction
> > > we are approaching.
> >
> > wow this is amazing. I haven't found yet RFC about this, nor ready to
> > test PR.
> >
> > This answer my initial question: with the secondary indexes options
> > comming, the hudi key shall be a primary key (if exists). There is no
> > reason to choose anything else.
> >
> > On Wed Nov 3, 2021 at 9:03 PM CET, Vinoth Chandar wrote:
> > > Hi.
> > >
> > > With the indexing approach we are taking, you should be able to add
> > > secondary indexes on any column. not just the key.
> > > In another words, we are generalizing this so hudi feels more like
> MySQL
> > > and not HBase/Cassandra (key value store). Thats the direction we are
> > > approaching.
> > >
> > > love to hear more feedback.
> > >
> > > On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris <nicolas.paris@riseup.net
> >
> > > wrote:
> > >
> > > > for example does the move of blooms into hfiles (0.10.0 feature)
> makes
> > > > unique bloom keys mandatory ?
> > > >
> > > >
> > > >
> > > > On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
> > > > >
> > > > > > Are you asking if there are advantages to allowing duplicates or
> > not
> > > > having keys in your table?
> > > > > it's all about allowing duplicates
> > > > >
> > > > > use case is say an Order table and choosing key = customer_id
> > > > > then being able to do indexed delete without need of prescanning
> the
> > > > > dataset
> > > > >
> > > > > I wonder if there will be trouble I am unaware of with such trick
> > > > >
> > > > > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Are you asking if there are advantages to allowing duplicates or
> > not
> > > > > > having
> > > > > > keys in your table?
> > > > > >
> > > > > > Having keys, helps with othe practical scenarios, in addition to
> > what
> > > > > > you
> > > > > > called out.
> > > > > > e.g: Oftentimes, you would want to backfill an insert-only table
> > and
> > > > you
> > > > > > don't want to introduce duplicates when doing so.
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <
> > > > nicolas.paris@riseup.net>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi devs,
> > > > > > >
> > > > > > > AFAIK, hudi has been designed to have primary keys in the
> hudi's
> > key.
> > > > > > > However it is possible to also choose a non unique field. I
> have
> > > > listed
> > > > > > > several trouble with such design:
> > > > > > >
> > > > > > > Non unique key yield to :
> > > > > > > - cannot delete / update a unique record
> > > > > > > - cannot apply primary key for new sql tables feature
> > > > > > >
> > > > > > > Is there other downsides to choose a non unique key you have in
> > mind
> > > > ?
> > > > > > >
> > > > > > > In my case, having user_id as a hudi key will help to apply
> > deletion
> > > > on
> > > > > > > the user level in any user table. The table are insert only, so
> > the
> > > > > > > drawbacks listed above do not really apply. In case of error in
> > the
> > > > > > > tables I have several options:
> > > > > > >
> > > > > > > - rollback to a previous commit
> > > > > > > - read partition/filter overwrite partition
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > >
> > > >
> >
> >
>
> --
> Regards,
> -Sivabalan
>

Re: Limitations of non unique keys

Posted by Sivabalan <n....@gmail.com>.

Thanks for bringing this up. We have a RFC-27 on data skipping
<https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance>
which is the secondary indexing being discussed here. We are flushing out
few more details on this end and will put up patches once we figure out
the unknowns. We have a WIP patch here
<https://github.com/apache/hudi/pull/3475>, but needs some refactoring and
updates before we its ready for review.
And we are also thinking of moving the existing bloom filters (from data
files) into metadata table and re-use them instead of reading from all data
files with the expectation to boost performance for index lookup. We will
start a discussion thread around this and go from there.



On Wed, Nov 3, 2021 at 5:36 PM Nicolas Paris <ni...@riseup.net>
wrote:

>
> > In another words, we are generalizing this so hudi feels more like
> > MySQL and not HBase/Cassandra (key value store). Thats the direction
> > we are approaching.
>
> wow this is amazing. I haven't found yet RFC about this, nor ready to
> test PR.
>
> This answer my initial question: with the secondary indexes options
> comming, the hudi key shall be a primary key (if exists). There is no
> reason to choose anything else.
>
> On Wed Nov 3, 2021 at 9:03 PM CET, Vinoth Chandar wrote:
> > Hi.
> >
> > With the indexing approach we are taking, you should be able to add
> > secondary indexes on any column. not just the key.
> > In another words, we are generalizing this so hudi feels more like MySQL
> > and not HBase/Cassandra (key value store). Thats the direction we are
> > approaching.
> >
> > love to hear more feedback.
> >
> > On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris <ni...@riseup.net>
> > wrote:
> >
> > > for example does the move of blooms into hfiles (0.10.0 feature) makes
> > > unique bloom keys mandatory ?
> > >
> > >
> > >
> > > On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
> > > >
> > > > > Are you asking if there are advantages to allowing duplicates or
> not
> > > having keys in your table?
> > > > it's all about allowing duplicates
> > > >
> > > > use case is say an Order table and choosing key = customer_id
> > > > then being able to do indexed delete without need of prescanning the
> > > > dataset
> > > >
> > > > I wonder if there will be trouble I am unaware of with such trick
> > > >
> > > > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > > > > Hi,
> > > > >
> > > > > Are you asking if there are advantages to allowing duplicates or
> not
> > > > > having
> > > > > keys in your table?
> > > > >
> > > > > Having keys, helps with othe practical scenarios, in addition to
> what
> > > > > you
> > > > > called out.
> > > > > e.g: Oftentimes, you would want to backfill an insert-only table
> and
> > > you
> > > > > don't want to introduce duplicates when doing so.
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <
> > > nicolas.paris@riseup.net>
> > > > > wrote:
> > > > >
> > > > > > Hi devs,
> > > > > >
> > > > > > AFAIK, hudi has been designed to have primary keys in the hudi's
> key.
> > > > > > However it is possible to also choose a non unique field. I have
> > > listed
> > > > > > several trouble with such design:
> > > > > >
> > > > > > Non unique key yield to :
> > > > > > - cannot delete / update a unique record
> > > > > > - cannot apply primary key for new sql tables feature
> > > > > >
> > > > > > Is there other downsides to choose a non unique key you have in
> mind
> > > ?
> > > > > >
> > > > > > In my case, having user_id as a hudi key will help to apply
> deletion
> > > on
> > > > > > the user level in any user table. The table are insert only, so
> the
> > > > > > drawbacks listed above do not really apply. In case of error in
> the
> > > > > > tables I have several options:
> > > > > >
> > > > > > - rollback to a previous commit
> > > > > > - read partition/filter overwrite partition
> > > > > >
> > > > > > Thanks
> > > > > >
> > >
> > >
>
>

-- 
Regards,
-Sivabalan

Re: Limitations of non unique keys

Posted by Nicolas Paris <ni...@riseup.net>.

> In another words, we are generalizing this so hudi feels more like
> MySQL and not HBase/Cassandra (key value store). Thats the direction
> we are approaching.

wow this is amazing. I haven't found yet RFC about this, nor ready to
test PR.

This answer my initial question: with the secondary indexes options
comming, the hudi key shall be a primary key (if exists). There is no
reason to choose anything else.

On Wed Nov 3, 2021 at 9:03 PM CET, Vinoth Chandar wrote:
> Hi.
>
> With the indexing approach we are taking, you should be able to add
> secondary indexes on any column. not just the key.
> In another words, we are generalizing this so hudi feels more like MySQL
> and not HBase/Cassandra (key value store). Thats the direction we are
> approaching.
>
> love to hear more feedback.
>
> On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris <ni...@riseup.net>
> wrote:
>
> > for example does the move of blooms into hfiles (0.10.0 feature) makes
> > unique bloom keys mandatory ?
> >
> >
> >
> > On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
> > >
> > > > Are you asking if there are advantages to allowing duplicates or not
> > having keys in your table?
> > > it's all about allowing duplicates
> > >
> > > use case is say an Order table and choosing key = customer_id
> > > then being able to do indexed delete without need of prescanning the
> > > dataset
> > >
> > > I wonder if there will be trouble I am unaware of with such trick
> > >
> > > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > > > Hi,
> > > >
> > > > Are you asking if there are advantages to allowing duplicates or not
> > > > having
> > > > keys in your table?
> > > >
> > > > Having keys, helps with othe practical scenarios, in addition to what
> > > > you
> > > > called out.
> > > > e.g: Oftentimes, you would want to backfill an insert-only table and
> > you
> > > > don't want to introduce duplicates when doing so.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <
> > nicolas.paris@riseup.net>
> > > > wrote:
> > > >
> > > > > Hi devs,
> > > > >
> > > > > AFAIK, hudi has been designed to have primary keys in the hudi's key.
> > > > > However it is possible to also choose a non unique field. I have
> > listed
> > > > > several trouble with such design:
> > > > >
> > > > > Non unique key yield to :
> > > > > - cannot delete / update a unique record
> > > > > - cannot apply primary key for new sql tables feature
> > > > >
> > > > > Is there other downsides to choose a non unique key you have in mind
> > ?
> > > > >
> > > > > In my case, having user_id as a hudi key will help to apply deletion
> > on
> > > > > the user level in any user table. The table are insert only, so the
> > > > > drawbacks listed above do not really apply. In case of error in the
> > > > > tables I have several options:
> > > > >
> > > > > - rollback to a previous commit
> > > > > - read partition/filter overwrite partition
> > > > >
> > > > > Thanks
> > > > >
> >
> >

Re: Limitations of non unique keys

Posted by Vinoth Chandar <vi...@apache.org>.

Hi.

With the indexing approach we are taking, you should be able to add
secondary indexes on any column. not just the key.
In another words, we are generalizing this so hudi feels more like MySQL
and not HBase/Cassandra (key value store). Thats the direction we are
approaching.

love to hear more feedback.

On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris <ni...@riseup.net>
wrote:

> for example does the move of blooms into hfiles (0.10.0 feature) makes
> unique bloom keys mandatory ?
>
>
>
> On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
> >
> > > Are you asking if there are advantages to allowing duplicates or not
> having keys in your table?
> > it's all about allowing duplicates
> >
> > use case is say an Order table and choosing key = customer_id
> > then being able to do indexed delete without need of prescanning the
> > dataset
> >
> > I wonder if there will be trouble I am unaware of with such trick
> >
> > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > > Hi,
> > >
> > > Are you asking if there are advantages to allowing duplicates or not
> > > having
> > > keys in your table?
> > >
> > > Having keys, helps with othe practical scenarios, in addition to what
> > > you
> > > called out.
> > > e.g: Oftentimes, you would want to backfill an insert-only table and
> you
> > > don't want to introduce duplicates when doing so.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <
> nicolas.paris@riseup.net>
> > > wrote:
> > >
> > > > Hi devs,
> > > >
> > > > AFAIK, hudi has been designed to have primary keys in the hudi's key.
> > > > However it is possible to also choose a non unique field. I have
> listed
> > > > several trouble with such design:
> > > >
> > > > Non unique key yield to :
> > > > - cannot delete / update a unique record
> > > > - cannot apply primary key for new sql tables feature
> > > >
> > > > Is there other downsides to choose a non unique key you have in mind
> ?
> > > >
> > > > In my case, having user_id as a hudi key will help to apply deletion
> on
> > > > the user level in any user table. The table are insert only, so the
> > > > drawbacks listed above do not really apply. In case of error in the
> > > > tables I have several options:
> > > >
> > > > - rollback to a previous commit
> > > > - read partition/filter overwrite partition
> > > >
> > > > Thanks
> > > >
>
>