You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Tanuj <ta...@gmail.com> on 2020/10/15 08:06:18 UTC

HUDI Table Primary Key - UUID or Custom For Better Performance

Hi all,
We don't have an "UPDATE" use case and all ingested rows will be "INSERT" so what is the best way to define PRIMARY key. As of now we have designed primary key as per domain object with create_date which is -
<domain_object_key_1>,<domain_object_key_2>,<create_date>

Since its always an INSERT for us , I can potentially use UUID as well .

We use keys for Bloom Index in HUDI so just wanted to know if I get a better performance in writing if I will have the UUID vs composite domain keys.

I believe read is not impacted as per the Primary Key as its not being considered ?

Please suggest


Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Posted by Vinoth Chandar <vi...@apache.org>.
Got it; please feel free to raise a jira for future

On Wed, Oct 21, 2020 at 9:47 PM tanu dua <ta...@gmail.com> wrote:

> Thanks got it. Unfortunately it’s not very straightforward for me to
> provide ordered keys. So far I am getting a decent write performance so
> will revisit if required.
>
> On Wed, 21 Oct 2020 at 7:45 AM, Vinoth Chandar <
> mail.vinoth.chandar@gmail.com> wrote:
>
> > For now, bloom filters are not actually leveraged in the read/query path
> > but only by the writer performing the index lookup for upserting. Hudi is
> > write optimized like an OLTP store and read optimized like OLAP, if
> > that makes sense.
> >
> > As for bloom index performance, our tuning guide and FAQ talk about this.
> > If you eventually want to support de-duplication say, it might be good to
> > pick a key that is ordered. Something like _hoodie_seq_no that keeps
> > increasing with new commits, then the bloom indexing mechanism will be
> also
> > able to do range pruning effectively improving performance significantly.
> > Pure uuid keys are not very conducive for range pruning ie files written
> > during each commit will over lap in key range with almost every other
> file.
> >
> > Thanks
> > Vinoth
> >
> > On Fri, Oct 16, 2020 at 8:42 PM Tanuj <ta...@gmail.com> wrote:
> >
> > > Thanks Prashant. To answer your questions -
> > > 1) Yes size of keys are something around 5-8 alphanumeric but since its
> > > composite key of 3 domain keys I believe it will be almost equal to
> UUID
> > > 4) Thats the business need. We need to keep a track/audit for every
> > > insertion of new record. We had 2 options - Update Existing Record ,
> make
> > > an Audit Table to store old records or keep pushing in the same table
> > with
> > > timestamp so that it always works with Append mode. We choose Option 2
> > > 5) Thats what I want to understand how Bloom Filters will be useful
> here.
> > > And in general also is bloom filter used in HUDI for read. I understand
> > the
> > > write process where its being used but does it use in read as well as I
> > > believe after picking up the correct parquet file Hudi delegates the
> read
> > > to Spark . Please correct me if I am wrong here
> > > 6) We will only query on domain object keys excluding create_date.
> > >
> > > On 2020/10/16 18:53:21, Prashant Wason <pw...@uber.com.INVALID>
> wrote:
> > > > Hi Tanu,
> > > >
> > > > Some points to consider:
> > > > 1. UUID is fixed size compared to domain_object_keys (dont know the
> > > size).
> > > > Smaller keys will reduce the storage requirements.
> > > > 2. UUIDs don't compress. Your domain object keys may compress better.
> > > > 3. From the bloom filter perspective, I dont think there is any
> > > difference
> > > > unless the size difference of keys is very large.
> > > > 4. If the domain object keys are already unique, what is the use of
> > > > suffixing the create_date?
> > > > 5. If you query by "primary key minus timestamp", the entire record
> key
> > > > column will have to be read to match it. So bloom filters won't be
> > useful
> > > > here.
> > > > 6. What do the domain object keys look like? Are they going to be
> > > included
> > > > in any other field in the record? Would you ever want to query on
> > domain
> > > > object keys?
> > > >
> > > > Thanks
> > > > Prashant
> > > >
> > > >
> > > > On Thu, Oct 15, 2020 at 8:21 PM tanu dua <ta...@gmail.com>
> > wrote:
> > > >
> > > > > read query pattern will be (partition key + primary key minus
> > > timestamp)
> > > > > where my primary key is domain keys + timestamp.
> > > > >
> > > > > Read Write queries are as per dataset but mostly all the tables are
> > > read
> > > > > and write frequently and equally
> > > > >
> > > > > Read will be mostly done by providing the partitions and not by
> > blanket
> > > > > query.
> > > > >
> > > > > If we have to choose between read and write I will choose write
> but I
> > > want
> > > > > to stick only with COW table.
> > > > >
> > > > > Please let me know if you need more information.
> > > > >
> > > > >
> > > > > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <n....@gmail.com>
> > wrote:
> > > > >
> > > > > > Can you give us a sense of how your read workload looks like?
> > > Depending
> > > > > on
> > > > > > that read perf could vary.
> > > > > >
> > > > > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj <ta...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > > We don't have an "UPDATE" use case and all ingested rows will
> be
> > > > > "INSERT"
> > > > > > > so what is the best way to define PRIMARY key. As of now we
> have
> > > > > designed
> > > > > > > primary key as per domain object with create_date which is -
> > > > > > > <domain_object_key_1>,<domain_object_key_2>,<create_date>
> > > > > > >
> > > > > > > Since its always an INSERT for us , I can potentially use UUID
> as
> > > well
> > > > > .
> > > > > > >
> > > > > > > We use keys for Bloom Index in HUDI so just wanted to know if I
> > > get a
> > > > > > > better performance in writing if I will have the UUID vs
> > composite
> > > > > domain
> > > > > > > keys.
> > > > > > >
> > > > > > > I believe read is not impacted as per the Primary Key as its
> not
> > > being
> > > > > > > considered ?
> > > > > > >
> > > > > > > Please suggest
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > -Sivabalan
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Posted by tanu dua <ta...@gmail.com>.
Thanks got it. Unfortunately it’s not very straightforward for me to
provide ordered keys. So far I am getting a decent write performance so
will revisit if required.

On Wed, 21 Oct 2020 at 7:45 AM, Vinoth Chandar <
mail.vinoth.chandar@gmail.com> wrote:

> For now, bloom filters are not actually leveraged in the read/query path
> but only by the writer performing the index lookup for upserting. Hudi is
> write optimized like an OLTP store and read optimized like OLAP, if
> that makes sense.
>
> As for bloom index performance, our tuning guide and FAQ talk about this.
> If you eventually want to support de-duplication say, it might be good to
> pick a key that is ordered. Something like _hoodie_seq_no that keeps
> increasing with new commits, then the bloom indexing mechanism will be also
> able to do range pruning effectively improving performance significantly.
> Pure uuid keys are not very conducive for range pruning ie files written
> during each commit will over lap in key range with almost every other file.
>
> Thanks
> Vinoth
>
> On Fri, Oct 16, 2020 at 8:42 PM Tanuj <ta...@gmail.com> wrote:
>
> > Thanks Prashant. To answer your questions -
> > 1) Yes size of keys are something around 5-8 alphanumeric but since its
> > composite key of 3 domain keys I believe it will be almost equal to UUID
> > 4) Thats the business need. We need to keep a track/audit for every
> > insertion of new record. We had 2 options - Update Existing Record , make
> > an Audit Table to store old records or keep pushing in the same table
> with
> > timestamp so that it always works with Append mode. We choose Option 2
> > 5) Thats what I want to understand how Bloom Filters will be useful here.
> > And in general also is bloom filter used in HUDI for read. I understand
> the
> > write process where its being used but does it use in read as well as I
> > believe after picking up the correct parquet file Hudi delegates the read
> > to Spark . Please correct me if I am wrong here
> > 6) We will only query on domain object keys excluding create_date.
> >
> > On 2020/10/16 18:53:21, Prashant Wason <pw...@uber.com.INVALID> wrote:
> > > Hi Tanu,
> > >
> > > Some points to consider:
> > > 1. UUID is fixed size compared to domain_object_keys (dont know the
> > size).
> > > Smaller keys will reduce the storage requirements.
> > > 2. UUIDs don't compress. Your domain object keys may compress better.
> > > 3. From the bloom filter perspective, I dont think there is any
> > difference
> > > unless the size difference of keys is very large.
> > > 4. If the domain object keys are already unique, what is the use of
> > > suffixing the create_date?
> > > 5. If you query by "primary key minus timestamp", the entire record key
> > > column will have to be read to match it. So bloom filters won't be
> useful
> > > here.
> > > 6. What do the domain object keys look like? Are they going to be
> > included
> > > in any other field in the record? Would you ever want to query on
> domain
> > > object keys?
> > >
> > > Thanks
> > > Prashant
> > >
> > >
> > > On Thu, Oct 15, 2020 at 8:21 PM tanu dua <ta...@gmail.com>
> wrote:
> > >
> > > > read query pattern will be (partition key + primary key minus
> > timestamp)
> > > > where my primary key is domain keys + timestamp.
> > > >
> > > > Read Write queries are as per dataset but mostly all the tables are
> > read
> > > > and write frequently and equally
> > > >
> > > > Read will be mostly done by providing the partitions and not by
> blanket
> > > > query.
> > > >
> > > > If we have to choose between read and write I will choose write but I
> > want
> > > > to stick only with COW table.
> > > >
> > > > Please let me know if you need more information.
> > > >
> > > >
> > > > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <n....@gmail.com>
> wrote:
> > > >
> > > > > Can you give us a sense of how your read workload looks like?
> > Depending
> > > > on
> > > > > that read perf could vary.
> > > > >
> > > > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj <ta...@gmail.com>
> wrote:
> > > > >
> > > > > > Hi all,
> > > > > > We don't have an "UPDATE" use case and all ingested rows will be
> > > > "INSERT"
> > > > > > so what is the best way to define PRIMARY key. As of now we have
> > > > designed
> > > > > > primary key as per domain object with create_date which is -
> > > > > > <domain_object_key_1>,<domain_object_key_2>,<create_date>
> > > > > >
> > > > > > Since its always an INSERT for us , I can potentially use UUID as
> > well
> > > > .
> > > > > >
> > > > > > We use keys for Bloom Index in HUDI so just wanted to know if I
> > get a
> > > > > > better performance in writing if I will have the UUID vs
> composite
> > > > domain
> > > > > > keys.
> > > > > >
> > > > > > I believe read is not impacted as per the Primary Key as its not
> > being
> > > > > > considered ?
> > > > > >
> > > > > > Please suggest
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > -Sivabalan
> > > > >
> > > >
> > >
> >
>

Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Posted by Vinoth Chandar <ma...@gmail.com>.
For now, bloom filters are not actually leveraged in the read/query path
but only by the writer performing the index lookup for upserting. Hudi is
write optimized like an OLTP store and read optimized like OLAP, if
that makes sense.

As for bloom index performance, our tuning guide and FAQ talk about this.
If you eventually want to support de-duplication say, it might be good to
pick a key that is ordered. Something like _hoodie_seq_no that keeps
increasing with new commits, then the bloom indexing mechanism will be also
able to do range pruning effectively improving performance significantly.
Pure uuid keys are not very conducive for range pruning ie files written
during each commit will over lap in key range with almost every other file.

Thanks
Vinoth

On Fri, Oct 16, 2020 at 8:42 PM Tanuj <ta...@gmail.com> wrote:

> Thanks Prashant. To answer your questions -
> 1) Yes size of keys are something around 5-8 alphanumeric but since its
> composite key of 3 domain keys I believe it will be almost equal to UUID
> 4) Thats the business need. We need to keep a track/audit for every
> insertion of new record. We had 2 options - Update Existing Record , make
> an Audit Table to store old records or keep pushing in the same table with
> timestamp so that it always works with Append mode. We choose Option 2
> 5) Thats what I want to understand how Bloom Filters will be useful here.
> And in general also is bloom filter used in HUDI for read. I understand the
> write process where its being used but does it use in read as well as I
> believe after picking up the correct parquet file Hudi delegates the read
> to Spark . Please correct me if I am wrong here
> 6) We will only query on domain object keys excluding create_date.
>
> On 2020/10/16 18:53:21, Prashant Wason <pw...@uber.com.INVALID> wrote:
> > Hi Tanu,
> >
> > Some points to consider:
> > 1. UUID is fixed size compared to domain_object_keys (dont know the
> size).
> > Smaller keys will reduce the storage requirements.
> > 2. UUIDs don't compress. Your domain object keys may compress better.
> > 3. From the bloom filter perspective, I dont think there is any
> difference
> > unless the size difference of keys is very large.
> > 4. If the domain object keys are already unique, what is the use of
> > suffixing the create_date?
> > 5. If you query by "primary key minus timestamp", the entire record key
> > column will have to be read to match it. So bloom filters won't be useful
> > here.
> > 6. What do the domain object keys look like? Are they going to be
> included
> > in any other field in the record? Would you ever want to query on domain
> > object keys?
> >
> > Thanks
> > Prashant
> >
> >
> > On Thu, Oct 15, 2020 at 8:21 PM tanu dua <ta...@gmail.com> wrote:
> >
> > > read query pattern will be (partition key + primary key minus
> timestamp)
> > > where my primary key is domain keys + timestamp.
> > >
> > > Read Write queries are as per dataset but mostly all the tables are
> read
> > > and write frequently and equally
> > >
> > > Read will be mostly done by providing the partitions and not by blanket
> > > query.
> > >
> > > If we have to choose between read and write I will choose write but I
> want
> > > to stick only with COW table.
> > >
> > > Please let me know if you need more information.
> > >
> > >
> > > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <n....@gmail.com> wrote:
> > >
> > > > Can you give us a sense of how your read workload looks like?
> Depending
> > > on
> > > > that read perf could vary.
> > > >
> > > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj <ta...@gmail.com> wrote:
> > > >
> > > > > Hi all,
> > > > > We don't have an "UPDATE" use case and all ingested rows will be
> > > "INSERT"
> > > > > so what is the best way to define PRIMARY key. As of now we have
> > > designed
> > > > > primary key as per domain object with create_date which is -
> > > > > <domain_object_key_1>,<domain_object_key_2>,<create_date>
> > > > >
> > > > > Since its always an INSERT for us , I can potentially use UUID as
> well
> > > .
> > > > >
> > > > > We use keys for Bloom Index in HUDI so just wanted to know if I
> get a
> > > > > better performance in writing if I will have the UUID vs composite
> > > domain
> > > > > keys.
> > > > >
> > > > > I believe read is not impacted as per the Primary Key as its not
> being
> > > > > considered ?
> > > > >
> > > > > Please suggest
> > > > >
> > > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >
>

Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Posted by Tanuj <ta...@gmail.com>.
Thanks Prashant. To answer your questions -
1) Yes size of keys are something around 5-8 alphanumeric but since its composite key of 3 domain keys I believe it will be almost equal to UUID
4) Thats the business need. We need to keep a track/audit for every insertion of new record. We had 2 options - Update Existing Record , make an Audit Table to store old records or keep pushing in the same table with timestamp so that it always works with Append mode. We choose Option 2
5) Thats what I want to understand how Bloom Filters will be useful here. And in general also is bloom filter used in HUDI for read. I understand the write process where its being used but does it use in read as well as I believe after picking up the correct parquet file Hudi delegates the read to Spark . Please correct me if I am wrong here
6) We will only query on domain object keys excluding create_date.

On 2020/10/16 18:53:21, Prashant Wason <pw...@uber.com.INVALID> wrote: 
> Hi Tanu,
> 
> Some points to consider:
> 1. UUID is fixed size compared to domain_object_keys (dont know the size).
> Smaller keys will reduce the storage requirements.
> 2. UUIDs don't compress. Your domain object keys may compress better.
> 3. From the bloom filter perspective, I dont think there is any difference
> unless the size difference of keys is very large.
> 4. If the domain object keys are already unique, what is the use of
> suffixing the create_date?
> 5. If you query by "primary key minus timestamp", the entire record key
> column will have to be read to match it. So bloom filters won't be useful
> here.
> 6. What do the domain object keys look like? Are they going to be included
> in any other field in the record? Would you ever want to query on domain
> object keys?
> 
> Thanks
> Prashant
> 
> 
> On Thu, Oct 15, 2020 at 8:21 PM tanu dua <ta...@gmail.com> wrote:
> 
> > read query pattern will be (partition key + primary key minus timestamp)
> > where my primary key is domain keys + timestamp.
> >
> > Read Write queries are as per dataset but mostly all the tables are read
> > and write frequently and equally
> >
> > Read will be mostly done by providing the partitions and not by blanket
> > query.
> >
> > If we have to choose between read and write I will choose write but I want
> > to stick only with COW table.
> >
> > Please let me know if you need more information.
> >
> >
> > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <n....@gmail.com> wrote:
> >
> > > Can you give us a sense of how your read workload looks like? Depending
> > on
> > > that read perf could vary.
> > >
> > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj <ta...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > > We don't have an "UPDATE" use case and all ingested rows will be
> > "INSERT"
> > > > so what is the best way to define PRIMARY key. As of now we have
> > designed
> > > > primary key as per domain object with create_date which is -
> > > > <domain_object_key_1>,<domain_object_key_2>,<create_date>
> > > >
> > > > Since its always an INSERT for us , I can potentially use UUID as well
> > .
> > > >
> > > > We use keys for Bloom Index in HUDI so just wanted to know if I get a
> > > > better performance in writing if I will have the UUID vs composite
> > domain
> > > > keys.
> > > >
> > > > I believe read is not impacted as per the Primary Key as its not being
> > > > considered ?
> > > >
> > > > Please suggest
> > > >
> > > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
> 

Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Posted by Prashant Wason <pw...@uber.com.INVALID>.
Hi Tanu,

Some points to consider:
1. UUID is fixed size compared to domain_object_keys (dont know the size).
Smaller keys will reduce the storage requirements.
2. UUIDs don't compress. Your domain object keys may compress better.
3. From the bloom filter perspective, I dont think there is any difference
unless the size difference of keys is very large.
4. If the domain object keys are already unique, what is the use of
suffixing the create_date?
5. If you query by "primary key minus timestamp", the entire record key
column will have to be read to match it. So bloom filters won't be useful
here.
6. What do the domain object keys look like? Are they going to be included
in any other field in the record? Would you ever want to query on domain
object keys?

Thanks
Prashant


On Thu, Oct 15, 2020 at 8:21 PM tanu dua <ta...@gmail.com> wrote:

> read query pattern will be (partition key + primary key minus timestamp)
> where my primary key is domain keys + timestamp.
>
> Read Write queries are as per dataset but mostly all the tables are read
> and write frequently and equally
>
> Read will be mostly done by providing the partitions and not by blanket
> query.
>
> If we have to choose between read and write I will choose write but I want
> to stick only with COW table.
>
> Please let me know if you need more information.
>
>
> On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <n....@gmail.com> wrote:
>
> > Can you give us a sense of how your read workload looks like? Depending
> on
> > that read perf could vary.
> >
> > On Thu, Oct 15, 2020 at 4:06 AM Tanuj <ta...@gmail.com> wrote:
> >
> > > Hi all,
> > > We don't have an "UPDATE" use case and all ingested rows will be
> "INSERT"
> > > so what is the best way to define PRIMARY key. As of now we have
> designed
> > > primary key as per domain object with create_date which is -
> > > <domain_object_key_1>,<domain_object_key_2>,<create_date>
> > >
> > > Since its always an INSERT for us , I can potentially use UUID as well
> .
> > >
> > > We use keys for Bloom Index in HUDI so just wanted to know if I get a
> > > better performance in writing if I will have the UUID vs composite
> domain
> > > keys.
> > >
> > > I believe read is not impacted as per the Primary Key as its not being
> > > considered ?
> > >
> > > Please suggest
> > >
> > >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Posted by tanu dua <ta...@gmail.com>.
read query pattern will be (partition key + primary key minus timestamp)
where my primary key is domain keys + timestamp.

Read Write queries are as per dataset but mostly all the tables are read
and write frequently and equally

Read will be mostly done by providing the partitions and not by blanket
query.

If we have to choose between read and write I will choose write but I want
to stick only with COW table.

Please let me know if you need more information.


On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <n....@gmail.com> wrote:

> Can you give us a sense of how your read workload looks like? Depending on
> that read perf could vary.
>
> On Thu, Oct 15, 2020 at 4:06 AM Tanuj <ta...@gmail.com> wrote:
>
> > Hi all,
> > We don't have an "UPDATE" use case and all ingested rows will be "INSERT"
> > so what is the best way to define PRIMARY key. As of now we have designed
> > primary key as per domain object with create_date which is -
> > <domain_object_key_1>,<domain_object_key_2>,<create_date>
> >
> > Since its always an INSERT for us , I can potentially use UUID as well .
> >
> > We use keys for Bloom Index in HUDI so just wanted to know if I get a
> > better performance in writing if I will have the UUID vs composite domain
> > keys.
> >
> > I believe read is not impacted as per the Primary Key as its not being
> > considered ?
> >
> > Please suggest
> >
> >
>
> --
> Regards,
> -Sivabalan
>

Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Posted by Sivabalan <n....@gmail.com>.
Can you give us a sense of how your read workload looks like? Depending on
that read perf could vary.

On Thu, Oct 15, 2020 at 4:06 AM Tanuj <ta...@gmail.com> wrote:

> Hi all,
> We don't have an "UPDATE" use case and all ingested rows will be "INSERT"
> so what is the best way to define PRIMARY key. As of now we have designed
> primary key as per domain object with create_date which is -
> <domain_object_key_1>,<domain_object_key_2>,<create_date>
>
> Since its always an INSERT for us , I can potentially use UUID as well .
>
> We use keys for Bloom Index in HUDI so just wanted to know if I get a
> better performance in writing if I will have the UUID vs composite domain
> keys.
>
> I believe read is not impacted as per the Primary Key as its not being
> considered ?
>
> Please suggest
>
>

-- 
Regards,
-Sivabalan