You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID> on 2020/11/15 04:46:21 UTC

Hudi Record Key Best Practices

Hello All,

I have asked generic questions regarding record key in slack channel, but I just want to consolidate everything regarding Record Key and the suggested best practices of Record Key construction to get better write performance.

Table Type: COW
Partition Path: Date

My record uniqueness is derived from a combination of 4 fields:

  1.  F1: Datetime (record’s origination datetime)
  2.  F2: String       (11 char  long serial number)
  3.  F3: UUID        (User Identifier)
  4.  F4: String.       (12 CHAR statistic name)

Note: My record is a nested document and some of the above fields are nested fields

My Write Use Cases:
1. Writes to partitioned HUDI table every 15 minutes

  1.  where 95% inserts and 5% updates,
  2.  Also 95% write goes to same partition (current date) 5% write can span across multiple partitions
2. GDPR request to delete records from the table using User Identifier field (F3)


Record Key Construction:
Approach 1:
Generate a UUID  from the concatenated String of all these 4 fields [eg: str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that newly generated field as Record Key

Approach 2:
Generate a UUID  from the concatenated String of 3 fields except datetime field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend datetime field to the generated UUID and use that newly generated field as Record Key •F1_<uuid>

Approach 3:
Record Key as a composite key of all 4 fields (F1, F2, F3, F4)

Which is the approach you will suggest? Could you please help me?

Regards,
Felix K Jose










________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: Hudi Record Key Best Practices

Posted by Vinoth Chandar <vi...@apache.org>.

Sounds good to me. We are always looking to add more contributors.

https://github.com/apache/hudi/pull/2263
 is the pr under review for clustering

RFC 18/19 have the details as well

On Wed, Nov 25, 2020 at 6:20 AM Kizhakkel Jose, Felix <
felix.jose@philips.com> wrote:

> Hi Vinoth, Siva,
>
> I know you guys are so busy. But I always get quick response from one of
> hoodiers. Thank you so much for the detailed information.
>
> Yes, as suggested for UPSERTs I will go with *Approach 2*.
>
> For deletes clustering can help me. Also happy to see that we don’t need
> to duplicate that field as part of Record Key to get it clustered. Where
> can I find PR/RFC for clustering implementation to read about it and get a
> better understanding? And I believe this is something similar to bucketing
> in Hive?
>
> Also RFC-21 is going to help on the storage footprint a lot.
>
>
> All interesting stuffs. Once I complete my major Data Lake Implementation
> project I definetly would like to start contributing to HUDI.
>
>
>
> Thank you @Vinoth Chandar <vi...@apache.org> @Siva once again for all of
> your help.  And @Raymond, thank you for answering and clarifying things
> throughout this.
>
>
>
> Regards,
>
> Felix K Jose
>
> *From: *Vinoth Chandar <vi...@apache.org>
> *Date: *Tuesday, November 24, 2020 at 5:52 PM
> *To: *Sivabalan <n....@gmail.com>
> *Cc: *Kizhakkel Jose, Felix <fe...@philips.com>, Raymond Xu <
> xu.shiyan.raymond@gmail.com>, dev@hudi.apache.org <de...@hudi.apache.org>
> *Subject: *Re: Hudi Record Key Best Practices
>
> Agree with Siva's suggestions.
>
>
>
> For clustering, it's not necessary for it to be part of the key. (Satish
> can correct if I missed something)
>
>
>
> On Tue, Nov 24, 2020 at 2:01 PM Sivabalan <n....@gmail.com> wrote:
>
> here are the discussions points we had in slack.
>
>
>
> Suggestion is to go with approach 2 based on these points.
>
> - Prefixing F1 (including timestamp), will help pruning some file slices
> even within a day (within a partition) if records are properly ordered
> based on timestamp.
>
> - Deletes are occasional compared to upserts. So, optimizing for upserts
> makes sense and hence approach 2 is fine. Also, anyways to delete records,
> its two part execution. First a query to hudi like "select HoodieKey from
> hudi_tbl where user_id = 'X'), and the a DELETE operation to hudi for these
> HoodieKeys. For first query, I assume embedding user_id in record keys does
> not matter, bcoz, this query does filtering for a specific column in the
> dataset.
>
> So, initially thought not much of value embedding user id in record key.
> But as vinoth suggested, clustering could come in handy and so lets have
> userId too as part of record keys.
>
> - In approach3, the record keys could be too large and so may not want to
> go this route.
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Nov 24, 2020 at 11:58 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> Hi Felix,
>
>
>
> I will try to be faster going forward. Apologies for the late reply.
> Thanks Raymond for all the great clarifications.
>
>
>
> On RFC-21, I think it's safe to assume it will be available by Jan or so.
> 0.8.0 (Uber folks, correct me if I am wrong)
>
>
>
> >>For approach 2 – the reason for prepending datetime is to have an
> incrementing id, otherwise your uuid is a purely random id and wont support
> range pruning, while writing, correct?
>
> You are right. In general, we only have the following levers to control
> performance. I take it that "origination datetime" is not monotonically
> increasing? Otherwise Approach 1 is good, right?
>
>
>
> If you want to optimize for upsert performance,
>
> - prepending a timestamp field would help. if you simply prepend the date,
> which is already also the partition path, then all keys in that partition
> will have the same prefix and no additional pruning opportunities exist.
>
> - Advise using dynamic bloom filters
> (config hoodie.bloom.index.filter.type=DYNAMIC_V0), to ensure the bloom
> filters filter our enough files after range pruning.
>
>
>
> For good delete performance, we can cluster records by user_id for older
> partitions, such that all records a user is packed into the smallest number
> of files. This way,  when only a small number of users leave,
>
> your delete won't rewrite the entire partition's files. Clustering support
> is landing by the end of year in 0.7.0. (There is a PR out already, if you
> want to test/play).
>
>
>
> All of this is also highly workload specific. So we can get into those
> details, if that helps. MOR is a much better alternative for dealing with
> deletes IMO.
>
> It was specifically designed, used for those, since it can absorb the
> deletes into log files and apply them later amortizing costs.
>
>
>
> Future is good, since we are investing in record level indexes that could
> also natively index secondary fields like user_id. Again expect that to be
> there in 0.9.0 or something, around Mar.
>
> For now, we have to play with how we lay out the data to squeeze
> performance.
>
>
>
> Hope that helps.
>
>
>
> thanks
>
> vinoth
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Nov 24, 2020 at 5:54 AM Kizhakkel Jose, Felix <
> felix.jose@philips.com> wrote:
>
> Hi Raymond,
>
> Thanks a lot for the reply.
>
> For approach 2 – the reason for prepending datetime is to have a
> incrementing id, otherwise your uuid is a purely random id and wont support
> range pruning, while writing, correct? In a given date partition I am
> expected to get 10s of billions records, and by having an incrementing id
> helps BLOOM filtering? This is the only intend of having the prefix of
> datetime (int64 representation)
>
> Yes, I also see Approach 3 really too big and causing lot in storage
> footprint.
>
> My initial approach was Approach 1 (generated uuid from all the 4 fields),
> then heard that the range pruning can make write faster – so thought of
> datetime as prefix. Do you see any benefit or the UUID can itself be
> sufficient -since it’s been generated from the 4 input fields?
>
>
>
> Regards,
>
> Felix K Jose
>
> *From: *Raymond Xu <xu...@gmail.com>
> *Date: *Tuesday, November 24, 2020 at 2:20 AM
> *To: *Kizhakkel Jose, Felix <fe...@philips.com>
> *Cc: *dev@hudi.apache.org <de...@hudi.apache.org>, vinoth@apache.org <
> vinoth@apache.org>, n.siva.b@gmail.com <n....@gmail.com>
> *Subject: *Re: Hudi Record Key Best Practices
>
> Hi Felix,
>
> I'd prefer approach 1. The logic is simple: to ensure uniqueness in your
> dataset.
>
> For 2, not very sure about the intention of prepending the datetime, looks
> like duplicate info knowing that you already partitioned it by that field.
>
> For 3, it seems too long for a primary id.
>
> Hope this helps.
>
>
>
> On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix <
> felix.jose@philips.com> wrote:
>
> @Vinoth Chandar <vi...@apache.org>,
>
> Could you please take a look at and let me know what is the best approach
> or could you see whom can help me on this?
>
>
>
> Regards,
>
> Felix K Jose
>
> *From: *Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
> *Date: *Thursday, November 19, 2020 at 12:04 PM
> *To: *dev@hudi.apache.org <de...@hudi.apache.org>, Vinoth Chandar <
> vinoth@apache.org>, xu.shiyan.raymond@gmail.com <
> xu.shiyan.raymond@gmail.com>
> *Cc: *vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> n.siva.b@gmail.com>
> *Subject: *Re: Hudi Record Key Best Practices
>
> Sure. I will see about partition key.
>
> Since RFC 21 is not yet implemented and available to consume, can anyone
> please suggest what is the best approach I should be following to construct
> the record key I asked in the  original question:
>
> “
> My Write Use Cases:
> 1. Writes to partitioned HUDI table every 15 minutes
>
>   1.  where 95% inserts and 5% updates,
>   2.  Also 95% write goes to same partition (current date) 5% write can
> span across multiple partitions
> 2. GDPR request to delete records from the table using User Identifier
> field (F3)
>
>
> Record Key Construction:
> Approach 1:
> Generate a UUID  from the concatenated String of all these 4 fields [eg:
> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> newly generated field as Record Key
>
> Approach 2:
> Generate a UUID  from the concatenated String of 3 fields except datetime
> field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> datetime field to the generated UUID and use that newly generated field as
> Record Key •F1_<uuid>
>
> Approach 3:
> Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> “
>
> Regards,
> Felix K Jose
> From: Raymond Xu <xu...@gmail.com>
> Date: Wednesday, November 18, 2020 at 5:30 PM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> n.siva.b@gmail.com>
> Subject: Re: Hudi Record Key Best Practices
> Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
> just saying maybe making the writes more evenly spreaded could be
> better. Effectively, with 95% writes, it's like writing to a single
> partition dataset. Hourly partition could mitigate the situation, since you
> also have date-range queries. Just some rough ideas, the strategy really
> depends on your data pattern and requirements.
>
> For the development timeline on RFC 21, probably Vinoth or Balaji
> could give more info.
>
> On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hi Raymond,
> > Thank you for the response.
> >
> > Yes, the virtual key definitely going to help reducing the storage
> > footprint. When do you think it is going to be available and will it be
> > compatible with all downstream processing engines (Presto, Redshift
> > Spectrum etc.)? We have started our development activities and expecting
> to
> > get into PROD by March-April timeframe.
> >
> > Regarding the partition key,  we get data every day from 10-20 million
> > users and currently the data we are planning to partition is by Date
> > (YYYY-MM-DD) and thereby we will have consistent partitions for
> downstream
> > systems(every partition has same amount of data [20 million user data in
> > each partition, rather than skewed partitions]). And most of our queries
> > are date range queries for a given user-Id
> >
> > If I partition by user-Id, then I will have millions of partitions, and I
> > have read that having large number of partition has major read impact
> (meta
> > data management etc.), what do you think? Is my understanding correct?
> >
> > Yes, for current day most of the data will be for that day – so do you
> > think it’s going to be a problem while writing (wont the BLOOM index
> help)?
> > And that’s what I am trying to understand to land in a better performant
> > solution.
> >
> > Meanwhile I would like to see my record Key construct as well, to see how
> > it can help on write performance and downstream requirement to support
> > GDPR.  To avoid any reprocessing/migration down the line.
> >
> > Regards,
> > Felix K Jose
> >
> > From: Raymond Xu <xu...@gmail.com>
> > Date: Tuesday, November 17, 2020 at 6:18 PM
> > To: dev@hudi.apache.org <de...@hudi.apache.org>
> > Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> > n.siva.b@gmail.com>, v.balaji@ymail.com.invalid
> > <v....@ymail.com.invalid>
> > Subject: Re: Hudi Record Key Best Practices
> > Hi Felix, looks like the use case will benefit from virtual key feature
> in
> > this RFC
> >
> >
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C6c4ae6d635fd405a2ee708d890cb9f48%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637418551529459732%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=az5pemZfveNQK5kf8h5m0iDHdixCnfx455PuIK2vrVo%3D&reserved=0>
> >
> > Once this is implemented, you don't have to create a separate key.
> >
> > A rough thought: you mentioned 95% writes go to the same partition.
> Rather
> > than the record key, maybe consider improving on the partition field? to
> > have more even writes across partitions for eg?
> >
> > On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
> > <fe...@philips.com.invalid> wrote:
> >
> > > Hello All,
> > >
> > > I have asked generic questions regarding record key in slack channel,
> but
> > > I just want to consolidate everything regarding Record Key and the
> > > suggested best practices of Record Key construction to get better write
> > > performance.
> > >
> > > Table Type: COW
> > > Partition Path: Date
> > >
> > > My record uniqueness is derived from a combination of 4 fields:
> > >
> > >   1.  F1: Datetime (record’s origination datetime)
> > >   2.  F2: String       (11 char  long serial number)
> > >   3.  F3: UUID        (User Identifier)
> > >   4.  F4: String.       (12 CHAR statistic name)
> > >
> > > Note: My record is a nested document and some of the above fields are
> > > nested fields
> > >
> > > My Write Use Cases:
> > > 1. Writes to partitioned HUDI table every 15 minutes
> > >
> > >   1.  where 95% inserts and 5% updates,
> > >   2.  Also 95% write goes to same partition (current date) 5% write can
> > > span across multiple partitions
> > > 2. GDPR request to delete records from the table using User Identifier
> > > field (F3)
> > >
> > >
> > > Record Key Construction:
> > > Approach 1:
> > > Generate a UUID  from the concatenated String of all these 4 fields
> [eg:
> > > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> > > newly generated field as Record Key
> > >
> > > Approach 2:
> > > Generate a UUID  from the concatenated String of 3 fields except
> datetime
> > > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> > > datetime field to the generated UUID and use that newly generated field
> > as
> > > Record Key •F1_<uuid>
> > >
> > > Approach 3:
> > > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> > >
> > > Which is the approach you will suggest? Could you please help me?
> > >
> > > Regards,
> > > Felix K Jose
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> legally
> > > protected under applicable law. The message is intended solely for the
> > > addressee(s). If you are not the intended recipient, you are hereby
> > > notified that any use, forwarding, dissemination, or reproduction of
> this
> > > message is strictly prohibited and may be unlawful. If you are not the
> > > intended recipient, please cont_act the sender by return e-mail and
> > destroy
> > > all copies of the original message.
> > >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>
>
>
>
> --
>
> Regards,
> -Sivabalan
>
>

Re: Hudi Record Key Best Practices

Posted by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID>.

Hi Vinoth, Siva,

I know you guys are so busy. But I always get quick response from one of hoodiers. Thank you so much for the detailed information.

Yes, as suggested for UPSERTs I will go with Approach 2.

For deletes clustering can help me. Also happy to see that we don’t need to duplicate that field as part of Record Key to get it clustered. Where can I find PR/RFC for clustering implementation to read about it and get a better understanding? And I believe this is something similar to bucketing in Hive?

Also RFC-21 is going to help on the storage footprint a lot.

All interesting stuffs. Once I complete my major Data Lake Implementation project I definetly would like to start contributing to HUDI.

Thank you @Vinoth Chandar<ma...@apache.org> @Siva once again for all of your help.  And @Raymond, thank you for answering and clarifying things throughout this.

Regards,
Felix K Jose
From: Vinoth Chandar <vi...@apache.org>
Date: Tuesday, November 24, 2020 at 5:52 PM
To: Sivabalan <n....@gmail.com>
Cc: Kizhakkel Jose, Felix <fe...@philips.com>, Raymond Xu <xu...@gmail.com>, dev@hudi.apache.org <de...@hudi.apache.org>
Subject: Re: Hudi Record Key Best Practices
Agree with Siva's suggestions.

For clustering, it's not necessary for it to be part of the key. (Satish can correct if I missed something)

On Tue, Nov 24, 2020 at 2:01 PM Sivabalan <n....@gmail.com>> wrote:
here are the discussions points we had in slack.

Suggestion is to go with approach 2 based on these points.
- Prefixing F1 (including timestamp), will help pruning some file slices even within a day (within a partition) if records are properly ordered based on timestamp.
- Deletes are occasional compared to upserts. So, optimizing for upserts makes sense and hence approach 2 is fine. Also, anyways to delete records, its two part execution. First a query to hudi like "select HoodieKey from hudi_tbl where user_id = 'X'), and the a DELETE operation to hudi for these HoodieKeys. For first query, I assume embedding user_id in record keys does not matter, bcoz, this query does filtering for a specific column in the dataset.
So, initially thought not much of value embedding user id in record key. But as vinoth suggested, clustering could come in handy and so lets have userId too as part of record keys.
- In approach3, the record keys could be too large and so may not want to go this route.

On Tue, Nov 24, 2020 at 11:58 AM Vinoth Chandar <vi...@apache.org>> wrote:
Hi Felix,

I will try to be faster going forward. Apologies for the late reply. Thanks Raymond for all the great clarifications.

On RFC-21, I think it's safe to assume it will be available by Jan or so. 0.8.0 (Uber folks, correct me if I am wrong)

>>For approach 2 – the reason for prepending datetime is to have an incrementing id, otherwise your uuid is a purely random id and wont support range pruning, while writing, correct?
You are right. In general, we only have the following levers to control performance. I take it that "origination datetime" is not monotonically increasing? Otherwise Approach 1 is good, right?

If you want to optimize for upsert performance,
- prepending a timestamp field would help. if you simply prepend the date, which is already also the partition path, then all keys in that partition will have the same prefix and no additional pruning opportunities exist.
- Advise using dynamic bloom filters (config hoodie.bloom.index.filter.type=DYNAMIC_V0), to ensure the bloom filters filter our enough files after range pruning.

For good delete performance, we can cluster records by user_id for older partitions, such that all records a user is packed into the smallest number of files. This way,  when only a small number of users leave,
your delete won't rewrite the entire partition's files. Clustering support is landing by the end of year in 0.7.0. (There is a PR out already, if you want to test/play).

All of this is also highly workload specific. So we can get into those details, if that helps. MOR is a much better alternative for dealing with deletes IMO.
It was specifically designed, used for those, since it can absorb the deletes into log files and apply them later amortizing costs.

Future is good, since we are investing in record level indexes that could also natively index secondary fields like user_id. Again expect that to be there in 0.9.0 or something, around Mar.
For now, we have to play with how we lay out the data to squeeze performance.

Hope that helps.

thanks
vinoth

On Tue, Nov 24, 2020 at 5:54 AM Kizhakkel Jose, Felix <fe...@philips.com>> wrote:
Hi Raymond,

Thanks a lot for the reply.

For approach 2 – the reason for prepending datetime is to have a incrementing id, otherwise your uuid is a purely random id and wont support range pruning, while writing, correct? In a given date partition I am expected to get 10s of billions records, and by having an incrementing id helps BLOOM filtering? This is the only intend of having the prefix of datetime (int64 representation)

Yes, I also see Approach 3 really too big and causing lot in storage footprint.

My initial approach was Approach 1 (generated uuid from all the 4 fields), then heard that the range pruning can make write faster – so thought of datetime as prefix. Do you see any benefit or the UUID can itself be sufficient -since it’s been generated from the 4 input fields?

Regards,
Felix K Jose
From: Raymond Xu <xu...@gmail.com>>
Date: Tuesday, November 24, 2020 at 2:20 AM
To: Kizhakkel Jose, Felix <fe...@philips.com>>
Cc: dev@hudi.apache.org<ma...@hudi.apache.org> <de...@hudi.apache.org>>, vinoth@apache.org<ma...@apache.org> <vi...@apache.org>>, n.siva.b@gmail.com<ma...@gmail.com> <n....@gmail.com>>
Subject: Re: Hudi Record Key Best Practices
Hi Felix,
I'd prefer approach 1. The logic is simple: to ensure uniqueness in your dataset.
For 2, not very sure about the intention of prepending the datetime, looks like duplicate info knowing that you already partitioned it by that field.
For 3, it seems too long for a primary id.
Hope this helps.

On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix <fe...@philips.com>> wrote:
@Vinoth Chandar<ma...@apache.org>,

Could you please take a look at and let me know what is the best approach or could you see whom can help me on this?

Regards,
Felix K Jose
From: Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
Date: Thursday, November 19, 2020 at 12:04 PM
To: dev@hudi.apache.org<ma...@hudi.apache.org> <de...@hudi.apache.org>>, Vinoth Chandar <vi...@apache.org>>, xu.shiyan.raymond@gmail.com<ma...@gmail.com> <xu...@gmail.com>>
Cc: vinoth@apache.org<ma...@apache.org> <vi...@apache.org>>, n.siva.b@gmail.com<ma...@gmail.com> <n....@gmail.com>>
Subject: Re: Hudi Record Key Best Practices
Sure. I will see about partition key.

Since RFC 21 is not yet implemented and available to consume, can anyone please suggest what is the best approach I should be following to construct the record key I asked in the  original question:

“
My Write Use Cases:
1. Writes to partitioned HUDI table every 15 minutes

  1.  where 95% inserts and 5% updates,
  2.  Also 95% write goes to same partition (current date) 5% write can span across multiple partitions
2. GDPR request to delete records from the table using User Identifier field (F3)

Record Key Construction:
Approach 1:
Generate a UUID  from the concatenated String of all these 4 fields [eg: str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that newly generated field as Record Key

Approach 2:
Generate a UUID  from the concatenated String of 3 fields except datetime field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend datetime field to the generated UUID and use that newly generated field as Record Key •F1_<uuid>

Approach 3:
Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
“

Regards,
Felix K Jose
From: Raymond Xu <xu...@gmail.com>>
Date: Wednesday, November 18, 2020 at 5:30 PM
To: dev@hudi.apache.org<ma...@hudi.apache.org> <de...@hudi.apache.org>>
Cc: vinoth@apache.org<ma...@apache.org> <vi...@apache.org>>, n.siva.b@gmail.com<ma...@gmail.com> <n....@gmail.com>>
Subject: Re: Hudi Record Key Best Practices
Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
just saying maybe making the writes more evenly spreaded could be
better. Effectively, with 95% writes, it's like writing to a single
partition dataset. Hourly partition could mitigate the situation, since you
also have date-range queries. Just some rough ideas, the strategy really
depends on your data pattern and requirements.

For the development timeline on RFC 21, probably Vinoth or Balaji
could give more info.

On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hi Raymond,
> Thank you for the response.
>
> Yes, the virtual key definitely going to help reducing the storage
> footprint. When do you think it is going to be available and will it be
> compatible with all downstream processing engines (Presto, Redshift
> Spectrum etc.)? We have started our development activities and expecting to
> get into PROD by March-April timeframe.
>
> Regarding the partition key,  we get data every day from 10-20 million
> users and currently the data we are planning to partition is by Date
> (YYYY-MM-DD) and thereby we will have consistent partitions for downstream
> systems(every partition has same amount of data [20 million user data in
> each partition, rather than skewed partitions]). And most of our queries
> are date range queries for a given user-Id
>
> If I partition by user-Id, then I will have millions of partitions, and I
> have read that having large number of partition has major read impact (meta
> data management etc.), what do you think? Is my understanding correct?
>
> Yes, for current day most of the data will be for that day – so do you
> think it’s going to be a problem while writing (wont the BLOOM index help)?
> And that’s what I am trying to understand to land in a better performant
> solution.
>
> Meanwhile I would like to see my record Key construct as well, to see how
> it can help on write performance and downstream requirement to support
> GDPR.  To avoid any reprocessing/migration down the line.
>
> Regards,
> Felix K Jose
>
> From: Raymond Xu <xu...@gmail.com>>
> Date: Tuesday, November 17, 2020 at 6:18 PM
> To: dev@hudi.apache.org<ma...@hudi.apache.org> <de...@hudi.apache.org>>
> Cc: vinoth@apache.org<ma...@apache.org> <vi...@apache.org>>, n.siva.b@gmail.com<ma...@gmail.com> <
> n.siva.b@gmail.com<ma...@gmail.com>>, v.balaji@ymail.com.invalid
> <v....@ymail.com.invalid>
> Subject: Re: Hudi Record Key Best Practices
> Hi Felix, looks like the use case will benefit from virtual key feature in
> this RFC
>
>
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C6c4ae6d635fd405a2ee708d890cb9f48%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637418551529459732%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=az5pemZfveNQK5kf8h5m0iDHdixCnfx455PuIK2vrVo%3D&reserved=0>
>
> Once this is implemented, you don't have to create a separate key.
>
> A rough thought: you mentioned 95% writes go to the same partition. Rather
> than the record key, maybe consider improving on the partition field? to
> have more even writes across partitions for eg?
>
> On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hello All,
> >
> > I have asked generic questions regarding record key in slack channel, but
> > I just want to consolidate everything regarding Record Key and the
> > suggested best practices of Record Key construction to get better write
> > performance.
> >
> > Table Type: COW
> > Partition Path: Date
> >
> > My record uniqueness is derived from a combination of 4 fields:
> >
> >   1.  F1: Datetime (record’s origination datetime)
> >   2.  F2: String       (11 char  long serial number)
> >   3.  F3: UUID        (User Identifier)
> >   4.  F4: String.       (12 CHAR statistic name)
> >
> > Note: My record is a nested document and some of the above fields are
> > nested fields
> >
> > My Write Use Cases:
> > 1. Writes to partitioned HUDI table every 15 minutes
> >
> >   1.  where 95% inserts and 5% updates,
> >   2.  Also 95% write goes to same partition (current date) 5% write can
> > span across multiple partitions
> > 2. GDPR request to delete records from the table using User Identifier
> > field (F3)
> >
> >
> > Record Key Construction:
> > Approach 1:
> > Generate a UUID  from the concatenated String of all these 4 fields [eg:
> > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> > newly generated field as Record Key
> >
> > Approach 2:
> > Generate a UUID  from the concatenated String of 3 fields except datetime
> > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> > datetime field to the generated UUID and use that newly generated field
> as
> > Record Key •F1_<uuid>
> >
> > Approach 3:
> > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> >
> > Which is the approach you will suggest? Could you please help me?
> >
> > Regards,
> > Felix K Jose
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

--
Regards,
-Sivabalan

Re: Hudi Record Key Best Practices

Posted by Vinoth Chandar <vi...@apache.org>.

Agree with Siva's suggestions.

For clustering, it's not necessary for it to be part of the key. (Satish
can correct if I missed something)

On Tue, Nov 24, 2020 at 2:01 PM Sivabalan <n....@gmail.com> wrote:

> here are the discussions points we had in slack.
>
> Suggestion is to go with approach 2 based on these points.
> - Prefixing F1 (including timestamp), will help pruning some file slices
> even within a day (within a partition) if records are properly ordered
> based on timestamp.
> - Deletes are occasional compared to upserts. So, optimizing for upserts
> makes sense and hence approach 2 is fine. Also, anyways to delete records,
> its two part execution. First a query to hudi like "select HoodieKey from
> hudi_tbl where user_id = 'X'), and the a DELETE operation to hudi for these
> HoodieKeys. For first query, I assume embedding user_id in record keys does
> not matter, bcoz, this query does filtering for a specific column in the
> dataset.
> So, initially thought not much of value embedding user id in record key.
> But as vinoth suggested, clustering could come in handy and so lets have
> userId too as part of record keys.
> - In approach3, the record keys could be too large and so may not want to
> go this route.
>
>
>
>
>
> On Tue, Nov 24, 2020 at 11:58 AM Vinoth Chandar <vi...@apache.org> wrote:
>
>> Hi Felix,
>>
>> I will try to be faster going forward. Apologies for the late reply.
>> Thanks Raymond for all the great clarifications.
>>
>> On RFC-21, I think it's safe to assume it will be available by Jan or so.
>> 0.8.0 (Uber folks, correct me if I am wrong)
>>
>> >>For approach 2 – the reason for prepending datetime is to have an
>> incrementing id, otherwise your uuid is a purely random id and wont support
>> range pruning, while writing, correct?
>> You are right. In general, we only have the following levers to control
>> performance. I take it that "origination datetime" is not monotonically
>> increasing? Otherwise Approach 1 is good, right?
>>
>> If you want to optimize for upsert performance,
>> - prepending a timestamp field would help. if you simply prepend the
>> date, which is already also the partition path, then all keys in that
>> partition will have the same prefix and no additional pruning opportunities
>> exist.
>> - Advise using dynamic bloom filters
>> (config hoodie.bloom.index.filter.type=DYNAMIC_V0), to ensure the bloom
>> filters filter our enough files after range pruning.
>>
>> For good delete performance, we can cluster records by user_id for older
>> partitions, such that all records a user is packed into the smallest number
>> of files. This way,  when only a small number of users leave,
>> your delete won't rewrite the entire partition's files. Clustering
>> support is landing by the end of year in 0.7.0. (There is a PR out already,
>> if you want to test/play).
>>
>> All of this is also highly workload specific. So we can get into those
>> details, if that helps. MOR is a much better alternative for dealing with
>> deletes IMO.
>> It was specifically designed, used for those, since it can absorb the
>> deletes into log files and apply them later amortizing costs.
>>
>> Future is good, since we are investing in record level indexes that could
>> also natively index secondary fields like user_id. Again expect that to be
>> there in 0.9.0 or something, around Mar.
>> For now, we have to play with how we lay out the data to squeeze
>> performance.
>>
>> Hope that helps.
>>
>> thanks
>> vinoth
>>
>>
>>
>>
>>
>> On Tue, Nov 24, 2020 at 5:54 AM Kizhakkel Jose, Felix <
>> felix.jose@philips.com> wrote:
>>
>>> Hi Raymond,
>>>
>>> Thanks a lot for the reply.
>>>
>>> For approach 2 – the reason for prepending datetime is to have a
>>> incrementing id, otherwise your uuid is a purely random id and wont support
>>> range pruning, while writing, correct? In a given date partition I am
>>> expected to get 10s of billions records, and by having an incrementing id
>>> helps BLOOM filtering? This is the only intend of having the prefix of
>>> datetime (int64 representation)
>>>
>>> Yes, I also see Approach 3 really too big and causing lot in storage
>>> footprint.
>>>
>>> My initial approach was Approach 1 (generated uuid from all the 4
>>> fields), then heard that the range pruning can make write faster – so
>>> thought of datetime as prefix. Do you see any benefit or the UUID can
>>> itself be sufficient -since it’s been generated from the 4 input fields?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Felix K Jose
>>>
>>> *From: *Raymond Xu <xu...@gmail.com>
>>> *Date: *Tuesday, November 24, 2020 at 2:20 AM
>>> *To: *Kizhakkel Jose, Felix <fe...@philips.com>
>>> *Cc: *dev@hudi.apache.org <de...@hudi.apache.org>, vinoth@apache.org <
>>> vinoth@apache.org>, n.siva.b@gmail.com <n....@gmail.com>
>>> *Subject: *Re: Hudi Record Key Best Practices
>>>
>>> Hi Felix,
>>>
>>> I'd prefer approach 1. The logic is simple: to ensure uniqueness in your
>>> dataset.
>>>
>>> For 2, not very sure about the intention of prepending the datetime,
>>> looks like duplicate info knowing that you already partitioned it by that
>>> field.
>>>
>>> For 3, it seems too long for a primary id.
>>>
>>> Hope this helps.
>>>
>>>
>>>
>>> On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix <
>>> felix.jose@philips.com> wrote:
>>>
>>> @Vinoth Chandar <vi...@apache.org>,
>>>
>>> Could you please take a look at and let me know what is the best
>>> approach or could you see whom can help me on this?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Felix K Jose
>>>
>>> *From: *Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
>>> *Date: *Thursday, November 19, 2020 at 12:04 PM
>>> *To: *dev@hudi.apache.org <de...@hudi.apache.org>, Vinoth Chandar <
>>> vinoth@apache.org>, xu.shiyan.raymond@gmail.com <
>>> xu.shiyan.raymond@gmail.com>
>>> *Cc: *vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
>>> n.siva.b@gmail.com>
>>> *Subject: *Re: Hudi Record Key Best Practices
>>>
>>> Sure. I will see about partition key.
>>>
>>> Since RFC 21 is not yet implemented and available to consume, can anyone
>>> please suggest what is the best approach I should be following to construct
>>> the record key I asked in the  original question:
>>>
>>> “
>>> My Write Use Cases:
>>> 1. Writes to partitioned HUDI table every 15 minutes
>>>
>>>   1.  where 95% inserts and 5% updates,
>>>   2.  Also 95% write goes to same partition (current date) 5% write can
>>> span across multiple partitions
>>> 2. GDPR request to delete records from the table using User Identifier
>>> field (F3)
>>>
>>>
>>> Record Key Construction:
>>> Approach 1:
>>> Generate a UUID  from the concatenated String of all these 4 fields [eg:
>>> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
>>> newly generated field as Record Key
>>>
>>> Approach 2:
>>> Generate a UUID  from the concatenated String of 3 fields except
>>> datetime field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and
>>> prepend datetime field to the generated UUID and use that newly generated
>>> field as Record Key •F1_<uuid>
>>>
>>> Approach 3:
>>> Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
>>> “
>>>
>>> Regards,
>>> Felix K Jose
>>> From: Raymond Xu <xu...@gmail.com>
>>> Date: Wednesday, November 18, 2020 at 5:30 PM
>>> To: dev@hudi.apache.org <de...@hudi.apache.org>
>>> Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
>>> n.siva.b@gmail.com>
>>> Subject: Re: Hudi Record Key Best Practices
>>> Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
>>> just saying maybe making the writes more evenly spreaded could be
>>> better. Effectively, with 95% writes, it's like writing to a single
>>> partition dataset. Hourly partition could mitigate the situation, since
>>> you
>>> also have date-range queries. Just some rough ideas, the strategy really
>>> depends on your data pattern and requirements.
>>>
>>> For the development timeline on RFC 21, probably Vinoth or Balaji
>>> could give more info.
>>>
>>> On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
>>> <fe...@philips.com.invalid> wrote:
>>>
>>> > Hi Raymond,
>>> > Thank you for the response.
>>> >
>>> > Yes, the virtual key definitely going to help reducing the storage
>>> > footprint. When do you think it is going to be available and will it be
>>> > compatible with all downstream processing engines (Presto, Redshift
>>> > Spectrum etc.)? We have started our development activities and
>>> expecting to
>>> > get into PROD by March-April timeframe.
>>> >
>>> > Regarding the partition key,  we get data every day from 10-20 million
>>> > users and currently the data we are planning to partition is by Date
>>> > (YYYY-MM-DD) and thereby we will have consistent partitions for
>>> downstream
>>> > systems(every partition has same amount of data [20 million user data
>>> in
>>> > each partition, rather than skewed partitions]). And most of our
>>> queries
>>> > are date range queries for a given user-Id
>>> >
>>> > If I partition by user-Id, then I will have millions of partitions,
>>> and I
>>> > have read that having large number of partition has major read impact
>>> (meta
>>> > data management etc.), what do you think? Is my understanding correct?
>>> >
>>> > Yes, for current day most of the data will be for that day – so do you
>>> > think it’s going to be a problem while writing (wont the BLOOM index
>>> help)?
>>> > And that’s what I am trying to understand to land in a better
>>> performant
>>> > solution.
>>> >
>>> > Meanwhile I would like to see my record Key construct as well, to see
>>> how
>>> > it can help on write performance and downstream requirement to support
>>> > GDPR.  To avoid any reprocessing/migration down the line.
>>> >
>>> > Regards,
>>> > Felix K Jose
>>> >
>>> > From: Raymond Xu <xu...@gmail.com>
>>> > Date: Tuesday, November 17, 2020 at 6:18 PM
>>> > To: dev@hudi.apache.org <de...@hudi.apache.org>
>>> > Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
>>> > n.siva.b@gmail.com>, v.balaji@ymail.com.invalid
>>> > <v....@ymail.com.invalid>
>>> > Subject: Re: Hudi Record Key Best Practices
>>> > Hi Felix, looks like the use case will benefit from virtual key
>>> feature in
>>> > this RFC
>>> >
>>> >
>>> >
>>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0
>>> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C9af2e2156ca741dc30b708d890497321%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637417992446807324%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=JFMrvaH7mq2o1eisazMXFvvmn4MjescTBp4bMygJ5Oo%3D&reserved=0>
>>> >
>>> > Once this is implemented, you don't have to create a separate key.
>>> >
>>> > A rough thought: you mentioned 95% writes go to the same partition.
>>> Rather
>>> > than the record key, maybe consider improving on the partition field?
>>> to
>>> > have more even writes across partitions for eg?
>>> >
>>> > On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
>>> > <fe...@philips.com.invalid> wrote:
>>> >
>>> > > Hello All,
>>> > >
>>> > > I have asked generic questions regarding record key in slack
>>> channel, but
>>> > > I just want to consolidate everything regarding Record Key and the
>>> > > suggested best practices of Record Key construction to get better
>>> write
>>> > > performance.
>>> > >
>>> > > Table Type: COW
>>> > > Partition Path: Date
>>> > >
>>> > > My record uniqueness is derived from a combination of 4 fields:
>>> > >
>>> > >   1.  F1: Datetime (record’s origination datetime)
>>> > >   2.  F2: String       (11 char  long serial number)
>>> > >   3.  F3: UUID        (User Identifier)
>>> > >   4.  F4: String.       (12 CHAR statistic name)
>>> > >
>>> > > Note: My record is a nested document and some of the above fields are
>>> > > nested fields
>>> > >
>>> > > My Write Use Cases:
>>> > > 1. Writes to partitioned HUDI table every 15 minutes
>>> > >
>>> > >   1.  where 95% inserts and 5% updates,
>>> > >   2.  Also 95% write goes to same partition (current date) 5% write
>>> can
>>> > > span across multiple partitions
>>> > > 2. GDPR request to delete records from the table using User
>>> Identifier
>>> > > field (F3)
>>> > >
>>> > >
>>> > > Record Key Construction:
>>> > > Approach 1:
>>> > > Generate a UUID  from the concatenated String of all these 4 fields
>>> [eg:
>>> > > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use
>>> that
>>> > > newly generated field as Record Key
>>> > >
>>> > > Approach 2:
>>> > > Generate a UUID  from the concatenated String of 3 fields except
>>> datetime
>>> > > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
>>> > > datetime field to the generated UUID and use that newly generated
>>> field
>>> > as
>>> > > Record Key •F1_<uuid>
>>> > >
>>> > > Approach 3:
>>> > > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
>>> > >
>>> > > Which is the approach you will suggest? Could you please help me?
>>> > >
>>> > > Regards,
>>> > > Felix K Jose
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > ________________________________
>>> > > The information contained in this message may be confidential and
>>> legally
>>> > > protected under applicable law. The message is intended solely for
>>> the
>>> > > addressee(s). If you are not the intended recipient, you are hereby
>>> > > notified that any use, forwarding, dissemination, or reproduction of
>>> this
>>> > > message is strictly prohibited and may be unlawful. If you are not
>>> the
>>> > > intended recipient, please contact the sender by return e-mail and
>>> > destroy
>>> > > all copies of the original message.
>>> > >
>>> >
>>> > ________________________________
>>> > The information contained in this message may be confidential and
>>> legally
>>> > protected under applicable law. The message is intended solely for the
>>> > addressee(s). If you are not the intended recipient, you are hereby
>>> > notified that any use, forwarding, dissemination, or reproduction of
>>> this
>>> > message is strictly prohibited and may be unlawful. If you are not the
>>> > intended recipient, please contact the sender by return e-mail and
>>> destroy
>>> > all copies of the original message.
>>> >
>>>
>>> ________________________________
>>> The information contained in this message may be confidential and
>>> legally protected under applicable law. The message is intended solely for
>>> the addressee(s). If you are not the intended recipient, you are hereby
>>> notified that any use, forwarding, dissemination, or reproduction of this
>>> message is strictly prohibited and may be unlawful. If you are not the
>>> intended recipient, please contact the sender by return e-mail and destroy
>>> all copies of the original message.
>>>
>>>
>
> --
> Regards,
> -Sivabalan
>

Re: Hudi Record Key Best Practices

Posted by Sivabalan <n....@gmail.com>.

here are the discussions points we had in slack.

Suggestion is to go with approach 2 based on these points.
- Prefixing F1 (including timestamp), will help pruning some file slices
even within a day (within a partition) if records are properly ordered
based on timestamp.
- Deletes are occasional compared to upserts. So, optimizing for upserts
makes sense and hence approach 2 is fine. Also, anyways to delete records,
its two part execution. First a query to hudi like "select HoodieKey from
hudi_tbl where user_id = 'X'), and the a DELETE operation to hudi for these
HoodieKeys. For first query, I assume embedding user_id in record keys does
not matter, bcoz, this query does filtering for a specific column in the
dataset.
So, initially thought not much of value embedding user id in record key.
But as vinoth suggested, clustering could come in handy and so lets have
userId too as part of record keys.
- In approach3, the record keys could be too large and so may not want to
go this route.





On Tue, Nov 24, 2020 at 11:58 AM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Felix,
>
> I will try to be faster going forward. Apologies for the late reply.
> Thanks Raymond for all the great clarifications.
>
> On RFC-21, I think it's safe to assume it will be available by Jan or so.
> 0.8.0 (Uber folks, correct me if I am wrong)
>
> >>For approach 2 – the reason for prepending datetime is to have an
> incrementing id, otherwise your uuid is a purely random id and wont support
> range pruning, while writing, correct?
> You are right. In general, we only have the following levers to control
> performance. I take it that "origination datetime" is not monotonically
> increasing? Otherwise Approach 1 is good, right?
>
> If you want to optimize for upsert performance,
> - prepending a timestamp field would help. if you simply prepend the date,
> which is already also the partition path, then all keys in that partition
> will have the same prefix and no additional pruning opportunities exist.
> - Advise using dynamic bloom filters
> (config hoodie.bloom.index.filter.type=DYNAMIC_V0), to ensure the bloom
> filters filter our enough files after range pruning.
>
> For good delete performance, we can cluster records by user_id for older
> partitions, such that all records a user is packed into the smallest number
> of files. This way,  when only a small number of users leave,
> your delete won't rewrite the entire partition's files. Clustering support
> is landing by the end of year in 0.7.0. (There is a PR out already, if you
> want to test/play).
>
> All of this is also highly workload specific. So we can get into those
> details, if that helps. MOR is a much better alternative for dealing with
> deletes IMO.
> It was specifically designed, used for those, since it can absorb the
> deletes into log files and apply them later amortizing costs.
>
> Future is good, since we are investing in record level indexes that could
> also natively index secondary fields like user_id. Again expect that to be
> there in 0.9.0 or something, around Mar.
> For now, we have to play with how we lay out the data to squeeze
> performance.
>
> Hope that helps.
>
> thanks
> vinoth
>
>
>
>
>
> On Tue, Nov 24, 2020 at 5:54 AM Kizhakkel Jose, Felix <
> felix.jose@philips.com> wrote:
>
>> Hi Raymond,
>>
>> Thanks a lot for the reply.
>>
>> For approach 2 – the reason for prepending datetime is to have a
>> incrementing id, otherwise your uuid is a purely random id and wont support
>> range pruning, while writing, correct? In a given date partition I am
>> expected to get 10s of billions records, and by having an incrementing id
>> helps BLOOM filtering? This is the only intend of having the prefix of
>> datetime (int64 representation)
>>
>> Yes, I also see Approach 3 really too big and causing lot in storage
>> footprint.
>>
>> My initial approach was Approach 1 (generated uuid from all the 4
>> fields), then heard that the range pruning can make write faster – so
>> thought of datetime as prefix. Do you see any benefit or the UUID can
>> itself be sufficient -since it’s been generated from the 4 input fields?
>>
>>
>>
>> Regards,
>>
>> Felix K Jose
>>
>> *From: *Raymond Xu <xu...@gmail.com>
>> *Date: *Tuesday, November 24, 2020 at 2:20 AM
>> *To: *Kizhakkel Jose, Felix <fe...@philips.com>
>> *Cc: *dev@hudi.apache.org <de...@hudi.apache.org>, vinoth@apache.org <
>> vinoth@apache.org>, n.siva.b@gmail.com <n....@gmail.com>
>> *Subject: *Re: Hudi Record Key Best Practices
>>
>> Hi Felix,
>>
>> I'd prefer approach 1. The logic is simple: to ensure uniqueness in your
>> dataset.
>>
>> For 2, not very sure about the intention of prepending the datetime,
>> looks like duplicate info knowing that you already partitioned it by that
>> field.
>>
>> For 3, it seems too long for a primary id.
>>
>> Hope this helps.
>>
>>
>>
>> On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix <
>> felix.jose@philips.com> wrote:
>>
>> @Vinoth Chandar <vi...@apache.org>,
>>
>> Could you please take a look at and let me know what is the best approach
>> or could you see whom can help me on this?
>>
>>
>>
>> Regards,
>>
>> Felix K Jose
>>
>> *From: *Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
>> *Date: *Thursday, November 19, 2020 at 12:04 PM
>> *To: *dev@hudi.apache.org <de...@hudi.apache.org>, Vinoth Chandar <
>> vinoth@apache.org>, xu.shiyan.raymond@gmail.com <
>> xu.shiyan.raymond@gmail.com>
>> *Cc: *vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
>> n.siva.b@gmail.com>
>> *Subject: *Re: Hudi Record Key Best Practices
>>
>> Sure. I will see about partition key.
>>
>> Since RFC 21 is not yet implemented and available to consume, can anyone
>> please suggest what is the best approach I should be following to construct
>> the record key I asked in the  original question:
>>
>> “
>> My Write Use Cases:
>> 1. Writes to partitioned HUDI table every 15 minutes
>>
>>   1.  where 95% inserts and 5% updates,
>>   2.  Also 95% write goes to same partition (current date) 5% write can
>> span across multiple partitions
>> 2. GDPR request to delete records from the table using User Identifier
>> field (F3)
>>
>>
>> Record Key Construction:
>> Approach 1:
>> Generate a UUID  from the concatenated String of all these 4 fields [eg:
>> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
>> newly generated field as Record Key
>>
>> Approach 2:
>> Generate a UUID  from the concatenated String of 3 fields except datetime
>> field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
>> datetime field to the generated UUID and use that newly generated field as
>> Record Key •F1_<uuid>
>>
>> Approach 3:
>> Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
>> “
>>
>> Regards,
>> Felix K Jose
>> From: Raymond Xu <xu...@gmail.com>
>> Date: Wednesday, November 18, 2020 at 5:30 PM
>> To: dev@hudi.apache.org <de...@hudi.apache.org>
>> Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
>> n.siva.b@gmail.com>
>> Subject: Re: Hudi Record Key Best Practices
>> Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
>> just saying maybe making the writes more evenly spreaded could be
>> better. Effectively, with 95% writes, it's like writing to a single
>> partition dataset. Hourly partition could mitigate the situation, since
>> you
>> also have date-range queries. Just some rough ideas, the strategy really
>> depends on your data pattern and requirements.
>>
>> For the development timeline on RFC 21, probably Vinoth or Balaji
>> could give more info.
>>
>> On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
>> <fe...@philips.com.invalid> wrote:
>>
>> > Hi Raymond,
>> > Thank you for the response.
>> >
>> > Yes, the virtual key definitely going to help reducing the storage
>> > footprint. When do you think it is going to be available and will it be
>> > compatible with all downstream processing engines (Presto, Redshift
>> > Spectrum etc.)? We have started our development activities and
>> expecting to
>> > get into PROD by March-April timeframe.
>> >
>> > Regarding the partition key,  we get data every day from 10-20 million
>> > users and currently the data we are planning to partition is by Date
>> > (YYYY-MM-DD) and thereby we will have consistent partitions for
>> downstream
>> > systems(every partition has same amount of data [20 million user data in
>> > each partition, rather than skewed partitions]). And most of our queries
>> > are date range queries for a given user-Id
>> >
>> > If I partition by user-Id, then I will have millions of partitions, and
>> I
>> > have read that having large number of partition has major read impact
>> (meta
>> > data management etc.), what do you think? Is my understanding correct?
>> >
>> > Yes, for current day most of the data will be for that day – so do you
>> > think it’s going to be a problem while writing (wont the BLOOM index
>> help)?
>> > And that’s what I am trying to understand to land in a better performant
>> > solution.
>> >
>> > Meanwhile I would like to see my record Key construct as well, to see
>> how
>> > it can help on write performance and downstream requirement to support
>> > GDPR.  To avoid any reprocessing/migration down the line.
>> >
>> > Regards,
>> > Felix K Jose
>> >
>> > From: Raymond Xu <xu...@gmail.com>
>> > Date: Tuesday, November 17, 2020 at 6:18 PM
>> > To: dev@hudi.apache.org <de...@hudi.apache.org>
>> > Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
>> > n.siva.b@gmail.com>, v.balaji@ymail.com.invalid
>> > <v....@ymail.com.invalid>
>> > Subject: Re: Hudi Record Key Best Practices
>> > Hi Felix, looks like the use case will benefit from virtual key feature
>> in
>> > this RFC
>> >
>> >
>> >
>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0
>> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C9af2e2156ca741dc30b708d890497321%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637417992446807324%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=JFMrvaH7mq2o1eisazMXFvvmn4MjescTBp4bMygJ5Oo%3D&reserved=0>
>> >
>> > Once this is implemented, you don't have to create a separate key.
>> >
>> > A rough thought: you mentioned 95% writes go to the same partition.
>> Rather
>> > than the record key, maybe consider improving on the partition field? to
>> > have more even writes across partitions for eg?
>> >
>> > On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
>> > <fe...@philips.com.invalid> wrote:
>> >
>> > > Hello All,
>> > >
>> > > I have asked generic questions regarding record key in slack channel,
>> but
>> > > I just want to consolidate everything regarding Record Key and the
>> > > suggested best practices of Record Key construction to get better
>> write
>> > > performance.
>> > >
>> > > Table Type: COW
>> > > Partition Path: Date
>> > >
>> > > My record uniqueness is derived from a combination of 4 fields:
>> > >
>> > >   1.  F1: Datetime (record’s origination datetime)
>> > >   2.  F2: String       (11 char  long serial number)
>> > >   3.  F3: UUID        (User Identifier)
>> > >   4.  F4: String.       (12 CHAR statistic name)
>> > >
>> > > Note: My record is a nested document and some of the above fields are
>> > > nested fields
>> > >
>> > > My Write Use Cases:
>> > > 1. Writes to partitioned HUDI table every 15 minutes
>> > >
>> > >   1.  where 95% inserts and 5% updates,
>> > >   2.  Also 95% write goes to same partition (current date) 5% write
>> can
>> > > span across multiple partitions
>> > > 2. GDPR request to delete records from the table using User Identifier
>> > > field (F3)
>> > >
>> > >
>> > > Record Key Construction:
>> > > Approach 1:
>> > > Generate a UUID  from the concatenated String of all these 4 fields
>> [eg:
>> > > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
>> > > newly generated field as Record Key
>> > >
>> > > Approach 2:
>> > > Generate a UUID  from the concatenated String of 3 fields except
>> datetime
>> > > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
>> > > datetime field to the generated UUID and use that newly generated
>> field
>> > as
>> > > Record Key •F1_<uuid>
>> > >
>> > > Approach 3:
>> > > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
>> > >
>> > > Which is the approach you will suggest? Could you please help me?
>> > >
>> > > Regards,
>> > > Felix K Jose
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > ________________________________
>> > > The information contained in this message may be confidential and
>> legally
>> > > protected under applicable law. The message is intended solely for the
>> > > addressee(s). If you are not the intended recipient, you are hereby
>> > > notified that any use, forwarding, dissemination, or reproduction of
>> this
>> > > message is strictly prohibited and may be unlawful. If you are not the
>> > > intended recipient, please contact the sender by return e-mail and
>> > destroy
>> > > all copies of the original message.
>> > >
>> >
>> > ________________________________
>> > The information contained in this message may be confidential and
>> legally
>> > protected under applicable law. The message is intended solely for the
>> > addressee(s). If you are not the intended recipient, you are hereby
>> > notified that any use, forwarding, dissemination, or reproduction of
>> this
>> > message is strictly prohibited and may be unlawful. If you are not the
>> > intended recipient, please contact the sender by return e-mail and
>> destroy
>> > all copies of the original message.
>> >
>>
>> ________________________________
>> The information contained in this message may be confidential and legally
>> protected under applicable law. The message is intended solely for the
>> addressee(s). If you are not the intended recipient, you are hereby
>> notified that any use, forwarding, dissemination, or reproduction of this
>> message is strictly prohibited and may be unlawful. If you are not the
>> intended recipient, please contact the sender by return e-mail and destroy
>> all copies of the original message.
>>
>>

-- 
Regards,
-Sivabalan

Re: Hudi Record Key Best Practices

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Felix,

I will try to be faster going forward. Apologies for the late reply. Thanks
Raymond for all the great clarifications.

On RFC-21, I think it's safe to assume it will be available by Jan or so.
0.8.0 (Uber folks, correct me if I am wrong)

>>For approach 2 – the reason for prepending datetime is to have an
incrementing id, otherwise your uuid is a purely random id and wont support
range pruning, while writing, correct?
You are right. In general, we only have the following levers to control
performance. I take it that "origination datetime" is not monotonically
increasing? Otherwise Approach 1 is good, right?

If you want to optimize for upsert performance,
- prepending a timestamp field would help. if you simply prepend the date,
which is already also the partition path, then all keys in that partition
will have the same prefix and no additional pruning opportunities exist.
- Advise using dynamic bloom filters
(config hoodie.bloom.index.filter.type=DYNAMIC_V0), to ensure the bloom
filters filter our enough files after range pruning.

For good delete performance, we can cluster records by user_id for older
partitions, such that all records a user is packed into the smallest number
of files. This way,  when only a small number of users leave,
your delete won't rewrite the entire partition's files. Clustering support
is landing by the end of year in 0.7.0. (There is a PR out already, if you
want to test/play).

All of this is also highly workload specific. So we can get into those
details, if that helps. MOR is a much better alternative for dealing with
deletes IMO.
It was specifically designed, used for those, since it can absorb the
deletes into log files and apply them later amortizing costs.

Future is good, since we are investing in record level indexes that could
also natively index secondary fields like user_id. Again expect that to be
there in 0.9.0 or something, around Mar.
For now, we have to play with how we lay out the data to squeeze
performance.

Hope that helps.

thanks
vinoth





On Tue, Nov 24, 2020 at 5:54 AM Kizhakkel Jose, Felix <
felix.jose@philips.com> wrote:

> Hi Raymond,
>
> Thanks a lot for the reply.
>
> For approach 2 – the reason for prepending datetime is to have a
> incrementing id, otherwise your uuid is a purely random id and wont support
> range pruning, while writing, correct? In a given date partition I am
> expected to get 10s of billions records, and by having an incrementing id
> helps BLOOM filtering? This is the only intend of having the prefix of
> datetime (int64 representation)
>
> Yes, I also see Approach 3 really too big and causing lot in storage
> footprint.
>
> My initial approach was Approach 1 (generated uuid from all the 4 fields),
> then heard that the range pruning can make write faster – so thought of
> datetime as prefix. Do you see any benefit or the UUID can itself be
> sufficient -since it’s been generated from the 4 input fields?
>
>
>
> Regards,
>
> Felix K Jose
>
> *From: *Raymond Xu <xu...@gmail.com>
> *Date: *Tuesday, November 24, 2020 at 2:20 AM
> *To: *Kizhakkel Jose, Felix <fe...@philips.com>
> *Cc: *dev@hudi.apache.org <de...@hudi.apache.org>, vinoth@apache.org <
> vinoth@apache.org>, n.siva.b@gmail.com <n....@gmail.com>
> *Subject: *Re: Hudi Record Key Best Practices
>
> Hi Felix,
>
> I'd prefer approach 1. The logic is simple: to ensure uniqueness in your
> dataset.
>
> For 2, not very sure about the intention of prepending the datetime, looks
> like duplicate info knowing that you already partitioned it by that field.
>
> For 3, it seems too long for a primary id.
>
> Hope this helps.
>
>
>
> On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix <
> felix.jose@philips.com> wrote:
>
> @Vinoth Chandar <vi...@apache.org>,
>
> Could you please take a look at and let me know what is the best approach
> or could you see whom can help me on this?
>
>
>
> Regards,
>
> Felix K Jose
>
> *From: *Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
> *Date: *Thursday, November 19, 2020 at 12:04 PM
> *To: *dev@hudi.apache.org <de...@hudi.apache.org>, Vinoth Chandar <
> vinoth@apache.org>, xu.shiyan.raymond@gmail.com <
> xu.shiyan.raymond@gmail.com>
> *Cc: *vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> n.siva.b@gmail.com>
> *Subject: *Re: Hudi Record Key Best Practices
>
> Sure. I will see about partition key.
>
> Since RFC 21 is not yet implemented and available to consume, can anyone
> please suggest what is the best approach I should be following to construct
> the record key I asked in the  original question:
>
> “
> My Write Use Cases:
> 1. Writes to partitioned HUDI table every 15 minutes
>
>   1.  where 95% inserts and 5% updates,
>   2.  Also 95% write goes to same partition (current date) 5% write can
> span across multiple partitions
> 2. GDPR request to delete records from the table using User Identifier
> field (F3)
>
>
> Record Key Construction:
> Approach 1:
> Generate a UUID  from the concatenated String of all these 4 fields [eg:
> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> newly generated field as Record Key
>
> Approach 2:
> Generate a UUID  from the concatenated String of 3 fields except datetime
> field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> datetime field to the generated UUID and use that newly generated field as
> Record Key •F1_<uuid>
>
> Approach 3:
> Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> “
>
> Regards,
> Felix K Jose
> From: Raymond Xu <xu...@gmail.com>
> Date: Wednesday, November 18, 2020 at 5:30 PM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> n.siva.b@gmail.com>
> Subject: Re: Hudi Record Key Best Practices
> Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
> just saying maybe making the writes more evenly spreaded could be
> better. Effectively, with 95% writes, it's like writing to a single
> partition dataset. Hourly partition could mitigate the situation, since you
> also have date-range queries. Just some rough ideas, the strategy really
> depends on your data pattern and requirements.
>
> For the development timeline on RFC 21, probably Vinoth or Balaji
> could give more info.
>
> On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hi Raymond,
> > Thank you for the response.
> >
> > Yes, the virtual key definitely going to help reducing the storage
> > footprint. When do you think it is going to be available and will it be
> > compatible with all downstream processing engines (Presto, Redshift
> > Spectrum etc.)? We have started our development activities and expecting
> to
> > get into PROD by March-April timeframe.
> >
> > Regarding the partition key,  we get data every day from 10-20 million
> > users and currently the data we are planning to partition is by Date
> > (YYYY-MM-DD) and thereby we will have consistent partitions for
> downstream
> > systems(every partition has same amount of data [20 million user data in
> > each partition, rather than skewed partitions]). And most of our queries
> > are date range queries for a given user-Id
> >
> > If I partition by user-Id, then I will have millions of partitions, and I
> > have read that having large number of partition has major read impact
> (meta
> > data management etc.), what do you think? Is my understanding correct?
> >
> > Yes, for current day most of the data will be for that day – so do you
> > think it’s going to be a problem while writing (wont the BLOOM index
> help)?
> > And that’s what I am trying to understand to land in a better performant
> > solution.
> >
> > Meanwhile I would like to see my record Key construct as well, to see how
> > it can help on write performance and downstream requirement to support
> > GDPR.  To avoid any reprocessing/migration down the line.
> >
> > Regards,
> > Felix K Jose
> >
> > From: Raymond Xu <xu...@gmail.com>
> > Date: Tuesday, November 17, 2020 at 6:18 PM
> > To: dev@hudi.apache.org <de...@hudi.apache.org>
> > Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> > n.siva.b@gmail.com>, v.balaji@ymail.com.invalid
> > <v....@ymail.com.invalid>
> > Subject: Re: Hudi Record Key Best Practices
> > Hi Felix, looks like the use case will benefit from virtual key feature
> in
> > this RFC
> >
> >
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C9af2e2156ca741dc30b708d890497321%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637417992446807324%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=JFMrvaH7mq2o1eisazMXFvvmn4MjescTBp4bMygJ5Oo%3D&reserved=0>
> >
> > Once this is implemented, you don't have to create a separate key.
> >
> > A rough thought: you mentioned 95% writes go to the same partition.
> Rather
> > than the record key, maybe consider improving on the partition field? to
> > have more even writes across partitions for eg?
> >
> > On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
> > <fe...@philips.com.invalid> wrote:
> >
> > > Hello All,
> > >
> > > I have asked generic questions regarding record key in slack channel,
> but
> > > I just want to consolidate everything regarding Record Key and the
> > > suggested best practices of Record Key construction to get better write
> > > performance.
> > >
> > > Table Type: COW
> > > Partition Path: Date
> > >
> > > My record uniqueness is derived from a combination of 4 fields:
> > >
> > >   1.  F1: Datetime (record’s origination datetime)
> > >   2.  F2: String       (11 char  long serial number)
> > >   3.  F3: UUID        (User Identifier)
> > >   4.  F4: String.       (12 CHAR statistic name)
> > >
> > > Note: My record is a nested document and some of the above fields are
> > > nested fields
> > >
> > > My Write Use Cases:
> > > 1. Writes to partitioned HUDI table every 15 minutes
> > >
> > >   1.  where 95% inserts and 5% updates,
> > >   2.  Also 95% write goes to same partition (current date) 5% write can
> > > span across multiple partitions
> > > 2. GDPR request to delete records from the table using User Identifier
> > > field (F3)
> > >
> > >
> > > Record Key Construction:
> > > Approach 1:
> > > Generate a UUID  from the concatenated String of all these 4 fields
> [eg:
> > > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> > > newly generated field as Record Key
> > >
> > > Approach 2:
> > > Generate a UUID  from the concatenated String of 3 fields except
> datetime
> > > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> > > datetime field to the generated UUID and use that newly generated field
> > as
> > > Record Key •F1_<uuid>
> > >
> > > Approach 3:
> > > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> > >
> > > Which is the approach you will suggest? Could you please help me?
> > >
> > > Regards,
> > > Felix K Jose
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> legally
> > > protected under applicable law. The message is intended solely for the
> > > addressee(s). If you are not the intended recipient, you are hereby
> > > notified that any use, forwarding, dissemination, or reproduction of
> this
> > > message is strictly prohibited and may be unlawful. If you are not the
> > > intended recipient, please contact the sender by return e-mail and
> > destroy
> > > all copies of the original message.
> > >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>
>

Re: Hudi Record Key Best Practices

Posted by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID>.

Hi Raymond,

Thanks a lot for the reply.

For approach 2 – the reason for prepending datetime is to have a incrementing id, otherwise your uuid is a purely random id and wont support range pruning, while writing, correct? In a given date partition I am expected to get 10s of billions records, and by having an incrementing id helps BLOOM filtering? This is the only intend of having the prefix of datetime (int64 representation)

Yes, I also see Approach 3 really too big and causing lot in storage footprint.

My initial approach was Approach 1 (generated uuid from all the 4 fields), then heard that the range pruning can make write faster – so thought of datetime as prefix. Do you see any benefit or the UUID can itself be sufficient -since it’s been generated from the 4 input fields?

Regards,
Felix K Jose
From: Raymond Xu <xu...@gmail.com>
Date: Tuesday, November 24, 2020 at 2:20 AM
To: Kizhakkel Jose, Felix <fe...@philips.com>
Cc: dev@hudi.apache.org <de...@hudi.apache.org>, vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <n....@gmail.com>
Subject: Re: Hudi Record Key Best Practices
Hi Felix,
I'd prefer approach 1. The logic is simple: to ensure uniqueness in your dataset.
For 2, not very sure about the intention of prepending the datetime, looks like duplicate info knowing that you already partitioned it by that field.
For 3, it seems too long for a primary id.
Hope this helps.

On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix <fe...@philips.com>> wrote:
@Vinoth Chandar<ma...@apache.org>,

Could you please take a look at and let me know what is the best approach or could you see whom can help me on this?

Regards,
Felix K Jose
From: Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
Date: Thursday, November 19, 2020 at 12:04 PM
To: dev@hudi.apache.org<ma...@hudi.apache.org> <de...@hudi.apache.org>>, Vinoth Chandar <vi...@apache.org>>, xu.shiyan.raymond@gmail.com<ma...@gmail.com> <xu...@gmail.com>>
Cc: vinoth@apache.org<ma...@apache.org> <vi...@apache.org>>, n.siva.b@gmail.com<ma...@gmail.com> <n....@gmail.com>>
Subject: Re: Hudi Record Key Best Practices
Sure. I will see about partition key.

Since RFC 21 is not yet implemented and available to consume, can anyone please suggest what is the best approach I should be following to construct the record key I asked in the  original question:

“
My Write Use Cases:
1. Writes to partitioned HUDI table every 15 minutes

  1.  where 95% inserts and 5% updates,
  2.  Also 95% write goes to same partition (current date) 5% write can span across multiple partitions
2. GDPR request to delete records from the table using User Identifier field (F3)

Record Key Construction:
Approach 1:
Generate a UUID  from the concatenated String of all these 4 fields [eg: str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that newly generated field as Record Key

Approach 2:
Generate a UUID  from the concatenated String of 3 fields except datetime field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend datetime field to the generated UUID and use that newly generated field as Record Key •F1_<uuid>

Approach 3:
Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
“

Regards,
Felix K Jose
From: Raymond Xu <xu...@gmail.com>>
Date: Wednesday, November 18, 2020 at 5:30 PM
To: dev@hudi.apache.org<ma...@hudi.apache.org> <de...@hudi.apache.org>>
Cc: vinoth@apache.org<ma...@apache.org> <vi...@apache.org>>, n.siva.b@gmail.com<ma...@gmail.com> <n....@gmail.com>>
Subject: Re: Hudi Record Key Best Practices
Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
just saying maybe making the writes more evenly spreaded could be
better. Effectively, with 95% writes, it's like writing to a single
partition dataset. Hourly partition could mitigate the situation, since you
also have date-range queries. Just some rough ideas, the strategy really
depends on your data pattern and requirements.

For the development timeline on RFC 21, probably Vinoth or Balaji
could give more info.

On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hi Raymond,
> Thank you for the response.
>
> Yes, the virtual key definitely going to help reducing the storage
> footprint. When do you think it is going to be available and will it be
> compatible with all downstream processing engines (Presto, Redshift
> Spectrum etc.)? We have started our development activities and expecting to
> get into PROD by March-April timeframe.
>
> Regarding the partition key,  we get data every day from 10-20 million
> users and currently the data we are planning to partition is by Date
> (YYYY-MM-DD) and thereby we will have consistent partitions for downstream
> systems(every partition has same amount of data [20 million user data in
> each partition, rather than skewed partitions]). And most of our queries
> are date range queries for a given user-Id
>
> If I partition by user-Id, then I will have millions of partitions, and I
> have read that having large number of partition has major read impact (meta
> data management etc.), what do you think? Is my understanding correct?
>
> Yes, for current day most of the data will be for that day – so do you
> think it’s going to be a problem while writing (wont the BLOOM index help)?
> And that’s what I am trying to understand to land in a better performant
> solution.
>
> Meanwhile I would like to see my record Key construct as well, to see how
> it can help on write performance and downstream requirement to support
> GDPR.  To avoid any reprocessing/migration down the line.
>
> Regards,
> Felix K Jose
>
> From: Raymond Xu <xu...@gmail.com>>
> Date: Tuesday, November 17, 2020 at 6:18 PM
> To: dev@hudi.apache.org<ma...@hudi.apache.org> <de...@hudi.apache.org>>
> Cc: vinoth@apache.org<ma...@apache.org> <vi...@apache.org>>, n.siva.b@gmail.com<ma...@gmail.com> <
> n.siva.b@gmail.com<ma...@gmail.com>>, v.balaji@ymail.com.invalid
> <v....@ymail.com.invalid>
> Subject: Re: Hudi Record Key Best Practices
> Hi Felix, looks like the use case will benefit from virtual key feature in
> this RFC
>
>
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C9af2e2156ca741dc30b708d890497321%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637417992446807324%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=JFMrvaH7mq2o1eisazMXFvvmn4MjescTBp4bMygJ5Oo%3D&reserved=0>
>
> Once this is implemented, you don't have to create a separate key.
>
> A rough thought: you mentioned 95% writes go to the same partition. Rather
> than the record key, maybe consider improving on the partition field? to
> have more even writes across partitions for eg?
>
> On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hello All,
> >
> > I have asked generic questions regarding record key in slack channel, but
> > I just want to consolidate everything regarding Record Key and the
> > suggested best practices of Record Key construction to get better write
> > performance.
> >
> > Table Type: COW
> > Partition Path: Date
> >
> > My record uniqueness is derived from a combination of 4 fields:
> >
> >   1.  F1: Datetime (record’s origination datetime)
> >   2.  F2: String       (11 char  long serial number)
> >   3.  F3: UUID        (User Identifier)
> >   4.  F4: String.       (12 CHAR statistic name)
> >
> > Note: My record is a nested document and some of the above fields are
> > nested fields
> >
> > My Write Use Cases:
> > 1. Writes to partitioned HUDI table every 15 minutes
> >
> >   1.  where 95% inserts and 5% updates,
> >   2.  Also 95% write goes to same partition (current date) 5% write can
> > span across multiple partitions
> > 2. GDPR request to delete records from the table using User Identifier
> > field (F3)
> >
> >
> > Record Key Construction:
> > Approach 1:
> > Generate a UUID  from the concatenated String of all these 4 fields [eg:
> > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> > newly generated field as Record Key
> >
> > Approach 2:
> > Generate a UUID  from the concatenated String of 3 fields except datetime
> > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> > datetime field to the generated UUID and use that newly generated field
> as
> > Record Key •F1_<uuid>
> >
> > Approach 3:
> > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> >
> > Which is the approach you will suggest? Could you please help me?
> >
> > Regards,
> > Felix K Jose
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: Hudi Record Key Best Practices

Posted by Raymond Xu <xu...@gmail.com>.

Hi Felix,
I'd prefer approach 1. The logic is simple: to ensure uniqueness in your
dataset.
For 2, not very sure about the intention of prepending the datetime, looks
like duplicate info knowing that you already partitioned it by that field.
For 3, it seems too long for a primary id.
Hope this helps.

On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix <
felix.jose@philips.com> wrote:

> @Vinoth Chandar <vi...@apache.org>,
>
> Could you please take a look at and let me know what is the best approach
> or could you see whom can help me on this?
>
>
>
> Regards,
>
> Felix K Jose
>
> *From: *Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
> *Date: *Thursday, November 19, 2020 at 12:04 PM
> *To: *dev@hudi.apache.org <de...@hudi.apache.org>, Vinoth Chandar <
> vinoth@apache.org>, xu.shiyan.raymond@gmail.com <
> xu.shiyan.raymond@gmail.com>
> *Cc: *vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> n.siva.b@gmail.com>
> *Subject: *Re: Hudi Record Key Best Practices
>
> Sure. I will see about partition key.
>
> Since RFC 21 is not yet implemented and available to consume, can anyone
> please suggest what is the best approach I should be following to construct
> the record key I asked in the  original question:
>
> “
> My Write Use Cases:
> 1. Writes to partitioned HUDI table every 15 minutes
>
>   1.  where 95% inserts and 5% updates,
>   2.  Also 95% write goes to same partition (current date) 5% write can
> span across multiple partitions
> 2. GDPR request to delete records from the table using User Identifier
> field (F3)
>
>
> Record Key Construction:
> Approach 1:
> Generate a UUID  from the concatenated String of all these 4 fields [eg:
> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> newly generated field as Record Key
>
> Approach 2:
> Generate a UUID  from the concatenated String of 3 fields except datetime
> field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> datetime field to the generated UUID and use that newly generated field as
> Record Key •F1_<uuid>
>
> Approach 3:
> Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> “
>
> Regards,
> Felix K Jose
> From: Raymond Xu <xu...@gmail.com>
> Date: Wednesday, November 18, 2020 at 5:30 PM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> n.siva.b@gmail.com>
> Subject: Re: Hudi Record Key Best Practices
> Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
> just saying maybe making the writes more evenly spreaded could be
> better. Effectively, with 95% writes, it's like writing to a single
> partition dataset. Hourly partition could mitigate the situation, since you
> also have date-range queries. Just some rough ideas, the strategy really
> depends on your data pattern and requirements.
>
> For the development timeline on RFC 21, probably Vinoth or Balaji
> could give more info.
>
> On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hi Raymond,
> > Thank you for the response.
> >
> > Yes, the virtual key definitely going to help reducing the storage
> > footprint. When do you think it is going to be available and will it be
> > compatible with all downstream processing engines (Presto, Redshift
> > Spectrum etc.)? We have started our development activities and expecting
> to
> > get into PROD by March-April timeframe.
> >
> > Regarding the partition key,  we get data every day from 10-20 million
> > users and currently the data we are planning to partition is by Date
> > (YYYY-MM-DD) and thereby we will have consistent partitions for
> downstream
> > systems(every partition has same amount of data [20 million user data in
> > each partition, rather than skewed partitions]). And most of our queries
> > are date range queries for a given user-Id
> >
> > If I partition by user-Id, then I will have millions of partitions, and I
> > have read that having large number of partition has major read impact
> (meta
> > data management etc.), what do you think? Is my understanding correct?
> >
> > Yes, for current day most of the data will be for that day – so do you
> > think it’s going to be a problem while writing (wont the BLOOM index
> help)?
> > And that’s what I am trying to understand to land in a better performant
> > solution.
> >
> > Meanwhile I would like to see my record Key construct as well, to see how
> > it can help on write performance and downstream requirement to support
> > GDPR.  To avoid any reprocessing/migration down the line.
> >
> > Regards,
> > Felix K Jose
> >
> > From: Raymond Xu <xu...@gmail.com>
> > Date: Tuesday, November 17, 2020 at 6:18 PM
> > To: dev@hudi.apache.org <de...@hudi.apache.org>
> > Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> > n.siva.b@gmail.com>, v.balaji@ymail.com.invalid
> > <v....@ymail.com.invalid>
> > Subject: Re: Hudi Record Key Best Practices
> > Hi Felix, looks like the use case will benefit from virtual key feature
> in
> > this RFC
> >
> >
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0
> >
> > Once this is implemented, you don't have to create a separate key.
> >
> > A rough thought: you mentioned 95% writes go to the same partition.
> Rather
> > than the record key, maybe consider improving on the partition field? to
> > have more even writes across partitions for eg?
> >
> > On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
> > <fe...@philips.com.invalid> wrote:
> >
> > > Hello All,
> > >
> > > I have asked generic questions regarding record key in slack channel,
> but
> > > I just want to consolidate everything regarding Record Key and the
> > > suggested best practices of Record Key construction to get better write
> > > performance.
> > >
> > > Table Type: COW
> > > Partition Path: Date
> > >
> > > My record uniqueness is derived from a combination of 4 fields:
> > >
> > >   1.  F1: Datetime (record’s origination datetime)
> > >   2.  F2: String       (11 char  long serial number)
> > >   3.  F3: UUID        (User Identifier)
> > >   4.  F4: String.       (12 CHAR statistic name)
> > >
> > > Note: My record is a nested document and some of the above fields are
> > > nested fields
> > >
> > > My Write Use Cases:
> > > 1. Writes to partitioned HUDI table every 15 minutes
> > >
> > >   1.  where 95% inserts and 5% updates,
> > >   2.  Also 95% write goes to same partition (current date) 5% write can
> > > span across multiple partitions
> > > 2. GDPR request to delete records from the table using User Identifier
> > > field (F3)
> > >
> > >
> > > Record Key Construction:
> > > Approach 1:
> > > Generate a UUID  from the concatenated String of all these 4 fields
> [eg:
> > > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> > > newly generated field as Record Key
> > >
> > > Approach 2:
> > > Generate a UUID  from the concatenated String of 3 fields except
> datetime
> > > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> > > datetime field to the generated UUID and use that newly generated field
> > as
> > > Record Key •F1_<uuid>
> > >
> > > Approach 3:
> > > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> > >
> > > Which is the approach you will suggest? Could you please help me?
> > >
> > > Regards,
> > > Felix K Jose
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> legally
> > > protected under applicable law. The message is intended solely for the
> > > addressee(s). If you are not the intended recipient, you are hereby
> > > notified that any use, forwarding, dissemination, or reproduction of
> this
> > > message is strictly prohibited and may be unlawful. If you are not the
> > > intended recipient, please contact the sender by return e-mail and
> > destroy
> > > all copies of the original message.
> > >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

Re: Hudi Record Key Best Practices

Posted by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID>.

@Vinoth Chandar<ma...@apache.org>,

Could you please take a look at and let me know what is the best approach or could you see whom can help me on this?

Regards,
Felix K Jose
From: Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
Date: Thursday, November 19, 2020 at 12:04 PM
To: dev@hudi.apache.org <de...@hudi.apache.org>, Vinoth Chandar <vi...@apache.org>, xu.shiyan.raymond@gmail.com <xu...@gmail.com>
Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <n....@gmail.com>
Subject: Re: Hudi Record Key Best Practices
Sure. I will see about partition key.

Since RFC 21 is not yet implemented and available to consume, can anyone please suggest what is the best approach I should be following to construct the record key I asked in the  original question:

“
My Write Use Cases:
1. Writes to partitioned HUDI table every 15 minutes

  1.  where 95% inserts and 5% updates,
  2.  Also 95% write goes to same partition (current date) 5% write can span across multiple partitions
2. GDPR request to delete records from the table using User Identifier field (F3)


Record Key Construction:
Approach 1:
Generate a UUID  from the concatenated String of all these 4 fields [eg: str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that newly generated field as Record Key

Approach 2:
Generate a UUID  from the concatenated String of 3 fields except datetime field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend datetime field to the generated UUID and use that newly generated field as Record Key •F1_<uuid>

Approach 3:
Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
“

Regards,
Felix K Jose
From: Raymond Xu <xu...@gmail.com>
Date: Wednesday, November 18, 2020 at 5:30 PM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <n....@gmail.com>
Subject: Re: Hudi Record Key Best Practices
Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
just saying maybe making the writes more evenly spreaded could be
better. Effectively, with 95% writes, it's like writing to a single
partition dataset. Hourly partition could mitigate the situation, since you
also have date-range queries. Just some rough ideas, the strategy really
depends on your data pattern and requirements.

For the development timeline on RFC 21, probably Vinoth or Balaji
could give more info.

On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hi Raymond,
> Thank you for the response.
>
> Yes, the virtual key definitely going to help reducing the storage
> footprint. When do you think it is going to be available and will it be
> compatible with all downstream processing engines (Presto, Redshift
> Spectrum etc.)? We have started our development activities and expecting to
> get into PROD by March-April timeframe.
>
> Regarding the partition key,  we get data every day from 10-20 million
> users and currently the data we are planning to partition is by Date
> (YYYY-MM-DD) and thereby we will have consistent partitions for downstream
> systems(every partition has same amount of data [20 million user data in
> each partition, rather than skewed partitions]). And most of our queries
> are date range queries for a given user-Id
>
> If I partition by user-Id, then I will have millions of partitions, and I
> have read that having large number of partition has major read impact (meta
> data management etc.), what do you think? Is my understanding correct?
>
> Yes, for current day most of the data will be for that day – so do you
> think it’s going to be a problem while writing (wont the BLOOM index help)?
> And that’s what I am trying to understand to land in a better performant
> solution.
>
> Meanwhile I would like to see my record Key construct as well, to see how
> it can help on write performance and downstream requirement to support
> GDPR.  To avoid any reprocessing/migration down the line.
>
> Regards,
> Felix K Jose
>
> From: Raymond Xu <xu...@gmail.com>
> Date: Tuesday, November 17, 2020 at 6:18 PM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> n.siva.b@gmail.com>, v.balaji@ymail.com.invalid
> <v....@ymail.com.invalid>
> Subject: Re: Hudi Record Key Best Practices
> Hi Felix, looks like the use case will benefit from virtual key feature in
> this RFC
>
>
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0
>
> Once this is implemented, you don't have to create a separate key.
>
> A rough thought: you mentioned 95% writes go to the same partition. Rather
> than the record key, maybe consider improving on the partition field? to
> have more even writes across partitions for eg?
>
> On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hello All,
> >
> > I have asked generic questions regarding record key in slack channel, but
> > I just want to consolidate everything regarding Record Key and the
> > suggested best practices of Record Key construction to get better write
> > performance.
> >
> > Table Type: COW
> > Partition Path: Date
> >
> > My record uniqueness is derived from a combination of 4 fields:
> >
> >   1.  F1: Datetime (record’s origination datetime)
> >   2.  F2: String       (11 char  long serial number)
> >   3.  F3: UUID        (User Identifier)
> >   4.  F4: String.       (12 CHAR statistic name)
> >
> > Note: My record is a nested document and some of the above fields are
> > nested fields
> >
> > My Write Use Cases:
> > 1. Writes to partitioned HUDI table every 15 minutes
> >
> >   1.  where 95% inserts and 5% updates,
> >   2.  Also 95% write goes to same partition (current date) 5% write can
> > span across multiple partitions
> > 2. GDPR request to delete records from the table using User Identifier
> > field (F3)
> >
> >
> > Record Key Construction:
> > Approach 1:
> > Generate a UUID  from the concatenated String of all these 4 fields [eg:
> > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> > newly generated field as Record Key
> >
> > Approach 2:
> > Generate a UUID  from the concatenated String of 3 fields except datetime
> > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> > datetime field to the generated UUID and use that newly generated field
> as
> > Record Key •F1_<uuid>
> >
> > Approach 3:
> > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> >
> > Which is the approach you will suggest? Could you please help me?
> >
> > Regards,
> > Felix K Jose
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: Hudi Record Key Best Practices

Posted by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID>.

Sure. I will see about partition key.

Since RFC 21 is not yet implemented and available to consume, can anyone please suggest what is the best approach I should be following to construct the record key I asked in the  original question:

“
My Write Use Cases:
1. Writes to partitioned HUDI table every 15 minutes

  1.  where 95% inserts and 5% updates,
  2.  Also 95% write goes to same partition (current date) 5% write can span across multiple partitions
2. GDPR request to delete records from the table using User Identifier field (F3)


Record Key Construction:
Approach 1:
Generate a UUID  from the concatenated String of all these 4 fields [eg: str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that newly generated field as Record Key

Approach 2:
Generate a UUID  from the concatenated String of 3 fields except datetime field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend datetime field to the generated UUID and use that newly generated field as Record Key •F1_<uuid>

Approach 3:
Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
“

Regards,
Felix K Jose
From: Raymond Xu <xu...@gmail.com>
Date: Wednesday, November 18, 2020 at 5:30 PM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <n....@gmail.com>
Subject: Re: Hudi Record Key Best Practices
Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
just saying maybe making the writes more evenly spreaded could be
better. Effectively, with 95% writes, it's like writing to a single
partition dataset. Hourly partition could mitigate the situation, since you
also have date-range queries. Just some rough ideas, the strategy really
depends on your data pattern and requirements.

For the development timeline on RFC 21, probably Vinoth or Balaji
could give more info.

On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hi Raymond,
> Thank you for the response.
>
> Yes, the virtual key definitely going to help reducing the storage
> footprint. When do you think it is going to be available and will it be
> compatible with all downstream processing engines (Presto, Redshift
> Spectrum etc.)? We have started our development activities and expecting to
> get into PROD by March-April timeframe.
>
> Regarding the partition key,  we get data every day from 10-20 million
> users and currently the data we are planning to partition is by Date
> (YYYY-MM-DD) and thereby we will have consistent partitions for downstream
> systems(every partition has same amount of data [20 million user data in
> each partition, rather than skewed partitions]). And most of our queries
> are date range queries for a given user-Id
>
> If I partition by user-Id, then I will have millions of partitions, and I
> have read that having large number of partition has major read impact (meta
> data management etc.), what do you think? Is my understanding correct?
>
> Yes, for current day most of the data will be for that day – so do you
> think it’s going to be a problem while writing (wont the BLOOM index help)?
> And that’s what I am trying to understand to land in a better performant
> solution.
>
> Meanwhile I would like to see my record Key construct as well, to see how
> it can help on write performance and downstream requirement to support
> GDPR.  To avoid any reprocessing/migration down the line.
>
> Regards,
> Felix K Jose
>
> From: Raymond Xu <xu...@gmail.com>
> Date: Tuesday, November 17, 2020 at 6:18 PM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> n.siva.b@gmail.com>, v.balaji@ymail.com.invalid
> <v....@ymail.com.invalid>
> Subject: Re: Hudi Record Key Best Practices
> Hi Felix, looks like the use case will benefit from virtual key feature in
> this RFC
>
>
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7Ccc9d58c7b0c4493c8c1d08d88c117ec8%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637413354055246103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=w7a0TmIyAuNTBCg5uKZzriYoBU2W%2FQR9H8U%2FvKk%2Fytc%3D&amp;reserved=0
>
> Once this is implemented, you don't have to create a separate key.
>
> A rough thought: you mentioned 95% writes go to the same partition. Rather
> than the record key, maybe consider improving on the partition field? to
> have more even writes across partitions for eg?
>
> On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hello All,
> >
> > I have asked generic questions regarding record key in slack channel, but
> > I just want to consolidate everything regarding Record Key and the
> > suggested best practices of Record Key construction to get better write
> > performance.
> >
> > Table Type: COW
> > Partition Path: Date
> >
> > My record uniqueness is derived from a combination of 4 fields:
> >
> >   1.  F1: Datetime (record’s origination datetime)
> >   2.  F2: String       (11 char  long serial number)
> >   3.  F3: UUID        (User Identifier)
> >   4.  F4: String.       (12 CHAR statistic name)
> >
> > Note: My record is a nested document and some of the above fields are
> > nested fields
> >
> > My Write Use Cases:
> > 1. Writes to partitioned HUDI table every 15 minutes
> >
> >   1.  where 95% inserts and 5% updates,
> >   2.  Also 95% write goes to same partition (current date) 5% write can
> > span across multiple partitions
> > 2. GDPR request to delete records from the table using User Identifier
> > field (F3)
> >
> >
> > Record Key Construction:
> > Approach 1:
> > Generate a UUID  from the concatenated String of all these 4 fields [eg:
> > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> > newly generated field as Record Key
> >
> > Approach 2:
> > Generate a UUID  from the concatenated String of 3 fields except datetime
> > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> > datetime field to the generated UUID and use that newly generated field
> as
> > Record Key •F1_<uuid>
> >
> > Approach 3:
> > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> >
> > Which is the approach you will suggest? Could you please help me?
> >
> > Regards,
> > Felix K Jose
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: Hudi Record Key Best Practices

Posted by Raymond Xu <xu...@gmail.com>.

Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
just saying maybe making the writes more evenly spreaded could be
better. Effectively, with 95% writes, it's like writing to a single
partition dataset. Hourly partition could mitigate the situation, since you
also have date-range queries. Just some rough ideas, the strategy really
depends on your data pattern and requirements.

For the development timeline on RFC 21, probably Vinoth or Balaji
could give more info.

On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hi Raymond,
> Thank you for the response.
>
> Yes, the virtual key definitely going to help reducing the storage
> footprint. When do you think it is going to be available and will it be
> compatible with all downstream processing engines (Presto, Redshift
> Spectrum etc.)? We have started our development activities and expecting to
> get into PROD by March-April timeframe.
>
> Regarding the partition key,  we get data every day from 10-20 million
> users and currently the data we are planning to partition is by Date
> (YYYY-MM-DD) and thereby we will have consistent partitions for downstream
> systems(every partition has same amount of data [20 million user data in
> each partition, rather than skewed partitions]). And most of our queries
> are date range queries for a given user-Id
>
> If I partition by user-Id, then I will have millions of partitions, and I
> have read that having large number of partition has major read impact (meta
> data management etc.), what do you think? Is my understanding correct?
>
> Yes, for current day most of the data will be for that day – so do you
> think it’s going to be a problem while writing (wont the BLOOM index help)?
> And that’s what I am trying to understand to land in a better performant
> solution.
>
> Meanwhile I would like to see my record Key construct as well, to see how
> it can help on write performance and downstream requirement to support
> GDPR.  To avoid any reprocessing/migration down the line.
>
> Regards,
> Felix K Jose
>
> From: Raymond Xu <xu...@gmail.com>
> Date: Tuesday, November 17, 2020 at 6:18 PM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <
> n.siva.b@gmail.com>, v.balaji@ymail.com.invalid
> <v....@ymail.com.invalid>
> Subject: Re: Hudi Record Key Best Practices
> Hi Felix, looks like the use case will benefit from virtual key feature in
> this RFC
>
>
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7Cf0057e2f47604496465c08d88b4f184a%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637412519108116619%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kAUHK3c5NlgKPbN6eB4n65UWPykqRmqJZthYrc%2FWO0c%3D&amp;reserved=0
>
> Once this is implemented, you don't have to create a separate key.
>
> A rough thought: you mentioned 95% writes go to the same partition. Rather
> than the record key, maybe consider improving on the partition field? to
> have more even writes across partitions for eg?
>
> On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hello All,
> >
> > I have asked generic questions regarding record key in slack channel, but
> > I just want to consolidate everything regarding Record Key and the
> > suggested best practices of Record Key construction to get better write
> > performance.
> >
> > Table Type: COW
> > Partition Path: Date
> >
> > My record uniqueness is derived from a combination of 4 fields:
> >
> >   1.  F1: Datetime (record’s origination datetime)
> >   2.  F2: String       (11 char  long serial number)
> >   3.  F3: UUID        (User Identifier)
> >   4.  F4: String.       (12 CHAR statistic name)
> >
> > Note: My record is a nested document and some of the above fields are
> > nested fields
> >
> > My Write Use Cases:
> > 1. Writes to partitioned HUDI table every 15 minutes
> >
> >   1.  where 95% inserts and 5% updates,
> >   2.  Also 95% write goes to same partition (current date) 5% write can
> > span across multiple partitions
> > 2. GDPR request to delete records from the table using User Identifier
> > field (F3)
> >
> >
> > Record Key Construction:
> > Approach 1:
> > Generate a UUID  from the concatenated String of all these 4 fields [eg:
> > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> > newly generated field as Record Key
> >
> > Approach 2:
> > Generate a UUID  from the concatenated String of 3 fields except datetime
> > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> > datetime field to the generated UUID and use that newly generated field
> as
> > Record Key •F1_<uuid>
> >
> > Approach 3:
> > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> >
> > Which is the approach you will suggest? Could you please help me?
> >
> > Regards,
> > Felix K Jose
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

Re: Hudi Record Key Best Practices

Posted by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID>.

Hi Raymond,
Thank you for the response.

Yes, the virtual key definitely going to help reducing the storage footprint. When do you think it is going to be available and will it be compatible with all downstream processing engines (Presto, Redshift Spectrum etc.)? We have started our development activities and expecting to get into PROD by March-April timeframe.

Regarding the partition key,  we get data every day from 10-20 million users and currently the data we are planning to partition is by Date (YYYY-MM-DD) and thereby we will have consistent partitions for downstream systems(every partition has same amount of data [20 million user data in each partition, rather than skewed partitions]). And most of our queries are date range queries for a given user-Id

If I partition by user-Id, then I will have millions of partitions, and I have read that having large number of partition has major read impact (meta data management etc.), what do you think? Is my understanding correct?

Yes, for current day most of the data will be for that day – so do you think it’s going to be a problem while writing (wont the BLOOM index help)? And that’s what I am trying to understand to land in a better performant solution.

Meanwhile I would like to see my record Key construct as well, to see how it can help on write performance and downstream requirement to support GDPR.  To avoid any reprocessing/migration down the line.

Regards,
Felix K Jose

From: Raymond Xu <xu...@gmail.com>
Date: Tuesday, November 17, 2020 at 6:18 PM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Cc: vinoth@apache.org <vi...@apache.org>, n.siva.b@gmail.com <n....@gmail.com>, v.balaji@ymail.com.invalid <v....@ymail.com.invalid>
Subject: Re: Hudi Record Key Best Practices
Hi Felix, looks like the use case will benefit from virtual key feature in
this RFC

https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7Cf0057e2f47604496465c08d88b4f184a%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637412519108116619%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kAUHK3c5NlgKPbN6eB4n65UWPykqRmqJZthYrc%2FWO0c%3D&amp;reserved=0

Once this is implemented, you don't have to create a separate key.

A rough thought: you mentioned 95% writes go to the same partition. Rather
than the record key, maybe consider improving on the partition field? to
have more even writes across partitions for eg?

On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hello All,
>
> I have asked generic questions regarding record key in slack channel, but
> I just want to consolidate everything regarding Record Key and the
> suggested best practices of Record Key construction to get better write
> performance.
>
> Table Type: COW
> Partition Path: Date
>
> My record uniqueness is derived from a combination of 4 fields:
>
>   1.  F1: Datetime (record’s origination datetime)
>   2.  F2: String       (11 char  long serial number)
>   3.  F3: UUID        (User Identifier)
>   4.  F4: String.       (12 CHAR statistic name)
>
> Note: My record is a nested document and some of the above fields are
> nested fields
>
> My Write Use Cases:
> 1. Writes to partitioned HUDI table every 15 minutes
>
>   1.  where 95% inserts and 5% updates,
>   2.  Also 95% write goes to same partition (current date) 5% write can
> span across multiple partitions
> 2. GDPR request to delete records from the table using User Identifier
> field (F3)
>
>
> Record Key Construction:
> Approach 1:
> Generate a UUID  from the concatenated String of all these 4 fields [eg:
> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> newly generated field as Record Key
>
> Approach 2:
> Generate a UUID  from the concatenated String of 3 fields except datetime
> field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> datetime field to the generated UUID and use that newly generated field as
> Record Key •F1_<uuid>
>
> Approach 3:
> Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
>
> Which is the approach you will suggest? Could you please help me?
>
> Regards,
> Felix K Jose
>
>
>
>
>
>
>
>
>
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: Hudi Record Key Best Practices

Posted by Raymond Xu <xu...@gmail.com>.

Hi Felix, looks like the use case will benefit from virtual key feature in
this RFC

https://cwiki.apache.org/confluence/display/HUDI/RFC+-+21+%3A+Allow+HoodieRecordKey+to+be+Virtual

Once this is implemented, you don't have to create a separate key.

A rough thought: you mentioned 95% writes go to the same partition. Rather
than the record key, maybe consider improving on the partition field? to
have more even writes across partitions for eg?

On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hello All,
>
> I have asked generic questions regarding record key in slack channel, but
> I just want to consolidate everything regarding Record Key and the
> suggested best practices of Record Key construction to get better write
> performance.
>
> Table Type: COW
> Partition Path: Date
>
> My record uniqueness is derived from a combination of 4 fields:
>
>   1.  F1: Datetime (record’s origination datetime)
>   2.  F2: String       (11 char  long serial number)
>   3.  F3: UUID        (User Identifier)
>   4.  F4: String.       (12 CHAR statistic name)
>
> Note: My record is a nested document and some of the above fields are
> nested fields
>
> My Write Use Cases:
> 1. Writes to partitioned HUDI table every 15 minutes
>
>   1.  where 95% inserts and 5% updates,
>   2.  Also 95% write goes to same partition (current date) 5% write can
> span across multiple partitions
> 2. GDPR request to delete records from the table using User Identifier
> field (F3)
>
>
> Record Key Construction:
> Approach 1:
> Generate a UUID  from the concatenated String of all these 4 fields [eg:
> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> newly generated field as Record Key
>
> Approach 2:
> Generate a UUID  from the concatenated String of 3 fields except datetime
> field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> datetime field to the generated UUID and use that newly generated field as
> Record Key •F1_<uuid>
>
> Approach 3:
> Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
>
> Which is the approach you will suggest? Could you please help me?
>
> Regards,
> Felix K Jose
>
>
>
>
>
>
>
>
>
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>