You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Abhishek Modi <mo...@uber.com.INVALID> on 2020/08/18 23:36:50 UTC

[DISCUSS] Support for `_hoodie_record_key` as a virtual column

Hi everyone!

I was hoping to discuss adding support for making `_hoodie_record_key` a
virtual column :)

Context:
Currently, _hoodie_record_key is written to DFS, as a column in the Parquet
file. In our production systems at Uber however, _hoodie_record_key
contains data that can be found in a different column (or set of columns).
This means that we are storing duplicated data.

Proposal:
In the interest of improving storage efficiency, we could add confs /
abstract classes that can construct the _hoodie_record_key given other
columns. That way we do not have to store duplicated data on DFS.

Any thoughts on this?

Best,
Modi

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Vinoth Chandar <vi...@apache.org>.
+1

Good point btw, gary! we have to have some mechanism to reconstruct the key
for the reader side.

On Fri, Sep 4, 2020 at 10:08 AM vbalaji@apache.org <vb...@apache.org>
wrote:

>  Regarding queries taking a hit, we need to distinguish queries which do
> not need _hoodie_record_key column vs those that need them. I would think
> the most common case would be the ones that does not required
> _hoodie_record_key. This is similar to how hive query integration for
> bootstrapped table is supported. We should implement on similar lines to
> preserve fast path.
> At the end of the day, this is going to be an optional feature. For cases,
> where the storage overhead of _hoodie_record_key is higher, this feature
> would help.
> Thanks,Balaji.V
>     On Wednesday, September 2, 2020, 05:07:49 PM PDT, Gary Li <
> garyli1019@outlook.com> wrote:
>
>  Yes it works this way, IMO the performance of reading certain fields from
> the row and then to build the key from there is not as efficient as having
> a separate key from the actual record. But if we only make it virtual in
> the parquet then this should be fine.
>
> For parquet, the downside I could see is we won’t be able to use
> ColumnBatch to fetch the key anymore. To get the key, we might need to scan
> the entire row. But since most of the data are not being touched during the
> upsert then I could see the storage benefit over the read performance.
>
> From the reader side, are we trying to make _hoodie_record_key still exist
> in the table view or excluding it from the query side? If we want to still
> include it, then we might need to support the virtual feature for multiple
> query engines to get the consistent view.
>
> No mean to stop this making progress. Just some questions pop into my
> head. I think my questions are all solvable. Happy to discuss more in the
> RFC if we move forward :)
>
> Best,
> Gary
>
> Gary Li
> ________________________________
> From: Vinoth Chandar <vi...@apache.org>
> Sent: Wednesday, September 2, 2020 11:07:23 PM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Subject: Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column
>
> Hi Gary,
>
> It should not be that bad, right. We can just create a new implementation
> of HoodieRecord where getKey() can dynamically fetch a field's value as the
> recordKey.
> Then the rest of the code should work as-is?
>
> On Tue, Sep 1, 2020 at 5:22 PM Gary Li <ga...@outlook.com> wrote:
>
> > Thanks for the proposal. I am a bit concerning about this feature.
> > _hoodie_record_key is the primary key of the Hudi table and we need this
> > field for indexing, sorting, merging e.t.c. We are using
> _hoodie_record_key
> > very frequently so I can see the overhead on both write side and read
> side.
> > I am not sure if this is worth it. Make _hoodie_partition_path virtual
> make
> > more sense to me.
> >
> > On the implementation side, if we support this as an option, we need
> > special handles in many places. For example, the LogScanner,
> > _hoodie_record_key is the key of the lookup Hashmap and this Map could be
> > spilled to the disk as well. There should be significant amount of work
> > from my perspective.
> >
> > -1 for making _hoodie_record_key virtual.
> > +1 for _hoodie_partition_path
> >
> > Best Regards,
> > Gary Li
> >
> >
> > On 8/22/20, 9:09 PM, "Sivabalan" <n....@gmail.com> wrote:
> >
> >    Aah, yes. That’s right.
> >
> >    On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <vi...@apache.org>
> > wrote:
> >
> >    > All of the remaining meta fields compress very very nicely. They
> have
> >    >
> >    > almost no overhead.
> >    >
> >    >
> >    >
> >    > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi
> <modi@uber.com.invalid
> > >
> >    >
> >    > wrote:
> >    >
> >    >
> >    >
> >    > > @sivabalan the current plan is to only add this for
> > hoodie_record_key.
> >    > But
> >    >
> >    > > I'm hoping to make the implementation general enough to add other
> > columns
> >    >
> >    > > as well going forward :)
> >    >
> >    > >
> >    >
> >    > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <n....@gmail.com>
> > wrote:
> >    >
> >    > >
> >    >
> >    > > > +1 for virtual record keys. Do you also propose to generalize
> > this for
> >    >
> >    > > > partition path as well ?
> >    >
> >    > > >
> >    >
> >    > > >
> >    >
> >    > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
> >    > pratyaksh13@gmail.com>
> >    >
> >    > > > wrote:
> >    >
> >    > > >
> >    >
> >    > > > > This is a good option to have. :)
> >    >
> >    > > > >
> >    >
> >    > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <
> > vinoth@apache.org>
> >    >
> >    > > > wrote:
> >    >
> >    > > > >
> >    >
> >    > > > > > IIRC _hoodie_record_key was supposed to this standardized
> key
> >    > field.
> >    >
> >    > > :)
> >    >
> >    > > > > > Anyways, it's good to provide this option to the user.
> >    >
> >    > > > > > So +1 for. RFC/further discussion.
> >    >
> >    > > > > >
> >    >
> >    > > > > > To level set, I want to also share some of the benefits of
> > having
> >    > an
> >    >
> >    > > > > > explicit key column.
> >    >
> >    > > > > > a) if you build your data lake using a bunch of hudi tables,
> > now
> >    > you
> >    >
> >    > > > > have a
> >    >
> >    > > > > > standardized data model
> >    >
> >    > > > > > b) Even if your key generator changes, it does not affect
> the
> >    >
> >    > > existing
> >    >
> >    > > > > > data's keys. and updates will be matched correctly.
> >    >
> >    > > > > >
> >    >
> >    > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
> >    >
> >    > > > > > <v....@ymail.com.invalid> wrote:
> >    >
> >    > > > > >
> >    >
> >    > > > > > >  +1. This should be good to have as an option. If
> everybody
> >    > agrees,
> >    >
> >    > > > > > please
> >    >
> >    > > > > > > go ahead with RFC and we can discuss details there.
> >    >
> >    > > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
> >    > Abhishek
> >    >
> >    > > > Modi
> >    >
> >    > > > > > > <mo...@uber.com.invalid> wrote:
> >    >
> >    > > > > > >
> >    >
> >    > > > > > >  Hi everyone!
> >    >
> >    > > > > > >
> >    >
> >    > > > > > > I was hoping to discuss adding support for making
> >    >
> >    > > > `_hoodie_record_key`
> >    >
> >    > > > > a
> >    >
> >    > > > > > > virtual column :)
> >    >
> >    > > > > > >
> >    >
> >    > > > > > > Context:
> >    >
> >    > > > > > > Currently, _hoodie_record_key is written to DFS, as a
> > column in
> >    > the
> >    >
> >    > > > > > Parquet
> >    >
> >    > > > > > > file. In our production systems at Uber however,
> >    > _hoodie_record_key
> >    >
> >    > > > > > > contains data that can be found in a different column (or
> > set of
> >    >
> >    > > > > > columns).
> >    >
> >    > > > > > > This means that we are storing duplicated data.
> >    >
> >    > > > > > >
> >    >
> >    > > > > > > Proposal:
> >    >
> >    > > > > > > In the interest of improving storage efficiency, we could
> > add
> >    >
> >    > > confs /
> >    >
> >    > > > > > > abstract classes that can construct the _hoodie_record_key
> > given
> >    >
> >    > > > other
> >    >
> >    > > > > > > columns. That way we do not have to store duplicated data
> > on DFS.
> >    >
> >    > > > > > >
> >    >
> >    > > > > > > Any thoughts on this?
> >    >
> >    > > > > > >
> >    >
> >    > > > > > > Best,
> >    >
> >    > > > > > > Modi
> >    >
> >    > > > > > >
> >    >
> >    > > > > >
> >    >
> >    > > > >
> >    >
> >    > > >
> >    >
> >    > > >
> >    >
> >    > > > --
> >    >
> >    > > > Regards,
> >    >
> >    > > > -Sivabalan
> >    >
> >    > > >
> >    >
> >    > >
> >    >
> >    > --
> >    Regards,
> >    -Sivabalan
> >
> >
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by "vbalaji@apache.org" <vb...@apache.org>.
 Regarding queries taking a hit, we need to distinguish queries which do not need _hoodie_record_key column vs those that need them. I would think the most common case would be the ones that does not required _hoodie_record_key. This is similar to how hive query integration for bootstrapped table is supported. We should implement on similar lines to preserve fast path.
At the end of the day, this is going to be an optional feature. For cases, where the storage overhead of _hoodie_record_key is higher, this feature would help.
Thanks,Balaji.V
    On Wednesday, September 2, 2020, 05:07:49 PM PDT, Gary Li <ga...@outlook.com> wrote:  
 
 Yes it works this way, IMO the performance of reading certain fields from the row and then to build the key from there is not as efficient as having a separate key from the actual record. But if we only make it virtual in the parquet then this should be fine.

For parquet, the downside I could see is we won’t be able to use ColumnBatch to fetch the key anymore. To get the key, we might need to scan the entire row. But since most of the data are not being touched during the upsert then I could see the storage benefit over the read performance.

From the reader side, are we trying to make _hoodie_record_key still exist in the table view or excluding it from the query side? If we want to still include it, then we might need to support the virtual feature for multiple query engines to get the consistent view.

No mean to stop this making progress. Just some questions pop into my head. I think my questions are all solvable. Happy to discuss more in the RFC if we move forward :)

Best,
Gary

Gary Li
________________________________
From: Vinoth Chandar <vi...@apache.org>
Sent: Wednesday, September 2, 2020 11:07:23 PM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Subject: Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Hi Gary,

It should not be that bad, right. We can just create a new implementation
of HoodieRecord where getKey() can dynamically fetch a field's value as the
recordKey.
Then the rest of the code should work as-is?

On Tue, Sep 1, 2020 at 5:22 PM Gary Li <ga...@outlook.com> wrote:

> Thanks for the proposal. I am a bit concerning about this feature.
> _hoodie_record_key is the primary key of the Hudi table and we need this
> field for indexing, sorting, merging e.t.c. We are using _hoodie_record_key
> very frequently so I can see the overhead on both write side and read side.
> I am not sure if this is worth it. Make _hoodie_partition_path virtual make
> more sense to me.
>
> On the implementation side, if we support this as an option, we need
> special handles in many places. For example, the LogScanner,
> _hoodie_record_key is the key of the lookup Hashmap and this Map could be
> spilled to the disk as well. There should be significant amount of work
> from my perspective.
>
> -1 for making _hoodie_record_key virtual.
> +1 for _hoodie_partition_path
>
> Best Regards,
> Gary Li
>
>
> On 8/22/20, 9:09 PM, "Sivabalan" <n....@gmail.com> wrote:
>
>    Aah, yes. That’s right.
>
>    On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <vi...@apache.org>
> wrote:
>
>    > All of the remaining meta fields compress very very nicely. They have
>    >
>    > almost no overhead.
>    >
>    >
>    >
>    > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi <modi@uber.com.invalid
> >
>    >
>    > wrote:
>    >
>    >
>    >
>    > > @sivabalan the current plan is to only add this for
> hoodie_record_key.
>    > But
>    >
>    > > I'm hoping to make the implementation general enough to add other
> columns
>    >
>    > > as well going forward :)
>    >
>    > >
>    >
>    > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <n....@gmail.com>
> wrote:
>    >
>    > >
>    >
>    > > > +1 for virtual record keys. Do you also propose to generalize
> this for
>    >
>    > > > partition path as well ?
>    >
>    > > >
>    >
>    > > >
>    >
>    > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
>    > pratyaksh13@gmail.com>
>    >
>    > > > wrote:
>    >
>    > > >
>    >
>    > > > > This is a good option to have. :)
>    >
>    > > > >
>    >
>    > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <
> vinoth@apache.org>
>    >
>    > > > wrote:
>    >
>    > > > >
>    >
>    > > > > > IIRC _hoodie_record_key was supposed to this standardized key
>    > field.
>    >
>    > > :)
>    >
>    > > > > > Anyways, it's good to provide this option to the user.
>    >
>    > > > > > So +1 for. RFC/further discussion.
>    >
>    > > > > >
>    >
>    > > > > > To level set, I want to also share some of the benefits of
> having
>    > an
>    >
>    > > > > > explicit key column.
>    >
>    > > > > > a) if you build your data lake using a bunch of hudi tables,
> now
>    > you
>    >
>    > > > > have a
>    >
>    > > > > > standardized data model
>    >
>    > > > > > b) Even if your key generator changes, it does not affect the
>    >
>    > > existing
>    >
>    > > > > > data's keys. and updates will be matched correctly.
>    >
>    > > > > >
>    >
>    > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
>    >
>    > > > > > <v....@ymail.com.invalid> wrote:
>    >
>    > > > > >
>    >
>    > > > > > >  +1. This should be good to have as an option. If everybody
>    > agrees,
>    >
>    > > > > > please
>    >
>    > > > > > > go ahead with RFC and we can discuss details there.
>    >
>    > > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
>    > Abhishek
>    >
>    > > > Modi
>    >
>    > > > > > > <mo...@uber.com.invalid> wrote:
>    >
>    > > > > > >
>    >
>    > > > > > >  Hi everyone!
>    >
>    > > > > > >
>    >
>    > > > > > > I was hoping to discuss adding support for making
>    >
>    > > > `_hoodie_record_key`
>    >
>    > > > > a
>    >
>    > > > > > > virtual column :)
>    >
>    > > > > > >
>    >
>    > > > > > > Context:
>    >
>    > > > > > > Currently, _hoodie_record_key is written to DFS, as a
> column in
>    > the
>    >
>    > > > > > Parquet
>    >
>    > > > > > > file. In our production systems at Uber however,
>    > _hoodie_record_key
>    >
>    > > > > > > contains data that can be found in a different column (or
> set of
>    >
>    > > > > > columns).
>    >
>    > > > > > > This means that we are storing duplicated data.
>    >
>    > > > > > >
>    >
>    > > > > > > Proposal:
>    >
>    > > > > > > In the interest of improving storage efficiency, we could
> add
>    >
>    > > confs /
>    >
>    > > > > > > abstract classes that can construct the _hoodie_record_key
> given
>    >
>    > > > other
>    >
>    > > > > > > columns. That way we do not have to store duplicated data
> on DFS.
>    >
>    > > > > > >
>    >
>    > > > > > > Any thoughts on this?
>    >
>    > > > > > >
>    >
>    > > > > > > Best,
>    >
>    > > > > > > Modi
>    >
>    > > > > > >
>    >
>    > > > > >
>    >
>    > > > >
>    >
>    > > >
>    >
>    > > >
>    >
>    > > > --
>    >
>    > > > Regards,
>    >
>    > > > -Sivabalan
>    >
>    > > >
>    >
>    > >
>    >
>    > --
>    Regards,
>    -Sivabalan
>
>
  

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Gary Li <ga...@outlook.com>.
Yes it works this way, IMO the performance of reading certain fields from the row and then to build the key from there is not as efficient as having a separate key from the actual record. But if we only make it virtual in the parquet then this should be fine.

For parquet, the downside I could see is we won’t be able to use ColumnBatch to fetch the key anymore. To get the key, we might need to scan the entire row. But since most of the data are not being touched during the upsert then I could see the storage benefit over the read performance.

From the reader side, are we trying to make _hoodie_record_key still exist in the table view or excluding it from the query side? If we want to still include it, then we might need to support the virtual feature for multiple query engines to get the consistent view.

No mean to stop this making progress. Just some questions pop into my head. I think my questions are all solvable. Happy to discuss more in the RFC if we move forward :)

Best,
Gary

Gary Li
________________________________
From: Vinoth Chandar <vi...@apache.org>
Sent: Wednesday, September 2, 2020 11:07:23 PM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Subject: Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Hi Gary,

It should not be that bad, right. We can just create a new implementation
of HoodieRecord where getKey() can dynamically fetch a field's value as the
recordKey.
Then the rest of the code should work as-is?

On Tue, Sep 1, 2020 at 5:22 PM Gary Li <ga...@outlook.com> wrote:

> Thanks for the proposal. I am a bit concerning about this feature.
> _hoodie_record_key is the primary key of the Hudi table and we need this
> field for indexing, sorting, merging e.t.c. We are using _hoodie_record_key
> very frequently so I can see the overhead on both write side and read side.
> I am not sure if this is worth it. Make _hoodie_partition_path virtual make
> more sense to me.
>
> On the implementation side, if we support this as an option, we need
> special handles in many places. For example, the LogScanner,
> _hoodie_record_key is the key of the lookup Hashmap and this Map could be
> spilled to the disk as well. There should be significant amount of work
> from my perspective.
>
> -1 for making _hoodie_record_key virtual.
> +1 for _hoodie_partition_path
>
> Best Regards,
> Gary Li
>
>
> On 8/22/20, 9:09 PM, "Sivabalan" <n....@gmail.com> wrote:
>
>     Aah, yes. That’s right.
>
>     On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <vi...@apache.org>
> wrote:
>
>     > All of the remaining meta fields compress very very nicely. They have
>     >
>     > almost no overhead.
>     >
>     >
>     >
>     > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi <modi@uber.com.invalid
> >
>     >
>     > wrote:
>     >
>     >
>     >
>     > > @sivabalan the current plan is to only add this for
> hoodie_record_key.
>     > But
>     >
>     > > I'm hoping to make the implementation general enough to add other
> columns
>     >
>     > > as well going forward :)
>     >
>     > >
>     >
>     > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <n....@gmail.com>
> wrote:
>     >
>     > >
>     >
>     > > > +1 for virtual record keys. Do you also propose to generalize
> this for
>     >
>     > > > partition path as well ?
>     >
>     > > >
>     >
>     > > >
>     >
>     > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
>     > pratyaksh13@gmail.com>
>     >
>     > > > wrote:
>     >
>     > > >
>     >
>     > > > > This is a good option to have. :)
>     >
>     > > > >
>     >
>     > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <
> vinoth@apache.org>
>     >
>     > > > wrote:
>     >
>     > > > >
>     >
>     > > > > > IIRC _hoodie_record_key was supposed to this standardized key
>     > field.
>     >
>     > > :)
>     >
>     > > > > > Anyways, it's good to provide this option to the user.
>     >
>     > > > > > So +1 for. RFC/further discussion.
>     >
>     > > > > >
>     >
>     > > > > > To level set, I want to also share some of the benefits of
> having
>     > an
>     >
>     > > > > > explicit key column.
>     >
>     > > > > > a) if you build your data lake using a bunch of hudi tables,
> now
>     > you
>     >
>     > > > > have a
>     >
>     > > > > > standardized data model
>     >
>     > > > > > b) Even if your key generator changes, it does not affect the
>     >
>     > > existing
>     >
>     > > > > > data's keys. and updates will be matched correctly.
>     >
>     > > > > >
>     >
>     > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
>     >
>     > > > > > <v....@ymail.com.invalid> wrote:
>     >
>     > > > > >
>     >
>     > > > > > >  +1. This should be good to have as an option. If everybody
>     > agrees,
>     >
>     > > > > > please
>     >
>     > > > > > > go ahead with RFC and we can discuss details there.
>     >
>     > > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
>     > Abhishek
>     >
>     > > > Modi
>     >
>     > > > > > > <mo...@uber.com.invalid> wrote:
>     >
>     > > > > > >
>     >
>     > > > > > >  Hi everyone!
>     >
>     > > > > > >
>     >
>     > > > > > > I was hoping to discuss adding support for making
>     >
>     > > > `_hoodie_record_key`
>     >
>     > > > > a
>     >
>     > > > > > > virtual column :)
>     >
>     > > > > > >
>     >
>     > > > > > > Context:
>     >
>     > > > > > > Currently, _hoodie_record_key is written to DFS, as a
> column in
>     > the
>     >
>     > > > > > Parquet
>     >
>     > > > > > > file. In our production systems at Uber however,
>     > _hoodie_record_key
>     >
>     > > > > > > contains data that can be found in a different column (or
> set of
>     >
>     > > > > > columns).
>     >
>     > > > > > > This means that we are storing duplicated data.
>     >
>     > > > > > >
>     >
>     > > > > > > Proposal:
>     >
>     > > > > > > In the interest of improving storage efficiency, we could
> add
>     >
>     > > confs /
>     >
>     > > > > > > abstract classes that can construct the _hoodie_record_key
> given
>     >
>     > > > other
>     >
>     > > > > > > columns. That way we do not have to store duplicated data
> on DFS.
>     >
>     > > > > > >
>     >
>     > > > > > > Any thoughts on this?
>     >
>     > > > > > >
>     >
>     > > > > > > Best,
>     >
>     > > > > > > Modi
>     >
>     > > > > > >
>     >
>     > > > > >
>     >
>     > > > >
>     >
>     > > >
>     >
>     > > >
>     >
>     > > > --
>     >
>     > > > Regards,
>     >
>     > > > -Sivabalan
>     >
>     > > >
>     >
>     > >
>     >
>     > --
>     Regards,
>     -Sivabalan
>
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Vinoth Chandar <vi...@apache.org>.
Hi Gary,

It should not be that bad, right. We can just create a new implementation
of HoodieRecord where getKey() can dynamically fetch a field's value as the
recordKey.
Then the rest of the code should work as-is?

On Tue, Sep 1, 2020 at 5:22 PM Gary Li <ga...@outlook.com> wrote:

> Thanks for the proposal. I am a bit concerning about this feature.
> _hoodie_record_key is the primary key of the Hudi table and we need this
> field for indexing, sorting, merging e.t.c. We are using _hoodie_record_key
> very frequently so I can see the overhead on both write side and read side.
> I am not sure if this is worth it. Make _hoodie_partition_path virtual make
> more sense to me.
>
> On the implementation side, if we support this as an option, we need
> special handles in many places. For example, the LogScanner,
> _hoodie_record_key is the key of the lookup Hashmap and this Map could be
> spilled to the disk as well. There should be significant amount of work
> from my perspective.
>
> -1 for making _hoodie_record_key virtual.
> +1 for _hoodie_partition_path
>
> Best Regards,
> Gary Li
>
>
> On 8/22/20, 9:09 PM, "Sivabalan" <n....@gmail.com> wrote:
>
>     Aah, yes. That’s right.
>
>     On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <vi...@apache.org>
> wrote:
>
>     > All of the remaining meta fields compress very very nicely. They have
>     >
>     > almost no overhead.
>     >
>     >
>     >
>     > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi <modi@uber.com.invalid
> >
>     >
>     > wrote:
>     >
>     >
>     >
>     > > @sivabalan the current plan is to only add this for
> hoodie_record_key.
>     > But
>     >
>     > > I'm hoping to make the implementation general enough to add other
> columns
>     >
>     > > as well going forward :)
>     >
>     > >
>     >
>     > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <n....@gmail.com>
> wrote:
>     >
>     > >
>     >
>     > > > +1 for virtual record keys. Do you also propose to generalize
> this for
>     >
>     > > > partition path as well ?
>     >
>     > > >
>     >
>     > > >
>     >
>     > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
>     > pratyaksh13@gmail.com>
>     >
>     > > > wrote:
>     >
>     > > >
>     >
>     > > > > This is a good option to have. :)
>     >
>     > > > >
>     >
>     > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <
> vinoth@apache.org>
>     >
>     > > > wrote:
>     >
>     > > > >
>     >
>     > > > > > IIRC _hoodie_record_key was supposed to this standardized key
>     > field.
>     >
>     > > :)
>     >
>     > > > > > Anyways, it's good to provide this option to the user.
>     >
>     > > > > > So +1 for. RFC/further discussion.
>     >
>     > > > > >
>     >
>     > > > > > To level set, I want to also share some of the benefits of
> having
>     > an
>     >
>     > > > > > explicit key column.
>     >
>     > > > > > a) if you build your data lake using a bunch of hudi tables,
> now
>     > you
>     >
>     > > > > have a
>     >
>     > > > > > standardized data model
>     >
>     > > > > > b) Even if your key generator changes, it does not affect the
>     >
>     > > existing
>     >
>     > > > > > data's keys. and updates will be matched correctly.
>     >
>     > > > > >
>     >
>     > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
>     >
>     > > > > > <v....@ymail.com.invalid> wrote:
>     >
>     > > > > >
>     >
>     > > > > > >  +1. This should be good to have as an option. If everybody
>     > agrees,
>     >
>     > > > > > please
>     >
>     > > > > > > go ahead with RFC and we can discuss details there.
>     >
>     > > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
>     > Abhishek
>     >
>     > > > Modi
>     >
>     > > > > > > <mo...@uber.com.invalid> wrote:
>     >
>     > > > > > >
>     >
>     > > > > > >  Hi everyone!
>     >
>     > > > > > >
>     >
>     > > > > > > I was hoping to discuss adding support for making
>     >
>     > > > `_hoodie_record_key`
>     >
>     > > > > a
>     >
>     > > > > > > virtual column :)
>     >
>     > > > > > >
>     >
>     > > > > > > Context:
>     >
>     > > > > > > Currently, _hoodie_record_key is written to DFS, as a
> column in
>     > the
>     >
>     > > > > > Parquet
>     >
>     > > > > > > file. In our production systems at Uber however,
>     > _hoodie_record_key
>     >
>     > > > > > > contains data that can be found in a different column (or
> set of
>     >
>     > > > > > columns).
>     >
>     > > > > > > This means that we are storing duplicated data.
>     >
>     > > > > > >
>     >
>     > > > > > > Proposal:
>     >
>     > > > > > > In the interest of improving storage efficiency, we could
> add
>     >
>     > > confs /
>     >
>     > > > > > > abstract classes that can construct the _hoodie_record_key
> given
>     >
>     > > > other
>     >
>     > > > > > > columns. That way we do not have to store duplicated data
> on DFS.
>     >
>     > > > > > >
>     >
>     > > > > > > Any thoughts on this?
>     >
>     > > > > > >
>     >
>     > > > > > > Best,
>     >
>     > > > > > > Modi
>     >
>     > > > > > >
>     >
>     > > > > >
>     >
>     > > > >
>     >
>     > > >
>     >
>     > > >
>     >
>     > > > --
>     >
>     > > > Regards,
>     >
>     > > > -Sivabalan
>     >
>     > > >
>     >
>     > >
>     >
>     > --
>     Regards,
>     -Sivabalan
>
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Gary Li <ga...@outlook.com>.
Thanks for the proposal. I am a bit concerning about this feature. _hoodie_record_key is the primary key of the Hudi table and we need this field for indexing, sorting, merging e.t.c. We are using _hoodie_record_key very frequently so I can see the overhead on both write side and read side. I am not sure if this is worth it. Make _hoodie_partition_path virtual make more sense to me.

On the implementation side, if we support this as an option, we need special handles in many places. For example, the LogScanner, _hoodie_record_key is the key of the lookup Hashmap and this Map could be spilled to the disk as well. There should be significant amount of work from my perspective. 

-1 for making _hoodie_record_key virtual.
+1 for _hoodie_partition_path

Best Regards,
Gary Li
 

On 8/22/20, 9:09 PM, "Sivabalan" <n....@gmail.com> wrote:

    Aah, yes. That’s right.

    On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <vi...@apache.org> wrote:

    > All of the remaining meta fields compress very very nicely. They have
    >
    > almost no overhead.
    >
    >
    >
    > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi <mo...@uber.com.invalid>
    >
    > wrote:
    >
    >
    >
    > > @sivabalan the current plan is to only add this for hoodie_record_key.
    > But
    >
    > > I'm hoping to make the implementation general enough to add other columns
    >
    > > as well going forward :)
    >
    > >
    >
    > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <n....@gmail.com> wrote:
    >
    > >
    >
    > > > +1 for virtual record keys. Do you also propose to generalize this for
    >
    > > > partition path as well ?
    >
    > > >
    >
    > > >
    >
    > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
    > pratyaksh13@gmail.com>
    >
    > > > wrote:
    >
    > > >
    >
    > > > > This is a good option to have. :)
    >
    > > > >
    >
    > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <vi...@apache.org>
    >
    > > > wrote:
    >
    > > > >
    >
    > > > > > IIRC _hoodie_record_key was supposed to this standardized key
    > field.
    >
    > > :)
    >
    > > > > > Anyways, it's good to provide this option to the user.
    >
    > > > > > So +1 for. RFC/further discussion.
    >
    > > > > >
    >
    > > > > > To level set, I want to also share some of the benefits of having
    > an
    >
    > > > > > explicit key column.
    >
    > > > > > a) if you build your data lake using a bunch of hudi tables, now
    > you
    >
    > > > > have a
    >
    > > > > > standardized data model
    >
    > > > > > b) Even if your key generator changes, it does not affect the
    >
    > > existing
    >
    > > > > > data's keys. and updates will be matched correctly.
    >
    > > > > >
    >
    > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
    >
    > > > > > <v....@ymail.com.invalid> wrote:
    >
    > > > > >
    >
    > > > > > >  +1. This should be good to have as an option. If everybody
    > agrees,
    >
    > > > > > please
    >
    > > > > > > go ahead with RFC and we can discuss details there.
    >
    > > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
    > Abhishek
    >
    > > > Modi
    >
    > > > > > > <mo...@uber.com.invalid> wrote:
    >
    > > > > > >
    >
    > > > > > >  Hi everyone!
    >
    > > > > > >
    >
    > > > > > > I was hoping to discuss adding support for making
    >
    > > > `_hoodie_record_key`
    >
    > > > > a
    >
    > > > > > > virtual column :)
    >
    > > > > > >
    >
    > > > > > > Context:
    >
    > > > > > > Currently, _hoodie_record_key is written to DFS, as a column in
    > the
    >
    > > > > > Parquet
    >
    > > > > > > file. In our production systems at Uber however,
    > _hoodie_record_key
    >
    > > > > > > contains data that can be found in a different column (or set of
    >
    > > > > > columns).
    >
    > > > > > > This means that we are storing duplicated data.
    >
    > > > > > >
    >
    > > > > > > Proposal:
    >
    > > > > > > In the interest of improving storage efficiency, we could add
    >
    > > confs /
    >
    > > > > > > abstract classes that can construct the _hoodie_record_key given
    >
    > > > other
    >
    > > > > > > columns. That way we do not have to store duplicated data on DFS.
    >
    > > > > > >
    >
    > > > > > > Any thoughts on this?
    >
    > > > > > >
    >
    > > > > > > Best,
    >
    > > > > > > Modi
    >
    > > > > > >
    >
    > > > > >
    >
    > > > >
    >
    > > >
    >
    > > >
    >
    > > > --
    >
    > > > Regards,
    >
    > > > -Sivabalan
    >
    > > >
    >
    > >
    >
    > --
    Regards,
    -Sivabalan


Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Sivabalan <n....@gmail.com>.
Aah, yes. That’s right.

On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar <vi...@apache.org> wrote:

> All of the remaining meta fields compress very very nicely. They have
>
> almost no overhead.
>
>
>
> On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi <mo...@uber.com.invalid>
>
> wrote:
>
>
>
> > @sivabalan the current plan is to only add this for hoodie_record_key.
> But
>
> > I'm hoping to make the implementation general enough to add other columns
>
> > as well going forward :)
>
> >
>
> > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <n....@gmail.com> wrote:
>
> >
>
> > > +1 for virtual record keys. Do you also propose to generalize this for
>
> > > partition path as well ?
>
> > >
>
> > >
>
> > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
> pratyaksh13@gmail.com>
>
> > > wrote:
>
> > >
>
> > > > This is a good option to have. :)
>
> > > >
>
> > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <vi...@apache.org>
>
> > > wrote:
>
> > > >
>
> > > > > IIRC _hoodie_record_key was supposed to this standardized key
> field.
>
> > :)
>
> > > > > Anyways, it's good to provide this option to the user.
>
> > > > > So +1 for. RFC/further discussion.
>
> > > > >
>
> > > > > To level set, I want to also share some of the benefits of having
> an
>
> > > > > explicit key column.
>
> > > > > a) if you build your data lake using a bunch of hudi tables, now
> you
>
> > > > have a
>
> > > > > standardized data model
>
> > > > > b) Even if your key generator changes, it does not affect the
>
> > existing
>
> > > > > data's keys. and updates will be matched correctly.
>
> > > > >
>
> > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
>
> > > > > <v....@ymail.com.invalid> wrote:
>
> > > > >
>
> > > > > >  +1. This should be good to have as an option. If everybody
> agrees,
>
> > > > > please
>
> > > > > > go ahead with RFC and we can discuss details there.
>
> > > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT,
> Abhishek
>
> > > Modi
>
> > > > > > <mo...@uber.com.invalid> wrote:
>
> > > > > >
>
> > > > > >  Hi everyone!
>
> > > > > >
>
> > > > > > I was hoping to discuss adding support for making
>
> > > `_hoodie_record_key`
>
> > > > a
>
> > > > > > virtual column :)
>
> > > > > >
>
> > > > > > Context:
>
> > > > > > Currently, _hoodie_record_key is written to DFS, as a column in
> the
>
> > > > > Parquet
>
> > > > > > file. In our production systems at Uber however,
> _hoodie_record_key
>
> > > > > > contains data that can be found in a different column (or set of
>
> > > > > columns).
>
> > > > > > This means that we are storing duplicated data.
>
> > > > > >
>
> > > > > > Proposal:
>
> > > > > > In the interest of improving storage efficiency, we could add
>
> > confs /
>
> > > > > > abstract classes that can construct the _hoodie_record_key given
>
> > > other
>
> > > > > > columns. That way we do not have to store duplicated data on DFS.
>
> > > > > >
>
> > > > > > Any thoughts on this?
>
> > > > > >
>
> > > > > > Best,
>
> > > > > > Modi
>
> > > > > >
>
> > > > >
>
> > > >
>
> > >
>
> > >
>
> > > --
>
> > > Regards,
>
> > > -Sivabalan
>
> > >
>
> >
>
> --
Regards,
-Sivabalan

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Vinoth Chandar <vi...@apache.org>.
All of the remaining meta fields compress very very nicely. They have
almost no overhead.

On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi <mo...@uber.com.invalid>
wrote:

> @sivabalan the current plan is to only add this for hoodie_record_key. But
> I'm hoping to make the implementation general enough to add other columns
> as well going forward :)
>
> On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <n....@gmail.com> wrote:
>
> > +1 for virtual record keys. Do you also propose to generalize this for
> > partition path as well ?
> >
> >
> > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <pr...@gmail.com>
> > wrote:
> >
> > > This is a good option to have. :)
> > >
> > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <vi...@apache.org>
> > wrote:
> > >
> > > > IIRC _hoodie_record_key was supposed to this standardized key field.
> :)
> > > > Anyways, it's good to provide this option to the user.
> > > > So +1 for. RFC/further discussion.
> > > >
> > > > To level set, I want to also share some of the benefits of having an
> > > > explicit key column.
> > > > a) if you build your data lake using a bunch of hudi tables, now you
> > > have a
> > > > standardized data model
> > > > b) Even if your key generator changes, it does not affect the
> existing
> > > > data's keys. and updates will be matched correctly.
> > > >
> > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
> > > > <v....@ymail.com.invalid> wrote:
> > > >
> > > > >  +1. This should be good to have as an option. If everybody agrees,
> > > > please
> > > > > go ahead with RFC and we can discuss details there.
> > > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek
> > Modi
> > > > > <mo...@uber.com.invalid> wrote:
> > > > >
> > > > >  Hi everyone!
> > > > >
> > > > > I was hoping to discuss adding support for making
> > `_hoodie_record_key`
> > > a
> > > > > virtual column :)
> > > > >
> > > > > Context:
> > > > > Currently, _hoodie_record_key is written to DFS, as a column in the
> > > > Parquet
> > > > > file. In our production systems at Uber however, _hoodie_record_key
> > > > > contains data that can be found in a different column (or set of
> > > > columns).
> > > > > This means that we are storing duplicated data.
> > > > >
> > > > > Proposal:
> > > > > In the interest of improving storage efficiency, we could add
> confs /
> > > > > abstract classes that can construct the _hoodie_record_key given
> > other
> > > > > columns. That way we do not have to store duplicated data on DFS.
> > > > >
> > > > > Any thoughts on this?
> > > > >
> > > > > Best,
> > > > > Modi
> > > > >
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Abhishek Modi <mo...@uber.com.INVALID>.
@sivabalan the current plan is to only add this for hoodie_record_key. But
I'm hoping to make the implementation general enough to add other columns
as well going forward :)

On Fri, Aug 21, 2020 at 11:49 AM Sivabalan <n....@gmail.com> wrote:

> +1 for virtual record keys. Do you also propose to generalize this for
> partition path as well ?
>
>
> On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
> > This is a good option to have. :)
> >
> > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > IIRC _hoodie_record_key was supposed to this standardized key field. :)
> > > Anyways, it's good to provide this option to the user.
> > > So +1 for. RFC/further discussion.
> > >
> > > To level set, I want to also share some of the benefits of having an
> > > explicit key column.
> > > a) if you build your data lake using a bunch of hudi tables, now you
> > have a
> > > standardized data model
> > > b) Even if your key generator changes, it does not affect the existing
> > > data's keys. and updates will be matched correctly.
> > >
> > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
> > > <v....@ymail.com.invalid> wrote:
> > >
> > > >  +1. This should be good to have as an option. If everybody agrees,
> > > please
> > > > go ahead with RFC and we can discuss details there.
> > > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek
> Modi
> > > > <mo...@uber.com.invalid> wrote:
> > > >
> > > >  Hi everyone!
> > > >
> > > > I was hoping to discuss adding support for making
> `_hoodie_record_key`
> > a
> > > > virtual column :)
> > > >
> > > > Context:
> > > > Currently, _hoodie_record_key is written to DFS, as a column in the
> > > Parquet
> > > > file. In our production systems at Uber however, _hoodie_record_key
> > > > contains data that can be found in a different column (or set of
> > > columns).
> > > > This means that we are storing duplicated data.
> > > >
> > > > Proposal:
> > > > In the interest of improving storage efficiency, we could add confs /
> > > > abstract classes that can construct the _hoodie_record_key given
> other
> > > > columns. That way we do not have to store duplicated data on DFS.
> > > >
> > > > Any thoughts on this?
> > > >
> > > > Best,
> > > > Modi
> > > >
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Sivabalan <n....@gmail.com>.
+1 for virtual record keys. Do you also propose to generalize this for
partition path as well ?


On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> This is a good option to have. :)
>
> On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > IIRC _hoodie_record_key was supposed to this standardized key field. :)
> > Anyways, it's good to provide this option to the user.
> > So +1 for. RFC/further discussion.
> >
> > To level set, I want to also share some of the benefits of having an
> > explicit key column.
> > a) if you build your data lake using a bunch of hudi tables, now you
> have a
> > standardized data model
> > b) Even if your key generator changes, it does not affect the existing
> > data's keys. and updates will be matched correctly.
> >
> > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
> > <v....@ymail.com.invalid> wrote:
> >
> > >  +1. This should be good to have as an option. If everybody agrees,
> > please
> > > go ahead with RFC and we can discuss details there.
> > > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek Modi
> > > <mo...@uber.com.invalid> wrote:
> > >
> > >  Hi everyone!
> > >
> > > I was hoping to discuss adding support for making `_hoodie_record_key`
> a
> > > virtual column :)
> > >
> > > Context:
> > > Currently, _hoodie_record_key is written to DFS, as a column in the
> > Parquet
> > > file. In our production systems at Uber however, _hoodie_record_key
> > > contains data that can be found in a different column (or set of
> > columns).
> > > This means that we are storing duplicated data.
> > >
> > > Proposal:
> > > In the interest of improving storage efficiency, we could add confs /
> > > abstract classes that can construct the _hoodie_record_key given other
> > > columns. That way we do not have to store duplicated data on DFS.
> > >
> > > Any thoughts on this?
> > >
> > > Best,
> > > Modi
> > >
> >
>


-- 
Regards,
-Sivabalan

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Pratyaksh Sharma <pr...@gmail.com>.
This is a good option to have. :)

On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <vi...@apache.org> wrote:

> IIRC _hoodie_record_key was supposed to this standardized key field. :)
> Anyways, it's good to provide this option to the user.
> So +1 for. RFC/further discussion.
>
> To level set, I want to also share some of the benefits of having an
> explicit key column.
> a) if you build your data lake using a bunch of hudi tables, now you have a
> standardized data model
> b) Even if your key generator changes, it does not affect the existing
> data's keys. and updates will be matched correctly.
>
> On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
> <v....@ymail.com.invalid> wrote:
>
> >  +1. This should be good to have as an option. If everybody agrees,
> please
> > go ahead with RFC and we can discuss details there.
> > Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek Modi
> > <mo...@uber.com.invalid> wrote:
> >
> >  Hi everyone!
> >
> > I was hoping to discuss adding support for making `_hoodie_record_key` a
> > virtual column :)
> >
> > Context:
> > Currently, _hoodie_record_key is written to DFS, as a column in the
> Parquet
> > file. In our production systems at Uber however, _hoodie_record_key
> > contains data that can be found in a different column (or set of
> columns).
> > This means that we are storing duplicated data.
> >
> > Proposal:
> > In the interest of improving storage efficiency, we could add confs /
> > abstract classes that can construct the _hoodie_record_key given other
> > columns. That way we do not have to store duplicated data on DFS.
> >
> > Any thoughts on this?
> >
> > Best,
> > Modi
> >
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Vinoth Chandar <vi...@apache.org>.
IIRC _hoodie_record_key was supposed to this standardized key field. :)
Anyways, it's good to provide this option to the user.
So +1 for. RFC/further discussion.

To level set, I want to also share some of the benefits of having an
explicit key column.
a) if you build your data lake using a bunch of hudi tables, now you have a
standardized data model
b) Even if your key generator changes, it does not affect the existing
data's keys. and updates will be matched correctly.

On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  +1. This should be good to have as an option. If everybody agrees, please
> go ahead with RFC and we can discuss details there.
> Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek Modi
> <mo...@uber.com.invalid> wrote:
>
>  Hi everyone!
>
> I was hoping to discuss adding support for making `_hoodie_record_key` a
> virtual column :)
>
> Context:
> Currently, _hoodie_record_key is written to DFS, as a column in the Parquet
> file. In our production systems at Uber however, _hoodie_record_key
> contains data that can be found in a different column (or set of columns).
> This means that we are storing duplicated data.
>
> Proposal:
> In the interest of improving storage efficiency, we could add confs /
> abstract classes that can construct the _hoodie_record_key given other
> columns. That way we do not have to store duplicated data on DFS.
>
> Any thoughts on this?
>
> Best,
> Modi
>

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 +1. This should be good to have as an option. If everybody agrees, please go ahead with RFC and we can discuss details there.
Balaji.V    On Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek Modi <mo...@uber.com.invalid> wrote:  
 
 Hi everyone!

I was hoping to discuss adding support for making `_hoodie_record_key` a
virtual column :)

Context:
Currently, _hoodie_record_key is written to DFS, as a column in the Parquet
file. In our production systems at Uber however, _hoodie_record_key
contains data that can be found in a different column (or set of columns).
This means that we are storing duplicated data.

Proposal:
In the interest of improving storage efficiency, we could add confs /
abstract classes that can construct the _hoodie_record_key given other
columns. That way we do not have to store duplicated data on DFS.

Any thoughts on this?

Best,
Modi