You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Yue Zhang <zh...@163.com> on 2022/04/18 06:15:10 UTC

[DISSCUSS][NEW FEATURE] Hudi Lake Manager

Hi all,
I would like to discuss and contribute a new feature named Hudi Lake Manager.

As more and more users from different companies and different businesses begin to use the hudi pipeline to write data, data governance has gradually become one of the most pain points for users. In order to get better query performance or better timeliness, users need to carefully configure clustering, compaction, cleaner and archive for each ingestion pipeline, which will undoubtedly bring higher learning costs and maintenance costs. Imagine that if a business has hundreds or thousands of ingestion piplines, then users even need to maintain hundreds or thousands of sets of configurations and keep tuning them maybe.

This new Feature Hudi Lake Manager is to decouple hudi ingestion and hudi table service, including cleaner, archival, clustering, comapction and any table services in the feature.

Users only need to care about their own ingest pipline and leave all the table services to the manager to automatically discover and manage the hudi table, thereby greatly reducing the pressure of operation and maintenance and the cost of on board.

This lake manager is the role of a hudi table master/coordinator, which can discover hudi tables and unify and automatically call out services such as cleaner/clustering/compaction/archive(multi-writer and async) based on certain conditions.

A common and interesting example is that in our production environment, we basically use date as the partition key and have specific data retention requests. To do this we need to write a script for each pipline to delete the data and the corresponding hive metadata. With this lake manager, we can expand the scope of the cleaner, implement a mechanism for data retention based on date partition.

I found there is a very valuable RFC-36 on going now https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore server, which will store the metadata of the hudi table, maybe we could expand this RFC's scope to design and develop lake manager or we could raise a new RFC and take this RFC-36 as information inputs.

I hope we can discuss the feasibility of this idea, it would be greatly appreciated.
I also volunteer my part if it is possible.
| |
Yue Zhang
|
|
zhangyue921010@163.com
|

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Posted by Vinoth Chandar <vi...@apache.org>.

I left my thoughts on the RFC https://github.com/apache/hudi/pull/4309

I just see this as a another deployment model where a centralized set of
microservices take up scheduling, execution of Hudi's table services.

+1 on thinking about sharding,locking and HA upfront.

Thanks
Vinoth

On Thu, Apr 21, 2022 at 3:31 PM Alexey Kudinkin <al...@onehouse.ai> wrote:

> Hey, folks!
>
> I feel there's quite a bit of confusion in this thread, so let's try to
> clear it: my understanding (please correct me if I'm wrong) is that
> Lake Manager was referred to as a service in a similar interpretation of
> how we call compaction, clustering and cleaning a* table services.*
>
> So, i'd suggest for us to be extra careful in operating familiar terms to
> avoid stirring up the confusion: for all things related to *RPC services *
> (like Metastore Server) we can call them "servers"*, *and for compaction,
> clustering and the rest we stick w/ "table services".
>
> If my understanding of the proposal is correct, then I think the proposal
> is to consolidate knobs and levers for Data Governance, Data Management,
> etc
> w/in the layer called *Lake Manager, *which will be orchestrating already
> existing table services through a nicely abstracted high-level API.
>
> Regarding adding any new *server* components: given Hudi's *stateless*
> architecture where we rely on standalone execution engines (like Spark or
> Flink) to operate, i don't really see us introducing a server component
> directly into Hudi's core. Metastore Server on the other hand will be a
> *standalone* component, that Hudi (as well as other processes) could be
> relying on to access the metadata.
>
> On Mon, Apr 18, 2022 at 10:07 PM Yue Zhang <zh...@apache.org>
> wrote:
>
> > Thanks for all your attention.
> > Sure, we do need to take care of high availability in design.
> >
> > Also in my opinion this lake manager wouldn't drive hudi into a database
> > on the cloud. It is just an official option. Something like
> > HoodieDeltaStreamer and help users to reduce maintenance and hudi data
> > governance efforts.
> >
> > As for resource and performance concerns, this lake manager should be
> > designed as a planner/master, for example, lake manager will call out
> > cleaner apis to launch a (spark/flink) execution to delete files under
> > certain conditions based on table metadata information, rather than doing
> > works itself. So that the workload and resources requirement is much
> less.
> > But in general, I agree that we have to consider failure recovery and
> high
> > availability, etc.
> >
> > On 2022/04/19 04:30:22 Simon Su wrote:
> > > >
> > > > I agree with Danny said. IMO, there are two points that should be
> > > > considered
> > >
> > > 1. If Lake Manager is designed as a service, so we should consider its
> > High
> > > Availability, Dynamic Expanding/Shrinking, and state consistency.
> > > 2. How many resources will Lake Manager used to execute those actions
> of
> > > HUDI such as compaction, clustering, etc..
> > >
> >
>

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Posted by Alexey Kudinkin <al...@onehouse.ai>.

Hey, folks!

I feel there's quite a bit of confusion in this thread, so let's try to
clear it: my understanding (please correct me if I'm wrong) is that
Lake Manager was referred to as a service in a similar interpretation of
how we call compaction, clustering and cleaning a* table services.*

So, i'd suggest for us to be extra careful in operating familiar terms to
avoid stirring up the confusion: for all things related to *RPC services *
(like Metastore Server) we can call them "servers"*, *and for compaction,
clustering and the rest we stick w/ "table services".

If my understanding of the proposal is correct, then I think the proposal
is to consolidate knobs and levers for Data Governance, Data Management, etc
w/in the layer called *Lake Manager, *which will be orchestrating already
existing table services through a nicely abstracted high-level API.

Regarding adding any new *server* components: given Hudi's *stateless*
architecture where we rely on standalone execution engines (like Spark or
Flink) to operate, i don't really see us introducing a server component
directly into Hudi's core. Metastore Server on the other hand will be a
*standalone* component, that Hudi (as well as other processes) could be
relying on to access the metadata.

On Mon, Apr 18, 2022 at 10:07 PM Yue Zhang <zh...@apache.org>
wrote:

> Thanks for all your attention.
> Sure, we do need to take care of high availability in design.
>
> Also in my opinion this lake manager wouldn't drive hudi into a database
> on the cloud. It is just an official option. Something like
> HoodieDeltaStreamer and help users to reduce maintenance and hudi data
> governance efforts.
>
> As for resource and performance concerns, this lake manager should be
> designed as a planner/master, for example, lake manager will call out
> cleaner apis to launch a (spark/flink) execution to delete files under
> certain conditions based on table metadata information, rather than doing
> works itself. So that the workload and resources requirement is much less.
> But in general, I agree that we have to consider failure recovery and high
> availability, etc.
>
> On 2022/04/19 04:30:22 Simon Su wrote:
> > >
> > > I agree with Danny said. IMO, there are two points that should be
> > > considered
> >
> > 1. If Lake Manager is designed as a service, so we should consider its
> High
> > Availability, Dynamic Expanding/Shrinking, and state consistency.
> > 2. How many resources will Lake Manager used to execute those actions of
> > HUDI such as compaction, clustering, etc..
> >
>

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Posted by Yue Zhang <zh...@apache.org>.

Thanks for all your attention.
Sure, we do need to take care of high availability in design.

Also in my opinion this lake manager wouldn't drive hudi into a database on the cloud. It is just an official option. Something like HoodieDeltaStreamer and help users to reduce maintenance and hudi data governance efforts. 

As for resource and performance concerns, this lake manager should be designed as a planner/master, for example, lake manager will call out cleaner apis to launch a (spark/flink) execution to delete files under certain conditions based on table metadata information, rather than doing works itself. So that the workload and resources requirement is much less. But in general, I agree that we have to consider failure recovery and high availability, etc.

On 2022/04/19 04:30:22 Simon Su wrote:
> >
> > I agree with Danny said. IMO, there are two points that should be
> > considered
> 
> 1. If Lake Manager is designed as a service, so we should consider its High
> Availability, Dynamic Expanding/Shrinking, and state consistency.
> 2. How many resources will Lake Manager used to execute those actions of
> HUDI such as compaction, clustering, etc..
>

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Posted by Simon Su <ba...@gmail.com>.

>
> I agree with Danny said. IMO, there are two points that should be
> considered

1. If Lake Manager is designed as a service, so we should consider its High
Availability, Dynamic Expanding/Shrinking, and state consistency.
2. How many resources will Lake Manager used to execute those actions of
HUDI such as compaction, clustering, etc..

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Posted by Y Ethan Guo <et...@gmail.com>.

In my point of view, this Lake Manager should be more like a centralized
management layer on top of Hudi tables to schedule different table services
and do data governance.  The scheduling / managing part should be
lightweight.  The execution should still be in cluster.  It should not be a
single node executing all services to create bottlenecks.  And I agree that
there should be fallbacks to achieve high availability, e.g., if the main
manager is down, there should be back up, or each table falls back to
execute independent table services.  How to achieve this can be discussed
later in the detailed design.

IMO, we should still keep the mode of running independent table services
and let users decide whether they want to use Lake Manager or not to manage
table services (providing one more option here), not making it a compulsory
move as you said.


On Mon, Apr 18, 2022 at 8:01 PM Danny Chan <da...@apache.org> wrote:

> I have different concerns here, the Lake Manager seems like a single
> node service here, and there is a risk that it becomes a bottleneck
> for handling too many table services. And for every single node
> service we should consider how to achieve high availability.
>
> What is the final state of the Hudi service here ? Should we drop the
> advantage of the server-less/light-weight architecture and moves
> forward to a service mode ?
> I mean will Hudi be more and more like a database on the cloud ?
>
> Best,
> Danny
>
> Y Ethan Guo <et...@gmail.com> 于2022年4月19日周二 01:38写道：
> >
> > +1 This is a great idea! The proposed lake manager and centralized
> > management layer are essential to ease the burden of carrying out data
> > governance and optimizing the storage layout, making them independent of
> > ingestion and streaming.  I see that this provides a better abstraction
> for
> > any potential centralized maintenance and optimization beyond existing
> > table services.
> >
> > It would be good to have this centralized Lake Manager component in the
> > metastore server proposed by RFC-36.  RFC-43 can also somehow be part of
> > it.  The Lake Manager implementation can be self-contained in some way.
> >
> > On Mon, Apr 18, 2022 at 2:11 AM Shiyan Xu <xu...@gmail.com>
> > wrote:
> >
> > > Great idea, Zhang Yue! I see more potential collaborations in the work
> for
> > > the table management service in this RFC 43
> > > https://github.com/apache/hudi/pull/4309
> > >
> > > On Mon, Apr 18, 2022 at 2:15 PM Yue Zhang <zh...@163.com>
> wrote:
> > >
> > > >
> > > >
> > > > Hi all,
> > > >     I would like to discuss and contribute a new feature named Hudi
> Lake
> > > > Manager.
> > > >
> > > >
> > > >     As more and more users from different companies and different
> > > > businesses begin to use the hudi pipeline to write data, data
> governance
> > > > has gradually become one of the most pain points for users. In order
> to
> > > get
> > > > better query performance or better timeliness, users need to
> carefully
> > > > configure clustering, compaction, cleaner and archive for each
> ingestion
> > > > pipeline, which will undoubtedly bring higher learning costs and
> > > > maintenance costs. Imagine that if a business has hundreds or
> thousands
> > > of
> > > > ingestion piplines, then users even need to maintain hundreds or
> > > thousands
> > > > of sets of configurations and keep tuning them maybe.
> > > >
> > > >
> > > >     This new Feature Hudi Lake Manager is to decouple hudi ingestion
> and
> > > > hudi table service, including cleaner, archival, clustering,
> comapction
> > > and
> > > > any table services in the feature.
> > > >
> > > >
> > > >     Users only need to care about their own ingest pipline and leave
> all
> > > > the table services to the manager to automatically discover and
> manage
> > > the
> > > > hudi table, thereby greatly reducing the pressure of operation and
> > > > maintenance and the cost of on board.
> > > >
> > > >
> > > >     This lake manager is  the role of a hudi table
> master/coordinator,
> > > > which can discover hudi tables and unify and automatically call out
> > > > services such as cleaner/clustering/compaction/archive(multi-writer
> and
> > > > async) based on certain conditions.
> > > >
> > > >
> > > >     A common and interesting example is that in our production
> > > > environment, we basically use date as the partition key and have
> specific
> > > > data retention requests. To do this we need to write a script for
> each
> > > > pipline to delete the data and the corresponding hive metadata. With
> this
> > > > lake manager, we can expand the scope of the cleaner, implement a
> > > mechanism
> > > > for data retention based on date partition.
> > > >
> > > >
> > > >     I found there is a very valuable RFC-36 on going now
> > > > https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore
> > > > server, which will store the metadata of the hudi table, maybe we
> could
> > > > expand this RFC's scope to design and develop lake manager or we
> could
> > > > raise a new RFC and take this RFC-36 as information inputs.
> > > >
> > > >
> > > >     I hope we can discuss the feasibility of this idea, it would be
> > > > greatly appreciated.
> > > >     I also volunteer my part if it is possible.
> > > > | |
> > > > Yue Zhang
> > > > |
> > > > |
> > > > zhangyue921010@163.com
> > > > |
> > > >
> > > > --
> > > Best,
> > > Shiyan
> > >
>

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Posted by Danny Chan <da...@apache.org>.

I have different concerns here, the Lake Manager seems like a single
node service here, and there is a risk that it becomes a bottleneck
for handling too many table services. And for every single node
service we should consider how to achieve high availability.

What is the final state of the Hudi service here ? Should we drop the
advantage of the server-less/light-weight architecture and moves
forward to a service mode ?
I mean will Hudi be more and more like a database on the cloud ?

Best,
Danny

Y Ethan Guo <et...@gmail.com> 于2022年4月19日周二 01:38写道：
>
> +1 This is a great idea! The proposed lake manager and centralized
> management layer are essential to ease the burden of carrying out data
> governance and optimizing the storage layout, making them independent of
> ingestion and streaming.  I see that this provides a better abstraction for
> any potential centralized maintenance and optimization beyond existing
> table services.
>
> It would be good to have this centralized Lake Manager component in the
> metastore server proposed by RFC-36.  RFC-43 can also somehow be part of
> it.  The Lake Manager implementation can be self-contained in some way.
>
> On Mon, Apr 18, 2022 at 2:11 AM Shiyan Xu <xu...@gmail.com>
> wrote:
>
> > Great idea, Zhang Yue! I see more potential collaborations in the work for
> > the table management service in this RFC 43
> > https://github.com/apache/hudi/pull/4309
> >
> > On Mon, Apr 18, 2022 at 2:15 PM Yue Zhang <zh...@163.com> wrote:
> >
> > >
> > >
> > > Hi all,
> > >     I would like to discuss and contribute a new feature named Hudi Lake
> > > Manager.
> > >
> > >
> > >     As more and more users from different companies and different
> > > businesses begin to use the hudi pipeline to write data, data governance
> > > has gradually become one of the most pain points for users. In order to
> > get
> > > better query performance or better timeliness, users need to carefully
> > > configure clustering, compaction, cleaner and archive for each ingestion
> > > pipeline, which will undoubtedly bring higher learning costs and
> > > maintenance costs. Imagine that if a business has hundreds or thousands
> > of
> > > ingestion piplines, then users even need to maintain hundreds or
> > thousands
> > > of sets of configurations and keep tuning them maybe.
> > >
> > >
> > >     This new Feature Hudi Lake Manager is to decouple hudi ingestion and
> > > hudi table service, including cleaner, archival, clustering, comapction
> > and
> > > any table services in the feature.
> > >
> > >
> > >     Users only need to care about their own ingest pipline and leave all
> > > the table services to the manager to automatically discover and manage
> > the
> > > hudi table, thereby greatly reducing the pressure of operation and
> > > maintenance and the cost of on board.
> > >
> > >
> > >     This lake manager is  the role of a hudi table master/coordinator,
> > > which can discover hudi tables and unify and automatically call out
> > > services such as cleaner/clustering/compaction/archive(multi-writer and
> > > async) based on certain conditions.
> > >
> > >
> > >     A common and interesting example is that in our production
> > > environment, we basically use date as the partition key and have specific
> > > data retention requests. To do this we need to write a script for each
> > > pipline to delete the data and the corresponding hive metadata. With this
> > > lake manager, we can expand the scope of the cleaner, implement a
> > mechanism
> > > for data retention based on date partition.
> > >
> > >
> > >     I found there is a very valuable RFC-36 on going now
> > > https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore
> > > server, which will store the metadata of the hudi table, maybe we could
> > > expand this RFC's scope to design and develop lake manager or we could
> > > raise a new RFC and take this RFC-36 as information inputs.
> > >
> > >
> > >     I hope we can discuss the feasibility of this idea, it would be
> > > greatly appreciated.
> > >     I also volunteer my part if it is possible.
> > > | |
> > > Yue Zhang
> > > |
> > > |
> > > zhangyue921010@163.com
> > > |
> > >
> > > --
> > Best,
> > Shiyan
> >

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Posted by Y Ethan Guo <et...@gmail.com>.

+1 This is a great idea! The proposed lake manager and centralized
management layer are essential to ease the burden of carrying out data
governance and optimizing the storage layout, making them independent of
ingestion and streaming.  I see that this provides a better abstraction for
any potential centralized maintenance and optimization beyond existing
table services.

It would be good to have this centralized Lake Manager component in the
metastore server proposed by RFC-36.  RFC-43 can also somehow be part of
it.  The Lake Manager implementation can be self-contained in some way.

On Mon, Apr 18, 2022 at 2:11 AM Shiyan Xu <xu...@gmail.com>
wrote:

> Great idea, Zhang Yue! I see more potential collaborations in the work for
> the table management service in this RFC 43
> https://github.com/apache/hudi/pull/4309
>
> On Mon, Apr 18, 2022 at 2:15 PM Yue Zhang <zh...@163.com> wrote:
>
> >
> >
> > Hi all,
> >     I would like to discuss and contribute a new feature named Hudi Lake
> > Manager.
> >
> >
> >     As more and more users from different companies and different
> > businesses begin to use the hudi pipeline to write data, data governance
> > has gradually become one of the most pain points for users. In order to
> get
> > better query performance or better timeliness, users need to carefully
> > configure clustering, compaction, cleaner and archive for each ingestion
> > pipeline, which will undoubtedly bring higher learning costs and
> > maintenance costs. Imagine that if a business has hundreds or thousands
> of
> > ingestion piplines, then users even need to maintain hundreds or
> thousands
> > of sets of configurations and keep tuning them maybe.
> >
> >
> >     This new Feature Hudi Lake Manager is to decouple hudi ingestion and
> > hudi table service, including cleaner, archival, clustering, comapction
> and
> > any table services in the feature.
> >
> >
> >     Users only need to care about their own ingest pipline and leave all
> > the table services to the manager to automatically discover and manage
> the
> > hudi table, thereby greatly reducing the pressure of operation and
> > maintenance and the cost of on board.
> >
> >
> >     This lake manager is  the role of a hudi table master/coordinator,
> > which can discover hudi tables and unify and automatically call out
> > services such as cleaner/clustering/compaction/archive(multi-writer and
> > async) based on certain conditions.
> >
> >
> >     A common and interesting example is that in our production
> > environment, we basically use date as the partition key and have specific
> > data retention requests. To do this we need to write a script for each
> > pipline to delete the data and the corresponding hive metadata. With this
> > lake manager, we can expand the scope of the cleaner, implement a
> mechanism
> > for data retention based on date partition.
> >
> >
> >     I found there is a very valuable RFC-36 on going now
> > https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore
> > server, which will store the metadata of the hudi table, maybe we could
> > expand this RFC's scope to design and develop lake manager or we could
> > raise a new RFC and take this RFC-36 as information inputs.
> >
> >
> >     I hope we can discuss the feasibility of this idea, it would be
> > greatly appreciated.
> >     I also volunteer my part if it is possible.
> > | |
> > Yue Zhang
> > |
> > |
> > zhangyue921010@163.com
> > |
> >
> > --
> Best,
> Shiyan
>

Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

Posted by Shiyan Xu <xu...@gmail.com>.

Great idea, Zhang Yue! I see more potential collaborations in the work for
the table management service in this RFC 43
https://github.com/apache/hudi/pull/4309

On Mon, Apr 18, 2022 at 2:15 PM Yue Zhang <zh...@163.com> wrote:

>
>
> Hi all,
>     I would like to discuss and contribute a new feature named Hudi Lake
> Manager.
>
>
>     As more and more users from different companies and different
> businesses begin to use the hudi pipeline to write data, data governance
> has gradually become one of the most pain points for users. In order to get
> better query performance or better timeliness, users need to carefully
> configure clustering, compaction, cleaner and archive for each ingestion
> pipeline, which will undoubtedly bring higher learning costs and
> maintenance costs. Imagine that if a business has hundreds or thousands of
> ingestion piplines, then users even need to maintain hundreds or thousands
> of sets of configurations and keep tuning them maybe.
>
>
>     This new Feature Hudi Lake Manager is to decouple hudi ingestion and
> hudi table service, including cleaner, archival, clustering, comapction and
> any table services in the feature.
>
>
>     Users only need to care about their own ingest pipline and leave all
> the table services to the manager to automatically discover and manage the
> hudi table, thereby greatly reducing the pressure of operation and
> maintenance and the cost of on board.
>
>
>     This lake manager is  the role of a hudi table master/coordinator,
> which can discover hudi tables and unify and automatically call out
> services such as cleaner/clustering/compaction/archive(multi-writer and
> async) based on certain conditions.
>
>
>     A common and interesting example is that in our production
> environment, we basically use date as the partition key and have specific
> data retention requests. To do this we need to write a script for each
> pipline to delete the data and the corresponding hive metadata. With this
> lake manager, we can expand the scope of the cleaner, implement a mechanism
> for data retention based on date partition.
>
>
>     I found there is a very valuable RFC-36 on going now
> https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore
> server, which will store the metadata of the hudi table, maybe we could
> expand this RFC's scope to design and develop lake manager or we could
> raise a new RFC and take this RFC-36 as information inputs.
>
>
>     I hope we can discuss the feasibility of this idea, it would be
> greatly appreciated.
>     I also volunteer my part if it is possible.
> | |
> Yue Zhang
> |
> |
> zhangyue921010@163.com
> |
>
> --
Best,
Shiyan