You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by vino yang <vi...@apache.org> on 2020/06/20 00:13:32 UTC

[DISCUSS] Introduce a write committed callback hook

Hi all,

Currently, we have a need to incrementally process and build a new table
based on an original hoodie table. We expect that after a new commit is
completed on the original hoodie table, it could be retrieved ASAP, so that
it can be used for incremental view queries. Based on the existing
capabilities, one approach we can use is to continuously poll Hoodie's
Timeline to check for new commits. This is a very common processing way,
but it will cause unnecessary waste of resources.

We expect to introduce a proactive notification(event callback) mechanism.
For example, a hook can be introduced after a successful commit. External
processors interested in the commit, such as scheduling systems, can use
the hook as their own trigger. When a certain commit is completed, the
scheduling system can pull up the task of obtaining incremental data
through the API in the callback. Thereby completing the processing of
incremental data.

There is currently a `postCommit` method in Hudi's client module, and the
existing implementation is mainly used for compression and cleanup after
commit. And the triggering time is a little early. Not after everything is
processed, we found that it may still cause the rollback of the commit due
to the exception. We need to find a new location to trigger this hook to
ensure that the commit is deterministic.

This is one of our scene requirements, and it will be a very useful feature
combined with the incremental query, it can make the incremental processing
more timely.

We hope to hear what the community thinks of this proposal. Any comments
and opinions are appreciated.

Best,
Vino

Re: [DISCUSS] Introduce a write committed callback hook

Posted by Vinoth Chandar <vi...@apache.org>.
This is a great discussion! thanks!

On Mon, Jun 22, 2020 at 6:33 PM vino yang <ya...@gmail.com> wrote:

> Hi everyone,
>
> Thanks for sharing your thoughts.
>
> We have created a Jira issue to track this work.[1]
>
> Best,
> Vino
>
> [1]: https://issues.apache.org/jira/browse/HUDI-1037
>
> Vinoth Chandar <vi...@apache.org> 于2020年6月23日周二 上午6:38写道:
>
> > Great, looks like a JIRA is in order? :), given we all agree
> > enthusiastically
> >
> > On Sun, Jun 21, 2020 at 8:10 PM Gary Li <ya...@gmail.com>
> wrote:
> >
> > > +1.
> > > That would be great to have a communication mechanism between
> downstream
> > > CDC applications chain.
> > > e.g. A->B->C->D. Right now I am using the commit timestamp to identify
> > > whether there is a new commit came in. But if I need to recompute app
> B,
> > > it’s difficult for C and D to aware they have to recompute as well,
> > > especially when the triggering frequencies are different.
> > >
> > > On Sun, Jun 21, 2020 at 6:11 PM hddong <hongdd2020@gmail.com<mailto:
> > > hongdd2020@gmail.com>> wrote:
> > > +1. a great feature.
> > >
> > > Sivabalan <n....@gmail.com>>
> 于2020年6月22日周一
> > > 上午7:50写道:
> > >
> > > > +1. would be a nice addition.
> > > >
> > > > On Sun, Jun 21, 2020 at 12:02 PM vbalaji@apache.org<mailto:
> > > vbalaji@apache.org> <vb...@apache.org>>
> > > > wrote:
> > > >
> > > > >
> > > > > +1. This would be a really good feature to have when building
> > dependent
> > > > > ETL pipelines.
> > > > >
> > > > >     On Friday, June 19, 2020, 05:13:45 PM PDT, vino yang <
> > > > > vinoyang@apache.org<ma...@apache.org>> wrote:
> > > > >
> > > > >  Hi all,
> > > > >
> > > > > Currently, we have a need to incrementally process and build a new
> > > table
> > > > > based on an original hoodie table. We expect that after a new
> commit
> > is
> > > > > completed on the original hoodie table, it could be retrieved ASAP,
> > so
> > > > that
> > > > > it can be used for incremental view queries. Based on the existing
> > > > > capabilities, one approach we can use is to continuously poll
> > Hoodie's
> > > > > Timeline to check for new commits. This is a very common processing
> > > way,
> > > > > but it will cause unnecessary waste of resources.
> > > > >
> > > > > We expect to introduce a proactive notification(event callback)
> > > > mechanism.
> > > > > For example, a hook can be introduced after a successful commit.
> > > External
> > > > > processors interested in the commit, such as scheduling systems,
> can
> > > use
> > > > > the hook as their own trigger. When a certain commit is completed,
> > the
> > > > > scheduling system can pull up the task of obtaining incremental
> data
> > > > > through the API in the callback. Thereby completing the processing
> of
> > > > > incremental data.
> > > > >
> > > > > There is currently a `postCommit` method in Hudi's client module,
> and
> > > the
> > > > > existing implementation is mainly used for compression and cleanup
> > > after
> > > > > commit. And the triggering time is a little early. Not after
> > everything
> > > > is
> > > > > processed, we found that it may still cause the rollback of the
> > commit
> > > > due
> > > > > to the exception. We need to find a new location to trigger this
> hook
> > > to
> > > > > ensure that the commit is deterministic.
> > > > >
> > > > > This is one of our scene requirements, and it will be a very useful
> > > > feature
> > > > > combined with the incremental query, it can make the incremental
> > > > processing
> > > > > more timely.
> > > > >
> > > > > We hope to hear what the community thinks of this proposal. Any
> > > comments
> > > > > and opinions are appreciated.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >
>

Re: [DISCUSS] Introduce a write committed callback hook

Posted by vino yang <ya...@gmail.com>.
Hi everyone,

Thanks for sharing your thoughts.

We have created a Jira issue to track this work.[1]

Best,
Vino

[1]: https://issues.apache.org/jira/browse/HUDI-1037

Vinoth Chandar <vi...@apache.org> 于2020年6月23日周二 上午6:38写道:

> Great, looks like a JIRA is in order? :), given we all agree
> enthusiastically
>
> On Sun, Jun 21, 2020 at 8:10 PM Gary Li <ya...@gmail.com> wrote:
>
> > +1.
> > That would be great to have a communication mechanism between downstream
> > CDC applications chain.
> > e.g. A->B->C->D. Right now I am using the commit timestamp to identify
> > whether there is a new commit came in. But if I need to recompute app B,
> > it’s difficult for C and D to aware they have to recompute as well,
> > especially when the triggering frequencies are different.
> >
> > On Sun, Jun 21, 2020 at 6:11 PM hddong <hongdd2020@gmail.com<mailto:
> > hongdd2020@gmail.com>> wrote:
> > +1. a great feature.
> >
> > Sivabalan <n....@gmail.com>> 于2020年6月22日周一
> > 上午7:50写道:
> >
> > > +1. would be a nice addition.
> > >
> > > On Sun, Jun 21, 2020 at 12:02 PM vbalaji@apache.org<mailto:
> > vbalaji@apache.org> <vb...@apache.org>>
> > > wrote:
> > >
> > > >
> > > > +1. This would be a really good feature to have when building
> dependent
> > > > ETL pipelines.
> > > >
> > > >     On Friday, June 19, 2020, 05:13:45 PM PDT, vino yang <
> > > > vinoyang@apache.org<ma...@apache.org>> wrote:
> > > >
> > > >  Hi all,
> > > >
> > > > Currently, we have a need to incrementally process and build a new
> > table
> > > > based on an original hoodie table. We expect that after a new commit
> is
> > > > completed on the original hoodie table, it could be retrieved ASAP,
> so
> > > that
> > > > it can be used for incremental view queries. Based on the existing
> > > > capabilities, one approach we can use is to continuously poll
> Hoodie's
> > > > Timeline to check for new commits. This is a very common processing
> > way,
> > > > but it will cause unnecessary waste of resources.
> > > >
> > > > We expect to introduce a proactive notification(event callback)
> > > mechanism.
> > > > For example, a hook can be introduced after a successful commit.
> > External
> > > > processors interested in the commit, such as scheduling systems, can
> > use
> > > > the hook as their own trigger. When a certain commit is completed,
> the
> > > > scheduling system can pull up the task of obtaining incremental data
> > > > through the API in the callback. Thereby completing the processing of
> > > > incremental data.
> > > >
> > > > There is currently a `postCommit` method in Hudi's client module, and
> > the
> > > > existing implementation is mainly used for compression and cleanup
> > after
> > > > commit. And the triggering time is a little early. Not after
> everything
> > > is
> > > > processed, we found that it may still cause the rollback of the
> commit
> > > due
> > > > to the exception. We need to find a new location to trigger this hook
> > to
> > > > ensure that the commit is deterministic.
> > > >
> > > > This is one of our scene requirements, and it will be a very useful
> > > feature
> > > > combined with the incremental query, it can make the incremental
> > > processing
> > > > more timely.
> > > >
> > > > We hope to hear what the community thinks of this proposal. Any
> > comments
> > > > and opinions are appreciated.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>

Re: [DISCUSS] Introduce a write committed callback hook

Posted by Vinoth Chandar <vi...@apache.org>.
Great, looks like a JIRA is in order? :), given we all agree
enthusiastically

On Sun, Jun 21, 2020 at 8:10 PM Gary Li <ya...@gmail.com> wrote:

> +1.
> That would be great to have a communication mechanism between downstream
> CDC applications chain.
> e.g. A->B->C->D. Right now I am using the commit timestamp to identify
> whether there is a new commit came in. But if I need to recompute app B,
> it’s difficult for C and D to aware they have to recompute as well,
> especially when the triggering frequencies are different.
>
> On Sun, Jun 21, 2020 at 6:11 PM hddong <hongdd2020@gmail.com<mailto:
> hongdd2020@gmail.com>> wrote:
> +1. a great feature.
>
> Sivabalan <n....@gmail.com>> 于2020年6月22日周一
> 上午7:50写道:
>
> > +1. would be a nice addition.
> >
> > On Sun, Jun 21, 2020 at 12:02 PM vbalaji@apache.org<mailto:
> vbalaji@apache.org> <vb...@apache.org>>
> > wrote:
> >
> > >
> > > +1. This would be a really good feature to have when building dependent
> > > ETL pipelines.
> > >
> > >     On Friday, June 19, 2020, 05:13:45 PM PDT, vino yang <
> > > vinoyang@apache.org<ma...@apache.org>> wrote:
> > >
> > >  Hi all,
> > >
> > > Currently, we have a need to incrementally process and build a new
> table
> > > based on an original hoodie table. We expect that after a new commit is
> > > completed on the original hoodie table, it could be retrieved ASAP, so
> > that
> > > it can be used for incremental view queries. Based on the existing
> > > capabilities, one approach we can use is to continuously poll Hoodie's
> > > Timeline to check for new commits. This is a very common processing
> way,
> > > but it will cause unnecessary waste of resources.
> > >
> > > We expect to introduce a proactive notification(event callback)
> > mechanism.
> > > For example, a hook can be introduced after a successful commit.
> External
> > > processors interested in the commit, such as scheduling systems, can
> use
> > > the hook as their own trigger. When a certain commit is completed, the
> > > scheduling system can pull up the task of obtaining incremental data
> > > through the API in the callback. Thereby completing the processing of
> > > incremental data.
> > >
> > > There is currently a `postCommit` method in Hudi's client module, and
> the
> > > existing implementation is mainly used for compression and cleanup
> after
> > > commit. And the triggering time is a little early. Not after everything
> > is
> > > processed, we found that it may still cause the rollback of the commit
> > due
> > > to the exception. We need to find a new location to trigger this hook
> to
> > > ensure that the commit is deterministic.
> > >
> > > This is one of our scene requirements, and it will be a very useful
> > feature
> > > combined with the incremental query, it can make the incremental
> > processing
> > > more timely.
> > >
> > > We hope to hear what the community thinks of this proposal. Any
> comments
> > > and opinions are appreciated.
> > >
> > > Best,
> > > Vino
> > >
> >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: [DISCUSS] Introduce a write committed callback hook

Posted by Gary Li <ya...@gmail.com>.
+1.
That would be great to have a communication mechanism between downstream CDC applications chain.
e.g. A->B->C->D. Right now I am using the commit timestamp to identify whether there is a new commit came in. But if I need to recompute app B, it’s difficult for C and D to aware they have to recompute as well, especially when the triggering frequencies are different.

On Sun, Jun 21, 2020 at 6:11 PM hddong <ho...@gmail.com>> wrote:
+1. a great feature.

Sivabalan <n....@gmail.com>> 于2020年6月22日周一 上午7:50写道:

> +1. would be a nice addition.
>
> On Sun, Jun 21, 2020 at 12:02 PM vbalaji@apache.org<ma...@apache.org> <vb...@apache.org>>
> wrote:
>
> >
> > +1. This would be a really good feature to have when building dependent
> > ETL pipelines.
> >
> >     On Friday, June 19, 2020, 05:13:45 PM PDT, vino yang <
> > vinoyang@apache.org<ma...@apache.org>> wrote:
> >
> >  Hi all,
> >
> > Currently, we have a need to incrementally process and build a new table
> > based on an original hoodie table. We expect that after a new commit is
> > completed on the original hoodie table, it could be retrieved ASAP, so
> that
> > it can be used for incremental view queries. Based on the existing
> > capabilities, one approach we can use is to continuously poll Hoodie's
> > Timeline to check for new commits. This is a very common processing way,
> > but it will cause unnecessary waste of resources.
> >
> > We expect to introduce a proactive notification(event callback)
> mechanism.
> > For example, a hook can be introduced after a successful commit. External
> > processors interested in the commit, such as scheduling systems, can use
> > the hook as their own trigger. When a certain commit is completed, the
> > scheduling system can pull up the task of obtaining incremental data
> > through the API in the callback. Thereby completing the processing of
> > incremental data.
> >
> > There is currently a `postCommit` method in Hudi's client module, and the
> > existing implementation is mainly used for compression and cleanup after
> > commit. And the triggering time is a little early. Not after everything
> is
> > processed, we found that it may still cause the rollback of the commit
> due
> > to the exception. We need to find a new location to trigger this hook to
> > ensure that the commit is deterministic.
> >
> > This is one of our scene requirements, and it will be a very useful
> feature
> > combined with the incremental query, it can make the incremental
> processing
> > more timely.
> >
> > We hope to hear what the community thinks of this proposal. Any comments
> > and opinions are appreciated.
> >
> > Best,
> > Vino
> >
>
>
>
> --
> Regards,
> -Sivabalan
>

Re: [DISCUSS] Introduce a write committed callback hook

Posted by hddong <ho...@gmail.com>.
+1. a great feature.

Sivabalan <n....@gmail.com> 于2020年6月22日周一 上午7:50写道:

> +1. would be a nice addition.
>
> On Sun, Jun 21, 2020 at 12:02 PM vbalaji@apache.org <vb...@apache.org>
> wrote:
>
> >
> > +1. This would be a really good feature to have when building dependent
> > ETL pipelines.
> >
> >     On Friday, June 19, 2020, 05:13:45 PM PDT, vino yang <
> > vinoyang@apache.org> wrote:
> >
> >  Hi all,
> >
> > Currently, we have a need to incrementally process and build a new table
> > based on an original hoodie table. We expect that after a new commit is
> > completed on the original hoodie table, it could be retrieved ASAP, so
> that
> > it can be used for incremental view queries. Based on the existing
> > capabilities, one approach we can use is to continuously poll Hoodie's
> > Timeline to check for new commits. This is a very common processing way,
> > but it will cause unnecessary waste of resources.
> >
> > We expect to introduce a proactive notification(event callback)
> mechanism.
> > For example, a hook can be introduced after a successful commit. External
> > processors interested in the commit, such as scheduling systems, can use
> > the hook as their own trigger. When a certain commit is completed, the
> > scheduling system can pull up the task of obtaining incremental data
> > through the API in the callback. Thereby completing the processing of
> > incremental data.
> >
> > There is currently a `postCommit` method in Hudi's client module, and the
> > existing implementation is mainly used for compression and cleanup after
> > commit. And the triggering time is a little early. Not after everything
> is
> > processed, we found that it may still cause the rollback of the commit
> due
> > to the exception. We need to find a new location to trigger this hook to
> > ensure that the commit is deterministic.
> >
> > This is one of our scene requirements, and it will be a very useful
> feature
> > combined with the incremental query, it can make the incremental
> processing
> > more timely.
> >
> > We hope to hear what the community thinks of this proposal. Any comments
> > and opinions are appreciated.
> >
> > Best,
> > Vino
> >
>
>
>
> --
> Regards,
> -Sivabalan
>

Re: [DISCUSS] Introduce a write committed callback hook

Posted by Sivabalan <n....@gmail.com>.
+1. would be a nice addition.

On Sun, Jun 21, 2020 at 12:02 PM vbalaji@apache.org <vb...@apache.org>
wrote:

>
> +1. This would be a really good feature to have when building dependent
> ETL pipelines.
>
>     On Friday, June 19, 2020, 05:13:45 PM PDT, vino yang <
> vinoyang@apache.org> wrote:
>
>  Hi all,
>
> Currently, we have a need to incrementally process and build a new table
> based on an original hoodie table. We expect that after a new commit is
> completed on the original hoodie table, it could be retrieved ASAP, so that
> it can be used for incremental view queries. Based on the existing
> capabilities, one approach we can use is to continuously poll Hoodie's
> Timeline to check for new commits. This is a very common processing way,
> but it will cause unnecessary waste of resources.
>
> We expect to introduce a proactive notification(event callback) mechanism.
> For example, a hook can be introduced after a successful commit. External
> processors interested in the commit, such as scheduling systems, can use
> the hook as their own trigger. When a certain commit is completed, the
> scheduling system can pull up the task of obtaining incremental data
> through the API in the callback. Thereby completing the processing of
> incremental data.
>
> There is currently a `postCommit` method in Hudi's client module, and the
> existing implementation is mainly used for compression and cleanup after
> commit. And the triggering time is a little early. Not after everything is
> processed, we found that it may still cause the rollback of the commit due
> to the exception. We need to find a new location to trigger this hook to
> ensure that the commit is deterministic.
>
> This is one of our scene requirements, and it will be a very useful feature
> combined with the incremental query, it can make the incremental processing
> more timely.
>
> We hope to hear what the community thinks of this proposal. Any comments
> and opinions are appreciated.
>
> Best,
> Vino
>



-- 
Regards,
-Sivabalan

Re: [DISCUSS] Introduce a write committed callback hook

Posted by "vbalaji@apache.org" <vb...@apache.org>.
 
+1. This would be a really good feature to have when building dependent ETL pipelines.

    On Friday, June 19, 2020, 05:13:45 PM PDT, vino yang <vi...@apache.org> wrote:  
 
 Hi all,

Currently, we have a need to incrementally process and build a new table
based on an original hoodie table. We expect that after a new commit is
completed on the original hoodie table, it could be retrieved ASAP, so that
it can be used for incremental view queries. Based on the existing
capabilities, one approach we can use is to continuously poll Hoodie's
Timeline to check for new commits. This is a very common processing way,
but it will cause unnecessary waste of resources.

We expect to introduce a proactive notification(event callback) mechanism.
For example, a hook can be introduced after a successful commit. External
processors interested in the commit, such as scheduling systems, can use
the hook as their own trigger. When a certain commit is completed, the
scheduling system can pull up the task of obtaining incremental data
through the API in the callback. Thereby completing the processing of
incremental data.

There is currently a `postCommit` method in Hudi's client module, and the
existing implementation is mainly used for compression and cleanup after
commit. And the triggering time is a little early. Not after everything is
processed, we found that it may still cause the rollback of the commit due
to the exception. We need to find a new location to trigger this hook to
ensure that the commit is deterministic.

This is one of our scene requirements, and it will be a very useful feature
combined with the incremental query, it can make the incremental processing
more timely.

We hope to hear what the community thinks of this proposal. Any comments
and opinions are appreciated.

Best,
Vino
  

Re: [DISCUSS] Introduce a write committed callback hook

Posted by Vinoth Chandar <vi...@apache.org>.
FWIW we built this out at Uber, at the ingest tool level (i.e
deltastreamer) and used it to notify the workflow scheduler to trigger
pipelines by data availability, and not by time.. So if we can do some
Airflow integration, that would be awesome (probably not in the scope of
this work may be).

Not sure if Nick is still actively following this list. This is a feature
he has brought up time and again as well.

On Sun, Jun 21, 2020 at 8:15 AM Shiyan Xu <xu...@gmail.com>
wrote:

> +1. It is a great complement to the pull model; helpful to fan-out
> scenarios
>
> On Sun, Jun 21, 2020 at 8:07 AM Bhavani Sudha <bh...@gmail.com>
> wrote:
>
> > +1 . I think this is a valid use case and would be useful in general.
> >
> > On Sun, Jun 21, 2020 at 7:11 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > +1 as well
> > >
> > > > We expect to introduce a proactive notification(event callback)
> > > mechanism. For example, a hook can be introduced after a successful
> > commit.
> > >
> > > This would be very useful. We could write to a variety of event bus-es
> > and
> > > notify new data arrival.
> > >
> > > On Sat, Jun 20, 2020 at 2:51 AM wangxianghu <wx...@126.com> wrote:
> > >
> > > > +1 for this, I think this is a feature worth doing.
> > > > Think about it in the filed of offline computing, data changes
> happens
> > > > hourly or daily, if there is no a notification mechanism to inform
> the
> > > > downstream,  then the tasks downstream will keeping running all the
> day
> > > > along, but the time really processing data maybe very short, this
> > > situation
> > > > will surely cause resource wastes.
> > > > > 2020年6月20日 上午8:13,vino yang <vi...@apache.org> 写道:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > Currently, we have a need to incrementally process and build a new
> > > table
> > > > > based on an original hoodie table. We expect that after a new
> commit
> > is
> > > > > completed on the original hoodie table, it could be retrieved ASAP,
> > so
> > > > that
> > > > > it can be used for incremental view queries. Based on the existing
> > > > > capabilities, one approach we can use is to continuously poll
> > Hoodie's
> > > > > Timeline to check for new commits. This is a very common processing
> > > way,
> > > > > but it will cause unnecessary waste of resources.
> > > > >
> > > > > We expect to introduce a proactive notification(event callback)
> > > > mechanism.
> > > > > For example, a hook can be introduced after a successful commit.
> > > External
> > > > > processors interested in the commit, such as scheduling systems,
> can
> > > use
> > > > > the hook as their own trigger. When a certain commit is completed,
> > the
> > > > > scheduling system can pull up the task of obtaining incremental
> data
> > > > > through the API in the callback. Thereby completing the processing
> of
> > > > > incremental data.
> > > > >
> > > > > There is currently a `postCommit` method in Hudi's client module,
> and
> > > the
> > > > > existing implementation is mainly used for compression and cleanup
> > > after
> > > > > commit. And the triggering time is a little early. Not after
> > everything
> > > > is
> > > > > processed, we found that it may still cause the rollback of the
> > commit
> > > > due
> > > > > to the exception. We need to find a new location to trigger this
> hook
> > > to
> > > > > ensure that the commit is deterministic.
> > > > >
> > > > > This is one of our scene requirements, and it will be a very useful
> > > > feature
> > > > > combined with the incremental query, it can make the incremental
> > > > processing
> > > > > more timely.
> > > > >
> > > > > We hope to hear what the community thinks of this proposal. Any
> > > comments
> > > > > and opinions are appreciated.
> > > > >
> > > > > Best,
> > > > > Vino
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Introduce a write committed callback hook

Posted by Shiyan Xu <xu...@gmail.com>.
+1. It is a great complement to the pull model; helpful to fan-out scenarios

On Sun, Jun 21, 2020 at 8:07 AM Bhavani Sudha <bh...@gmail.com>
wrote:

> +1 . I think this is a valid use case and would be useful in general.
>
> On Sun, Jun 21, 2020 at 7:11 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> > +1 as well
> >
> > > We expect to introduce a proactive notification(event callback)
> > mechanism. For example, a hook can be introduced after a successful
> commit.
> >
> > This would be very useful. We could write to a variety of event bus-es
> and
> > notify new data arrival.
> >
> > On Sat, Jun 20, 2020 at 2:51 AM wangxianghu <wx...@126.com> wrote:
> >
> > > +1 for this, I think this is a feature worth doing.
> > > Think about it in the filed of offline computing, data changes happens
> > > hourly or daily, if there is no a notification mechanism to inform the
> > > downstream,  then the tasks downstream will keeping running all the day
> > > along, but the time really processing data maybe very short, this
> > situation
> > > will surely cause resource wastes.
> > > > 2020年6月20日 上午8:13,vino yang <vi...@apache.org> 写道:
> > > >
> > > > Hi all,
> > > >
> > > > Currently, we have a need to incrementally process and build a new
> > table
> > > > based on an original hoodie table. We expect that after a new commit
> is
> > > > completed on the original hoodie table, it could be retrieved ASAP,
> so
> > > that
> > > > it can be used for incremental view queries. Based on the existing
> > > > capabilities, one approach we can use is to continuously poll
> Hoodie's
> > > > Timeline to check for new commits. This is a very common processing
> > way,
> > > > but it will cause unnecessary waste of resources.
> > > >
> > > > We expect to introduce a proactive notification(event callback)
> > > mechanism.
> > > > For example, a hook can be introduced after a successful commit.
> > External
> > > > processors interested in the commit, such as scheduling systems, can
> > use
> > > > the hook as their own trigger. When a certain commit is completed,
> the
> > > > scheduling system can pull up the task of obtaining incremental data
> > > > through the API in the callback. Thereby completing the processing of
> > > > incremental data.
> > > >
> > > > There is currently a `postCommit` method in Hudi's client module, and
> > the
> > > > existing implementation is mainly used for compression and cleanup
> > after
> > > > commit. And the triggering time is a little early. Not after
> everything
> > > is
> > > > processed, we found that it may still cause the rollback of the
> commit
> > > due
> > > > to the exception. We need to find a new location to trigger this hook
> > to
> > > > ensure that the commit is deterministic.
> > > >
> > > > This is one of our scene requirements, and it will be a very useful
> > > feature
> > > > combined with the incremental query, it can make the incremental
> > > processing
> > > > more timely.
> > > >
> > > > We hope to hear what the community thinks of this proposal. Any
> > comments
> > > > and opinions are appreciated.
> > > >
> > > > Best,
> > > > Vino
> > >
> > >
> >
>

Re: [DISCUSS] Introduce a write committed callback hook

Posted by Bhavani Sudha <bh...@gmail.com>.
+1 . I think this is a valid use case and would be useful in general.

On Sun, Jun 21, 2020 at 7:11 AM Vinoth Chandar <vi...@apache.org> wrote:

> +1 as well
>
> > We expect to introduce a proactive notification(event callback)
> mechanism. For example, a hook can be introduced after a successful commit.
>
> This would be very useful. We could write to a variety of event bus-es and
> notify new data arrival.
>
> On Sat, Jun 20, 2020 at 2:51 AM wangxianghu <wx...@126.com> wrote:
>
> > +1 for this, I think this is a feature worth doing.
> > Think about it in the filed of offline computing, data changes happens
> > hourly or daily, if there is no a notification mechanism to inform the
> > downstream,  then the tasks downstream will keeping running all the day
> > along, but the time really processing data maybe very short, this
> situation
> > will surely cause resource wastes.
> > > 2020年6月20日 上午8:13,vino yang <vi...@apache.org> 写道:
> > >
> > > Hi all,
> > >
> > > Currently, we have a need to incrementally process and build a new
> table
> > > based on an original hoodie table. We expect that after a new commit is
> > > completed on the original hoodie table, it could be retrieved ASAP, so
> > that
> > > it can be used for incremental view queries. Based on the existing
> > > capabilities, one approach we can use is to continuously poll Hoodie's
> > > Timeline to check for new commits. This is a very common processing
> way,
> > > but it will cause unnecessary waste of resources.
> > >
> > > We expect to introduce a proactive notification(event callback)
> > mechanism.
> > > For example, a hook can be introduced after a successful commit.
> External
> > > processors interested in the commit, such as scheduling systems, can
> use
> > > the hook as their own trigger. When a certain commit is completed, the
> > > scheduling system can pull up the task of obtaining incremental data
> > > through the API in the callback. Thereby completing the processing of
> > > incremental data.
> > >
> > > There is currently a `postCommit` method in Hudi's client module, and
> the
> > > existing implementation is mainly used for compression and cleanup
> after
> > > commit. And the triggering time is a little early. Not after everything
> > is
> > > processed, we found that it may still cause the rollback of the commit
> > due
> > > to the exception. We need to find a new location to trigger this hook
> to
> > > ensure that the commit is deterministic.
> > >
> > > This is one of our scene requirements, and it will be a very useful
> > feature
> > > combined with the incremental query, it can make the incremental
> > processing
> > > more timely.
> > >
> > > We hope to hear what the community thinks of this proposal. Any
> comments
> > > and opinions are appreciated.
> > >
> > > Best,
> > > Vino
> >
> >
>

Re: [DISCUSS] Introduce a write committed callback hook

Posted by Vinoth Chandar <vi...@apache.org>.
+1 as well

> We expect to introduce a proactive notification(event callback)
mechanism. For example, a hook can be introduced after a successful commit.

This would be very useful. We could write to a variety of event bus-es and
notify new data arrival.

On Sat, Jun 20, 2020 at 2:51 AM wangxianghu <wx...@126.com> wrote:

> +1 for this, I think this is a feature worth doing.
> Think about it in the filed of offline computing, data changes happens
> hourly or daily, if there is no a notification mechanism to inform the
> downstream,  then the tasks downstream will keeping running all the day
> along, but the time really processing data maybe very short, this situation
> will surely cause resource wastes.
> > 2020年6月20日 上午8:13,vino yang <vi...@apache.org> 写道:
> >
> > Hi all,
> >
> > Currently, we have a need to incrementally process and build a new table
> > based on an original hoodie table. We expect that after a new commit is
> > completed on the original hoodie table, it could be retrieved ASAP, so
> that
> > it can be used for incremental view queries. Based on the existing
> > capabilities, one approach we can use is to continuously poll Hoodie's
> > Timeline to check for new commits. This is a very common processing way,
> > but it will cause unnecessary waste of resources.
> >
> > We expect to introduce a proactive notification(event callback)
> mechanism.
> > For example, a hook can be introduced after a successful commit. External
> > processors interested in the commit, such as scheduling systems, can use
> > the hook as their own trigger. When a certain commit is completed, the
> > scheduling system can pull up the task of obtaining incremental data
> > through the API in the callback. Thereby completing the processing of
> > incremental data.
> >
> > There is currently a `postCommit` method in Hudi's client module, and the
> > existing implementation is mainly used for compression and cleanup after
> > commit. And the triggering time is a little early. Not after everything
> is
> > processed, we found that it may still cause the rollback of the commit
> due
> > to the exception. We need to find a new location to trigger this hook to
> > ensure that the commit is deterministic.
> >
> > This is one of our scene requirements, and it will be a very useful
> feature
> > combined with the incremental query, it can make the incremental
> processing
> > more timely.
> >
> > We hope to hear what the community thinks of this proposal. Any comments
> > and opinions are appreciated.
> >
> > Best,
> > Vino
>
>

Re: [DISCUSS] Introduce a write committed callback hook

Posted by wangxianghu <wx...@126.com>.
+1 for this, I think this is a feature worth doing. 
Think about it in the filed of offline computing, data changes happens hourly or daily, if there is no a notification mechanism to inform the downstream,  then the tasks downstream will keeping running all the day along, but the time really processing data maybe very short, this situation will surely cause resource wastes.
> 2020年6月20日 上午8:13,vino yang <vi...@apache.org> 写道:
> 
> Hi all,
> 
> Currently, we have a need to incrementally process and build a new table
> based on an original hoodie table. We expect that after a new commit is
> completed on the original hoodie table, it could be retrieved ASAP, so that
> it can be used for incremental view queries. Based on the existing
> capabilities, one approach we can use is to continuously poll Hoodie's
> Timeline to check for new commits. This is a very common processing way,
> but it will cause unnecessary waste of resources.
> 
> We expect to introduce a proactive notification(event callback) mechanism.
> For example, a hook can be introduced after a successful commit. External
> processors interested in the commit, such as scheduling systems, can use
> the hook as their own trigger. When a certain commit is completed, the
> scheduling system can pull up the task of obtaining incremental data
> through the API in the callback. Thereby completing the processing of
> incremental data.
> 
> There is currently a `postCommit` method in Hudi's client module, and the
> existing implementation is mainly used for compression and cleanup after
> commit. And the triggering time is a little early. Not after everything is
> processed, we found that it may still cause the rollback of the commit due
> to the exception. We need to find a new location to trigger this hook to
> ensure that the commit is deterministic.
> 
> This is one of our scene requirements, and it will be a very useful feature
> combined with the incremental query, it can make the incremental processing
> more timely.
> 
> We hope to hear what the community thinks of this proposal. Any comments
> and opinions are appreciated.
> 
> Best,
> Vino