You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by vino yang <ya...@gmail.com> on 2020/03/04 10:28:41 UTC

[DISCUSS] Restructure hudi-utilities module

Hi folks,

Currently, it seems the content of hudi-utilities looks a bit mix.
Summarize all of them, there are two aspects list below:


   - delta streamer and its relevant packages, e.g. deltastreamer, sources,
   schema, transform, these packages are served for delta streamer.
   - Some utility tools such as
   HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on


We are trying to refactor the computing engine relevant business logic.
Delta Streamer (especially, the sources package is a start point of a job
of Spark/Flink) will be affected. Doing this restructure can make the work
more clear and focus.

I would like to start a proposal to restructure the hudi-utilites module.
Considering delta streamer is a great feature for hudi, the logic is very
much in the hudi-utilites. Can we raise its importance via making the delta
streamer as a single module? It could be named e.g. hudi-delta or something
else. Then let the hudi-utilities be a real utilities module to host
HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools.

In short, we can do these restructure works:


   - create a new module, named “hudi-delta” (or other name?) and move the
   deltastreamer, sources, schema, transform … packages into this module
   - leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner … in the
   current place (utilities module)

What do you think?

Any comments and suggestions are welcome and appreciated.

Best,
Vino

Re: [DISCUSS] Restructure hudi-utilities module

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 +1 on Vinoth's suggestion on waiting for the lower level (write-client) re-factored and re-organized first.  We can then look at Data-Source and DeltaStreamer to make sure how to best organize them. 
Balaji.V    On Sunday, March 8, 2020, 11:06:13 PM PDT, Vinoth Chandar <vi...@apache.org> wrote:  
 
 >> make delta streamer a engine agnostic part so that Spark and Flink can
share some common logic.

If we make the change at the Write Client level to make it engine agnostic,
it should help with most of the cases.. I believe there will be spark
specific pieces in the Source abstraction since those are using spark
datasources underneath in some cases..  My opinion is that we can first
focus our efforts on making hudi-client agnostic and pluggable with
different engines.. We can tackle deltastreamer down the line once we have
it..

On Wed, Mar 4, 2020 at 6:51 PM vino yang <ya...@gmail.com> wrote:

> Hi guys,
>
> My original thought is to make delta streamer a engine agnostic part so
> that Spark and Flink can share some common logic.
>
> >>I am not sure the ROI is there for renaming to hudi-deltastreamer  and
> pull this out.. Everytime we change a module name
>
> Actually, here my suggestion is to move the delta streamer to another new
> module and keep the current hudi-utilities module. Although, in a way,
> moving classes are similar to rename the module name.
>
> >> I propose we leave this module to be spark specific, i.e depending on
> hudi-spark alone
>
> OK, will think to build delta streaming mode via Flink and ignore the
> current implementation of delta streamer.
>
> Best,
> Vino
>
> Vinoth Chandar <vi...@apache.org> 于2020年3月5日周四 上午12:47写道:
>
> > I am not sure the ROI is there for renaming to hudi-deltastreamer  and
> pull
> > this out.. Everytime we change a module name, its a breaking change and I
> > would prefer if we reserved those for really pressing issues.. or take
> > natural course of development and get there..
> >
> > Regarding how multi framework support would affect this module, I propose
> > we leave this module to be spark specific, i.e depending on hudi-spark
> > alone.. Until, we can make flink work end-end.
> > This feels kind of premature to me.
> >
> > On Wed, Mar 4, 2020 at 8:37 AM Gary Li <ya...@gmail.com> wrote:
> >
> > > +1. hudi-delta gives me the feeling that it has something to do with
> > other
> > > frameworks... I’d vote for another name hudi-deltastreamer or
> > hudi-streamer
> > > or hudi-stream.
> > >
> > > On Wed, Mar 4, 2020 at 2:29 AM vino yang <ya...@gmail.com>
> wrote:
> > >
> > > > Hi folks,
> > > >
> > > > Currently, it seems the content of hudi-utilities looks a bit mix.
> > > > Summarize all of them, there are two aspects list below:
> > > >
> > > >
> > > >    - delta streamer and its relevant packages, e.g. deltastreamer,
> > > sources,
> > > >    schema, transform, these packages are served for delta streamer.
> > > >    - Some utility tools such as
> > > >    HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on
> > > >
> > > >
> > > > We are trying to refactor the computing engine relevant business
> logic.
> > > > Delta Streamer (especially, the sources package is a start point of a
> > job
> > > > of Spark/Flink) will be affected. Doing this restructure can make the
> > > work
> > > > more clear and focus.
> > > >
> > > > I would like to start a proposal to restructure the hudi-utilites
> > module.
> > > > Considering delta streamer is a great feature for hudi, the logic is
> > very
> > > > much in the hudi-utilites. Can we raise its importance via making the
> > > delta
> > > > streamer as a single module? It could be named e.g. hudi-delta or
> > > something
> > > > else. Then let the hudi-utilities be a real utilities module to host
> > > > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools.
> > > >
> > > > In short, we can do these restructure works:
> > > >
> > > >
> > > >    - create a new module, named “hudi-delta” (or other name?) and
> move
> > > the
> > > >    deltastreamer, sources, schema, transform … packages into this
> > module
> > > >    - leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner …
> in
> > > the
> > > >    current place (utilities module)
> > > >
> > > > What do you think?
> > > >
> > > > Any comments and suggestions are welcome and appreciated.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > >
> >
>  

Re: [DISCUSS] Restructure hudi-utilities module

Posted by Vinoth Chandar <vi...@apache.org>.
>> make delta streamer a engine agnostic part so that Spark and Flink can
share some common logic.

If we make the change at the Write Client level to make it engine agnostic,
it should help with most of the cases.. I believe there will be spark
specific pieces in the Source abstraction since those are using spark
datasources underneath in some cases..  My opinion is that we can first
focus our efforts on making hudi-client agnostic and pluggable with
different engines.. We can tackle deltastreamer down the line once we have
it..

On Wed, Mar 4, 2020 at 6:51 PM vino yang <ya...@gmail.com> wrote:

> Hi guys,
>
> My original thought is to make delta streamer a engine agnostic part so
> that Spark and Flink can share some common logic.
>
> >>I am not sure the ROI is there for renaming to hudi-deltastreamer  and
> pull this out.. Everytime we change a module name
>
> Actually, here my suggestion is to move the delta streamer to another new
> module and keep the current hudi-utilities module. Although, in a way,
> moving classes are similar to rename the module name.
>
> >> I propose we leave this module to be spark specific, i.e depending on
> hudi-spark alone
>
> OK, will think to build delta streaming mode via Flink and ignore the
> current implementation of delta streamer.
>
> Best,
> Vino
>
> Vinoth Chandar <vi...@apache.org> 于2020年3月5日周四 上午12:47写道:
>
> > I am not sure the ROI is there for renaming to hudi-deltastreamer  and
> pull
> > this out.. Everytime we change a module name, its a breaking change and I
> > would prefer if we reserved those for really pressing issues.. or take
> > natural course of development and get there..
> >
> > Regarding how multi framework support would affect this module, I propose
> > we leave this module to be spark specific, i.e depending on hudi-spark
> > alone.. Until, we can make flink work end-end.
> > This feels kind of premature to me.
> >
> > On Wed, Mar 4, 2020 at 8:37 AM Gary Li <ya...@gmail.com> wrote:
> >
> > > +1. hudi-delta gives me the feeling that it has something to do with
> > other
> > > frameworks... I’d vote for another name hudi-deltastreamer or
> > hudi-streamer
> > > or hudi-stream.
> > >
> > > On Wed, Mar 4, 2020 at 2:29 AM vino yang <ya...@gmail.com>
> wrote:
> > >
> > > > Hi folks,
> > > >
> > > > Currently, it seems the content of hudi-utilities looks a bit mix.
> > > > Summarize all of them, there are two aspects list below:
> > > >
> > > >
> > > >    - delta streamer and its relevant packages, e.g. deltastreamer,
> > > sources,
> > > >    schema, transform, these packages are served for delta streamer.
> > > >    - Some utility tools such as
> > > >    HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on
> > > >
> > > >
> > > > We are trying to refactor the computing engine relevant business
> logic.
> > > > Delta Streamer (especially, the sources package is a start point of a
> > job
> > > > of Spark/Flink) will be affected. Doing this restructure can make the
> > > work
> > > > more clear and focus.
> > > >
> > > > I would like to start a proposal to restructure the hudi-utilites
> > module.
> > > > Considering delta streamer is a great feature for hudi, the logic is
> > very
> > > > much in the hudi-utilites. Can we raise its importance via making the
> > > delta
> > > > streamer as a single module? It could be named e.g. hudi-delta or
> > > something
> > > > else. Then let the hudi-utilities be a real utilities module to host
> > > > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools.
> > > >
> > > > In short, we can do these restructure works:
> > > >
> > > >
> > > >    - create a new module, named “hudi-delta” (or other name?) and
> move
> > > the
> > > >    deltastreamer, sources, schema, transform … packages into this
> > module
> > > >    - leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner …
> in
> > > the
> > > >    current place (utilities module)
> > > >
> > > > What do you think?
> > > >
> > > > Any comments and suggestions are welcome and appreciated.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > >
> >
>

Re: [DISCUSS] Restructure hudi-utilities module

Posted by vino yang <ya...@gmail.com>.
Hi guys,

My original thought is to make delta streamer a engine agnostic part so
that Spark and Flink can share some common logic.

>>I am not sure the ROI is there for renaming to hudi-deltastreamer  and
pull this out.. Everytime we change a module name

Actually, here my suggestion is to move the delta streamer to another new
module and keep the current hudi-utilities module. Although, in a way,
moving classes are similar to rename the module name.

>> I propose we leave this module to be spark specific, i.e depending on
hudi-spark alone

OK, will think to build delta streaming mode via Flink and ignore the
current implementation of delta streamer.

Best,
Vino

Vinoth Chandar <vi...@apache.org> 于2020年3月5日周四 上午12:47写道:

> I am not sure the ROI is there for renaming to hudi-deltastreamer  and pull
> this out.. Everytime we change a module name, its a breaking change and I
> would prefer if we reserved those for really pressing issues.. or take
> natural course of development and get there..
>
> Regarding how multi framework support would affect this module, I propose
> we leave this module to be spark specific, i.e depending on hudi-spark
> alone.. Until, we can make flink work end-end.
> This feels kind of premature to me.
>
> On Wed, Mar 4, 2020 at 8:37 AM Gary Li <ya...@gmail.com> wrote:
>
> > +1. hudi-delta gives me the feeling that it has something to do with
> other
> > frameworks... I’d vote for another name hudi-deltastreamer or
> hudi-streamer
> > or hudi-stream.
> >
> > On Wed, Mar 4, 2020 at 2:29 AM vino yang <ya...@gmail.com> wrote:
> >
> > > Hi folks,
> > >
> > > Currently, it seems the content of hudi-utilities looks a bit mix.
> > > Summarize all of them, there are two aspects list below:
> > >
> > >
> > >    - delta streamer and its relevant packages, e.g. deltastreamer,
> > sources,
> > >    schema, transform, these packages are served for delta streamer.
> > >    - Some utility tools such as
> > >    HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on
> > >
> > >
> > > We are trying to refactor the computing engine relevant business logic.
> > > Delta Streamer (especially, the sources package is a start point of a
> job
> > > of Spark/Flink) will be affected. Doing this restructure can make the
> > work
> > > more clear and focus.
> > >
> > > I would like to start a proposal to restructure the hudi-utilites
> module.
> > > Considering delta streamer is a great feature for hudi, the logic is
> very
> > > much in the hudi-utilites. Can we raise its importance via making the
> > delta
> > > streamer as a single module? It could be named e.g. hudi-delta or
> > something
> > > else. Then let the hudi-utilities be a real utilities module to host
> > > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools.
> > >
> > > In short, we can do these restructure works:
> > >
> > >
> > >    - create a new module, named “hudi-delta” (or other name?) and move
> > the
> > >    deltastreamer, sources, schema, transform … packages into this
> module
> > >    - leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner … in
> > the
> > >    current place (utilities module)
> > >
> > > What do you think?
> > >
> > > Any comments and suggestions are welcome and appreciated.
> > >
> > > Best,
> > > Vino
> > >
> >
>

Re: [DISCUSS] Restructure hudi-utilities module

Posted by Vinoth Chandar <vi...@apache.org>.
I am not sure the ROI is there for renaming to hudi-deltastreamer  and pull
this out.. Everytime we change a module name, its a breaking change and I
would prefer if we reserved those for really pressing issues.. or take
natural course of development and get there..

Regarding how multi framework support would affect this module, I propose
we leave this module to be spark specific, i.e depending on hudi-spark
alone.. Until, we can make flink work end-end.
This feels kind of premature to me.

On Wed, Mar 4, 2020 at 8:37 AM Gary Li <ya...@gmail.com> wrote:

> +1. hudi-delta gives me the feeling that it has something to do with other
> frameworks... I’d vote for another name hudi-deltastreamer or hudi-streamer
> or hudi-stream.
>
> On Wed, Mar 4, 2020 at 2:29 AM vino yang <ya...@gmail.com> wrote:
>
> > Hi folks,
> >
> > Currently, it seems the content of hudi-utilities looks a bit mix.
> > Summarize all of them, there are two aspects list below:
> >
> >
> >    - delta streamer and its relevant packages, e.g. deltastreamer,
> sources,
> >    schema, transform, these packages are served for delta streamer.
> >    - Some utility tools such as
> >    HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on
> >
> >
> > We are trying to refactor the computing engine relevant business logic.
> > Delta Streamer (especially, the sources package is a start point of a job
> > of Spark/Flink) will be affected. Doing this restructure can make the
> work
> > more clear and focus.
> >
> > I would like to start a proposal to restructure the hudi-utilites module.
> > Considering delta streamer is a great feature for hudi, the logic is very
> > much in the hudi-utilites. Can we raise its importance via making the
> delta
> > streamer as a single module? It could be named e.g. hudi-delta or
> something
> > else. Then let the hudi-utilities be a real utilities module to host
> > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools.
> >
> > In short, we can do these restructure works:
> >
> >
> >    - create a new module, named “hudi-delta” (or other name?) and move
> the
> >    deltastreamer, sources, schema, transform … packages into this module
> >    - leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner … in
> the
> >    current place (utilities module)
> >
> > What do you think?
> >
> > Any comments and suggestions are welcome and appreciated.
> >
> > Best,
> > Vino
> >
>

Re: [DISCUSS] Restructure hudi-utilities module

Posted by Gary Li <ya...@gmail.com>.
+1. hudi-delta gives me the feeling that it has something to do with other
frameworks... I’d vote for another name hudi-deltastreamer or hudi-streamer
or hudi-stream.

On Wed, Mar 4, 2020 at 2:29 AM vino yang <ya...@gmail.com> wrote:

> Hi folks,
>
> Currently, it seems the content of hudi-utilities looks a bit mix.
> Summarize all of them, there are two aspects list below:
>
>
>    - delta streamer and its relevant packages, e.g. deltastreamer, sources,
>    schema, transform, these packages are served for delta streamer.
>    - Some utility tools such as
>    HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on
>
>
> We are trying to refactor the computing engine relevant business logic.
> Delta Streamer (especially, the sources package is a start point of a job
> of Spark/Flink) will be affected. Doing this restructure can make the work
> more clear and focus.
>
> I would like to start a proposal to restructure the hudi-utilites module.
> Considering delta streamer is a great feature for hudi, the logic is very
> much in the hudi-utilites. Can we raise its importance via making the delta
> streamer as a single module? It could be named e.g. hudi-delta or something
> else. Then let the hudi-utilities be a real utilities module to host
> HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools.
>
> In short, we can do these restructure works:
>
>
>    - create a new module, named “hudi-delta” (or other name?) and move the
>    deltastreamer, sources, schema, transform … packages into this module
>    - leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner … in the
>    current place (utilities module)
>
> What do you think?
>
> Any comments and suggestions are welcome and appreciated.
>
> Best,
> Vino
>