You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by vino yang <ya...@gmail.com> on 2020/02/07 04:24:54 UTC

Refactor and enhance Hudi Transformer

Currently, Hudi has a component that has not been widely used: Transformer.
As we all know, before the original data fell into the data lake, a very
common operation is data preprocessing and ETL. This is also the most
common use scenario of many computing engines, such as Flink and Spark. Now
that Hudi has taken advantage of the power of the computing engine, it can
also naturally take advantage of its ability of data preprocessing. We can
refactor the Transformer to make it become more flexible. To summarize, we
can refactor from the following aspects:

   - Decouple Transformer from Spark
   - Enrich the Transformer and provide built-in transformer
   - Support Transformer-chain

For the first point, the Transformer interface is tightly coupled with
Spark in design, and it contains a Spark-specific context. This makes it
impossible for us to take advantage of the transform capabilities provided
by other engines (such as Flink) after supporting multiple engines.
Therefore, we need to decouple it from Spark in design.

For the second point, we can enhance the Transformer and provide some
out-of-the-box Transformers, such as FilterTransformer, FlatMapTrnasformer,
and so on.

For the third point, the most common pattern for data processing is the
pipeline model, and the common implementation of the pipeline model is the
responsibility chain model, which can be compared to the Apache commons
chain[1], combining multiple Transformers can make data-processing become
more flexible and expandable.

If we enhance the capabilities of Transformer components, Hudi will provide
richer data processing capabilities based on the computing engine.

What do you think?

Any opinions and feedback are welcome and appreciated.

Best,
Vino

[1]: https://commons.apache.org/proper/commons-chain/

Re: Refactor and enhance Hudi Transformer

Posted by Shiyan Xu <xu...@gmail.com>.

Thanks. After reading the discussion in HUDI-561, I just realized that the
previously-mentioned built-in partition transformer is better suited to a
custom key generator. Hopefully other suitable ideas of built-in
transformer would come up later.

On Sun, Feb 23, 2020 at 6:34 PM vino yang <ya...@gmail.com> wrote:

> Hi Shiyan,
>
> Really sorry, I forgot to attach the reference, the relevant Jira ID is
> HUDI-561: https://issues.apache.org/jira/browse/HUDI-561
>
> It seems both of you faced the same issue. While the solution is not the
> same. Never mind, you can move the discussion to that issue.
>
> Best,
> Vino
>
>
> Shiyan Xu <xu...@gmail.com> 于2020年2月24日周一 上午10:21写道：
>
> > Thanks Vino. Are you referring to HUDI-613? How about making it an
> umbrella
> > task due to its big scope? (btw it is stated as "bug", which should be
> > fixed too). I can create another specific task under it for the idea of
> > datetime -> partition path transformer, if it makes sense.
> >
> > On Sun, Feb 23, 2020 at 5:57 PM vino yang <ya...@gmail.com> wrote:
> >
> > > Hi Shiyan,
> > >
> > > Thanks for rasing this thread up again and sharing your thoughts. They
> > are
> > > valuable.
> > >
> > > Regarding the date-time specific transform, there is an issue[1] that
> > > describes this business requirement.
> > >
> > > Best,
> > > Vino
> > >
> > > Shiyan Xu <xu...@gmail.com> 于2020年2月24日周一 上午7:22写道：
> > >
> > > > Late to the party. :P
> > > >
> > > > I really favor the idea of built-in support enrichment. It is a very
> > > common
> > > > case where we want to set datetime fields for partition path. We
> could
> > > have
> > > > a built-in support to normalize ISO format / unix timestamp. For
> > example
> > > > `HourlyPartitionTransformer` will normalize whatever field user
> > specified
> > > > as partition path. Let's say user set `create_ts` as partition path
> > > field,
> > > > the transfromer will apply change create_ts => _hoodie_partition_path
> > > >
> > > >
> > > >    - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
> > > >    - 1582497702.123456789 => 2020/02/23/22
> > > >
> > > > Does that make sense? If so, I may file a jira for this.
> > > >
> > > > As for FilterTransformer or FlatMapTransformer which is designed for
> > > > generic purpose, they seem to belong to Spark or Flink's realm.
> > > > You can do these 2 transformation with Spark Dataset now. Or once
> > > > decoupled from Spark, you'll probably have an abstract Dataset class
> > > > to perform engine-agnostic transformation
> > > >
> > > > My understanding of transformer in HUDI is more specifically
> purposed,
> > > > where the underlying transformation is handled by the actual
> > > > processing engine (Spark or Flink)
> > > >
> > > >
> > > > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar <vi...@apache.org>
> > > wrote:
> > > >
> > > > > Thanks Hamid and Vinoyang for the great discussion
> > > > >
> > > > > On Fri, Feb 14, 2020 at 5:18 AM vino yang <ya...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I have filed a Jira issue[1] to track this work.
> > > > > >
> > > > > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > > > > >
> > > > > > vino yang <ya...@gmail.com> 于2020年2月13日周四 下午9:51写道：
> > > > > >
> > > > > > > Hi hamid,
> > > > > > >
> > > > > > > Agree with your opinion.
> > > > > > >
> > > > > > > Let's move forward step by step.
> > > > > > >
> > > > > > > Will file an issue to track refactor about Transformer.
> > > > > > >
> > > > > > > Best,
> > > > > > > Vino
> > > > > > >
> > > > > > > hamid pirahesh <hp...@gmail.com> 于2020年2月13日周四 下午6:38写道：
> > > > > > >
> > > > > > >> I think it is a good idea to decouple  the transformer from
> > spark
> > > so
> > > > > > that
> > > > > > >> it can be used with other flow engines.
> > > > > > >> Once you do that, then it is worth considering a much bigger
> > play
> > > > > rather
> > > > > > >> than another incremental play.
> > > > > > >> Given the scale of Hudi, we need to look at airflow,
> > particularly
> > > in
> > > > > the
> > > > > > >> context of what google is doing with Composer, addressing
> > > > autoscaling,
> > > > > > >> scheduleing, monitoring, etc.
> > > > > > >> You need all of that to manage a serious tetl/elt flow.
> > > > > > >>
> > > > > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <
> yanghua1127@gmail.com
> > >
> > > > > wrote:
> > > > > > >>
> > > > > > >> > Currently, Hudi has a component that has not been widely
> used:
> > > > > > >> Transformer.
> > > > > > >> > As we all know, before the original data fell into the data
> > > lake,
> > > > a
> > > > > > very
> > > > > > >> > common operation is data preprocessing and ETL. This is also
> > the
> > > > > most
> > > > > > >> > common use scenario of many computing engines, such as Flink
> > and
> > > > > > Spark.
> > > > > > >> Now
> > > > > > >> > that Hudi has taken advantage of the power of the computing
> > > > engine,
> > > > > it
> > > > > > >> can
> > > > > > >> > also naturally take advantage of its ability of data
> > > > preprocessing.
> > > > > We
> > > > > > >> can
> > > > > > >> > refactor the Transformer to make it become more flexible. To
> > > > > > summarize,
> > > > > > >> we
> > > > > > >> > can refactor from the following aspects:
> > > > > > >> >
> > > > > > >> >    - Decouple Transformer from Spark
> > > > > > >> >    - Enrich the Transformer and provide built-in transformer
> > > > > > >> >    - Support Transformer-chain
> > > > > > >> >
> > > > > > >> > For the first point, the Transformer interface is tightly
> > > coupled
> > > > > with
> > > > > > >> > Spark in design, and it contains a Spark-specific context.
> > This
> > > > > makes
> > > > > > it
> > > > > > >> > impossible for us to take advantage of the transform
> > > capabilities
> > > > > > >> provided
> > > > > > >> > by other engines (such as Flink) after supporting multiple
> > > > engines.
> > > > > > >> > Therefore, we need to decouple it from Spark in design.
> > > > > > >> >
> > > > > > >> > For the second point, we can enhance the Transformer and
> > provide
> > > > > some
> > > > > > >> > out-of-the-box Transformers, such as FilterTransformer,
> > > > > > >> FlatMapTrnasformer,
> > > > > > >> > and so on.
> > > > > > >> >
> > > > > > >> > For the third point, the most common pattern for data
> > processing
> > > > is
> > > > > > the
> > > > > > >> > pipeline model, and the common implementation of the
> pipeline
> > > > model
> > > > > is
> > > > > > >> the
> > > > > > >> > responsibility chain model, which can be compared to the
> > Apache
> > > > > > commons
> > > > > > >> > chain[1], combining multiple Transformers can make
> > > data-processing
> > > > > > >> become
> > > > > > >> > more flexible and expandable.
> > > > > > >> >
> > > > > > >> > If we enhance the capabilities of Transformer components,
> Hudi
> > > > will
> > > > > > >> provide
> > > > > > >> > richer data processing capabilities based on the computing
> > > engine.
> > > > > > >> >
> > > > > > >> > What do you think?
> > > > > > >> >
> > > > > > >> > Any opinions and feedback are welcome and appreciated.
> > > > > > >> >
> > > > > > >> > Best,
> > > > > > >> > Vino
> > > > > > >> >
> > > > > > >> > [1]: https://commons.apache.org/proper/commons-chain/
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by vino yang <ya...@gmail.com>.

Hi Shiyan,

Really sorry, I forgot to attach the reference, the relevant Jira ID is
HUDI-561: https://issues.apache.org/jira/browse/HUDI-561

It seems both of you faced the same issue. While the solution is not the
same. Never mind, you can move the discussion to that issue.

Best,
Vino


Shiyan Xu <xu...@gmail.com> 于2020年2月24日周一 上午10:21写道：

> Thanks Vino. Are you referring to HUDI-613? How about making it an umbrella
> task due to its big scope? (btw it is stated as "bug", which should be
> fixed too). I can create another specific task under it for the idea of
> datetime -> partition path transformer, if it makes sense.
>
> On Sun, Feb 23, 2020 at 5:57 PM vino yang <ya...@gmail.com> wrote:
>
> > Hi Shiyan,
> >
> > Thanks for rasing this thread up again and sharing your thoughts. They
> are
> > valuable.
> >
> > Regarding the date-time specific transform, there is an issue[1] that
> > describes this business requirement.
> >
> > Best,
> > Vino
> >
> > Shiyan Xu <xu...@gmail.com> 于2020年2月24日周一 上午7:22写道：
> >
> > > Late to the party. :P
> > >
> > > I really favor the idea of built-in support enrichment. It is a very
> > common
> > > case where we want to set datetime fields for partition path. We could
> > have
> > > a built-in support to normalize ISO format / unix timestamp. For
> example
> > > `HourlyPartitionTransformer` will normalize whatever field user
> specified
> > > as partition path. Let's say user set `create_ts` as partition path
> > field,
> > > the transfromer will apply change create_ts => _hoodie_partition_path
> > >
> > >
> > >    - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
> > >    - 1582497702.123456789 => 2020/02/23/22
> > >
> > > Does that make sense? If so, I may file a jira for this.
> > >
> > > As for FilterTransformer or FlatMapTransformer which is designed for
> > > generic purpose, they seem to belong to Spark or Flink's realm.
> > > You can do these 2 transformation with Spark Dataset now. Or once
> > > decoupled from Spark, you'll probably have an abstract Dataset class
> > > to perform engine-agnostic transformation
> > >
> > > My understanding of transformer in HUDI is more specifically purposed,
> > > where the underlying transformation is handled by the actual
> > > processing engine (Spark or Flink)
> > >
> > >
> > > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar <vi...@apache.org>
> > wrote:
> > >
> > > > Thanks Hamid and Vinoyang for the great discussion
> > > >
> > > > On Fri, Feb 14, 2020 at 5:18 AM vino yang <ya...@gmail.com>
> > wrote:
> > > >
> > > > > I have filed a Jira issue[1] to track this work.
> > > > >
> > > > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > > > >
> > > > > vino yang <ya...@gmail.com> 于2020年2月13日周四 下午9:51写道：
> > > > >
> > > > > > Hi hamid,
> > > > > >
> > > > > > Agree with your opinion.
> > > > > >
> > > > > > Let's move forward step by step.
> > > > > >
> > > > > > Will file an issue to track refactor about Transformer.
> > > > > >
> > > > > > Best,
> > > > > > Vino
> > > > > >
> > > > > > hamid pirahesh <hp...@gmail.com> 于2020年2月13日周四 下午6:38写道：
> > > > > >
> > > > > >> I think it is a good idea to decouple  the transformer from
> spark
> > so
> > > > > that
> > > > > >> it can be used with other flow engines.
> > > > > >> Once you do that, then it is worth considering a much bigger
> play
> > > > rather
> > > > > >> than another incremental play.
> > > > > >> Given the scale of Hudi, we need to look at airflow,
> particularly
> > in
> > > > the
> > > > > >> context of what google is doing with Composer, addressing
> > > autoscaling,
> > > > > >> scheduleing, monitoring, etc.
> > > > > >> You need all of that to manage a serious tetl/elt flow.
> > > > > >>
> > > > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <yanghua1127@gmail.com
> >
> > > > wrote:
> > > > > >>
> > > > > >> > Currently, Hudi has a component that has not been widely used:
> > > > > >> Transformer.
> > > > > >> > As we all know, before the original data fell into the data
> > lake,
> > > a
> > > > > very
> > > > > >> > common operation is data preprocessing and ETL. This is also
> the
> > > > most
> > > > > >> > common use scenario of many computing engines, such as Flink
> and
> > > > > Spark.
> > > > > >> Now
> > > > > >> > that Hudi has taken advantage of the power of the computing
> > > engine,
> > > > it
> > > > > >> can
> > > > > >> > also naturally take advantage of its ability of data
> > > preprocessing.
> > > > We
> > > > > >> can
> > > > > >> > refactor the Transformer to make it become more flexible. To
> > > > > summarize,
> > > > > >> we
> > > > > >> > can refactor from the following aspects:
> > > > > >> >
> > > > > >> >    - Decouple Transformer from Spark
> > > > > >> >    - Enrich the Transformer and provide built-in transformer
> > > > > >> >    - Support Transformer-chain
> > > > > >> >
> > > > > >> > For the first point, the Transformer interface is tightly
> > coupled
> > > > with
> > > > > >> > Spark in design, and it contains a Spark-specific context.
> This
> > > > makes
> > > > > it
> > > > > >> > impossible for us to take advantage of the transform
> > capabilities
> > > > > >> provided
> > > > > >> > by other engines (such as Flink) after supporting multiple
> > > engines.
> > > > > >> > Therefore, we need to decouple it from Spark in design.
> > > > > >> >
> > > > > >> > For the second point, we can enhance the Transformer and
> provide
> > > > some
> > > > > >> > out-of-the-box Transformers, such as FilterTransformer,
> > > > > >> FlatMapTrnasformer,
> > > > > >> > and so on.
> > > > > >> >
> > > > > >> > For the third point, the most common pattern for data
> processing
> > > is
> > > > > the
> > > > > >> > pipeline model, and the common implementation of the pipeline
> > > model
> > > > is
> > > > > >> the
> > > > > >> > responsibility chain model, which can be compared to the
> Apache
> > > > > commons
> > > > > >> > chain[1], combining multiple Transformers can make
> > data-processing
> > > > > >> become
> > > > > >> > more flexible and expandable.
> > > > > >> >
> > > > > >> > If we enhance the capabilities of Transformer components, Hudi
> > > will
> > > > > >> provide
> > > > > >> > richer data processing capabilities based on the computing
> > engine.
> > > > > >> >
> > > > > >> > What do you think?
> > > > > >> >
> > > > > >> > Any opinions and feedback are welcome and appreciated.
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Vino
> > > > > >> >
> > > > > >> > [1]: https://commons.apache.org/proper/commons-chain/
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by Shiyan Xu <xu...@gmail.com>.

Thanks Vino. Are you referring to HUDI-613? How about making it an umbrella
task due to its big scope? (btw it is stated as "bug", which should be
fixed too). I can create another specific task under it for the idea of
datetime -> partition path transformer, if it makes sense.

On Sun, Feb 23, 2020 at 5:57 PM vino yang <ya...@gmail.com> wrote:

> Hi Shiyan,
>
> Thanks for rasing this thread up again and sharing your thoughts. They are
> valuable.
>
> Regarding the date-time specific transform, there is an issue[1] that
> describes this business requirement.
>
> Best,
> Vino
>
> Shiyan Xu <xu...@gmail.com> 于2020年2月24日周一 上午7:22写道：
>
> > Late to the party. :P
> >
> > I really favor the idea of built-in support enrichment. It is a very
> common
> > case where we want to set datetime fields for partition path. We could
> have
> > a built-in support to normalize ISO format / unix timestamp. For example
> > `HourlyPartitionTransformer` will normalize whatever field user specified
> > as partition path. Let's say user set `create_ts` as partition path
> field,
> > the transfromer will apply change create_ts => _hoodie_partition_path
> >
> >
> >    - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
> >    - 1582497702.123456789 => 2020/02/23/22
> >
> > Does that make sense? If so, I may file a jira for this.
> >
> > As for FilterTransformer or FlatMapTransformer which is designed for
> > generic purpose, they seem to belong to Spark or Flink's realm.
> > You can do these 2 transformation with Spark Dataset now. Or once
> > decoupled from Spark, you'll probably have an abstract Dataset class
> > to perform engine-agnostic transformation
> >
> > My understanding of transformer in HUDI is more specifically purposed,
> > where the underlying transformation is handled by the actual
> > processing engine (Spark or Flink)
> >
> >
> > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > Thanks Hamid and Vinoyang for the great discussion
> > >
> > > On Fri, Feb 14, 2020 at 5:18 AM vino yang <ya...@gmail.com>
> wrote:
> > >
> > > > I have filed a Jira issue[1] to track this work.
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > > >
> > > > vino yang <ya...@gmail.com> 于2020年2月13日周四 下午9:51写道：
> > > >
> > > > > Hi hamid,
> > > > >
> > > > > Agree with your opinion.
> > > > >
> > > > > Let's move forward step by step.
> > > > >
> > > > > Will file an issue to track refactor about Transformer.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > hamid pirahesh <hp...@gmail.com> 于2020年2月13日周四 下午6:38写道：
> > > > >
> > > > >> I think it is a good idea to decouple  the transformer from spark
> so
> > > > that
> > > > >> it can be used with other flow engines.
> > > > >> Once you do that, then it is worth considering a much bigger play
> > > rather
> > > > >> than another incremental play.
> > > > >> Given the scale of Hudi, we need to look at airflow, particularly
> in
> > > the
> > > > >> context of what google is doing with Composer, addressing
> > autoscaling,
> > > > >> scheduleing, monitoring, etc.
> > > > >> You need all of that to manage a serious tetl/elt flow.
> > > > >>
> > > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com>
> > > wrote:
> > > > >>
> > > > >> > Currently, Hudi has a component that has not been widely used:
> > > > >> Transformer.
> > > > >> > As we all know, before the original data fell into the data
> lake,
> > a
> > > > very
> > > > >> > common operation is data preprocessing and ETL. This is also the
> > > most
> > > > >> > common use scenario of many computing engines, such as Flink and
> > > > Spark.
> > > > >> Now
> > > > >> > that Hudi has taken advantage of the power of the computing
> > engine,
> > > it
> > > > >> can
> > > > >> > also naturally take advantage of its ability of data
> > preprocessing.
> > > We
> > > > >> can
> > > > >> > refactor the Transformer to make it become more flexible. To
> > > > summarize,
> > > > >> we
> > > > >> > can refactor from the following aspects:
> > > > >> >
> > > > >> >    - Decouple Transformer from Spark
> > > > >> >    - Enrich the Transformer and provide built-in transformer
> > > > >> >    - Support Transformer-chain
> > > > >> >
> > > > >> > For the first point, the Transformer interface is tightly
> coupled
> > > with
> > > > >> > Spark in design, and it contains a Spark-specific context. This
> > > makes
> > > > it
> > > > >> > impossible for us to take advantage of the transform
> capabilities
> > > > >> provided
> > > > >> > by other engines (such as Flink) after supporting multiple
> > engines.
> > > > >> > Therefore, we need to decouple it from Spark in design.
> > > > >> >
> > > > >> > For the second point, we can enhance the Transformer and provide
> > > some
> > > > >> > out-of-the-box Transformers, such as FilterTransformer,
> > > > >> FlatMapTrnasformer,
> > > > >> > and so on.
> > > > >> >
> > > > >> > For the third point, the most common pattern for data processing
> > is
> > > > the
> > > > >> > pipeline model, and the common implementation of the pipeline
> > model
> > > is
> > > > >> the
> > > > >> > responsibility chain model, which can be compared to the Apache
> > > > commons
> > > > >> > chain[1], combining multiple Transformers can make
> data-processing
> > > > >> become
> > > > >> > more flexible and expandable.
> > > > >> >
> > > > >> > If we enhance the capabilities of Transformer components, Hudi
> > will
> > > > >> provide
> > > > >> > richer data processing capabilities based on the computing
> engine.
> > > > >> >
> > > > >> > What do you think?
> > > > >> >
> > > > >> > Any opinions and feedback are welcome and appreciated.
> > > > >> >
> > > > >> > Best,
> > > > >> > Vino
> > > > >> >
> > > > >> > [1]: https://commons.apache.org/proper/commons-chain/
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by vino yang <ya...@gmail.com>.

Hi Shiyan,

Thanks for rasing this thread up again and sharing your thoughts. They are
valuable.

Regarding the date-time specific transform, there is an issue[1] that
describes this business requirement.

Best,
Vino

Shiyan Xu <xu...@gmail.com> 于2020年2月24日周一 上午7:22写道：

> Late to the party. :P
>
> I really favor the idea of built-in support enrichment. It is a very common
> case where we want to set datetime fields for partition path. We could have
> a built-in support to normalize ISO format / unix timestamp. For example
> `HourlyPartitionTransformer` will normalize whatever field user specified
> as partition path. Let's say user set `create_ts` as partition path field,
> the transfromer will apply change create_ts => _hoodie_partition_path
>
>
>    - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
>    - 1582497702.123456789 => 2020/02/23/22
>
> Does that make sense? If so, I may file a jira for this.
>
> As for FilterTransformer or FlatMapTransformer which is designed for
> generic purpose, they seem to belong to Spark or Flink's realm.
> You can do these 2 transformation with Spark Dataset now. Or once
> decoupled from Spark, you'll probably have an abstract Dataset class
> to perform engine-agnostic transformation
>
> My understanding of transformer in HUDI is more specifically purposed,
> where the underlying transformation is handled by the actual
> processing engine (Spark or Flink)
>
>
> On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Thanks Hamid and Vinoyang for the great discussion
> >
> > On Fri, Feb 14, 2020 at 5:18 AM vino yang <ya...@gmail.com> wrote:
> >
> > > I have filed a Jira issue[1] to track this work.
> > >
> > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > >
> > > vino yang <ya...@gmail.com> 于2020年2月13日周四 下午9:51写道：
> > >
> > > > Hi hamid,
> > > >
> > > > Agree with your opinion.
> > > >
> > > > Let's move forward step by step.
> > > >
> > > > Will file an issue to track refactor about Transformer.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > hamid pirahesh <hp...@gmail.com> 于2020年2月13日周四 下午6:38写道：
> > > >
> > > >> I think it is a good idea to decouple  the transformer from spark so
> > > that
> > > >> it can be used with other flow engines.
> > > >> Once you do that, then it is worth considering a much bigger play
> > rather
> > > >> than another incremental play.
> > > >> Given the scale of Hudi, we need to look at airflow, particularly in
> > the
> > > >> context of what google is doing with Composer, addressing
> autoscaling,
> > > >> scheduleing, monitoring, etc.
> > > >> You need all of that to manage a serious tetl/elt flow.
> > > >>
> > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com>
> > wrote:
> > > >>
> > > >> > Currently, Hudi has a component that has not been widely used:
> > > >> Transformer.
> > > >> > As we all know, before the original data fell into the data lake,
> a
> > > very
> > > >> > common operation is data preprocessing and ETL. This is also the
> > most
> > > >> > common use scenario of many computing engines, such as Flink and
> > > Spark.
> > > >> Now
> > > >> > that Hudi has taken advantage of the power of the computing
> engine,
> > it
> > > >> can
> > > >> > also naturally take advantage of its ability of data
> preprocessing.
> > We
> > > >> can
> > > >> > refactor the Transformer to make it become more flexible. To
> > > summarize,
> > > >> we
> > > >> > can refactor from the following aspects:
> > > >> >
> > > >> >    - Decouple Transformer from Spark
> > > >> >    - Enrich the Transformer and provide built-in transformer
> > > >> >    - Support Transformer-chain
> > > >> >
> > > >> > For the first point, the Transformer interface is tightly coupled
> > with
> > > >> > Spark in design, and it contains a Spark-specific context. This
> > makes
> > > it
> > > >> > impossible for us to take advantage of the transform capabilities
> > > >> provided
> > > >> > by other engines (such as Flink) after supporting multiple
> engines.
> > > >> > Therefore, we need to decouple it from Spark in design.
> > > >> >
> > > >> > For the second point, we can enhance the Transformer and provide
> > some
> > > >> > out-of-the-box Transformers, such as FilterTransformer,
> > > >> FlatMapTrnasformer,
> > > >> > and so on.
> > > >> >
> > > >> > For the third point, the most common pattern for data processing
> is
> > > the
> > > >> > pipeline model, and the common implementation of the pipeline
> model
> > is
> > > >> the
> > > >> > responsibility chain model, which can be compared to the Apache
> > > commons
> > > >> > chain[1], combining multiple Transformers can make data-processing
> > > >> become
> > > >> > more flexible and expandable.
> > > >> >
> > > >> > If we enhance the capabilities of Transformer components, Hudi
> will
> > > >> provide
> > > >> > richer data processing capabilities based on the computing engine.
> > > >> >
> > > >> > What do you think?
> > > >> >
> > > >> > Any opinions and feedback are welcome and appreciated.
> > > >> >
> > > >> > Best,
> > > >> > Vino
> > > >> >
> > > >> > [1]: https://commons.apache.org/proper/commons-chain/
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by Shiyan Xu <xu...@gmail.com>.

Late to the party. :P

I really favor the idea of built-in support enrichment. It is a very common
case where we want to set datetime fields for partition path. We could have
a built-in support to normalize ISO format / unix timestamp. For example
`HourlyPartitionTransformer` will normalize whatever field user specified
as partition path. Let's say user set `create_ts` as partition path field,
the transfromer will apply change create_ts => _hoodie_partition_path


   - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
   - 1582497702.123456789 => 2020/02/23/22

Does that make sense? If so, I may file a jira for this.

As for FilterTransformer or FlatMapTransformer which is designed for
generic purpose, they seem to belong to Spark or Flink's realm.
You can do these 2 transformation with Spark Dataset now. Or once
decoupled from Spark, you'll probably have an abstract Dataset class
to perform engine-agnostic transformation

My understanding of transformer in HUDI is more specifically purposed,
where the underlying transformation is handled by the actual
processing engine (Spark or Flink)


On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar <vi...@apache.org> wrote:

> Thanks Hamid and Vinoyang for the great discussion
>
> On Fri, Feb 14, 2020 at 5:18 AM vino yang <ya...@gmail.com> wrote:
>
> > I have filed a Jira issue[1] to track this work.
> >
> > [1]: https://issues.apache.org/jira/browse/HUDI-613
> >
> > vino yang <ya...@gmail.com> 于2020年2月13日周四 下午9:51写道：
> >
> > > Hi hamid,
> > >
> > > Agree with your opinion.
> > >
> > > Let's move forward step by step.
> > >
> > > Will file an issue to track refactor about Transformer.
> > >
> > > Best,
> > > Vino
> > >
> > > hamid pirahesh <hp...@gmail.com> 于2020年2月13日周四 下午6:38写道：
> > >
> > >> I think it is a good idea to decouple  the transformer from spark so
> > that
> > >> it can be used with other flow engines.
> > >> Once you do that, then it is worth considering a much bigger play
> rather
> > >> than another incremental play.
> > >> Given the scale of Hudi, we need to look at airflow, particularly in
> the
> > >> context of what google is doing with Composer, addressing autoscaling,
> > >> scheduleing, monitoring, etc.
> > >> You need all of that to manage a serious tetl/elt flow.
> > >>
> > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com>
> wrote:
> > >>
> > >> > Currently, Hudi has a component that has not been widely used:
> > >> Transformer.
> > >> > As we all know, before the original data fell into the data lake, a
> > very
> > >> > common operation is data preprocessing and ETL. This is also the
> most
> > >> > common use scenario of many computing engines, such as Flink and
> > Spark.
> > >> Now
> > >> > that Hudi has taken advantage of the power of the computing engine,
> it
> > >> can
> > >> > also naturally take advantage of its ability of data preprocessing.
> We
> > >> can
> > >> > refactor the Transformer to make it become more flexible. To
> > summarize,
> > >> we
> > >> > can refactor from the following aspects:
> > >> >
> > >> >    - Decouple Transformer from Spark
> > >> >    - Enrich the Transformer and provide built-in transformer
> > >> >    - Support Transformer-chain
> > >> >
> > >> > For the first point, the Transformer interface is tightly coupled
> with
> > >> > Spark in design, and it contains a Spark-specific context. This
> makes
> > it
> > >> > impossible for us to take advantage of the transform capabilities
> > >> provided
> > >> > by other engines (such as Flink) after supporting multiple engines.
> > >> > Therefore, we need to decouple it from Spark in design.
> > >> >
> > >> > For the second point, we can enhance the Transformer and provide
> some
> > >> > out-of-the-box Transformers, such as FilterTransformer,
> > >> FlatMapTrnasformer,
> > >> > and so on.
> > >> >
> > >> > For the third point, the most common pattern for data processing is
> > the
> > >> > pipeline model, and the common implementation of the pipeline model
> is
> > >> the
> > >> > responsibility chain model, which can be compared to the Apache
> > commons
> > >> > chain[1], combining multiple Transformers can make data-processing
> > >> become
> > >> > more flexible and expandable.
> > >> >
> > >> > If we enhance the capabilities of Transformer components, Hudi will
> > >> provide
> > >> > richer data processing capabilities based on the computing engine.
> > >> >
> > >> > What do you think?
> > >> >
> > >> > Any opinions and feedback are welcome and appreciated.
> > >> >
> > >> > Best,
> > >> > Vino
> > >> >
> > >> > [1]: https://commons.apache.org/proper/commons-chain/
> > >> >
> > >>
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by Vinoth Chandar <vi...@apache.org>.

Thanks Hamid and Vinoyang for the great discussion

On Fri, Feb 14, 2020 at 5:18 AM vino yang <ya...@gmail.com> wrote:

> I have filed a Jira issue[1] to track this work.
>
> [1]: https://issues.apache.org/jira/browse/HUDI-613
>
> vino yang <ya...@gmail.com> 于2020年2月13日周四 下午9:51写道：
>
> > Hi hamid,
> >
> > Agree with your opinion.
> >
> > Let's move forward step by step.
> >
> > Will file an issue to track refactor about Transformer.
> >
> > Best,
> > Vino
> >
> > hamid pirahesh <hp...@gmail.com> 于2020年2月13日周四 下午6:38写道：
> >
> >> I think it is a good idea to decouple  the transformer from spark so
> that
> >> it can be used with other flow engines.
> >> Once you do that, then it is worth considering a much bigger play rather
> >> than another incremental play.
> >> Given the scale of Hudi, we need to look at airflow, particularly in the
> >> context of what google is doing with Composer, addressing autoscaling,
> >> scheduleing, monitoring, etc.
> >> You need all of that to manage a serious tetl/elt flow.
> >>
> >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com> wrote:
> >>
> >> > Currently, Hudi has a component that has not been widely used:
> >> Transformer.
> >> > As we all know, before the original data fell into the data lake, a
> very
> >> > common operation is data preprocessing and ETL. This is also the most
> >> > common use scenario of many computing engines, such as Flink and
> Spark.
> >> Now
> >> > that Hudi has taken advantage of the power of the computing engine, it
> >> can
> >> > also naturally take advantage of its ability of data preprocessing. We
> >> can
> >> > refactor the Transformer to make it become more flexible. To
> summarize,
> >> we
> >> > can refactor from the following aspects:
> >> >
> >> >    - Decouple Transformer from Spark
> >> >    - Enrich the Transformer and provide built-in transformer
> >> >    - Support Transformer-chain
> >> >
> >> > For the first point, the Transformer interface is tightly coupled with
> >> > Spark in design, and it contains a Spark-specific context. This makes
> it
> >> > impossible for us to take advantage of the transform capabilities
> >> provided
> >> > by other engines (such as Flink) after supporting multiple engines.
> >> > Therefore, we need to decouple it from Spark in design.
> >> >
> >> > For the second point, we can enhance the Transformer and provide some
> >> > out-of-the-box Transformers, such as FilterTransformer,
> >> FlatMapTrnasformer,
> >> > and so on.
> >> >
> >> > For the third point, the most common pattern for data processing is
> the
> >> > pipeline model, and the common implementation of the pipeline model is
> >> the
> >> > responsibility chain model, which can be compared to the Apache
> commons
> >> > chain[1], combining multiple Transformers can make data-processing
> >> become
> >> > more flexible and expandable.
> >> >
> >> > If we enhance the capabilities of Transformer components, Hudi will
> >> provide
> >> > richer data processing capabilities based on the computing engine.
> >> >
> >> > What do you think?
> >> >
> >> > Any opinions and feedback are welcome and appreciated.
> >> >
> >> > Best,
> >> > Vino
> >> >
> >> > [1]: https://commons.apache.org/proper/commons-chain/
> >> >
> >>
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by vino yang <ya...@gmail.com>.

I have filed a Jira issue[1] to track this work.

[1]: https://issues.apache.org/jira/browse/HUDI-613

vino yang <ya...@gmail.com> 于2020年2月13日周四 下午9:51写道：

> Hi hamid,
>
> Agree with your opinion.
>
> Let's move forward step by step.
>
> Will file an issue to track refactor about Transformer.
>
> Best,
> Vino
>
> hamid pirahesh <hp...@gmail.com> 于2020年2月13日周四 下午6:38写道：
>
>> I think it is a good idea to decouple  the transformer from spark so that
>> it can be used with other flow engines.
>> Once you do that, then it is worth considering a much bigger play rather
>> than another incremental play.
>> Given the scale of Hudi, we need to look at airflow, particularly in the
>> context of what google is doing with Composer, addressing autoscaling,
>> scheduleing, monitoring, etc.
>> You need all of that to manage a serious tetl/elt flow.
>>
>> On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com> wrote:
>>
>> > Currently, Hudi has a component that has not been widely used:
>> Transformer.
>> > As we all know, before the original data fell into the data lake, a very
>> > common operation is data preprocessing and ETL. This is also the most
>> > common use scenario of many computing engines, such as Flink and Spark.
>> Now
>> > that Hudi has taken advantage of the power of the computing engine, it
>> can
>> > also naturally take advantage of its ability of data preprocessing. We
>> can
>> > refactor the Transformer to make it become more flexible. To summarize,
>> we
>> > can refactor from the following aspects:
>> >
>> >    - Decouple Transformer from Spark
>> >    - Enrich the Transformer and provide built-in transformer
>> >    - Support Transformer-chain
>> >
>> > For the first point, the Transformer interface is tightly coupled with
>> > Spark in design, and it contains a Spark-specific context. This makes it
>> > impossible for us to take advantage of the transform capabilities
>> provided
>> > by other engines (such as Flink) after supporting multiple engines.
>> > Therefore, we need to decouple it from Spark in design.
>> >
>> > For the second point, we can enhance the Transformer and provide some
>> > out-of-the-box Transformers, such as FilterTransformer,
>> FlatMapTrnasformer,
>> > and so on.
>> >
>> > For the third point, the most common pattern for data processing is the
>> > pipeline model, and the common implementation of the pipeline model is
>> the
>> > responsibility chain model, which can be compared to the Apache commons
>> > chain[1], combining multiple Transformers can make data-processing
>> become
>> > more flexible and expandable.
>> >
>> > If we enhance the capabilities of Transformer components, Hudi will
>> provide
>> > richer data processing capabilities based on the computing engine.
>> >
>> > What do you think?
>> >
>> > Any opinions and feedback are welcome and appreciated.
>> >
>> > Best,
>> > Vino
>> >
>> > [1]: https://commons.apache.org/proper/commons-chain/
>> >
>>
>

Re: Refactor and enhance Hudi Transformer

Posted by vino yang <ya...@gmail.com>.

Hi hamid,

Agree with your opinion.

Let's move forward step by step.

Will file an issue to track refactor about Transformer.

Best,
Vino

hamid pirahesh <hp...@gmail.com> 于2020年2月13日周四 下午6:38写道：

> I think it is a good idea to decouple  the transformer from spark so that
> it can be used with other flow engines.
> Once you do that, then it is worth considering a much bigger play rather
> than another incremental play.
> Given the scale of Hudi, we need to look at airflow, particularly in the
> context of what google is doing with Composer, addressing autoscaling,
> scheduleing, monitoring, etc.
> You need all of that to manage a serious tetl/elt flow.
>
> On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com> wrote:
>
> > Currently, Hudi has a component that has not been widely used:
> Transformer.
> > As we all know, before the original data fell into the data lake, a very
> > common operation is data preprocessing and ETL. This is also the most
> > common use scenario of many computing engines, such as Flink and Spark.
> Now
> > that Hudi has taken advantage of the power of the computing engine, it
> can
> > also naturally take advantage of its ability of data preprocessing. We
> can
> > refactor the Transformer to make it become more flexible. To summarize,
> we
> > can refactor from the following aspects:
> >
> >    - Decouple Transformer from Spark
> >    - Enrich the Transformer and provide built-in transformer
> >    - Support Transformer-chain
> >
> > For the first point, the Transformer interface is tightly coupled with
> > Spark in design, and it contains a Spark-specific context. This makes it
> > impossible for us to take advantage of the transform capabilities
> provided
> > by other engines (such as Flink) after supporting multiple engines.
> > Therefore, we need to decouple it from Spark in design.
> >
> > For the second point, we can enhance the Transformer and provide some
> > out-of-the-box Transformers, such as FilterTransformer,
> FlatMapTrnasformer,
> > and so on.
> >
> > For the third point, the most common pattern for data processing is the
> > pipeline model, and the common implementation of the pipeline model is
> the
> > responsibility chain model, which can be compared to the Apache commons
> > chain[1], combining multiple Transformers can make data-processing become
> > more flexible and expandable.
> >
> > If we enhance the capabilities of Transformer components, Hudi will
> provide
> > richer data processing capabilities based on the computing engine.
> >
> > What do you think?
> >
> > Any opinions and feedback are welcome and appreciated.
> >
> > Best,
> > Vino
> >
> > [1]: https://commons.apache.org/proper/commons-chain/
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by hamid pirahesh <hp...@gmail.com>.

I think it is a good idea to decouple  the transformer from spark so that
it can be used with other flow engines.
Once you do that, then it is worth considering a much bigger play rather
than another incremental play.
Given the scale of Hudi, we need to look at airflow, particularly in the
context of what google is doing with Composer, addressing autoscaling,
scheduleing, monitoring, etc.
You need all of that to manage a serious tetl/elt flow.

On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com> wrote:

> Currently, Hudi has a component that has not been widely used: Transformer.
> As we all know, before the original data fell into the data lake, a very
> common operation is data preprocessing and ETL. This is also the most
> common use scenario of many computing engines, such as Flink and Spark. Now
> that Hudi has taken advantage of the power of the computing engine, it can
> also naturally take advantage of its ability of data preprocessing. We can
> refactor the Transformer to make it become more flexible. To summarize, we
> can refactor from the following aspects:
>
>    - Decouple Transformer from Spark
>    - Enrich the Transformer and provide built-in transformer
>    - Support Transformer-chain
>
> For the first point, the Transformer interface is tightly coupled with
> Spark in design, and it contains a Spark-specific context. This makes it
> impossible for us to take advantage of the transform capabilities provided
> by other engines (such as Flink) after supporting multiple engines.
> Therefore, we need to decouple it from Spark in design.
>
> For the second point, we can enhance the Transformer and provide some
> out-of-the-box Transformers, such as FilterTransformer, FlatMapTrnasformer,
> and so on.
>
> For the third point, the most common pattern for data processing is the
> pipeline model, and the common implementation of the pipeline model is the
> responsibility chain model, which can be compared to the Apache commons
> chain[1], combining multiple Transformers can make data-processing become
> more flexible and expandable.
>
> If we enhance the capabilities of Transformer components, Hudi will provide
> richer data processing capabilities based on the computing engine.
>
> What do you think?
>
> Any opinions and feedback are welcome and appreciated.
>
> Best,
> Vino
>
> [1]: https://commons.apache.org/proper/commons-chain/
>

Re: Refactor and enhance Hudi Transformer

Posted by Vinoth Chandar <vi...@apache.org>.

yes familiar with those systems (I actually named it marmaray :))..

I am not opposed to building a set of built in transformers for common
things, per se. It can actually help adoption of delta streamer as well

On Mon, Feb 10, 2020 at 6:27 PM vino yang <ya...@gmail.com> wrote:

> Hi Vinoth,
>
> Thanks for summarizing both our opinions. Your summarize is good.
>
> >> How about we first focus our discussion on how we can expand ETL
> support,
> rather than zooming on this interface (which is a lower value conversation
> IMO).
>
> Of course, Yes.
>
> About your summarize.
>
> The initial goal of this discussion topic was to talk about improving the
> Transformer component to help Hudi itself provide more powerful ETL
> capabilities. If it involves third-party frameworks (such as a scheduling
> engine), then its direction will shift to an "ecological" perspective,
> namely how to integrate with a third-party framework to better enable Hudi
> to support strong ETL capabilities. Of course, their goals are to develop
> in a good direction.
>
> Recently, I saw a data ingestion framework named marmaray[1] that open
> source by Uber (maybe you are familiar with this project). Hudi's
> Transformer is similar to the converter component it provides, so I
> launched this proposal to try to see if we can enhance the Transformer so
> that Hudi can have strong ETL characteristics while it does not rely on any
> other services.
>
> Of course, I absolutely agree with your second point. It is also necessary
> for us to provide convenience and flexibility for third parties to conduct
> ETL more conveniently.
>
> Best,
> Vino
>
> [1]: https://github.com/uber/marmaray
>
>
> Vinoth Chandar <vi...@apache.org> 于2020年2月10日周一 下午4:10写道：
>
> > Thanks for kicking this discussion off...
> >
> > At a high level, improving the deltastreamer tool to be able to better
> > support ETL pipelines is a great goal and we can do a lot more here to
> > help.
> >
> > > Currently, Hudi has a component that has not been widely used:
> > Transformer.
> > Not true actually. The recent DMS integration was based off this and I
> know
> > of atleast two other users. At the end of the day, its very simple
> > interface that take a dataframe in and hands a dataframe out and it need
> > not be anymore complicated than that. On Flink, we can first get the
> > custom/handwritten pipeline working (ala hudi-spark),before we can extend
> > deltastreamer to flink. This is  much larger effort.
> >
> > How about we first focus our discussion on how we can expand ETL support,
> > rather than zooming on this interface (which is a lower value
> conversation
> > IMO).
> > Here are some thoughts
> >
> > - (vinoyang's point) We could support a lot of standard transformations
> > built into Hudi itself? This can include common timestamp extraction,
> field
> > masking/filtering and such.
> > - (hamid's point) Airflow is a workflow scheduler and people can use it
> to
> > schedule DeltaStreamer jobs/Spark jobs. IMO its orthogonal/complementary
> to
> > transforms we would support ourselves. But, I think we can provide some
> > real value in implementing some way to trigger data pipelines in Airflow
> > based on a Hudi dataset receiving new commits.. for e.g we could run an
> > incremental ETL every time a new commit lands on the source Hudi table?
> >
> > Thanks
> > Vinoth
> >
> >
> >
> >
> >
> > On Fri, Feb 7, 2020 at 6:11 PM vino yang <ya...@gmail.com> wrote:
> >
> > > Hi hamid,
> > >
> > > AFAIK, currently, Transformer works as a task(Spark task) in the
> context
> > of
> > > hudi reading data.
> > > It's not a single component out of Hudi.
> > > Can you describe more details about how to use Apache Airflow?
> > > I personally suggest that we have a premise here: our goal is to
> enhance
> > > the data preprocessing capabilities of hudi.
> > >
> > > Best,
> > > Vino
> > >
> > >
> > > hamid pirahesh <hp...@gmail.com> 于2020年2月8日周六 上午4:23写道：
> > >
> > > > What about using apache airflow for creating a DAG of
> > > > transformer operators?
> > > >
> > > > On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com>
> > wrote:
> > > >
> > > > > Currently, Hudi has a component that has not been widely used:
> > > > Transformer.
> > > > > As we all know, before the original data fell into the data lake, a
> > > very
> > > > > common operation is data preprocessing and ETL. This is also the
> most
> > > > > common use scenario of many computing engines, such as Flink and
> > Spark.
> > > > Now
> > > > > that Hudi has taken advantage of the power of the computing engine,
> > it
> > > > can
> > > > > also naturally take advantage of its ability of data preprocessing.
> > We
> > > > can
> > > > > refactor the Transformer to make it become more flexible. To
> > summarize,
> > > > we
> > > > > can refactor from the following aspects:
> > > > >
> > > > >    - Decouple Transformer from Spark
> > > > >    - Enrich the Transformer and provide built-in transformer
> > > > >    - Support Transformer-chain
> > > > >
> > > > > For the first point, the Transformer interface is tightly coupled
> > with
> > > > > Spark in design, and it contains a Spark-specific context. This
> makes
> > > it
> > > > > impossible for us to take advantage of the transform capabilities
> > > > provided
> > > > > by other engines (such as Flink) after supporting multiple engines.
> > > > > Therefore, we need to decouple it from Spark in design.
> > > > >
> > > > > For the second point, we can enhance the Transformer and provide
> some
> > > > > out-of-the-box Transformers, such as FilterTransformer,
> > > > FlatMapTrnasformer,
> > > > > and so on.
> > > > >
> > > > > For the third point, the most common pattern for data processing is
> > the
> > > > > pipeline model, and the common implementation of the pipeline model
> > is
> > > > the
> > > > > responsibility chain model, which can be compared to the Apache
> > commons
> > > > > chain[1], combining multiple Transformers can make data-processing
> > > become
> > > > > more flexible and expandable.
> > > > >
> > > > > If we enhance the capabilities of Transformer components, Hudi will
> > > > provide
> > > > > richer data processing capabilities based on the computing engine.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Any opinions and feedback are welcome and appreciated.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > [1]: https://commons.apache.org/proper/commons-chain/
> > > > >
> > > >
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by vino yang <ya...@gmail.com>.

Hi Vinoth,

Thanks for summarizing both our opinions. Your summarize is good.

>> How about we first focus our discussion on how we can expand ETL support,
rather than zooming on this interface (which is a lower value conversation
IMO).

Of course, Yes.

About your summarize.

The initial goal of this discussion topic was to talk about improving the
Transformer component to help Hudi itself provide more powerful ETL
capabilities. If it involves third-party frameworks (such as a scheduling
engine), then its direction will shift to an "ecological" perspective,
namely how to integrate with a third-party framework to better enable Hudi
to support strong ETL capabilities. Of course, their goals are to develop
in a good direction.

Recently, I saw a data ingestion framework named marmaray[1] that open
source by Uber (maybe you are familiar with this project). Hudi's
Transformer is similar to the converter component it provides, so I
launched this proposal to try to see if we can enhance the Transformer so
that Hudi can have strong ETL characteristics while it does not rely on any
other services.

Of course, I absolutely agree with your second point. It is also necessary
for us to provide convenience and flexibility for third parties to conduct
ETL more conveniently.

Best,
Vino

[1]: https://github.com/uber/marmaray


Vinoth Chandar <vi...@apache.org> 于2020年2月10日周一 下午4:10写道：

> Thanks for kicking this discussion off...
>
> At a high level, improving the deltastreamer tool to be able to better
> support ETL pipelines is a great goal and we can do a lot more here to
> help.
>
> > Currently, Hudi has a component that has not been widely used:
> Transformer.
> Not true actually. The recent DMS integration was based off this and I know
> of atleast two other users. At the end of the day, its very simple
> interface that take a dataframe in and hands a dataframe out and it need
> not be anymore complicated than that. On Flink, we can first get the
> custom/handwritten pipeline working (ala hudi-spark),before we can extend
> deltastreamer to flink. This is  much larger effort.
>
> How about we first focus our discussion on how we can expand ETL support,
> rather than zooming on this interface (which is a lower value conversation
> IMO).
> Here are some thoughts
>
> - (vinoyang's point) We could support a lot of standard transformations
> built into Hudi itself? This can include common timestamp extraction, field
> masking/filtering and such.
> - (hamid's point) Airflow is a workflow scheduler and people can use it to
> schedule DeltaStreamer jobs/Spark jobs. IMO its orthogonal/complementary to
> transforms we would support ourselves. But, I think we can provide some
> real value in implementing some way to trigger data pipelines in Airflow
> based on a Hudi dataset receiving new commits.. for e.g we could run an
> incremental ETL every time a new commit lands on the source Hudi table?
>
> Thanks
> Vinoth
>
>
>
>
>
> On Fri, Feb 7, 2020 at 6:11 PM vino yang <ya...@gmail.com> wrote:
>
> > Hi hamid,
> >
> > AFAIK, currently, Transformer works as a task(Spark task) in the context
> of
> > hudi reading data.
> > It's not a single component out of Hudi.
> > Can you describe more details about how to use Apache Airflow?
> > I personally suggest that we have a premise here: our goal is to enhance
> > the data preprocessing capabilities of hudi.
> >
> > Best,
> > Vino
> >
> >
> > hamid pirahesh <hp...@gmail.com> 于2020年2月8日周六 上午4:23写道：
> >
> > > What about using apache airflow for creating a DAG of
> > > transformer operators?
> > >
> > > On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com>
> wrote:
> > >
> > > > Currently, Hudi has a component that has not been widely used:
> > > Transformer.
> > > > As we all know, before the original data fell into the data lake, a
> > very
> > > > common operation is data preprocessing and ETL. This is also the most
> > > > common use scenario of many computing engines, such as Flink and
> Spark.
> > > Now
> > > > that Hudi has taken advantage of the power of the computing engine,
> it
> > > can
> > > > also naturally take advantage of its ability of data preprocessing.
> We
> > > can
> > > > refactor the Transformer to make it become more flexible. To
> summarize,
> > > we
> > > > can refactor from the following aspects:
> > > >
> > > >    - Decouple Transformer from Spark
> > > >    - Enrich the Transformer and provide built-in transformer
> > > >    - Support Transformer-chain
> > > >
> > > > For the first point, the Transformer interface is tightly coupled
> with
> > > > Spark in design, and it contains a Spark-specific context. This makes
> > it
> > > > impossible for us to take advantage of the transform capabilities
> > > provided
> > > > by other engines (such as Flink) after supporting multiple engines.
> > > > Therefore, we need to decouple it from Spark in design.
> > > >
> > > > For the second point, we can enhance the Transformer and provide some
> > > > out-of-the-box Transformers, such as FilterTransformer,
> > > FlatMapTrnasformer,
> > > > and so on.
> > > >
> > > > For the third point, the most common pattern for data processing is
> the
> > > > pipeline model, and the common implementation of the pipeline model
> is
> > > the
> > > > responsibility chain model, which can be compared to the Apache
> commons
> > > > chain[1], combining multiple Transformers can make data-processing
> > become
> > > > more flexible and expandable.
> > > >
> > > > If we enhance the capabilities of Transformer components, Hudi will
> > > provide
> > > > richer data processing capabilities based on the computing engine.
> > > >
> > > > What do you think?
> > > >
> > > > Any opinions and feedback are welcome and appreciated.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > [1]: https://commons.apache.org/proper/commons-chain/
> > > >
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by Vinoth Chandar <vi...@apache.org>.

Thanks for kicking this discussion off...

At a high level, improving the deltastreamer tool to be able to better
support ETL pipelines is a great goal and we can do a lot more here to
help.

> Currently, Hudi has a component that has not been widely used:
Transformer.
Not true actually. The recent DMS integration was based off this and I know
of atleast two other users. At the end of the day, its very simple
interface that take a dataframe in and hands a dataframe out and it need
not be anymore complicated than that. On Flink, we can first get the
custom/handwritten pipeline working (ala hudi-spark),before we can extend
deltastreamer to flink. This is  much larger effort.

How about we first focus our discussion on how we can expand ETL support,
rather than zooming on this interface (which is a lower value conversation
IMO).
Here are some thoughts

- (vinoyang's point) We could support a lot of standard transformations
built into Hudi itself? This can include common timestamp extraction, field
masking/filtering and such.
- (hamid's point) Airflow is a workflow scheduler and people can use it to
schedule DeltaStreamer jobs/Spark jobs. IMO its orthogonal/complementary to
transforms we would support ourselves. But, I think we can provide some
real value in implementing some way to trigger data pipelines in Airflow
based on a Hudi dataset receiving new commits.. for e.g we could run an
incremental ETL every time a new commit lands on the source Hudi table?

Thanks
Vinoth





On Fri, Feb 7, 2020 at 6:11 PM vino yang <ya...@gmail.com> wrote:

> Hi hamid,
>
> AFAIK, currently, Transformer works as a task(Spark task) in the context of
> hudi reading data.
> It's not a single component out of Hudi.
> Can you describe more details about how to use Apache Airflow?
> I personally suggest that we have a premise here: our goal is to enhance
> the data preprocessing capabilities of hudi.
>
> Best,
> Vino
>
>
> hamid pirahesh <hp...@gmail.com> 于2020年2月8日周六 上午4:23写道：
>
> > What about using apache airflow for creating a DAG of
> > transformer operators?
> >
> > On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com> wrote:
> >
> > > Currently, Hudi has a component that has not been widely used:
> > Transformer.
> > > As we all know, before the original data fell into the data lake, a
> very
> > > common operation is data preprocessing and ETL. This is also the most
> > > common use scenario of many computing engines, such as Flink and Spark.
> > Now
> > > that Hudi has taken advantage of the power of the computing engine, it
> > can
> > > also naturally take advantage of its ability of data preprocessing. We
> > can
> > > refactor the Transformer to make it become more flexible. To summarize,
> > we
> > > can refactor from the following aspects:
> > >
> > >    - Decouple Transformer from Spark
> > >    - Enrich the Transformer and provide built-in transformer
> > >    - Support Transformer-chain
> > >
> > > For the first point, the Transformer interface is tightly coupled with
> > > Spark in design, and it contains a Spark-specific context. This makes
> it
> > > impossible for us to take advantage of the transform capabilities
> > provided
> > > by other engines (such as Flink) after supporting multiple engines.
> > > Therefore, we need to decouple it from Spark in design.
> > >
> > > For the second point, we can enhance the Transformer and provide some
> > > out-of-the-box Transformers, such as FilterTransformer,
> > FlatMapTrnasformer,
> > > and so on.
> > >
> > > For the third point, the most common pattern for data processing is the
> > > pipeline model, and the common implementation of the pipeline model is
> > the
> > > responsibility chain model, which can be compared to the Apache commons
> > > chain[1], combining multiple Transformers can make data-processing
> become
> > > more flexible and expandable.
> > >
> > > If we enhance the capabilities of Transformer components, Hudi will
> > provide
> > > richer data processing capabilities based on the computing engine.
> > >
> > > What do you think?
> > >
> > > Any opinions and feedback are welcome and appreciated.
> > >
> > > Best,
> > > Vino
> > >
> > > [1]: https://commons.apache.org/proper/commons-chain/
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by vino yang <ya...@gmail.com>.

Hi hamid,

AFAIK, currently, Transformer works as a task(Spark task) in the context of
hudi reading data.
It's not a single component out of Hudi.
Can you describe more details about how to use Apache Airflow?
I personally suggest that we have a premise here: our goal is to enhance
the data preprocessing capabilities of hudi.

Best,
Vino


hamid pirahesh <hp...@gmail.com> 于2020年2月8日周六 上午4:23写道：

> What about using apache airflow for creating a DAG of
> transformer operators?
>
> On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com> wrote:
>
> > Currently, Hudi has a component that has not been widely used:
> Transformer.
> > As we all know, before the original data fell into the data lake, a very
> > common operation is data preprocessing and ETL. This is also the most
> > common use scenario of many computing engines, such as Flink and Spark.
> Now
> > that Hudi has taken advantage of the power of the computing engine, it
> can
> > also naturally take advantage of its ability of data preprocessing. We
> can
> > refactor the Transformer to make it become more flexible. To summarize,
> we
> > can refactor from the following aspects:
> >
> >    - Decouple Transformer from Spark
> >    - Enrich the Transformer and provide built-in transformer
> >    - Support Transformer-chain
> >
> > For the first point, the Transformer interface is tightly coupled with
> > Spark in design, and it contains a Spark-specific context. This makes it
> > impossible for us to take advantage of the transform capabilities
> provided
> > by other engines (such as Flink) after supporting multiple engines.
> > Therefore, we need to decouple it from Spark in design.
> >
> > For the second point, we can enhance the Transformer and provide some
> > out-of-the-box Transformers, such as FilterTransformer,
> FlatMapTrnasformer,
> > and so on.
> >
> > For the third point, the most common pattern for data processing is the
> > pipeline model, and the common implementation of the pipeline model is
> the
> > responsibility chain model, which can be compared to the Apache commons
> > chain[1], combining multiple Transformers can make data-processing become
> > more flexible and expandable.
> >
> > If we enhance the capabilities of Transformer components, Hudi will
> provide
> > richer data processing capabilities based on the computing engine.
> >
> > What do you think?
> >
> > Any opinions and feedback are welcome and appreciated.
> >
> > Best,
> > Vino
> >
> > [1]: https://commons.apache.org/proper/commons-chain/
> >
>

Re: Refactor and enhance Hudi Transformer

Posted by hamid pirahesh <hp...@gmail.com>.

What about using apache airflow for creating a DAG of
transformer operators?

On Thu, Feb 6, 2020 at 8:25 PM vino yang <ya...@gmail.com> wrote:

> Currently, Hudi has a component that has not been widely used: Transformer.
> As we all know, before the original data fell into the data lake, a very
> common operation is data preprocessing and ETL. This is also the most
> common use scenario of many computing engines, such as Flink and Spark. Now
> that Hudi has taken advantage of the power of the computing engine, it can
> also naturally take advantage of its ability of data preprocessing. We can
> refactor the Transformer to make it become more flexible. To summarize, we
> can refactor from the following aspects:
>
>    - Decouple Transformer from Spark
>    - Enrich the Transformer and provide built-in transformer
>    - Support Transformer-chain
>
> For the first point, the Transformer interface is tightly coupled with
> Spark in design, and it contains a Spark-specific context. This makes it
> impossible for us to take advantage of the transform capabilities provided
> by other engines (such as Flink) after supporting multiple engines.
> Therefore, we need to decouple it from Spark in design.
>
> For the second point, we can enhance the Transformer and provide some
> out-of-the-box Transformers, such as FilterTransformer, FlatMapTrnasformer,
> and so on.
>
> For the third point, the most common pattern for data processing is the
> pipeline model, and the common implementation of the pipeline model is the
> responsibility chain model, which can be compared to the Apache commons
> chain[1], combining multiple Transformers can make data-processing become
> more flexible and expandable.
>
> If we enhance the capabilities of Transformer components, Hudi will provide
> richer data processing capabilities based on the computing engine.
>
> What do you think?
>
> Any opinions and feedback are welcome and appreciated.
>
> Best,
> Vino
>
> [1]: https://commons.apache.org/proper/commons-chain/
>