You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Matt Casters <ma...@gmail.com> on 2019/02/24 21:13:07 UTC

Kettle Beam 0.5.0

Folks, it's not my habit but playing around with running Kettle
transformations on Flink w/ Beam was so cool I had to blog about it.

http://sandbox.kettle.be/wordpress/index.php/2019/02/24/kettle-beam-update-0-5-0/

Allow me to again extend my thanks to all the developers involved.  Some
really cool things are happening right now.
Version 0.5.0 of Kettle Beam now supports all Kettle steps including third
party connectors like SalesForce, SAP, Neo4j and so on.  Obviously they
don't always make sense in a big data context but side-loading the data for
in-memory lookup and so on can indeed make a lot of sense in a lot of
scenarios.
For the batched output I also managed to get performance on-par with
expectations, specifically for Neo4j since I work for the company after
all.  I really appreciate all the help I got so far getting to this point.
In a record time we've gone from conceptual work to something we can
consider to be stable. Apache Beam has really made a huge difference.

Cheers,

Matt
---
Matt Casters <m <mc...@gmail.com>
Senior Solution Architect, Kettle Project Founder

Re: Kettle Beam 0.5.0

Posted by Matt Casters <ma...@gmail.com>.
Hi Kenn,

It's fundamentally a question I asked myself a few times when I see
questions on this very mailing list.  Automatic column detection, weird
data sources... all these things have already been solved in Kettle long
time ago.

The core Kettle API for a transformation step (as it is called) follows
similar logic to Apache Beam Transform in the sense that a step reads rows
of data and writes them. Things like side-loading are also supported but
also a bunch of other options like directing rows to specific other target
steps (switch/case) or reading from specific source steps (Merge join
specifying left/right).
These similarities have made it "fairly easy" to wrap them up in
Transform/DoFn and ultimately convert Kettle transformations into Beam
pipelines.

I think we can make it easier in the future by making some changes to the
core API of Kettle itself. The API has been working fine for over 15 years
but and it's doable now but I think there are things we learned along the
way and there are more options right now.
Before we do something like that however we (the core Kettle community) are
contemplating making Kettle itself an Apache incubator project. Kettle is
pretty widely used in large organisations across the globe and the Apache
cooperation model is something we think would work better than what is
currently in place for all sorts of reasons I won't go into as I'm trying
to phrase this as diplomatically as possible.  If anyone has suggestions on
this subject, please reach out to me.

But to the core of your question: I do see a lot of value in a reverse wrap
of a generic IO wrapper around a bunch of Kettle input and output step
plugins. Instead of converting Kettle metadata into the Beam API you would
convert Beam properties to Kettle metadata in some smart way, probably
simply by sub-classing some Kettle metadata beans to implement Input or
Output interfaces.
What would be an issue is that any data integration running off of metadata
(any ETL tool really) requires input and output formats to be predictable.
This means that there needs to be a certain contract as to what goes in and
out of steps in any shape or form.  Because of this, the current pipelines
we build pass around data in the form of a KettleRow
(PCollection<KettleRow>). KettleRow is just an Object[] wrapper and you get
a description of what's in there.  If folks can live with that they can
easily convert this data to other formats.

All the best,

Matt

Op ma 25 feb. 2019 00:25 schreef Kenneth Knowles <ke...@apache.org>:

> Nice work! I'm impressed at how quickly this has come together.
>
> Did you build a generic adapter for using Kettle connectors in Beam? (I
> don't know what a Kettle connector API looks like)
>
> It would be cool to make these connectors more broadly available to Beam
> users, though maybe not optimal for parallel big data reads.
>
> Kenn
>
> On Sun, Feb 24, 2019 at 1:13 PM Matt Casters <ma...@gmail.com>
> wrote:
>
>>
>> Folks, it's not my habit but playing around with running Kettle
>> transformations on Flink w/ Beam was so cool I had to blog about it.
>>
>>
>> http://sandbox.kettle.be/wordpress/index.php/2019/02/24/kettle-beam-update-0-5-0/
>>
>> Allow me to again extend my thanks to all the developers involved.  Some
>> really cool things are happening right now.
>> Version 0.5.0 of Kettle Beam now supports all Kettle steps including
>> third party connectors like SalesForce, SAP, Neo4j and so on.  Obviously
>> they don't always make sense in a big data context but side-loading the
>> data for in-memory lookup and so on can indeed make a lot of sense in a lot
>> of scenarios.
>> For the batched output I also managed to get performance on-par with
>> expectations, specifically for Neo4j since I work for the company after
>> all.  I really appreciate all the help I got so far getting to this point.
>> In a record time we've gone from conceptual work to something we can
>> consider to be stable. Apache Beam has really made a huge difference.
>>
>> Cheers,
>>
>> Matt
>> ---
>> Matt Casters <m <mc...@gmail.com>
>> Senior Solution Architect, Kettle Project Founder
>>
>>
>>

Re: Kettle Beam 0.5.0

Posted by Matt Casters <ma...@gmail.com>.
Hi Kenn,

It's fundamentally a question I asked myself a few times when I see
questions on this very mailing list.  Automatic column detection, weird
data sources... all these things have already been solved in Kettle long
time ago.

The core Kettle API for a transformation step (as it is called) follows
similar logic to Apache Beam Transform in the sense that a step reads rows
of data and writes them. Things like side-loading are also supported but
also a bunch of other options like directing rows to specific other target
steps (switch/case) or reading from specific source steps (Merge join
specifying left/right).
These similarities have made it "fairly easy" to wrap them up in
Transform/DoFn and ultimately convert Kettle transformations into Beam
pipelines.

I think we can make it easier in the future by making some changes to the
core API of Kettle itself. The API has been working fine for over 15 years
but and it's doable now but I think there are things we learned along the
way and there are more options right now.
Before we do something like that however we (the core Kettle community) are
contemplating making Kettle itself an Apache incubator project. Kettle is
pretty widely used in large organisations across the globe and the Apache
cooperation model is something we think would work better than what is
currently in place for all sorts of reasons I won't go into as I'm trying
to phrase this as diplomatically as possible.  If anyone has suggestions on
this subject, please reach out to me.

But to the core of your question: I do see a lot of value in a reverse wrap
of a generic IO wrapper around a bunch of Kettle input and output step
plugins. Instead of converting Kettle metadata into the Beam API you would
convert Beam properties to Kettle metadata in some smart way, probably
simply by sub-classing some Kettle metadata beans to implement Input or
Output interfaces.
What would be an issue is that any data integration running off of metadata
(any ETL tool really) requires input and output formats to be predictable.
This means that there needs to be a certain contract as to what goes in and
out of steps in any shape or form.  Because of this, the current pipelines
we build pass around data in the form of a KettleRow
(PCollection<KettleRow>). KettleRow is just an Object[] wrapper and you get
a description of what's in there.  If folks can live with that they can
easily convert this data to other formats.

All the best,

Matt

Op ma 25 feb. 2019 00:25 schreef Kenneth Knowles <ke...@apache.org>:

> Nice work! I'm impressed at how quickly this has come together.
>
> Did you build a generic adapter for using Kettle connectors in Beam? (I
> don't know what a Kettle connector API looks like)
>
> It would be cool to make these connectors more broadly available to Beam
> users, though maybe not optimal for parallel big data reads.
>
> Kenn
>
> On Sun, Feb 24, 2019 at 1:13 PM Matt Casters <ma...@gmail.com>
> wrote:
>
>>
>> Folks, it's not my habit but playing around with running Kettle
>> transformations on Flink w/ Beam was so cool I had to blog about it.
>>
>>
>> http://sandbox.kettle.be/wordpress/index.php/2019/02/24/kettle-beam-update-0-5-0/
>>
>> Allow me to again extend my thanks to all the developers involved.  Some
>> really cool things are happening right now.
>> Version 0.5.0 of Kettle Beam now supports all Kettle steps including
>> third party connectors like SalesForce, SAP, Neo4j and so on.  Obviously
>> they don't always make sense in a big data context but side-loading the
>> data for in-memory lookup and so on can indeed make a lot of sense in a lot
>> of scenarios.
>> For the batched output I also managed to get performance on-par with
>> expectations, specifically for Neo4j since I work for the company after
>> all.  I really appreciate all the help I got so far getting to this point.
>> In a record time we've gone from conceptual work to something we can
>> consider to be stable. Apache Beam has really made a huge difference.
>>
>> Cheers,
>>
>> Matt
>> ---
>> Matt Casters <m <mc...@gmail.com>
>> Senior Solution Architect, Kettle Project Founder
>>
>>
>>

Re: Kettle Beam 0.5.0

Posted by Kenneth Knowles <ke...@apache.org>.
Nice work! I'm impressed at how quickly this has come together.

Did you build a generic adapter for using Kettle connectors in Beam? (I
don't know what a Kettle connector API looks like)

It would be cool to make these connectors more broadly available to Beam
users, though maybe not optimal for parallel big data reads.

Kenn

On Sun, Feb 24, 2019 at 1:13 PM Matt Casters <ma...@gmail.com> wrote:

>
> Folks, it's not my habit but playing around with running Kettle
> transformations on Flink w/ Beam was so cool I had to blog about it.
>
>
> http://sandbox.kettle.be/wordpress/index.php/2019/02/24/kettle-beam-update-0-5-0/
>
> Allow me to again extend my thanks to all the developers involved.  Some
> really cool things are happening right now.
> Version 0.5.0 of Kettle Beam now supports all Kettle steps including third
> party connectors like SalesForce, SAP, Neo4j and so on.  Obviously they
> don't always make sense in a big data context but side-loading the data for
> in-memory lookup and so on can indeed make a lot of sense in a lot of
> scenarios.
> For the batched output I also managed to get performance on-par with
> expectations, specifically for Neo4j since I work for the company after
> all.  I really appreciate all the help I got so far getting to this point.
> In a record time we've gone from conceptual work to something we can
> consider to be stable. Apache Beam has really made a huge difference.
>
> Cheers,
>
> Matt
> ---
> Matt Casters <m <mc...@gmail.com>
> Senior Solution Architect, Kettle Project Founder
>
>
>

Re: Kettle Beam 0.5.0

Posted by Kenneth Knowles <ke...@apache.org>.
Nice work! I'm impressed at how quickly this has come together.

Did you build a generic adapter for using Kettle connectors in Beam? (I
don't know what a Kettle connector API looks like)

It would be cool to make these connectors more broadly available to Beam
users, though maybe not optimal for parallel big data reads.

Kenn

On Sun, Feb 24, 2019 at 1:13 PM Matt Casters <ma...@gmail.com> wrote:

>
> Folks, it's not my habit but playing around with running Kettle
> transformations on Flink w/ Beam was so cool I had to blog about it.
>
>
> http://sandbox.kettle.be/wordpress/index.php/2019/02/24/kettle-beam-update-0-5-0/
>
> Allow me to again extend my thanks to all the developers involved.  Some
> really cool things are happening right now.
> Version 0.5.0 of Kettle Beam now supports all Kettle steps including third
> party connectors like SalesForce, SAP, Neo4j and so on.  Obviously they
> don't always make sense in a big data context but side-loading the data for
> in-memory lookup and so on can indeed make a lot of sense in a lot of
> scenarios.
> For the batched output I also managed to get performance on-par with
> expectations, specifically for Neo4j since I work for the company after
> all.  I really appreciate all the help I got so far getting to this point.
> In a record time we've gone from conceptual work to something we can
> consider to be stable. Apache Beam has really made a huge difference.
>
> Cheers,
>
> Matt
> ---
> Matt Casters <m <mc...@gmail.com>
> Senior Solution Architect, Kettle Project Founder
>
>
>