You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@hop.apache.org by "Koffman, Noa (Nokia - IL/Kfar Sava)" <no...@nokia.com> on 2022/01/17 08:51:41 UTC

Some HOP pipelines questions

Hi all,
We are trying to move our pipeline from beam and create them in HOP,
We have a few simple working beam pipelines which are running on flink and have the following basic flow:

  1.  Read Avro messages from kafka topic into Java objects using KafkaIO.read and serializers/deserializers
  2.  Extract some data and create different object
  3.  Write object to kafka topic in Avro format
We are using ‘ApiCurio’ as a schema registry.

So far, we were able to do a simple pipeline in HOP which writes and reads from kafka using: “Beam Kafka Produce” and “Beam Kafka Consume”
However, we need some help to figure out:

  1.  How to integrate the same logic and reading/writing AVRO messages
  2.  Is it possible to use ‘ApiCurio’ as a Schema registry, do we need to develop a new plugin for that? Is a different schema registry supported (like confluent Schema registry)?
  3.  After reading from Kafka transform could be used to process the message and extract some of the data before writing it back to kafka?

Another thing we were wondering about is dependency management:
To use the ‘Beam Kafka Consume’ I had to change multiple ‘pom.xml’ files and exclude an ‘org.apache.avro’ dependency of an older version (1.7.7) – this is brought in form Hadoop dependency.
Before excluding that jars the pipeline failed trying to load the class ‘org.apache.avro.Conversions’
Is this a bug? Or am I building the project wrong?
Also,  is there a recommended procedure to pack and run the pipeline, only keeping the dependencies we need?
For example, if we create a jar, to run on a ‘Flink’ cluster, is there a way to only package the required dependencies from our pipeline (not include, spark, Hadoop, etc..)

Sorry for the long email
Thanks for your help
Noa

Re: Some HOP pipelines questions

Posted by Matt Casters <ma...@neo4j.com>.

As a follow up I'm happy to report that support for Avro messages,
including encoding them and handling them with Kafka (and Confluent Schema
Registry) have been implemented for both the regular Hop pipeline engine
and Beam.

https://hop.apache.org/manual/next/pipeline/transforms/beamkafkaconsume.html
https://hop.apache.org/manual/next/pipeline/transforms/beamkafkaproduce.html

Feedback is more than welcome.  If you have any issues, please let us know.

Enjoy,
Matt


On Tue, Jan 18, 2022 at 1:43 PM Matt Casters <ma...@neo4j.com> wrote:

> Hi Noa,
>
> Thanks a lot for trying out Hop and for the feedback.
>
> Your questions really point to some missing links in our support for Avro:
> - The ability to retrieve Avro fields from a "Beam Kafka consume"
> transform [1]
> - The ability to create Avro fields, so an "Avro Encode" transform [2]
>
> Both things have been on the horizon, we just hadn't gotten round to it
> yet.
> As for the library updates: those can be quite the puzzle.  The upcoming
> version 1.1.0 of Hop (we're voting on it) will contain version 1.8.2 of
> Avro.  Like you said, it's a dependency which comes from other places but
> it is used in multiple locations in the Hop build.  So whenever we're
> updating the library it's best to bring all the plugins onto a new API.
> I'd be happy to look into Avro 1.11.  As long as the API is compatible it
> shouldn't be a problem.  Usually it isn't.  [3]
> The library splits for Spark, Flink and Dataflow is something we've talked
> about among the developers.  I think it mostly requires some small build
> changes and some metadata for the fat jar builder.
> In general we've been thinking about implementing software definitions and
> policies which would be a more structural way of dealing with included
> plugins and features.  [4]
>
> Finally, I'm sure some apicur.io transforms would be great to have.  We
> can add some metadata to make it easy to use.  I personally never used it
> so I can't comment on that beyond the observation that APL libraries are
> available to make things easy.  Perhaps you could detail a bit more about
> what's needed precisely?
>
> The same goes for the new JIRA cases: don't hesitate to comment with
> additional requirements.  This will prevent us from getting on the wrong
> track along the way.
>
> All the best,
> Matt
>
> [1] https://issues.apache.org/jira/browse/HOP-3687
> [2] https://issues.apache.org/jira/browse/HOP-3686
> [3] https://issues.apache.org/jira/browse/HOP-3688
> [4] https://issues.apache.org/jira/browse/HOP-3689
>
>
> On Mon, Jan 17, 2022 at 9:51 AM Koffman, Noa (Nokia - IL/Kfar Sava) <
> noa.koffman@nokia.com> wrote:
>
>> Hi all,
>>
>> We are trying to move our pipeline from beam and create them in HOP,
>>
>> We have a few simple working beam pipelines which are running on flink
>> and have the following basic flow:
>>
>>    1. Read Avro messages from kafka topic into Java objects using
>>    KafkaIO.read and serializers/deserializers
>>    2. Extract some data and create different object
>>    3. Write object to kafka topic in Avro format
>>
>> We are using ‘ApiCurio’ as a schema registry.
>>
>>
>>
>> So far, we were able to do a simple pipeline in HOP which writes and
>> reads from kafka using: “Beam Kafka Produce” and “Beam Kafka Consume”
>>
>> However, we need some help to figure out:
>>
>>    1. How to integrate the same logic and reading/writing AVRO messages
>>    2. Is it possible to use ‘ApiCurio’ as a Schema registry, do we need
>>    to develop a new plugin for that? Is a different schema registry supported
>>    (like confluent Schema registry)?
>>    3. After reading from Kafka transform could be used to process the
>>    message and extract some of the data before writing it back to kafka?
>>
>>
>>
>> Another thing we were wondering about is dependency management:
>>
>> To use the ‘Beam Kafka Consume’ I had to change multiple ‘pom.xml’ files
>> and exclude an ‘org.apache.avro’ dependency of an older version (1.7.7) –
>> this is brought in form Hadoop dependency.
>>
>> Before excluding that jars the pipeline failed trying to load the class
>> ‘org.apache.avro.Conversions’
>>
>> Is this a bug? Or am I building the project wrong?
>>
>> Also,  is there a recommended procedure to pack and run the pipeline,
>> only keeping the dependencies we need?
>>
>> For example, if we create a jar, to run on a ‘Flink’ cluster, is there a
>> way to only package the required dependencies from our pipeline (not
>> include, spark, Hadoop, etc..)
>>
>>
>>
>> Sorry for the long email
>>
>> Thanks for your help
>>
>> Noa
>>
>>
>>
>
>
> --
> Neo4j Chief Solutions Architect
> *✉   *matt.casters@neo4j.com
>
>
>
>

-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com

Re: Some HOP pipelines questions

Posted by Matt Casters <ma...@neo4j.com>.

As a follow up I'm happy to report that support for Avro messages,
including encoding them and handling them with Kafka (and Confluent Schema
Registry) have been implemented for both the regular Hop pipeline engine
and Beam.

https://hop.apache.org/manual/next/pipeline/transforms/beamkafkaconsume.html
https://hop.apache.org/manual/next/pipeline/transforms/beamkafkaproduce.html

Feedback is more than welcome.  If you have any issues, please let us know.

Enjoy,
Matt


On Tue, Jan 18, 2022 at 1:43 PM Matt Casters <ma...@neo4j.com> wrote:

> Hi Noa,
>
> Thanks a lot for trying out Hop and for the feedback.
>
> Your questions really point to some missing links in our support for Avro:
> - The ability to retrieve Avro fields from a "Beam Kafka consume"
> transform [1]
> - The ability to create Avro fields, so an "Avro Encode" transform [2]
>
> Both things have been on the horizon, we just hadn't gotten round to it
> yet.
> As for the library updates: those can be quite the puzzle.  The upcoming
> version 1.1.0 of Hop (we're voting on it) will contain version 1.8.2 of
> Avro.  Like you said, it's a dependency which comes from other places but
> it is used in multiple locations in the Hop build.  So whenever we're
> updating the library it's best to bring all the plugins onto a new API.
> I'd be happy to look into Avro 1.11.  As long as the API is compatible it
> shouldn't be a problem.  Usually it isn't.  [3]
> The library splits for Spark, Flink and Dataflow is something we've talked
> about among the developers.  I think it mostly requires some small build
> changes and some metadata for the fat jar builder.
> In general we've been thinking about implementing software definitions and
> policies which would be a more structural way of dealing with included
> plugins and features.  [4]
>
> Finally, I'm sure some apicur.io transforms would be great to have.  We
> can add some metadata to make it easy to use.  I personally never used it
> so I can't comment on that beyond the observation that APL libraries are
> available to make things easy.  Perhaps you could detail a bit more about
> what's needed precisely?
>
> The same goes for the new JIRA cases: don't hesitate to comment with
> additional requirements.  This will prevent us from getting on the wrong
> track along the way.
>
> All the best,
> Matt
>
> [1] https://issues.apache.org/jira/browse/HOP-3687
> [2] https://issues.apache.org/jira/browse/HOP-3686
> [3] https://issues.apache.org/jira/browse/HOP-3688
> [4] https://issues.apache.org/jira/browse/HOP-3689
>
>
> On Mon, Jan 17, 2022 at 9:51 AM Koffman, Noa (Nokia - IL/Kfar Sava) <
> noa.koffman@nokia.com> wrote:
>
>> Hi all,
>>
>> We are trying to move our pipeline from beam and create them in HOP,
>>
>> We have a few simple working beam pipelines which are running on flink
>> and have the following basic flow:
>>
>>    1. Read Avro messages from kafka topic into Java objects using
>>    KafkaIO.read and serializers/deserializers
>>    2. Extract some data and create different object
>>    3. Write object to kafka topic in Avro format
>>
>> We are using ‘ApiCurio’ as a schema registry.
>>
>>
>>
>> So far, we were able to do a simple pipeline in HOP which writes and
>> reads from kafka using: “Beam Kafka Produce” and “Beam Kafka Consume”
>>
>> However, we need some help to figure out:
>>
>>    1. How to integrate the same logic and reading/writing AVRO messages
>>    2. Is it possible to use ‘ApiCurio’ as a Schema registry, do we need
>>    to develop a new plugin for that? Is a different schema registry supported
>>    (like confluent Schema registry)?
>>    3. After reading from Kafka transform could be used to process the
>>    message and extract some of the data before writing it back to kafka?
>>
>>
>>
>> Another thing we were wondering about is dependency management:
>>
>> To use the ‘Beam Kafka Consume’ I had to change multiple ‘pom.xml’ files
>> and exclude an ‘org.apache.avro’ dependency of an older version (1.7.7) –
>> this is brought in form Hadoop dependency.
>>
>> Before excluding that jars the pipeline failed trying to load the class
>> ‘org.apache.avro.Conversions’
>>
>> Is this a bug? Or am I building the project wrong?
>>
>> Also,  is there a recommended procedure to pack and run the pipeline,
>> only keeping the dependencies we need?
>>
>> For example, if we create a jar, to run on a ‘Flink’ cluster, is there a
>> way to only package the required dependencies from our pipeline (not
>> include, spark, Hadoop, etc..)
>>
>>
>>
>> Sorry for the long email
>>
>> Thanks for your help
>>
>> Noa
>>
>>
>>
>
>
> --
> Neo4j Chief Solutions Architect
> *✉   *matt.casters@neo4j.com
>
>
>
>

-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com

Re: Some HOP pipelines questions

Posted by Matt Casters <ma...@neo4j.com>.

Hi Noa,

Thanks a lot for trying out Hop and for the feedback.

Your questions really point to some missing links in our support for Avro:
- The ability to retrieve Avro fields from a "Beam Kafka consume" transform
[1]
- The ability to create Avro fields, so an "Avro Encode" transform [2]

Both things have been on the horizon, we just hadn't gotten round to it yet.
As for the library updates: those can be quite the puzzle.  The upcoming
version 1.1.0 of Hop (we're voting on it) will contain version 1.8.2 of
Avro.  Like you said, it's a dependency which comes from other places but
it is used in multiple locations in the Hop build.  So whenever we're
updating the library it's best to bring all the plugins onto a new API.
I'd be happy to look into Avro 1.11.  As long as the API is compatible it
shouldn't be a problem.  Usually it isn't.  [3]
The library splits for Spark, Flink and Dataflow is something we've talked
about among the developers.  I think it mostly requires some small build
changes and some metadata for the fat jar builder.
In general we've been thinking about implementing software definitions and
policies which would be a more structural way of dealing with included
plugins and features.  [4]

Finally, I'm sure some apicur.io transforms would be great to have.  We can
add some metadata to make it easy to use.  I personally never used it so I
can't comment on that beyond the observation that APL libraries are
available to make things easy.  Perhaps you could detail a bit more about
what's needed precisely?

The same goes for the new JIRA cases: don't hesitate to comment with
additional requirements.  This will prevent us from getting on the wrong
track along the way.

All the best,
Matt

[1] https://issues.apache.org/jira/browse/HOP-3687
[2] https://issues.apache.org/jira/browse/HOP-3686
[3] https://issues.apache.org/jira/browse/HOP-3688
[4] https://issues.apache.org/jira/browse/HOP-3689

On Mon, Jan 17, 2022 at 9:51 AM Koffman, Noa (Nokia - IL/Kfar Sava) <
noa.koffman@nokia.com> wrote:

> Hi all,
>
> We are trying to move our pipeline from beam and create them in HOP,
>
> We have a few simple working beam pipelines which are running on flink and
> have the following basic flow:
>
>    1. Read Avro messages from kafka topic into Java objects using
>    KafkaIO.read and serializers/deserializers
>    2. Extract some data and create different object
>    3. Write object to kafka topic in Avro format
>
> We are using ‘ApiCurio’ as a schema registry.
>
>
>
> So far, we were able to do a simple pipeline in HOP which writes and reads
> from kafka using: “Beam Kafka Produce” and “Beam Kafka Consume”
>
> However, we need some help to figure out:
>
>    1. How to integrate the same logic and reading/writing AVRO messages
>    2. Is it possible to use ‘ApiCurio’ as a Schema registry, do we need
>    to develop a new plugin for that? Is a different schema registry supported
>    (like confluent Schema registry)?
>    3. After reading from Kafka transform could be used to process the
>    message and extract some of the data before writing it back to kafka?
>
>
>
> Another thing we were wondering about is dependency management:
>
> To use the ‘Beam Kafka Consume’ I had to change multiple ‘pom.xml’ files
> and exclude an ‘org.apache.avro’ dependency of an older version (1.7.7) –
> this is brought in form Hadoop dependency.
>
> Before excluding that jars the pipeline failed trying to load the class
> ‘org.apache.avro.Conversions’
>
> Is this a bug? Or am I building the project wrong?
>
> Also,  is there a recommended procedure to pack and run the pipeline, only
> keeping the dependencies we need?
>
> For example, if we create a jar, to run on a ‘Flink’ cluster, is there a
> way to only package the required dependencies from our pipeline (not
> include, spark, Hadoop, etc..)
>
>
>
> Sorry for the long email
>
> Thanks for your help
>
> Noa
>
>
>

-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com

Re: Some HOP pipelines questions

Posted by Matt Casters <ma...@neo4j.com>.

Hi Noa,

Thanks a lot for trying out Hop and for the feedback.

Your questions really point to some missing links in our support for Avro:
- The ability to retrieve Avro fields from a "Beam Kafka consume" transform
[1]
- The ability to create Avro fields, so an "Avro Encode" transform [2]

Both things have been on the horizon, we just hadn't gotten round to it yet.
As for the library updates: those can be quite the puzzle.  The upcoming
version 1.1.0 of Hop (we're voting on it) will contain version 1.8.2 of
Avro.  Like you said, it's a dependency which comes from other places but
it is used in multiple locations in the Hop build.  So whenever we're
updating the library it's best to bring all the plugins onto a new API.
I'd be happy to look into Avro 1.11.  As long as the API is compatible it
shouldn't be a problem.  Usually it isn't.  [3]
The library splits for Spark, Flink and Dataflow is something we've talked
about among the developers.  I think it mostly requires some small build
changes and some metadata for the fat jar builder.
In general we've been thinking about implementing software definitions and
policies which would be a more structural way of dealing with included
plugins and features.  [4]

Finally, I'm sure some apicur.io transforms would be great to have.  We can
add some metadata to make it easy to use.  I personally never used it so I
can't comment on that beyond the observation that APL libraries are
available to make things easy.  Perhaps you could detail a bit more about
what's needed precisely?

The same goes for the new JIRA cases: don't hesitate to comment with
additional requirements.  This will prevent us from getting on the wrong
track along the way.

All the best,
Matt

[1] https://issues.apache.org/jira/browse/HOP-3687
[2] https://issues.apache.org/jira/browse/HOP-3686
[3] https://issues.apache.org/jira/browse/HOP-3688
[4] https://issues.apache.org/jira/browse/HOP-3689

On Mon, Jan 17, 2022 at 9:51 AM Koffman, Noa (Nokia - IL/Kfar Sava) <
noa.koffman@nokia.com> wrote:

> Hi all,
>
> We are trying to move our pipeline from beam and create them in HOP,
>
> We have a few simple working beam pipelines which are running on flink and
> have the following basic flow:
>
>    1. Read Avro messages from kafka topic into Java objects using
>    KafkaIO.read and serializers/deserializers
>    2. Extract some data and create different object
>    3. Write object to kafka topic in Avro format
>
> We are using ‘ApiCurio’ as a schema registry.
>
>
>
> So far, we were able to do a simple pipeline in HOP which writes and reads
> from kafka using: “Beam Kafka Produce” and “Beam Kafka Consume”
>
> However, we need some help to figure out:
>
>    1. How to integrate the same logic and reading/writing AVRO messages
>    2. Is it possible to use ‘ApiCurio’ as a Schema registry, do we need
>    to develop a new plugin for that? Is a different schema registry supported
>    (like confluent Schema registry)?
>    3. After reading from Kafka transform could be used to process the
>    message and extract some of the data before writing it back to kafka?
>
>
>
> Another thing we were wondering about is dependency management:
>
> To use the ‘Beam Kafka Consume’ I had to change multiple ‘pom.xml’ files
> and exclude an ‘org.apache.avro’ dependency of an older version (1.7.7) –
> this is brought in form Hadoop dependency.
>
> Before excluding that jars the pipeline failed trying to load the class
> ‘org.apache.avro.Conversions’
>
> Is this a bug? Or am I building the project wrong?
>
> Also,  is there a recommended procedure to pack and run the pipeline, only
> keeping the dependencies we need?
>
> For example, if we create a jar, to run on a ‘Flink’ cluster, is there a
> way to only package the required dependencies from our pipeline (not
> include, spark, Hadoop, etc..)
>
>
>
> Sorry for the long email
>
> Thanks for your help
>
> Noa
>
>
>

-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com