You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@hop.apache.org by po...@gmx.com on 2022/05/30 15:42:18 UTC

How to send pipeline to Flink?


Hello community!





I'm learning Apache Flink. I launched test installation on my PC (version
1.13.6), I have HOP running and now I want to run first pipeline on Flink.

Yet for some reason it does not work. What I did:



1\. I exported fat jar according to this tutorial:

https://hop.apache.org/manual/latest/pipeline/beam/running-the-beam-
samples.html#_prerequisites



2\. I created new Pipeline Run Configuration with details:



Name: Local Flink

Description:

Engine type: Beam Flink pipeline engine

The Flink master: [local]

Parallelism: 2

...

...

User agent: Hop

Temp file: file://C:/Temp

..

Fat jar file location: ${PROJECT_HOME}/fat_jar.jar



3\. I did open 'generate-synthetic-data' pipeline and executed it with
Pipeline run configuration: Local Flink



Looks like it runs; I see it in the Flink dashboard.





I have two questions:



\- I guess there is no way to see the data somewhere in Flink gui?

\- How can I modify this 'generate-synthetic-data' pipeline so Flink saves
these sample data to file or database? In last transform 'Beam output' i see
variable ${DATA_OUTPUT} - does it mean I can do some mapping in 'Variables'
tab of Pipeline Run configuration to point to file? But what about database?

\- Hop exports pipeline (I know it can be executed as well with 'run') and it
runs permamently till it is stopped, right? Or there is some parameter to
execute it just once?



Thanks a lot for your help



Mike

Re: How to send pipeline to Flink?

Posted by Matt Casters <ma...@neo4j.com>.

Thanks Mike.
I don't know about legends and stuff but don’t forget what happened to the
man who suddenly got everything he wanted.Cheers,
Matt

On Tue, May 31, 2022 at 3:24 PM <po...@gmx.com> wrote:

> Wow! That's really you Matt!? You're a living legend to me :-)
>
> Kettle was (still is) amazing piece of software. I was sad that Hitachi
> Vantara is killing this product. But when I started looking on the
> Internet to see if there was any alternative and it turned out that there
> is Hop, I was delighted. This software, so Kettle, is amazing! I use it
> as a door to any system, to work on any data.
>
> Thanks a lot for help - I will test solution you propose and get back if
> any problem (hope not :-))
>
> Mike
>
> *Sent:* Monday, May 30, 2022 at 5:59 PM
> *From:* "Matt Casters" <ma...@neo4j.com>
> *To:* users@hop.apache.org
> *Subject:* Re: How to send pipeline to Flink?
> Hi Mike,
>
> the variable PROJECT_HOME is actually not set by flink so what I had to do
> was add it to conf/flink-conf.yml:
>
> env.java.opts: -DPROJECT_HOME=/path/to/project/home/folder
>
> My script to run right now looks something like this:
>
> bin/flink run \
>  --class org.apache.hop.beam.run.MainBeam \
>  /tmp/hop-fatjar.jar \
>  /path/to/project/home/folder/beam/pipelines/input-process-output.hpl \
>  /tmp/hop-metadata.json \
>  Flink
>
> As to your questions...
>
>
>> - I guess there is no way to see the data somewhere in Flink gui?
>
>
> No but I'm thinking of adding a Hop call-back service (on Hop Server
> probably) so that we can capture progress and allow Hop GUI to see that.
>
>
>> - How can I modify this 'generate-synthetic-data' pipeline so Flink saves
>> these sample data to file or database? In last transform 'Beam output' i
>> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
>> 'Variables' tab of Pipeline Run configuration to point to file? But what
>> about database?
>
>
> The Table Output transform will allow you to write to a relational
> database.  Performance-wise, please read up
> <https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html>
> on the basics specifically the section on "Row batching with non-Beam
> transforms".
>
>
>> - Hop exports pipeline (I know it can be executed as well with 'run') and
>> it runs permamently till it is stopped, right? Or there is some parameter
>> to execute it just once?
>
>
> Flink Run will execute only once as shown in the script above.
>
> HTH,
> Matt
>
> On Mon, May 30, 2022 at 5:42 PM <po...@gmx.com> wrote:
>
>>
>> Hello community!
>>
>>
>> I'm learning Apache Flink. I launched test installation on my PC (version
>> 1.13.6), I have HOP running and now I want to run first pipeline on Flink.
>> Yet for some reason it does not work. What I did:
>>
>> 1. I exported fat jar according to this tutorial:
>>
>> https://hop.apache.org/manual/latest/pipeline/beam/running-the-beam-samples.html#_prerequisites
>>
>> 2. I created new Pipeline Run Configuration with details:
>>
>> Name: Local Flink
>> Description:
>> Engine type: Beam Flink pipeline engine
>> The Flink master: [local]
>> Parallelism: 2
>> ...
>> ...
>> User agent: Hop
>> Temp file: file://C:/Temp
>> ..
>> Fat jar file location: ${PROJECT_HOME}/fat_jar.jar
>>
>> 3. I did open 'generate-synthetic-data' pipeline and executed it with
>> Pipeline run configuration: Local Flink
>>
>> Looks like it runs; I see it in the Flink dashboard.
>>
>>
>> I have two questions:
>>
>> - I guess there is no way to see the data somewhere in Flink gui?
>> - How can I modify this 'generate-synthetic-data' pipeline so Flink saves
>> these sample data to file or database? In last transform 'Beam output' i
>> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
>> 'Variables' tab of Pipeline Run configuration to point to file? But what
>> about database?
>> - Hop exports pipeline (I know it can be executed as well with 'run') and
>> it runs permamently till it is stopped, right? Or there is some parameter
>> to execute it just once?
>>
>> Thanks a lot for your help
>>
>> Mike
>>
>>
>>
>>
>
>
>
>
>
>
>
>
>


-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com

Re: How to send pipeline to Flink?

Posted by Matt Casters <ma...@neo4j.com>.

Thanks Mike.
I don't know about legends and stuff but don’t forget what happened to the
man who suddenly got everything he wanted.Cheers,
Matt

On Tue, May 31, 2022 at 3:24 PM <po...@gmx.com> wrote:

> Wow! That's really you Matt!? You're a living legend to me :-)
>
> Kettle was (still is) amazing piece of software. I was sad that Hitachi
> Vantara is killing this product. But when I started looking on the
> Internet to see if there was any alternative and it turned out that there
> is Hop, I was delighted. This software, so Kettle, is amazing! I use it
> as a door to any system, to work on any data.
>
> Thanks a lot for help - I will test solution you propose and get back if
> any problem (hope not :-))
>
> Mike
>
> *Sent:* Monday, May 30, 2022 at 5:59 PM
> *From:* "Matt Casters" <ma...@neo4j.com>
> *To:* users@hop.apache.org
> *Subject:* Re: How to send pipeline to Flink?
> Hi Mike,
>
> the variable PROJECT_HOME is actually not set by flink so what I had to do
> was add it to conf/flink-conf.yml:
>
> env.java.opts: -DPROJECT_HOME=/path/to/project/home/folder
>
> My script to run right now looks something like this:
>
> bin/flink run \
>  --class org.apache.hop.beam.run.MainBeam \
>  /tmp/hop-fatjar.jar \
>  /path/to/project/home/folder/beam/pipelines/input-process-output.hpl \
>  /tmp/hop-metadata.json \
>  Flink
>
> As to your questions...
>
>
>> - I guess there is no way to see the data somewhere in Flink gui?
>
>
> No but I'm thinking of adding a Hop call-back service (on Hop Server
> probably) so that we can capture progress and allow Hop GUI to see that.
>
>
>> - How can I modify this 'generate-synthetic-data' pipeline so Flink saves
>> these sample data to file or database? In last transform 'Beam output' i
>> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
>> 'Variables' tab of Pipeline Run configuration to point to file? But what
>> about database?
>
>
> The Table Output transform will allow you to write to a relational
> database.  Performance-wise, please read up
> <https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html>
> on the basics specifically the section on "Row batching with non-Beam
> transforms".
>
>
>> - Hop exports pipeline (I know it can be executed as well with 'run') and
>> it runs permamently till it is stopped, right? Or there is some parameter
>> to execute it just once?
>
>
> Flink Run will execute only once as shown in the script above.
>
> HTH,
> Matt
>
> On Mon, May 30, 2022 at 5:42 PM <po...@gmx.com> wrote:
>
>>
>> Hello community!
>>
>>
>> I'm learning Apache Flink. I launched test installation on my PC (version
>> 1.13.6), I have HOP running and now I want to run first pipeline on Flink.
>> Yet for some reason it does not work. What I did:
>>
>> 1. I exported fat jar according to this tutorial:
>>
>> https://hop.apache.org/manual/latest/pipeline/beam/running-the-beam-samples.html#_prerequisites
>>
>> 2. I created new Pipeline Run Configuration with details:
>>
>> Name: Local Flink
>> Description:
>> Engine type: Beam Flink pipeline engine
>> The Flink master: [local]
>> Parallelism: 2
>> ...
>> ...
>> User agent: Hop
>> Temp file: file://C:/Temp
>> ..
>> Fat jar file location: ${PROJECT_HOME}/fat_jar.jar
>>
>> 3. I did open 'generate-synthetic-data' pipeline and executed it with
>> Pipeline run configuration: Local Flink
>>
>> Looks like it runs; I see it in the Flink dashboard.
>>
>>
>> I have two questions:
>>
>> - I guess there is no way to see the data somewhere in Flink gui?
>> - How can I modify this 'generate-synthetic-data' pipeline so Flink saves
>> these sample data to file or database? In last transform 'Beam output' i
>> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
>> 'Variables' tab of Pipeline Run configuration to point to file? But what
>> about database?
>> - Hop exports pipeline (I know it can be executed as well with 'run') and
>> it runs permamently till it is stopped, right? Or there is some parameter
>> to execute it just once?
>>
>> Thanks a lot for your help
>>
>> Mike
>>
>>
>>
>>
>
>
>
>
>
>
>
>
>


-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com

Re: How to send pipeline to Flink?

Posted by po...@gmx.com.

Wow! That's really you Matt!? You're a living legend to me :-)

Kettle was (still is) amazing piece of software. I was sad that Hitachi
Vantara is killing this product. But when I started looking on the Internet to
see if there was any alternative and it turned out that there is Hop, I was
delighted. This software, so Kettle, is amazing! I use it as a door to any
system, to work on any data.

Thanks a lot for help - I will test solution you propose and get back if any
problem (hope not :-))

Mike

**Sent:**  Monday, May 30, 2022 at 5:59 PM  
**From:**  "Matt Casters" <ma...@neo4j.com>  
**To:**  users@hop.apache.org  
**Subject:**  Re: How to send pipeline to Flink?

Hi Mike,

the variable PROJECT_HOME is actually not set by flink so what I had to do was
add it to conf/flink-conf.yml:

env.java.opts: -DPROJECT_HOME=/path/to/project/home/folder  

My script to run right now looks something like this:

bin/flink run \  
 \--class org.apache.hop.beam.run.MainBeam \  
 /tmp/hop-fatjar.jar \  
 /path/to/project/home/folder/beam/pipelines/input-process-output.hpl \  
 /tmp/hop-metadata.json \  
 Flink  

As to your questions...

> \- I guess there is no way to see the data somewhere in Flink gui?

No but I'm thinking of adding a Hop call-back service (on Hop Server probably)
so that we can capture progress and allow Hop GUI to see that.

> \- How can I modify this 'generate-synthetic-data' pipeline so Flink saves
> these sample data to file or database? In last transform 'Beam output' i see
> variable ${DATA_OUTPUT} - does it mean I can do some mapping in 'Variables'
> tab of Pipeline Run configuration to point to file? But what about database?

The Table Output transform will allow you to write to a relational database.
Performance-wise, please [read
up](https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-
beam.html) on the basics specifically the section on "Row batching with non-
Beam transforms".

> \- Hop exports pipeline (I know it can be executed as well with 'run') and
> it runs permamently till it is stopped, right? Or there is some parameter to
> execute it just once?

Flink Run will execute only once as shown in the script above.

HTH,  
Matt

On Mon, May 30, 2022 at 5:42 PM <[podunk@gmx.com](mailto:podunk@gmx.com)>
wrote:

>  
>
> Hello community!
>
>  
>
>  
>
> I'm learning Apache Flink. I launched test installation on my PC (version
> 1.13.6), I have HOP running and now I want to run first pipeline on Flink.
>
> Yet for some reason it does not work. What I did:
>
>  
>
> 1\. I exported fat jar according to this tutorial:
>
> <https://hop.apache.org/manual/latest/pipeline/beam/running-the-beam-
> samples.html#_prerequisites>
>
>  
>
> 2\. I created new Pipeline Run Configuration with details:
>
>  
>
> Name: Local Flink
>
> Description:
>
> Engine type: Beam Flink pipeline engine
>
> The Flink master: [local]
>
> Parallelism: 2
>
> ...
>
> ...
>
> User agent: Hop
>
> Temp file: file://C:/Temp
>
> ..
>
> Fat jar file location: ${PROJECT_HOME}/fat_jar.jar
>
>  
>
> 3\. I did open 'generate-synthetic-data' pipeline and executed it with
> Pipeline run configuration: Local Flink
>
>  
>
> Looks like it runs; I see it in the Flink dashboard.
>
>  
>
>  
>
> I have two questions:
>
>  
>
> \- I guess there is no way to see the data somewhere in Flink gui?
>
> \- How can I modify this 'generate-synthetic-data' pipeline so Flink saves
> these sample data to file or database? In last transform 'Beam output' i see
> variable ${DATA_OUTPUT} - does it mean I can do some mapping in 'Variables'
> tab of Pipeline Run configuration to point to file? But what about database?
>
> \- Hop exports pipeline (I know it can be executed as well with 'run') and
> it runs permamently till it is stopped, right? Or there is some parameter to
> execute it just once?
>
>  
>
> Thanks a lot for your help
>
>  
>
> Mike
>
>  
>
>  
>
>

Re: Beam input - problem with data format

Posted by Matt Casters <ma...@neo4j.com>.

If you think it's a bug and you have a reproduction case we would very much
welcome a JIRA case for this.
Most likely we're dealing with an issue with the file encoding or the line
endings but it's still worth looking into.

Thanks in advance!

Matt

Op di 7 jun. 2022 15:56 schreef <po...@gmx.com>:

>
> After another few days I spent on it I'm quite close this is some bug in
> software:
>
> Capture1.PNG - when executed 'Local'
>
> "2022/06/07 22:53:53 - merge_join_test - client_id Integer(15) : There was
> a data type error: the data type of java.lang.String object ["Some customer
> 4"] does not correspond to value meta [Integer(15)]"
>
> When exectued with "Beam Flink pipeline engine"
>
> I tried everything.
>
> M.
> *Sent:* Friday, June 03, 2022 at 5:31 PM
> *From:* podunk@gmx.com
> *To:* users@hop.apache.org
> *Subject:* Beam input - problem with data format
>
> Hi,
>
> since 3 hours I'm trying to do such trivial thing like opne file in Beam
> input.
>
> My file is:
>
> 80010;Some customer 1;2344;Address 1
> 80011;Some customer 2;7546;Address 2
> 80012;Some customer 3;4564;Address 3
> 80013;Some customer 4;7564;Address 4
> 80014;Some customer 5;2354;Address 5
>
> I defined input file in Pipeline Run configuration - ${CUSTOMERS}
> I have Beam file definition 'customers' (see attachment)
> In Beam input i have - Input location: ${CUSTOMERS}, File definition to:
> customers
>
> When I start pipeline I receive error:
>
> 2022/06/03 17:24:40 - merge_join - client_no Integer(5) : There was a data
> type error: the data type of java.lang.String object [Some customer 1] does
> not correspond to value meta [Integer(5)]
>
> For some reason first column, client_no, is completly ignored (yes, i have
> ';' as field separator) and the second is taken.
> Does not matter what end of line in file, does not matter what kind of
> encoding.
>
> I open same file in 'Text file input' and all is OK.
>
> Any suggestion?
>
>
>

Re: Beam input - problem with data format

Posted by Matt Casters <ma...@neo4j.com>.

If you think it's a bug and you have a reproduction case we would very much
welcome a JIRA case for this.
Most likely we're dealing with an issue with the file encoding or the line
endings but it's still worth looking into.

Thanks in advance!

Matt

Op di 7 jun. 2022 15:56 schreef <po...@gmx.com>:

>
> After another few days I spent on it I'm quite close this is some bug in
> software:
>
> Capture1.PNG - when executed 'Local'
>
> "2022/06/07 22:53:53 - merge_join_test - client_id Integer(15) : There was
> a data type error: the data type of java.lang.String object ["Some customer
> 4"] does not correspond to value meta [Integer(15)]"
>
> When exectued with "Beam Flink pipeline engine"
>
> I tried everything.
>
> M.
> *Sent:* Friday, June 03, 2022 at 5:31 PM
> *From:* podunk@gmx.com
> *To:* users@hop.apache.org
> *Subject:* Beam input - problem with data format
>
> Hi,
>
> since 3 hours I'm trying to do such trivial thing like opne file in Beam
> input.
>
> My file is:
>
> 80010;Some customer 1;2344;Address 1
> 80011;Some customer 2;7546;Address 2
> 80012;Some customer 3;4564;Address 3
> 80013;Some customer 4;7564;Address 4
> 80014;Some customer 5;2354;Address 5
>
> I defined input file in Pipeline Run configuration - ${CUSTOMERS}
> I have Beam file definition 'customers' (see attachment)
> In Beam input i have - Input location: ${CUSTOMERS}, File definition to:
> customers
>
> When I start pipeline I receive error:
>
> 2022/06/03 17:24:40 - merge_join - client_no Integer(5) : There was a data
> type error: the data type of java.lang.String object [Some customer 1] does
> not correspond to value meta [Integer(5)]
>
> For some reason first column, client_no, is completly ignored (yes, i have
> ';' as field separator) and the second is taken.
> Does not matter what end of line in file, does not matter what kind of
> encoding.
>
> I open same file in 'Text file input' and all is OK.
>
> Any suggestion?
>
>
>

Re: Beam input - problem with data format

Posted by po...@gmx.com.


After another few days I spent on it I'm quite close this is some bug in
software:



Capture1.PNG - when executed 'Local'



"2022/06/07 22:53:53 - merge_join_test - client_id Integer(15) : There was a
data type error: the data type of java.lang.String object ["Some customer 4"]
does not correspond to value meta [Integer(15)]"



When exectued with "Beam Flink pipeline engine"



I tried everything.



M.

**Sent:**  Friday, June 03, 2022 at 5:31 PM  
**From:**  podunk@gmx.com  
**To:**  users@hop.apache.org  
**Subject:**  Beam input - problem with data format



Hi,



since 3 hours I'm trying to do such trivial thing like opne file in Beam
input.



My file is:



80010;Some customer 1;2344;Address 1  
80011;Some customer 2;7546;Address 2  
80012;Some customer 3;4564;Address 3  
80013;Some customer 4;7564;Address 4  
80014;Some customer 5;2354;Address 5



I defined input file in Pipeline Run configuration - ${CUSTOMERS}

I have Beam file definition 'customers' (see attachment)

In Beam input i have - Input location: ${CUSTOMERS}, File definition to:
customers



When I start pipeline I receive error:

  
2022/06/03 17:24:40 - merge_join - client_no Integer(5) : There was a data
type error: the data type of java.lang.String object [Some customer 1] does
not correspond to value meta [Integer(5)]



For some reason first column, client_no, is completly ignored (yes, i have ';'
as field separator) and the second is taken.

Does not matter what end of line in file, does not matter what kind of
encoding.



I open same file in 'Text file input' and all is OK.



Any suggestion?

Beam input - problem with data format

Posted by po...@gmx.com.


Hi,



since 3 hours I'm trying to do such trivial thing like opne file in Beam
input.



My file is:



80010;Some customer 1;2344;Address 1  
80011;Some customer 2;7546;Address 2  
80012;Some customer 3;4564;Address 3  
80013;Some customer 4;7564;Address 4  
80014;Some customer 5;2354;Address 5



I defined input file in Pipeline Run configuration - ${CUSTOMERS}

I have Beam file definition 'customers' (see attachment)

In Beam input i have - Input location: ${CUSTOMERS}, File definition to:
customers



When I start pipeline I receive error:

  
2022/06/03 17:24:40 - merge_join - client_no Integer(5) : There was a data
type error: the data type of java.lang.String object [Some customer 1] does
not correspond to value meta [Integer(5)]



For some reason first column, client_no, is completly ignored (yes, i have ';'
as field separator) and the second is taken.

Does not matter what end of line in file, does not matter what kind of
encoding.



I open same file in 'Text file input' and all is OK.



Any suggestion?

Re: How to send pipeline to Flink?

Posted by Matt Casters <ma...@neo4j.com>.

Just set the flink master to [auto]. It should work for your scenarios from
Hop GUI and Flink submit.

[local] is indeed the embedded Flink engine you can use for testing.

Beam Input will work faster as it will read files in parallel. For Excel
Output please check the Beam getting started guide as you need to specify
BEAM_SINGLE as the number of copies if you want to end up with a single
file. Given the embarrassingly parallel nature of Flink that is not a
trivial thing.

Good luck!

Op wo 1 jun. 2022 17:51 schreef <po...@gmx.com>:

> I tried several things but I need help:
>
> "[local], [collection] or [auto]":
> - 'local' means use embeded Flink (so Hop has Flink engine built in?)?
> - 'auto' will generate some engine independent Flink job?
> - 'collection' ?
>
> It is executed in my PC Flink installation only if i set The Flink Master
> to '127.0.0.1:8081'
>
>
> If I want to execute following pipeline in Flink:
>
> text input => some transform => table output
>
> It will run or I have to specify 'Beam Output' at the end (I think not)?
> Something like:
>
> text file input => some transform => table output => Beam output
>
> 'Text file input' will not work and I have to insert 'Beam input' ?
> Similar obout 'Beam output' - 'Beam input/output' are just faster than
> 'text file input/output'?
>
> If I need finally Excel file as a result - what will be faster; save
> directly using 'Excel writer' or save to txt file and run another pipeline
> that will open this file and save to Excel?
>
>
> Regards
>
> M.
>
> *Sent:* Monday, May 30, 2022 at 5:59 PM
> *From:* "Matt Casters" <ma...@neo4j.com>
> *To:* users@hop.apache.org
> *Subject:* Re: How to send pipeline to Flink?
> Hi Mike,
>
> the variable PROJECT_HOME is actually not set by flink so what I had to do
> was add it to conf/flink-conf.yml:
>
> env.java.opts: -DPROJECT_HOME=/path/to/project/home/folder
>
> My script to run right now looks something like this:
>
> bin/flink run \
>  --class org.apache.hop.beam.run.MainBeam \
>  /tmp/hop-fatjar.jar \
>  /path/to/project/home/folder/beam/pipelines/input-process-output.hpl \
>  /tmp/hop-metadata.json \
>  Flink
>
> As to your questions...
>
>
>> - I guess there is no way to see the data somewhere in Flink gui?
>
>
> No but I'm thinking of adding a Hop call-back service (on Hop Server
> probably) so that we can capture progress and allow Hop GUI to see that.
>
>
>> - How can I modify this 'generate-synthetic-data' pipeline so Flink saves
>> these sample data to file or database? In last transform 'Beam output' i
>> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
>> 'Variables' tab of Pipeline Run configuration to point to file? But what
>> about database?
>
>
> The Table Output transform will allow you to write to a relational
> database.  Performance-wise, please read up
> <https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html>
> on the basics specifically the section on "Row batching with non-Beam
> transforms".
>
>
>> - Hop exports pipeline (I know it can be executed as well with 'run') and
>> it runs permamently till it is stopped, right? Or there is some parameter
>> to execute it just once?
>
>
> Flink Run will execute only once as shown in the script above.
>
> HTH,
> Matt
>
> On Mon, May 30, 2022 at 5:42 PM <po...@gmx.com> wrote:
>
>>
>> Hello community!
>>
>>
>> I'm learning Apache Flink. I launched test installation on my PC (version
>> 1.13.6), I have HOP running and now I want to run first pipeline on Flink.
>> Yet for some reason it does not work. What I did:
>>
>> 1. I exported fat jar according to this tutorial:
>>
>> https://hop.apache.org/manual/latest/pipeline/beam/running-the-beam-samples.html#_prerequisites
>>
>> 2. I created new Pipeline Run Configuration with details:
>>
>> Name: Local Flink
>> Description:
>> Engine type: Beam Flink pipeline engine
>> The Flink master: [local]
>> Parallelism: 2
>> ...
>> ...
>> User agent: Hop
>> Temp file: file://C:/Temp
>> ..
>> Fat jar file location: ${PROJECT_HOME}/fat_jar.jar
>>
>> 3. I did open 'generate-synthetic-data' pipeline and executed it with
>> Pipeline run configuration: Local Flink
>>
>> Looks like it runs; I see it in the Flink dashboard.
>>
>>
>> I have two questions:
>>
>> - I guess there is no way to see the data somewhere in Flink gui?
>> - How can I modify this 'generate-synthetic-data' pipeline so Flink saves
>> these sample data to file or database? In last transform 'Beam output' i
>> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
>> 'Variables' tab of Pipeline Run configuration to point to file? But what
>> about database?
>> - Hop exports pipeline (I know it can be executed as well with 'run') and
>> it runs permamently till it is stopped, right? Or there is some parameter
>> to execute it just once?
>>
>> Thanks a lot for your help
>>
>> Mike
>>
>>
>>
>>
>
>
>
>
>
>
>
>
>

Re: How to send pipeline to Flink?

Posted by Matt Casters <ma...@neo4j.com>.

Just set the flink master to [auto]. It should work for your scenarios from
Hop GUI and Flink submit.

[local] is indeed the embedded Flink engine you can use for testing.

Beam Input will work faster as it will read files in parallel. For Excel
Output please check the Beam getting started guide as you need to specify
BEAM_SINGLE as the number of copies if you want to end up with a single
file. Given the embarrassingly parallel nature of Flink that is not a
trivial thing.

Good luck!

Op wo 1 jun. 2022 17:51 schreef <po...@gmx.com>:

> I tried several things but I need help:
>
> "[local], [collection] or [auto]":
> - 'local' means use embeded Flink (so Hop has Flink engine built in?)?
> - 'auto' will generate some engine independent Flink job?
> - 'collection' ?
>
> It is executed in my PC Flink installation only if i set The Flink Master
> to '127.0.0.1:8081'
>
>
> If I want to execute following pipeline in Flink:
>
> text input => some transform => table output
>
> It will run or I have to specify 'Beam Output' at the end (I think not)?
> Something like:
>
> text file input => some transform => table output => Beam output
>
> 'Text file input' will not work and I have to insert 'Beam input' ?
> Similar obout 'Beam output' - 'Beam input/output' are just faster than
> 'text file input/output'?
>
> If I need finally Excel file as a result - what will be faster; save
> directly using 'Excel writer' or save to txt file and run another pipeline
> that will open this file and save to Excel?
>
>
> Regards
>
> M.
>
> *Sent:* Monday, May 30, 2022 at 5:59 PM
> *From:* "Matt Casters" <ma...@neo4j.com>
> *To:* users@hop.apache.org
> *Subject:* Re: How to send pipeline to Flink?
> Hi Mike,
>
> the variable PROJECT_HOME is actually not set by flink so what I had to do
> was add it to conf/flink-conf.yml:
>
> env.java.opts: -DPROJECT_HOME=/path/to/project/home/folder
>
> My script to run right now looks something like this:
>
> bin/flink run \
>  --class org.apache.hop.beam.run.MainBeam \
>  /tmp/hop-fatjar.jar \
>  /path/to/project/home/folder/beam/pipelines/input-process-output.hpl \
>  /tmp/hop-metadata.json \
>  Flink
>
> As to your questions...
>
>
>> - I guess there is no way to see the data somewhere in Flink gui?
>
>
> No but I'm thinking of adding a Hop call-back service (on Hop Server
> probably) so that we can capture progress and allow Hop GUI to see that.
>
>
>> - How can I modify this 'generate-synthetic-data' pipeline so Flink saves
>> these sample data to file or database? In last transform 'Beam output' i
>> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
>> 'Variables' tab of Pipeline Run configuration to point to file? But what
>> about database?
>
>
> The Table Output transform will allow you to write to a relational
> database.  Performance-wise, please read up
> <https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html>
> on the basics specifically the section on "Row batching with non-Beam
> transforms".
>
>
>> - Hop exports pipeline (I know it can be executed as well with 'run') and
>> it runs permamently till it is stopped, right? Or there is some parameter
>> to execute it just once?
>
>
> Flink Run will execute only once as shown in the script above.
>
> HTH,
> Matt
>
> On Mon, May 30, 2022 at 5:42 PM <po...@gmx.com> wrote:
>
>>
>> Hello community!
>>
>>
>> I'm learning Apache Flink. I launched test installation on my PC (version
>> 1.13.6), I have HOP running and now I want to run first pipeline on Flink.
>> Yet for some reason it does not work. What I did:
>>
>> 1. I exported fat jar according to this tutorial:
>>
>> https://hop.apache.org/manual/latest/pipeline/beam/running-the-beam-samples.html#_prerequisites
>>
>> 2. I created new Pipeline Run Configuration with details:
>>
>> Name: Local Flink
>> Description:
>> Engine type: Beam Flink pipeline engine
>> The Flink master: [local]
>> Parallelism: 2
>> ...
>> ...
>> User agent: Hop
>> Temp file: file://C:/Temp
>> ..
>> Fat jar file location: ${PROJECT_HOME}/fat_jar.jar
>>
>> 3. I did open 'generate-synthetic-data' pipeline and executed it with
>> Pipeline run configuration: Local Flink
>>
>> Looks like it runs; I see it in the Flink dashboard.
>>
>>
>> I have two questions:
>>
>> - I guess there is no way to see the data somewhere in Flink gui?
>> - How can I modify this 'generate-synthetic-data' pipeline so Flink saves
>> these sample data to file or database? In last transform 'Beam output' i
>> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
>> 'Variables' tab of Pipeline Run configuration to point to file? But what
>> about database?
>> - Hop exports pipeline (I know it can be executed as well with 'run') and
>> it runs permamently till it is stopped, right? Or there is some parameter
>> to execute it just once?
>>
>> Thanks a lot for your help
>>
>> Mike
>>
>>
>>
>>
>
>
>
>
>
>
>
>
>

Re: How to send pipeline to Flink?

Posted by po...@gmx.com.

I tried several things but I need help:

"[local], [collection] or [auto]":

\- 'local' means use embeded Flink (so Hop has Flink engine built in?)?

\- 'auto' will generate some engine independent Flink job?

\- 'collection' ?

It is executed in my PC Flink installation only if i set The Flink Master to
'127.0.0.1:8081'

If I want to execute following pipeline in Flink:

text input => some transform => table output

It will run or I have to specify 'Beam Output' at the end (I think not)?
Something like:

text file input => some transform => table output => Beam output

'Text file input' will not work and I have to insert 'Beam input' ? Similar
obout 'Beam output' \- 'Beam input/output' are just faster than 'text file
input/output'?

If I need finally Excel file as a result - what will be faster; save directly
using 'Excel writer' or save to txt file and run another pipeline that will
open this file and save to Excel?

Regards

M.

**Sent:**  Monday, May 30, 2022 at 5:59 PM  
**From:**  "Matt Casters" <ma...@neo4j.com>  
**To:**  users@hop.apache.org  
**Subject:**  Re: How to send pipeline to Flink?

Hi Mike,

the variable PROJECT_HOME is actually not set by flink so what I had to do was
add it to conf/flink-conf.yml:

env.java.opts: -DPROJECT_HOME=/path/to/project/home/folder  

My script to run right now looks something like this:

bin/flink run \  
 \--class org.apache.hop.beam.run.MainBeam \  
 /tmp/hop-fatjar.jar \  
 /path/to/project/home/folder/beam/pipelines/input-process-output.hpl \  
 /tmp/hop-metadata.json \  
 Flink  

As to your questions...

> \- I guess there is no way to see the data somewhere in Flink gui?

No but I'm thinking of adding a Hop call-back service (on Hop Server probably)
so that we can capture progress and allow Hop GUI to see that.

> \- How can I modify this 'generate-synthetic-data' pipeline so Flink saves
> these sample data to file or database? In last transform 'Beam output' i see
> variable ${DATA_OUTPUT} - does it mean I can do some mapping in 'Variables'
> tab of Pipeline Run configuration to point to file? But what about database?

The Table Output transform will allow you to write to a relational database.
Performance-wise, please [read
up](https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-
beam.html) on the basics specifically the section on "Row batching with non-
Beam transforms".

> \- Hop exports pipeline (I know it can be executed as well with 'run') and
> it runs permamently till it is stopped, right? Or there is some parameter to
> execute it just once?

Flink Run will execute only once as shown in the script above.

HTH,  
Matt

On Mon, May 30, 2022 at 5:42 PM <[podunk@gmx.com](mailto:podunk@gmx.com)>
wrote:

>  
>
> Hello community!
>
>  
>
>  
>
> I'm learning Apache Flink. I launched test installation on my PC (version
> 1.13.6), I have HOP running and now I want to run first pipeline on Flink.
>
> Yet for some reason it does not work. What I did:
>
>  
>
> 1\. I exported fat jar according to this tutorial:
>
> <https://hop.apache.org/manual/latest/pipeline/beam/running-the-beam-
> samples.html#_prerequisites>
>
>  
>
> 2\. I created new Pipeline Run Configuration with details:
>
>  
>
> Name: Local Flink
>
> Description:
>
> Engine type: Beam Flink pipeline engine
>
> The Flink master: [local]
>
> Parallelism: 2
>
> ...
>
> ...
>
> User agent: Hop
>
> Temp file: file://C:/Temp
>
> ..
>
> Fat jar file location: ${PROJECT_HOME}/fat_jar.jar
>
>  
>
> 3\. I did open 'generate-synthetic-data' pipeline and executed it with
> Pipeline run configuration: Local Flink
>
>  
>
> Looks like it runs; I see it in the Flink dashboard.
>
>  
>
>  
>
> I have two questions:
>
>  
>
> \- I guess there is no way to see the data somewhere in Flink gui?
>
> \- How can I modify this 'generate-synthetic-data' pipeline so Flink saves
> these sample data to file or database? In last transform 'Beam output' i see
> variable ${DATA_OUTPUT} - does it mean I can do some mapping in 'Variables'
> tab of Pipeline Run configuration to point to file? But what about database?
>
> \- Hop exports pipeline (I know it can be executed as well with 'run') and
> it runs permamently till it is stopped, right? Or there is some parameter to
> execute it just once?
>
>  
>
> Thanks a lot for your help
>
>  
>
> Mike
>
>  
>
>  
>
>

Re: How to send pipeline to Flink?

Posted by Matt Casters <ma...@neo4j.com>.

Hi Mike,

the variable PROJECT_HOME is actually not set by flink so what I had to do
was add it to conf/flink-conf.yml:

env.java.opts: -DPROJECT_HOME=/path/to/project/home/folder

My script to run right now looks something like this:

bin/flink run \
 --class org.apache.hop.beam.run.MainBeam \
 /tmp/hop-fatjar.jar \
 /path/to/project/home/folder/beam/pipelines/input-process-output.hpl \
 /tmp/hop-metadata.json \
 Flink

As to your questions...

- I guess there is no way to see the data somewhere in Flink gui?


No but I'm thinking of adding a Hop call-back service (on Hop Server
probably) so that we can capture progress and allow Hop GUI to see that.

- How can I modify this 'generate-synthetic-data' pipeline so Flink saves
> these sample data to file or database? In last transform 'Beam output' i
> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
> 'Variables' tab of Pipeline Run configuration to point to file? But what
> about database?


The Table Output transform will allow you to write to a relational
database.  Performance-wise, please read up
<https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html>
on the basics specifically the section on "Row batching with non-Beam
transforms".

- Hop exports pipeline (I know it can be executed as well with 'run') and
> it runs permamently till it is stopped, right? Or there is some parameter
> to execute it just once?


Flink Run will execute only once as shown in the script above.

HTH,
Matt

On Mon, May 30, 2022 at 5:42 PM <po...@gmx.com> wrote:

>
> Hello community!
>
>
> I'm learning Apache Flink. I launched test installation on my PC (version
> 1.13.6), I have HOP running and now I want to run first pipeline on Flink.
> Yet for some reason it does not work. What I did:
>
> 1. I exported fat jar according to this tutorial:
>
> https://hop.apache.org/manual/latest/pipeline/beam/running-the-beam-samples.html#_prerequisites
>
> 2. I created new Pipeline Run Configuration with details:
>
> Name: Local Flink
> Description:
> Engine type: Beam Flink pipeline engine
> The Flink master: [local]
> Parallelism: 2
> ...
> ...
> User agent: Hop
> Temp file: file://C:/Temp
> ..
> Fat jar file location: ${PROJECT_HOME}/fat_jar.jar
>
> 3. I did open 'generate-synthetic-data' pipeline and executed it with
> Pipeline run configuration: Local Flink
>
> Looks like it runs; I see it in the Flink dashboard.
>
>
> I have two questions:
>
> - I guess there is no way to see the data somewhere in Flink gui?
> - How can I modify this 'generate-synthetic-data' pipeline so Flink saves
> these sample data to file or database? In last transform 'Beam output' i
> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
> 'Variables' tab of Pipeline Run configuration to point to file? But what
> about database?
> - Hop exports pipeline (I know it can be executed as well with 'run') and
> it runs permamently till it is stopped, right? Or there is some parameter
> to execute it just once?
>
> Thanks a lot for your help
>
> Mike
>
>
>
>

Re: How to send pipeline to Flink?

Posted by Matt Casters <ma...@neo4j.com>.

Hi Mike,

the variable PROJECT_HOME is actually not set by flink so what I had to do
was add it to conf/flink-conf.yml:

env.java.opts: -DPROJECT_HOME=/path/to/project/home/folder

My script to run right now looks something like this:

bin/flink run \
 --class org.apache.hop.beam.run.MainBeam \
 /tmp/hop-fatjar.jar \
 /path/to/project/home/folder/beam/pipelines/input-process-output.hpl \
 /tmp/hop-metadata.json \
 Flink

As to your questions...

- I guess there is no way to see the data somewhere in Flink gui?


No but I'm thinking of adding a Hop call-back service (on Hop Server
probably) so that we can capture progress and allow Hop GUI to see that.

- How can I modify this 'generate-synthetic-data' pipeline so Flink saves
> these sample data to file or database? In last transform 'Beam output' i
> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
> 'Variables' tab of Pipeline Run configuration to point to file? But what
> about database?


The Table Output transform will allow you to write to a relational
database.  Performance-wise, please read up
<https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html>
on the basics specifically the section on "Row batching with non-Beam
transforms".

- Hop exports pipeline (I know it can be executed as well with 'run') and
> it runs permamently till it is stopped, right? Or there is some parameter
> to execute it just once?


Flink Run will execute only once as shown in the script above.

HTH,
Matt

On Mon, May 30, 2022 at 5:42 PM <po...@gmx.com> wrote:

>
> Hello community!
>
>
> I'm learning Apache Flink. I launched test installation on my PC (version
> 1.13.6), I have HOP running and now I want to run first pipeline on Flink.
> Yet for some reason it does not work. What I did:
>
> 1. I exported fat jar according to this tutorial:
>
> https://hop.apache.org/manual/latest/pipeline/beam/running-the-beam-samples.html#_prerequisites
>
> 2. I created new Pipeline Run Configuration with details:
>
> Name: Local Flink
> Description:
> Engine type: Beam Flink pipeline engine
> The Flink master: [local]
> Parallelism: 2
> ...
> ...
> User agent: Hop
> Temp file: file://C:/Temp
> ..
> Fat jar file location: ${PROJECT_HOME}/fat_jar.jar
>
> 3. I did open 'generate-synthetic-data' pipeline and executed it with
> Pipeline run configuration: Local Flink
>
> Looks like it runs; I see it in the Flink dashboard.
>
>
> I have two questions:
>
> - I guess there is no way to see the data somewhere in Flink gui?
> - How can I modify this 'generate-synthetic-data' pipeline so Flink saves
> these sample data to file or database? In last transform 'Beam output' i
> see variable ${DATA_OUTPUT} - does it mean I can do some mapping in
> 'Variables' tab of Pipeline Run configuration to point to file? But what
> about database?
> - Hop exports pipeline (I know it can be executed as well with 'run') and
> it runs permamently till it is stopped, right? Or there is some parameter
> to execute it just once?
>
> Thanks a lot for your help
>
> Mike
>
>
>
>