You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@hop.apache.org by Jochen Gatternig <Jo...@adebo.ch> on 2022/10/18 09:39:11 UTC

Scaling Pipelines & Workflows

Dear all

Are there option/parameters in the Hop server that allow parallelization and scaling of the processing?
We tested it with a pipeline configuration which read data from a source table, created XMLs and write them to a filesystem. Additionally, it copies a document to the very same directory.
Our server has 8 cores (VM).

When running it with a single job, the system caps at 400-450%.
However, we then thought to modify the where-clause and run 2-4 jobs separately. However, each job seems to be capped at 100-150% CPU load.

Any idea how to increase performance?

Regards
Jochen

Beste Grüsse

Jochen Gatternig
Head of Advisory

Telefon +41 76 431 00 94
jochen.gatternig@adebo.ch<ma...@adebo.ch>


Re: Scaling Pipelines & Workflows

Posted by Diego Mainou <di...@bizcubed.com.au>.
Sorry 

Big thumb issue... 

Here : " Additionally, it copies a document to the very same directory." 
I've seen endless issues with that behaviour. Particularly in cloud based solutions. 

Try copying to a different folder 

[ https://www.bizcubed.com.au/ | 
				
			 ] 	Diego Mainou Project Delivery Manager 
M. +61 415 152 091 
E. [ mailto:diego.mainou@bizcubed.com.au | diego.mainou@bizcubed.com.au ] 
[ https://www.bizcubed.com.au/ | www.bizcubed.com.au ] 


From: "Hans Van Akelyen" <ha...@gmail.com> 
To: "users" <us...@hop.apache.org> 
Sent: Tuesday, 18 October, 2022 9:21:09 PM 
Subject: Re: Scaling Pipelines & Workflows 

Hi Jochen, 

Hop Server runs in a JVM, you can edit the hop-server.sh script to add extra options to the JVM (allocate more memory to the JVM default is 2048MB). 

Parallelisation and scaling, each Hop transform creates its own thread and will consume records on the input side and place them to the output side after processing. You can increase the amount of instances/threads of a transform by clicking on it and changing the “number of copies”. One thing to keep in mind is that when you for example add more copies to a table input that it will execute the query “x” times and will result in x times the same rows unless you add logic in your query to distribute the data over these multiple instances. 
( you can use ${Internal.Transform.CopyNr} and a mod function on an ID column for example). 

What we usually see in the field is that CPU is not the bottleneck of pipelines, usually IO is a limiting factor. 
When looking at the status of a pipeline via the UI in Hop Server there are indications to what is the bottleneck, you have a field containing the input/output buffers of each transform. The transform that has max rows (default: 10000) on input and 0 on output is your bottleneck. If you see no data pile-up in the buffers it means that it is processing the data just as fast as it is receiving it (your database can’t feed rows faster than it does). 

It might be that the pipeline can’t go faster because the DB does not deliver records any faster, or that the XML writer can’t write faster to disk than it does. 

When dealing with performance issues: 
- The transform metrics will show you who is the culprit 
- Look at Memory/CPU usage (as you are already doing) 
- Increase (copies/threads) but be mindful of the implications as the rows will be split over multiple instances (input,output,sorting,grouping) 

Hope this helps, 
Hans 



On 18 October 2022 at 11:39:21, Jochen Gatternig ( [ mailto:jochen.gatternig@adebo.ch | jochen.gatternig@adebo.ch ] ) wrote: 




Dear all 



Are there option/parameters in the Hop server that allow parallelization and scaling of the processing? 

We tested it with a pipeline configuration which read data from a source table, created XMLs and write them to a filesystem. Additionally, it copies a document to the very same directory. 

Our server has 8 cores (VM). 



When running it with a single job, the system caps at 400-450%. 

However, we then thought to modify the where-clause and run 2-4 jobs separately. However, each job seems to be capped at 100-150% CPU load. 



Any idea how to increase performance? 



Regards 

Jochen 



Beste Grüsse 



Jochen Gatternig 

Head of Advisory 

Telefon +41 76 431 00 94 

[ mailto:christian.bernhard@adebo.ch | jochen.gatternig@adebo.ch ] 







Re: Scaling Pipelines & Workflows

Posted by Diego Mainou <di...@bizcubed.com.au>.
Here : " Additionally, it copies a document to the very same directory." 

I've seen endless issues with that behaviour. Particularly in cloud based solutions. 

Tr 

[ https://www.bizcubed.com.au/ | 
				
			 ] 	Diego Mainou Project Delivery Manager 
M. +61 415 152 091 
E. [ mailto:diego.mainou@bizcubed.com.au | diego.mainou@bizcubed.com.au ] 
[ https://www.bizcubed.com.au/ | www.bizcubed.com.au ] 


From: "Hans Van Akelyen" <ha...@gmail.com> 
To: "users" <us...@hop.apache.org> 
Sent: Tuesday, 18 October, 2022 9:21:09 PM 
Subject: Re: Scaling Pipelines & Workflows 

Hi Jochen, 

Hop Server runs in a JVM, you can edit the hop-server.sh script to add extra options to the JVM (allocate more memory to the JVM default is 2048MB). 

Parallelisation and scaling, each Hop transform creates its own thread and will consume records on the input side and place them to the output side after processing. You can increase the amount of instances/threads of a transform by clicking on it and changing the “number of copies”. One thing to keep in mind is that when you for example add more copies to a table input that it will execute the query “x” times and will result in x times the same rows unless you add logic in your query to distribute the data over these multiple instances. 
( you can use ${Internal.Transform.CopyNr} and a mod function on an ID column for example). 

What we usually see in the field is that CPU is not the bottleneck of pipelines, usually IO is a limiting factor. 
When looking at the status of a pipeline via the UI in Hop Server there are indications to what is the bottleneck, you have a field containing the input/output buffers of each transform. The transform that has max rows (default: 10000) on input and 0 on output is your bottleneck. If you see no data pile-up in the buffers it means that it is processing the data just as fast as it is receiving it (your database can’t feed rows faster than it does). 

It might be that the pipeline can’t go faster because the DB does not deliver records any faster, or that the XML writer can’t write faster to disk than it does. 

When dealing with performance issues: 
- The transform metrics will show you who is the culprit 
- Look at Memory/CPU usage (as you are already doing) 
- Increase (copies/threads) but be mindful of the implications as the rows will be split over multiple instances (input,output,sorting,grouping) 

Hope this helps, 
Hans 



On 18 October 2022 at 11:39:21, Jochen Gatternig ( [ mailto:jochen.gatternig@adebo.ch | jochen.gatternig@adebo.ch ] ) wrote: 




Dear all 



Are there option/parameters in the Hop server that allow parallelization and scaling of the processing? 

We tested it with a pipeline configuration which read data from a source table, created XMLs and write them to a filesystem. Additionally, it copies a document to the very same directory. 

Our server has 8 cores (VM). 



When running it with a single job, the system caps at 400-450%. 

However, we then thought to modify the where-clause and run 2-4 jobs separately. However, each job seems to be capped at 100-150% CPU load. 



Any idea how to increase performance? 



Regards 

Jochen 



Beste Grüsse 



Jochen Gatternig 

Head of Advisory 

Telefon +41 76 431 00 94 

[ mailto:christian.bernhard@adebo.ch | jochen.gatternig@adebo.ch ] 






Re: Scaling Pipelines & Workflows

Posted by Hans Van Akelyen <ha...@gmail.com>.
 Hi Jochen,

Hop Server runs in a JVM, you can edit the hop-server.sh script to add
extra options to the JVM (allocate more memory to the JVM default is
2048MB).

Parallelisation and scaling, each Hop transform creates its own thread and
will consume records on the input side and place them to the output side
after processing. You can increase the amount of instances/threads of a
transform by clicking on it and changing the “number of copies”. One thing
to keep in mind is that when you for example add more copies to a table
input that it will execute the query “x” times and will result in x times
the same rows unless you add logic in your query to distribute the data
over these multiple instances.
( you can use ${Internal.Transform.CopyNr} and a mod function on an ID
column for example).

What we usually see in the field is that CPU is not the bottleneck of
pipelines, usually IO is a limiting factor.
When looking at the status of a pipeline via the UI in Hop Server there are
indications to what is the bottleneck, you have a field containing the
input/output buffers of each transform. The transform that has max rows
(default: 10000) on input and 0 on output is your bottleneck. If you see no
data pile-up in the buffers it means that it is processing the data just as
fast as it is receiving it (your database can’t feed rows faster than it
does).

It might be that the pipeline can’t go faster because the DB does not
deliver records any faster, or that the XML writer can’t write faster to
disk than it does.

When dealing with performance issues:
- The transform metrics will show you who is the culprit
- Look at Memory/CPU usage (as you are already doing)
- Increase (copies/threads) but be mindful of the implications as the rows
will be split over multiple instances (input,output,sorting,grouping)

Hope this helps,
Hans

On 18 October 2022 at 11:39:21, Jochen Gatternig (jochen.gatternig@adebo.ch)
wrote:

Dear all



Are there option/parameters in the Hop server that allow parallelization
and scaling of the processing?

We tested it with a pipeline configuration which read data from a source
table, created XMLs and write them to a filesystem. Additionally, it copies
a document to the very same directory.

Our server has 8 cores (VM).



When running it with a single job, the system caps at 400-450%.

However, we then thought to modify the where-clause and run 2-4 jobs
separately. However, each job seems to be capped at 100-150% CPU load.



Any idea how to increase performance?



Regards

Jochen



Beste Grüsse



*Jochen Gatternig*

Head of Advisory

Telefon +41 76 431 00 94

jochen.gatternig@adebo.ch <ch...@adebo.ch>