You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@hop.apache.org by po...@gmx.com on 2022/11/04 18:26:18 UTC

Beam File Input - unexpected result

Hi,
 
I'm playing with Beam pipeline. My goal is to merge two big files.
So I have source (one of two) file like:
 
column_one|colum_two
0099"|"0080199111"
...
 
My trivial pipeline is:
Beam File Input => Text file output
I created definition for Beam File Input: separator "|", column_one - string, column_two - string
 
But in I get in result (Text file output):
 
column_one|colum_two
0|0|9|9|"|"|0|0|8|0|1|9|9|1|1|1|"
...
 
Why each character is separated by "|"?
 
I also get 51 result files. Even if I set 'Number of workers : 3' in Pipeline Run Configuration for engine 'Beam Direct pipeline engine'
 
Also this source file is really big and building definition is quite time consuming process - would be great such options like in Text file input where Hop detects fields and is able to preview it.
 
Best
 

Re: Beam File Input - unexpected result

Posted by Hans Van Akelyen <ha...@gmail.com>.
 Hi,

The Beam engines will only be giving you a performance gain if the data set
is “large data”. Spinning up the nodes and spreading the loads has a
certain time cost.

In most cases if your source files are not exceeding gigabytes the Hop
engine will perform better. When you are hitting 10’s of GB’s or TB/PB the
Beam engines will win in performance or you want to process 1000’s of files
at the same time, then the scaling of Dataflow/Spark/Flink will probably
outperform the Hop engine.

Depending on the filetype/IO/operations the Hop engine can easily reach
300-500K rows/s

Cheers,
Hans

On 7 November 2022 at 14:52:51, podunk@gmx.com (podunk@gmx.com) wrote:


Thank you very much for your help!
I will check out the options you mention - I did not know that I can
preview Beam File Input result (however I'm aware that I can do this in
Text File Input).
Before I go deeper with this; will Beam engine give me some performance
gain comparing to Hop processing engine? I mean in my scenario where I need
to join/merge information from two or more files (large files; each almost
millions of lines).

Best


Sent: Friday, November 04, 2022 at 10:10 PM
From: "Hans Van Akelyen" <ha...@gmail.com>
To: users@hop.apache.org
Subject: Re: Beam File Input - unexpected result
Hi,

Seems like an odd thing you are encountering.
Not sure how you are ending up with that result, if you think you are
hitting a bug feel free to create a ticket with a reproduction path.

As for debugging, you are right that the Beam file definitions currently
have no way of doing a best guess on what the structure is.
This is possible using the text file input, but that transform is not
optimised for Beam usage but you can build the definition in the text file
input and then copy that to the file definition.

When using the direct runner you can preview data flowing through the
transform, you can press on a transform and use the “preview” output [1].
It will launch the pipeline and show you the result. When executing in
Dataflow/Spark/Flink we also have a concept to “see” what is happening
inside the pipeline. You can use the Execution information perspective [2].
It can save execution information and sample data, when running in a remote
cluster it is best to also have a Hop Server running as an endpoint to save
the execution information.

As for the final part, Because of the distributed/retry on fail and other
mechanisms in Beam for transforms like a text file output we let every
bundle/instance write to a new file this is the safest way to do it (and
this is the default and recommended approach) . If it must go to a single
file we have a workaround to change the number of copies on the output
transform and enter the value “SINGLE_BEAM” this will add a group by to the
beam pipeline forcing it to a single thread and thus being able to write to
a single file, this also has a performance kickback. For more information
on this you can take a look at how we handle our Transforms [3].

Cheers,
Hans

[1]
https://hop.apache.org/manual/latest/pipeline/run-preview-debug-pipeline.html#top
[2]
https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top[https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top]
[3]
https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others[https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others]

On 4 November 2022 at 19:26:27, podunk@gmx.com[mailto:podunk@gmx.com] (
podunk@gmx.com[mailto:podunk@gmx.com]) wrote:


Hi,

I'm playing with Beam pipeline. My goal is to merge two big files.
So I have source (one of two) file like:

column_one|colum_two
0099"|"0080199111"
...

My trivial pipeline is:
Beam File Input => Text file output
I created definition for Beam File Input: separator "|", column_one -
string, column_two - string

But in I get in result (Text file output):

column_one|colum_two
0|0|9|9|"|"|0|0|8|0|1|9|9|1|1|1|"
...

Why each character is separated by "|"?

I also get 51 result files. Even if I set 'Number of workers : 3' in
Pipeline Run Configuration for engine 'Beam Direct pipeline engine'

Also this source file is really big and building definition is quite time
consuming process - would be great such options like in Text file input
where Hop detects fields and is able to preview it.

Best

Re: Beam File Input - unexpected result

Posted by po...@gmx.com.
Thank you very much for your help!
I will check out the options you mention - I did not know that I can preview Beam File Input result (however I'm aware that I can do this in Text File Input).
Before I go deeper with this; will Beam engine give me some performance gain comparing to Hop processing engine? I mean in my scenario where I need to join/merge information from two or more files (large files; each almost millions of lines).

Best 
 

Sent: Friday, November 04, 2022 at 10:10 PM
From: "Hans Van Akelyen" <ha...@gmail.com>
To: users@hop.apache.org
Subject: Re: Beam File Input - unexpected result
Hi,

Seems like an odd thing you are encountering.
Not sure how you are ending up with that result, if you think you are hitting a bug feel free to create a ticket with a reproduction path.

As for debugging, you are right that the Beam file definitions currently have no way of doing a best guess on what the structure is.
This is possible using the text file input, but that transform is not optimised for Beam usage but you can build the definition in the text file input and then copy that to the file definition.

When using the direct runner you can preview data flowing through the transform, you can press on a transform and use the “preview” output [1].
It will launch the pipeline and show you the result. When executing in Dataflow/Spark/Flink we also have a concept to “see” what is happening inside the pipeline. You can use the Execution information perspective [2]. It can save execution information and sample data, when running in a remote cluster it is best to also have a Hop Server running as an endpoint to save the execution information.

As for the final part, Because of the distributed/retry on fail and other mechanisms in Beam for transforms like a text file output we let every bundle/instance write to a new file this is the safest way to do it (and this is the default and recommended approach) . If it must go to a single file we have a workaround to change the number of copies on the output transform and enter the value “SINGLE_BEAM” this will add a group by to the beam pipeline forcing it to a single thread and thus being able to write to a single file, this also has a performance kickback. For more information on this you can take a look at how we handle our Transforms [3].

Cheers,
Hans

[1] https://hop.apache.org/manual/latest/pipeline/run-preview-debug-pipeline.html#top
[2] https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top[https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top]
[3] https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others[https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others]
 
On 4 November 2022 at 19:26:27, podunk@gmx.com[mailto:podunk@gmx.com] (podunk@gmx.com[mailto:podunk@gmx.com]) wrote:

 
Hi,
 
I'm playing with Beam pipeline. My goal is to merge two big files.
So I have source (one of two) file like:
 
column_one|colum_two
0099"|"0080199111"
...
 
My trivial pipeline is:
Beam File Input => Text file output
I created definition for Beam File Input: separator "|", column_one - string, column_two - string
 
But in I get in result (Text file output):
 
column_one|colum_two
0|0|9|9|"|"|0|0|8|0|1|9|9|1|1|1|"
...
 
Why each character is separated by "|"?
 
I also get 51 result files. Even if I set 'Number of workers : 3' in Pipeline Run Configuration for engine 'Beam Direct pipeline engine'
 
Also this source file is really big and building definition is quite time consuming process - would be great such options like in Text file input where Hop detects fields and is able to preview it.
 
Best
 
 
 

Re: Beam File Input - unexpected result

Posted by Hans Van Akelyen <ha...@gmail.com>.
 Hi,

We did improvements in 2.2 to avoid confusion (we added defaults to the
default run configurations).
For this tab to work you first need to create an execution information
location
https://hop.apache.org/manual/latest/metadata-types/execution-information-location.html
and
attach it to your run configuration.

A full walkthrough can be found on YouTube
https://www.youtube.com/watch?v=HCbW2TB3pEo.

Cheers,
Hans




On 9 November 2022 at 16:35:16, podunk@gmx.com (podunk@gmx.com) wrote:

I did open issue in Jira: HOP-4575[
https://issues.apache.org/jira/browse/HOP-4575]
Looks like bug

This feature ([2]
https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top)]
does not work here. Clicking on localization icon gives nothing.
When pressing CTRL-Shift-I it takes me to the pane but nothing is there
(blank pane). There's as well no tab Data. I mean after pipeline execution.
Regards


Sent: Friday, November 04, 2022 at 10:10 PM
From: "Hans Van Akelyen" <ha...@gmail.com>
To: users@hop.apache.org
Subject: Re: Beam File Input - unexpected result
Hi,

Seems like an odd thing you are encountering.
Not sure how you are ending up with that result, if you think you are
hitting a bug feel free to create a ticket with a reproduction path.

As for debugging, you are right that the Beam file definitions currently
have no way of doing a best guess on what the structure is.
This is possible using the text file input, but that transform is not
optimised for Beam usage but you can build the definition in the text file
input and then copy that to the file definition.

When using the direct runner you can preview data flowing through the
transform, you can press on a transform and use the “preview” output [1].
It will launch the pipeline and show you the result. When executing in
Dataflow/Spark/Flink we also have a concept to “see” what is happening
inside the pipeline. You can use the Execution information perspective [2].
It can save execution information and sample data, when running in a remote
cluster it is best to also have a Hop Server running as an endpoint to save
the execution information.

As for the final part, Because of the distributed/retry on fail and other
mechanisms in Beam for transforms like a text file output we let every
bundle/instance write to a new file this is the safest way to do it (and
this is the default and recommended approach) . If it must go to a single
file we have a workaround to change the number of copies on the output
transform and enter the value “SINGLE_BEAM” this will add a group by to the
beam pipeline forcing it to a single thread and thus being able to write to
a single file, this also has a performance kickback. For more information
on this you can take a look at how we handle our Transforms [3].

Cheers,
Hans

[1]
https://hop.apache.org/manual/latest/pipeline/run-preview-debug-pipeline.html#top[https://hop.apache.org/manual/latest/pipeline/run-preview-debug-pipeline.html#top]
[2]
https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top[https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top]
[3]
https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others[https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others]

On 4 November 2022 at 19:26:27, podunk@gmx.com[mailto:podunk@gmx.com] (
podunk@gmx.com[mailto:podunk@gmx.com]) wrote:


Hi,

I'm playing with Beam pipeline. My goal is to merge two big files.
So I have source (one of two) file like:

column_one|colum_two
0099"|"0080199111"
...

My trivial pipeline is:
Beam File Input => Text file output
I created definition for Beam File Input: separator "|", column_one -
string, column_two - string

But in I get in result (Text file output):

column_one|colum_two
0|0|9|9|"|"|0|0|8|0|1|9|9|1|1|1|"
...

Why each character is separated by "|"?

I also get 51 result files. Even if I set 'Number of workers : 3' in
Pipeline Run Configuration for engine 'Beam Direct pipeline engine'

Also this source file is really big and building definition is quite time
consuming process - would be great such options like in Text file input
where Hop detects fields and is able to preview it.

Best

Re: Beam File Input - unexpected result

Posted by po...@gmx.com.
I did open issue in Jira: HOP-4575[https://issues.apache.org/jira/browse/HOP-4575]
Looks like bug
 
This feature ([2] https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top)] does not work here. Clicking on localization icon gives nothing.
When pressing CTRL-Shift-I it takes me to the pane but nothing is there (blank pane). There's as well no tab Data. I mean after pipeline execution.
Regards 
 

Sent: Friday, November 04, 2022 at 10:10 PM
From: "Hans Van Akelyen" <ha...@gmail.com>
To: users@hop.apache.org
Subject: Re: Beam File Input - unexpected result
Hi,

Seems like an odd thing you are encountering.
Not sure how you are ending up with that result, if you think you are hitting a bug feel free to create a ticket with a reproduction path.

As for debugging, you are right that the Beam file definitions currently have no way of doing a best guess on what the structure is.
This is possible using the text file input, but that transform is not optimised for Beam usage but you can build the definition in the text file input and then copy that to the file definition.

When using the direct runner you can preview data flowing through the transform, you can press on a transform and use the “preview” output [1].
It will launch the pipeline and show you the result. When executing in Dataflow/Spark/Flink we also have a concept to “see” what is happening inside the pipeline. You can use the Execution information perspective [2]. It can save execution information and sample data, when running in a remote cluster it is best to also have a Hop Server running as an endpoint to save the execution information.

As for the final part, Because of the distributed/retry on fail and other mechanisms in Beam for transforms like a text file output we let every bundle/instance write to a new file this is the safest way to do it (and this is the default and recommended approach) . If it must go to a single file we have a workaround to change the number of copies on the output transform and enter the value “SINGLE_BEAM” this will add a group by to the beam pipeline forcing it to a single thread and thus being able to write to a single file, this also has a performance kickback. For more information on this you can take a look at how we handle our Transforms [3].

Cheers,
Hans

[1] https://hop.apache.org/manual/latest/pipeline/run-preview-debug-pipeline.html#top[https://hop.apache.org/manual/latest/pipeline/run-preview-debug-pipeline.html#top]
[2] https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top[https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top]
[3] https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others[https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others]
 
On 4 November 2022 at 19:26:27, podunk@gmx.com[mailto:podunk@gmx.com] (podunk@gmx.com[mailto:podunk@gmx.com]) wrote:

 
Hi,
 
I'm playing with Beam pipeline. My goal is to merge two big files.
So I have source (one of two) file like:
 
column_one|colum_two
0099"|"0080199111"
...
 
My trivial pipeline is:
Beam File Input => Text file output
I created definition for Beam File Input: separator "|", column_one - string, column_two - string
 
But in I get in result (Text file output):
 
column_one|colum_two
0|0|9|9|"|"|0|0|8|0|1|9|9|1|1|1|"
...
 
Why each character is separated by "|"?
 
I also get 51 result files. Even if I set 'Number of workers : 3' in Pipeline Run Configuration for engine 'Beam Direct pipeline engine'
 
Also this source file is really big and building definition is quite time consuming process - would be great such options like in Text file input where Hop detects fields and is able to preview it.
 
Best
 
 
 

Re: Beam File Input - unexpected result

Posted by Hans Van Akelyen <ha...@gmail.com>.
 Hi,

Seems like an odd thing you are encountering.
Not sure how you are ending up with that result, if you think you are
hitting a bug feel free to create a ticket with a reproduction path.

As for debugging, you are right that the Beam file definitions currently
have no way of doing a best guess on what the structure is.
This is possible using the text file input, but that transform is not
optimised for Beam usage but you can build the definition in the text file
input and then copy that to the file definition.

When using the direct runner you can preview data flowing through the
transform, you can press on a transform and use the “preview” output [1].
It will launch the pipeline and show you the result. When executing in
Dataflow/Spark/Flink we also have a concept to “see” what is happening
inside the pipeline. You can use the Execution information perspective [2].
It can save execution information and sample data, when running in a remote
cluster it is best to also have a Hop Server running as an endpoint to save
the execution information.

As for the final part, Because of the distributed/retry on fail and other
mechanisms in Beam for transforms like a text file output we let every
bundle/instance write to a new file this is the safest way to do it (and
this is the default and recommended approach) . If it must go to a single
file we have a workaround to change the number of copies on the output
transform and enter the value “SINGLE_BEAM” this will add a group by to the
beam pipeline forcing it to a single thread and thus being able to write to
a single file, this also has a performance kickback. For more information
on this you can take a look at how we handle our Transforms [3].

Cheers,
Hans

[1]
https://hop.apache.org/manual/latest/pipeline/run-preview-debug-pipeline.html#top
[2]
https://hop.apache.org/manual/latest/hop-gui/perspective-execution-information.html#top
[3]
https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others

On 4 November 2022 at 19:26:27, podunk@gmx.com (podunk@gmx.com) wrote:


Hi,

I'm playing with Beam pipeline. My goal is to merge two big files.
So I have source (one of two) file like:

column_one|colum_two
0099"|"0080199111"
...

My trivial pipeline is:
Beam File Input => Text file output
I created definition for Beam File Input: separator "|", column_one -
string, column_two - string

But in I get in result (Text file output):

column_one|colum_two
0|0|9|9|"|"|0|0|8|0|1|9|9|1|1|1|"
...

Why each character is separated by "|"?

I also get 51 result files. Even if I set 'Number of workers : 3' in
Pipeline Run Configuration for engine 'Beam Direct pipeline engine'

Also this source file is really big and building definition is quite time
consuming process - would be great such options like in Text file input
where Hop detects fields and is able to preview it.

Best