You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@hop.apache.org by Fabian Peters <po...@mercadu.de> on 2022/08/25 09:00:20 UTC

Common local and Beam file output

Hi all,

During development I used the "Serialize to file" output to share data among pipelines <https://hop.apache.org/manual/latest/best-practices/index.html#_size_matters>. Unfortunately that transform only creates empty files when running on Beam, as do the Parquet <https://issues.apache.org/jira/browse/HOP-3557>, Avro and Text file outputs. The Beam output on the other hand only works on Beam.

Is there any output that works with the local runner and Beam/Dataflow?

cheers

Fabian

Re: Common local and Beam file output

Posted by Matt Casters <ma...@neo4j.com>.
Fabian, you are not the first to notice that SINGLE_BEAM doesn't work in
2.0.0 so I took the liberty of creating a JIRA case for me to investigate:

https://issues.apache.org/jira/browse/HOP-4172

I think the issue isn't the Beam pipeline as such but the behavior of the
non-beam transform in this specific scenario.  I think that we might fail
to close the file properly at the teardown of the beam transform and
function.

All the best,
Matt


On Thu, Aug 25, 2022 at 1:45 PM Fabian Peters <po...@mercadu.de> wrote:

> Hi Hans,
>
> Thanks for the quick reply! The "Supported Engines" box in the
> documentation is much appreciated.
>
> I had read about the SINGLE_BEAM option but forgotten about it since. I
> just tried it on the "Avro file output" and the "Serialize to file"
> transforms, but still get empty files on GCS when running with BeamDirect.
> I configured the SINGLE_BEAM option via the "Specify copies" on the output,
> is that correct?
>
> When SINGLE_BEAM is configured, it looks like the transform is being
> ignored by the local runner?
>
> cheers
>
> Fabian
>
> Am 25.08.2022 um 11:43 schrieb Hans Van Akelyen <
> hans.van.akelyen@gmail.com>:
>
> Hi Fabian,
>
> Did you try running those transforms with the "SINGLE_BEAM" option in the
> number of copies? (for more info here
> <https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others> in
> the Non-Beam output transforms section)
>
> This being said, we are working on getting this tested on all runners, I
> have written textfiles on Flink in the past and that worked but DataFlow is
> another beast and might produce other results. To improve transparency we
> are adding indicators in the next version of our docs on each and every
> transform on what we have tested (you can already see this by switching to
> our pre-release docs example
> <https://hop.apache.org/manual/next/pipeline/transforms/dummy.html>).
> These docs are still very much alive so expect changes in these flags up
> until release.
>
> For the Avro and Parquet transforms we can implement the Beam equivalent
> in the backend so they should definitely start working once that work is
> done (tickets HOP-4168 and HOP-4169).
>
> Once we have tested everything the plan is to include an advisory or
> warnings in the application that some transforms do not work on the
> specified engine.
>
> Cheers,
> Hans
>
> On Thu, 25 Aug 2022 at 11:00, Fabian Peters <po...@mercadu.de> wrote:
>
>> Hi all,
>>
>> During development I used the "Serialize to file" output to share data
>> among pipelines
>> <https://hop.apache.org/manual/latest/best-practices/index.html#_size_matters>.
>> Unfortunately that transform only creates empty files when running on Beam,
>> as do the Parquet <https://issues.apache.org/jira/browse/HOP-3557>, Avro
>> and Text file outputs. The Beam output on the other hand only works on Beam.
>>
>> Is there any output that works with the local runner and Beam/Dataflow?
>>
>> cheers
>>
>> Fabian
>>
>
>

-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com

Re: Common local and Beam file output

Posted by Matt Casters <ma...@neo4j.com>.
Fabian, you are not the first to notice that SINGLE_BEAM doesn't work in
2.0.0 so I took the liberty of creating a JIRA case for me to investigate:

https://issues.apache.org/jira/browse/HOP-4172

I think the issue isn't the Beam pipeline as such but the behavior of the
non-beam transform in this specific scenario.  I think that we might fail
to close the file properly at the teardown of the beam transform and
function.

All the best,
Matt


On Thu, Aug 25, 2022 at 1:45 PM Fabian Peters <po...@mercadu.de> wrote:

> Hi Hans,
>
> Thanks for the quick reply! The "Supported Engines" box in the
> documentation is much appreciated.
>
> I had read about the SINGLE_BEAM option but forgotten about it since. I
> just tried it on the "Avro file output" and the "Serialize to file"
> transforms, but still get empty files on GCS when running with BeamDirect.
> I configured the SINGLE_BEAM option via the "Specify copies" on the output,
> is that correct?
>
> When SINGLE_BEAM is configured, it looks like the transform is being
> ignored by the local runner?
>
> cheers
>
> Fabian
>
> Am 25.08.2022 um 11:43 schrieb Hans Van Akelyen <
> hans.van.akelyen@gmail.com>:
>
> Hi Fabian,
>
> Did you try running those transforms with the "SINGLE_BEAM" option in the
> number of copies? (for more info here
> <https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others> in
> the Non-Beam output transforms section)
>
> This being said, we are working on getting this tested on all runners, I
> have written textfiles on Flink in the past and that worked but DataFlow is
> another beast and might produce other results. To improve transparency we
> are adding indicators in the next version of our docs on each and every
> transform on what we have tested (you can already see this by switching to
> our pre-release docs example
> <https://hop.apache.org/manual/next/pipeline/transforms/dummy.html>).
> These docs are still very much alive so expect changes in these flags up
> until release.
>
> For the Avro and Parquet transforms we can implement the Beam equivalent
> in the backend so they should definitely start working once that work is
> done (tickets HOP-4168 and HOP-4169).
>
> Once we have tested everything the plan is to include an advisory or
> warnings in the application that some transforms do not work on the
> specified engine.
>
> Cheers,
> Hans
>
> On Thu, 25 Aug 2022 at 11:00, Fabian Peters <po...@mercadu.de> wrote:
>
>> Hi all,
>>
>> During development I used the "Serialize to file" output to share data
>> among pipelines
>> <https://hop.apache.org/manual/latest/best-practices/index.html#_size_matters>.
>> Unfortunately that transform only creates empty files when running on Beam,
>> as do the Parquet <https://issues.apache.org/jira/browse/HOP-3557>, Avro
>> and Text file outputs. The Beam output on the other hand only works on Beam.
>>
>> Is there any output that works with the local runner and Beam/Dataflow?
>>
>> cheers
>>
>> Fabian
>>
>
>

-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com

Re: Common local and Beam file output

Posted by Fabian Peters <po...@mercadu.de>.
Hi Hans,

Thanks for the quick reply! The "Supported Engines" box in the documentation is much appreciated.

I had read about the SINGLE_BEAM option but forgotten about it since. I just tried it on the "Avro file output" and the "Serialize to file" transforms, but still get empty files on GCS when running with BeamDirect. I configured the SINGLE_BEAM option via the "Specify copies" on the output, is that correct?

When SINGLE_BEAM is configured, it looks like the transform is being ignored by the local runner?

cheers

Fabian

> Am 25.08.2022 um 11:43 schrieb Hans Van Akelyen <ha...@gmail.com>:
> 
> Hi Fabian,
> 
> Did you try running those transforms with the "SINGLE_BEAM" option in the number of copies? (for more info here <https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others> in the Non-Beam output transforms section)
> 
> This being said, we are working on getting this tested on all runners, I have written textfiles on Flink in the past and that worked but DataFlow is another beast and might produce other results. To improve transparency we are adding indicators in the next version of our docs on each and every transform on what we have tested (you can already see this by switching to our pre-release docs example <https://hop.apache.org/manual/next/pipeline/transforms/dummy.html>). These docs are still very much alive so expect changes in these flags up until release.
> 
> For the Avro and Parquet transforms we can implement the Beam equivalent in the backend so they should definitely start working once that work is done (tickets HOP-4168 and HOP-4169).
> 
> Once we have tested everything the plan is to include an advisory or warnings in the application that some transforms do not work on the specified engine.
> 
> Cheers,
> Hans
> 
> On Thu, 25 Aug 2022 at 11:00, Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
> Hi all,
> 
> During development I used the "Serialize to file" output to share data among pipelines <https://hop.apache.org/manual/latest/best-practices/index.html#_size_matters>. Unfortunately that transform only creates empty files when running on Beam, as do the Parquet <https://issues.apache.org/jira/browse/HOP-3557>, Avro and Text file outputs. The Beam output on the other hand only works on Beam.
> 
> Is there any output that works with the local runner and Beam/Dataflow?
> 
> cheers
> 
> Fabian


Re: Common local and Beam file output

Posted by Hans Van Akelyen <ha...@gmail.com>.
Hi Fabian,

Did you try running those transforms with the "SINGLE_BEAM" option in the
number of copies? (for more info here
<https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_all_others>
in
the Non-Beam output transforms section)

This being said, we are working on getting this tested on all runners, I
have written textfiles on Flink in the past and that worked but DataFlow is
another beast and might produce other results. To improve transparency we
are adding indicators in the next version of our docs on each and every
transform on what we have tested (you can already see this by switching to
our pre-release docs example
<https://hop.apache.org/manual/next/pipeline/transforms/dummy.html>). These
docs are still very much alive so expect changes in these flags up until
release.

For the Avro and Parquet transforms we can implement the Beam equivalent in
the backend so they should definitely start working once that work is done
(tickets HOP-4168 and HOP-4169).

Once we have tested everything the plan is to include an advisory or
warnings in the application that some transforms do not work on the
specified engine.

Cheers,
Hans

On Thu, 25 Aug 2022 at 11:00, Fabian Peters <po...@mercadu.de> wrote:

> Hi all,
>
> During development I used the "Serialize to file" output to share data
> among pipelines
> <https://hop.apache.org/manual/latest/best-practices/index.html#_size_matters>.
> Unfortunately that transform only creates empty files when running on Beam,
> as do the Parquet <https://issues.apache.org/jira/browse/HOP-3557>, Avro
> and Text file outputs. The Beam output on the other hand only works on Beam.
>
> Is there any output that works with the local runner and Beam/Dataflow?
>
> cheers
>
> Fabian
>