You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@hop.apache.org by Fabian Peters <po...@mercadu.de> on 2022/08/10 13:46:11 UTC

Dataflow template creation

Hi all!

Thanks to Hans' work on the REST transform, I can now deploy my jobs to Dataflow.

Next, I'd like to schedule a batch job <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>, but for this I need to create a template <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>template <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>. I've searched the Hop documentation but haven't found anything on this. I'm guessing that flex-templates <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are the way to go, due to the fat-jar, but I'm wondering what to pass as the FLEX_TEMPLATE_JAVA_MAIN_CLASS.

cheers

Fabian

Re: Dataflow template creation

Posted by Hans Van Akelyen <ha...@gmail.com>.
Hi Fabian,

Glad we could help!
I think we can still make some improvements in the longer run. When using
the Pipeline it is actually also providing all the information you now have
to insert in the run configuration. I made a ticket with some things we can
do when we have a bit more spare time
https://issues.apache.org/jira/browse/HOP-4144

Cheers,
Hans

On Wed, 24 Aug 2022 at 14:25, Fabian Peters <po...@mercadu.de> wrote:

> Hi Hans,
>
> I finally got around to testing this and can confirm that it works fine!
> For me the "launch" job remains in the "Queued" state and then moves to
> "Failed". It looks like no resources are being billed for this job, so in
> my view this issue is mostly cosmetic. Thanks a lot for making this
> possible so quickly!
>
> cheers
>
> Fabian
>
> Am 18.08.2022 um 18:22 schrieb Fabian Peters <po...@mercadu.de>:
>
> Hello Hans,
>
> Just catching up on the day's mails. I'm really grateful to you for
> looking into this in depth and even coming up with a working setup! I've
> been swamped with other work but would never have gotten this far anyway.
> Nogmaals bedankt! ;)
>
> I'll try to do a test run tomorrow.
>
> cheers
>
> Fabian
>
> Am 18.08.2022 um 15:53 schrieb Hans Van Akelyen <
> hans.van.akelyen@gmail.com>:
>
> Hi Fabian,
>
> So I played around a bit more with the pipelines and I was able to launch
> dataflow jobs but it's not completely working as expected.
> The documentation around this is also a bit scattered everywhere so I'm
> not sure I'll be able to figure out the final solution in a short period of
> time.
>
> Steps taken to get this working:
> - Modified the code a bit, these changes will be merged soon [1]
> - Generate a hop-fatjar.jar
> - Upload a pipeline and the hop-metadata to Google Storage
>   - Modify the run configuration to take the fat-jar from following
> location /dataflow/template/hop-fatjar.jar (location in the docker image)
> - Modified the default docker to include the fat jar:
>
>
> * FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
> <http://gcr.io/dataflow-templates-base/java11-template-launcher-base>*
>
>
>
>
>
>
>
>
>
>
> *  ARG WORKDIR=/dataflow/template  RUN mkdir -p ${WORKDIR}  WORKDIR
> ${WORKDIR}  COPY hop-fatjar.jar .  ENV
> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"  ENV
> FLEX_TEMPLATE_JAVA_CLASSPATH="${WORKDIR}/*"  ENTRYPOINT
> ["/opt/google/dataflow/java_template_launcher"]*
>
> - Save the image in the container registry (gcloud builds submit --tag
> <image_location>:latest .)
> - Create a new pipeline using following template:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *{    "defaultEnvironment": {},    "image": "<your image
> location>:latest",    "metadata": {        "description": "This template
> allows you to start Hop pipelines on dataflow",        "name": "Template to
> start a hop pipeline",        "parameters": [            {
> "helpText": "Google storage location pointing to the pipeline you wish to
> start",                "label": "Google storage location pointing to the
> pipeline you wish to start",                "name": "HopPipelinePath",
>           "regexes": [                    ".*"                ]
> },            {                "helpText": "Google storage location
> pointing to the Hop Metadata you wish to use",                "label":
> "Google storage location pointing to the Hop Metadata you wish to use",
>             "name": "HopMetadataPath",                "regexes": [
>           ".*"                ]            },            {
> "helpText": "Run configuration used to launch the pipeline",
> "label": "Run configuration used to launch the pipeline",
> "name": "HopRunConfigurationName",                "regexes": [
>       ".*"                ]            }        ]    },    "sdkInfo": {
>     "language": "JAVA"    }}*
>
> - Fill in the parameters with the google storage location and run
> configuration name
> - Run the pipeline
>
> Now we enter the point where things get a bit strange, when you follow all
> these steps you will notice a dataflow job will be started.
> This Dataflow job will then spawn another Dataflow job that contains the
> actual pipeline, the original job started via the pipeline will fail but
> your other job will run fine.
> <image.png>
> The Pipeline job expects that a job file gets generated in a specific
> location and it will then pick up this file to execute the actual job.
> This is the part we would probably have to change our code a bit to save
> the job specification to that location and not start another job via the
> Beam API.
>
> Until we get that sorted out you will have 2 jobs where one will fail on
> every run, I hope this is acceptable for now.
>
> Cheers,
> Hans
>
> [1] https://github.com/apache/hop/pull/1644
>
>
> On Thu, 18 Aug 2022 at 13:00, Hans Van Akelyen <ha...@gmail.com>
> wrote:
>
>> Hi Fabian,
>>
>> I've been digging into this a bit and it seems we will need some code
>> changes to make this work.
>> As far as I can tell you have to use one of the docker templates Google
>> provides to start a pipeline from a template.
>> The issue we have is that our MainBeam class requires 3 arguments to work
>> (filename/metadata/run configuration name).
>> These 3 arguments need to be the 3 first arguments passed to the class,
>> we have no named parameters implemented.
>>
>> When the template launches it calls java in the following way:
>>
>> Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam
>> --pipelineLocation=test --runner=DataflowRunner --project=xxx
>> --templateLocation=
>> gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object
>> --stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging
>> --labels={ "goog-data-pipelines" : "test" } --jobName=test-mp--1660815257
>> --region=us-central1 --serviceAccount=
>> xxxx-compute@developer.gserviceaccount.com --tempLocation=
>> gs://dataflow-staging-us-central1-xxxx/tmp
>>
>> In this case it will see the first 3 arguments and select them.
>> <image.png>
>>
>> As I can not find a way to force those 3 arguments in there we will need
>> to implement named parameters in that class, I tried a bit of a hack but it
>> did not work, I changed the docker template to the following but the Google
>> script then throws an error:
>>
>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam
>> gs://xxx/0004-rest-client-get.hpl gs://xxx/hop-metadata.json Dataflow"
>>
>> As I think this will have great added value, I will work on this ASAP.
>> When the work has been done we can even supply the image required from our
>> DockerHub Account and you should be able to run Hop pipelines in dataflow
>> by using a simple template.
>>
>> My idea will be to add support for the following 3 named parameters:
>>  - HopPipelinePath -> location of the pipeline (can be Google Storage)
>>  - HopMetadataPath -> location of the metadata file (can be Google
>> storage)
>>  - HopRunConfigurationName
>>
>> I'll post updates here on the progress.
>>
>> Cheers,
>> Hans
>>
>> On Tue, 16 Aug 2022 at 11:36, Fabian Peters <po...@mercadu.de> wrote:
>>
>>> Hi Hans,
>>>
>>> No, I didn't yet have another go. The hints from Matt (didn't see that
>>> mail on the list?) do look quite useful in the context of Datlow templates.
>>> I'll try to see whether I can get a bit further, but if you have time to
>>> have a look at it, I'd much appreciate!
>>>
>>> cheers
>>>
>>> Fabian
>>>
>>> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <
>>> hans.van.akelyen@gmail.com>:
>>>
>>> Hi Fabian,
>>>
>>> Did you get this working and are you willing to share the final results?
>>> If not I will see what I can do, and we can add it to our documentation.
>>>
>>> Cheers,
>>> Hans
>>>
>>> On Thu, 11 Aug 2022 at 13:14, Matt Casters <ma...@neo4j.com>
>>> wrote:
>>>
>>>> When you run class org.apache.hop.beam.run.MainBeam you need to provide
>>>> 3 arguments to run:
>>>>
>>>> 1. The filename of the pipeline to run
>>>> 2. The filename which contains Hop metadata
>>>> 3. The name of the pipeline run configuration to use
>>>>
>>>> See also for example:
>>>> https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run
>>>>
>>>> Good luck,
>>>> Matt
>>>>
>>>>
>>>> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <po...@mercadu.de> wrote:
>>>>
>>>>> Hello Hans,
>>>>>
>>>>> I went through the flex-template process yesterday but the generated
>>>>> template does not work. The main piece that's missing for me is how to pass
>>>>> the actual pipeline that should be run. My test boiled down to:
>>>>>
>>>>> gcloud dataflow flex-template build
>>>>> gs://foo_ag_dataflow/tmp/todays-directories.json \
>>>>>       --image-gcr-path "
>>>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
>>>>>       --sdk-language "JAVA" \
>>>>>       --flex-template-base-image JAVA11 \
>>>>>       --metadata-file
>>>>> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
>>>>> \
>>>>>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>>>>>       --env
>>>>> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>>>
>>>>> gcloud dataflow flex-template run "todays-directories-`date
>>>>> +%Y%m%d-%H%M%S`" \
>>>>>     --template-file-gcs-location "
>>>>> gs://foo_ag_dataflow/tmp/todays-directories.json" \
>>>>>     --region "europe-west1"
>>>>>
>>>>> With Dockerfile:
>>>>>
>>>>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
>>>>>
>>>>> ARG WORKDIR=/dataflow/template
>>>>> RUN mkdir -p ${WORKDIR}
>>>>> WORKDIR ${WORKDIR}
>>>>>
>>>>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>>>>>
>>>>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>>>>>
>>>>>
>>>>> And "todays-directories.json":
>>>>>
>>>>> {
>>>>>     "defaultEnvironment": {},
>>>>>     "image": "
>>>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
>>>>>     "metadata": {
>>>>>         "description": "Test templates creation with Apache Hop",
>>>>>         "name": "Todays directories"
>>>>>     },
>>>>>     "sdkInfo": {
>>>>>         "language": "JAVA"
>>>>>     }
>>>>> }
>>>>>
>>>>> Thanks for having a look at it!
>>>>>
>>>>> cheers
>>>>>
>>>>> Fabian
>>>>>
>>>>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <
>>>>> hans.van.akelyen@gmail.com>:
>>>>>
>>>>> Hi Fabian,
>>>>>
>>>>> You have indeed found something we have not yet documented, mainly
>>>>> because we have not yet tried it out ourselves.
>>>>> The main class that gets called when running Beam pipelines is
>>>>> "org.apache.hop.beam.run.MainBeam".
>>>>>
>>>>> I was hoping the "Import as pipeline" button on a job would give you
>>>>> everything you need to execute this but it does not.
>>>>> I'll take a closer look the following days to see what is needed to
>>>>> use this functionality, could be that we need to export the template based
>>>>> on a pipeline.
>>>>>
>>>>> Kr,
>>>>> Hans
>>>>>
>>>>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <po...@mercadu.de> wrote:
>>>>>
>>>>>> Hi all!
>>>>>>
>>>>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs
>>>>>> to Dataflow.
>>>>>>
>>>>>> Next, I'd like to schedule a batch job
>>>>>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>>>>>> but for this I need to create a
>>>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
>>>>>> template
>>>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
>>>>>> I've searched the Hop documentation but haven't found anything on this. I'm
>>>>>> guessing that flex-templates
>>>>>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are
>>>>>> the way to go, due to the fat-jar, but I'm wondering what to pass as
>>>>>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>>>>>
>>>>>> cheers
>>>>>>
>>>>>> Fabian
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Neo4j Chief Solutions Architect
>>>> *✉   *matt.casters@neo4j.com
>>>>
>>>>
>>>>
>>>>
>>>
>
>

Re: Dataflow template creation

Posted by Fabian Peters <po...@mercadu.de>.
Hi Hans,

I finally got around to testing this and can confirm that it works fine! For me the "launch" job remains in the "Queued" state and then moves to "Failed". It looks like no resources are being billed for this job, so in my view this issue is mostly cosmetic. Thanks a lot for making this possible so quickly!

cheers

Fabian

> Am 18.08.2022 um 18:22 schrieb Fabian Peters <po...@mercadu.de>:
> 
> Hello Hans,
> 
> Just catching up on the day's mails. I'm really grateful to you for looking into this in depth and even coming up with a working setup! I've been swamped with other work but would never have gotten this far anyway. Nogmaals bedankt! ;)
> 
> I'll try to do a test run tomorrow.
> 
> cheers
> 
> Fabian
> 
>> Am 18.08.2022 um 15:53 schrieb Hans Van Akelyen <hans.van.akelyen@gmail.com <ma...@gmail.com>>:
>> 
>> Hi Fabian,
>> 
>> So I played around a bit more with the pipelines and I was able to launch dataflow jobs but it's not completely working as expected.
>> The documentation around this is also a bit scattered everywhere so I'm not sure I'll be able to figure out the final solution in a short period of time.
>> 
>> Steps taken to get this working:
>> - Modified the code a bit, these changes will be merged soon [1]
>> - Generate a hop-fatjar.jar
>> - Upload a pipeline and the hop-metadata to Google Storage
>>   - Modify the run configuration to take the fat-jar from following location /dataflow/template/hop-fatjar.jar (location in the docker image)
>> - Modified the default docker to include the fat jar:
>>  
>>  FROM gcr.io/dataflow-templates-base/java11-template-launcher-base <http://gcr.io/dataflow-templates-base/java11-template-launcher-base>
>> 
>>   ARG WORKDIR=/dataflow/template
>>   RUN mkdir -p ${WORKDIR}
>>   WORKDIR ${WORKDIR}
>> 
>>   COPY hop-fatjar.jar .
>> 
>>   ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>   ENV FLEX_TEMPLATE_JAVA_CLASSPATH="${WORKDIR}/*"
>> 
>>   ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>> 
>> - Save the image in the container registry (gcloud builds submit --tag <image_location>:latest .)
>> - Create a new pipeline using following template:
>> 
>> {
>>     "defaultEnvironment": {},
>>     "image": "<your image location>:latest",
>>     "metadata": {
>>         "description": "This template allows you to start Hop pipelines on dataflow",
>>         "name": "Template to start a hop pipeline",
>>         "parameters": [
>>             {
>>                 "helpText": "Google storage location pointing to the pipeline you wish to start",
>>                 "label": "Google storage location pointing to the pipeline you wish to start",
>>                 "name": "HopPipelinePath",
>>                 "regexes": [
>>                     ".*"
>>                 ]
>>             },
>>             {
>>                 "helpText": "Google storage location pointing to the Hop Metadata you wish to use",
>>                 "label": "Google storage location pointing to the Hop Metadata you wish to use",
>>                 "name": "HopMetadataPath",
>>                 "regexes": [
>>                     ".*"
>>                 ]
>>             },
>>             {
>>                 "helpText": "Run configuration used to launch the pipeline",
>>                 "label": "Run configuration used to launch the pipeline",
>>                 "name": "HopRunConfigurationName",
>>                 "regexes": [
>>                     ".*"
>>                 ]
>>             }
>>         ]
>>     },
>>     "sdkInfo": {
>>         "language": "JAVA"
>>     }
>> }
>> 
>> - Fill in the parameters with the google storage location and run configuration name
>> - Run the pipeline
>> 
>> Now we enter the point where things get a bit strange, when you follow all these steps you will notice a dataflow job will be started.
>> This Dataflow job will then spawn another Dataflow job that contains the actual pipeline, the original job started via the pipeline will fail but your other job will run fine.
>> <image.png>
>> The Pipeline job expects that a job file gets generated in a specific location and it will then pick up this file to execute the actual job.
>> This is the part we would probably have to change our code a bit to save the job specification to that location and not start another job via the Beam API.
>> 
>> Until we get that sorted out you will have 2 jobs where one will fail on every run, I hope this is acceptable for now.
>> 
>> Cheers,
>> Hans
>> 
>> [1] https://github.com/apache/hop/pull/1644 <https://github.com/apache/hop/pull/1644>
>> 
>> 
>> On Thu, 18 Aug 2022 at 13:00, Hans Van Akelyen <hans.van.akelyen@gmail.com <ma...@gmail.com>> wrote:
>> Hi Fabian,
>> 
>> I've been digging into this a bit and it seems we will need some code changes to make this work.
>> As far as I can tell you have to use one of the docker templates Google provides to start a pipeline from a template.
>> The issue we have is that our MainBeam class requires 3 arguments to work (filename/metadata/run configuration name).
>> These 3 arguments need to be the 3 first arguments passed to the class, we have no named parameters implemented.
>> 
>> When the template launches it calls java in the following way:
>> 
>> Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam --pipelineLocation=test --runner=DataflowRunner --project=xxx --templateLocation=gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object <gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object> --stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging <gs://dataflow-staging-us-central1-xxxx/staging> --labels={ "goog-data-pipelines" : "test" } --jobName=test-mp--1660815257 --region=us-central1 --serviceAccount=xxxx-compute@developer.gserviceaccount.com <ma...@developer.gserviceaccount.com> --tempLocation=gs://dataflow-staging-us-central1-xxxx/tmp <gs://dataflow-staging-us-central1-xxxx/tmp>
>> 
>> In this case it will see the first 3 arguments and select them.
>> <image.png>
>> 
>> As I can not find a way to force those 3 arguments in there we will need to implement named parameters in that class, I tried a bit of a hack but it did not work, I changed the docker template to the following but the Google script then throws an error:
>> 
>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam gs://xxx/0004-rest-client-get.hpl <gs://xxx/0004-rest-client-get.hpl> gs://xxx/hop-metadata.json <gs://xxx/hop-metadata.json> Dataflow"
>> 
>> As I think this will have great added value, I will work on this ASAP. When the work has been done we can even supply the image required from our DockerHub Account and you should be able to run Hop pipelines in dataflow by using a simple template.
>> 
>> My idea will be to add support for the following 3 named parameters:
>>  - HopPipelinePath -> location of the pipeline (can be Google Storage)
>>  - HopMetadataPath -> location of the metadata file (can be Google storage)
>>  - HopRunConfigurationName 
>> 
>> I'll post updates here on the progress.
>> 
>> Cheers,
>> Hans
>> 
>> On Tue, 16 Aug 2022 at 11:36, Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
>> Hi Hans,
>> 
>> No, I didn't yet have another go. The hints from Matt (didn't see that mail on the list?) do look quite useful in the context of Datlow templates. I'll try to see whether I can get a bit further, but if you have time to have a look at it, I'd much appreciate!
>> 
>> cheers
>> 
>> Fabian
>> 
>>> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <hans.van.akelyen@gmail.com <ma...@gmail.com>>:
>>> 
>>> Hi Fabian,
>>> 
>>> Did you get this working and are you willing to share the final results?
>>> If not I will see what I can do, and we can add it to our documentation.
>>> 
>>> Cheers,
>>> Hans
>>> 
>>> On Thu, 11 Aug 2022 at 13:14, Matt Casters <matt.casters@neo4j.com <ma...@neo4j.com>> wrote:
>>> When you run class org.apache.hop.beam.run.MainBeam you need to provide 3 arguments to run:
>>> 
>>> 1. The filename of the pipeline to run
>>> 2. The filename which contains Hop metadata
>>> 3. The name of the pipeline run configuration to use
>>> 
>>> See also for example: https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run <https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run>
>>> 
>>> Good luck,
>>> Matt
>>> 
>>> 
>>> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
>>> Hello Hans,
>>> 
>>> I went through the flex-template process yesterday but the generated template does not work. The main piece that's missing for me is how to pass the actual pipeline that should be run. My test boiled down to:
>>> 
>>> gcloud dataflow flex-template build gs://foo_ag_dataflow/tmp/todays-directories.json <> \
>>>       --image-gcr-path "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest <http://europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest>" \
>>>       --sdk-language "JAVA" \
>>>       --flex-template-base-image JAVA11 \
>>>       --metadata-file "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json" \
>>>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>>>       --env FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>> 
>>> gcloud dataflow flex-template run "todays-directories-`date +%Y%m%d-%H%M%S`" \
>>>     --template-file-gcs-location "gs://foo_ag_dataflow/tmp/todays-directories.json <>" \
>>>     --region "europe-west1"
>>> 
>>> With Dockerfile:
>>> 
>>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base <http://gcr.io/dataflow-templates-base/java11-template-launcher-base>
>>> 
>>> ARG WORKDIR=/dataflow/template
>>> RUN mkdir -p ${WORKDIR}
>>> WORKDIR ${WORKDIR}
>>> 
>>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>>> 
>>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>>> 
>>> 
>>> And "todays-directories.json":
>>> 
>>> {
>>>     "defaultEnvironment": {},
>>>     "image": "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest <http://europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest>",
>>>     "metadata": {
>>>         "description": "Test templates creation with Apache Hop",
>>>         "name": "Todays directories"
>>>     },
>>>     "sdkInfo": {
>>>         "language": "JAVA"
>>>     }
>>> }
>>> 
>>> Thanks for having a look at it!
>>> 
>>> cheers
>>> 
>>> Fabian
>>> 
>>>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <hans.van.akelyen@gmail.com <ma...@gmail.com>>:
>>>> 
>>>> Hi Fabian,
>>>> 
>>>> You have indeed found something we have not yet documented, mainly because we have not yet tried it out ourselves.
>>>> The main class that gets called when running Beam pipelines is "org.apache.hop.beam.run.MainBeam".
>>>> 
>>>> I was hoping the "Import as pipeline" button on a job would give you everything you need to execute this but it does not.
>>>> I'll take a closer look the following days to see what is needed to use this functionality, could be that we need to export the template based on a pipeline.
>>>> 
>>>> Kr,
>>>> Hans
>>>> 
>>>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
>>>> Hi all!
>>>> 
>>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to Dataflow.
>>>> 
>>>> Next, I'd like to schedule a batch job <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>, but for this I need to create a  <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>template <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>. I've searched the Hop documentation but haven't found anything on this. I'm guessing that flex-templates <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are the way to go, due to the fat-jar, but I'm wondering what to pass as the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>>> 
>>>> cheers
>>>> 
>>>> Fabian
>>> 
>>> 
>>> 
>>> -- 
>>> Neo4j Chief Solutions Architect
>>> ✉   matt.casters@neo4j.com <ma...@neo4j.com>
>>> 
>>> 
>>> 
>> 
> 


Re: Dataflow template creation

Posted by Fabian Peters <po...@mercadu.de>.
Hello Hans,

Just catching up on the day's mails. I'm really grateful to you for looking into this in depth and even coming up with a working setup! I've been swamped with other work but would never have gotten this far anyway. Nogmaals bedankt! ;)

I'll try to do a test run tomorrow.

cheers

Fabian

> Am 18.08.2022 um 15:53 schrieb Hans Van Akelyen <ha...@gmail.com>:
> 
> Hi Fabian,
> 
> So I played around a bit more with the pipelines and I was able to launch dataflow jobs but it's not completely working as expected.
> The documentation around this is also a bit scattered everywhere so I'm not sure I'll be able to figure out the final solution in a short period of time.
> 
> Steps taken to get this working:
> - Modified the code a bit, these changes will be merged soon [1]
> - Generate a hop-fatjar.jar
> - Upload a pipeline and the hop-metadata to Google Storage
>   - Modify the run configuration to take the fat-jar from following location /dataflow/template/hop-fatjar.jar (location in the docker image)
> - Modified the default docker to include the fat jar:
>  
>  FROM gcr.io/dataflow-templates-base/java11-template-launcher-base <http://gcr.io/dataflow-templates-base/java11-template-launcher-base>
> 
>   ARG WORKDIR=/dataflow/template
>   RUN mkdir -p ${WORKDIR}
>   WORKDIR ${WORKDIR}
> 
>   COPY hop-fatjar.jar .
> 
>   ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>   ENV FLEX_TEMPLATE_JAVA_CLASSPATH="${WORKDIR}/*"
> 
>   ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
> 
> - Save the image in the container registry (gcloud builds submit --tag <image_location>:latest .)
> - Create a new pipeline using following template:
> 
> {
>     "defaultEnvironment": {},
>     "image": "<your image location>:latest",
>     "metadata": {
>         "description": "This template allows you to start Hop pipelines on dataflow",
>         "name": "Template to start a hop pipeline",
>         "parameters": [
>             {
>                 "helpText": "Google storage location pointing to the pipeline you wish to start",
>                 "label": "Google storage location pointing to the pipeline you wish to start",
>                 "name": "HopPipelinePath",
>                 "regexes": [
>                     ".*"
>                 ]
>             },
>             {
>                 "helpText": "Google storage location pointing to the Hop Metadata you wish to use",
>                 "label": "Google storage location pointing to the Hop Metadata you wish to use",
>                 "name": "HopMetadataPath",
>                 "regexes": [
>                     ".*"
>                 ]
>             },
>             {
>                 "helpText": "Run configuration used to launch the pipeline",
>                 "label": "Run configuration used to launch the pipeline",
>                 "name": "HopRunConfigurationName",
>                 "regexes": [
>                     ".*"
>                 ]
>             }
>         ]
>     },
>     "sdkInfo": {
>         "language": "JAVA"
>     }
> }
> 
> - Fill in the parameters with the google storage location and run configuration name
> - Run the pipeline
> 
> Now we enter the point where things get a bit strange, when you follow all these steps you will notice a dataflow job will be started.
> This Dataflow job will then spawn another Dataflow job that contains the actual pipeline, the original job started via the pipeline will fail but your other job will run fine.
> <image.png>
> The Pipeline job expects that a job file gets generated in a specific location and it will then pick up this file to execute the actual job.
> This is the part we would probably have to change our code a bit to save the job specification to that location and not start another job via the Beam API.
> 
> Until we get that sorted out you will have 2 jobs where one will fail on every run, I hope this is acceptable for now.
> 
> Cheers,
> Hans
> 
> [1] https://github.com/apache/hop/pull/1644 <https://github.com/apache/hop/pull/1644>
> 
> 
> On Thu, 18 Aug 2022 at 13:00, Hans Van Akelyen <hans.van.akelyen@gmail.com <ma...@gmail.com>> wrote:
> Hi Fabian,
> 
> I've been digging into this a bit and it seems we will need some code changes to make this work.
> As far as I can tell you have to use one of the docker templates Google provides to start a pipeline from a template.
> The issue we have is that our MainBeam class requires 3 arguments to work (filename/metadata/run configuration name).
> These 3 arguments need to be the 3 first arguments passed to the class, we have no named parameters implemented.
> 
> When the template launches it calls java in the following way:
> 
> Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam --pipelineLocation=test --runner=DataflowRunner --project=xxx --templateLocation=gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object --stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging --labels={ "goog-data-pipelines" : "test" } --jobName=test-mp--1660815257 --region=us-central1 --serviceAccount=xxxx-compute@developer.gserviceaccount.com <ma...@developer.gserviceaccount.com> --tempLocation=gs://dataflow-staging-us-central1-xxxx/tmp
> 
> In this case it will see the first 3 arguments and select them.
> <image.png>
> 
> As I can not find a way to force those 3 arguments in there we will need to implement named parameters in that class, I tried a bit of a hack but it did not work, I changed the docker template to the following but the Google script then throws an error:
> 
> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam gs://xxx/0004-rest-client-get.hpl gs://xxx/hop-metadata.json Dataflow"
> 
> As I think this will have great added value, I will work on this ASAP. When the work has been done we can even supply the image required from our DockerHub Account and you should be able to run Hop pipelines in dataflow by using a simple template.
> 
> My idea will be to add support for the following 3 named parameters:
>  - HopPipelinePath -> location of the pipeline (can be Google Storage)
>  - HopMetadataPath -> location of the metadata file (can be Google storage)
>  - HopRunConfigurationName 
> 
> I'll post updates here on the progress.
> 
> Cheers,
> Hans
> 
> On Tue, 16 Aug 2022 at 11:36, Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
> Hi Hans,
> 
> No, I didn't yet have another go. The hints from Matt (didn't see that mail on the list?) do look quite useful in the context of Datlow templates. I'll try to see whether I can get a bit further, but if you have time to have a look at it, I'd much appreciate!
> 
> cheers
> 
> Fabian
> 
>> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <hans.van.akelyen@gmail.com <ma...@gmail.com>>:
>> 
>> Hi Fabian,
>> 
>> Did you get this working and are you willing to share the final results?
>> If not I will see what I can do, and we can add it to our documentation.
>> 
>> Cheers,
>> Hans
>> 
>> On Thu, 11 Aug 2022 at 13:14, Matt Casters <matt.casters@neo4j.com <ma...@neo4j.com>> wrote:
>> When you run class org.apache.hop.beam.run.MainBeam you need to provide 3 arguments to run:
>> 
>> 1. The filename of the pipeline to run
>> 2. The filename which contains Hop metadata
>> 3. The name of the pipeline run configuration to use
>> 
>> See also for example: https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run <https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run>
>> 
>> Good luck,
>> Matt
>> 
>> 
>> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
>> Hello Hans,
>> 
>> I went through the flex-template process yesterday but the generated template does not work. The main piece that's missing for me is how to pass the actual pipeline that should be run. My test boiled down to:
>> 
>> gcloud dataflow flex-template build gs://foo_ag_dataflow/tmp/todays-directories.json <> \
>>       --image-gcr-path "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest <http://europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest>" \
>>       --sdk-language "JAVA" \
>>       --flex-template-base-image JAVA11 \
>>       --metadata-file "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json" \
>>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>>       --env FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>> 
>> gcloud dataflow flex-template run "todays-directories-`date +%Y%m%d-%H%M%S`" \
>>     --template-file-gcs-location "gs://foo_ag_dataflow/tmp/todays-directories.json <>" \
>>     --region "europe-west1"
>> 
>> With Dockerfile:
>> 
>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base <http://gcr.io/dataflow-templates-base/java11-template-launcher-base>
>> 
>> ARG WORKDIR=/dataflow/template
>> RUN mkdir -p ${WORKDIR}
>> WORKDIR ${WORKDIR}
>> 
>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>> 
>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>> 
>> 
>> And "todays-directories.json":
>> 
>> {
>>     "defaultEnvironment": {},
>>     "image": "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest <http://europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest>",
>>     "metadata": {
>>         "description": "Test templates creation with Apache Hop",
>>         "name": "Todays directories"
>>     },
>>     "sdkInfo": {
>>         "language": "JAVA"
>>     }
>> }
>> 
>> Thanks for having a look at it!
>> 
>> cheers
>> 
>> Fabian
>> 
>>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <hans.van.akelyen@gmail.com <ma...@gmail.com>>:
>>> 
>>> Hi Fabian,
>>> 
>>> You have indeed found something we have not yet documented, mainly because we have not yet tried it out ourselves.
>>> The main class that gets called when running Beam pipelines is "org.apache.hop.beam.run.MainBeam".
>>> 
>>> I was hoping the "Import as pipeline" button on a job would give you everything you need to execute this but it does not.
>>> I'll take a closer look the following days to see what is needed to use this functionality, could be that we need to export the template based on a pipeline.
>>> 
>>> Kr,
>>> Hans
>>> 
>>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
>>> Hi all!
>>> 
>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to Dataflow.
>>> 
>>> Next, I'd like to schedule a batch job <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>, but for this I need to create a  <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>template <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>. I've searched the Hop documentation but haven't found anything on this. I'm guessing that flex-templates <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are the way to go, due to the fat-jar, but I'm wondering what to pass as the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>> 
>>> cheers
>>> 
>>> Fabian
>> 
>> 
>> 
>> -- 
>> Neo4j Chief Solutions Architect
>> ✉   matt.casters@neo4j.com <ma...@neo4j.com>
>> 
>> 
>> 
> 


Re: Dataflow template creation

Posted by Hans Van Akelyen <ha...@gmail.com>.
Hi Fabian,

So I played around a bit more with the pipelines and I was able to launch
dataflow jobs but it's not completely working as expected.
The documentation around this is also a bit scattered everywhere so I'm not
sure I'll be able to figure out the final solution in a short period of
time.

Steps taken to get this working:
- Modified the code a bit, these changes will be merged soon [1]
- Generate a hop-fatjar.jar
- Upload a pipeline and the hop-metadata to Google Storage
  - Modify the run configuration to take the fat-jar from following
location /dataflow/template/hop-fatjar.jar (location in the docker image)
- Modified the default docker to include the fat jar:


* FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
<http://gcr.io/dataflow-templates-base/java11-template-launcher-base>*










*  ARG WORKDIR=/dataflow/template  RUN mkdir -p ${WORKDIR}  WORKDIR
${WORKDIR}  COPY hop-fatjar.jar .  ENV
FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"  ENV
FLEX_TEMPLATE_JAVA_CLASSPATH="${WORKDIR}/*"  ENTRYPOINT
["/opt/google/dataflow/java_template_launcher"]*

- Save the image in the container registry (gcloud builds submit --tag
<image_location>:latest .)
- Create a new pipeline using following template:





































*{    "defaultEnvironment": {},    "image": "<your image
location>:latest",    "metadata": {        "description": "This template
allows you to start Hop pipelines on dataflow",        "name": "Template to
start a hop pipeline",        "parameters": [            {
"helpText": "Google storage location pointing to the pipeline you wish to
start",                "label": "Google storage location pointing to the
pipeline you wish to start",                "name": "HopPipelinePath",
          "regexes": [                    ".*"                ]
},            {                "helpText": "Google storage location
pointing to the Hop Metadata you wish to use",                "label":
"Google storage location pointing to the Hop Metadata you wish to use",
            "name": "HopMetadataPath",                "regexes": [
          ".*"                ]            },            {
"helpText": "Run configuration used to launch the pipeline",
"label": "Run configuration used to launch the pipeline",
"name": "HopRunConfigurationName",                "regexes": [
      ".*"                ]            }        ]    },    "sdkInfo": {
    "language": "JAVA"    }}*

- Fill in the parameters with the google storage location and run
configuration name
- Run the pipeline

Now we enter the point where things get a bit strange, when you follow all
these steps you will notice a dataflow job will be started.
This Dataflow job will then spawn another Dataflow job that contains the
actual pipeline, the original job started via the pipeline will fail but
your other job will run fine.
[image: image.png]
The Pipeline job expects that a job file gets generated in a specific
location and it will then pick up this file to execute the actual job.
This is the part we would probably have to change our code a bit to save
the job specification to that location and not start another job via the
Beam API.

Until we get that sorted out you will have 2 jobs where one will fail on
every run, I hope this is acceptable for now.

Cheers,
Hans

[1] https://github.com/apache/hop/pull/1644


On Thu, 18 Aug 2022 at 13:00, Hans Van Akelyen <ha...@gmail.com>
wrote:

> Hi Fabian,
>
> I've been digging into this a bit and it seems we will need some code
> changes to make this work.
> As far as I can tell you have to use one of the docker templates Google
> provides to start a pipeline from a template.
> The issue we have is that our MainBeam class requires 3 arguments to work
> (filename/metadata/run configuration name).
> These 3 arguments need to be the 3 first arguments passed to the class, we
> have no named parameters implemented.
>
> When the template launches it calls java in the following way:
>
> Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam
> --pipelineLocation=test --runner=DataflowRunner --project=xxx
> --templateLocation=gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object
> --stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging --labels={
> "goog-data-pipelines" : "test" } --jobName=test-mp--1660815257
> --region=us-central1 --serviceAccount=
> xxxx-compute@developer.gserviceaccount.com
> --tempLocation=gs://dataflow-staging-us-central1-xxxx/tmp
>
> In this case it will see the first 3 arguments and select them.
> [image: image.png]
>
> As I can not find a way to force those 3 arguments in there we will need
> to implement named parameters in that class, I tried a bit of a hack but it
> did not work, I changed the docker template to the following but the Google
> script then throws an error:
>
> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam
> gs://xxx/0004-rest-client-get.hpl gs://xxx/hop-metadata.json Dataflow"
>
> As I think this will have great added value, I will work on this ASAP.
> When the work has been done we can even supply the image required from our
> DockerHub Account and you should be able to run Hop pipelines in dataflow
> by using a simple template.
>
> My idea will be to add support for the following 3 named parameters:
>  - HopPipelinePath -> location of the pipeline (can be Google Storage)
>  - HopMetadataPath -> location of the metadata file (can be Google storage)
>  - HopRunConfigurationName
>
> I'll post updates here on the progress.
>
> Cheers,
> Hans
>
> On Tue, 16 Aug 2022 at 11:36, Fabian Peters <po...@mercadu.de> wrote:
>
>> Hi Hans,
>>
>> No, I didn't yet have another go. The hints from Matt (didn't see that
>> mail on the list?) do look quite useful in the context of Datlow templates.
>> I'll try to see whether I can get a bit further, but if you have time to
>> have a look at it, I'd much appreciate!
>>
>> cheers
>>
>> Fabian
>>
>> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <
>> hans.van.akelyen@gmail.com>:
>>
>> Hi Fabian,
>>
>> Did you get this working and are you willing to share the final results?
>> If not I will see what I can do, and we can add it to our documentation.
>>
>> Cheers,
>> Hans
>>
>> On Thu, 11 Aug 2022 at 13:14, Matt Casters <ma...@neo4j.com>
>> wrote:
>>
>>> When you run class org.apache.hop.beam.run.MainBeam you need to provide
>>> 3 arguments to run:
>>>
>>> 1. The filename of the pipeline to run
>>> 2. The filename which contains Hop metadata
>>> 3. The name of the pipeline run configuration to use
>>>
>>> See also for example:
>>> https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run
>>>
>>> Good luck,
>>> Matt
>>>
>>>
>>> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <po...@mercadu.de> wrote:
>>>
>>>> Hello Hans,
>>>>
>>>> I went through the flex-template process yesterday but the generated
>>>> template does not work. The main piece that's missing for me is how to pass
>>>> the actual pipeline that should be run. My test boiled down to:
>>>>
>>>> gcloud dataflow flex-template build
>>>> gs://foo_ag_dataflow/tmp/todays-directories.json \
>>>>       --image-gcr-path "
>>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
>>>>       --sdk-language "JAVA" \
>>>>       --flex-template-base-image JAVA11 \
>>>>       --metadata-file
>>>> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
>>>> \
>>>>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>>>>       --env
>>>> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>>
>>>> gcloud dataflow flex-template run "todays-directories-`date
>>>> +%Y%m%d-%H%M%S`" \
>>>>     --template-file-gcs-location "
>>>> gs://foo_ag_dataflow/tmp/todays-directories.json" \
>>>>     --region "europe-west1"
>>>>
>>>> With Dockerfile:
>>>>
>>>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
>>>>
>>>> ARG WORKDIR=/dataflow/template
>>>> RUN mkdir -p ${WORKDIR}
>>>> WORKDIR ${WORKDIR}
>>>>
>>>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>>>>
>>>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>>>>
>>>>
>>>> And "todays-directories.json":
>>>>
>>>> {
>>>>     "defaultEnvironment": {},
>>>>     "image": "
>>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
>>>>     "metadata": {
>>>>         "description": "Test templates creation with Apache Hop",
>>>>         "name": "Todays directories"
>>>>     },
>>>>     "sdkInfo": {
>>>>         "language": "JAVA"
>>>>     }
>>>> }
>>>>
>>>> Thanks for having a look at it!
>>>>
>>>> cheers
>>>>
>>>> Fabian
>>>>
>>>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <
>>>> hans.van.akelyen@gmail.com>:
>>>>
>>>> Hi Fabian,
>>>>
>>>> You have indeed found something we have not yet documented, mainly
>>>> because we have not yet tried it out ourselves.
>>>> The main class that gets called when running Beam pipelines is
>>>> "org.apache.hop.beam.run.MainBeam".
>>>>
>>>> I was hoping the "Import as pipeline" button on a job would give you
>>>> everything you need to execute this but it does not.
>>>> I'll take a closer look the following days to see what is needed to use
>>>> this functionality, could be that we need to export the template based on a
>>>> pipeline.
>>>>
>>>> Kr,
>>>> Hans
>>>>
>>>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <po...@mercadu.de> wrote:
>>>>
>>>>> Hi all!
>>>>>
>>>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs
>>>>> to Dataflow.
>>>>>
>>>>> Next, I'd like to schedule a batch job
>>>>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>>>>> but for this I need to create a
>>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
>>>>> template
>>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
>>>>> I've searched the Hop documentation but haven't found anything on this. I'm
>>>>> guessing that flex-templates
>>>>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are
>>>>> the way to go, due to the fat-jar, but I'm wondering what to pass as
>>>>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>>>>
>>>>> cheers
>>>>>
>>>>> Fabian
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Neo4j Chief Solutions Architect
>>> *✉   *matt.casters@neo4j.com
>>>
>>>
>>>
>>>
>>

Re: Dataflow template creation

Posted by Hans Van Akelyen <ha...@gmail.com>.
Hi Fabian,

So I played around a bit more with the pipelines and I was able to launch
dataflow jobs but it's not completely working as expected.
The documentation around this is also a bit scattered everywhere so I'm not
sure I'll be able to figure out the final solution in a short period of
time.

Steps taken to get this working:
- Modified the code a bit, these changes will be merged soon [1]
- Generate a hop-fatjar.jar
- Upload a pipeline and the hop-metadata to Google Storage
  - Modify the run configuration to take the fat-jar from following
location /dataflow/template/hop-fatjar.jar (location in the docker image)
- Modified the default docker to include the fat jar:


* FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
<http://gcr.io/dataflow-templates-base/java11-template-launcher-base>*










*  ARG WORKDIR=/dataflow/template  RUN mkdir -p ${WORKDIR}  WORKDIR
${WORKDIR}  COPY hop-fatjar.jar .  ENV
FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"  ENV
FLEX_TEMPLATE_JAVA_CLASSPATH="${WORKDIR}/*"  ENTRYPOINT
["/opt/google/dataflow/java_template_launcher"]*

- Save the image in the container registry (gcloud builds submit --tag
<image_location>:latest .)
- Create a new pipeline using following template:





































*{    "defaultEnvironment": {},    "image": "<your image
location>:latest",    "metadata": {        "description": "This template
allows you to start Hop pipelines on dataflow",        "name": "Template to
start a hop pipeline",        "parameters": [            {
"helpText": "Google storage location pointing to the pipeline you wish to
start",                "label": "Google storage location pointing to the
pipeline you wish to start",                "name": "HopPipelinePath",
          "regexes": [                    ".*"                ]
},            {                "helpText": "Google storage location
pointing to the Hop Metadata you wish to use",                "label":
"Google storage location pointing to the Hop Metadata you wish to use",
            "name": "HopMetadataPath",                "regexes": [
          ".*"                ]            },            {
"helpText": "Run configuration used to launch the pipeline",
"label": "Run configuration used to launch the pipeline",
"name": "HopRunConfigurationName",                "regexes": [
      ".*"                ]            }        ]    },    "sdkInfo": {
    "language": "JAVA"    }}*

- Fill in the parameters with the google storage location and run
configuration name
- Run the pipeline

Now we enter the point where things get a bit strange, when you follow all
these steps you will notice a dataflow job will be started.
This Dataflow job will then spawn another Dataflow job that contains the
actual pipeline, the original job started via the pipeline will fail but
your other job will run fine.
[image: image.png]
The Pipeline job expects that a job file gets generated in a specific
location and it will then pick up this file to execute the actual job.
This is the part we would probably have to change our code a bit to save
the job specification to that location and not start another job via the
Beam API.

Until we get that sorted out you will have 2 jobs where one will fail on
every run, I hope this is acceptable for now.

Cheers,
Hans

[1] https://github.com/apache/hop/pull/1644


On Thu, 18 Aug 2022 at 13:00, Hans Van Akelyen <ha...@gmail.com>
wrote:

> Hi Fabian,
>
> I've been digging into this a bit and it seems we will need some code
> changes to make this work.
> As far as I can tell you have to use one of the docker templates Google
> provides to start a pipeline from a template.
> The issue we have is that our MainBeam class requires 3 arguments to work
> (filename/metadata/run configuration name).
> These 3 arguments need to be the 3 first arguments passed to the class, we
> have no named parameters implemented.
>
> When the template launches it calls java in the following way:
>
> Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam
> --pipelineLocation=test --runner=DataflowRunner --project=xxx
> --templateLocation=gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object
> --stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging --labels={
> "goog-data-pipelines" : "test" } --jobName=test-mp--1660815257
> --region=us-central1 --serviceAccount=
> xxxx-compute@developer.gserviceaccount.com
> --tempLocation=gs://dataflow-staging-us-central1-xxxx/tmp
>
> In this case it will see the first 3 arguments and select them.
> [image: image.png]
>
> As I can not find a way to force those 3 arguments in there we will need
> to implement named parameters in that class, I tried a bit of a hack but it
> did not work, I changed the docker template to the following but the Google
> script then throws an error:
>
> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam
> gs://xxx/0004-rest-client-get.hpl gs://xxx/hop-metadata.json Dataflow"
>
> As I think this will have great added value, I will work on this ASAP.
> When the work has been done we can even supply the image required from our
> DockerHub Account and you should be able to run Hop pipelines in dataflow
> by using a simple template.
>
> My idea will be to add support for the following 3 named parameters:
>  - HopPipelinePath -> location of the pipeline (can be Google Storage)
>  - HopMetadataPath -> location of the metadata file (can be Google storage)
>  - HopRunConfigurationName
>
> I'll post updates here on the progress.
>
> Cheers,
> Hans
>
> On Tue, 16 Aug 2022 at 11:36, Fabian Peters <po...@mercadu.de> wrote:
>
>> Hi Hans,
>>
>> No, I didn't yet have another go. The hints from Matt (didn't see that
>> mail on the list?) do look quite useful in the context of Datlow templates.
>> I'll try to see whether I can get a bit further, but if you have time to
>> have a look at it, I'd much appreciate!
>>
>> cheers
>>
>> Fabian
>>
>> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <
>> hans.van.akelyen@gmail.com>:
>>
>> Hi Fabian,
>>
>> Did you get this working and are you willing to share the final results?
>> If not I will see what I can do, and we can add it to our documentation.
>>
>> Cheers,
>> Hans
>>
>> On Thu, 11 Aug 2022 at 13:14, Matt Casters <ma...@neo4j.com>
>> wrote:
>>
>>> When you run class org.apache.hop.beam.run.MainBeam you need to provide
>>> 3 arguments to run:
>>>
>>> 1. The filename of the pipeline to run
>>> 2. The filename which contains Hop metadata
>>> 3. The name of the pipeline run configuration to use
>>>
>>> See also for example:
>>> https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run
>>>
>>> Good luck,
>>> Matt
>>>
>>>
>>> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <po...@mercadu.de> wrote:
>>>
>>>> Hello Hans,
>>>>
>>>> I went through the flex-template process yesterday but the generated
>>>> template does not work. The main piece that's missing for me is how to pass
>>>> the actual pipeline that should be run. My test boiled down to:
>>>>
>>>> gcloud dataflow flex-template build
>>>> gs://foo_ag_dataflow/tmp/todays-directories.json \
>>>>       --image-gcr-path "
>>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
>>>>       --sdk-language "JAVA" \
>>>>       --flex-template-base-image JAVA11 \
>>>>       --metadata-file
>>>> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
>>>> \
>>>>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>>>>       --env
>>>> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>>
>>>> gcloud dataflow flex-template run "todays-directories-`date
>>>> +%Y%m%d-%H%M%S`" \
>>>>     --template-file-gcs-location "
>>>> gs://foo_ag_dataflow/tmp/todays-directories.json" \
>>>>     --region "europe-west1"
>>>>
>>>> With Dockerfile:
>>>>
>>>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
>>>>
>>>> ARG WORKDIR=/dataflow/template
>>>> RUN mkdir -p ${WORKDIR}
>>>> WORKDIR ${WORKDIR}
>>>>
>>>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>>>>
>>>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>>>>
>>>>
>>>> And "todays-directories.json":
>>>>
>>>> {
>>>>     "defaultEnvironment": {},
>>>>     "image": "
>>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
>>>>     "metadata": {
>>>>         "description": "Test templates creation with Apache Hop",
>>>>         "name": "Todays directories"
>>>>     },
>>>>     "sdkInfo": {
>>>>         "language": "JAVA"
>>>>     }
>>>> }
>>>>
>>>> Thanks for having a look at it!
>>>>
>>>> cheers
>>>>
>>>> Fabian
>>>>
>>>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <
>>>> hans.van.akelyen@gmail.com>:
>>>>
>>>> Hi Fabian,
>>>>
>>>> You have indeed found something we have not yet documented, mainly
>>>> because we have not yet tried it out ourselves.
>>>> The main class that gets called when running Beam pipelines is
>>>> "org.apache.hop.beam.run.MainBeam".
>>>>
>>>> I was hoping the "Import as pipeline" button on a job would give you
>>>> everything you need to execute this but it does not.
>>>> I'll take a closer look the following days to see what is needed to use
>>>> this functionality, could be that we need to export the template based on a
>>>> pipeline.
>>>>
>>>> Kr,
>>>> Hans
>>>>
>>>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <po...@mercadu.de> wrote:
>>>>
>>>>> Hi all!
>>>>>
>>>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs
>>>>> to Dataflow.
>>>>>
>>>>> Next, I'd like to schedule a batch job
>>>>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>>>>> but for this I need to create a
>>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
>>>>> template
>>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
>>>>> I've searched the Hop documentation but haven't found anything on this. I'm
>>>>> guessing that flex-templates
>>>>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are
>>>>> the way to go, due to the fat-jar, but I'm wondering what to pass as
>>>>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>>>>
>>>>> cheers
>>>>>
>>>>> Fabian
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Neo4j Chief Solutions Architect
>>> *✉   *matt.casters@neo4j.com
>>>
>>>
>>>
>>>
>>

Re: Dataflow template creation

Posted by Hans Van Akelyen <ha...@gmail.com>.
Hi Fabian,

I've been digging into this a bit and it seems we will need some code
changes to make this work.
As far as I can tell you have to use one of the docker templates Google
provides to start a pipeline from a template.
The issue we have is that our MainBeam class requires 3 arguments to work
(filename/metadata/run configuration name).
These 3 arguments need to be the 3 first arguments passed to the class, we
have no named parameters implemented.

When the template launches it calls java in the following way:

Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam
--pipelineLocation=test --runner=DataflowRunner --project=xxx
--templateLocation=gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object
--stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging --labels={
"goog-data-pipelines" : "test" } --jobName=test-mp--1660815257
--region=us-central1 --serviceAccount=
xxxx-compute@developer.gserviceaccount.com
--tempLocation=gs://dataflow-staging-us-central1-xxxx/tmp

In this case it will see the first 3 arguments and select them.
[image: image.png]

As I can not find a way to force those 3 arguments in there we will need to
implement named parameters in that class, I tried a bit of a hack but it
did not work, I changed the docker template to the following but the Google
script then throws an error:

ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam
gs://xxx/0004-rest-client-get.hpl gs://xxx/hop-metadata.json Dataflow"

As I think this will have great added value, I will work on this ASAP. When
the work has been done we can even supply the image required from our
DockerHub Account and you should be able to run Hop pipelines in dataflow
by using a simple template.

My idea will be to add support for the following 3 named parameters:
 - HopPipelinePath -> location of the pipeline (can be Google Storage)
 - HopMetadataPath -> location of the metadata file (can be Google storage)
 - HopRunConfigurationName

I'll post updates here on the progress.

Cheers,
Hans

On Tue, 16 Aug 2022 at 11:36, Fabian Peters <po...@mercadu.de> wrote:

> Hi Hans,
>
> No, I didn't yet have another go. The hints from Matt (didn't see that
> mail on the list?) do look quite useful in the context of Datlow templates.
> I'll try to see whether I can get a bit further, but if you have time to
> have a look at it, I'd much appreciate!
>
> cheers
>
> Fabian
>
> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <
> hans.van.akelyen@gmail.com>:
>
> Hi Fabian,
>
> Did you get this working and are you willing to share the final results?
> If not I will see what I can do, and we can add it to our documentation.
>
> Cheers,
> Hans
>
> On Thu, 11 Aug 2022 at 13:14, Matt Casters <ma...@neo4j.com> wrote:
>
>> When you run class org.apache.hop.beam.run.MainBeam you need to provide 3
>> arguments to run:
>>
>> 1. The filename of the pipeline to run
>> 2. The filename which contains Hop metadata
>> 3. The name of the pipeline run configuration to use
>>
>> See also for example:
>> https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run
>>
>> Good luck,
>> Matt
>>
>>
>> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <po...@mercadu.de> wrote:
>>
>>> Hello Hans,
>>>
>>> I went through the flex-template process yesterday but the generated
>>> template does not work. The main piece that's missing for me is how to pass
>>> the actual pipeline that should be run. My test boiled down to:
>>>
>>> gcloud dataflow flex-template build
>>> gs://foo_ag_dataflow/tmp/todays-directories.json \
>>>       --image-gcr-path "
>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
>>>       --sdk-language "JAVA" \
>>>       --flex-template-base-image JAVA11 \
>>>       --metadata-file
>>> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
>>> \
>>>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>>>       --env
>>> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>
>>> gcloud dataflow flex-template run "todays-directories-`date
>>> +%Y%m%d-%H%M%S`" \
>>>     --template-file-gcs-location "
>>> gs://foo_ag_dataflow/tmp/todays-directories.json" \
>>>     --region "europe-west1"
>>>
>>> With Dockerfile:
>>>
>>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
>>>
>>> ARG WORKDIR=/dataflow/template
>>> RUN mkdir -p ${WORKDIR}
>>> WORKDIR ${WORKDIR}
>>>
>>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>>>
>>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>>>
>>>
>>> And "todays-directories.json":
>>>
>>> {
>>>     "defaultEnvironment": {},
>>>     "image": "
>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
>>>     "metadata": {
>>>         "description": "Test templates creation with Apache Hop",
>>>         "name": "Todays directories"
>>>     },
>>>     "sdkInfo": {
>>>         "language": "JAVA"
>>>     }
>>> }
>>>
>>> Thanks for having a look at it!
>>>
>>> cheers
>>>
>>> Fabian
>>>
>>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <
>>> hans.van.akelyen@gmail.com>:
>>>
>>> Hi Fabian,
>>>
>>> You have indeed found something we have not yet documented, mainly
>>> because we have not yet tried it out ourselves.
>>> The main class that gets called when running Beam pipelines is
>>> "org.apache.hop.beam.run.MainBeam".
>>>
>>> I was hoping the "Import as pipeline" button on a job would give you
>>> everything you need to execute this but it does not.
>>> I'll take a closer look the following days to see what is needed to use
>>> this functionality, could be that we need to export the template based on a
>>> pipeline.
>>>
>>> Kr,
>>> Hans
>>>
>>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <po...@mercadu.de> wrote:
>>>
>>>> Hi all!
>>>>
>>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to
>>>> Dataflow.
>>>>
>>>> Next, I'd like to schedule a batch job
>>>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>>>> but for this I need to create a
>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
>>>> template
>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
>>>> I've searched the Hop documentation but haven't found anything on this. I'm
>>>> guessing that flex-templates
>>>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are
>>>> the way to go, due to the fat-jar, but I'm wondering what to pass as
>>>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>>>
>>>> cheers
>>>>
>>>> Fabian
>>>>
>>>
>>>
>>
>> --
>> Neo4j Chief Solutions Architect
>> *✉   *matt.casters@neo4j.com
>>
>>
>>
>>
>

Re: Dataflow template creation

Posted by Hans Van Akelyen <ha...@gmail.com>.
Hi Fabian,

I've been digging into this a bit and it seems we will need some code
changes to make this work.
As far as I can tell you have to use one of the docker templates Google
provides to start a pipeline from a template.
The issue we have is that our MainBeam class requires 3 arguments to work
(filename/metadata/run configuration name).
These 3 arguments need to be the 3 first arguments passed to the class, we
have no named parameters implemented.

When the template launches it calls java in the following way:

Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam
--pipelineLocation=test --runner=DataflowRunner --project=xxx
--templateLocation=gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object
--stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging --labels={
"goog-data-pipelines" : "test" } --jobName=test-mp--1660815257
--region=us-central1 --serviceAccount=
xxxx-compute@developer.gserviceaccount.com
--tempLocation=gs://dataflow-staging-us-central1-xxxx/tmp

In this case it will see the first 3 arguments and select them.
[image: image.png]

As I can not find a way to force those 3 arguments in there we will need to
implement named parameters in that class, I tried a bit of a hack but it
did not work, I changed the docker template to the following but the Google
script then throws an error:

ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam
gs://xxx/0004-rest-client-get.hpl gs://xxx/hop-metadata.json Dataflow"

As I think this will have great added value, I will work on this ASAP. When
the work has been done we can even supply the image required from our
DockerHub Account and you should be able to run Hop pipelines in dataflow
by using a simple template.

My idea will be to add support for the following 3 named parameters:
 - HopPipelinePath -> location of the pipeline (can be Google Storage)
 - HopMetadataPath -> location of the metadata file (can be Google storage)
 - HopRunConfigurationName

I'll post updates here on the progress.

Cheers,
Hans

On Tue, 16 Aug 2022 at 11:36, Fabian Peters <po...@mercadu.de> wrote:

> Hi Hans,
>
> No, I didn't yet have another go. The hints from Matt (didn't see that
> mail on the list?) do look quite useful in the context of Datlow templates.
> I'll try to see whether I can get a bit further, but if you have time to
> have a look at it, I'd much appreciate!
>
> cheers
>
> Fabian
>
> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <
> hans.van.akelyen@gmail.com>:
>
> Hi Fabian,
>
> Did you get this working and are you willing to share the final results?
> If not I will see what I can do, and we can add it to our documentation.
>
> Cheers,
> Hans
>
> On Thu, 11 Aug 2022 at 13:14, Matt Casters <ma...@neo4j.com> wrote:
>
>> When you run class org.apache.hop.beam.run.MainBeam you need to provide 3
>> arguments to run:
>>
>> 1. The filename of the pipeline to run
>> 2. The filename which contains Hop metadata
>> 3. The name of the pipeline run configuration to use
>>
>> See also for example:
>> https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run
>>
>> Good luck,
>> Matt
>>
>>
>> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <po...@mercadu.de> wrote:
>>
>>> Hello Hans,
>>>
>>> I went through the flex-template process yesterday but the generated
>>> template does not work. The main piece that's missing for me is how to pass
>>> the actual pipeline that should be run. My test boiled down to:
>>>
>>> gcloud dataflow flex-template build
>>> gs://foo_ag_dataflow/tmp/todays-directories.json \
>>>       --image-gcr-path "
>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
>>>       --sdk-language "JAVA" \
>>>       --flex-template-base-image JAVA11 \
>>>       --metadata-file
>>> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
>>> \
>>>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>>>       --env
>>> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>
>>> gcloud dataflow flex-template run "todays-directories-`date
>>> +%Y%m%d-%H%M%S`" \
>>>     --template-file-gcs-location "
>>> gs://foo_ag_dataflow/tmp/todays-directories.json" \
>>>     --region "europe-west1"
>>>
>>> With Dockerfile:
>>>
>>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
>>>
>>> ARG WORKDIR=/dataflow/template
>>> RUN mkdir -p ${WORKDIR}
>>> WORKDIR ${WORKDIR}
>>>
>>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>>>
>>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>>>
>>>
>>> And "todays-directories.json":
>>>
>>> {
>>>     "defaultEnvironment": {},
>>>     "image": "
>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
>>>     "metadata": {
>>>         "description": "Test templates creation with Apache Hop",
>>>         "name": "Todays directories"
>>>     },
>>>     "sdkInfo": {
>>>         "language": "JAVA"
>>>     }
>>> }
>>>
>>> Thanks for having a look at it!
>>>
>>> cheers
>>>
>>> Fabian
>>>
>>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <
>>> hans.van.akelyen@gmail.com>:
>>>
>>> Hi Fabian,
>>>
>>> You have indeed found something we have not yet documented, mainly
>>> because we have not yet tried it out ourselves.
>>> The main class that gets called when running Beam pipelines is
>>> "org.apache.hop.beam.run.MainBeam".
>>>
>>> I was hoping the "Import as pipeline" button on a job would give you
>>> everything you need to execute this but it does not.
>>> I'll take a closer look the following days to see what is needed to use
>>> this functionality, could be that we need to export the template based on a
>>> pipeline.
>>>
>>> Kr,
>>> Hans
>>>
>>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <po...@mercadu.de> wrote:
>>>
>>>> Hi all!
>>>>
>>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to
>>>> Dataflow.
>>>>
>>>> Next, I'd like to schedule a batch job
>>>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>>>> but for this I need to create a
>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
>>>> template
>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
>>>> I've searched the Hop documentation but haven't found anything on this. I'm
>>>> guessing that flex-templates
>>>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are
>>>> the way to go, due to the fat-jar, but I'm wondering what to pass as
>>>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>>>
>>>> cheers
>>>>
>>>> Fabian
>>>>
>>>
>>>
>>
>> --
>> Neo4j Chief Solutions Architect
>> *✉   *matt.casters@neo4j.com
>>
>>
>>
>>
>

Re: Dataflow template creation

Posted by Fabian Peters <po...@mercadu.de>.
Hi Hans,

No, I didn't yet have another go. The hints from Matt (didn't see that mail on the list?) do look quite useful in the context of Datlow templates. I'll try to see whether I can get a bit further, but if you have time to have a look at it, I'd much appreciate!

cheers

Fabian

> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <ha...@gmail.com>:
> 
> Hi Fabian,
> 
> Did you get this working and are you willing to share the final results?
> If not I will see what I can do, and we can add it to our documentation.
> 
> Cheers,
> Hans
> 
> On Thu, 11 Aug 2022 at 13:14, Matt Casters <matt.casters@neo4j.com <ma...@neo4j.com>> wrote:
> When you run class org.apache.hop.beam.run.MainBeam you need to provide 3 arguments to run:
> 
> 1. The filename of the pipeline to run
> 2. The filename which contains Hop metadata
> 3. The name of the pipeline run configuration to use
> 
> See also for example: https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run <https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run>
> 
> Good luck,
> Matt
> 
> 
> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
> Hello Hans,
> 
> I went through the flex-template process yesterday but the generated template does not work. The main piece that's missing for me is how to pass the actual pipeline that should be run. My test boiled down to:
> 
> gcloud dataflow flex-template build gs://foo_ag_dataflow/tmp/todays-directories.json <> \
>       --image-gcr-path "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest <http://europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest>" \
>       --sdk-language "JAVA" \
>       --flex-template-base-image JAVA11 \
>       --metadata-file "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json" \
>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>       --env FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
> 
> gcloud dataflow flex-template run "todays-directories-`date +%Y%m%d-%H%M%S`" \
>     --template-file-gcs-location "gs://foo_ag_dataflow/tmp/todays-directories.json <>" \
>     --region "europe-west1"
> 
> With Dockerfile:
> 
> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base <http://gcr.io/dataflow-templates-base/java11-template-launcher-base>
> 
> ARG WORKDIR=/dataflow/template
> RUN mkdir -p ${WORKDIR}
> WORKDIR ${WORKDIR}
> 
> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
> 
> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
> 
> 
> And "todays-directories.json":
> 
> {
>     "defaultEnvironment": {},
>     "image": "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest <http://europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest>",
>     "metadata": {
>         "description": "Test templates creation with Apache Hop",
>         "name": "Todays directories"
>     },
>     "sdkInfo": {
>         "language": "JAVA"
>     }
> }
> 
> Thanks for having a look at it!
> 
> cheers
> 
> Fabian
> 
>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <hans.van.akelyen@gmail.com <ma...@gmail.com>>:
>> 
>> Hi Fabian,
>> 
>> You have indeed found something we have not yet documented, mainly because we have not yet tried it out ourselves.
>> The main class that gets called when running Beam pipelines is "org.apache.hop.beam.run.MainBeam".
>> 
>> I was hoping the "Import as pipeline" button on a job would give you everything you need to execute this but it does not.
>> I'll take a closer look the following days to see what is needed to use this functionality, could be that we need to export the template based on a pipeline.
>> 
>> Kr,
>> Hans
>> 
>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
>> Hi all!
>> 
>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to Dataflow.
>> 
>> Next, I'd like to schedule a batch job <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>, but for this I need to create a  <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>template <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>. I've searched the Hop documentation but haven't found anything on this. I'm guessing that flex-templates <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are the way to go, due to the fat-jar, but I'm wondering what to pass as the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>> 
>> cheers
>> 
>> Fabian
> 
> 
> 
> -- 
> Neo4j Chief Solutions Architect
> ✉   matt.casters@neo4j.com <ma...@neo4j.com>
> 
> 
> 


Re: Dataflow template creation

Posted by Hans Van Akelyen <ha...@gmail.com>.
Hi Fabian,

Did you get this working and are you willing to share the final results?
If not I will see what I can do, and we can add it to our documentation.

Cheers,
Hans

On Thu, 11 Aug 2022 at 13:14, Matt Casters <ma...@neo4j.com> wrote:

> When you run class org.apache.hop.beam.run.MainBeam you need to provide 3
> arguments to run:
>
> 1. The filename of the pipeline to run
> 2. The filename which contains Hop metadata
> 3. The name of the pipeline run configuration to use
>
> See also for example:
> https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run
>
> Good luck,
> Matt
>
>
> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <po...@mercadu.de> wrote:
>
>> Hello Hans,
>>
>> I went through the flex-template process yesterday but the generated
>> template does not work. The main piece that's missing for me is how to pass
>> the actual pipeline that should be run. My test boiled down to:
>>
>> gcloud dataflow flex-template build
>> gs://foo_ag_dataflow/tmp/todays-directories.json \
>>       --image-gcr-path "
>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
>>       --sdk-language "JAVA" \
>>       --flex-template-base-image JAVA11 \
>>       --metadata-file
>> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
>> \
>>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>>       --env
>> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>
>> gcloud dataflow flex-template run "todays-directories-`date
>> +%Y%m%d-%H%M%S`" \
>>     --template-file-gcs-location "
>> gs://foo_ag_dataflow/tmp/todays-directories.json" \
>>     --region "europe-west1"
>>
>> With Dockerfile:
>>
>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
>>
>> ARG WORKDIR=/dataflow/template
>> RUN mkdir -p ${WORKDIR}
>> WORKDIR ${WORKDIR}
>>
>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>>
>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>>
>>
>> And "todays-directories.json":
>>
>> {
>>     "defaultEnvironment": {},
>>     "image": "
>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
>>     "metadata": {
>>         "description": "Test templates creation with Apache Hop",
>>         "name": "Todays directories"
>>     },
>>     "sdkInfo": {
>>         "language": "JAVA"
>>     }
>> }
>>
>> Thanks for having a look at it!
>>
>> cheers
>>
>> Fabian
>>
>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <
>> hans.van.akelyen@gmail.com>:
>>
>> Hi Fabian,
>>
>> You have indeed found something we have not yet documented, mainly
>> because we have not yet tried it out ourselves.
>> The main class that gets called when running Beam pipelines is
>> "org.apache.hop.beam.run.MainBeam".
>>
>> I was hoping the "Import as pipeline" button on a job would give you
>> everything you need to execute this but it does not.
>> I'll take a closer look the following days to see what is needed to use
>> this functionality, could be that we need to export the template based on a
>> pipeline.
>>
>> Kr,
>> Hans
>>
>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <po...@mercadu.de> wrote:
>>
>>> Hi all!
>>>
>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to
>>> Dataflow.
>>>
>>> Next, I'd like to schedule a batch job
>>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>>> but for this I need to create a
>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
>>> template
>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
>>> I've searched the Hop documentation but haven't found anything on this. I'm
>>> guessing that flex-templates
>>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are
>>> the way to go, due to the fat-jar, but I'm wondering what to pass as
>>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>>
>>> cheers
>>>
>>> Fabian
>>>
>>
>>
>
> --
> Neo4j Chief Solutions Architect
> *✉   *matt.casters@neo4j.com
>
>
>
>

Re: Dataflow template creation

Posted by Matt Casters <ma...@neo4j.com>.
When you run class org.apache.hop.beam.run.MainBeam you need to provide 3
arguments to run:

1. The filename of the pipeline to run
2. The filename which contains Hop metadata
3. The name of the pipeline run configuration to use

See also for example:
https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run

Good luck,
Matt


On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <po...@mercadu.de> wrote:

> Hello Hans,
>
> I went through the flex-template process yesterday but the generated
> template does not work. The main piece that's missing for me is how to pass
> the actual pipeline that should be run. My test boiled down to:
>
> gcloud dataflow flex-template build
> gs://foo_ag_dataflow/tmp/todays-directories.json \
>       --image-gcr-path "
> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
>       --sdk-language "JAVA" \
>       --flex-template-base-image JAVA11 \
>       --metadata-file
> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
> \
>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>       --env
> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>
> gcloud dataflow flex-template run "todays-directories-`date
> +%Y%m%d-%H%M%S`" \
>     --template-file-gcs-location "
> gs://foo_ag_dataflow/tmp/todays-directories.json" \
>     --region "europe-west1"
>
> With Dockerfile:
>
> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
>
> ARG WORKDIR=/dataflow/template
> RUN mkdir -p ${WORKDIR}
> WORKDIR ${WORKDIR}
>
> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>
> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>
>
> And "todays-directories.json":
>
> {
>     "defaultEnvironment": {},
>     "image": "
> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
>     "metadata": {
>         "description": "Test templates creation with Apache Hop",
>         "name": "Todays directories"
>     },
>     "sdkInfo": {
>         "language": "JAVA"
>     }
> }
>
> Thanks for having a look at it!
>
> cheers
>
> Fabian
>
> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <
> hans.van.akelyen@gmail.com>:
>
> Hi Fabian,
>
> You have indeed found something we have not yet documented, mainly because
> we have not yet tried it out ourselves.
> The main class that gets called when running Beam pipelines is
> "org.apache.hop.beam.run.MainBeam".
>
> I was hoping the "Import as pipeline" button on a job would give you
> everything you need to execute this but it does not.
> I'll take a closer look the following days to see what is needed to use
> this functionality, could be that we need to export the template based on a
> pipeline.
>
> Kr,
> Hans
>
> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <po...@mercadu.de> wrote:
>
>> Hi all!
>>
>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to
>> Dataflow.
>>
>> Next, I'd like to schedule a batch job
>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>> but for this I need to create a
>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
>> template
>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
>> I've searched the Hop documentation but haven't found anything on this. I'm
>> guessing that flex-templates
>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are
>> the way to go, due to the fat-jar, but I'm wondering what to pass as
>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>
>> cheers
>>
>> Fabian
>>
>
>

-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com

Re: Dataflow template creation

Posted by Matt Casters <ma...@neo4j.com>.
When you run class org.apache.hop.beam.run.MainBeam you need to provide 3
arguments to run:

1. The filename of the pipeline to run
2. The filename which contains Hop metadata
3. The name of the pipeline run configuration to use

See also for example:
https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run

Good luck,
Matt


On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <po...@mercadu.de> wrote:

> Hello Hans,
>
> I went through the flex-template process yesterday but the generated
> template does not work. The main piece that's missing for me is how to pass
> the actual pipeline that should be run. My test boiled down to:
>
> gcloud dataflow flex-template build
> gs://foo_ag_dataflow/tmp/todays-directories.json \
>       --image-gcr-path "
> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
>       --sdk-language "JAVA" \
>       --flex-template-base-image JAVA11 \
>       --metadata-file
> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
> \
>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>       --env
> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>
> gcloud dataflow flex-template run "todays-directories-`date
> +%Y%m%d-%H%M%S`" \
>     --template-file-gcs-location "
> gs://foo_ag_dataflow/tmp/todays-directories.json" \
>     --region "europe-west1"
>
> With Dockerfile:
>
> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
>
> ARG WORKDIR=/dataflow/template
> RUN mkdir -p ${WORKDIR}
> WORKDIR ${WORKDIR}
>
> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>
> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>
>
> And "todays-directories.json":
>
> {
>     "defaultEnvironment": {},
>     "image": "
> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
>     "metadata": {
>         "description": "Test templates creation with Apache Hop",
>         "name": "Todays directories"
>     },
>     "sdkInfo": {
>         "language": "JAVA"
>     }
> }
>
> Thanks for having a look at it!
>
> cheers
>
> Fabian
>
> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <
> hans.van.akelyen@gmail.com>:
>
> Hi Fabian,
>
> You have indeed found something we have not yet documented, mainly because
> we have not yet tried it out ourselves.
> The main class that gets called when running Beam pipelines is
> "org.apache.hop.beam.run.MainBeam".
>
> I was hoping the "Import as pipeline" button on a job would give you
> everything you need to execute this but it does not.
> I'll take a closer look the following days to see what is needed to use
> this functionality, could be that we need to export the template based on a
> pipeline.
>
> Kr,
> Hans
>
> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <po...@mercadu.de> wrote:
>
>> Hi all!
>>
>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to
>> Dataflow.
>>
>> Next, I'd like to schedule a batch job
>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>> but for this I need to create a
>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
>> template
>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
>> I've searched the Hop documentation but haven't found anything on this. I'm
>> guessing that flex-templates
>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are
>> the way to go, due to the fat-jar, but I'm wondering what to pass as
>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>
>> cheers
>>
>> Fabian
>>
>
>

-- 
Neo4j Chief Solutions Architect
*✉   *matt.casters@neo4j.com

Re: Dataflow template creation

Posted by Fabian Peters <po...@mercadu.de>.
Hello Hans,

I went through the flex-template process yesterday but the generated template does not work. The main piece that's missing for me is how to pass the actual pipeline that should be run. My test boiled down to:

gcloud dataflow flex-template build gs://foo_ag_dataflow/tmp/todays-directories.json \
      --image-gcr-path "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
      --sdk-language "JAVA" \
      --flex-template-base-image JAVA11 \
      --metadata-file "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json" \
      --jar "/Users/fabian/tmp/fat-hop.jar" \
      --env FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"

gcloud dataflow flex-template run "todays-directories-`date +%Y%m%d-%H%M%S`" \
    --template-file-gcs-location "gs://foo_ag_dataflow/tmp/todays-directories.json" \
    --region "europe-west1"

With Dockerfile:

FROM gcr.io/dataflow-templates-base/java11-template-launcher-base

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"

ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]


And "todays-directories.json":

{
    "defaultEnvironment": {},
    "image": "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
    "metadata": {
        "description": "Test templates creation with Apache Hop",
        "name": "Todays directories"
    },
    "sdkInfo": {
        "language": "JAVA"
    }
}

Thanks for having a look at it!

cheers

Fabian

> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <ha...@gmail.com>:
> 
> Hi Fabian,
> 
> You have indeed found something we have not yet documented, mainly because we have not yet tried it out ourselves.
> The main class that gets called when running Beam pipelines is "org.apache.hop.beam.run.MainBeam".
> 
> I was hoping the "Import as pipeline" button on a job would give you everything you need to execute this but it does not.
> I'll take a closer look the following days to see what is needed to use this functionality, could be that we need to export the template based on a pipeline.
> 
> Kr,
> Hans
> 
> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
> Hi all!
> 
> Thanks to Hans' work on the REST transform, I can now deploy my jobs to Dataflow.
> 
> Next, I'd like to schedule a batch job <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>, but for this I need to create a  <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>template <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>. I've searched the Hop documentation but haven't found anything on this. I'm guessing that flex-templates <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are the way to go, due to the fat-jar, but I'm wondering what to pass as the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
> 
> cheers
> 
> Fabian


Re: Dataflow template creation

Posted by Hans Van Akelyen <ha...@gmail.com>.
Hi Fabian,

You have indeed found something we have not yet documented, mainly because
we have not yet tried it out ourselves.
The main class that gets called when running Beam pipelines is
"org.apache.hop.beam.run.MainBeam".

I was hoping the "Import as pipeline" button on a job would give you
everything you need to execute this but it does not.
I'll take a closer look the following days to see what is needed to use
this functionality, could be that we need to export the template based on a
pipeline.

Kr,
Hans

On Wed, 10 Aug 2022 at 15:46, Fabian Peters <po...@mercadu.de> wrote:

> Hi all!
>
> Thanks to Hans' work on the REST transform, I can now deploy my jobs to
> Dataflow.
>
> Next, I'd like to schedule a batch job
> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
> but for this I need to create a
> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
> template
> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
> I've searched the Hop documentation but haven't found anything on this. I'm
> guessing that flex-templates
> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> are
> the way to go, due to the fat-jar, but I'm wondering what to pass as
> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>
> cheers
>
> Fabian
>