You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2023/01/06 12:01:34 UTC

[GitHub] [airflow] johannaojeling opened a new pull request, #28764: Add support for running a Beam Go pipeline with an executable binary

johannaojeling opened a new pull request, #28764:
URL: https://github.com/apache/airflow/pull/28764

   The `BeamRunGoPipelineOperator` currently has a `go_file` parameter, which represents the path to a Go source file with the pipeline code. The operator starts the pipeline with `go run`, i.e. compiles the code into a temporary binary and executes.  
   
   This PR adds support for the operator to start the pipeline with an already compiled binary, as an alternative to the source file approach. It introduces two new parameters:
   1. `go_binary` path to a binary compiled for the launching platform, i.e. the platform where Airflow is deployed
   2. `worker_binary` (optional) path to a binary compiled for the worker platform if using a remote runner
   
   Some motivations to introduce this feature:
   - It does not require a Go installation on the system where Airflow is run (which is more similar to how the `BeamRunJavaPipelineOperator` works, running a jar)
   - It does not involve the extra steps of initializing a Go module, installing dependences and compiling the code every task run, which is what currently happens when the Go source file is downloaded from GCS
   - In the current implementation only a single Go source file can downloaded from GCS. This can be limiting if the project comprises multiple files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #28764: Add support for running a Beam Go pipeline with an executable binary

Posted by GitBox <gi...@apache.org>.
uranusjr commented on code in PR #28764:
URL: https://github.com/apache/airflow/pull/28764#discussion_r1064353800


##########
airflow/providers/apache/beam/operators/beam.py:
##########
@@ -563,8 +582,13 @@ def __init__(
             )
         self.dataflow_support_impersonation = False
 
+        if not bool(go_file) ^ bool(go_binary):

Review Comment:
   But this check likely also needs to include `worker_binary`? Right now it’s possible to pass in `go_binary` without `worker_binary`, which would cause a weird error in the worker.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] johannaojeling commented on pull request #28764: Add support for running a Beam Go pipeline with an executable binary

Posted by GitBox <gi...@apache.org>.
johannaojeling commented on PR #28764:
URL: https://github.com/apache/airflow/pull/28764#issuecomment-1384389915

   @uranusjr can this be merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] kaxil merged pull request #28764: Add support for running a Beam Go pipeline with an executable binary

Posted by GitBox <gi...@apache.org>.
kaxil merged PR #28764:
URL: https://github.com/apache/airflow/pull/28764


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #28764: Add support for running a Beam Go pipeline with an executable binary

Posted by GitBox <gi...@apache.org>.
uranusjr commented on code in PR #28764:
URL: https://github.com/apache/airflow/pull/28764#discussion_r1068967457


##########
airflow/providers/apache/beam/operators/beam.py:
##########
@@ -520,12 +524,26 @@ class BeamRunGoPipelineOperator(BeamBasePipelineOperator):
         For more detail on Apache Beam have a look at the reference:
         https://beam.apache.org/documentation/
 
-    :param go_file: Reference to the Go Apache Beam pipeline e.g.,
-        /some/local/file/path/to/your/go/pipeline/file.go
+    :param go_file: Reference to the Apache Beam pipeline Go source file,
+        e.g. /local/path/to/main.go or gs://bucket/path/to/main.go.
+        Exactly one of go_file and go_binary must be provided.
+
+    :param go_binary: Reference to the Apache Beam pipeline Go binary compiled for the launching platform,
+        e.g. /local/path/to/launcher-main or gs://bucket/path/to/launcher-main.
+        Exactly one of go_file and go_binary must be provided.
+
+    :param worker_binary: Reference to the Apache Beam pipeline Go binary compiled for the worker platform,
+        e.g. /local/path/to/worker-main or gs://bucket/path/to/worker-main.
+        Needed if the OS or architecture of the workers running the pipeline is different from that
+        of the platform launching the pipeline. If not set, will default to the value of go_binary.
+        For more information, see the Apache Beam documentation for Go cross compilation:
+        https://beam.apache.org/documentation/sdks/go-cross-compilation/

Review Comment:
   Need to mention this has no effect is `go_binary` is not set. Maybe an alternative would to make `go_binary` accept a `str | tuple[str, str]` and get rid of this parameter altogether?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] johannaojeling commented on a diff in pull request #28764: Add support for running a Beam Go pipeline with an executable binary

Posted by GitBox <gi...@apache.org>.
johannaojeling commented on code in PR #28764:
URL: https://github.com/apache/airflow/pull/28764#discussion_r1069027057


##########
airflow/providers/apache/beam/operators/beam.py:
##########
@@ -520,12 +524,26 @@ class BeamRunGoPipelineOperator(BeamBasePipelineOperator):
         For more detail on Apache Beam have a look at the reference:
         https://beam.apache.org/documentation/
 
-    :param go_file: Reference to the Go Apache Beam pipeline e.g.,
-        /some/local/file/path/to/your/go/pipeline/file.go
+    :param go_file: Reference to the Apache Beam pipeline Go source file,
+        e.g. /local/path/to/main.go or gs://bucket/path/to/main.go.
+        Exactly one of go_file and go_binary must be provided.
+
+    :param go_binary: Reference to the Apache Beam pipeline Go binary compiled for the launching platform,
+        e.g. /local/path/to/launcher-main or gs://bucket/path/to/launcher-main.
+        Exactly one of go_file and go_binary must be provided.
+
+    :param worker_binary: Reference to the Apache Beam pipeline Go binary compiled for the worker platform,
+        e.g. /local/path/to/worker-main or gs://bucket/path/to/worker-main.
+        Needed if the OS or architecture of the workers running the pipeline is different from that
+        of the platform launching the pipeline. If not set, will default to the value of go_binary.
+        For more information, see the Apache Beam documentation for Go cross compilation:
+        https://beam.apache.org/documentation/sdks/go-cross-compilation/

Review Comment:
   Thanks, I think it is clearer to have two different parameters, but agree the effect of specifying the launcher vs worker binary needs clarification. Updated in the latest commit. I also renamed the `go_binary` param to `launcher_binary` to be even more explicit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #28764: Add support for running a Beam Go pipeline with an executable binary

Posted by GitBox <gi...@apache.org>.
uranusjr commented on code in PR #28764:
URL: https://github.com/apache/airflow/pull/28764#discussion_r1068967457


##########
airflow/providers/apache/beam/operators/beam.py:
##########
@@ -520,12 +524,26 @@ class BeamRunGoPipelineOperator(BeamBasePipelineOperator):
         For more detail on Apache Beam have a look at the reference:
         https://beam.apache.org/documentation/
 
-    :param go_file: Reference to the Go Apache Beam pipeline e.g.,
-        /some/local/file/path/to/your/go/pipeline/file.go
+    :param go_file: Reference to the Apache Beam pipeline Go source file,
+        e.g. /local/path/to/main.go or gs://bucket/path/to/main.go.
+        Exactly one of go_file and go_binary must be provided.
+
+    :param go_binary: Reference to the Apache Beam pipeline Go binary compiled for the launching platform,
+        e.g. /local/path/to/launcher-main or gs://bucket/path/to/launcher-main.
+        Exactly one of go_file and go_binary must be provided.
+
+    :param worker_binary: Reference to the Apache Beam pipeline Go binary compiled for the worker platform,
+        e.g. /local/path/to/worker-main or gs://bucket/path/to/worker-main.
+        Needed if the OS or architecture of the workers running the pipeline is different from that
+        of the platform launching the pipeline. If not set, will default to the value of go_binary.
+        For more information, see the Apache Beam documentation for Go cross compilation:
+        https://beam.apache.org/documentation/sdks/go-cross-compilation/

Review Comment:
   Need to mention this has no affect is `go_binary` is not set. Maybe an alternative would to make `go_binary` accept a `str | tuple[str, str]` and get rid of this parameter altogether?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] johannaojeling commented on a diff in pull request #28764: Add support for running a Beam Go pipeline with an executable binary

Posted by GitBox <gi...@apache.org>.
johannaojeling commented on code in PR #28764:
URL: https://github.com/apache/airflow/pull/28764#discussion_r1065071532


##########
airflow/providers/apache/beam/operators/beam.py:
##########
@@ -563,8 +582,13 @@ def __init__(
             )
         self.dataflow_support_impersonation = False
 
+        if not bool(go_file) ^ bool(go_binary):

Review Comment:
   Ah nice, will update to use the helper. Thanks!
   
   Regarding the worker binary, I think in production settings in many cases the launcher and worker binaries will be the same artifact, compiled for the same platform. So it might make sense to still have the `worker_binary` parameter as optional, but if not set, default it to the same path as the `go_binary`? That way we can ensure the `--worker_binary` is always passed when launching the pipeline and won't run into that error, but we don't require the user to explicitly set the worker_binary if it has the same value as the go_binary or is not needed at all (e.g. if running with DirectRunner the worker_binary is not needed so it might be strange to require it). Thoughts on this?
   
   I will make a suggestion and also change the GCS download method to only download one object if the launcher and worker binaries have the same URI.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #28764: Add support for running a Beam Go pipeline with an executable binary

Posted by GitBox <gi...@apache.org>.
uranusjr commented on code in PR #28764:
URL: https://github.com/apache/airflow/pull/28764#discussion_r1064352909


##########
airflow/providers/apache/beam/operators/beam.py:
##########
@@ -563,8 +582,13 @@ def __init__(
             )
         self.dataflow_support_impersonation = False
 
+        if not bool(go_file) ^ bool(go_binary):

Review Comment:
   There is an `exactly_one` helper for exactly this purpose



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] johannaojeling commented on a diff in pull request #28764: Add support for running a Beam Go pipeline with an executable binary

Posted by GitBox <gi...@apache.org>.
johannaojeling commented on code in PR #28764:
URL: https://github.com/apache/airflow/pull/28764#discussion_r1068951137


##########
airflow/providers/apache/beam/operators/beam.py:
##########
@@ -563,8 +582,13 @@ def __init__(
             )
         self.dataflow_support_impersonation = False
 
+        if not bool(go_file) ^ bool(go_binary):

Review Comment:
   @uranusjr are your happy with the changes or is there anything I should modify?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org