You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/03/23 01:01:36 UTC

[GitHub] [airflow] mik-laj opened a new pull request #7809: Run Dataflow for ML Engine summary in venv

mik-laj opened a new pull request #7809: Run Dataflow for ML Engine summary in venv
URL: https://github.com/apache/airflow/pull/7809
 
 
   ---
   Issue link: WILL BE INSERTED BY [boring-cyborg](https://github.com/kaxil/boring-cyborg)
   
   Make sure to mark the boxes below before creating PR: [x]
   
   - [X] Description above provides context of the change
   - [X] Unit tests coverage for changes (not needed for documentation changes)
   - [X] Commits follow "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"
   - [X] Relevant documentation is updated including usage instructions.
   - [X] I will engage committers as explained in [Contribution Workflow Example](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#contribution-workflow-example).
   
   ---
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   Read the [Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines) for more information.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv
URL: https://github.com/apache/airflow/pull/7809#discussion_r396172233
 
 

 ##########
 File path: airflow/providers/google/cloud/utils/mlengine_operator_utils.py
 ##########
 @@ -225,20 +225,22 @@ def validate_err_and_count(summary):
     metric_fn_encoded = base64.b64encode(dill.dumps(metric_fn, recurse=True)).decode()
     evaluate_summary = DataflowCreatePythonJobOperator(
         task_id=(task_prefix + "-summary"),
-        py_options=["-m"],
-        py_file="airflow.providers.google.cloud.utils.mlengine_prediction_summary",
+        py_file=os.path.join(os.path.dirname(__file__), 'mlengine_prediction_summary.py'),
         dataflow_default_options=dataflow_options,
         options={
             "prediction_path": prediction_path,
             "metric_fn_encoded": metric_fn_encoded,
             "metric_keys": ','.join(metric_keys)
         },
         py_interpreter=py_interpreter,
+        py_requirements=[
+            'apache-beam[gcp]>=2.14.0'
 
 Review comment:
   We install these libraries in an isolated environment outside of user code. In this environment, we only run Airflow code. At this point, Airflow is not a framework, but an application. We do not need an extension point here, i.e. change versions. Version pinning is a good solution here to maintain stability.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj commented on issue #7809: Run Dataflow for ML Engine summary in venv

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #7809: Run Dataflow for ML Engine summary in venv
URL: https://github.com/apache/airflow/pull/7809#issuecomment-603192586
 
 
   @potiuk Can I have your review again? Travis is happy.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv
URL: https://github.com/apache/airflow/pull/7809#discussion_r396172462
 
 

 ##########
 File path: airflow/providers/google/cloud/utils/mlengine_prediction_summary.py
 ##########
 @@ -156,23 +157,22 @@ def run(argv=None):
         raise ValueError("--metric_fn_encoded must be an encoded callable.")
     metric_keys = known_args.metric_keys.split(",")
 
-    with beam.Pipeline(
-        options=beam.pipeline.PipelineOptions(pipeline_args)) as pipe:
-        # This is apache-beam ptransform's convention
+    with beam.Pipeline(options=beam.pipeline.PipelineOptions(pipeline_args)) as p:
         # pylint: disable=no-value-for-parameter
-        _ = (pipe
-             | "ReadPredictionResult" >> beam.io.ReadFromText(
-                 os.path.join(known_args.prediction_path,
-                              "prediction.results-*-of-*"),
-                 coder=JsonCoder())
-             | "Summary" >> MakeSummary(metric_fn, metric_keys)
-             | "Write" >> beam.io.WriteToText(
-                 os.path.join(known_args.prediction_path,
-                              "prediction.summary.json"),
-                 shard_name_template='',  # without trailing -NNNNN-of-NNNNN.
-                 coder=JsonCoder()))
-        # pylint: enable=no-value-for-parameter
+        prediction_result_pattern = os.path.join(known_args.prediction_path, "prediction.results-*-of-*")
+        prediction_summary_path = os.path.join(known_args.prediction_path, "prediction.summary.json")
+        # This is apache-beam ptransform's convention
+        _ = (
+            p | "ReadPredictionResult" >> beam.io.ReadFromText(prediction_result_pattern, coder=JsonCoder())
+              | "Summary" >> MakeSummary(metric_fn, metric_keys)
+              | "Write" >> beam.io.WriteToText(
+                    prediction_summary_path,
+                    shard_name_template='',  # without trailing -NNNNN-of-NNNNN.
+                    coder=JsonCoder()
+            )
+        )
 
 
 if __name__ == "__main__":
+    logging.getLogger().setLevel(logging.INFO)
 
 Review comment:
   Dataflow does not print anything on the screen by default. Good practice says to configure the logger to be able to track the progress. This code is run in a separate process, so it's safe.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj merged pull request #7809: Run Dataflow for ML Engine summary in venv

Posted by GitBox <gi...@apache.org>.
mik-laj merged pull request #7809: Run Dataflow for ML Engine summary in venv
URL: https://github.com/apache/airflow/pull/7809
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv
URL: https://github.com/apache/airflow/pull/7809#discussion_r396339382
 
 

 ##########
 File path: airflow/providers/google/cloud/utils/mlengine_operator_utils.py
 ##########
 @@ -225,20 +225,22 @@ def validate_err_and_count(summary):
     metric_fn_encoded = base64.b64encode(dill.dumps(metric_fn, recurse=True)).decode()
     evaluate_summary = DataflowCreatePythonJobOperator(
         task_id=(task_prefix + "-summary"),
-        py_options=["-m"],
-        py_file="airflow.providers.google.cloud.utils.mlengine_prediction_summary",
+        py_file=os.path.join(os.path.dirname(__file__), 'mlengine_prediction_summary.py'),
         dataflow_default_options=dataflow_options,
         options={
             "prediction_path": prediction_path,
             "metric_fn_encoded": metric_fn_encoded,
             "metric_keys": ','.join(metric_keys)
         },
         py_interpreter=py_interpreter,
+        py_requirements=[
+            'apache-beam[gcp]>=2.14.0'
 
 Review comment:
   They can also prepare this virtual environment in a Docker image and provide it with the installation. In the next step, they can modify this call to use a pre-baked environment.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] potiuk commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv
URL: https://github.com/apache/airflow/pull/7809#discussion_r396195123
 
 

 ##########
 File path: airflow/providers/google/cloud/utils/mlengine_operator_utils.py
 ##########
 @@ -225,20 +225,22 @@ def validate_err_and_count(summary):
     metric_fn_encoded = base64.b64encode(dill.dumps(metric_fn, recurse=True)).decode()
     evaluate_summary = DataflowCreatePythonJobOperator(
         task_id=(task_prefix + "-summary"),
-        py_options=["-m"],
-        py_file="airflow.providers.google.cloud.utils.mlengine_prediction_summary",
+        py_file=os.path.join(os.path.dirname(__file__), 'mlengine_prediction_summary.py'),
         dataflow_default_options=dataflow_options,
         options={
             "prediction_path": prediction_path,
             "metric_fn_encoded": metric_fn_encoded,
             "metric_keys": ','.join(metric_keys)
         },
         py_interpreter=py_interpreter,
+        py_requirements=[
+            'apache-beam[gcp]>=2.14.0'
 
 Review comment:
   I am not sure if this is possible for example in Composer environment. But I like it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv
URL: https://github.com/apache/airflow/pull/7809#discussion_r396172233
 
 

 ##########
 File path: airflow/providers/google/cloud/utils/mlengine_operator_utils.py
 ##########
 @@ -225,20 +225,22 @@ def validate_err_and_count(summary):
     metric_fn_encoded = base64.b64encode(dill.dumps(metric_fn, recurse=True)).decode()
     evaluate_summary = DataflowCreatePythonJobOperator(
         task_id=(task_prefix + "-summary"),
-        py_options=["-m"],
-        py_file="airflow.providers.google.cloud.utils.mlengine_prediction_summary",
+        py_file=os.path.join(os.path.dirname(__file__), 'mlengine_prediction_summary.py'),
         dataflow_default_options=dataflow_options,
         options={
             "prediction_path": prediction_path,
             "metric_fn_encoded": metric_fn_encoded,
             "metric_keys": ','.join(metric_keys)
         },
         py_interpreter=py_interpreter,
+        py_requirements=[
+            'apache-beam[gcp]>=2.14.0'
 
 Review comment:
   We install these libraries in an isolated environment outside of user code. In this environment, we only run Airflow code. At this point, Airflow is not a framework, but an ordinary application. We do not need an extension point here, i.e. change versions.v Version pinning is a good solution here to maintain stability.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7809: Run Dataflow for ML Engine summary in venv
URL: https://github.com/apache/airflow/pull/7809#discussion_r396337172
 
 

 ##########
 File path: airflow/providers/google/cloud/utils/mlengine_operator_utils.py
 ##########
 @@ -225,20 +225,22 @@ def validate_err_and_count(summary):
     metric_fn_encoded = base64.b64encode(dill.dumps(metric_fn, recurse=True)).decode()
     evaluate_summary = DataflowCreatePythonJobOperator(
         task_id=(task_prefix + "-summary"),
-        py_options=["-m"],
-        py_file="airflow.providers.google.cloud.utils.mlengine_prediction_summary",
+        py_file=os.path.join(os.path.dirname(__file__), 'mlengine_prediction_summary.py'),
         dataflow_default_options=dataflow_options,
         options={
             "prediction_path": prediction_path,
             "metric_fn_encoded": metric_fn_encoded,
             "metric_keys": ','.join(metric_keys)
         },
         py_interpreter=py_interpreter,
+        py_requirements=[
+            'apache-beam[gcp]>=2.14.0'
 
 Review comment:
   The Cloud Composer team must think of a way to access pip if they want this operator to work.  This team is following changes in MLEngine and is aware of new requirements.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services