You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/03/25 09:37:21 UTC

[GitHub] [airflow] roelhogervorst opened a new pull request #7864: GCP SparkR operator

roelhogervorst opened a new pull request #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864
 
 
   This PR adds a spark_R operator which will allow you to schedule R, and sparkR jobs on a dataproc cluster. The functionality to run that kind of job is already in dataproc, but for some reason there is no operator in Airflow.
   
   ---
   Issue link: WILL BE INSERTED BY [boring-cyborg](https://github.com/kaxil/boring-cyborg)
   
   Make sure to mark the boxes below before creating PR: [x]
   
   - [x] Description above provides context of the change
   - [x] Unit tests coverage for changes (not needed for documentation changes)
   - [ ] Commits follow "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"
   - [x] Relevant documentation is updated including usage instructions.
   - [x] I will engage committers as explained in [Contribution Workflow Example](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#contribution-workflow-example).
   
   ---
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   Read the [Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines) for more information.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] roelhogervorst commented on a change in pull request #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
roelhogervorst commented on a change in pull request #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#discussion_r406219150
 
 

 ##########
 File path: airflow/providers/google/cloud/operators/dataproc.py
 ##########
 @@ -1359,6 +1359,128 @@ def execute(self, context):
         super().execute(context)
 
 
+class DataprocSubmitSparkRJobOperator(DataprocJobBaseOperator):
 
 Review comment:
   Is there an example of how to use the generic DataprocJobBaseOperator for all the jobs yet, @turbaszek?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj commented on a change in pull request #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#discussion_r406233189
 
 

 ##########
 File path: airflow/providers/google/cloud/operators/dataproc.py
 ##########
 @@ -1359,6 +1359,128 @@ def execute(self, context):
         super().execute(context)
 
 
+class DataprocSubmitSparkRJobOperator(DataprocJobBaseOperator):
 
 Review comment:
   https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/example_dags/example_dataproc.py#L141-L166

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] turbaszek commented on a change in pull request #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#discussion_r398088165
 
 

 ##########
 File path: airflow/providers/google/cloud/operators/dataproc.py
 ##########
 @@ -1359,6 +1359,128 @@ def execute(self, context):
         super().execute(context)
 
 
+class DataprocSubmitSparkRJobOperator(DataprocJobBaseOperator):
 
 Review comment:
   The `DataprocJobBaseOperator` as well as other XJobOperators will be deleted in future:
   ```
           # TODO: Remove one day
           warnings.warn(
               "The `{cls}` operator is deprecated, please use `DataprocSubmitJobOperator` instead. You can use"
               " `generate_job` method of `{cls}` to generate dictionary representing your job"
               " and use it with the new operator.".format(cls=type(self).__name__),
               DeprecationWarning,
               stacklevel=1
           )
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] roelhogervorst commented on a change in pull request #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
roelhogervorst commented on a change in pull request #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#discussion_r406710965
 
 

 ##########
 File path: airflow/providers/google/cloud/operators/dataproc.py
 ##########
 @@ -1359,6 +1359,128 @@ def execute(self, context):
         super().execute(context)
 
 
+class DataprocSubmitSparkRJobOperator(DataprocJobBaseOperator):
 
 Review comment:
   oh that is way easier, I'll create a new PR

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] turbaszek commented on a change in pull request #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#discussion_r398088165
 
 

 ##########
 File path: airflow/providers/google/cloud/operators/dataproc.py
 ##########
 @@ -1359,6 +1359,128 @@ def execute(self, context):
         super().execute(context)
 
 
+class DataprocSubmitSparkRJobOperator(DataprocJobBaseOperator):
 
 Review comment:
   The `DataprocJobBaseOperator` as well as other XJobOperators will be deleted in future:
   ```python
           # TODO: Remove one day
           warnings.warn(
               "The `{cls}` operator is deprecated, please use `DataprocSubmitJobOperator` instead. You can use"
               " `generate_job` method of `{cls}` to generate dictionary representing your job"
               " and use it with the new operator.".format(cls=type(self).__name__),
               DeprecationWarning,
               stacklevel=1
           )
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] roelhogervorst commented on issue #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
roelhogervorst commented on issue #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#issuecomment-611474436
 
 
   I'm so sorry @turbaszek , I completely missed your message. Are you saying we can submit jobs without specifying the type of job? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj commented on a change in pull request #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#discussion_r397782413
 
 

 ##########
 File path: airflow/contrib/operators/dataproc_operator.py
 ##########
 @@ -157,6 +157,22 @@ def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
 
 
+class DataProcSparkROperator(DataprocSubmitSparkRJobOperator):
 
 Review comment:
   It is not necessary. This is only needed if this operator was available in Airflow 1.10.x

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] turbaszek commented on issue #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#issuecomment-610354260
 
 
   Hi @roelhogervorst any progress? I am happy to help with running system tests for the new SparkR job :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] boring-cyborg[bot] commented on issue #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#issuecomment-603739681
 
 
   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, pylint and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/master/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/master/docs/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/master/BREEZE.rst) for testing locally, itโ€™s a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better ๐Ÿš€.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://apache-airflow-slack.herokuapp.com/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] turbaszek commented on a change in pull request #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#discussion_r398086582
 
 

 ##########
 File path: airflow/providers/google/cloud/operators/dataproc.py
 ##########
 @@ -1359,6 +1359,128 @@ def execute(self, context):
         super().execute(context)
 
 
+class DataprocSubmitSparkRJobOperator(DataprocJobBaseOperator):
 
 Review comment:
   Do we need custom operator for R jobs? Can't we use the `DataprocSubmitJobOperator`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] roelhogervorst commented on a change in pull request #7864: GCP SparkR operator

Posted by GitBox <gi...@apache.org>.
roelhogervorst commented on a change in pull request #7864: GCP SparkR operator
URL: https://github.com/apache/airflow/pull/7864#discussion_r397887221
 
 

 ##########
 File path: airflow/contrib/operators/dataproc_operator.py
 ##########
 @@ -157,6 +157,22 @@ def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
 
 
+class DataProcSparkROperator(DataprocSubmitSparkRJobOperator):
 
 Review comment:
   merci, will remove

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services