You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/01/06 11:06:34 UTC

[GitHub] [airflow] albertusk95 opened a new pull request #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

albertusk95 opened a new pull request #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075
 
 
   **Problem**
   
   I tried to use `SparkSubmitOperator` using standalone cluster first. Unfortunately, the `spark-submit` task was failed. The following exception occurred.
   ```
   airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, '--master', host:port, job_file.py]
   ```
   
   The first thing that came up into my mind was why the master address excluded the `spark://` prefix. So it should be like `--master spark://host:port`. I performed a quick check to the source code and found that such a thing (scheme addition) hadn't been handled. Please take a look at the following code snippet [source](https://github.com/apache/airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py#L171).
   
   After reviewing the subsequent method callings, it turned out that the driver status tracking feature won't be utilised at all because of the above bug. Look at the following code snippet.
   
   ```python
   def _resolve_should_track_driver_status(self):
   	"""
   	Determines whether to not this hook should poll the spark driver status through subsequent spark-submit status requests after the initial spark-submit request
   	:return: if the driver status should be tracked
   	"""
   	return ('spark://' in self._connection['master'] and self._connection['deploy_mode'] == 'cluster')
   ```
   
   The above method will always return `False` as the spark master's address doesn't start with the scheme, such as `spark://`.
   
   Later on, I investigated the `Connection` module (_airflow.models.connection_) further and found that if we provide the URI (ex: _spark://host:port_), then the attributes of the `Connection` object will be derived via URI parsing.
   
   When parsing the host, the resulting value was only the hostname without the scheme. It also becomes a critical enough bug.
   
   **Proposed Solution**
   
   I think we don't really need the whole URI. I mean, when we store the connection data as an environment variable, we could just specify the URI parts in form of JSON. This approach is mainly used to tackle the URI parsing problem.
   
   In this case, the `conn_id` will still be preserved.
   
   Take a look at the following example (`conn_id` = "spark_default"). For simplicity, let's presume that `extra` is in JSON form.
   
   ```
   AIRFLOW_CONN_SPARK_DEFAULT='{"conn_type": <conn_type>, "host":<host>, "port":<port>, "schema":<schema>, "extra":<extra>}'
   ```
   
   Even though this solution could reduce the false result returned by URI parsing, one need to strictly ensure that each attribute (host, port, scheme, etc.) should store the relevant value. I think it's much easier than creating a correct URI parser. Moreover, applying such a technique makes the whole connection data builder for both database & environment variable mode have the same pattern (both use a structured data specification).
   
   ---
   Link to JIRA issue: https://issues.apache.org/jira/browse/AIRFLOW-6212
   
   - [X] Description above provides context of the change
   - [X] Commit message starts with `[AIRFLOW-NNNN]`, where AIRFLOW-NNNN = JIRA ID*
   - [ ] Unit tests coverage for changes (not needed for documentation changes)
   - [X] Commits follow "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"
   - [ ] Relevant documentation is updated including usage instructions.
   - [ ] I will engage committers as explained in [Contribution Workflow Example](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#contribution-workflow-example).
   
   (*) For document-only changes, no JIRA issue is needed. Commit message starts `[AIRFLOW-XXXX]`.
   
   ---
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   Read the [Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines) for more information.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571429121
 
 
   @tooptoop4 the tests pass only for the case when the connection information (host, port, conn_type, etc.) are stored in database. I tried this hook by storing the connection info as an environment variable. This failed because the URI parser returned irrelevant results for all types of cluster mode deployment (yarn, standalone, mesos, k8s)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] stale[bot] closed pull request #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
stale[bot] closed pull request #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571424197
 
 
   @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think `spark://` is only for standalone mode.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571876662
 
 
   > existing tests for connection added via db/cli needs to work
   
   well, I guess the current tests don't support connection added via cli, right?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571429121
 
 
   @tooptoop4 the tests pass only for the case when the connection information (host, port, conn_type, etc.) are stored in **database**. I tried this hook by storing the connection info as an **environment variable**.
   
   This failed because the URI parser returned irrelevant results for all types of cluster mode deployment. For instance, `URI=spark://host:port` will be parsed into `host:port` without the `spark://`. Obviously it returns this exception:
   
   ```
   airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, '--master', host:port, job_file.py]
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] tooptoop4 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
tooptoop4 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571462616
 
 
   
   
   > @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think `spark://` is only for standalone mode.
   > 
   > Or this hook is created only for standalone mode?
   
   yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] tooptoop4 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571462616
 
 
   spark hook works for me and tests pass without this pr. can u send the add connection cli command u create to hit the issue?
   
   > @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think `spark://` is only for standalone mode.
   > 
   > Or this hook is created only for standalone mode?
   
   yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] tooptoop4 commented on a change in pull request #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on a change in pull request #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#discussion_r363944283
 
 

 ##########
 File path: airflow/contrib/hooks/spark_submit_hook.py
 ##########
 @@ -190,17 +189,22 @@ def _resolve_connection(self):
             # Master can be local, yarn, spark://HOST:PORT, mesos://HOST:PORT and
             # k8s://https://<HOST>:<PORT>
             conn = self.get_connection(self._conn_id)
-            if conn.port:
-                conn_data['master'] = "{}:{}".format(conn.host, conn.port)
+            if conn.conn_type in ['spark', 'mesos']:
 
 Review comment:
   revert this section 192-200

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] tooptoop4 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
tooptoop4 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571462616
 
 
   
   
   > @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think `spark://` is only for standalone mode.
   > 
   > Or this hook is created only for standalone mode?
   
   yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] tooptoop4 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571762855
 
 
   > > > @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think `spark://` is only for standalone mode.
   > > > Or this hook is created only for standalone mode?
   > > 
   > > 
   > > yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95
   > 
   > I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.
   
   There isn't async driver polling in YARN, I know Spark on YARN.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571468228
 
 
   > > @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think `spark://` is only for standalone mode.
   > > Or this hook is created only for standalone mode?
   > 
   > yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95
   
   I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571876662
 
 
   > existing tests for connection added via db/cli needs to work
   
   well, I think the current tests don't support connection added via cli, right?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] tooptoop4 commented on a change in pull request #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on a change in pull request #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#discussion_r363943123
 
 

 ##########
 File path: airflow/contrib/hooks/spark_submit_hook.py
 ##########
 @@ -174,8 +174,7 @@ def _resolve_should_track_driver_status(self):
         subsequent spark-submit status requests after the initial spark-submit request
         :return: if the driver status should be tracked
         """
-        return ('spark://' in self._connection['master'] and
-                self._connection['deploy_mode'] == 'cluster')
+        return self._connection['deploy_mode'] == 'cluster'
 
 Review comment:
   return (('spark://' in self._connection['master'] or conn_type starts with spark) and self._connection['deploy_mode'] == 'cluster')

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] tooptoop4 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-572250534
 
 
   > > > > > @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think `spark://` is only for standalone mode.
   > > > > > Or this hook is created only for standalone mode?
   > > > > 
   > > > > 
   > > > > yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95
   > > > 
   > > > 
   > > > I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.
   > > 
   > > 
   > > There isn't async driver polling in YARN, I know Spark on YARN.
   > 
   > How about using Livy to interact with the YARN cluster? I guess it supports sync & async results retrieval.
   
   there is another active pr on livy i saw

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571424197
 
 
   @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think `spark://` is only for standalone mode.
   
   Or this hook is created only for standalone mode?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571989590
 
 
   > > > > @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think `spark://` is only for standalone mode.
   > > > > Or this hook is created only for standalone mode?
   > > > 
   > > > 
   > > > yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95
   > > 
   > > 
   > > I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.
   > 
   > There isn't async driver polling in YARN, I know Spark on YARN.
   
   How about using Livy to interact with the YARN cluster? I guess it supports sync & async results retrieval.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571462963
 
 
   > @tooptoop4 the tests pass only for the case when the connection information (host, port, conn_type, etc.) are stored in **database**. I tried this hook by storing the connection info as an **environment variable**.
   > 
   > This failed because the URI parser returned irrelevant results for all types of cluster mode deployment. For instance, `URI=spark://host:port` will be parsed into `host:port` without the `spark://`. Obviously it returns this exception:
   > 
   > ```
   > airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, '--master', host:port, job_file.py]
   > ```
   
   @tooptoop4 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571989590
 
 
   > > > > @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s? Since I think `spark://` is only for standalone mode.
   > > > > Or this hook is created only for standalone mode?
   > > > 
   > > > 
   > > > yes.there is no concept of async driver status poll for other modes , read https://spark.apache.org/docs/latest/running-on-yarn.html ! in other modes the submit to launch is synchronous . i think u can cancel this @albertusk95
   > > 
   > > 
   > > I couldn't find any info stating that there's no async driver polling for YARN anyway from the provided link.
   > 
   > There isn't async driver polling in YARN, I know Spark on YARN.
   
   How about using Livy to interact with the YARN cluster?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] tooptoop4 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571765352
 
 
   existing tests for connection added via/cli needs to work

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571424197
 
 
   @tooptoop4 if we don't remove the spark check on line 177, how to use this hook to track driver status deployed on yarn, mesos, or k8s?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] tooptoop4 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
tooptoop4 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571765352
 
 
   existing tests for connection added via db/cli needs to work

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571429121
 
 
   @tooptoop4 the tests pass only for the case when the connection information (host, port, conn_type, etc.) are stored in **database**. I tried this hook by storing the connection info as an **environment variable**.
   
   This failed because the URI parser returned irrelevant results for all types of cluster mode deployment. For instance, `URI=spark://master-address:port` will be parsed into `master-address:port` without the `spark://`. Obviously it returns this exception:
   
   ```
   airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, '--master', host:port, job_file.py]
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] stale[bot] commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-590000164
 
 
   This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571470394
 
 
   @tooptoop4 I think you might want to try this sample DAG to reproduce the issue.
   
   a) create an environment var for spark connection.
   ```
   export AIRFLOW_CONN_SPARK_DEFAULT='{"conn_type": spark, "host":<host>, "port":<port>}'
   ```
   
   b) create a DAG file to run
   
   ```python
   from airflow import DAG
   from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
   from datetime import datetime, timedelta
   import os
   
   job_file = 'path/to/job/file'
   
   default_args = {
       'depends_on_past': False,
       'start_date': <fill_start_date>,
       'retries': <fill_retries>,
       'retry_delay': <fill_retry_delay>
   }
   dag = DAG('spark-submit-hook', default_args=default_args, schedule_interval=<fill_interval>)
   
   avg = SparkSubmitOperator(task_id=<fill_task_id>, dag=dag, 
   	application=job_file,
   	spark_binary='path/to/spark-submit')
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] boring-cyborg[bot] commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571099167
 
 
   Congratulations on your first Pull Request and welcome to the Apache Airflow community!
   If you have any issues or are unsure about any anything please check our
   Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
   
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://apache-airflow-slack.herokuapp.com/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection

Posted by GitBox <gi...@apache.org>.
albertusk95 edited a comment on issue #7075: [AIRFLOW-6212] SparkSubmitHook resolve connection
URL: https://github.com/apache/airflow/pull/7075#issuecomment-571429121
 
 
   @tooptoop4 the tests pass only for the case when the connection information (host, port, conn_type, etc.) are stored in **database**. I tried this hook by storing the connection info as an **environment variable**. This failed because the URI parser returned irrelevant results for all types of cluster mode deployment (yarn, standalone, mesos, k8s)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services