You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/05/31 15:56:52 UTC

[GitHub] [airflow] ywan2017 opened a new pull request #9081: Get spark driver pod status if log stream interrupted accidentally

ywan2017 opened a new pull request #9081:
URL: https://github.com/apache/airflow/pull/9081


   #8963 
   
   ## Description
   
   I am using airflow SparkSubmitOperator to schedule my spark jobs on kubernetes cluster. 
   
   But for some reason, kubernetes often throw 'too old resource version' exception which will interrupt spark watcher, then airflow will lost the log stream and could not get 'Exit Code' eventually. So airflow will mark job failed once log stream lost but the job is still running.
   
   This is  a solution about a simple retry mechanism which is when the log stream is interrupted, then call  method  'read_namespaced_pod()', which is provided by kubernetes client api,  to get spark driver pod status.
   
   ## Target Github ISSUE
   
   https://github.com/apache/airflow/issues/8963
   
   ---
   Make sure to mark the boxes below before creating PR: [x]
   
   - [ ] Description above provides context of the change
   - [ ] Unit tests coverage for changes (not needed for documentation changes)
   - [ ] Target Github ISSUE in description if exists
   - [ ] Commits follow "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"
   - [ ] Relevant documentation is updated including usage instructions.
   - [ ] I will engage committers as explained in [Contribution Workflow Example](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#contribution-workflow-example).
   
   ---
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   Read the [Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines) for more information.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ywan2017 commented on pull request #9081: Get spark driver pod status if log stream interrupted accidentally

Posted by GitBox <gi...@apache.org>.
ywan2017 commented on pull request #9081:
URL: https://github.com/apache/airflow/pull/9081#issuecomment-648621875


   @ashb would you mind review the new code change? thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] stale[bot] commented on pull request #9081: Get spark driver pod status if log stream interrupted accidentally

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #9081:
URL: https://github.com/apache/airflow/pull/9081#issuecomment-670836627


   This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] stale[bot] closed pull request #9081: Get spark driver pod status if log stream interrupted accidentally

Posted by GitBox <gi...@apache.org>.
stale[bot] closed pull request #9081:
URL: https://github.com/apache/airflow/pull/9081


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ywan2017 commented on a change in pull request #9081: Get spark driver pod status if log stream interrupted accidentally

Posted by GitBox <gi...@apache.org>.
ywan2017 commented on a change in pull request #9081:
URL: https://github.com/apache/airflow/pull/9081#discussion_r441388924



##########
File path: airflow/providers/apache/spark/hooks/spark_submit.py
##########
@@ -404,11 +404,16 @@ def submit(self, application="", **kwargs):
         # Check spark-submit return code. In Kubernetes mode, also check the value
         # of exit code in the log, as it may differ.
         if returncode or (self._is_kubernetes and self._spark_exit_code != 0):
-            raise AirflowException(
-                "Cannot execute: {}. Error code is: {}.".format(
-                    self._mask_cmd(spark_submit_cmd), returncode
+            # double check by spark driver pod status (blocking function)
+            spark_driver_pod_status = self._start_k8s_pod_status_tracking()

Review comment:
       Yes, thanks for that, I've split the 'if' conditions to remove the influence when not in k8s mode.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #9081: Get spark driver pod status if log stream interrupted accidentally

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #9081:
URL: https://github.com/apache/airflow/pull/9081#discussion_r440094842



##########
File path: airflow/providers/apache/spark/hooks/spark_submit.py
##########
@@ -404,11 +404,16 @@ def submit(self, application="", **kwargs):
         # Check spark-submit return code. In Kubernetes mode, also check the value
         # of exit code in the log, as it may differ.
         if returncode or (self._is_kubernetes and self._spark_exit_code != 0):
-            raise AirflowException(
-                "Cannot execute: {}. Error code is: {}.".format(
-                    self._mask_cmd(spark_submit_cmd), returncode
+            # double check by spark driver pod status (blocking function)
+            spark_driver_pod_status = self._start_k8s_pod_status_tracking()

Review comment:
       This is going to fail hard when not in kubenetes mode.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org