You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/11/22 18:08:16 UTC

[GitHub] [airflow] cb149 opened a new issue #19752: Infinite wait in SparkSubmitOperator kill

cb149 opened a new issue #19752:
URL: https://github.com/apache/airflow/issues/19752


   ### Apache Airflow Provider(s)
   
   apache-spark
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-apache-spark 2.0.1
   
   ### Apache Airflow version
   
   2.2.0
   
   ### Operating System
   
   Debian buster
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   Today, my Spark on YARN job ran longer than the `execution_timeout=timedelta(minutes=30)`.
   While trying to send the kill signal, there was an error with kinit:
   
   > [2021-11-22, 17:37:57 UTC] {kerberos.py:103} ERROR - Couldn't reinit from keytab! `kinit' exited with 1.
   kinit: Failed to store credentials: Credentials cache permissions incorrect (filename: /var/airflow_krb5_ccache) while getting initial credentials
   
   And after that, the task just keeps going forever, no more logs, no task success or fail, just keeps running.
   
   My guess is that during the part that executes yarn kill, since the timeout is missing it will wait forever:
   ```python
    with subprocess.Popen(
       kill_cmd, env=env, stdout=subprocess.PIPE, stderr=subprocess.PIPE
    ) as yarn_kill:
       self.log.info("YARN app killed with return code: %s", yarn_kill.wait())
   ```
   and there should be a timeout in `yarn_kill.wait()`
   
   ### What you expected to happen
   
   If there is an error trying to send the kill signal to spark-submit, the task should fail or timeout at some point, and not keep going infinitely.
   
   ### How to reproduce
   
   Use SparkSubmitOperator in deploy-mode cluster and master yarn with execution_timeout shorter than the Spark job needs, with a kerberos ccache directory that is not writable.
   
   ### Anything else
   
   The log:
   > [2021-11-22, 17:37:57 UTC] {spark_submit.py:499} INFO - Identified spark driver id:
   
   Is written way to often, maybe it would make sense to change:
   
   ```python
                   if match:
                       self._yarn_application_id = match.groups()[0]
                       self.log.info("Identified spark driver id: %s", self._yarn_application_id)
   ```
   
   to something like
   
   ```python
                   if match and (not self._yarn_application_id or self._yarn_application_id != match.groups()[0]):
                       self._yarn_application_id = match.groups()[0]
                       self.log.info("Identified spark driver id: %s", self._yarn_application_id)
   ```
   to write the log only the first time the application_id is identified or when it has changed.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] cb149 commented on issue #19752: Infinite wait in SparkSubmitOperator kill

Posted by GitBox <gi...@apache.org>.
cb149 commented on issue #19752:
URL: https://github.com/apache/airflow/issues/19752#issuecomment-987055637


   @potiuk I'll give it a try, though I am also wondering about other parts of the SparkSubmitHook code, e.g. this part in [def on_kill(self)](https://github.com/apache/airflow/blob/866a601b76e219b3c043e1dbbc8fb22300866351/airflow/providers/apache/spark/hooks/spark_submit.py#L637)
   ```python
    if self._keytab is not None and self._principal is not None:
                       # we are ignoring renewal failures from renew_from_kt
                       # here as the failure could just be due to a non-renewable ticket,
                       # we still attempt to kill the yarn application
                       renew_from_kt(self._principal, self._keytab, exit_on_fail=False)
                       env = os.environ.copy()
                       env["KRB5CCNAME"] = airflow_conf.get('kerberos', 'ccache')
   ```
   
   - **renew_from_kt** will fail if there is no "airflow kerberos" ticket renewer or if the user hasn't somehow initialized a credentials cache at the ccache path from the config
   - Since this is using that one ccache from config, what happens if two or more SparkSubmitOperators with different keytabs timeout/get killed at the exact same time. Would the following be possible or would the scheduler not run the two kill commands at the same time?
   1. SparkSubmitOperator A with keytab/principal for dev_user and SparkSubmitOperator B with keytab/principal for ops_user are killed/timeout at the same time
   2. renew_from_kt for A is called
   3. renew_from_kt for B is called
   4. subprocess with kill_cmd for A is opened (fails cause ops_user is not allowed to modify YARN jobs of dev_user)
   
   Whats the design decision here to use **renew_from_kt** instead of creating/using a temporary ccache location for each YARN kill?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #19752: Infinite wait in SparkSubmitOperator kill

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #19752:
URL: https://github.com/apache/airflow/issues/19752#issuecomment-986053341


   Maybe you would like to take a stab on fixing both in a PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #19752: Infinite wait in SparkSubmitOperator kill

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #19752:
URL: https://github.com/apache/airflow/issues/19752#issuecomment-1045973944


   @cb149 you can check the PR that added the functionality https://github.com/apache/airflow/pull/9044


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #19752: Infinite wait in SparkSubmitOperator kill

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #19752:
URL: https://github.com/apache/airflow/issues/19752#issuecomment-987088311


   > Whats the design decision here to use **renew_from_kt** instead of creating/using a temporary ccache location for each YARN kill?
   
   No idea. I've not been here when the decision was made :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] cb149 edited a comment on issue #19752: Infinite wait in SparkSubmitOperator kill

Posted by GitBox <gi...@apache.org>.
cb149 edited a comment on issue #19752:
URL: https://github.com/apache/airflow/issues/19752#issuecomment-987055637


   @potiuk I'll give it a try, though I am also wondering about other parts of the SparkSubmitHook code, e.g. this part in [def on_kill(self)](https://github.com/apache/airflow/blob/866a601b76e219b3c043e1dbbc8fb22300866351/airflow/providers/apache/spark/hooks/spark_submit.py#L637)
   ```python
    if self._keytab is not None and self._principal is not None:
                       # we are ignoring renewal failures from renew_from_kt
                       # here as the failure could just be due to a non-renewable ticket,
                       # we still attempt to kill the yarn application
                       renew_from_kt(self._principal, self._keytab, exit_on_fail=False)
                       env = os.environ.copy()
                       env["KRB5CCNAME"] = airflow_conf.get('kerberos', 'ccache')
   ```
   
   - **renew_from_kt** will fail if there is no "airflow kerberos" ticket renewer or if the user hasn't somehow initialized a credentials cache at the ccache path from the config
   - Since this is using that one ccache from config, what happens if two or more SparkSubmitOperators with different keytabs timeout/get killed at the exact same time. Would the following be possible or would the scheduler not run the two kill commands in parallel?
   1. SparkSubmitOperator A with keytab/principal for dev_user and SparkSubmitOperator B with keytab/principal for ops_user are killed/timeout at the same time
   2. renew_from_kt for A is called
   3. renew_from_kt for B is called (at the same time or shortly after 2. but before 4.)
   4. subprocess with kill_cmd for A is opened (fails cause ccache contains ticket for ops_user, who is not allowed to modify YARN jobs of dev_user)
   
   Whats the design decision here to use **renew_from_kt** instead of creating/using a temporary ccache location for each YARN kill?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] cb149 edited a comment on issue #19752: Infinite wait in SparkSubmitOperator kill

Posted by GitBox <gi...@apache.org>.
cb149 edited a comment on issue #19752:
URL: https://github.com/apache/airflow/issues/19752#issuecomment-987055637


   @potiuk I'll give it a try, though I am also wondering about other parts of the SparkSubmitHook code, e.g. this part in [def on_kill(self)](https://github.com/apache/airflow/blob/866a601b76e219b3c043e1dbbc8fb22300866351/airflow/providers/apache/spark/hooks/spark_submit.py#L637)
   ```python
    if self._keytab is not None and self._principal is not None:
                       # we are ignoring renewal failures from renew_from_kt
                       # here as the failure could just be due to a non-renewable ticket,
                       # we still attempt to kill the yarn application
                       renew_from_kt(self._principal, self._keytab, exit_on_fail=False)
                       env = os.environ.copy()
                       env["KRB5CCNAME"] = airflow_conf.get('kerberos', 'ccache')
   ```
   
   - **renew_from_kt** will fail if there is no "airflow kerberos" ticket renewer or if the user hasn't somehow initialized a credentials cache at the ccache path from the config
   - Since this is using that one ccache from config, what happens if two or more SparkSubmitOperators with different keytabs timeout/get killed at the exact same time. Would the following be possible or would the scheduler not run the two kill commands in parallel?
   1. SparkSubmitOperator A with keytab/principal for dev_user and SparkSubmitOperator B with keytab/principal for ops_user are killed/timeout at the same time
   2. renew_from_kt for A is called
   3. renew_from_kt for B is called (at the same time or shortly after 2. but before 4.)
   4. subprocess with kill_cmd for A is opened (fails cause ops_user is not allowed to modify YARN jobs of dev_user)
   
   Whats the design decision here to use **renew_from_kt** instead of creating/using a temporary ccache location for each YARN kill?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org