You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/02/05 13:39:14 UTC

[GitHub] [airflow] nicor88 opened a new issue #14101: Tasks not retried on failures when retries>=1

nicor88 opened a new issue #14101:
URL: https://github.com/apache/airflow/issues/14101


   <!--
   
   Welcome to Apache Airflow!  For a smooth issue process, try to answer the following questions.
   Don't worry if they're not all applicable; just try to include what you can :-)
   
   If you need to include code snippets or logs, please put them in fenced code
   blocks.  If they're super-long, please use the details tag like
   <details><summary>super-long log</summary> lots of stuff </details>
   
   Please delete these comment blocks before submitting the issue.
   
   -->
   
   <!--
   
   IMPORTANT!!!
   
   PLEASE CHECK "SIMILAR TO X EXISTING ISSUES" OPTION IF VISIBLE
   NEXT TO "SUBMIT NEW ISSUE" BUTTON!!!
   
   PLEASE CHECK IF THIS ISSUE HAS BEEN REPORTED PREVIOUSLY USING SEARCH!!!
   
   Please complete the next sections or the issue will be closed.
   These questions are the first thing we need to know to understand the context.
   
   -->
   
   **Apache Airflow version**: 2.0.0
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):  v1.18.15
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: AWS (running in k8s cluster provisioned by kops)
   - **OS** (e.g. from /etc/os-release): using Docker image **apache/airflow:2.0.0-python3.8**
   <pre>
   PRETTY_NAME="Debian GNU/Linux 10 (buster)"
   NAME="Debian GNU/Linux"
   VERSION_ID="10"
   VERSION="10 (buster)"
   VERSION_CODENAME=buster
   ID=debian
   HOME_URL="https://www.debian.org/"
   SUPPORT_URL="https://www.debian.org/support"
   BUG_REPORT_URL="https://bugs.debian.org/"
   </pre>
   
   - **Kernel** (e.g. `uname -a`): Linux  5.4.92-flatcar #1 SMP Wed Jan 27 16:53:10 -00 2021 x86_64 GNU/Linux
   - **Install tools**:
   - **Others**:
      * installed snowflake-connector-python==2.3.9
   
   **What happened**:
   Time to time seems to be a network glitch between K8s pods and RDS database (postgres 11.6). worker-pod fails with error `Failed to log action with (psycopg2.OperationalError) SSL SYSCALL error: Connection timed out` but the task (retry=1) is marked as Failed, and no retry happen. Considering that we have a dag with serial tasks, all the dag is marked as FAILED.
   
   
   <!-- (please include exact error messages if you can) -->
   
   
   **What you expected to happen**:
   
   I expected that the task will be retried. We experience the same issue with airflow 1.10.14, and the with the 2nd retry the tasks was marked as successful.
   
   **How to reproduce it**:
   I didn't manage to reproduce.
   <!---
   
   As minimally and precisely as possible. Keep in mind we do not have access to your cluster or dags.
   
   If you are using kubernetes, please attempt to recreate the issue using minikube or kind.
   
   ## Install minikube/kind
   
   - Minikube https://minikube.sigs.k8s.io/docs/start/
   - Kind https://kind.sigs.k8s.io/docs/user/quick-start/
   
   If this is a UI bug, please provide a screenshot of the bug or a link to a youtube video of the bug in action
   
   You can include images using the .md style of
   ![alt text](http://url/to/img.png)
   
   To record a screencast, mac users can use QuickTime and then create an unlisted youtube video with the resulting .mov file.
   
   --->
   
   
   **Anything else we need to know**:
   The issue happen time to time, depends on the day also 2/3 times per day.
   
   <!--
   
   How often does this problem occur? Once? Every time etc?
   
   Any relevant logs to include? Put them here in side a detail tag:
   <details><summary>x.log</summary> lots of stuff </details>
   
   -->
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] waleedsamy edited a comment on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
waleedsamy edited a comment on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-846566825


   I'm kinda of experiencing similar issue. In airflow 1.10.12 when a task failed, killed or the executor (Celery) running it killed (zombie task?) it will put to retry if retries > 1. But this seems to not be the case with v2.1.0. when a task killed it mark it as failed and does not retry at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nicor88 edited a comment on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
nicor88 edited a comment on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-933543353


   @ephraimbuddy I can't provide them anymore, as the I'm not working anymore in the setup where I experienced that error.
   What I remember is that when using KubernetesExecutor, and when the running the task, sometimes there where DNS issues, and the hostname of the PostgresDB was not resolved, therefore the task was immediately marked as failed without retrying it. I believe that https://github.com/apache/airflow/pull/17819 should actually fixing it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] waleedsamy edited a comment on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
waleedsamy edited a comment on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-846566825


   I'm kinda of experiencing similar issue. In airflow 1.10.12 when a task failed, killed or the executor (Celery) running it killed (zombie task?) it will but to retry if retries > 1. But this seems to not be the case with v2.1.0. when a task killed it mark it as failed and does not retry at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nicor88 commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
nicor88 commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-774966161


   @tooptoop4 Yes seems similar with AIRFLOW-6518. Point is, that with version 1.10.14 when the metastore connection was lost (and having a task retry >=1) with the 2nd or 3th retry the task was able to connect to the Metastore DB. Now, in such case the task is marked immediately as failed, without retrying at all. So for me this was fix on previous airflow versions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] waleedsamy commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
waleedsamy commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-846566825


   I'm kinda of experiencing similar issue. In airflow 1.10.12 when a task failed, killed or the executor running it killed (zombie task?) it will but to retry if retries > 1. But this seems to not be the case with v2.1.0. when a task killed it mark it as failed and does not retry at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #14101: Tasks not retried on failures when retries>=1

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-774038390


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] waleedsamy removed a comment on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
waleedsamy removed a comment on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-846566825


   I'm kinda of experiencing similar issue. In airflow 1.10.12 when a task failed, killed or the executor (Celery) running it killed (zombie task?) it will put to retry if retries > 1. But this seems to not be the case with v2.1.0. when a task killed it mark it as failed and does not retry at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nicor88 edited a comment on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
nicor88 edited a comment on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-774966161


   @tooptoop4 Yes seems similar with AIRFLOW-6518. Point is, that with version 1.10.14 when the metastore connection was lost (and having a task retry >=1) with the 2nd or 3th retry the task was able to connect to the Metastore DB. Now, in such case the task is marked immediately as failed, without retrying at all. So for me this was fixed on previous airflow versions, and seems that doesn't work anymore in 2.0.0.
   
   Also I noticed that the value of `max_db_retries` in our airflow.cfg is set to 3, so I expect at least 3 retries to connect to the DB before giving up.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-933457739


   At @d3centr your case seems different and is resolved by https://github.com/apache/airflow/pull/16301 released in 2.1.3.
   
   Also, this ticket has been resolved by https://github.com/apache/airflow/pull/17819 which will be released in 2.2
   
   To confirm, @nicor88 , when you get this operational error, do you also see in the scheduler log message like this: 
   `<TaskInstance: somedag.taskid [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?`
   You can share the scheduler log, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] d3centr commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
d3centr commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-933323630


   This can be reproduced on Airflow 2.1.2 with K8s executor when setting task retries >= 1. Call `kubectl delete pod` on a task pod while it is running. Task will fail but won’t be retried.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nicor88 edited a comment on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
nicor88 edited a comment on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-774966161


   @tooptoop4 Yes seems similar with AIRFLOW-6518. Point is, that with version 1.10.14 when the metastore connection was lost (and having a task retry >=1) with the 2nd or 3th retry the task was able to connect to the Metastore DB. Now, in such case the task is marked immediately as failed, without retrying at all. So for me this was fix on previous airflow versions, and seems that doesn't work anymore in 2.0.0.
   
   Also I noticed that the value of `max_db_retries` in our airflow.cfg is set to 3, so I expect at least 3 retries to connect to the DB before giving up.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb closed issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
ashb closed issue #14101:
URL: https://github.com/apache/airflow/issues/14101


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-919081306


   > @ephraimbuddy Can you look at this one too -- it is similar as the process executor events where we should retry instead of fail
   
   Yes. I will take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] vikramkoka commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
vikramkoka commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-807236423


   @nicor88 In the interest of clarity, what executor are you using? 
   I am not sure it makes a difference, but trying to figure out how to reproduce this and it would definitely help there. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] waleedsamy removed a comment on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
waleedsamy removed a comment on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-854966727


   This issue maybe a side effect of [d24c1e8a84f72e88b0b9031e2862b866be83831b](https://github.com/apache/airflow/pull/15537/files/d24c1e8a84f72e88b0b9031e2862b866be83831b#diff-d80fa918cc75c4d6aa582d5e29eeb812ba21371d6977fde45a4749668b79a515R159).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nicor88 commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
nicor88 commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-933543353


   @ephraimbuddy I can't provide them anymore, as the I'm not working anymore in the setup where I experienced that error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-919077937


   @ephraimbuddy Can you look at this one too -- it is similar as the process executor events where we should retry instead of fail


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] vikramkoka commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
vikramkoka commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-807297596


   > @vikramkoka We are using Kubernetes Executor, totally forgot to mention that in the issue.
   
   Thank you @nicor88 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nicor88 edited a comment on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
nicor88 edited a comment on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-774966161


   @tooptoop4 Yes seems similar with AIRFLOW-6518. Point is, that with version 1.10.14 when the metastore connection was lost (and having a task retry >=1) with the 2nd or 3th retry the task was able to connect to the Metastore DB. Now, in such case the task is marked immediately as failed, without retrying at all. So for me this was fix on previous airflow versions, and seems that doesn't work anymore in 2.0.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] tooptoop4 commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-774563381


   similar to https://issues.apache.org/jira/browse/AIRFLOW-6518


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy edited a comment on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
ephraimbuddy edited a comment on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-933457739


   At @d3centr your case seems different and is resolved by https://github.com/apache/airflow/pull/16301 released in 2.1.3.
   
   Also, this ticket has been resolved by https://github.com/apache/airflow/pull/17819 which will be released in 2.2
   
   To confirm, @nicor88 , when you get this operational error, do you also see in the scheduler log message like this: 
   `<TaskInstance: somedag.taskid [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?`
   You can share the scheduler log and task logs, maybe I'm mixing things up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-934355269


   Yeah, from what you describe that PR would mean the task is retried correctly.
   
   I'm going to close this issue then as the fix is included in 2.2.0 (in beta stages right now)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] waleedsamy commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
waleedsamy commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-854966727


   This issue maybe a side effect of [d24c1e8a84f72e88b0b9031e2862b866be83831b](https://github.com/apache/airflow/pull/15537/files/d24c1e8a84f72e88b0b9031e2862b866be83831b#diff-d80fa918cc75c4d6aa582d5e29eeb812ba21371d6977fde45a4749668b79a515R159).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nicor88 commented on issue #14101: Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError

Posted by GitBox <gi...@apache.org>.
nicor88 commented on issue #14101:
URL: https://github.com/apache/airflow/issues/14101#issuecomment-807285766


   @vikramkoka We are using Kubernetes Executor, totally forgot to mention that in the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org