You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/06/07 20:03:34 UTC

[GitHub] [airflow] john-jac opened a new issue #14592: Unreachable Secrets Backend Causes Web Server Crash

john-jac opened a new issue #14592:
URL: https://github.com/apache/airflow/issues/14592


   **Apache Airflow version**:
   
   1.10.12
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   
   n/a
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**:
   Amazon MWAA
   
   - **OS** (e.g. from /etc/os-release):
   Amazon Linux (latest)
   
   - **Kernel** (e.g. `uname -a`):
   n/a
   
   - **Install tools**:
   n/a
   
   **What happened**:
   
   If an unreachable secrets.backend is specified in airflow.cfg the web server crashes
   
   **What you expected to happen**:
   
   An invalid secrets backend should be ignored with a warning, and the system should default back to the metadatabase secrets
   
   **How to reproduce it**:
   
   In an environment without access to AWS Secrets Manager, add the following to your airflow.cfg:
   
   ```
   [secrets]
   backend = airflow.contrib.secrets.aws_secrets_manager.SecretsManagerBackend
   ```
   **or** an environment without access to SSM specifiy:
   
   ```
   [secrets]
   backend = airflow.contrib.secrets.aws_systems_manager.SystemsManagerParameterStoreBackend
   ```
   
   Reference: https://airflow.apache.org/docs/apache-airflow/1.10.12/howto/use-alternative-secrets-backend.html#aws-ssm-parameter-store-secrets-backend


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-853743827


   > If we treat the secrets backend as meta-DB like quintessential, then the worker and scheduler should reflect the same startup behavior. Why would the bar be higher for the webserver?
   
   Yep. They should fail. I was quite sure they are failing as well in this case. I think it should be fixed if they don't. I think it is 
   
   
   
   
   > In theory this is more of a safety trouble on the scheduler and worker as the DAGs and tasks are failing not being able to use the connection.
   
   Agree. They should fail if they are not currently.
   
   
   > The webserver at the least presents a window of visibility into what is failing and why, by making it visible to the user through task failures and logs. We should then be consistent on the stance on all 3, right?
   > 
   
   Not if the configuration renders the whole airflow potentially unusable we can see failures of single tasks. But otherwise it is working. For me, misconfiguration of secrets backend (because of it's scope - it potentially impacts all configuration INCLUDING the metadata db configuration itself!!!!) Is straight equivalent of what happens if you misconfigure the metadata URL - which makes webserver crash 
   
   > The problem is the webserver is intended to be stateless, but it hosts the CLI commands which makes it stateful in this ecosystem. Until we get to the future of either fully stateless webserver with decoupled REST API and CLI (AIP-38?), I believe that a failsafe configuration for the webserver would be extremely useful.
   
   Agree that webserver should be stateless, but I think it does not matter if it is stateless or not . The UI/webserver accesses /can change the metadata DB. So what is more important is whether they are rad-only or not and whether the secret backend presence or not has an impact on actions triggered and changes done by the UI. 
   
   So more of a question is (and something that should be answered)
   
   * Is there any case where any actions triggered by webserver behave differently wit/without presence of the secret backend? Does any of the variables/connections/configurations impact any of the actions of the user?
   * Does the user see  we see the status of secret backend and understand the consequences of that ?
   * Should the user be able to do any of the edit actions and modify the database if the secret backend is not available ? 
   * What happens if you have Metadata db configurred in secret backend ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] subashcanapathy commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
subashcanapathy commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-853369327


   If we treat the secrets backend as meta-DB like quintessential, then the worker and scheduler should reflect the same startup behavior. Why would the bar be higher for the webserver?
   In theory this is more of a safety trouble on the scheduler and worker as the DAGs and tasks are failing not being able to use the connection. 
   
   The webserver at the least presents a window of visibility into what is failing and why, by making it visible to the user through task failures and logs. We should then be consistent on the stance on all 3, right?
   
   The problem is the webserver is intended to be stateless, but it hosts the CLI commands which makes it stateful in this ecosystem. Until we get to the future of either fully stateless webserver with decoupled REST API and CLI (AIP-38?), I believe that a failsafe configuration for the webserver would be extremely useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lidalei commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
lidalei commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-852464627


   It is likely a desired (though not nice) behavior in Airflow 2.0.2. We observed a similar behavior when using GCP Secret Manger and the service account key file couldn't be found. Any Airflow command would fail because of it.
   ```
   Traceback (most recent call last): 
   File "/home/airflow/.local/bin/airflow", line 5, in <module> from airflow.__main__ import main 
   File "/home/airflow/.local/lib/python3.8/site-packages/airflow/__init__.py", line 34, in <module> from airflow import settings 
   File "/home/airflow/.local/lib/python3.8/site-packages/airflow/settings.py", line 37, in <module> from airflow.configuration import AIRFLOW_HOME, WEBSERVER_CONFIG, conf # NOQA F401 
   File "/home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py", line 1099, in <module> secrets_backend_list = initialize_secrets_backends() 
   File "/home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py", line 1020, in initialize_secrets_backends custom_secret_backend = get_custom_secret_backend() 
   File "/home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py", line 1009, in get_custom_secret_backend return secrets_backend_cls(**alternative_secrets_config_dict) 
   File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/google/cloud/secrets/secret_manager.py", line 101, in __init__ self.credentials, self.project_id = get_credentials_and_project_id( 
   File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 309, in get_credentials_and_project_id return _CredentialProvider(*args, **kwargs).get_credentials_and_project() 
   File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 242, in get_credentials_and_project credentials, project_id = self._get_credentials_using_adc() 
   File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 295, in _get_credentials_using_adc credentials, project_id = google.auth.default(scopes=self.scopes) 
   File "/home/airflow/.local/lib/python3.8/site-packages/google/auth/_default.py", line 346, in default credentials, project_id = checker() 
   File "/home/airflow/.local/lib/python3.8/site-packages/google/auth/_default.py", line 189, in _get_explicit_environ_credentials credentials, project_id = load_credentials_from_file( 
   File "/home/airflow/.local/lib/python3.8/site-packages/google/auth/_default.py", line 100, in load_credentials_from_file raise exceptions.DefaultCredentialsError( google.auth.exceptions.DefaultCredentialsError: 
   File /secrets/xxx.json was not found.
   ```
   
   How about catching the exception and logging the error when a secrets backend cannot be initialized, instead of exiting?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-790691181


   @fhoda I know you talked about something similar, would you want to add your description too.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lidalei commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
lidalei commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-853446846


   > If we treat the secrets backend as meta-DB like quintessential, then the worker and scheduler should reflect the same startup behavior. Why would the bar be higher for the webserver?
   > 
   > In theory this is more of a safety trouble on the scheduler and worker as the DAGs and tasks are failing not being able to use the connection. 
   > 
   > 
   > 
   > The webserver at the least presents a window of visibility into what is failing and why, by making it visible to the user through task failures and logs. We should then be consistent on the stance on all 3, right?
   > 
   > 
   > 
   > The problem is the webserver is intended to be stateless, but it hosts the CLI commands which makes it stateful in this ecosystem. Until we get to the future of either fully stateless webserver with decoupled REST API and CLI (AIP-38?), I believe that a failsafe configuration for the webserver would be extremely useful.
   
   I just got to know secrets backend can hold connections, variables and configuration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] fhoda commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
fhoda commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-790817766


   I have not experienced the web server crashing (probably because of DAG serialization being enabled), but do have task failures that don't check the downstream Secrets Backends if the Alternative Secrets Backend fails to connect.
   
   **Airflow Version**
   - 1.10.12
   - 2.0.0
   
   **Alternative Backend**
   Hashicorp Vault
   
   I think a warning and some sort and fail over checking of the Environment Variables then the  Metastore DB would make sense. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] fhoda edited a comment on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
fhoda edited a comment on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-856125981


   As of now it seems the expected behavior is not what is happening and is inconsistent across different secret backends.
   
   I have tried to reproduce this issue with Airflow 2.0 (main branch) and am not able to do so for any AWS secrets backends. I was only able to reproduce a crashing webserver for GCP Secret Manager and not any other secrets backend.
   
   The GCP Secret Manager error seems more to do with the function to get the credentials and not the actual connection.
   ```
   google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authenticati│google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentica
   on/getting-started    
   ```
   
   I used the `airflow.providers.*` secrets packages for each. I noticed that the original post on the issue uses the `contrib` package and Airflow `1.10.12`.
   
   ```
   #export AIRFLOW__SECRETS__BACKEND=airflow.providers.hashicorp.secrets.vault.VaultBackend
   #export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_path": "airflow/connections", "variables_path": "airflow/variables", "config_path": "airflow/config", "url": "http://127.0.0.1:8200", "token": "$VAULT_TOKEN"}'
   
   
   export AIRFLOW__SECRETS__BACKEND=airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend
   export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow-connections", "variables_prefix": "airflow-variables", "gcp_keyfile_dict": $GCP_SECRET_MANAGER_SA_KEY}'
   
   
   #export AIRFLOW__SECRETS__BACKEND=airflow.providers.microsoft.azure.secrets.azure_key_vault.AzureKeyVaultBackend
   #export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow-connections", "variables_prefix": null, "vault_url": "https://example-akv-resource-name.vault.azure.net/"}'
   
   
   #export AIRFLOW__SECRETS__BACKEND=airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
   #export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables", "profile_name": "default"}'
   
   
   
   #export AIRFLOW__SECRETS__BACKEND=airflow.providers.amazon.aws.secrets.systems_manager.SystemsManagerParameterStoreBackend
   #export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "/airflow/connections", "variables_prefix": "/airflow/variables", "profile_name": "default"}'
   ```
   
   Here are my findings:
   
   - AWS Secret Manager - No Crash
   - AWS SSM - No Crash
   - Vault - No Crash
   - Azure Key Vault - No Crash
   - **GCP Secret Manager - Crash**
   
   I believe we should evaluate what the expected behavior should be as compared to what is actually happening.
   
   Also after discussing with @kaxil there may be a middle ground for fail over implementation that could make sense here.
   1. If configs are being retrieved through the secrets backend then a failure makes sense.
   2. If connections and/or variables are not able to be retrieved, then fail over could be a strategy used by users to ensure DAG/task success and predictable execution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-856216348


   Agree we have consistency issue here - Interestingly, the AWS secret manager crashed originally for @subashcanapathy and @john-jac but did not crash for you @fhoda. Not sure what the reason is for that (maybe the 1.10 vs 2.* behavioral difference)? 
   
   I really like the idea of different behavior for different type of access. I think it answers my concerns perfectly and what it really boils down to is "who" is the "client" - whether it is "airflow" or the "DAG/task writer".
   
   I think the main difference of configuration vs. variables and connections is that Airflow has default values for most of the configurations and when they are not found, they will fall-back to the default values - which might alter behavior of airflow. So lack of secrets backend when it is configured and configuration is retrieved is very dangerous. And since it is accessed under-the hood  by Airflow, without the "dag" or "task" using it, it's airflow that is the "client" and it's airflow that should handle it (and crashing is the only reasonable behavior IMHO). Simply "dag writer" is not in a position to make any decision here.
   
   This is (as you rightfully noticed), far less of a concern for connections and variables - "clients" for those are "dag writers". Whoever uses them should be prepared for what happens when the secret backend is missing. Either the "writers" will prepare fallback values for those in the DB or they will have to handle "missing" value somehow (and this is up to the 'user' what to do in this case). But they are in full control, there is no need to crash Airflow (yet! - until configuration is not accessed by Airflow itself).
   
   Reopening it as it might actually be an actionable item to do :)
   
   @subashcanapathy , @john-jac - would that be a reasonable approach for you as well ? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] fhoda commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
fhoda commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-856965584


   Since we are in agreement about the fail over being appropriate for Connections and Variables, I have started to work on those. I will address the inconsistency in behavior and config use case after we determine what the final state should be for those.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] subashcanapathy commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
subashcanapathy commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-831120616


   DAG serialization is turned on for John's use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil closed issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
kaxil closed issue #14592:
URL: https://github.com/apache/airflow/issues/14592


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #14592:
URL: https://github.com/apache/airflow/issues/14592


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] fhoda commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
fhoda commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-856125981


   As of now it seems the expected behavior is not what is happening and is inconsistent across different secret backends.
   
   I have tried to reproduce this issue with Airflow 2.0 (main branch) and am not able to do so for any AWS secrets backends. I was only able to reproduce a crashing webserver for GCP Secret Manager and not any other secrets backend.
   
   I used the `airflow.providers.*` secrets packages for each. I noticed that the original post on the issue uses the `contrib` package and Airflow `1.10.12`.
   
   ```
   #export AIRFLOW__SECRETS__BACKEND=airflow.providers.hashicorp.secrets.vault.VaultBackend
   #export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_path": "airflow/connections", "variables_path": "airflow/variables", "config_path": "airflow/config", "url": "http://127.0.0.1:8200", "token": "$VAULT_TOKEN"}'
   
   
   export AIRFLOW__SECRETS__BACKEND=airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend
   export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow-connections", "variables_prefix": "airflow-variables", "gcp_keyfile_dict": $GCP_SECRET_MANAGER_SA_KEY}'
   
   
   #export AIRFLOW__SECRETS__BACKEND=airflow.providers.microsoft.azure.secrets.azure_key_vault.AzureKeyVaultBackend
   #export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow-connections", "variables_prefix": null, "vault_url": "https://example-akv-resource-name.vault.azure.net/"}'
   
   
   #export AIRFLOW__SECRETS__BACKEND=airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
   #export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables", "profile_name": "default"}'
   
   
   
   #export AIRFLOW__SECRETS__BACKEND=airflow.providers.amazon.aws.secrets.systems_manager.SystemsManagerParameterStoreBackend
   #export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "/airflow/connections", "variables_prefix": "/airflow/variables", "profile_name": "default"}'
   ```
   
   Here are my findings:
   
   - AWS Secret Manager - No Crash
   - AWS SSM - No Crash
   - Vault - No Crash
   - Azure Key Vault - No Crash
   - **GCP Secret Manager - Crash**
   
   I believe we should evaluate what the expected behavior should be as compared to what is actually happening.
   
   Also after discussing with @kaxil there may be a middle ground for fail over implementation that could make sense here.
   1. If configs are being retrieved through the secrets backend then a failure makes sense.
   2. If connections and/or variables are not able to be retrieved, then fail over could be a strategy used by users to ensure DAG/task success and predictable execution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-852858448


   Closing it for now until there is a gods reasoning why not to crash :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-853329159


   I am not at all convinced. 
   
   The Secrets Backend - when configured - is expected to work and it can hold crucial part of configuration that spans across all functionality of Airflow. Airflow is not able to asses if the secret backend is wrongly configured or this is temporary problem in communication so letting Airflow work when expected, crucial configuration service is not reachable is a very bad idea IMHO.
   
   I think usually if services like Airflow have that big of a  configuration problem - they should crash quickly at restart and even if that configuration is result of application in a managed service - it should be a signal to roll-back to previous configuration.
   
   But i would love to hear what others think?  Happy to reopen if the consensus will be different than my assessment .
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-852516068


   If I understand correctly - this is only when you cannot reach secret backend when it is configured ? Not about just a value missing in the backend?
   
   If my assumption is correct, then I have no doubts whatsoever that  his is desired, nice and "good" behavior for airflow to crash hard in this case. 
   
   It's far worse to continue running, when you rely on the backend with variables to be available and they cannot be read where you expec them (and the assumption is that when you configure the backend, you want it to be reachable). 
   
   This might lead to a number of problems - bad processing of data, wrong databases updated, wrong calculations, you name it. Totally unpredictable, because we have no idea what the retrieved value will be used for, and we have no idea whether a "fallback default' is good or not.
   
   It's NEVER a good idea to continue when you have unpredictable and unexpected configuration. Airflow as a system is not able to assess if the "default" value is "OK" to continue if the "expected" backend does not provide the right values (either because it is missing or wrongly configured). Actually it's even worse - Airflow is not even able to determine which kind of problem it is - is it wrongly configured? or just temporarily missing-in-action ?  It will look exactly the same from Airlfow's point of view and this means that by not-crashing we accept that sometimes Airflow will run with configuration A and sometimes (when the secret backend is not reachable) with configuration B. This is no-go.
   
   You either expect the secret backend to be available or not  - there is no middle ground.
   
   Crashing in this case is the only reasonable approach. 
   
   And I really like the output @lidalei  copied. It is very straightworward and easy to find what the problem is. Again - we have no  idea whether we can continue with default value or not. The only sensible approach is to crash hard.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] fhoda commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
fhoda commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-805853124


   @john-jac can you tell me if you had DAG serialization turned on or not when you were seeing this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lidalei commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
lidalei commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-853446846


   > If we treat the secrets backend as meta-DB like quintessential, then the worker and scheduler should reflect the same startup behavior. Why would the bar be higher for the webserver?
   > 
   > In theory this is more of a safety trouble on the scheduler and worker as the DAGs and tasks are failing not being able to use the connection. 
   > 
   > 
   > 
   > The webserver at the least presents a window of visibility into what is failing and why, by making it visible to the user through task failures and logs. We should then be consistent on the stance on all 3, right?
   > 
   > 
   > 
   > The problem is the webserver is intended to be stateless, but it hosts the CLI commands which makes it stateful in this ecosystem. Until we get to the future of either fully stateless webserver with decoupled REST API and CLI (AIP-38?), I believe that a failsafe configuration for the webserver would be extremely useful.
   
   I just got to know secrets backend can hold connections, variables and configuration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] subashcanapathy commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
subashcanapathy commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-853211929


   A web application that is stateless (or intended to be that in long term) should fail gracefully on such situations. A secrets provider is at best a plugin and not a core feature. I would understand if the meta-DB connection failed and or the IDP provider connection failed - then it makes sense to prevent startup. If the customer made a mistake on configuring the secrets backend, booting up the web UI will make it obvious to them as the task failures and logs are viewable. Without this we are just assuming the user has access to logs on a box to be able to even understand where this went wrong.
   
   Airflow webserver should be stateless representation of current state of things in the environment. I request to consider re-opening this so that we can make a configuration control like
   `webserver.failsafe_secrets_backend = true`
   
   @potiuk @kaxil 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lidalei commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
lidalei commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-852812326


   @potiuk Thank you very much for the detailed reasoning process! The concern is that if the an operation doesn't even need secrets backend (no need to use any connection or variable), should it crash or continue working? But generally it is a good idea to crash due to a bad configuration when starting an APP.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-852858121


   > @potiuk Thank you very much for the detailed reasoning process! The concern is that if the an operation doesn't even need secrets backend (no need to use any connection or variable), should it crash or continue working? But generally it is a good idea to crash due to a bad configuration when starting an APP.
   
   Airflow is a distributed system there is never 'single' operation. Even if your operation does not need it, there can be many more tasks and jobs that might. They share co figuration and processes sometimes.
   
   Really trying to pass on the bad configuration problem form the operator who should configure Airflow well to Airflow itself is very bad idea. Really crashing is the best signal Airflow can give to the human operating it 'please fix it'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-856216348


   Agree we have consistency issue here - Interestingly, the AWS secret manager crashed originally for @subashcanapathy and @john-jac but did not crash for you @fhoda. Not sure what the reason is for that (maybe the 1.10 vs 2.* behavioral difference)? 
   
   I really like the idea of different behavior for different type of access. I think it answers my concerns perfectly and what it really boils down to is "who" is the "client" - whether it is "airflow" or the "DAG/task writer".
   
   I think the main difference of configuration vs. variables and connections is that Airflow has default values for most of the configurations and when they are not found, they will fall-back to the default values - which might alter behavior of airflow. So lack of secrets backend when it is configured and configuration is retrieved is very dangerous. And since it is accessed under-the hood  by Airflow, without the "dag" or "task" using it, it's airflow that is the "client" and it's airflow that should handle it (and crashing is the only reasonable behavior IMHO). Simply "dag writer" is not in a control to make any decision here.
   
   This is (as you rightfully noticed), far less of a concern for connections and variables - "clients" for those are "dag writers". Whoever uses them should be prepared for what happens when the secret backend is missing. Either the "writers" will prepare fallback values for those in the DB or they will have to handle "missing" value somehow (and this is up to the 'user' what to do in this case). But they are in full control, there is no need to crash Airflow (yet! - until configuration is not accessed by Airflow itself).
   
   Reopening it as it might actually be an actionable item to do :)
   
   @subashcanapathy , @john-jac - would that be a reasonable approach for you as well ? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14592: Unreachable Secrets Backend Causes Web Server Crash

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14592:
URL: https://github.com/apache/airflow/issues/14592#issuecomment-853743827


   > If we treat the secrets backend as meta-DB like quintessential, then the worker and scheduler should reflect the same startup behavior. Why would the bar be higher for the webserver?
   
   Yep. They should fail. I was quite sure they are failing as well in this case. I think it should be fixed if they don't. I think it is 
   
   
   
   
   > In theory this is more of a safety trouble on the scheduler and worker as the DAGs and tasks are failing not being able to use the connection.
   
   Agree. They should fail if they are not currently.
   
   
   > The webserver at the least presents a window of visibility into what is failing and why, by making it visible to the user through task failures and logs. We should then be consistent on the stance on all 3, right?
   > 
   
   Not if the configuration renders the whole airflow potentially unusable we can see failures of single tasks. But otherwise it is working. For me, misconfiguration of secrets backend (because of it's scope - it potentially impacts all configuration INCLUDING the metadata db configuration itself!!!!) Is straight equivalent of what happens if you misconfigure the metadata URL - which makes webserver crash 
   
   > The problem is the webserver is intended to be stateless, but it hosts the CLI commands which makes it stateful in this ecosystem. Until we get to the future of either fully stateless webserver with decoupled REST API and CLI (AIP-38?), I believe that a failsafe configuration for the webserver would be extremely useful.
   
   Agree that webserver should be stateless, but I think it does not matter if it is stateless or not . The UI/webserver accesses /can change the metadata DB. So what is more important is whether they are rad-only or not and whether the secret backend presence or not has an impact on actions triggered and changes done by the UI. 
   
   So more of a question is (and something that should be answered)
   
   * Is there any case where any actions triggered by webserver behave differently wit/without presence of the secret backend? Does any of the variables/connections/configurations impact any of the actions of the user?
   * Does the user see  we see the status of secret backend and understand the consequences of that ?
   * Should the user be able to do any of the edit actions and modify the database if the secret backend is not available ? 
   * What happens if you have Metadata db configurred in secret backend ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org