You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "James Meickle (JIRA)" <ji...@apache.org> on 2019/08/16 15:10:00 UTC
[jira] [Updated] (AIRFLOW-5238) Outage caused by interaction between SFTPSensor, SSHHook, and paramiko ProxyCommand

     [ https://issues.apache.org/jira/browse/AIRFLOW-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Meickle updated AIRFLOW-5238:
-----------------------------------
    Description: 
We had a partial outage last night due to an extremely subtle bug that requires a bit of explanation. Please bear with me...

We have an extensive DAG for moving financial data files between vendors. One of these vendor SFTPs is only accessible via certain subnets, so we use an SSH ProxyCommand to pass through a bastion first. The existence of this bastion is a difference from staging, where we directly connect to a mock SFTP (AWS Transfer).

After a recent code deployment, our production Airflow cluster began exhibiting intermittent connectivity issues to a specific vendor SFTP. This connectivity issue affected _all_ SFTP connections to the vendor, regardless of specific operator or sensor class. Some connections would work, but most would fail with:
{code}
paramiko.ssh_exception.SSHException: No existing session
{code}
We first looked at the diff of changes to the DAG code. There was no apparent "shared" or "global" configuration that would affect all connections to the vendor; for example, we didn't change any hook classes.

One of our engineers built a minimal test case that instantiated an SFTP operator and immediately executed it. This was run on the Airflow production server, and therefore had the same user, virtualenv, metadata db, connection, and so on. This test case did not reproduce the original issue.

We implemented a remote debugger to attach to the failing operators. The intermittent failure still occurred when stepping through a debugger, though the underlying error was revealed to be a Paramiko socket timeout. We didn't get much clarity on what was specifically failing.

We tried going a layer lower and just executing the SSH {{ProxyCommand}} in a terminal. This also ran as expected manually, connecting to the vendor SSH through the bastion and receiving an SSH banner awaiting change to SFTP protocol.

However, after _repeatedly_ running the manual command, we noticed that in some cases we immediately received a server EOF, just before getting an SSH banner. These failures would be batched; the command would work from all instances, then fail several times in a row (only from a single instance), and then resume working. We began to suspect that we were being throttled.

We began to investigate resource use, wondering whether there was a process leak. We did notice several old SSH processes running on the workers. Since they were marked as defunct, we would not expect them to be holding connections open; just in case, we cycled all of the worker servers. This did not resolve the situation.

At this point we took a break to manually run our ETL commands for the evening, and to preemptively ask our vendor whether they had recently instituted throttling on their SFTP server. In the morning we were still impacted by this issue and resumed debugging.

One of our engineers tried inspecting the state of our Airflow _web_ server and noticed that it had many open connections to the remote SFTP. This turned out to be the critical clue that narrowed it down to two likely causes: a hook that was initiating a connection outside of {{get_conn}}, or an operator that was misusing a hook.

I began tracing through our custom code and did not find either of those problems. However, in examining Airflow core I was able to find the code path that led to this outage:
 - We deploy new DAG tasks, which includes a {{SFTPSensor}}.
 - The SFTPSensor instantiates an SFTPHook in {{__init__}} instead of {{execute}}: [https://github.com/apache/airflow/blob/1.10.4/airflow/contrib/sensors/sftp_sensor.py#L41]
 - The SFTPHook is a {{SSHHook}} subclass and uses its {{__init__}}: [https://github.com/apache/airflow/blob/a18283911750017a01d538bcdd48283b6a265bfb/airflow/contrib/hooks/sftp_hook.py#L48]
 - In the {{__init__}} method of the {{SSHHook}}, if an SSH config file is present and there's a matching {{ProxyCommand}}, a new {{paramiko.ProxyCommand}} will be created: [https://github.com/apache/airflow/blob/a18283911750017a01d538bcdd48283b6a265bfb/airflow/contrib/hooks/ssh_hook.py#L137]
 - In the {{__init__}} method of {{paramiko.ProxyCommand}}, the SSH {{ProxyCommand}} gets _executed immediately_ rather than deferred until use: [https://github.com/paramiko/paramiko/blob/master/paramiko/proxy.py#L56]

Taken together, every DAG parsing attempt, including on scheduler and web instances, was creating multiple SSH connections resulting in our getting throttled (by either the bastion or the vendor SFTP). We didn't see this on staging because we don't use an SSH {{ProxyCommand}} there.

This outage was the result of multiple intersecting coincidences, but it's not the first time we've seen hook issues in Airflow. To mitigate the blast radius on similar issues, I think Airflow should:

- Standardize and document that operators only instantiate hooks at execute-time.
- Fix the SFTPSensor's issues with the above
- Standardize and document that hooks can't make any connections outside of get_conn.
- Fix the SSHHook's use of ProxyCommand
- File an upstream bug against Paramiko to make their ProxyCommand behavior less surprising, and to improve logging of it. (I will do this.)

  was:
We had a major outage last night due to an extremely subtle bug that requires a bit of explanation. Please bear with me...

We have an extensive DAG for moving financial data files between vendors. One of these vendor SFTPs is only accessible via certain subnets, so we use an SSH ProxyCommand to pass through a bastion first. The existence of this bastion is a difference from staging, where we directly connect to a mock SFTP (AWS Transfer).

After a recent code deployment, our production Airflow cluster began exhibiting intermittent connectivity issues to a specific vendor SFTP. This connectivity issue affected _all_ SFTP connections to the vendor, regardless of specific operator or sensor class. Some connections would work, but most would fail with:
{code}
paramiko.ssh_exception.SSHException: No existing session
{code}
We first looked at the diff of changes to the DAG code. There was no apparent "shared" or "global" configuration that would affect all connections to the vendor; for example, we didn't change any hook classes.

One of our engineers built a minimal test case that instantiated an SFTP operator and immediately executed it. This was run on the Airflow production server, and therefore had the same user, virtualenv, metadata db, connection, and so on. This test case did not reproduce the original issue.

We implemented a remote debugger to attach to the failing operators. The intermittent failure still occurred when stepping through a debugger, though the underlying error was revealed to be a Paramiko socket timeout. We didn't get much clarity on what was specifically failing.

We tried going a layer lower and just executing the SSH {{ProxyCommand}} in a terminal. This also ran as expected manually, connecting to the vendor SSH through the bastion and receiving an SSH banner awaiting change to SFTP protocol.

However, after _repeatedly_ running the manual command, we noticed that in some cases we immediately received a server EOF, just before getting an SSH banner. These failures would be batched; the command would work from all instances, then fail several times in a row (only from a single instance), and then resume working. We began to suspect that we were being throttled.

We began to investigate resource use, wondering whether there was a process leak. We did notice several old SSH processes running on the workers. Since they were marked as defunct, we would not expect them to be holding connections open; just in case, we cycled all of the worker servers. This did not resolve the situation.

At this point we took a break to manually run our ETL commands for the evening, and to preemptively ask our vendor whether they had recently instituted throttling on their SFTP server. In the morning we were still impacted by this issue and resumed debugging.

One of our engineers tried inspecting the state of our Airflow _web_ server and noticed that it had many open connections to the remote SFTP. This turned out to be the critical clue that narrowed it down to two likely causes: a hook that was initiating a connection outside of {{get_conn}}, or an operator that was misusing a hook.

I began tracing through our custom code and did not find either of those problems. However, in examining Airflow core I was able to find the code path that led to this outage:
 - We deploy new DAG tasks, which includes a {{SFTPSensor}}.
 - The SFTPSensor instantiates an SFTPHook in {{__init__}} instead of {{execute}}: [https://github.com/apache/airflow/blob/1.10.4/airflow/contrib/sensors/sftp_sensor.py#L41]
 - The SFTPHook is a {{SSHHook}} subclass and uses its {{__init__}}: [https://github.com/apache/airflow/blob/a18283911750017a01d538bcdd48283b6a265bfb/airflow/contrib/hooks/sftp_hook.py#L48]
 - In the {{__init__}} method of the {{SSHHook}}, if an SSH config file is present and there's a matching {{ProxyCommand}}, a new {{paramiko.ProxyCommand}} will be created: [https://github.com/apache/airflow/blob/a18283911750017a01d538bcdd48283b6a265bfb/airflow/contrib/hooks/ssh_hook.py#L137]
 - In the {{__init__}} method of {{paramiko.ProxyCommand}}, the SSH {{ProxyCommand}} gets _executed immediately_ rather than deferred until use: [https://github.com/paramiko/paramiko/blob/master/paramiko/proxy.py#L56]

Taken together, every DAG parsing attempt, including on scheduler and web instances, was creating multiple SSH connections resulting in our getting throttled (by either the bastion or the vendor SFTP). We didn't see this on staging because we don't use an SSH {{ProxyCommand}} there.

This outage was the result of multiple intersecting coincidences, but it's not the first time we've seen hook issues in Airflow. To mitigate the blast radius on similar issues, I think Airflow should:

- Standardize and document that operators only instantiate hooks at execute-time.
- Fix the SFTPSensor's issues with the above
- Standardize and document that hooks can't make any connections outside of get_conn.
- Fix the SSHHook's use of ProxyCommand
- File an upstream bug against Paramiko to make their ProxyCommand behavior less surprising, and to improve logging of it. (I will do this.)


> Outage caused by interaction between SFTPSensor, SSHHook, and paramiko ProxyCommand
> -----------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5238
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5238
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: hooks
>    Affects Versions: 1.10.4
>            Reporter: James Meickle
>            Priority: Major
>
> We had a partial outage last night due to an extremely subtle bug that requires a bit of explanation. Please bear with me...
> We have an extensive DAG for moving financial data files between vendors. One of these vendor SFTPs is only accessible via certain subnets, so we use an SSH ProxyCommand to pass through a bastion first. The existence of this bastion is a difference from staging, where we directly connect to a mock SFTP (AWS Transfer).
> After a recent code deployment, our production Airflow cluster began exhibiting intermittent connectivity issues to a specific vendor SFTP. This connectivity issue affected _all_ SFTP connections to the vendor, regardless of specific operator or sensor class. Some connections would work, but most would fail with:
> {code}
> paramiko.ssh_exception.SSHException: No existing session
> {code}
> We first looked at the diff of changes to the DAG code. There was no apparent "shared" or "global" configuration that would affect all connections to the vendor; for example, we didn't change any hook classes.
> One of our engineers built a minimal test case that instantiated an SFTP operator and immediately executed it. This was run on the Airflow production server, and therefore had the same user, virtualenv, metadata db, connection, and so on. This test case did not reproduce the original issue.
> We implemented a remote debugger to attach to the failing operators. The intermittent failure still occurred when stepping through a debugger, though the underlying error was revealed to be a Paramiko socket timeout. We didn't get much clarity on what was specifically failing.
> We tried going a layer lower and just executing the SSH {{ProxyCommand}} in a terminal. This also ran as expected manually, connecting to the vendor SSH through the bastion and receiving an SSH banner awaiting change to SFTP protocol.
> However, after _repeatedly_ running the manual command, we noticed that in some cases we immediately received a server EOF, just before getting an SSH banner. These failures would be batched; the command would work from all instances, then fail several times in a row (only from a single instance), and then resume working. We began to suspect that we were being throttled.
> We began to investigate resource use, wondering whether there was a process leak. We did notice several old SSH processes running on the workers. Since they were marked as defunct, we would not expect them to be holding connections open; just in case, we cycled all of the worker servers. This did not resolve the situation.
> At this point we took a break to manually run our ETL commands for the evening, and to preemptively ask our vendor whether they had recently instituted throttling on their SFTP server. In the morning we were still impacted by this issue and resumed debugging.
> One of our engineers tried inspecting the state of our Airflow _web_ server and noticed that it had many open connections to the remote SFTP. This turned out to be the critical clue that narrowed it down to two likely causes: a hook that was initiating a connection outside of {{get_conn}}, or an operator that was misusing a hook.
> I began tracing through our custom code and did not find either of those problems. However, in examining Airflow core I was able to find the code path that led to this outage:
>  - We deploy new DAG tasks, which includes a {{SFTPSensor}}.
>  - The SFTPSensor instantiates an SFTPHook in {{__init__}} instead of {{execute}}: [https://github.com/apache/airflow/blob/1.10.4/airflow/contrib/sensors/sftp_sensor.py#L41]
>  - The SFTPHook is a {{SSHHook}} subclass and uses its {{__init__}}: [https://github.com/apache/airflow/blob/a18283911750017a01d538bcdd48283b6a265bfb/airflow/contrib/hooks/sftp_hook.py#L48]
>  - In the {{__init__}} method of the {{SSHHook}}, if an SSH config file is present and there's a matching {{ProxyCommand}}, a new {{paramiko.ProxyCommand}} will be created: [https://github.com/apache/airflow/blob/a18283911750017a01d538bcdd48283b6a265bfb/airflow/contrib/hooks/ssh_hook.py#L137]
>  - In the {{__init__}} method of {{paramiko.ProxyCommand}}, the SSH {{ProxyCommand}} gets _executed immediately_ rather than deferred until use: [https://github.com/paramiko/paramiko/blob/master/paramiko/proxy.py#L56]
> Taken together, every DAG parsing attempt, including on scheduler and web instances, was creating multiple SSH connections resulting in our getting throttled (by either the bastion or the vendor SFTP). We didn't see this on staging because we don't use an SSH {{ProxyCommand}} there.
> This outage was the result of multiple intersecting coincidences, but it's not the first time we've seen hook issues in Airflow. To mitigate the blast radius on similar issues, I think Airflow should:
> - Standardize and document that operators only instantiate hooks at execute-time.
> - Fix the SFTPSensor's issues with the above
> - Standardize and document that hooks can't make any connections outside of get_conn.
> - Fix the SSHHook's use of ProxyCommand
> - File an upstream bug against Paramiko to make their ProxyCommand behavior less surprising, and to improve logging of it. (I will do this.)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)