You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@httpd.apache.org by Franck Fallateuf <fr...@plansource.com> on 2019/10/09 19:33:33 UTC

[users@httpd] Apache random traffic outage for specific customer

Hello everyone,

We upgraded from Apache 2.4.12 to 2.4.18 on a public facing webserver which proxies requests to backend servers. Initially when we cut-over to the webserver running the newer version (2.4.18), all traffic seemed to flow normally. But a few days onwards, we received a report from one of our customers that they were experiencing random outages. The outage would manifest itself in a browser page "This site can't be reached", "ERR_CONNECTION_TIMED_OUT". As far as we were aware, this is the only customer experiencing this issue and to report of it. After looking through all available logs for Apache and otherwise, we could not identify what was causing this nor where this was occurring. So we decided to setup some packet capturing (tcpdumps) from both ends between us and this customer. What we observed was the following:

Packet captures on border firewall showed the SSL handshake failing during ECDH negotiations, after the server hello message was received on the client. The return packet was a ‘bad_record_mac’ alert message, alert code 20.

Because of this, we decided to make the following changes:

During trouble shooting the TIME_WAIT value was increased on the firewall to allow enough time for a response, this did not resolve the issue. The firewall was then configured for TCP by-pass for the IP addresses having the communication issues, this did not resolve the issue either. The firewall is a Cisco ASA 5545 running v 9.8(3)29.

While comparing the Apache setup we had running 2.4.12 and 2.4.18, we found out that we were running the "event" mpm on 2.4.18 vs "worker" mpm on 2.4.12. Reading on the differences between both of these mpm types, we immediately thought this could have played a part in this because of how sockets are handled. We reverted the mpm back to "worker" on the newer Apache version. We tested again and this customer still experienced the same random issues.

Additional information:
- Customer uses one single destination IP address where all of these requests are coming from for all of their employees' traffic to access our application.
- There seems to be a correlation between high peak traffic time for this customer and the likely occurrence of these events. So as stated all traffic is coming from one single destination IP address and there could be 200+ users on our system at that given time.
- Customer reports less occurrence of this issue outside of their high peak traffic times.
- We've tuned the ListenBacklog to 99999 with no noticeable impact on this issue, although we believe it could have played a part in a separate issue not within this scope.

Any help would greatly be appreciated as we are out of ideas and this customer has not been very friendly in helping us help them with this issue. We've had to revert back to running on Apache 2.4.12 which we would like to upgrade from.

Thank you,
Franck

This email may contain confidential or protected material for the sole use of the intended recipient(s). Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.

Re: [users@httpd] Apache random traffic outage for specific customer

Posted by Daniel Ferradal <df...@apache.org>.

Unfortunately no, the SSL handling is really dealt through openssl libs,
meaningfull SSL changes are dealt by those libs, in any case you can change
the full log between versions to check if that sounds a bell
http://www.apache.org/dist/httpd/CHANGES_2.4  , imo changing version of
httpd would not have to change anything, but I had similar issues like
yours with very "old clients" who would refuse to do proper ssl handshakes
after openssl upgrades in the server side.

IMO a timeout should be something else.



El jue., 10 oct. 2019 a las 0:16, Franck Fallateuf (<
franck.fallateuf@plansource.com>) escribió:

> Thank you Daniel for you response.
>
> The openssl version on the webserver running Apache 2.4.12 is 'OpenSSL
> 1.0.1e-fips 11 Feb 2013' and the version running on 2.4.18 is 'OpenSSL
> 1.0.2g  1 Mar 2016'.
>
> I'm not sure if we had any tcpdumps from the client side but I'll look
> into that. Getting new captures from the customer's end is going to be
> tricky, otherwise, given their reluctance to help out.
>
> Would there otherwise be any obvious changes between 2.4.12 and 2.4.18
> that would or potentically could have introduced such a scenario as the one
> I'm describing?
>
> Thanks again.
>
>
> -----Original Message-----
> *From*: Daniel Ferradal <dferradal@apache.org
> <Daniel%20Ferradal%20%3cdferradal@apache.org%3e>>
> *Reply-To*: users@httpd.apache.org
> *To*: <us...@httpd.apache.org> <users@httpd.apache.org
> <%22%3cusers@httpd.apache.org%3e%22%20%3cusers@httpd.apache.org%3e>>
> *Subject*: Re: [users@httpd] Apache random traffic outage for specific
> customer
> *Date*: Wed, 09 Oct 2019 22:08:30 +0200
>
> CAUTION: This email originated from outside of the organization.
> ------------------------------
> Perhaps you can add the openssl version to the puzzle due to those ssl
> errors you caught, did it change with the upgrade? although without looking
> I would really tend to not associate a time out with ssl issues at all.
>
> I'd also try tcpdump on the client side instead of the server.
>
> El mié., 9 oct. 2019 21:33, Franck Fallateuf <
> franck.fallateuf@plansource.com> escribió:
>
> Hello everyone,
>
> We upgraded from Apache 2.4.12 to 2.4.18 on a public facing webserver
> which proxies requests to backend servers. Initially when we cut-over to
> the webserver running the newer version (2.4.18), all traffic seemed to
> flow normally.  But a few days onwards, we received a report from one of
> our customers that they were experiencing random outages. The outage would
> manifest itself in a browser page "This site can't be reached",
> "ERR_CONNECTION_TIMED_OUT".  As far as we were aware, this is the only
> customer experiencing this issue and to report of it. After looking through
> all available logs for Apache and otherwise, we could not identify what was
> causing this nor where this was occurring.  So we decided to setup some
> packet capturing (tcpdumps) from both ends between us and this customer.
> What we observed was the following:
>
> Packet captures on border firewall showed the SSL handshake failing during
> ECDH negotiations, after the server hello message was received on the
> client. The return packet was a ‘bad_record_mac’ alert message, alert code
> 20.
>
> Because of this, we decided to make the following changes:
>
> During trouble shooting the TIME_WAIT value was increased on the firewall
> to allow enough time for a response, this did not resolve the issue. The
> firewall was then configured for TCP by-pass for the IP addresses having
> the communication issues, this did not resolve the issue either. The
> firewall is a Cisco ASA 5545 running v 9.8(3)29.
>
> While comparing the Apache setup we had running 2.4.12 and 2.4.18, we
> found out that we were running the "event" mpm on 2.4.18 vs "worker" mpm on
> 2.4.12. Reading on the differences between both of these mpm types, we
> immediately thought this could have played a part in this because of how
> sockets are handled. We reverted the mpm back to "worker" on the newer
> Apache version. We tested again and this customer still experienced the
> same random issues.
>
> Additional information:
>   - Customer uses one single destination IP address where all of these
> requests are coming from for all of their employees' traffic to access our
> application.
>   - There seems to be a correlation between high peak traffic time for
> this customer and the likely occurrence of these events.  So as stated all
> traffic is coming from one single destination IP address and there could be
> 200+ users on our system at that given time.
> - Customer reports less occurrence of this issue outside of their high
> peak traffic times.
>   - We've tuned the ListenBacklog to 99999 with no noticeable impact on
> this issue, although we believe it could have played a part in a separate
> issue not within this scope.
>
> Any help would greatly be appreciated as we are out of ideas and this
> customer has not been very friendly in helping us help them with this
> issue. We've had to revert back to running on Apache 2.4.12 which we would
> like to upgrade from.
>
> Thank you,
> Franck
>
> This email may contain confidential or protected material for the sole use
> of the intended recipient(s). Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> This email may contain confidential or protected material for the sole use
> of the intended recipient(s). Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>


-- 
Daniel Ferradal
HTTPD Project
#httpd help at Freenode

Re: [users@httpd] Apache random traffic outage for specific customer

Posted by Franck Fallateuf <fr...@plansource.com>.

Thank you Daniel for you response.

The openssl version on the webserver running Apache 2.4.12 is 'OpenSSL 1.0.1e-fips 11 Feb 2013' and the version running on 2.4.18 is 'OpenSSL 1.0.2g  1 Mar 2016'.

I'm not sure if we had any tcpdumps from the client side but I'll look into that. Getting new captures from the customer's end is going to be tricky, otherwise, given their reluctance to help out.

Would there otherwise be any obvious changes between 2.4.12 and 2.4.18 that would or potentically could have introduced such a scenario as the one I'm describing?

Thanks again.


-----Original Message-----
From: Daniel Ferradal <dferradal@apache.org<mailto:Daniel%20Ferradal%20%3cdferradal@apache.org%3e>>
Reply-To: users@httpd.apache.org<ma...@httpd.apache.org>
To: <us...@httpd.apache.org> <users@httpd.apache.org<mailto:%22%3cusers@httpd.apache.org%3e%22%20%3cusers@httpd.apache.org%3e>>
Subject: Re: [users@httpd] Apache random traffic outage for specific customer
Date: Wed, 09 Oct 2019 22:08:30 +0200

CAUTION: This email originated from outside of the organization.
________________________________
Perhaps you can add the openssl version to the puzzle due to those ssl errors you caught, did it change with the upgrade? although without looking I would really tend to not associate a time out with ssl issues at all.

I'd also try tcpdump on the client side instead of the server.

El mié., 9 oct. 2019 21:33, Franck Fallateuf <fr...@plansource.com>> escribió:
Hello everyone,

We upgraded from Apache 2.4.12 to 2.4.18 on a public facing webserver which proxies requests to backend servers. Initially when we cut-over to the webserver running the newer version (2.4.18), all traffic seemed to flow normally.  But a few days onwards, we received a report from one of our customers that they were experiencing random outages. The outage would manifest itself in a browser page "This site can't be reached", "ERR_CONNECTION_TIMED_OUT".  As far as we were aware, this is the only customer experiencing this issue and to report of it. After looking through all available logs for Apache and otherwise, we could not identify what was causing this nor where this was occurring.  So we decided to setup some packet capturing (tcpdumps) from both ends between us and this customer. What we observed was the following:

Packet captures on border firewall showed the SSL handshake failing during ECDH negotiations, after the server hello message was received on the client. The return packet was a ‘bad_record_mac’ alert message, alert code 20.

Because of this, we decided to make the following changes:

During trouble shooting the TIME_WAIT value was increased on the firewall to allow enough time for a response, this did not resolve the issue. The firewall was then configured for TCP by-pass for the IP addresses having the communication issues, this did not resolve the issue either. The firewall is a Cisco ASA 5545 running v 9.8(3)29.

While comparing the Apache setup we had running 2.4.12 and 2.4.18, we found out that we were running the "event" mpm on 2.4.18 vs "worker" mpm on 2.4.12. Reading on the differences between both of these mpm types, we immediately thought this could have played a part in this because of how sockets are handled. We reverted the mpm back to "worker" on the newer Apache version. We tested again and this customer still experienced the same random issues.

Additional information:
  - Customer uses one single destination IP address where all of these requests are coming from for all of their employees' traffic to access our application.
  - There seems to be a correlation between high peak traffic time for this customer and the likely occurrence of these events.  So as stated all traffic is coming from one single destination IP address and there could be 200+ users on our system at that given time.
- Customer reports less occurrence of this issue outside of their high peak traffic times.
  - We've tuned the ListenBacklog to 99999 with no noticeable impact on this issue, although we believe it could have played a part in a separate issue not within this scope.

Any help would greatly be appreciated as we are out of ideas and this customer has not been very friendly in helping us help them with this issue. We've had to revert back to running on Apache 2.4.12 which we would like to upgrade from.

Thank you,
Franck

This email may contain confidential or protected material for the sole use of the intended recipient(s). Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.

This email may contain confidential or protected material for the sole use of the intended recipient(s). Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.

Re: [users@httpd] Apache random traffic outage for specific customer

Posted by Daniel Ferradal <df...@apache.org>.

Perhaps you can add the openssl version to the puzzle due to those ssl
errors you caught, did it change with the upgrade? although without looking
I would really tend to not associate a time out with ssl issues at all.

I'd also try tcpdump on the client side instead of the server.

El mié., 9 oct. 2019 21:33, Franck Fallateuf <
franck.fallateuf@plansource.com> escribió:

> Hello everyone,
>
> We upgraded from Apache 2.4.12 to 2.4.18 on a public facing webserver
> which proxies requests to backend servers. Initially when we cut-over to
> the webserver running the newer version (2.4.18), all traffic seemed to
> flow normally.  But a few days onwards, we received a report from one of
> our customers that they were experiencing random outages. The outage would
> manifest itself in a browser page "This site can't be reached",
> "ERR_CONNECTION_TIMED_OUT".  As far as we were aware, this is the only
> customer experiencing this issue and to report of it. After looking through
> all available logs for Apache and otherwise, we could not identify what was
> causing this nor where this was occurring.  So we decided to setup some
> packet capturing (tcpdumps) from both ends between us and this customer.
> What we observed was the following:
>
> Packet captures on border firewall showed the SSL handshake failing during
> ECDH negotiations, after the server hello message was received on the
> client. The return packet was a ‘bad_record_mac’ alert message, alert code
> 20.
>
> Because of this, we decided to make the following changes:
>
> During trouble shooting the TIME_WAIT value was increased on the firewall
> to allow enough time for a response, this did not resolve the issue. The
> firewall was then configured for TCP by-pass for the IP addresses having
> the communication issues, this did not resolve the issue either. The
> firewall is a Cisco ASA 5545 running v 9.8(3)29.
>
> While comparing the Apache setup we had running 2.4.12 and 2.4.18, we
> found out that we were running the "event" mpm on 2.4.18 vs "worker" mpm on
> 2.4.12. Reading on the differences between both of these mpm types, we
> immediately thought this could have played a part in this because of how
> sockets are handled. We reverted the mpm back to "worker" on the newer
> Apache version. We tested again and this customer still experienced the
> same random issues.
>
> Additional information:
>   - Customer uses one single destination IP address where all of these
> requests are coming from for all of their employees' traffic to access our
> application.
>   - There seems to be a correlation between high peak traffic time for
> this customer and the likely occurrence of these events.  So as stated all
> traffic is coming from one single destination IP address and there could be
> 200+ users on our system at that given time.
> - Customer reports less occurrence of this issue outside of their high
> peak traffic times.
>   - We've tuned the ListenBacklog to 99999 with no noticeable impact on
> this issue, although we believe it could have played a part in a separate
> issue not within this scope.
>
> Any help would greatly be appreciated as we are out of ideas and this
> customer has not been very friendly in helping us help them with this
> issue. We've had to revert back to running on Apache 2.4.12 which we would
> like to upgrade from.
>
> Thank you,
> Franck
>
> This email may contain confidential or protected material for the sole use
> of the intended recipient(s). Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>