You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@trafficserver.apache.org by GitBox <gi...@apache.org> on 2020/10/22 17:06:37 UTC

[GitHub] [trafficserver] shinrich opened a new issue #7290: Clarification and/or configurability of what it means for a server to be dead/down

shinrich opened a new issue #7290:
URL: https://github.com/apache/trafficserver/issues/7290

While working with @djcarlin and @bryancall to understand an issue from deploying the dead server no retry feature added in PR #7142, we were surprised by when exactly servers are marked dead.

In the case where transactions are retryable, things pretty much worked as expected. If the handshake failed or the request is sent but the origin fails to return data and is retryable, the address is tried the number of times specified in proxy.config.http.connect_attempts_rr_retries (at least once PR #7288 is applied) before marking the IP address as down and moving onto the next IP address.

However, if the transaction failed after sending the header and it is not retryable (e.g. a POST request), the ip address is marked down immediately (the retry count in proxy.config.http.connect_attempts_rr_retries is ignored). If the origin only times out now and again due to larger requests, taking it down immediately seems bad particularly using the new feature that avoids the retries against the down server in the dead period. However, if the server is consistently failing to respond to post requests, it should be marked down.

Probably this down decision criteria needs to be configurable. Some origins need different criteria than others. Some should only be marked down in the initial handshake fails. Others should be marked down but only if no data was returned. Or maybe you want to mark things down only for specific origin connection failures.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [trafficserver] shinrich commented on issue #7290: Clarification and/or configurability of what it means for a server to be dead/down

Posted by GitBox <gi...@apache.org>.

shinrich commented on issue #7290:
URL: https://github.com/apache/trafficserver/issues/7290#issuecomment-819884085


   @djcarlin updated our configs to set  proxy.config.http.down_server.abort_threshold  higher than any of our other timeouts (basically making sure this config never did anything), and our down server failures greatly reduced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [trafficserver] shinrich edited a comment on issue #7290: Clarification and/or configurability of what it means for a server to be dead/down

Posted by GitBox <gi...@apache.org>.

shinrich edited a comment on issue #7290:
URL: https://github.com/apache/trafficserver/issues/7290#issuecomment-819885851

Did some more work today on down servers in our environment.

One think I hadn't noticed before was that an origin failure only contributes to the down server count if the t_state->current.server->connect_result is non-zero. That is a real error happened during the TCP/TLS connection failure. There are many messages generated in error.log where the connect_result is 0 and a failure happened between connect open and first byte from server. These transactions are available for retries, but they don't contribute to the counts to marking a server down.

Locally, we are trying a built that only adds a log to error.log if it really is a connect failure. It really cut down the noise in our logs.

Once we remove the noise, we see the following cases for origin connection failure in our environment

ENET_SSL_CONNECT_FAILED - I added this in the case of ERROR_SSL_ERROR in the TLS handshake negotiation. It seems for us this is mostly due to server cert verification failure.

Connection timed out [110] - A time out during the handshake

No route to host [113] - The DNS entry for the origin is still there, but the machine has been decommissioned.

Connection refused [111] - Presumably the service is down

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [trafficserver] shinrich commented on issue #7290: Clarification and/or configurability of what it means for a server to be dead/down

Posted by GitBox <gi...@apache.org>.

shinrich commented on issue #7290:
URL: https://github.com/apache/trafficserver/issues/7290#issuecomment-819895762


   Based on the type of connection errors we are seeing in our environment, I'm adding a setting to adjust which connection failures count towards the down server count.
   
   For our environment, a TCP connection failure (no route, timeout, or refused) is very different from a SSL cert verification failure.   For some of our next level machines, quite a few connections will be routed to the same machine with different SNI and host names.  A failure for one host name/SNI (due to cert verification failure) should not take down the origin for all the other requests for other host names.
   
   So we will experiment with a setting that has the following options
   - All connection failures (TCP and TLS) count towards the down server count.  Existing function.
   - Only TCP connection failures count towards the down server count


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [trafficserver] shinrich edited a comment on issue #7290: Clarification and/or configurability of what it means for a server to be dead/down

Posted by GitBox <gi...@apache.org>.

shinrich edited a comment on issue #7290:
URL: https://github.com/apache/trafficserver/issues/7290#issuecomment-819885851

Did some more work today on down servers in our environment.

Locally, we are trying a build that only adds a log to error.log if it really is a connect failure. It really cut down the noise in our logs.

Once we remove the noise, we see the following cases for origin connection failure in our environment

ENET_SSL_CONNECT_FAILED - I added this in the case of ERROR_SSL_ERROR in the TLS handshake negotiation. It seems for us this is mostly due to server cert verification failure.

Connection timed out [110] - A time out during the handshake

No route to host [113] - The DNS entry for the origin is still there, but the machine has been decommissioned.

Connection refused [111] - Presumably the service is down

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [trafficserver] shinrich commented on issue #7290: Clarification and/or configurability of what it means for a server to be dead/down

Posted by GitBox <gi...@apache.org>.

shinrich commented on issue #7290:
URL: https://github.com/apache/trafficserver/issues/7290#issuecomment-815922190


   I ran across the proxy.config.http.down_server.abort_threshold setting.  I think the comparison is backwards from the description.  But if the client times out in more than this many seconds and the origin hasn't sent anything back yet, it will count towards marking that origin down.   The default value for this setting is 10..
   
   The intent is to include "slow servers" in the set of dead servers to be avoided.  But if your client side inactivity timeout is less than your origin side inactivity, the bad client scenario that @djcarlin identifies will trigger this and increment the counts for the origin connection to be down.
   
   It is easy enough to take proxy.config.http.down_server.abort_threshold out of the equation by setting it higher than any of your inactivity timeouts.  But I think we may want to reconsider this setting.  The intent of marking "slow" servers as down is admirable, but identifying the slowness as an issue on the origin side vs the client side is sketchy at best.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [trafficserver] shinrich commented on issue #7290: Clarification and/or configurability of what it means for a server to be dead/down

Posted by GitBox <gi...@apache.org>.

shinrich commented on issue #7290:
URL: https://github.com/apache/trafficserver/issues/7290#issuecomment-819885851


   Did some more work today on down servers in our environment.  
   
   One think I hadn't noticed before was that an origin failure only contributes to the down server count if the t_state->current.server->connect_result is non-zero.   That is a real error happened during the TCP/TLS connection failure.  There are many messages generated in error.log where the connect_result is 0 and a failure happened between connect open and  first byte from server.  These transactions are available for retries, but they don't contribute to the counts to marking a server down.
   
   Locally, we are trying a built that only adds a log to error.log if it really is a connect failure.  It really cut down the noise in our logs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [trafficserver] djcarlin commented on issue #7290: Clarification and/or configurability of what it means for a server to be dead/down

Posted by GitBox <gi...@apache.org>.

djcarlin commented on issue #7290:
URL: https://github.com/apache/trafficserver/issues/7290#issuecomment-809439293


   I noticed squid.log entries with "ERR_CONNECT_FAIL 502" when a POST was happening, and Content-Length sent by client request differed greatly from %\<cqql> log field - "Client request header and content length combined, in bytes."
   
   In our case, Content-Length was tens of megabytes and cqql was only hundreds of bytes. This was marking origin down and breaking other valid requests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org