You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "John Rushford (JIRA)" <ji...@apache.org> on 2016/08/31 17:23:21 UTC

[jira] [Comment Edited] (TS-4801) Are we marking parents down too aggressively?

    [ https://issues.apache.org/jira/browse/TS-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452771#comment-15452771 ] 

John Rushford edited comment on TS-4801 at 8/31/16 5:22 PM:
------------------------------------------------------------

Even with that setup, a connection error is a connection error.  markParentDown() increments the failure count for that parent ond only really makes it unavailable once the fail threshold is met.


was (Author: jrushford):
Even with that setup, a connection error is a connection error.  marParentDown() increments the failure count for that parent ond only really makes it unavailable once the fail threshold is met.

> Are we marking parents down too aggressively?
> ---------------------------------------------
>
>                 Key: TS-4801
>                 URL: https://issues.apache.org/jira/browse/TS-4801
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Parent Proxy
>            Reporter: Leif Hedstrom
>            Assignee: Leif Hedstrom
>             Fix For: 7.0.0
>
>
> [~jrushford] and I were looking at some code, and found this piece which looks suspicious:
> {code}
>     } else if (s->current.attempts < s->txn_conf->parent_connect_attempts) {
>       s->current.attempts++;
>       // Are we done with this particular parent?
>       if ((s->current.attempts - 1) % s->http_config_param->per_parent_connect_attempts != 0) {
>         // No we are not done with this parent so retry
>         s->next_action = how_to_open_connection(s);
>         DebugTxn("http_trans", "%s Retrying parent for attempt %d, max %" PRId64, "[handle_response_from_parent]",
>                  s->current.attempts, s->http_config_param->per_parent_connect_attempts);
>         return;
>       } else {
>         DebugTxn("http_trans", "%s %d per parent attempts exhausted", "[handle_response_from_parent]", s->current.attempts);
>         // Only mark the parent down if we failed to connect
>         //  to the parent otherwise slow origin servers cause
>         //  us to mark the parent down
>         if (s->current.state == CONNECTION_ERROR) {
>           s->parent_params->markParentDown(&s->parent_result);
>         }
>         // We are done so look for another parent if any
>         next_lookup = find_server_and_update_current_info(s);
>       }
>     } else {
>       // Done trying parents... fail over to origin server if that is
>       //   appropriate
>       DebugTxn("http_trans", "[handle_response_from_parent] Error. No more retries.");
>       s->parent_params->markParentDown(&s->parent_result);
>       s->parent_result.result = PARENT_FAIL;
>       next_lookup             = find_server_and_update_current_info(s);
>     }
> {code}
> Here's my thinking: It seems that if we exhaust parent_connect_attempts (which is proxy.config.http.parent_proxy.total_connect_attempts, default 4), we end up always marking whatever the current parent is as potentially down.
> This seems wrong, because it might do this even if the number of tries against that particular parent has not been exhausted (that is the proxy.config.http.parent_proxy.per_parent_connect_attempts setting, tested inside the loop).
> Looking at this, it feels that we should either remove that markParentDown() entirely, or at a minimum add the same checks as above, i.e.
> {code}
> -      s->parent_params->markParentDown(&s->parent_result);
> +      if (s->current.state == CONNECTION_ERROR) {
> +        s->parent_params->markParentDown(&s->parent_result);
> +      }
> {code}
> I'm still doing some more tests, one concern is that it's unclear if the earlier comment is correct, and that we can detect the difference between a connection failure to parent, vs an origin server response to the parent is just being slow (resulting in a timeout and no response to child).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)