You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Rainer Jung <ra...@kippdata.de> on 2008/04/18 14:41:46 UTC
Re: svn commit: r647085 - in /tomcat/connectors/trunk/jk: native/common/jk_lb_worker.c
native/common/jk_lb_worker.h native/common/jk_shm.h xdocs/miscellaneous/changelog.xml
Hi Mladen,
I went through the new states and busy lines. Comments see below.
mturk@apache.org schrieb:
> @@ -1030,11 +1038,22 @@
>
> /* Set returned error to OK */
> *is_error = JK_HTTP_OK;
> + if (!(states = (int *)malloc(num_of_workers * sizeof(int)))) {
> + *is_error = JK_HTTP_SERVER_ERROR;
> + jk_log(l, JK_LOG_ERROR,
> + "Failed allocating private worker state memory");
> + JK_TRACE_EXIT(l);
> + return JK_SERVER_ERROR;
> + }
We don't free, if a few lines further down allocating the AJP
message/message buffer fails, because then we use an early return.
Maybe we should also add
@@ -1231,7 +1231,11 @@
if (rec->s->busy)
rec->s->busy--;
if (service_stat == JK_TRUE) {
- rec->s->state = JK_LB_STATE_OK;
+ /*
+ * Successful request.
+ */
+ rec->s->state = JK_LB_STATE_OK;
+ states[rec->i] = JK_LB_STATE_OK;
rec->s->error_time = 0;
rc = JK_TRUE;
recoverable = JK_UNSET;
Since we will use the final value of states[rec->i] for logging. At the
moment you only check against ERROR there, but one never knows for the
future. It sounds safer to set state[] always when we set ->state.
> @@ -1146,6 +1169,7 @@
>
> /* Increment the number of workers serving request */
> p->worker->s->busy++;
> + rec->s->busy++;
> if (p->worker->s->busy > p->worker->s->max_busy)
> p->worker->s->max_busy = p->worker->s->busy;
> if ( (p->worker->lbmethod == JK_LB_METHOD_REQUESTS) ||
So we actually have two different uses for the local states copy:
A) in case of a problem with an lb member, which is not big enough to
put it into global error, we can make sure, that we don't use the same
node again during failover of the single request (by setting the local
state to error). This behaviour is easy to understand.
B) we want to differentiate the decision of putting an lb member into
global error depending on the observation, if there is still some
request from the same process running on the "failed" node. In some
error situations only the last returning request will trigger global
error state. This could mean, that we might never trigger it, because we
never stop sending new requests there and none of the returning requests
might be the last ones, although all of them return constantly with the
same problem. This case seems to make sense only for few situations.
So when do we use cases A) and B)?
A) service_stat is one of:
- JK_CLIENT_ERROR: The case of client read or write errors. We don't
recover in this case, so if we set states to OK or to ERROR only changes
the logging behaviour at the end of the main loop. Do you really want to
log the additional message? Since we do no recovery, we could just as
well stick to OK here (also for states).
- JK_SERVER_ERROR: the case, when we ran into memory allocation problems
in jk_ajp_common. Good example for A).
- JK_STATUS_ERROR: Fail on status soft error. Good example for A)
- JK_REPLY_TIMEOUT (below max reply timeouts): Good example for A)
B) service_stat is one of:
- JK_STATUS_FATAL_ERROR: Fail on status hard error. Since the user
configured, that he wants the node to go into error, I don't understand,
why we should respect the busy counter.
- JK_REPLY_TIMEOUT (above max reply timeouts). Here I'm afraid, that
this has a high potential for actually making the timeout useless. If a
node throws a high number of reply timeouts, but we wait for the "last"
request to return before putting it into error, I assume it will never
be put into error, sonce there will be no "last" request. We simply
proceed forwarding requests and busy will stay above 0 all the time. So
I think, in this case we should keep it the way it was before, i.e.
state=ERROR.
- all other error return codes (JK_FALSE, JK_FATAL_ERROR): Lots of
different reasons, from invalid struct params in service call, to all
sorts of low-level communication problems (timeouts, connection refused
etc.), protocol errors (corrupt AJP packets), ...
Again here I'm not able to understand the behaviour after we switch to
respecting busyness. I would assume, that again the node gets traffic
all the time and in effect we no longer put it into error, but keep it
OK all the time.
In summary, I see no clear situations, where B) really is a good idea.
Would you mind to provide more insight?
I somehow think, that if you want to do more clever error reactions, we
need to check conditions inside jk_ajp_common and then return the info,
if we only want to have local or global error back to the lb (like the
recoverable flag in the jk_endpoint; we tried to guess recoverability
from the ajp service return code earlier, but the info wasn't rich
enough, so the jk_ajp_common can now return recoverability info directly
via endpoint.recoverable).
A couple of additional thoughts:
- error_time: in the cases, where you set the global state depending on
the busy value, you always set error_time, even if the node stays OK.
- force_recovery inside service loop does not reset the states array and
thus is broken
- in case B), if busy>=0, should we really set the global state to OK?
We might override an ERROR state set by some other process. I've got no
good solution for this.
- states inside the service loop do no longer learn from other
processes, because they are initialized before the loop. Moving
initialization to the beginning of the loop doesn't help either, because
then we'll override possible local error states learned during the
previous loop iterations. One could fix it with "used state in this
iteration = worse of global state and local state".
Regards,
Rainer
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org