You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Rainer Jung <ra...@kippdata.de> on 2008/04/18 14:41:46 UTC

Re: svn commit: r647085 - in /tomcat/connectors/trunk/jk: native/common/jk_lb_worker.c native/common/jk_lb_worker.h native/common/jk_shm.h xdocs/miscellaneous/changelog.xml

Hi Mladen,

I went through the new states and busy lines. Comments see below.

mturk@apache.org schrieb:
> @@ -1030,11 +1038,22 @@
>  
>      /* Set returned error to OK */
>      *is_error = JK_HTTP_OK;
> +    if (!(states = (int *)malloc(num_of_workers * sizeof(int)))) {
> +        *is_error = JK_HTTP_SERVER_ERROR;
> +        jk_log(l, JK_LOG_ERROR,
> +               "Failed allocating private worker state memory");
> +        JK_TRACE_EXIT(l);
> +        return JK_SERVER_ERROR;        
> +    }

We don't free, if a few lines further down allocating the AJP 
message/message buffer fails, because then we use an early return.

Maybe we should also add

@@ -1231,7 +1231,11 @@
                  if (rec->s->busy)
                      rec->s->busy--;
                  if (service_stat == JK_TRUE) {
-                    rec->s->state = JK_LB_STATE_OK;
+                    /*
+                     * Successful request.
+                     */
+                    rec->s->state  = JK_LB_STATE_OK;
+                    states[rec->i] = JK_LB_STATE_OK;
                      rec->s->error_time = 0;
                      rc = JK_TRUE;
                      recoverable = JK_UNSET;

Since we will use the final value of states[rec->i] for logging. At the 
moment you only check against ERROR there, but one never knows for the 
future. It sounds safer to set state[] always when we set ->state.

> @@ -1146,6 +1169,7 @@
>  
>                  /* Increment the number of workers serving request */
>                  p->worker->s->busy++;
> +                rec->s->busy++;
>                  if (p->worker->s->busy > p->worker->s->max_busy)
>                      p->worker->s->max_busy = p->worker->s->busy;
>                  if ( (p->worker->lbmethod == JK_LB_METHOD_REQUESTS) ||

So we actually have two different uses for the local states copy:

A) in case of a problem with an lb member, which is not big enough to 
put it into global error, we can make sure, that we don't use the same 
node again during failover of the single request (by setting the local 
state to error). This behaviour is easy to understand.

B) we want to differentiate the decision of putting an lb member into 
global error depending on the observation, if there is still some 
request from the same process running on the "failed" node. In some 
error situations only the last returning request will trigger global 
error state. This could mean, that we might never trigger it, because we 
never stop sending new requests there and none of the returning requests 
might be the last ones, although all of them return constantly with the 
same problem. This case seems to make sense only for few situations.

So when do we use cases A) and B)?

A) service_stat is one of:

- JK_CLIENT_ERROR: The case of client read or write errors. We don't 
recover in this case, so if we set states to OK or to ERROR only changes 
the logging behaviour at the end of the main loop. Do you really want to 
log the additional message? Since we do no recovery, we could just as 
well stick to OK here (also for states).

- JK_SERVER_ERROR: the case, when we ran into memory allocation problems 
in jk_ajp_common. Good example for A).

- JK_STATUS_ERROR: Fail on status soft error. Good example for A)

- JK_REPLY_TIMEOUT (below max reply timeouts): Good example for A)

B) service_stat is one of:

- JK_STATUS_FATAL_ERROR: Fail on status hard error. Since the user 
configured, that he wants the node to go into error, I don't understand, 
why we should respect the busy counter.

- JK_REPLY_TIMEOUT (above max reply timeouts). Here I'm afraid, that 
this has a high potential for actually making the timeout useless. If a 
node throws a high number of reply timeouts, but we wait for the "last" 
request to return before putting it into error, I assume it will never 
be put into error, sonce there will be no "last" request. We simply 
proceed forwarding requests and busy will stay above 0 all the time. So 
I think, in this case we should keep it the way it was before, i.e. 
state=ERROR.

- all other error return codes (JK_FALSE, JK_FATAL_ERROR): Lots of 
different reasons, from invalid struct params in service call, to all 
sorts of low-level communication problems (timeouts, connection refused 
etc.), protocol errors (corrupt AJP packets), ...

Again here I'm not able to understand the behaviour after we switch to 
respecting busyness. I would assume, that again the node gets traffic 
all the time and in effect we no longer put it into error, but keep it 
OK all the time.

In summary, I see no clear situations, where B) really is a good idea. 
Would you mind to provide more insight?

I somehow think, that if you want to do more clever error reactions, we 
need to check conditions inside jk_ajp_common and then return the info, 
if we only want to have local or global error back to the lb (like the 
recoverable flag in the jk_endpoint; we tried to guess recoverability 
from the ajp service return code earlier, but the info wasn't rich 
enough, so the jk_ajp_common can now return recoverability info directly 
via endpoint.recoverable).

A couple of additional thoughts:

- error_time: in the cases, where you set the global state depending on 
the busy value, you always set error_time, even if the node stays OK.

- force_recovery inside service loop does not reset the states array and 
thus is broken

- in case B), if busy>=0, should we really set the global state to OK? 
We might override an ERROR state set by some other process. I've got no 
good solution for this.

- states inside the service loop do no longer learn from other 
processes, because they are initialized before the loop. Moving 
initialization to the beginning of the loop doesn't help either, because 
then we'll override possible local error states learned during the 
previous loop iterations. One could fix it with "used state in this 
iteration = worse of global state and local state".

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org