You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Scott McClanahan <sc...@trnswrks.com> on 2007/07/25 15:47:18 UTC

mod_jk error detection

I am installing mod_jk 1.2.23 in a load balancing configuration between
apache 2.0.52 and tomcat 5.0.28.  I am trying to understand how the
mod_jk error detection actually works.  In the documentation
"socket_timeout" directive defaults to zero (infinite waiting) but the
"retries" directive defaults to two.

With these settings how could I expect the connector to behave if:

1.  Tomcat dies and the port is no longer listening resulting in an
immediate icmp response.

2.  The box hosting tomcat dies or the tcp stack for whatever reason
tanks resulting in no immediate icmp response.

3.  The connector does make a successful connection to the backend
tomcat worker only to have that worker become slow and almost
unresponsive.

Are there more directives I should be concerned with?  Currently, I have
no intentions on monitoring the http response status codes to detect
errors.


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk error detection

Posted by Scott McClanahan <sc...@trnswrks.com>.
On Wed, 2007-07-25 at 17:00 +0200, Rainer Jung wrote:
> Hi,
> 
> good questions. First of all: I just today wrote a new docs page about 
> timeouts. We are soon releasing 1.2.24 which contains this page. You can 
> already look at it under
> 
> http://people.apache.org/~rjung/mod_jk-dev/docs/
> 
> (The new page is named "Timeouts" and part of the group Generic Howtos.
> 
> Also the new docs contain a better explanation, what retries means, 
> especially the huge difference between retries for an lb worker and a 
> usual worker. This info is on the updated workers.properties page in the 
> reference guide.
> 
> > With these settings how could I expect the connector to behave if:
> > 
> > 1.  Tomcat dies and the port is no longer listening resulting in an
> > immediate icmp response.
> 
> I would expect, that any attempt to use an existing connection or to 
> open a new one immediately returns with an error, because the remote 
> machine rejects the communication. Further JK behaviour is now depending 
> if you are using a load balancer or not. Se retries etc. in the updated 
> docs.
> 
> > 2.  The box hosting tomcat dies or the tcp stack for whatever reason
> > tanks resulting in no immediate icmp response.
> 
> As long as your local system or the last router still has an arp entry 
> for the died machine, you will run into very long TCP timeouts. We 
> recommend CPing/CPong, see the new Timeouts page.
> 
> > 3.  The connector does make a successful connection to the backend
> > tomcat worker only to have that worker become slow and almost
> > unresponsive.
> 
> You should use CPing/CPong and reply timeouts. See again the new 
> Timeouts page. If you don't use an lb, the best you can do is throwing 
> an error early, such that the rest of the infrastructure doesnt get 
> congested.
> 
> > Are there more directives I should be concerned with?  Currently, I have
> > no intentions on monitoring the http response status codes to detect
> > errors.
> 
> Look at the new page and look at the workers.properties page of the 
> reference guide. Use a load balancing worker, set recovery_options etc.
> 
> HTH.
> 
> Regards,
> 
> Rainer
> 
> P.S.: If you have suggestions how to improve the new page: it's not 
> public yet. If you are fast enough, we can include those changes.
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 

Thanks I'll be reading up this afternoon and posting comments.


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk error detection

Posted by Rainer Jung <ra...@kippdata.de>.
> One obvious thing that confuses me and could be changed is the "Advanced
> worker directives" table.  It includes directives that are applicable to
> both load balancer workers and real workers and only distinguishes which
> directives are used for which worker when it is to be used for a load
> balancer worker.  Does that mean the others are usable directives for
> both real workers and load balancer workers or just real workers or in
> some cases both.
> 
> I believe I know the answer to that but it somewhat misleading.

You are right, this needs a little forward before the table. At the 
moment we note "Only used for load balancer workers.", "Only used for a 
member worker of a load balancer." or "This attribute can be used for 
normal workers and for load balancer workers.". In fact we would need to 
mark, if an attribute is useful

LB) for an lb worker
N) for a non lb worker (normal worker, mostly synonymous with ajp13)
M) for an lb member worker

X) for a worker in the worker list (worker.list)

An attribute might apply to some "or" combinations of LB) to M) and an 
optional "and" with X.

I also think, that our "advanced" category is more historically 
motivated, so we should check, if some attributes should change their group.

Have fun using mod_jk!

Regards,

Rainer

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk error detection

Posted by Scott McClanahan <sc...@trnswrks.com>.
On Wed, 2007-07-25 at 22:40 +0200, Rainer Jung wrote:
> Scott McClanahan wrote:
> > Thanks, so much! I'd like to continue this thread a bit more because of
> > helpful I think it will be for everyone using mod_jk.
> > 
> >> That one, reply_timeout, is not really meant for high speed detection. 
> >> Usually you've got an ap, that every now and then needs 10 or 20 seconds 
> >> for an answer and you don't like to disable a worker automatically 
> >> because of those rare events. So normally one sets reply_timeout to 1, 2 
> >> or 3 minutes.
> > 
> > I don't understand what besides a timed out CPING/CPONG message would
> > render a backend tomcat disabled, especially in a default config since
> > reply_timeout is 0.
> 
> Default config: no CPing/CPong. But: after some time the TCP stack will 
> give up, when there is a network problem, or the backend is no longer 
> listening. So this case will even be handled in a default config, but 
> depending on the exact network situation, the error detection might take 
> a long time.
> 
> n case your backend simply eats your requests, but doesn't produce 
> answers, you will very fast eat up all connections and threads and the 
> whole system will hang - without configured timeouts.

I see your point.  I was thinking only within the context of mod_jk.
Meaning what in mod_jk other than CPING/CPONG message failures would
cause a worker to go into error state.  You answered that.

> 
> BTW: there is also a non-default config to make a worker fail on several 
> received HTTP status codes, "fail_on_status".
> 
> >> We have to strongly make a difference between retries of a non-lb worker 
> >> and of a load balancer worker. A normal worker has a simple retry 
> >> procedure, independant of the fact, if it is used directly or as part of 
> >> an lb. If it detects an error it uses another pool connection and by 
> >> default tries once more.
> > 
> > If that happens does the real worker officially change to an error state
> > which would subsequently kick off the retry logic of the load balancer
> > worker?
> 
> Without an lb a worker does not have an error state. It will be 
> continuously reused. Only an lb uses error states and temporarily 
> disables a failed worker. Even an lb will continuously reuse a worker, 
> if there is no other worker to failover.

I understand this bit now finally too.  It was a really good idea to
have the CPING/CPONG message timeout checks before individual requests
get forwarded to avoid several different problem scenarios here.  Good
thinking.

> 
> >> The maintenance uses a real request and handles it as if the backend 
> >> wouldn't have failed. If you enabled CPing/CPong this means, that it 
> >> would detect a still broken backend early and transparently send the 
> >> request to another member. Because no part of the request (the CPing 
> >> doesn't count) already has been send, the failover to another member 
> >> happens independently of recovery_options (i.e. even with 
> >> recovery_options 3).
> > 
> > Is the request used to test the health of the backend tomcat whichever
> > one comes first after a global maintenance run even if it has been
> > previously serviced by another healthy tomcat?  Is this request attempt
> > to a once errant worker only to test its healthiness and not to actually
> > have it fulfill the request?  I would hope it is only to test the health
> > of the backend tomcat and even if it is now willing to accept
> > connections, the request goes to whatever tomcat has been previously and
> > successfully responding to the session.
> 
> No, the first new request accepted by the web server and mapped to the 
> lb will be used (at least if it is free to be routed to any worker. If 
> the request belongs to a session located on another backend and the 
> default config with sticky sessions is active, it will of course be send 
> to its correct backend). It is a real user request. If the backend 
> works, OK. If it doesn't accept the request, we can still send it to 
> some other worker. If the backend accepts the requests, but processing 
> fails, depending on recovery_options the user gets an error.

Sounds great too.

> 
> >> If you like to improve the page about load balancing or the timeouts 
> >> page, or you want to add some parts about retries and recovery: 
> >> contributions are welcome.
> > 
> > After, we are done discussing I might have some recommendations.  Again,
> > you've been great.
> 
> Thanks. At least we improve the knowledge inside the mailing list archive.

One obvious thing that confuses me and could be changed is the "Advanced
worker directives" table.  It includes directives that are applicable to
both load balancer workers and real workers and only distinguishes which
directives are used for which worker when it is to be used for a load
balancer worker.  Does that mean the others are usable directives for
both real workers and load balancer workers or just real workers or in
some cases both.

I believe I know the answer to that but it somewhat misleading.

> 
> Regards,
> 
> Rainer
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk error detection

Posted by Rainer Jung <ra...@kippdata.de>.
Scott McClanahan wrote:
> Thanks, so much! I'd like to continue this thread a bit more because of
> helpful I think it will be for everyone using mod_jk.
> 
>> That one, reply_timeout, is not really meant for high speed detection. 
>> Usually you've got an ap, that every now and then needs 10 or 20 seconds 
>> for an answer and you don't like to disable a worker automatically 
>> because of those rare events. So normally one sets reply_timeout to 1, 2 
>> or 3 minutes.
> 
> I don't understand what besides a timed out CPING/CPONG message would
> render a backend tomcat disabled, especially in a default config since
> reply_timeout is 0.

Default config: no CPing/CPong. But: after some time the TCP stack will 
give up, when there is a network problem, or the backend is no longer 
listening. So this case will even be handled in a default config, but 
depending on the exact network situation, the error detection might take 
a long time.

n case your backend simply eats your requests, but doesn't produce 
answers, you will very fast eat up all connections and threads and the 
whole system will hang - without configured timeouts.

BTW: there is also a non-default config to make a worker fail on several 
received HTTP status codes, "fail_on_status".

>> We have to strongly make a difference between retries of a non-lb worker 
>> and of a load balancer worker. A normal worker has a simple retry 
>> procedure, independant of the fact, if it is used directly or as part of 
>> an lb. If it detects an error it uses another pool connection and by 
>> default tries once more.
> 
> If that happens does the real worker officially change to an error state
> which would subsequently kick off the retry logic of the load balancer
> worker?

Without an lb a worker does not have an error state. It will be 
continuously reused. Only an lb uses error states and temporarily 
disables a failed worker. Even an lb will continuously reuse a worker, 
if there is no other worker to failover.

>> The maintenance uses a real request and handles it as if the backend 
>> wouldn't have failed. If you enabled CPing/CPong this means, that it 
>> would detect a still broken backend early and transparently send the 
>> request to another member. Because no part of the request (the CPing 
>> doesn't count) already has been send, the failover to another member 
>> happens independently of recovery_options (i.e. even with 
>> recovery_options 3).
> 
> Is the request used to test the health of the backend tomcat whichever
> one comes first after a global maintenance run even if it has been
> previously serviced by another healthy tomcat?  Is this request attempt
> to a once errant worker only to test its healthiness and not to actually
> have it fulfill the request?  I would hope it is only to test the health
> of the backend tomcat and even if it is now willing to accept
> connections, the request goes to whatever tomcat has been previously and
> successfully responding to the session.

No, the first new request accepted by the web server and mapped to the 
lb will be used (at least if it is free to be routed to any worker. If 
the request belongs to a session located on another backend and the 
default config with sticky sessions is active, it will of course be send 
to its correct backend). It is a real user request. If the backend 
works, OK. If it doesn't accept the request, we can still send it to 
some other worker. If the backend accepts the requests, but processing 
fails, depending on recovery_options the user gets an error.

>> If you like to improve the page about load balancing or the timeouts 
>> page, or you want to add some parts about retries and recovery: 
>> contributions are welcome.
> 
> After, we are done discussing I might have some recommendations.  Again,
> you've been great.

Thanks. At least we improve the knowledge inside the mailing list archive.

Regards,

Rainer

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk error detection

Posted by Scott McClanahan <sc...@trnswrks.com>.
Thanks, so much! I'd like to continue this thread a bit more because of
helpful I think it will be for everyone using mod_jk.

On Wed, 2007-07-25 at 22:00 +0200, Rainer Jung wrote:
> Hi Scott,
> 
> > I thoroughly enjoyed the updated docs.  It is just what I needed.  I
> > just want to mention a few inferences I have now from reading it.
> 
> Thanks.
> 
> > In a load balanced setup using connect_timeout and prepost_timeout, this
> > will protect me from sending either newly established connections (rare
> > event due to persistence) as well as each and every individual request
> > from being sent to a failed tomcat node based on CPING/CPONG messages.
> > These messages only detect whether or not the container (I'm using
> > tomcat) is healthy enough to respond to such a message but not
> > necessarily anything more, correct?  Basically, its ajp listener is
> > responsive.  Plus, if I need more high speed error detection I can use
> 
> That's correct.
> 
> > reply_timeout.  Sound correct?
> 
> That one, reply_timeout, is not really meant for high speed detection. 
> Usually you've got an ap, that every now and then needs 10 or 20 seconds 
> for an answer and you don't like to disable a worker automatically 
> because of those rare events. So normally one sets reply_timeout to 1, 2 
> or 3 minutes.

I don't understand what besides a timed out CPING/CPONG message would
render a backend tomcat disabled, especially in a default config since
reply_timeout is 0.

> 
> Now with the new max_reply_timeouts one can experiment with lower 
> values. It's new, so not enough experience for good suggestions.
> 
> > I get confused on the recovery_options section.  How does it work in a
> > load balanced environment?  If tomcat receives a request and processes
> > some of it followed by a catastrophic failure before completing the
> > response, what exactly does a repeated request from the client do?
> > Assuming recovery_options is set to 0.
> 
> Value "0" means, if you don't get any part of the answer and an error 
> occurs (network, reply_timeout, ...) then send the same request again to 
> another member of the load balancer (if a working member is remaining).
> 
> That's why you usualy really want to not use value "0" in case your app 
> has data changing use cases. Most apps have.
> 
> If you use REST principles and HEAD and GET is always idempotent for 
> your app, the new (version 1.2.24) bits 8 and 16 are your friend!
> 
> > Also, I get confused with the section describing the retries directive.
> > In a load balanced environment, would the connector retry no matter the
> > state (tcp state here) of the connection whether it be established
> > already?  Would it retry against the same backend tomcat server?  The
> > reason I ask is because the docs say "If the load balancer can not get a
> > free connection for a member worker from the pool, it will try again a
> > number of times given by retries." I highlighted the words that confuse
> > me.
> 
> We have to strongly make a difference between retries of a non-lb worker 
> and of a load balancer worker. A normal worker has a simple retry 
> procedure, independant of the fact, if it is used directly or as part of 
> an lb. If it detects an error it uses another pool connection and by 
> default tries once more.

If that happens does the real worker officially change to an error state
which would subsequently kick off the retry logic of the load balancer
worker?

> 
> An lb has another idea of retries. It uses retries if all connections to 
> a backend are busy. For Apache with default config, this should never 
> happen, because we allow as many connections as threads per process. So 
> any request should be able to get a connection without waiting (maybe it 
> needs to start a new one). For the other web servers we don't have a 
> good way to detect the "correct" pool size. In some cases even for 
> Apache it might be interesting to use a smaler pool size, in case the 
> backend is only used occasionally and/or you want to prevent it from 
> getting flodded in case of congestion. Then you might run out of 
> available connections and requests will have to wait. LB retries 
> configure this waiting.
> 
> > Every 60 seconds would we expect the connector to attempt to send a
> > valid request to a backend tomcat and fail or once a worker goes into
> > error state do we only check with CPING/CPONG requests during the
> > maintenance cycle?
> 
> The maintenance uses a real request and handles it as if the backend 
> wouldn't have failed. If you enabled CPing/CPong this means, that it 
> would detect a still broken backend early and transparently send the 
> request to another member. Because no part of the request (the CPing 
> doesn't count) already has been send, the failover to another member 
> happens independently of recovery_options (i.e. even with 
> recovery_options 3).

Is the request used to test the health of the backend tomcat whichever
one comes first after a global maintenance run even if it has been
previously serviced by another healthy tomcat?  Is this request attempt
to a once errant worker only to test its healthiness and not to actually
have it fulfill the request?  I would hope it is only to test the health
of the backend tomcat and even if it is now willing to accept
connections, the request goes to whatever tomcat has been previously and
successfully responding to the session.

> 
> If you like to improve the page about load balancing or the timeouts 
> page, or you want to add some parts about retries and recovery: 
> contributions are welcome.

After, we are done discussing I might have some recommendations.  Again,
you've been great.

> 
> Regards,
> 
> Rainer
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk error detection

Posted by Rainer Jung <ra...@kippdata.de>.
Hi Scott,

> I thoroughly enjoyed the updated docs.  It is just what I needed.  I
> just want to mention a few inferences I have now from reading it.

Thanks.

> In a load balanced setup using connect_timeout and prepost_timeout, this
> will protect me from sending either newly established connections (rare
> event due to persistence) as well as each and every individual request
> from being sent to a failed tomcat node based on CPING/CPONG messages.
> These messages only detect whether or not the container (I'm using
> tomcat) is healthy enough to respond to such a message but not
> necessarily anything more, correct?  Basically, its ajp listener is
> responsive.  Plus, if I need more high speed error detection I can use

That's correct.

> reply_timeout.  Sound correct?

That one, reply_timeout, is not really meant for high speed detection. 
Usually you've got an ap, that every now and then needs 10 or 20 seconds 
for an answer and you don't like to disable a worker automatically 
because of those rare events. So normally one sets reply_timeout to 1, 2 
or 3 minutes.

Now with the new max_reply_timeouts one can experiment with lower 
values. It's new, so not enough experience for good suggestions.

> I get confused on the recovery_options section.  How does it work in a
> load balanced environment?  If tomcat receives a request and processes
> some of it followed by a catastrophic failure before completing the
> response, what exactly does a repeated request from the client do?
> Assuming recovery_options is set to 0.

Value "0" means, if you don't get any part of the answer and an error 
occurs (network, reply_timeout, ...) then send the same request again to 
another member of the load balancer (if a working member is remaining).

That's why you usualy really want to not use value "0" in case your app 
has data changing use cases. Most apps have.

If you use REST principles and HEAD and GET is always idempotent for 
your app, the new (version 1.2.24) bits 8 and 16 are your friend!

> Also, I get confused with the section describing the retries directive.
> In a load balanced environment, would the connector retry no matter the
> state (tcp state here) of the connection whether it be established
> already?  Would it retry against the same backend tomcat server?  The
> reason I ask is because the docs say "If the load balancer can not get a
> free connection for a member worker from the pool, it will try again a
> number of times given by retries." I highlighted the words that confuse
> me.

We have to strongly make a difference between retries of a non-lb worker 
and of a load balancer worker. A normal worker has a simple retry 
procedure, independant of the fact, if it is used directly or as part of 
an lb. If it detects an error it uses another pool connection and by 
default tries once more.

An lb has another idea of retries. It uses retries if all connections to 
a backend are busy. For Apache with default config, this should never 
happen, because we allow as many connections as threads per process. So 
any request should be able to get a connection without waiting (maybe it 
needs to start a new one). For the other web servers we don't have a 
good way to detect the "correct" pool size. In some cases even for 
Apache it might be interesting to use a smaler pool size, in case the 
backend is only used occasionally and/or you want to prevent it from 
getting flodded in case of congestion. Then you might run out of 
available connections and requests will have to wait. LB retries 
configure this waiting.

> Every 60 seconds would we expect the connector to attempt to send a
> valid request to a backend tomcat and fail or once a worker goes into
> error state do we only check with CPING/CPONG requests during the
> maintenance cycle?

The maintenance uses a real request and handles it as if the backend 
wouldn't have failed. If you enabled CPing/CPong this means, that it 
would detect a still broken backend early and transparently send the 
request to another member. Because no part of the request (the CPing 
doesn't count) already has been send, the failover to another member 
happens independently of recovery_options (i.e. even with 
recovery_options 3).

If you like to improve the page about load balancing or the timeouts 
page, or you want to add some parts about retries and recovery: 
contributions are welcome.

Regards,

Rainer

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk error detection

Posted by Scott McClanahan <sc...@trnswrks.com>.
On Wed, 2007-07-25 at 17:00 +0200, Rainer Jung wrote:

> Hi,
> 
> good questions. First of all: I just today wrote a new docs page about 
> timeouts. We are soon releasing 1.2.24 which contains this page. You can 
> already look at it under
> 
> http://people.apache.org/~rjung/mod_jk-dev/docs/
> 
> (The new page is named "Timeouts" and part of the group Generic Howtos.
> 
> Also the new docs contain a better explanation, what retries means, 
> especially the huge difference between retries for an lb worker and a 
> usual worker. This info is on the updated workers.properties page in the 
> reference guide.
> 
> > With these settings how could I expect the connector to behave if:
> > 
> > 1.  Tomcat dies and the port is no longer listening resulting in an
> > immediate icmp response.
> 
> I would expect, that any attempt to use an existing connection or to 
> open a new one immediately returns with an error, because the remote 
> machine rejects the communication. Further JK behaviour is now depending 
> if you are using a load balancer or not. Se retries etc. in the updated 
> docs.
> 
> > 2.  The box hosting tomcat dies or the tcp stack for whatever reason
> > tanks resulting in no immediate icmp response.
> 
> As long as your local system or the last router still has an arp entry 
> for the died machine, you will run into very long TCP timeouts. We 
> recommend CPing/CPong, see the new Timeouts page.
> 
> > 3.  The connector does make a successful connection to the backend
> > tomcat worker only to have that worker become slow and almost
> > unresponsive.
> 
> You should use CPing/CPong and reply timeouts. See again the new 
> Timeouts page. If you don't use an lb, the best you can do is throwing 
> an error early, such that the rest of the infrastructure doesnt get 
> congested.
> 
> > Are there more directives I should be concerned with?  Currently, I have
> > no intentions on monitoring the http response status codes to detect
> > errors.
> 
> Look at the new page and look at the workers.properties page of the 
> reference guide. Use a load balancing worker, set recovery_options etc.
> 
> HTH.
> 
> Regards,
> 
> Rainer
> 
> P.S.: If you have suggestions how to improve the new page: it's not 
> public yet. If you are fast enough, we can include those changes.
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 


I thoroughly enjoyed the updated docs.  It is just what I needed.  I
just want to mention a few inferences I have now from reading it.

In a load balanced setup using connect_timeout and prepost_timeout, this
will protect me from sending either newly established connections (rare
event due to persistence) as well as each and every individual request
from being sent to a failed tomcat node based on CPING/CPONG messages.
These messages only detect whether or not the container (I'm using
tomcat) is healthy enough to respond to such a message but not
necessarily anything more, correct?  Basically, its ajp listener is
responsive.  Plus, if I need more high speed error detection I can use
reply_timeout.  Sound correct?

I get confused on the recovery_options section.  How does it work in a
load balanced environment?  If tomcat receives a request and processes
some of it followed by a catastrophic failure before completing the
response, what exactly does a repeated request from the client do?
Assuming recovery_options is set to 0.

Also, I get confused with the section describing the retries directive.
In a load balanced environment, would the connector retry no matter the
state (tcp state here) of the connection whether it be established
already?  Would it retry against the same backend tomcat server?  The
reason I ask is because the docs say "If the load balancer can not get a
free connection for a member worker from the pool, it will try again a
number of times given by retries." I highlighted the words that confuse
me.

Every 60 seconds would we expect the connector to attempt to send a
valid request to a backend tomcat and fail or once a worker goes into
error state do we only check with CPING/CPONG requests during the
maintenance cycle?

Re: mod_jk error detection

Posted by Rainer Jung <ra...@kippdata.de>.
Hi,

good questions. First of all: I just today wrote a new docs page about 
timeouts. We are soon releasing 1.2.24 which contains this page. You can 
already look at it under

http://people.apache.org/~rjung/mod_jk-dev/docs/

(The new page is named "Timeouts" and part of the group Generic Howtos.

Also the new docs contain a better explanation, what retries means, 
especially the huge difference between retries for an lb worker and a 
usual worker. This info is on the updated workers.properties page in the 
reference guide.

> With these settings how could I expect the connector to behave if:
> 
> 1.  Tomcat dies and the port is no longer listening resulting in an
> immediate icmp response.

I would expect, that any attempt to use an existing connection or to 
open a new one immediately returns with an error, because the remote 
machine rejects the communication. Further JK behaviour is now depending 
if you are using a load balancer or not. Se retries etc. in the updated 
docs.

> 2.  The box hosting tomcat dies or the tcp stack for whatever reason
> tanks resulting in no immediate icmp response.

As long as your local system or the last router still has an arp entry 
for the died machine, you will run into very long TCP timeouts. We 
recommend CPing/CPong, see the new Timeouts page.

> 3.  The connector does make a successful connection to the backend
> tomcat worker only to have that worker become slow and almost
> unresponsive.

You should use CPing/CPong and reply timeouts. See again the new 
Timeouts page. If you don't use an lb, the best you can do is throwing 
an error early, such that the rest of the infrastructure doesnt get 
congested.

> Are there more directives I should be concerned with?  Currently, I have
> no intentions on monitoring the http response status codes to detect
> errors.

Look at the new page and look at the workers.properties page of the 
reference guide. Use a load balancing worker, set recovery_options etc.

HTH.

Regards,

Rainer

P.S.: If you have suggestions how to improve the new page: it's not 
public yet. If you are fast enough, we can include those changes.

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org