You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by sa...@twinix.com on 2008/02/20 15:23:35 UTC

mod_jk Problems - - worker went to error state and dont recover

See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted on behalf of a User

Hallo to all,
After long unsuccessful research i hope someone can give me a hint to the following problems.

Our Apache-mod_jk-Tomcat Infrastructur was running without Problems for about one year-than since two month mod_jk errors occurs.
We upgraded the mod_jk Version, made improvements in the worker.properties - the problems changed and get less but sometimes they appear further on.
 
It seems that the mod_jk worker loose the connection to their Tomcat-Backendserver - there are messages in the mod_jk log Files which points in this direction. Normally this seems not to be a big problem - but under certain conditions (which ?) the worker goes to an error state and cannot recover itself- must be done manually.

Problem 1: The Tomcats are reachable - unknown why the workers think the server is dead ?
Problem 2: I have no idea why the worker goes to an error state and cannot recover.
Problem3: I miss explanations of logged messages - i read the messages - but cannot match them to the situation - when does a worker post this messages

[Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info] jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi 
[Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error] ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with waiting reply from tomcat. Tomcat is down, stopped or network problems (errno=110)
[Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error] ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply from tomcat failed with out recovery in send loop attempt=0
[Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error] service::jk_lb_worker.c (1105): unrecoverable error 504, request failed. Tomcat failed in the middle of request, we can't recover to another instance.

-> Which Timeout - how does mod_jk think Tomcat is down ? Where can i found details to errno=110 ?...
-> receiving reply from tomcat failed with out recovery in send loop attempt=0  - ? with out recovery in send loop - means?
-> unrecoverable error 504 - details to this error ?

Ok - i turn the logging level to debug - the course of events get more clear - but also more questions appear - there are socket numbers - which sockets - what are these numbers e.g will be shutting down socket 35 for worker INETP1021 - The sockets are good for ? - how many are there/per worker ? can i configure them ?

=> Generally -How can i solve such problems - i tried to look into the mod_jk code - searching for error codes, error messages - but cannot find some relevant informations, - i am studying the log Files - but don't find out what really happens.

So  - maybe someone has an idea why the worker think that the corresponding Tomcat is dead, and why he will not recover by itself. !

And i am also searching for tips how i can help myself  - and where to find something about the error codes, messages,..in mod_jk

thanks for your attention
Best
ahmed musa (writing from vienna)
 
Current Infrastructur
We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3 /Kernelversion 2.6.9-34
In front of the Webserver there are two (two Locations) HW-Loadbalancer (but they have no role in this story)
The Webservers are hosted at our ISP.
 
The Webserver balance the requests via mod_jk (Version 1.2.25) for approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver - because of underlying Application-Parts the OS is Windows 2003 Server - a long story not worth to explain :-) ). The Tomcatserver gain Data via Requests against DB2 Server/DB2-Databases on the Mainframe. The Tomcatserver are Inhouse -and were rebooted nightly because of automated Deployment processes.

Between the Webserver and the Tomcatserver is a Checkpoint Firewall.  
All webapps are deployed on all Tomcats - only mod_jk manages the requests to certain Tomcat- instances.
(on one Bladeserver there are two identically Tomcat Instances running).
 
Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests against the public Website(s) are normal short living requests - not many - The most Webapps (Portals) need a login, have a strong focus on business
logic - so the instances are big (many MBs in RAM), the sessions are sticky and the session timeout is 20 minutes. But there are also less requests. To the User requests - Monitoring requests from our ISP are added.
The Problems appears at Servers/Portals which very less Userrequests.

worker.properties
worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus

worker.template.type=ajp13
worker.template.lbfactor=5
worker.template.socket_keepalive=1
worker.template.connect_timeout=7000
worker.template.prepost_timeout=5000
worker.template.reply_timeout=120000
worker.template.retries=6
worker.template.activation=Active
worker.template.recovery_options=7

worker.lbtemplate.type=lb
worker.lbtemplate.max_reply_timeouts=6
worker.lbtemplate.method=Session

#Produktions Worker
# AS-INETP101 - 106 - 6/6 GGI
worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
worker.INETP1011.port=65001
worker.INETP1011.reference=worker.template

....many more of the same

then

worker.ajp_ad.reference=worker.lbtemplate
worker.ajp_ad.balance_workers=INETP1032,INETP1062

.... many more portals

at least jkstatus

The JKMount is very simple
JkMount /* ajp_ad    --- for the other portals mostly the same

The Portals are Virtual Hosts on the Apache.

Tomcat - server.xml
example
<Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
    <Engine name="Catalina" jvmRoute="INETP5021" defaultHost="default">
......
<Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
autoDeploy="false" deployOnStartup="false" xmlValidation="false"
xmlNamespaceAware="false">
        <Alias>www.slfinsol.com</Alias>
        <Alias>web1.slfinsol.com</Alias>
        ...
        <Alias>testweb.slfinsol.com</Alias>
        .....
        <Valve className="org.apache.catalina.valves.AccessLogValve"
directory="logs" prefix="swl_access_log." suffix=".txt" pattern="common"
resolveHosts="false" />
        <Valve
className="at.allianz.tomcat.valve.RequestTimeValve"/>
        <Valve
className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
        <Context path="" docBase="swl" />
        <Context path="/monitor5" docBase="monitor" />
        <Context path="/swl" docBase="swl" />
      </Host>    



---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: mod_jk Problems - - worker went to error state and dont recover

Posted by lu...@bt.com.
Thanks, I did try to unsubscribe but I kept getting them. Will try the
address below. 

Luke Walshe
BT Operate, HGIPCC Technical Specialist
Telephone: +44 (0)1314483482, Email: Luke.Walshe@bt.com 


-----Original Message-----
From: Rainer Jung [mailto:rainer.jung@kippdata.de] 
Sent: 21 February 2008 09:30
To: Tomcat Users List
Subject: Re: mod_jk Problems - - worker went to error state and dont
recover

See the footer of any mail on the list:

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


luke.walshe@bt.com wrote:
> All
> 
> Apologies, this is unrelated. How do I unsubscribe from this mailing
> list, I thought it would be useful and small but its overwhelming my
> inbox?
> 
> Thanks in Advance.
> 
> Luke Walshe
> BT Operate, HGIPCC Technical Specialist
> Telephone: +44 (0)1314483482, Email: Luke.Walshe@bt.com 
> 
> -----Original Message-----
> From: Ahmed Musa [mailto:donald1090@gmx.at] 
> Sent: 21 February 2008 09:25
> To: Tomcat Users List
> Subject: Re: mod_jk Problems - - worker went to error state and dont
> recover
> 
> Hello Rainer,
> Thanks for your informations - the Situation gets more clear now.
> I will read again some dics - following your links and will make
further
> tests also with the improved logging.
> Thanks a lot for your time
> with best regards 
> ahmed
> 
> -------- Original-Nachricht --------
>> Datum: Wed, 20 Feb 2008 18:59:01 +0100
>> Von: Rainer Jung <ra...@kippdata.de>
>> An: Tomcat Users List <us...@tomcat.apache.org>
>> Betreff: Re: mod_jk Problems - - worker went to error state and dont
> recover
> 
>> Ahmed Musa wrote:
>>> Hello,
>>> Wow -thank you very much Rainer for your very quick and informative
>> answer.
>>> I will go to 1.2.26 and think about some "smoother" Values for
>> reply_timeout and max_reply_timeouts.
>>> I will search for the requests which causes the Problems - becasue i
>> still log the response time in your mentioned way - but I am not sure
> that the
>> Userrequests are responsible for the Situation. 
>>
>> One note: for Apache httpd 2.x %d is microseconds (there is no format

>> for milliseconds), for Tomcat %D is milliseconds. As long as you are 
>> searching for the root cause, it might make sense to have both access

>> logs active to check about duration differences.
>>
>>> So one further question - does mod_jk itself checks if the Backend
> is
>> reachable - without userrequests? 
>>
>> No. Everything only works on top of user requests.
>>
>>> When there are connections to the Backend - are they closed after
> the
>> respone or are the hold open for further requests.
>>
>> In general hold open. There are parameters on how long they are held 
>> open without more requests before they get shut down, and also how
> many 
>> might be kept open even when no requests are coming in. Those are the

>> connection pool parameters, which you will find on
>>
>> http://tomcat.apache.org/connectors-doc/reference/workers.html
>>
>> Tomcat also has a connectionTimeout on the connector, which will shut

>> down a connection from the Tomcat side if it is idle for to long.
>>
>> If you don't want to reuse connections at all, there's also a setting
> (a 
>> JkOption in Apache).
>>
>>> Is it possible that the Checkpoint Firewall in Between can be
>> responsible for the connectivity problem?
>>
>> It can cut a connection that's idle for too long. Since you have 
>> cping/cpong active via connect_timeout and prepost_timeout, you
should
> 
>> get a cping error message, if the connection was dropped by the
> firewall 
>> during idle times and mod_jk tries to use it again. The reply timeout
> in 
>> the error log indicates, that the backend isn't answering. Of course
> if 
>> it takes *very* long to answer, it might be that the firewall dropped

>> the connection in between, but then the root cause would still be the

>> long response time of the backend.
>>
>>> Another point is the "not recovering" of the worker. Yes, you are
> right
>> - in this situation i have many reply_timeouts - but these happens in
> a
>> period of time - for example 30 minutes - but the worker is still
dead
> even
>> then when there are no more reply_timeouts. It remains dead.
>>> It was necessary to restart it manually via jkstatus.
>> I assume you are using stickyness, so when a session started on a
> node, 
>> it will stay there. So when a worker is in error for a long time, all

>> new sessions will start on other nodes. If the worker is ready for 
>> recovery, it needs a request, that doesn't carry a session to get
> probed 
>> with this request.
>>
>> In jkstatus, the status of an error worker should switch to REC, when

>> mod_jk decides that it could send a non-sticky request there (to
> probe) 
>> and to PRB, during the time this request is on the node, and finally 
>> either to OK or back to ERR depending on the result of the request.
>>
>> You can log the number of errors (and accesses) that happened on the 
>> node in the httpd access log. If you think that the node simply stays
> in 
>> error for a long time, then the error count (and access count) should

>> stay constant. I would expect, that they do not.
>>
>> Have a look at how LogFormat in Apache httpd works, and then add some
> of 
>> those documented in
>>
>> http://tomcat.apache.org/connectors-doc/reference/apache.html
>>
>> like:
>>
>> JK_LB_LAST_NAME
>> JK_LB_LAST_ACCESSED
>> JK_LB_LAST_ERRORS
>> JK_LB_LAST_BUSY
>> JK_LB_LAST_STATE
>>
>> using the syntax %{JK_LB_LAST_STATE}n etc.
>>
>>> Another point is the learning - i read the dics - the infos on the
>> apache Website i dont't find other ones - are there other ones ? -
and
> they are
>> not going in depth - if you read the spec and watch the logs it is -
> for me
>> - very hard to match the things. Also the many possibilities that
> mod_jk
>> has to prove if there is a connection to the Backend,... - i
> understand them
>> but check the reality in an error situation is very hard. Under
> matching i
>> mean "Which Part of the Communication sequence failed - why - and
> causes
>> which error message".
>>> But i will try - and study also the mailing list..
>> It's hard for us too (sometimes).
>>
>>> Thank you for your time - tomorrow we will have the new version and
> will
>> see what happens.
>>> best
>>> ahmed
>>
>> Regards,
>>
>> Rainer
>>
>>> -------- Original-Nachricht --------
>>>> Datum: Wed, 20 Feb 2008 15:56:42 +0100
>>>> Von: Rainer Jung <ra...@kippdata.de>
>>>> An: Tomcat Users List <us...@tomcat.apache.org>
>>>> Betreff: Re: mod_jk Problems - - worker went to error state and
> dont
>> recover
>>>> samk@twinix.com wrote:
>>>>> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted
> on
>>>> behalf of a User
>>>>> Hallo to all, After long unsuccessful research i hope someone can
>>>>> give me a hint to the following problems.
>>>>>
>>>>> Our Apache-mod_jk-Tomcat Infrastructur was running without
> Problems
>>>>> for about one year-than since two month mod_jk errors occurs.
>>>>> We upgraded the mod_jk Version, made improvements in the
>>>>> worker.properties - the problems changed and get less but
> sometimes
>> they
>>>>> appear further on.
>>>>>
>>>>> It seems that the mod_jk worker loose the connection to their
>>>>> Tomcat-Backendserver - there are messages in the mod_jk log Files
>> which
>>>>> points in this direction. Normally this seems not to be a big
> problem
>> -
>>>>> but under certain conditions (which ?) the worker goes to an error
>> state
>>>>> and cannot recover itself- must be done manually.
>>>>>
>>>>> Problem 1: The Tomcats are reachable - unknown why the workers
> think
>> the
>>>> server is dead ?
>>>>> Problem 2: I have no idea why the worker goes to an error state
> and
>>>> cannot recover.
>>>>
>>>> 2 is a consequence of 1
>>>>
>>>>> Problem3: I miss explanations of logged messages - i read the
> messages
>> -
>>>> but cannot match them to the situation - when does a worker post
> this
>>>> messages
>>>>
>>>> 1 is a consequence of these messages
>>>>
>>>>> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]
>>>> jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi
> 
>>>>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
>>>> ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with
> waiting
>> reply from
>>>> tomcat. Tomcat is down, stopped or network problems (errno=110)
>>>>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
>>>> ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply
> from
>> tomcat failed with
>>>> out recovery in send loop attempt=0
>>>>> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]
>>>> service::jk_lb_worker.c (1105): unrecoverable error 504, request
>> failed. Tomcat failed in
>>>> the middle of request, we can't recover to another instance.
>>>>
>>>> The second line tells us, that your configured reply_timeout fired.
>>>> You set it to 120000 (2 minutes), so there are requests taking
> longer 
>>>> than 2 minutes on the backend, before the first response packet
> comes 
>>>> back from the backend.
>>>>
>>>> With your configuration mod_jk then doesn't wait any longer on the
>> reply 
>>>> *and puts the backend into error mode*.
>>>>
>>>> Up until version 1.2.25, if you use a reply-timeout, you need to
> set it
>>>> to a high number which justifies the resoning "if it takes that
> long, 
>>>> that something is wrong with the backend".
>>>>
>>>> Reality shows: there is no such number. Often there are few
> requests 
>>>> that take unaccetably long on the backend *although* the backend is
> 
>>>> still working.
>>>>
>>>> So in 1.2.25 we added max_reply_timeouts. With this set in addition
> to 
>>>> reply_timeout, mod_jk will abort waiting for a reply after 
>>>> reply_timeout, but allow some timeouts before actually deciding to
> put 
>>>> the backend into error.
>>>>
>>>> Unfortunately the implementation of max_reply_timeouts in 1.2.25
> was 
>>>> wrong, so you need to go to 1.2.26 to get it working right.
>>>>
>>>> See:
>>>>
>>>> http://issues.apache.org/bugzilla/show_bug.cgi?id=43229
>>>>
>>>> Caution: this does *not* explain, why the backends are not
>> automatically 
>>>> recovered after a minute of error condition. Maybe you have times,
>> where 
>>>> you getr to many of those reply_timeouts (see log file), and
> although
>> we 
>>>> recover after a minute the backend almost immediately goes back
> into 
>>>> error status.
>>>>
>>>>> -> Which Timeout - how does mod_jk think Tomcat is down ? Where
> can i
>>>> found details to errno=110 ?...
>>>>
>>>> reply_timeout, see above and also
>>>>
>>>> http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
>>>>
>>>> errno: a standard unix feature. The numbers are platform dependent.
> I 
>>>> would assume in your case
>>>>
>>>> ETIMEDOUT       110     /* Connection timed out */
>>>>
>>>> so no wonder, that's exactly what we expect (and doesn't tell us
> the 
>>>> reason, i.e. what's wrong on the *backend* taking that long for a
>>>> response).
>>>>
>>>>> -> receiving reply from tomcat failed with out recovery in send
> loop
>>>> attempt=0  - ? with out recovery in send loop - means?
>>>>
>>>> That your configuration doesn't allow us to send the request to
> another
>>>> backend. recovery_options 7 include: if mod_jk was able to send the
> 
>>>> request to a backend, do not try to send it to another backend in
> case 
>>>> of an error during the response handling. Even if you would allow 
>>>> sending to another backend, it would not help with *not* putting
> the 
>>>> worker into error state. More likely would be, that you would put
> all 
>>>> workers into error state, because all of them might run into the
> same 
>>>> timeout, one after the other.
>>>>
>>>>> -> unrecoverable error 504 - details to this error ?
>>>> That's simply how we return the situation back to the client
> (browser).
>>>>> Ok - i turn the logging level to debug - the course of events get
>>>>> more
>>>>> clear - but also more questions appear - there are socket numbers
> -
>>>>> which sockets - what are these numbers e.g will be shutting down
>> socket
>>>>> 35 for worker INETP1021 - The sockets are good for ? - how many
> are
>>>>> there/per worker ? can i configure them ?
>>>> Should not be the problem here. For apache httpd if you do *not* 
>>>> configure anything, we automatically choose the number of httpd
> threads
>>>> as the maximum number of connections. No need to change anything
> here.
>>>>> => Generally -How can i solve such problems - i tried to look into
>>>>> the
>>>>> mod_jk code - searching for error codes, error messages - but
> cannot
>>>>> find some relevant informations, - i am studying the log Files -
> but
>>>>> don't find out what really happens.
>>>> Post to the list. Improve our dics.
>>>>
>>>> The error message contains the word "timeout" and "reply" and you
> have
>> a 
>>>> "reply_timeout".
>>>>
>>>> Long running requests are a frequent problem. If you want to get
> rid of
>>>> them, start by adding response times to your httpd and your tomcat 
>>>> access log format (%D). Then have a look, which URLs are producing
> long
>>>> running requests, during what time of day are they happening etc.
> This 
>>>> might give you a clue about the reasons.
>>>>
>>>> And if they are very frequent: do Java Thread Dumps of your
> backends
>> and 
>>>> analyze them.
>>>>
>>>>> So - maybe someone has an idea why the worker think that the
>>>>> corresponding Tomcat is dead, and why he will not recover by
> itself. !
>>>> Tomecat is dead: from the point of view of mod_jk it simply means:
> we 
>>>> didn't get an answer, when we expected one. Details depend on the 
>>>> additional log lines (could not connect, reply timeout etc.).
>>>>
>>>>> And i am also searching for tips how i can help myself - and where
> to
>>>>> find something about the error codes, messages,..in mod_jk
>>>>>
>>>>> thanks for your attention
>>>>> Best
>>>>> ahmed musa (writing from vienna)
>>>>>
>>>> Regards,
>>>>
>>>> Rainer
>>>>
>>>>> Current Infrastructur
>>>>> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3
>>>> /Kernelversion 2.6.9-34
>>>>> In front of the Webserver there are two (two Locations)
>> HW-Loadbalancer
>>>> (but they have no role in this story)
>>>>> The Webservers are hosted at our ISP.
>>>>>  
>>>>> The Webserver balance the requests via mod_jk (Version 1.2.25) for
>>>>> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver -
> because
>> of
>>>>> underlying Application-Parts the OS is Windows 2003 Server - a
> long
>>>>> story not worth to explain :-) ). The Tomcatserver gain Data via
>>>>> Requests against DB2 Server/DB2-Databases on the Mainframe. The
>>>>> Tomcatserver are Inhouse -and were rebooted nightly because of
>> automated
>>>>> Deployment processes.
>>>>>
>>>>> Between the Webserver and the Tomcatserver is a Checkpoint
> Firewall. 
>>>>> All webapps are deployed on all Tomcats - only mod_jk manages the
>>>>> requests to certain Tomcat- instances.
>>>>> (on one Bladeserver there are two identically Tomcat Instances
>>>>> running).
>>>>>
>>>>> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests
> against
>>>>> the public Website(s) are normal short living requests - not many
> -
>> The
>>>>> most Webapps (Portals) need a login, have a strong focus on
> business
>>>>> logic - so the instances are big (many MBs in RAM), the sessions
> are
>>>>> sticky and the session timeout is 20 minutes. But there are also
> less
>>>>> requests. To the User requests - Monitoring requests from our ISP
> are
>>>> added.
>>>>> The Problems appears at Servers/Portals which very less
> Userrequests.
>>>>> worker.properties
>>>>> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
>>>>>
>>>>> worker.template.type=ajp13
>>>>> worker.template.lbfactor=5
>>>>> worker.template.socket_keepalive=1
>>>>> worker.template.connect_timeout=7000
>>>>> worker.template.prepost_timeout=5000
>>>>> worker.template.reply_timeout=120000
>>>>> worker.template.retries=6
>>>>> worker.template.activation=Active
>>>>> worker.template.recovery_options=7
>>>>>
>>>>> worker.lbtemplate.type=lb
>>>>> worker.lbtemplate.max_reply_timeouts=6
>>>>> worker.lbtemplate.method=Session
>>>>>
>>>>> #Produktions Worker
>>>>> # AS-INETP101 - 106 - 6/6 GGI
>>>>> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
>>>>> worker.INETP1011.port=65001
>>>>> worker.INETP1011.reference=worker.template
>>>>>
>>>>> ....many more of the same
>>>>>
>>>>> then
>>>>>
>>>>> worker.ajp_ad.reference=worker.lbtemplate
>>>>> worker.ajp_ad.balance_workers=INETP1032,INETP1062
>>>>>
>>>>> .... many more portals
>>>>>
>>>>> at least jkstatus
>>>>>
>>>>> The JKMount is very simple
>>>>> JkMount /* ajp_ad    --- for the other portals mostly the same
>>>>>
>>>>> The Portals are Virtual Hosts on the Apache.
>>>>>
>>>>> Tomcat - server.xml
>>>>> example
>>>>> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
>>>>>     <Engine name="Catalina" jvmRoute="INETP5021"
>> defaultHost="default">
>>>>> ......
>>>>> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
>>>>> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
>>>>> xmlNamespaceAware="false">
>>>>>         <Alias>www.slfinsol.com</Alias>
>>>>>         <Alias>web1.slfinsol.com</Alias>
>>>>>         ...
>>>>>         <Alias>testweb.slfinsol.com</Alias>
>>>>>         .....
>>>>>         <Valve
> className="org.apache.catalina.valves.AccessLogValve"
>>>>> directory="logs" prefix="swl_access_log." suffix=".txt"
>> pattern="common"
>>>>> resolveHosts="false" />
>>>>>         <Valve
>>>>> className="at.allianz.tomcat.valve.RequestTimeValve"/>
>>>>>         <Valve
>>>>>
> className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
>>>>>         <Context path="" docBase="swl" />
>>>>>         <Context path="/monitor5" docBase="monitor" />
>>>>>         <Context path="/swl" docBase="swl" />
>>>>>       </Host>    
>> ---------------------------------------------------------------------
>> To start a new topic, e-mail: users@tomcat.apache.org
>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: users-help@tomcat.apache.org

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk Problems - - worker went to error state and dont recover

Posted by Rainer Jung <ra...@kippdata.de>.
See the footer of any mail on the list:

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


luke.walshe@bt.com wrote:
> All
> 
> Apologies, this is unrelated. How do I unsubscribe from this mailing
> list, I thought it would be useful and small but its overwhelming my
> inbox?
> 
> Thanks in Advance.
> 
> Luke Walshe
> BT Operate, HGIPCC Technical Specialist
> Telephone: +44 (0)1314483482, Email: Luke.Walshe@bt.com 
> 
> -----Original Message-----
> From: Ahmed Musa [mailto:donald1090@gmx.at] 
> Sent: 21 February 2008 09:25
> To: Tomcat Users List
> Subject: Re: mod_jk Problems - - worker went to error state and dont
> recover
> 
> Hello Rainer,
> Thanks for your informations - the Situation gets more clear now.
> I will read again some dics - following your links and will make further
> tests also with the improved logging.
> Thanks a lot for your time
> with best regards 
> ahmed
> 
> -------- Original-Nachricht --------
>> Datum: Wed, 20 Feb 2008 18:59:01 +0100
>> Von: Rainer Jung <ra...@kippdata.de>
>> An: Tomcat Users List <us...@tomcat.apache.org>
>> Betreff: Re: mod_jk Problems - - worker went to error state and dont
> recover
> 
>> Ahmed Musa wrote:
>>> Hello,
>>> Wow -thank you very much Rainer for your very quick and informative
>> answer.
>>> I will go to 1.2.26 and think about some "smoother" Values for
>> reply_timeout and max_reply_timeouts.
>>> I will search for the requests which causes the Problems - becasue i
>> still log the response time in your mentioned way - but I am not sure
> that the
>> Userrequests are responsible for the Situation. 
>>
>> One note: for Apache httpd 2.x %d is microseconds (there is no format 
>> for milliseconds), for Tomcat %D is milliseconds. As long as you are 
>> searching for the root cause, it might make sense to have both access 
>> logs active to check about duration differences.
>>
>>> So one further question - does mod_jk itself checks if the Backend
> is
>> reachable - without userrequests? 
>>
>> No. Everything only works on top of user requests.
>>
>>> When there are connections to the Backend - are they closed after
> the
>> respone or are the hold open for further requests.
>>
>> In general hold open. There are parameters on how long they are held 
>> open without more requests before they get shut down, and also how
> many 
>> might be kept open even when no requests are coming in. Those are the 
>> connection pool parameters, which you will find on
>>
>> http://tomcat.apache.org/connectors-doc/reference/workers.html
>>
>> Tomcat also has a connectionTimeout on the connector, which will shut 
>> down a connection from the Tomcat side if it is idle for to long.
>>
>> If you don't want to reuse connections at all, there's also a setting
> (a 
>> JkOption in Apache).
>>
>>> Is it possible that the Checkpoint Firewall in Between can be
>> responsible for the connectivity problem?
>>
>> It can cut a connection that's idle for too long. Since you have 
>> cping/cpong active via connect_timeout and prepost_timeout, you should
> 
>> get a cping error message, if the connection was dropped by the
> firewall 
>> during idle times and mod_jk tries to use it again. The reply timeout
> in 
>> the error log indicates, that the backend isn't answering. Of course
> if 
>> it takes *very* long to answer, it might be that the firewall dropped 
>> the connection in between, but then the root cause would still be the 
>> long response time of the backend.
>>
>>> Another point is the "not recovering" of the worker. Yes, you are
> right
>> - in this situation i have many reply_timeouts - but these happens in
> a
>> period of time - for example 30 minutes - but the worker is still dead
> even
>> then when there are no more reply_timeouts. It remains dead.
>>> It was necessary to restart it manually via jkstatus.
>> I assume you are using stickyness, so when a session started on a
> node, 
>> it will stay there. So when a worker is in error for a long time, all 
>> new sessions will start on other nodes. If the worker is ready for 
>> recovery, it needs a request, that doesn't carry a session to get
> probed 
>> with this request.
>>
>> In jkstatus, the status of an error worker should switch to REC, when 
>> mod_jk decides that it could send a non-sticky request there (to
> probe) 
>> and to PRB, during the time this request is on the node, and finally 
>> either to OK or back to ERR depending on the result of the request.
>>
>> You can log the number of errors (and accesses) that happened on the 
>> node in the httpd access log. If you think that the node simply stays
> in 
>> error for a long time, then the error count (and access count) should 
>> stay constant. I would expect, that they do not.
>>
>> Have a look at how LogFormat in Apache httpd works, and then add some
> of 
>> those documented in
>>
>> http://tomcat.apache.org/connectors-doc/reference/apache.html
>>
>> like:
>>
>> JK_LB_LAST_NAME
>> JK_LB_LAST_ACCESSED
>> JK_LB_LAST_ERRORS
>> JK_LB_LAST_BUSY
>> JK_LB_LAST_STATE
>>
>> using the syntax %{JK_LB_LAST_STATE}n etc.
>>
>>> Another point is the learning - i read the dics - the infos on the
>> apache Website i dont't find other ones - are there other ones ? - and
> they are
>> not going in depth - if you read the spec and watch the logs it is -
> for me
>> - very hard to match the things. Also the many possibilities that
> mod_jk
>> has to prove if there is a connection to the Backend,... - i
> understand them
>> but check the reality in an error situation is very hard. Under
> matching i
>> mean "Which Part of the Communication sequence failed - why - and
> causes
>> which error message".
>>> But i will try - and study also the mailing list..
>> It's hard for us too (sometimes).
>>
>>> Thank you for your time - tomorrow we will have the new version and
> will
>> see what happens.
>>> best
>>> ahmed
>>
>> Regards,
>>
>> Rainer
>>
>>> -------- Original-Nachricht --------
>>>> Datum: Wed, 20 Feb 2008 15:56:42 +0100
>>>> Von: Rainer Jung <ra...@kippdata.de>
>>>> An: Tomcat Users List <us...@tomcat.apache.org>
>>>> Betreff: Re: mod_jk Problems - - worker went to error state and
> dont
>> recover
>>>> samk@twinix.com wrote:
>>>>> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted
> on
>>>> behalf of a User
>>>>> Hallo to all, After long unsuccessful research i hope someone can
>>>>> give me a hint to the following problems.
>>>>>
>>>>> Our Apache-mod_jk-Tomcat Infrastructur was running without
> Problems
>>>>> for about one year-than since two month mod_jk errors occurs.
>>>>> We upgraded the mod_jk Version, made improvements in the
>>>>> worker.properties - the problems changed and get less but
> sometimes
>> they
>>>>> appear further on.
>>>>>
>>>>> It seems that the mod_jk worker loose the connection to their
>>>>> Tomcat-Backendserver - there are messages in the mod_jk log Files
>> which
>>>>> points in this direction. Normally this seems not to be a big
> problem
>> -
>>>>> but under certain conditions (which ?) the worker goes to an error
>> state
>>>>> and cannot recover itself- must be done manually.
>>>>>
>>>>> Problem 1: The Tomcats are reachable - unknown why the workers
> think
>> the
>>>> server is dead ?
>>>>> Problem 2: I have no idea why the worker goes to an error state
> and
>>>> cannot recover.
>>>>
>>>> 2 is a consequence of 1
>>>>
>>>>> Problem3: I miss explanations of logged messages - i read the
> messages
>> -
>>>> but cannot match them to the situation - when does a worker post
> this
>>>> messages
>>>>
>>>> 1 is a consequence of these messages
>>>>
>>>>> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]
>>>> jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi
> 
>>>>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
>>>> ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with
> waiting
>> reply from
>>>> tomcat. Tomcat is down, stopped or network problems (errno=110)
>>>>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
>>>> ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply
> from
>> tomcat failed with
>>>> out recovery in send loop attempt=0
>>>>> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]
>>>> service::jk_lb_worker.c (1105): unrecoverable error 504, request
>> failed. Tomcat failed in
>>>> the middle of request, we can't recover to another instance.
>>>>
>>>> The second line tells us, that your configured reply_timeout fired.
>>>> You set it to 120000 (2 minutes), so there are requests taking
> longer 
>>>> than 2 minutes on the backend, before the first response packet
> comes 
>>>> back from the backend.
>>>>
>>>> With your configuration mod_jk then doesn't wait any longer on the
>> reply 
>>>> *and puts the backend into error mode*.
>>>>
>>>> Up until version 1.2.25, if you use a reply-timeout, you need to
> set it
>>>> to a high number which justifies the resoning "if it takes that
> long, 
>>>> that something is wrong with the backend".
>>>>
>>>> Reality shows: there is no such number. Often there are few
> requests 
>>>> that take unaccetably long on the backend *although* the backend is
> 
>>>> still working.
>>>>
>>>> So in 1.2.25 we added max_reply_timeouts. With this set in addition
> to 
>>>> reply_timeout, mod_jk will abort waiting for a reply after 
>>>> reply_timeout, but allow some timeouts before actually deciding to
> put 
>>>> the backend into error.
>>>>
>>>> Unfortunately the implementation of max_reply_timeouts in 1.2.25
> was 
>>>> wrong, so you need to go to 1.2.26 to get it working right.
>>>>
>>>> See:
>>>>
>>>> http://issues.apache.org/bugzilla/show_bug.cgi?id=43229
>>>>
>>>> Caution: this does *not* explain, why the backends are not
>> automatically 
>>>> recovered after a minute of error condition. Maybe you have times,
>> where 
>>>> you getr to many of those reply_timeouts (see log file), and
> although
>> we 
>>>> recover after a minute the backend almost immediately goes back
> into 
>>>> error status.
>>>>
>>>>> -> Which Timeout - how does mod_jk think Tomcat is down ? Where
> can i
>>>> found details to errno=110 ?...
>>>>
>>>> reply_timeout, see above and also
>>>>
>>>> http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
>>>>
>>>> errno: a standard unix feature. The numbers are platform dependent.
> I 
>>>> would assume in your case
>>>>
>>>> ETIMEDOUT       110     /* Connection timed out */
>>>>
>>>> so no wonder, that's exactly what we expect (and doesn't tell us
> the 
>>>> reason, i.e. what's wrong on the *backend* taking that long for a
>>>> response).
>>>>
>>>>> -> receiving reply from tomcat failed with out recovery in send
> loop
>>>> attempt=0  - ? with out recovery in send loop - means?
>>>>
>>>> That your configuration doesn't allow us to send the request to
> another
>>>> backend. recovery_options 7 include: if mod_jk was able to send the
> 
>>>> request to a backend, do not try to send it to another backend in
> case 
>>>> of an error during the response handling. Even if you would allow 
>>>> sending to another backend, it would not help with *not* putting
> the 
>>>> worker into error state. More likely would be, that you would put
> all 
>>>> workers into error state, because all of them might run into the
> same 
>>>> timeout, one after the other.
>>>>
>>>>> -> unrecoverable error 504 - details to this error ?
>>>> That's simply how we return the situation back to the client
> (browser).
>>>>> Ok - i turn the logging level to debug - the course of events get
>>>>> more
>>>>> clear - but also more questions appear - there are socket numbers
> -
>>>>> which sockets - what are these numbers e.g will be shutting down
>> socket
>>>>> 35 for worker INETP1021 - The sockets are good for ? - how many
> are
>>>>> there/per worker ? can i configure them ?
>>>> Should not be the problem here. For apache httpd if you do *not* 
>>>> configure anything, we automatically choose the number of httpd
> threads
>>>> as the maximum number of connections. No need to change anything
> here.
>>>>> => Generally -How can i solve such problems - i tried to look into
>>>>> the
>>>>> mod_jk code - searching for error codes, error messages - but
> cannot
>>>>> find some relevant informations, - i am studying the log Files -
> but
>>>>> don't find out what really happens.
>>>> Post to the list. Improve our dics.
>>>>
>>>> The error message contains the word "timeout" and "reply" and you
> have
>> a 
>>>> "reply_timeout".
>>>>
>>>> Long running requests are a frequent problem. If you want to get
> rid of
>>>> them, start by adding response times to your httpd and your tomcat 
>>>> access log format (%D). Then have a look, which URLs are producing
> long
>>>> running requests, during what time of day are they happening etc.
> This 
>>>> might give you a clue about the reasons.
>>>>
>>>> And if they are very frequent: do Java Thread Dumps of your
> backends
>> and 
>>>> analyze them.
>>>>
>>>>> So - maybe someone has an idea why the worker think that the
>>>>> corresponding Tomcat is dead, and why he will not recover by
> itself. !
>>>> Tomecat is dead: from the point of view of mod_jk it simply means:
> we 
>>>> didn't get an answer, when we expected one. Details depend on the 
>>>> additional log lines (could not connect, reply timeout etc.).
>>>>
>>>>> And i am also searching for tips how i can help myself - and where
> to
>>>>> find something about the error codes, messages,..in mod_jk
>>>>>
>>>>> thanks for your attention
>>>>> Best
>>>>> ahmed musa (writing from vienna)
>>>>>
>>>> Regards,
>>>>
>>>> Rainer
>>>>
>>>>> Current Infrastructur
>>>>> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3
>>>> /Kernelversion 2.6.9-34
>>>>> In front of the Webserver there are two (two Locations)
>> HW-Loadbalancer
>>>> (but they have no role in this story)
>>>>> The Webservers are hosted at our ISP.
>>>>>  
>>>>> The Webserver balance the requests via mod_jk (Version 1.2.25) for
>>>>> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver -
> because
>> of
>>>>> underlying Application-Parts the OS is Windows 2003 Server - a
> long
>>>>> story not worth to explain :-) ). The Tomcatserver gain Data via
>>>>> Requests against DB2 Server/DB2-Databases on the Mainframe. The
>>>>> Tomcatserver are Inhouse -and were rebooted nightly because of
>> automated
>>>>> Deployment processes.
>>>>>
>>>>> Between the Webserver and the Tomcatserver is a Checkpoint
> Firewall. 
>>>>> All webapps are deployed on all Tomcats - only mod_jk manages the
>>>>> requests to certain Tomcat- instances.
>>>>> (on one Bladeserver there are two identically Tomcat Instances
>>>>> running).
>>>>>
>>>>> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests
> against
>>>>> the public Website(s) are normal short living requests - not many
> -
>> The
>>>>> most Webapps (Portals) need a login, have a strong focus on
> business
>>>>> logic - so the instances are big (many MBs in RAM), the sessions
> are
>>>>> sticky and the session timeout is 20 minutes. But there are also
> less
>>>>> requests. To the User requests - Monitoring requests from our ISP
> are
>>>> added.
>>>>> The Problems appears at Servers/Portals which very less
> Userrequests.
>>>>> worker.properties
>>>>> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
>>>>>
>>>>> worker.template.type=ajp13
>>>>> worker.template.lbfactor=5
>>>>> worker.template.socket_keepalive=1
>>>>> worker.template.connect_timeout=7000
>>>>> worker.template.prepost_timeout=5000
>>>>> worker.template.reply_timeout=120000
>>>>> worker.template.retries=6
>>>>> worker.template.activation=Active
>>>>> worker.template.recovery_options=7
>>>>>
>>>>> worker.lbtemplate.type=lb
>>>>> worker.lbtemplate.max_reply_timeouts=6
>>>>> worker.lbtemplate.method=Session
>>>>>
>>>>> #Produktions Worker
>>>>> # AS-INETP101 - 106 - 6/6 GGI
>>>>> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
>>>>> worker.INETP1011.port=65001
>>>>> worker.INETP1011.reference=worker.template
>>>>>
>>>>> ....many more of the same
>>>>>
>>>>> then
>>>>>
>>>>> worker.ajp_ad.reference=worker.lbtemplate
>>>>> worker.ajp_ad.balance_workers=INETP1032,INETP1062
>>>>>
>>>>> .... many more portals
>>>>>
>>>>> at least jkstatus
>>>>>
>>>>> The JKMount is very simple
>>>>> JkMount /* ajp_ad    --- for the other portals mostly the same
>>>>>
>>>>> The Portals are Virtual Hosts on the Apache.
>>>>>
>>>>> Tomcat - server.xml
>>>>> example
>>>>> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
>>>>>     <Engine name="Catalina" jvmRoute="INETP5021"
>> defaultHost="default">
>>>>> ......
>>>>> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
>>>>> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
>>>>> xmlNamespaceAware="false">
>>>>>         <Alias>www.slfinsol.com</Alias>
>>>>>         <Alias>web1.slfinsol.com</Alias>
>>>>>         ...
>>>>>         <Alias>testweb.slfinsol.com</Alias>
>>>>>         .....
>>>>>         <Valve
> className="org.apache.catalina.valves.AccessLogValve"
>>>>> directory="logs" prefix="swl_access_log." suffix=".txt"
>> pattern="common"
>>>>> resolveHosts="false" />
>>>>>         <Valve
>>>>> className="at.allianz.tomcat.valve.RequestTimeValve"/>
>>>>>         <Valve
>>>>>
> className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
>>>>>         <Context path="" docBase="swl" />
>>>>>         <Context path="/monitor5" docBase="monitor" />
>>>>>         <Context path="/swl" docBase="swl" />
>>>>>       </Host>    
>> ---------------------------------------------------------------------
>> To start a new topic, e-mail: users@tomcat.apache.org
>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: users-help@tomcat.apache.org

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: RE: mod_jk Problems - - worker went to error state and dont recover

Posted by Ahmed Musa <do...@gmx.at>.
Hallo Luke,

Here the information from tomcat.apache.org

Unsubscription: Send a blank email to  users-unsubscribe@tomcat.apache.org
Digest unsubscription: 	Send a blank email to users-digest-unsubscribe@tomcat.apache.org

best ahmed

-------- Original-Nachricht --------
> Datum: Thu, 21 Feb 2008 09:27:31 -0000
> Von: luke.walshe@bt.com
> An: users@tomcat.apache.org
> Betreff: RE: mod_jk Problems - - worker went to error state and dont recover

> All
> 
> Apologies, this is unrelated. How do I unsubscribe from this mailing
> list, I thought it would be useful and small but its overwhelming my
> inbox?
> 
> Thanks in Advance.
> 
> Luke Walshe
> BT Operate, HGIPCC Technical Specialist
> Telephone: +44 (0)1314483482, Email: Luke.Walshe@bt.com 
> 
> -----Original Message-----
> From: Ahmed Musa [mailto:donald1090@gmx.at] 
> Sent: 21 February 2008 09:25
> To: Tomcat Users List
> Subject: Re: mod_jk Problems - - worker went to error state and dont
> recover
> 
> Hello Rainer,
> Thanks for your informations - the Situation gets more clear now.
> I will read again some dics - following your links and will make further
> tests also with the improved logging.
> Thanks a lot for your time
> with best regards 
> ahmed
> 
> -------- Original-Nachricht --------
> > Datum: Wed, 20 Feb 2008 18:59:01 +0100
> > Von: Rainer Jung <ra...@kippdata.de>
> > An: Tomcat Users List <us...@tomcat.apache.org>
> > Betreff: Re: mod_jk Problems - - worker went to error state and dont
> recover
> 
> > Ahmed Musa wrote:
> > > Hello,
> > > Wow -thank you very much Rainer for your very quick and informative
> > answer.
> > > I will go to 1.2.26 and think about some "smoother" Values for
> > reply_timeout and max_reply_timeouts.
> > > I will search for the requests which causes the Problems - becasue i
> > still log the response time in your mentioned way - but I am not sure
> that the
> > Userrequests are responsible for the Situation. 
> > 
> > One note: for Apache httpd 2.x %d is microseconds (there is no format 
> > for milliseconds), for Tomcat %D is milliseconds. As long as you are 
> > searching for the root cause, it might make sense to have both access 
> > logs active to check about duration differences.
> > 
> > > So one further question - does mod_jk itself checks if the Backend
> is
> > reachable - without userrequests? 
> > 
> > No. Everything only works on top of user requests.
> > 
> > > When there are connections to the Backend - are they closed after
> the
> > respone or are the hold open for further requests.
> > 
> > In general hold open. There are parameters on how long they are held 
> > open without more requests before they get shut down, and also how
> many 
> > might be kept open even when no requests are coming in. Those are the 
> > connection pool parameters, which you will find on
> > 
> > http://tomcat.apache.org/connectors-doc/reference/workers.html
> > 
> > Tomcat also has a connectionTimeout on the connector, which will shut 
> > down a connection from the Tomcat side if it is idle for to long.
> > 
> > If you don't want to reuse connections at all, there's also a setting
> (a 
> > JkOption in Apache).
> > 
> > > Is it possible that the Checkpoint Firewall in Between can be
> > responsible for the connectivity problem?
> > 
> > It can cut a connection that's idle for too long. Since you have 
> > cping/cpong active via connect_timeout and prepost_timeout, you should
> 
> > get a cping error message, if the connection was dropped by the
> firewall 
> > during idle times and mod_jk tries to use it again. The reply timeout
> in 
> > the error log indicates, that the backend isn't answering. Of course
> if 
> > it takes *very* long to answer, it might be that the firewall dropped 
> > the connection in between, but then the root cause would still be the 
> > long response time of the backend.
> > 
> > > Another point is the "not recovering" of the worker. Yes, you are
> right
> > - in this situation i have many reply_timeouts - but these happens in
> a
> > period of time - for example 30 minutes - but the worker is still dead
> even
> > then when there are no more reply_timeouts. It remains dead.
> > > It was necessary to restart it manually via jkstatus.
> > 
> > I assume you are using stickyness, so when a session started on a
> node, 
> > it will stay there. So when a worker is in error for a long time, all 
> > new sessions will start on other nodes. If the worker is ready for 
> > recovery, it needs a request, that doesn't carry a session to get
> probed 
> > with this request.
> > 
> > In jkstatus, the status of an error worker should switch to REC, when 
> > mod_jk decides that it could send a non-sticky request there (to
> probe) 
> > and to PRB, during the time this request is on the node, and finally 
> > either to OK or back to ERR depending on the result of the request.
> > 
> > You can log the number of errors (and accesses) that happened on the 
> > node in the httpd access log. If you think that the node simply stays
> in 
> > error for a long time, then the error count (and access count) should 
> > stay constant. I would expect, that they do not.
> > 
> > Have a look at how LogFormat in Apache httpd works, and then add some
> of 
> > those documented in
> > 
> > http://tomcat.apache.org/connectors-doc/reference/apache.html
> > 
> > like:
> > 
> > JK_LB_LAST_NAME
> > JK_LB_LAST_ACCESSED
> > JK_LB_LAST_ERRORS
> > JK_LB_LAST_BUSY
> > JK_LB_LAST_STATE
> > 
> > using the syntax %{JK_LB_LAST_STATE}n etc.
> > 
> > > 
> > > Another point is the learning - i read the dics - the infos on the
> > apache Website i dont't find other ones - are there other ones ? - and
> they are
> > not going in depth - if you read the spec and watch the logs it is -
> for me
> > - very hard to match the things. Also the many possibilities that
> mod_jk
> > has to prove if there is a connection to the Backend,... - i
> understand them
> > but check the reality in an error situation is very hard. Under
> matching i
> > mean "Which Part of the Communication sequence failed - why - and
> causes
> > which error message".
> > > But i will try - and study also the mailing list..
> > 
> > It's hard for us too (sometimes).
> > 
> > > Thank you for your time - tomorrow we will have the new version and
> will
> > see what happens.
> > > 
> > > best
> > > ahmed
> > 
> > 
> > Regards,
> > 
> > Rainer
> > 
> > > -------- Original-Nachricht --------
> > >> Datum: Wed, 20 Feb 2008 15:56:42 +0100
> > >> Von: Rainer Jung <ra...@kippdata.de>
> > >> An: Tomcat Users List <us...@tomcat.apache.org>
> > >> Betreff: Re: mod_jk Problems - - worker went to error state and
> dont
> > recover
> > > 
> > >> samk@twinix.com wrote:
> > >>> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted
> on
> > >> behalf of a User
> > >>> Hallo to all, After long unsuccessful research i hope someone can
> > >>> give me a hint to the following problems.
> > >>>
> > >>> Our Apache-mod_jk-Tomcat Infrastructur was running without
> Problems
> > >>> for about one year-than since two month mod_jk errors occurs.
> > >>> We upgraded the mod_jk Version, made improvements in the
> > >>> worker.properties - the problems changed and get less but
> sometimes
> > they
> > >>> appear further on.
> > >>>
> > >>> It seems that the mod_jk worker loose the connection to their
> > >>> Tomcat-Backendserver - there are messages in the mod_jk log Files
> > which
> > >>> points in this direction. Normally this seems not to be a big
> problem
> > -
> > >>> but under certain conditions (which ?) the worker goes to an error
> > state
> > >>> and cannot recover itself- must be done manually.
> > >>>
> > >>> Problem 1: The Tomcats are reachable - unknown why the workers
> think
> > the
> > >> server is dead ?
> > >>> Problem 2: I have no idea why the worker goes to an error state
> and
> > >> cannot recover.
> > >>
> > >> 2 is a consequence of 1
> > >>
> > >>> Problem3: I miss explanations of logged messages - i read the
> messages
> > -
> > >> but cannot match them to the situation - when does a worker post
> this
> > >> messages
> > >>
> > >> 1 is a consequence of these messages
> > >>
> > >>> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]
> > >> jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi
> 
> > >>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> > >> ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with
> waiting
> > reply from
> > >> tomcat. Tomcat is down, stopped or network problems (errno=110)
> > >>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> > >> ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply
> from
> > tomcat failed with
> > >> out recovery in send loop attempt=0
> > >>> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]
> > >> service::jk_lb_worker.c (1105): unrecoverable error 504, request
> > failed. Tomcat failed in
> > >> the middle of request, we can't recover to another instance.
> > >>
> > >> The second line tells us, that your configured reply_timeout fired.
> > >> You set it to 120000 (2 minutes), so there are requests taking
> longer 
> > >> than 2 minutes on the backend, before the first response packet
> comes 
> > >> back from the backend.
> > >>
> > >> With your configuration mod_jk then doesn't wait any longer on the
> > reply 
> > >> *and puts the backend into error mode*.
> > >>
> > >> Up until version 1.2.25, if you use a reply-timeout, you need to
> set it
> > >> to a high number which justifies the resoning "if it takes that
> long, 
> > >> that something is wrong with the backend".
> > >>
> > >> Reality shows: there is no such number. Often there are few
> requests 
> > >> that take unaccetably long on the backend *although* the backend is
> 
> > >> still working.
> > >>
> > >> So in 1.2.25 we added max_reply_timeouts. With this set in addition
> to 
> > >> reply_timeout, mod_jk will abort waiting for a reply after 
> > >> reply_timeout, but allow some timeouts before actually deciding to
> put 
> > >> the backend into error.
> > >>
> > >> Unfortunately the implementation of max_reply_timeouts in 1.2.25
> was 
> > >> wrong, so you need to go to 1.2.26 to get it working right.
> > >>
> > >> See:
> > >>
> > >> http://issues.apache.org/bugzilla/show_bug.cgi?id=43229
> > >>
> > >> Caution: this does *not* explain, why the backends are not
> > automatically 
> > >> recovered after a minute of error condition. Maybe you have times,
> > where 
> > >> you getr to many of those reply_timeouts (see log file), and
> although
> > we 
> > >> recover after a minute the backend almost immediately goes back
> into 
> > >> error status.
> > >>
> > >>> -> Which Timeout - how does mod_jk think Tomcat is down ? Where
> can i
> > >> found details to errno=110 ?...
> > >>
> > >> reply_timeout, see above and also
> > >>
> > >> http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
> > >>
> > >> errno: a standard unix feature. The numbers are platform dependent.
> I 
> > >> would assume in your case
> > >>
> > >> ETIMEDOUT       110     /* Connection timed out */
> > >>
> > >> so no wonder, that's exactly what we expect (and doesn't tell us
> the 
> > >> reason, i.e. what's wrong on the *backend* taking that long for a
> > >> response).
> > >>
> > >>> -> receiving reply from tomcat failed with out recovery in send
> loop
> > >> attempt=0  - ? with out recovery in send loop - means?
> > >>
> > >> That your configuration doesn't allow us to send the request to
> another
> > >> backend. recovery_options 7 include: if mod_jk was able to send the
> 
> > >> request to a backend, do not try to send it to another backend in
> case 
> > >> of an error during the response handling. Even if you would allow 
> > >> sending to another backend, it would not help with *not* putting
> the 
> > >> worker into error state. More likely would be, that you would put
> all 
> > >> workers into error state, because all of them might run into the
> same 
> > >> timeout, one after the other.
> > >>
> > >>> -> unrecoverable error 504 - details to this error ?
> > >> That's simply how we return the situation back to the client
> (browser).
> > >>
> > >>> Ok - i turn the logging level to debug - the course of events get
> > >>> more
> > >>> clear - but also more questions appear - there are socket numbers
> -
> > >>> which sockets - what are these numbers e.g will be shutting down
> > socket
> > >>> 35 for worker INETP1021 - The sockets are good for ? - how many
> are
> > >>> there/per worker ? can i configure them ?
> > >> Should not be the problem here. For apache httpd if you do *not* 
> > >> configure anything, we automatically choose the number of httpd
> threads
> > >> as the maximum number of connections. No need to change anything
> here.
> > >>> => Generally -How can i solve such problems - i tried to look into
> > >>> the
> > >>> mod_jk code - searching for error codes, error messages - but
> cannot
> > >>> find some relevant informations, - i am studying the log Files -
> but
> > >>> don't find out what really happens.
> > >> Post to the list. Improve our dics.
> > >>
> > >> The error message contains the word "timeout" and "reply" and you
> have
> > a 
> > >> "reply_timeout".
> > >>
> > >> Long running requests are a frequent problem. If you want to get
> rid of
> > >> them, start by adding response times to your httpd and your tomcat 
> > >> access log format (%D). Then have a look, which URLs are producing
> long
> > >> running requests, during what time of day are they happening etc.
> This 
> > >> might give you a clue about the reasons.
> > >>
> > >> And if they are very frequent: do Java Thread Dumps of your
> backends
> > and 
> > >> analyze them.
> > >>
> > >>> So - maybe someone has an idea why the worker think that the
> > >>> corresponding Tomcat is dead, and why he will not recover by
> itself. !
> > >> Tomecat is dead: from the point of view of mod_jk it simply means:
> we 
> > >> didn't get an answer, when we expected one. Details depend on the 
> > >> additional log lines (could not connect, reply timeout etc.).
> > >>
> > >>> And i am also searching for tips how i can help myself - and where
> to
> > >>> find something about the error codes, messages,..in mod_jk
> > >>>
> > >>> thanks for your attention
> > >>> Best
> > >>> ahmed musa (writing from vienna)
> > >>>
> > >> Regards,
> > >>
> > >> Rainer
> > >>
> > >>> Current Infrastructur
> > >>> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3
> > >> /Kernelversion 2.6.9-34
> > >>> In front of the Webserver there are two (two Locations)
> > HW-Loadbalancer
> > >> (but they have no role in this story)
> > >>> The Webservers are hosted at our ISP.
> > >>>  
> > >>> The Webserver balance the requests via mod_jk (Version 1.2.25) for
> > >>> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver -
> because
> > of
> > >>> underlying Application-Parts the OS is Windows 2003 Server - a
> long
> > >>> story not worth to explain :-) ). The Tomcatserver gain Data via
> > >>> Requests against DB2 Server/DB2-Databases on the Mainframe. The
> > >>> Tomcatserver are Inhouse -and were rebooted nightly because of
> > automated
> > >>> Deployment processes.
> > >>>
> > >>> Between the Webserver and the Tomcatserver is a Checkpoint
> Firewall. 
> > >>> All webapps are deployed on all Tomcats - only mod_jk manages the
> > >>> requests to certain Tomcat- instances.
> > >>> (on one Bladeserver there are two identically Tomcat Instances
> > >>> running).
> > >>>
> > >>> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests
> against
> > >>> the public Website(s) are normal short living requests - not many
> -
> > The
> > >>> most Webapps (Portals) need a login, have a strong focus on
> business
> > >>> logic - so the instances are big (many MBs in RAM), the sessions
> are
> > >>> sticky and the session timeout is 20 minutes. But there are also
> less
> > >>> requests. To the User requests - Monitoring requests from our ISP
> are
> > >> added.
> > >>> The Problems appears at Servers/Portals which very less
> Userrequests.
> > >>>
> > >>> worker.properties
> > >>> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
> > >>>
> > >>> worker.template.type=ajp13
> > >>> worker.template.lbfactor=5
> > >>> worker.template.socket_keepalive=1
> > >>> worker.template.connect_timeout=7000
> > >>> worker.template.prepost_timeout=5000
> > >>> worker.template.reply_timeout=120000
> > >>> worker.template.retries=6
> > >>> worker.template.activation=Active
> > >>> worker.template.recovery_options=7
> > >>>
> > >>> worker.lbtemplate.type=lb
> > >>> worker.lbtemplate.max_reply_timeouts=6
> > >>> worker.lbtemplate.method=Session
> > >>>
> > >>> #Produktions Worker
> > >>> # AS-INETP101 - 106 - 6/6 GGI
> > >>> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
> > >>> worker.INETP1011.port=65001
> > >>> worker.INETP1011.reference=worker.template
> > >>>
> > >>> ....many more of the same
> > >>>
> > >>> then
> > >>>
> > >>> worker.ajp_ad.reference=worker.lbtemplate
> > >>> worker.ajp_ad.balance_workers=INETP1032,INETP1062
> > >>>
> > >>> .... many more portals
> > >>>
> > >>> at least jkstatus
> > >>>
> > >>> The JKMount is very simple
> > >>> JkMount /* ajp_ad    --- for the other portals mostly the same
> > >>>
> > >>> The Portals are Virtual Hosts on the Apache.
> > >>>
> > >>> Tomcat - server.xml
> > >>> example
> > >>> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
> > >>>     <Engine name="Catalina" jvmRoute="INETP5021"
> > defaultHost="default">
> > >>> ......
> > >>> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
> > >>> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
> > >>> xmlNamespaceAware="false">
> > >>>         <Alias>www.slfinsol.com</Alias>
> > >>>         <Alias>web1.slfinsol.com</Alias>
> > >>>         ...
> > >>>         <Alias>testweb.slfinsol.com</Alias>
> > >>>         .....
> > >>>         <Valve
> className="org.apache.catalina.valves.AccessLogValve"
> > >>> directory="logs" prefix="swl_access_log." suffix=".txt"
> > pattern="common"
> > >>> resolveHosts="false" />
> > >>>         <Valve
> > >>> className="at.allianz.tomcat.valve.RequestTimeValve"/>
> > >>>         <Valve
> > >>>
> className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
> > >>>         <Context path="" docBase="swl" />
> > >>>         <Context path="/monitor5" docBase="monitor" />
> > >>>         <Context path="/swl" docBase="swl" />
> > >>>       </Host>    
> > 
> > ---------------------------------------------------------------------
> > To start a new topic, e-mail: users@tomcat.apache.org
> > To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> > For additional commands, e-mail: users-help@tomcat.apache.org
> 
> -- 
> Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
> Browser-Versionen downloaden: http://www.gmx.net/de/go/browser
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org

-- 
GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen!
Jetzt dabei sein: http://www.shortview.de/?mc=sv_ext_mf@gmx

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: mod_jk Problems - - worker went to error state and dont recover

Posted by lu...@bt.com.
All

Apologies, this is unrelated. How do I unsubscribe from this mailing
list, I thought it would be useful and small but its overwhelming my
inbox?

Thanks in Advance.

Luke Walshe
BT Operate, HGIPCC Technical Specialist
Telephone: +44 (0)1314483482, Email: Luke.Walshe@bt.com 

-----Original Message-----
From: Ahmed Musa [mailto:donald1090@gmx.at] 
Sent: 21 February 2008 09:25
To: Tomcat Users List
Subject: Re: mod_jk Problems - - worker went to error state and dont
recover

Hello Rainer,
Thanks for your informations - the Situation gets more clear now.
I will read again some dics - following your links and will make further
tests also with the improved logging.
Thanks a lot for your time
with best regards 
ahmed

-------- Original-Nachricht --------
> Datum: Wed, 20 Feb 2008 18:59:01 +0100
> Von: Rainer Jung <ra...@kippdata.de>
> An: Tomcat Users List <us...@tomcat.apache.org>
> Betreff: Re: mod_jk Problems - - worker went to error state and dont
recover

> Ahmed Musa wrote:
> > Hello,
> > Wow -thank you very much Rainer for your very quick and informative
> answer.
> > I will go to 1.2.26 and think about some "smoother" Values for
> reply_timeout and max_reply_timeouts.
> > I will search for the requests which causes the Problems - becasue i
> still log the response time in your mentioned way - but I am not sure
that the
> Userrequests are responsible for the Situation. 
> 
> One note: for Apache httpd 2.x %d is microseconds (there is no format 
> for milliseconds), for Tomcat %D is milliseconds. As long as you are 
> searching for the root cause, it might make sense to have both access 
> logs active to check about duration differences.
> 
> > So one further question - does mod_jk itself checks if the Backend
is
> reachable - without userrequests? 
> 
> No. Everything only works on top of user requests.
> 
> > When there are connections to the Backend - are they closed after
the
> respone or are the hold open for further requests.
> 
> In general hold open. There are parameters on how long they are held 
> open without more requests before they get shut down, and also how
many 
> might be kept open even when no requests are coming in. Those are the 
> connection pool parameters, which you will find on
> 
> http://tomcat.apache.org/connectors-doc/reference/workers.html
> 
> Tomcat also has a connectionTimeout on the connector, which will shut 
> down a connection from the Tomcat side if it is idle for to long.
> 
> If you don't want to reuse connections at all, there's also a setting
(a 
> JkOption in Apache).
> 
> > Is it possible that the Checkpoint Firewall in Between can be
> responsible for the connectivity problem?
> 
> It can cut a connection that's idle for too long. Since you have 
> cping/cpong active via connect_timeout and prepost_timeout, you should

> get a cping error message, if the connection was dropped by the
firewall 
> during idle times and mod_jk tries to use it again. The reply timeout
in 
> the error log indicates, that the backend isn't answering. Of course
if 
> it takes *very* long to answer, it might be that the firewall dropped 
> the connection in between, but then the root cause would still be the 
> long response time of the backend.
> 
> > Another point is the "not recovering" of the worker. Yes, you are
right
> - in this situation i have many reply_timeouts - but these happens in
a
> period of time - for example 30 minutes - but the worker is still dead
even
> then when there are no more reply_timeouts. It remains dead.
> > It was necessary to restart it manually via jkstatus.
> 
> I assume you are using stickyness, so when a session started on a
node, 
> it will stay there. So when a worker is in error for a long time, all 
> new sessions will start on other nodes. If the worker is ready for 
> recovery, it needs a request, that doesn't carry a session to get
probed 
> with this request.
> 
> In jkstatus, the status of an error worker should switch to REC, when 
> mod_jk decides that it could send a non-sticky request there (to
probe) 
> and to PRB, during the time this request is on the node, and finally 
> either to OK or back to ERR depending on the result of the request.
> 
> You can log the number of errors (and accesses) that happened on the 
> node in the httpd access log. If you think that the node simply stays
in 
> error for a long time, then the error count (and access count) should 
> stay constant. I would expect, that they do not.
> 
> Have a look at how LogFormat in Apache httpd works, and then add some
of 
> those documented in
> 
> http://tomcat.apache.org/connectors-doc/reference/apache.html
> 
> like:
> 
> JK_LB_LAST_NAME
> JK_LB_LAST_ACCESSED
> JK_LB_LAST_ERRORS
> JK_LB_LAST_BUSY
> JK_LB_LAST_STATE
> 
> using the syntax %{JK_LB_LAST_STATE}n etc.
> 
> > 
> > Another point is the learning - i read the dics - the infos on the
> apache Website i dont't find other ones - are there other ones ? - and
they are
> not going in depth - if you read the spec and watch the logs it is -
for me
> - very hard to match the things. Also the many possibilities that
mod_jk
> has to prove if there is a connection to the Backend,... - i
understand them
> but check the reality in an error situation is very hard. Under
matching i
> mean "Which Part of the Communication sequence failed - why - and
causes
> which error message".
> > But i will try - and study also the mailing list..
> 
> It's hard for us too (sometimes).
> 
> > Thank you for your time - tomorrow we will have the new version and
will
> see what happens.
> > 
> > best
> > ahmed
> 
> 
> Regards,
> 
> Rainer
> 
> > -------- Original-Nachricht --------
> >> Datum: Wed, 20 Feb 2008 15:56:42 +0100
> >> Von: Rainer Jung <ra...@kippdata.de>
> >> An: Tomcat Users List <us...@tomcat.apache.org>
> >> Betreff: Re: mod_jk Problems - - worker went to error state and
dont
> recover
> > 
> >> samk@twinix.com wrote:
> >>> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted
on
> >> behalf of a User
> >>> Hallo to all, After long unsuccessful research i hope someone can
> >>> give me a hint to the following problems.
> >>>
> >>> Our Apache-mod_jk-Tomcat Infrastructur was running without
Problems
> >>> for about one year-than since two month mod_jk errors occurs.
> >>> We upgraded the mod_jk Version, made improvements in the
> >>> worker.properties - the problems changed and get less but
sometimes
> they
> >>> appear further on.
> >>>
> >>> It seems that the mod_jk worker loose the connection to their
> >>> Tomcat-Backendserver - there are messages in the mod_jk log Files
> which
> >>> points in this direction. Normally this seems not to be a big
problem
> -
> >>> but under certain conditions (which ?) the worker goes to an error
> state
> >>> and cannot recover itself- must be done manually.
> >>>
> >>> Problem 1: The Tomcats are reachable - unknown why the workers
think
> the
> >> server is dead ?
> >>> Problem 2: I have no idea why the worker goes to an error state
and
> >> cannot recover.
> >>
> >> 2 is a consequence of 1
> >>
> >>> Problem3: I miss explanations of logged messages - i read the
messages
> -
> >> but cannot match them to the situation - when does a worker post
this
> >> messages
> >>
> >> 1 is a consequence of these messages
> >>
> >>> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]
> >> jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi

> >>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> >> ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with
waiting
> reply from
> >> tomcat. Tomcat is down, stopped or network problems (errno=110)
> >>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> >> ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply
from
> tomcat failed with
> >> out recovery in send loop attempt=0
> >>> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]
> >> service::jk_lb_worker.c (1105): unrecoverable error 504, request
> failed. Tomcat failed in
> >> the middle of request, we can't recover to another instance.
> >>
> >> The second line tells us, that your configured reply_timeout fired.
> >> You set it to 120000 (2 minutes), so there are requests taking
longer 
> >> than 2 minutes on the backend, before the first response packet
comes 
> >> back from the backend.
> >>
> >> With your configuration mod_jk then doesn't wait any longer on the
> reply 
> >> *and puts the backend into error mode*.
> >>
> >> Up until version 1.2.25, if you use a reply-timeout, you need to
set it
> >> to a high number which justifies the resoning "if it takes that
long, 
> >> that something is wrong with the backend".
> >>
> >> Reality shows: there is no such number. Often there are few
requests 
> >> that take unaccetably long on the backend *although* the backend is

> >> still working.
> >>
> >> So in 1.2.25 we added max_reply_timeouts. With this set in addition
to 
> >> reply_timeout, mod_jk will abort waiting for a reply after 
> >> reply_timeout, but allow some timeouts before actually deciding to
put 
> >> the backend into error.
> >>
> >> Unfortunately the implementation of max_reply_timeouts in 1.2.25
was 
> >> wrong, so you need to go to 1.2.26 to get it working right.
> >>
> >> See:
> >>
> >> http://issues.apache.org/bugzilla/show_bug.cgi?id=43229
> >>
> >> Caution: this does *not* explain, why the backends are not
> automatically 
> >> recovered after a minute of error condition. Maybe you have times,
> where 
> >> you getr to many of those reply_timeouts (see log file), and
although
> we 
> >> recover after a minute the backend almost immediately goes back
into 
> >> error status.
> >>
> >>> -> Which Timeout - how does mod_jk think Tomcat is down ? Where
can i
> >> found details to errno=110 ?...
> >>
> >> reply_timeout, see above and also
> >>
> >> http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
> >>
> >> errno: a standard unix feature. The numbers are platform dependent.
I 
> >> would assume in your case
> >>
> >> ETIMEDOUT       110     /* Connection timed out */
> >>
> >> so no wonder, that's exactly what we expect (and doesn't tell us
the 
> >> reason, i.e. what's wrong on the *backend* taking that long for a
> >> response).
> >>
> >>> -> receiving reply from tomcat failed with out recovery in send
loop
> >> attempt=0  - ? with out recovery in send loop - means?
> >>
> >> That your configuration doesn't allow us to send the request to
another
> >> backend. recovery_options 7 include: if mod_jk was able to send the

> >> request to a backend, do not try to send it to another backend in
case 
> >> of an error during the response handling. Even if you would allow 
> >> sending to another backend, it would not help with *not* putting
the 
> >> worker into error state. More likely would be, that you would put
all 
> >> workers into error state, because all of them might run into the
same 
> >> timeout, one after the other.
> >>
> >>> -> unrecoverable error 504 - details to this error ?
> >> That's simply how we return the situation back to the client
(browser).
> >>
> >>> Ok - i turn the logging level to debug - the course of events get
> >>> more
> >>> clear - but also more questions appear - there are socket numbers
-
> >>> which sockets - what are these numbers e.g will be shutting down
> socket
> >>> 35 for worker INETP1021 - The sockets are good for ? - how many
are
> >>> there/per worker ? can i configure them ?
> >> Should not be the problem here. For apache httpd if you do *not* 
> >> configure anything, we automatically choose the number of httpd
threads
> >> as the maximum number of connections. No need to change anything
here.
> >>> => Generally -How can i solve such problems - i tried to look into
> >>> the
> >>> mod_jk code - searching for error codes, error messages - but
cannot
> >>> find some relevant informations, - i am studying the log Files -
but
> >>> don't find out what really happens.
> >> Post to the list. Improve our dics.
> >>
> >> The error message contains the word "timeout" and "reply" and you
have
> a 
> >> "reply_timeout".
> >>
> >> Long running requests are a frequent problem. If you want to get
rid of
> >> them, start by adding response times to your httpd and your tomcat 
> >> access log format (%D). Then have a look, which URLs are producing
long
> >> running requests, during what time of day are they happening etc.
This 
> >> might give you a clue about the reasons.
> >>
> >> And if they are very frequent: do Java Thread Dumps of your
backends
> and 
> >> analyze them.
> >>
> >>> So - maybe someone has an idea why the worker think that the
> >>> corresponding Tomcat is dead, and why he will not recover by
itself. !
> >> Tomecat is dead: from the point of view of mod_jk it simply means:
we 
> >> didn't get an answer, when we expected one. Details depend on the 
> >> additional log lines (could not connect, reply timeout etc.).
> >>
> >>> And i am also searching for tips how i can help myself - and where
to
> >>> find something about the error codes, messages,..in mod_jk
> >>>
> >>> thanks for your attention
> >>> Best
> >>> ahmed musa (writing from vienna)
> >>>
> >> Regards,
> >>
> >> Rainer
> >>
> >>> Current Infrastructur
> >>> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3
> >> /Kernelversion 2.6.9-34
> >>> In front of the Webserver there are two (two Locations)
> HW-Loadbalancer
> >> (but they have no role in this story)
> >>> The Webservers are hosted at our ISP.
> >>>  
> >>> The Webserver balance the requests via mod_jk (Version 1.2.25) for
> >>> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver -
because
> of
> >>> underlying Application-Parts the OS is Windows 2003 Server - a
long
> >>> story not worth to explain :-) ). The Tomcatserver gain Data via
> >>> Requests against DB2 Server/DB2-Databases on the Mainframe. The
> >>> Tomcatserver are Inhouse -and were rebooted nightly because of
> automated
> >>> Deployment processes.
> >>>
> >>> Between the Webserver and the Tomcatserver is a Checkpoint
Firewall. 
> >>> All webapps are deployed on all Tomcats - only mod_jk manages the
> >>> requests to certain Tomcat- instances.
> >>> (on one Bladeserver there are two identically Tomcat Instances
> >>> running).
> >>>
> >>> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests
against
> >>> the public Website(s) are normal short living requests - not many
-
> The
> >>> most Webapps (Portals) need a login, have a strong focus on
business
> >>> logic - so the instances are big (many MBs in RAM), the sessions
are
> >>> sticky and the session timeout is 20 minutes. But there are also
less
> >>> requests. To the User requests - Monitoring requests from our ISP
are
> >> added.
> >>> The Problems appears at Servers/Portals which very less
Userrequests.
> >>>
> >>> worker.properties
> >>> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
> >>>
> >>> worker.template.type=ajp13
> >>> worker.template.lbfactor=5
> >>> worker.template.socket_keepalive=1
> >>> worker.template.connect_timeout=7000
> >>> worker.template.prepost_timeout=5000
> >>> worker.template.reply_timeout=120000
> >>> worker.template.retries=6
> >>> worker.template.activation=Active
> >>> worker.template.recovery_options=7
> >>>
> >>> worker.lbtemplate.type=lb
> >>> worker.lbtemplate.max_reply_timeouts=6
> >>> worker.lbtemplate.method=Session
> >>>
> >>> #Produktions Worker
> >>> # AS-INETP101 - 106 - 6/6 GGI
> >>> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
> >>> worker.INETP1011.port=65001
> >>> worker.INETP1011.reference=worker.template
> >>>
> >>> ....many more of the same
> >>>
> >>> then
> >>>
> >>> worker.ajp_ad.reference=worker.lbtemplate
> >>> worker.ajp_ad.balance_workers=INETP1032,INETP1062
> >>>
> >>> .... many more portals
> >>>
> >>> at least jkstatus
> >>>
> >>> The JKMount is very simple
> >>> JkMount /* ajp_ad    --- for the other portals mostly the same
> >>>
> >>> The Portals are Virtual Hosts on the Apache.
> >>>
> >>> Tomcat - server.xml
> >>> example
> >>> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
> >>>     <Engine name="Catalina" jvmRoute="INETP5021"
> defaultHost="default">
> >>> ......
> >>> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
> >>> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
> >>> xmlNamespaceAware="false">
> >>>         <Alias>www.slfinsol.com</Alias>
> >>>         <Alias>web1.slfinsol.com</Alias>
> >>>         ...
> >>>         <Alias>testweb.slfinsol.com</Alias>
> >>>         .....
> >>>         <Valve
className="org.apache.catalina.valves.AccessLogValve"
> >>> directory="logs" prefix="swl_access_log." suffix=".txt"
> pattern="common"
> >>> resolveHosts="false" />
> >>>         <Valve
> >>> className="at.allianz.tomcat.valve.RequestTimeValve"/>
> >>>         <Valve
> >>>
className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
> >>>         <Context path="" docBase="swl" />
> >>>         <Context path="/monitor5" docBase="monitor" />
> >>>         <Context path="/swl" docBase="swl" />
> >>>       </Host>    
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org

-- 
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
Browser-Versionen downloaden: http://www.gmx.net/de/go/browser

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk Problems - - worker went to error state and dont recover

Posted by Ahmed Musa <do...@gmx.at>.
Hello Rainer,
Thanks for your informations - the Situation gets more clear now.
I will read again some dics - following your links and will make further tests also with the improved logging.
Thanks a lot for your time
with best regards 
ahmed

-------- Original-Nachricht --------
> Datum: Wed, 20 Feb 2008 18:59:01 +0100
> Von: Rainer Jung <ra...@kippdata.de>
> An: Tomcat Users List <us...@tomcat.apache.org>
> Betreff: Re: mod_jk Problems - - worker went to error state and dont recover

> Ahmed Musa wrote:
> > Hello,
> > Wow -thank you very much Rainer for your very quick and informative
> answer.
> > I will go to 1.2.26 and think about some "smoother" Values for
> reply_timeout and max_reply_timeouts.
> > I will search for the requests which causes the Problems - becasue i
> still log the response time in your mentioned way - but I am not sure that the
> Userrequests are responsible for the Situation. 
> 
> One note: for Apache httpd 2.x %d is microseconds (there is no format 
> for milliseconds), for Tomcat %D is milliseconds. As long as you are 
> searching for the root cause, it might make sense to have both access 
> logs active to check about duration differences.
> 
> > So one further question - does mod_jk itself checks if the Backend is
> reachable - without userrequests? 
> 
> No. Everything only works on top of user requests.
> 
> > When there are connections to the Backend - are they closed after the
> respone or are the hold open for further requests.
> 
> In general hold open. There are parameters on how long they are held 
> open without more requests before they get shut down, and also how many 
> might be kept open even when no requests are coming in. Those are the 
> connection pool parameters, which you will find on
> 
> http://tomcat.apache.org/connectors-doc/reference/workers.html
> 
> Tomcat also has a connectionTimeout on the connector, which will shut 
> down a connection from the Tomcat side if it is idle for to long.
> 
> If you don't want to reuse connections at all, there's also a setting (a 
> JkOption in Apache).
> 
> > Is it possible that the Checkpoint Firewall in Between can be
> responsible for the connectivity problem?
> 
> It can cut a connection that's idle for too long. Since you have 
> cping/cpong active via connect_timeout and prepost_timeout, you should 
> get a cping error message, if the connection was dropped by the firewall 
> during idle times and mod_jk tries to use it again. The reply timeout in 
> the error log indicates, that the backend isn't answering. Of course if 
> it takes *very* long to answer, it might be that the firewall dropped 
> the connection in between, but then the root cause would still be the 
> long response time of the backend.
> 
> > Another point is the "not recovering" of the worker. Yes, you are right
> - in this situation i have many reply_timeouts - but these happens in a
> period of time - for example 30 minutes - but the worker is still dead even
> then when there are no more reply_timeouts. It remains dead.
> > It was necessary to restart it manually via jkstatus.
> 
> I assume you are using stickyness, so when a session started on a node, 
> it will stay there. So when a worker is in error for a long time, all 
> new sessions will start on other nodes. If the worker is ready for 
> recovery, it needs a request, that doesn't carry a session to get probed 
> with this request.
> 
> In jkstatus, the status of an error worker should switch to REC, when 
> mod_jk decides that it could send a non-sticky request there (to probe) 
> and to PRB, during the time this request is on the node, and finally 
> either to OK or back to ERR depending on the result of the request.
> 
> You can log the number of errors (and accesses) that happened on the 
> node in the httpd access log. If you think that the node simply stays in 
> error for a long time, then the error count (and access count) should 
> stay constant. I would expect, that they do not.
> 
> Have a look at how LogFormat in Apache httpd works, and then add some of 
> those documented in
> 
> http://tomcat.apache.org/connectors-doc/reference/apache.html
> 
> like:
> 
> JK_LB_LAST_NAME
> JK_LB_LAST_ACCESSED
> JK_LB_LAST_ERRORS
> JK_LB_LAST_BUSY
> JK_LB_LAST_STATE
> 
> using the syntax %{JK_LB_LAST_STATE}n etc.
> 
> > 
> > Another point is the learning - i read the dics - the infos on the
> apache Website i dont't find other ones - are there other ones ? - and they are
> not going in depth - if you read the spec and watch the logs it is - for me
> - very hard to match the things. Also the many possibilities that mod_jk
> has to prove if there is a connection to the Backend,... - i understand them
> but check the reality in an error situation is very hard. Under matching i
> mean "Which Part of the Communication sequence failed - why - and causes
> which error message".
> > But i will try - and study also the mailing list..
> 
> It's hard for us too (sometimes).
> 
> > Thank you for your time - tomorrow we will have the new version and will
> see what happens.
> > 
> > best
> > ahmed
> 
> 
> Regards,
> 
> Rainer
> 
> > -------- Original-Nachricht --------
> >> Datum: Wed, 20 Feb 2008 15:56:42 +0100
> >> Von: Rainer Jung <ra...@kippdata.de>
> >> An: Tomcat Users List <us...@tomcat.apache.org>
> >> Betreff: Re: mod_jk Problems - - worker went to error state and dont
> recover
> > 
> >> samk@twinix.com wrote:
> >>> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted on
> >> behalf of a User
> >>> Hallo to all, After long unsuccessful research i hope someone can
> >>> give me a hint to the following problems.
> >>>
> >>> Our Apache-mod_jk-Tomcat Infrastructur was running without Problems
> >>> for about one year-than since two month mod_jk errors occurs.
> >>> We upgraded the mod_jk Version, made improvements in the
> >>> worker.properties - the problems changed and get less but sometimes
> they
> >>> appear further on.
> >>>
> >>> It seems that the mod_jk worker loose the connection to their
> >>> Tomcat-Backendserver - there are messages in the mod_jk log Files
> which
> >>> points in this direction. Normally this seems not to be a big problem
> -
> >>> but under certain conditions (which ?) the worker goes to an error
> state
> >>> and cannot recover itself- must be done manually.
> >>>
> >>> Problem 1: The Tomcats are reachable - unknown why the workers think
> the
> >> server is dead ?
> >>> Problem 2: I have no idea why the worker goes to an error state and
> >> cannot recover.
> >>
> >> 2 is a consequence of 1
> >>
> >>> Problem3: I miss explanations of logged messages - i read the messages
> -
> >> but cannot match them to the situation - when does a worker post this
> >> messages
> >>
> >> 1 is a consequence of these messages
> >>
> >>> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]
> >> jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi 
> >>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> >> ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with waiting
> reply from
> >> tomcat. Tomcat is down, stopped or network problems (errno=110)
> >>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> >> ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply from
> tomcat failed with
> >> out recovery in send loop attempt=0
> >>> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]
> >> service::jk_lb_worker.c (1105): unrecoverable error 504, request
> failed. Tomcat failed in
> >> the middle of request, we can't recover to another instance.
> >>
> >> The second line tells us, that your configured reply_timeout fired.
> >> You set it to 120000 (2 minutes), so there are requests taking longer 
> >> than 2 minutes on the backend, before the first response packet comes 
> >> back from the backend.
> >>
> >> With your configuration mod_jk then doesn't wait any longer on the
> reply 
> >> *and puts the backend into error mode*.
> >>
> >> Up until version 1.2.25, if you use a reply-timeout, you need to set it
> >> to a high number which justifies the resoning "if it takes that long, 
> >> that something is wrong with the backend".
> >>
> >> Reality shows: there is no such number. Often there are few requests 
> >> that take unaccetably long on the backend *although* the backend is 
> >> still working.
> >>
> >> So in 1.2.25 we added max_reply_timeouts. With this set in addition to 
> >> reply_timeout, mod_jk will abort waiting for a reply after 
> >> reply_timeout, but allow some timeouts before actually deciding to put 
> >> the backend into error.
> >>
> >> Unfortunately the implementation of max_reply_timeouts in 1.2.25 was 
> >> wrong, so you need to go to 1.2.26 to get it working right.
> >>
> >> See:
> >>
> >> http://issues.apache.org/bugzilla/show_bug.cgi?id=43229
> >>
> >> Caution: this does *not* explain, why the backends are not
> automatically 
> >> recovered after a minute of error condition. Maybe you have times,
> where 
> >> you getr to many of those reply_timeouts (see log file), and although
> we 
> >> recover after a minute the backend almost immediately goes back into 
> >> error status.
> >>
> >>> -> Which Timeout - how does mod_jk think Tomcat is down ? Where can i
> >> found details to errno=110 ?...
> >>
> >> reply_timeout, see above and also
> >>
> >> http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
> >>
> >> errno: a standard unix feature. The numbers are platform dependent. I 
> >> would assume in your case
> >>
> >> ETIMEDOUT       110     /* Connection timed out */
> >>
> >> so no wonder, that's exactly what we expect (and doesn't tell us the 
> >> reason, i.e. what's wrong on the *backend* taking that long for a
> >> response).
> >>
> >>> -> receiving reply from tomcat failed with out recovery in send loop
> >> attempt=0  - ? with out recovery in send loop - means?
> >>
> >> That your configuration doesn't allow us to send the request to another
> >> backend. recovery_options 7 include: if mod_jk was able to send the 
> >> request to a backend, do not try to send it to another backend in case 
> >> of an error during the response handling. Even if you would allow 
> >> sending to another backend, it would not help with *not* putting the 
> >> worker into error state. More likely would be, that you would put all 
> >> workers into error state, because all of them might run into the same 
> >> timeout, one after the other.
> >>
> >>> -> unrecoverable error 504 - details to this error ?
> >> That's simply how we return the situation back to the client (browser).
> >>
> >>> Ok - i turn the logging level to debug - the course of events get
> >>> more
> >>> clear - but also more questions appear - there are socket numbers -
> >>> which sockets - what are these numbers e.g will be shutting down
> socket
> >>> 35 for worker INETP1021 - The sockets are good for ? - how many are
> >>> there/per worker ? can i configure them ?
> >> Should not be the problem here. For apache httpd if you do *not* 
> >> configure anything, we automatically choose the number of httpd threads
> >> as the maximum number of connections. No need to change anything here.
> >>> => Generally -How can i solve such problems - i tried to look into
> >>> the
> >>> mod_jk code - searching for error codes, error messages - but cannot
> >>> find some relevant informations, - i am studying the log Files - but
> >>> don't find out what really happens.
> >> Post to the list. Improve our dics.
> >>
> >> The error message contains the word "timeout" and "reply" and you have
> a 
> >> "reply_timeout".
> >>
> >> Long running requests are a frequent problem. If you want to get rid of
> >> them, start by adding response times to your httpd and your tomcat 
> >> access log format (%D). Then have a look, which URLs are producing long
> >> running requests, during what time of day are they happening etc. This 
> >> might give you a clue about the reasons.
> >>
> >> And if they are very frequent: do Java Thread Dumps of your backends
> and 
> >> analyze them.
> >>
> >>> So - maybe someone has an idea why the worker think that the
> >>> corresponding Tomcat is dead, and why he will not recover by itself. !
> >> Tomecat is dead: from the point of view of mod_jk it simply means: we 
> >> didn't get an answer, when we expected one. Details depend on the 
> >> additional log lines (could not connect, reply timeout etc.).
> >>
> >>> And i am also searching for tips how i can help myself - and where to
> >>> find something about the error codes, messages,..in mod_jk
> >>>
> >>> thanks for your attention
> >>> Best
> >>> ahmed musa (writing from vienna)
> >>>
> >> Regards,
> >>
> >> Rainer
> >>
> >>> Current Infrastructur
> >>> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3
> >> /Kernelversion 2.6.9-34
> >>> In front of the Webserver there are two (two Locations)
> HW-Loadbalancer
> >> (but they have no role in this story)
> >>> The Webservers are hosted at our ISP.
> >>>  
> >>> The Webserver balance the requests via mod_jk (Version 1.2.25) for
> >>> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver - because
> of
> >>> underlying Application-Parts the OS is Windows 2003 Server - a long
> >>> story not worth to explain :-) ). The Tomcatserver gain Data via
> >>> Requests against DB2 Server/DB2-Databases on the Mainframe. The
> >>> Tomcatserver are Inhouse -and were rebooted nightly because of
> automated
> >>> Deployment processes.
> >>>
> >>> Between the Webserver and the Tomcatserver is a Checkpoint Firewall. 
> >>> All webapps are deployed on all Tomcats - only mod_jk manages the
> >>> requests to certain Tomcat- instances.
> >>> (on one Bladeserver there are two identically Tomcat Instances
> >>> running).
> >>>
> >>> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests against
> >>> the public Website(s) are normal short living requests - not many -
> The
> >>> most Webapps (Portals) need a login, have a strong focus on business
> >>> logic - so the instances are big (many MBs in RAM), the sessions are
> >>> sticky and the session timeout is 20 minutes. But there are also less
> >>> requests. To the User requests - Monitoring requests from our ISP are
> >> added.
> >>> The Problems appears at Servers/Portals which very less Userrequests.
> >>>
> >>> worker.properties
> >>> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
> >>>
> >>> worker.template.type=ajp13
> >>> worker.template.lbfactor=5
> >>> worker.template.socket_keepalive=1
> >>> worker.template.connect_timeout=7000
> >>> worker.template.prepost_timeout=5000
> >>> worker.template.reply_timeout=120000
> >>> worker.template.retries=6
> >>> worker.template.activation=Active
> >>> worker.template.recovery_options=7
> >>>
> >>> worker.lbtemplate.type=lb
> >>> worker.lbtemplate.max_reply_timeouts=6
> >>> worker.lbtemplate.method=Session
> >>>
> >>> #Produktions Worker
> >>> # AS-INETP101 - 106 - 6/6 GGI
> >>> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
> >>> worker.INETP1011.port=65001
> >>> worker.INETP1011.reference=worker.template
> >>>
> >>> ....many more of the same
> >>>
> >>> then
> >>>
> >>> worker.ajp_ad.reference=worker.lbtemplate
> >>> worker.ajp_ad.balance_workers=INETP1032,INETP1062
> >>>
> >>> .... many more portals
> >>>
> >>> at least jkstatus
> >>>
> >>> The JKMount is very simple
> >>> JkMount /* ajp_ad    --- for the other portals mostly the same
> >>>
> >>> The Portals are Virtual Hosts on the Apache.
> >>>
> >>> Tomcat - server.xml
> >>> example
> >>> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
> >>>     <Engine name="Catalina" jvmRoute="INETP5021"
> defaultHost="default">
> >>> ......
> >>> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
> >>> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
> >>> xmlNamespaceAware="false">
> >>>         <Alias>www.slfinsol.com</Alias>
> >>>         <Alias>web1.slfinsol.com</Alias>
> >>>         ...
> >>>         <Alias>testweb.slfinsol.com</Alias>
> >>>         .....
> >>>         <Valve className="org.apache.catalina.valves.AccessLogValve"
> >>> directory="logs" prefix="swl_access_log." suffix=".txt"
> pattern="common"
> >>> resolveHosts="false" />
> >>>         <Valve
> >>> className="at.allianz.tomcat.valve.RequestTimeValve"/>
> >>>         <Valve
> >>> className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
> >>>         <Context path="" docBase="swl" />
> >>>         <Context path="/monitor5" docBase="monitor" />
> >>>         <Context path="/swl" docBase="swl" />
> >>>       </Host>    
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org

-- 
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
Browser-Versionen downloaden: http://www.gmx.net/de/go/browser

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk Problems - - worker went to error state and dont recover

Posted by Rainer Jung <ra...@kippdata.de>.
Ahmed Musa wrote:
> Hello,
> Wow -thank you very much Rainer for your very quick and informative answer.
> I will go to 1.2.26 and think about some "smoother" Values for reply_timeout and max_reply_timeouts.
> I will search for the requests which causes the Problems - becasue i still log the response time in your mentioned way - but I am not sure that the Userrequests are responsible for the Situation. 

One note: for Apache httpd 2.x %d is microseconds (there is no format 
for milliseconds), for Tomcat %D is milliseconds. As long as you are 
searching for the root cause, it might make sense to have both access 
logs active to check about duration differences.

> So one further question - does mod_jk itself checks if the Backend is reachable - without userrequests? 

No. Everything only works on top of user requests.

> When there are connections to the Backend - are they closed after the respone or are the hold open for further requests.

In general hold open. There are parameters on how long they are held 
open without more requests before they get shut down, and also how many 
might be kept open even when no requests are coming in. Those are the 
connection pool parameters, which you will find on

http://tomcat.apache.org/connectors-doc/reference/workers.html

Tomcat also has a connectionTimeout on the connector, which will shut 
down a connection from the Tomcat side if it is idle for to long.

If you don't want to reuse connections at all, there's also a setting (a 
JkOption in Apache).

> Is it possible that the Checkpoint Firewall in Between can be responsible for the connectivity problem?

It can cut a connection that's idle for too long. Since you have 
cping/cpong active via connect_timeout and prepost_timeout, you should 
get a cping error message, if the connection was dropped by the firewall 
during idle times and mod_jk tries to use it again. The reply timeout in 
the error log indicates, that the backend isn't answering. Of course if 
it takes *very* long to answer, it might be that the firewall dropped 
the connection in between, but then the root cause would still be the 
long response time of the backend.

> Another point is the "not recovering" of the worker. Yes, you are right - in this situation i have many reply_timeouts - but these happens in a period of time - for example 30 minutes - but the worker is still dead even then when there are no more reply_timeouts. It remains dead.
> It was necessary to restart it manually via jkstatus.

I assume you are using stickyness, so when a session started on a node, 
it will stay there. So when a worker is in error for a long time, all 
new sessions will start on other nodes. If the worker is ready for 
recovery, it needs a request, that doesn't carry a session to get probed 
with this request.

In jkstatus, the status of an error worker should switch to REC, when 
mod_jk decides that it could send a non-sticky request there (to probe) 
and to PRB, during the time this request is on the node, and finally 
either to OK or back to ERR depending on the result of the request.

You can log the number of errors (and accesses) that happened on the 
node in the httpd access log. If you think that the node simply stays in 
error for a long time, then the error count (and access count) should 
stay constant. I would expect, that they do not.

Have a look at how LogFormat in Apache httpd works, and then add some of 
those documented in

http://tomcat.apache.org/connectors-doc/reference/apache.html

like:

JK_LB_LAST_NAME
JK_LB_LAST_ACCESSED
JK_LB_LAST_ERRORS
JK_LB_LAST_BUSY
JK_LB_LAST_STATE

using the syntax %{JK_LB_LAST_STATE}n etc.

> 
> Another point is the learning - i read the dics - the infos on the apache Website i dont't find other ones - are there other ones ? - and they are not going in depth - if you read the spec and watch the logs it is - for me - very hard to match the things. Also the many possibilities that mod_jk has to prove if there is a connection to the Backend,... - i understand them but check the reality in an error situation is very hard. Under matching i mean "Which Part of the Communication sequence failed - why - and causes which error message".
> But i will try - and study also the mailing list..

It's hard for us too (sometimes).

> Thank you for your time - tomorrow we will have the new version and will see what happens.
> 
> best
> ahmed


Regards,

Rainer

> -------- Original-Nachricht --------
>> Datum: Wed, 20 Feb 2008 15:56:42 +0100
>> Von: Rainer Jung <ra...@kippdata.de>
>> An: Tomcat Users List <us...@tomcat.apache.org>
>> Betreff: Re: mod_jk Problems - - worker went to error state and dont recover
> 
>> samk@twinix.com wrote:
>>> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted on
>> behalf of a User
>>> Hallo to all, After long unsuccessful research i hope someone can
>>> give me a hint to the following problems.
>>>
>>> Our Apache-mod_jk-Tomcat Infrastructur was running without Problems
>>> for about one year-than since two month mod_jk errors occurs.
>>> We upgraded the mod_jk Version, made improvements in the
>>> worker.properties - the problems changed and get less but sometimes they
>>> appear further on.
>>>
>>> It seems that the mod_jk worker loose the connection to their
>>> Tomcat-Backendserver - there are messages in the mod_jk log Files which
>>> points in this direction. Normally this seems not to be a big problem -
>>> but under certain conditions (which ?) the worker goes to an error state
>>> and cannot recover itself- must be done manually.
>>>
>>> Problem 1: The Tomcats are reachable - unknown why the workers think the
>> server is dead ?
>>> Problem 2: I have no idea why the worker goes to an error state and
>> cannot recover.
>>
>> 2 is a consequence of 1
>>
>>> Problem3: I miss explanations of logged messages - i read the messages -
>> but cannot match them to the situation - when does a worker post this
>> messages
>>
>> 1 is a consequence of these messages
>>
>>> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]
>> jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi 
>>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
>> ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with waiting reply from
>> tomcat. Tomcat is down, stopped or network problems (errno=110)
>>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
>> ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply from tomcat failed with
>> out recovery in send loop attempt=0
>>> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]
>> service::jk_lb_worker.c (1105): unrecoverable error 504, request failed. Tomcat failed in
>> the middle of request, we can't recover to another instance.
>>
>> The second line tells us, that your configured reply_timeout fired.
>> You set it to 120000 (2 minutes), so there are requests taking longer 
>> than 2 minutes on the backend, before the first response packet comes 
>> back from the backend.
>>
>> With your configuration mod_jk then doesn't wait any longer on the reply 
>> *and puts the backend into error mode*.
>>
>> Up until version 1.2.25, if you use a reply-timeout, you need to set it 
>> to a high number which justifies the resoning "if it takes that long, 
>> that something is wrong with the backend".
>>
>> Reality shows: there is no such number. Often there are few requests 
>> that take unaccetably long on the backend *although* the backend is 
>> still working.
>>
>> So in 1.2.25 we added max_reply_timeouts. With this set in addition to 
>> reply_timeout, mod_jk will abort waiting for a reply after 
>> reply_timeout, but allow some timeouts before actually deciding to put 
>> the backend into error.
>>
>> Unfortunately the implementation of max_reply_timeouts in 1.2.25 was 
>> wrong, so you need to go to 1.2.26 to get it working right.
>>
>> See:
>>
>> http://issues.apache.org/bugzilla/show_bug.cgi?id=43229
>>
>> Caution: this does *not* explain, why the backends are not automatically 
>> recovered after a minute of error condition. Maybe you have times, where 
>> you getr to many of those reply_timeouts (see log file), and although we 
>> recover after a minute the backend almost immediately goes back into 
>> error status.
>>
>>> -> Which Timeout - how does mod_jk think Tomcat is down ? Where can i
>> found details to errno=110 ?...
>>
>> reply_timeout, see above and also
>>
>> http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
>>
>> errno: a standard unix feature. The numbers are platform dependent. I 
>> would assume in your case
>>
>> ETIMEDOUT       110     /* Connection timed out */
>>
>> so no wonder, that's exactly what we expect (and doesn't tell us the 
>> reason, i.e. what's wrong on the *backend* taking that long for a
>> response).
>>
>>> -> receiving reply from tomcat failed with out recovery in send loop
>> attempt=0  - ? with out recovery in send loop - means?
>>
>> That your configuration doesn't allow us to send the request to another 
>> backend. recovery_options 7 include: if mod_jk was able to send the 
>> request to a backend, do not try to send it to another backend in case 
>> of an error during the response handling. Even if you would allow 
>> sending to another backend, it would not help with *not* putting the 
>> worker into error state. More likely would be, that you would put all 
>> workers into error state, because all of them might run into the same 
>> timeout, one after the other.
>>
>>> -> unrecoverable error 504 - details to this error ?
>> That's simply how we return the situation back to the client (browser).
>>
>>> Ok - i turn the logging level to debug - the course of events get
>>> more
>>> clear - but also more questions appear - there are socket numbers -
>>> which sockets - what are these numbers e.g will be shutting down socket
>>> 35 for worker INETP1021 - The sockets are good for ? - how many are
>>> there/per worker ? can i configure them ?
>> Should not be the problem here. For apache httpd if you do *not* 
>> configure anything, we automatically choose the number of httpd threads 
>> as the maximum number of connections. No need to change anything here.
>>> => Generally -How can i solve such problems - i tried to look into
>>> the
>>> mod_jk code - searching for error codes, error messages - but cannot
>>> find some relevant informations, - i am studying the log Files - but
>>> don't find out what really happens.
>> Post to the list. Improve our dics.
>>
>> The error message contains the word "timeout" and "reply" and you have a 
>> "reply_timeout".
>>
>> Long running requests are a frequent problem. If you want to get rid of 
>> them, start by adding response times to your httpd and your tomcat 
>> access log format (%D). Then have a look, which URLs are producing long 
>> running requests, during what time of day are they happening etc. This 
>> might give you a clue about the reasons.
>>
>> And if they are very frequent: do Java Thread Dumps of your backends and 
>> analyze them.
>>
>>> So - maybe someone has an idea why the worker think that the
>>> corresponding Tomcat is dead, and why he will not recover by itself. !
>> Tomecat is dead: from the point of view of mod_jk it simply means: we 
>> didn't get an answer, when we expected one. Details depend on the 
>> additional log lines (could not connect, reply timeout etc.).
>>
>>> And i am also searching for tips how i can help myself - and where to
>>> find something about the error codes, messages,..in mod_jk
>>>
>>> thanks for your attention
>>> Best
>>> ahmed musa (writing from vienna)
>>>
>> Regards,
>>
>> Rainer
>>
>>> Current Infrastructur
>>> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3
>> /Kernelversion 2.6.9-34
>>> In front of the Webserver there are two (two Locations) HW-Loadbalancer
>> (but they have no role in this story)
>>> The Webservers are hosted at our ISP.
>>>  
>>> The Webserver balance the requests via mod_jk (Version 1.2.25) for
>>> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver - because of
>>> underlying Application-Parts the OS is Windows 2003 Server - a long
>>> story not worth to explain :-) ). The Tomcatserver gain Data via
>>> Requests against DB2 Server/DB2-Databases on the Mainframe. The
>>> Tomcatserver are Inhouse -and were rebooted nightly because of automated
>>> Deployment processes.
>>>
>>> Between the Webserver and the Tomcatserver is a Checkpoint Firewall. 
>>> All webapps are deployed on all Tomcats - only mod_jk manages the
>>> requests to certain Tomcat- instances.
>>> (on one Bladeserver there are two identically Tomcat Instances
>>> running).
>>>
>>> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests against
>>> the public Website(s) are normal short living requests - not many - The
>>> most Webapps (Portals) need a login, have a strong focus on business
>>> logic - so the instances are big (many MBs in RAM), the sessions are
>>> sticky and the session timeout is 20 minutes. But there are also less
>>> requests. To the User requests - Monitoring requests from our ISP are
>> added.
>>> The Problems appears at Servers/Portals which very less Userrequests.
>>>
>>> worker.properties
>>> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
>>>
>>> worker.template.type=ajp13
>>> worker.template.lbfactor=5
>>> worker.template.socket_keepalive=1
>>> worker.template.connect_timeout=7000
>>> worker.template.prepost_timeout=5000
>>> worker.template.reply_timeout=120000
>>> worker.template.retries=6
>>> worker.template.activation=Active
>>> worker.template.recovery_options=7
>>>
>>> worker.lbtemplate.type=lb
>>> worker.lbtemplate.max_reply_timeouts=6
>>> worker.lbtemplate.method=Session
>>>
>>> #Produktions Worker
>>> # AS-INETP101 - 106 - 6/6 GGI
>>> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
>>> worker.INETP1011.port=65001
>>> worker.INETP1011.reference=worker.template
>>>
>>> ....many more of the same
>>>
>>> then
>>>
>>> worker.ajp_ad.reference=worker.lbtemplate
>>> worker.ajp_ad.balance_workers=INETP1032,INETP1062
>>>
>>> .... many more portals
>>>
>>> at least jkstatus
>>>
>>> The JKMount is very simple
>>> JkMount /* ajp_ad    --- for the other portals mostly the same
>>>
>>> The Portals are Virtual Hosts on the Apache.
>>>
>>> Tomcat - server.xml
>>> example
>>> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
>>>     <Engine name="Catalina" jvmRoute="INETP5021" defaultHost="default">
>>> ......
>>> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
>>> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
>>> xmlNamespaceAware="false">
>>>         <Alias>www.slfinsol.com</Alias>
>>>         <Alias>web1.slfinsol.com</Alias>
>>>         ...
>>>         <Alias>testweb.slfinsol.com</Alias>
>>>         .....
>>>         <Valve className="org.apache.catalina.valves.AccessLogValve"
>>> directory="logs" prefix="swl_access_log." suffix=".txt" pattern="common"
>>> resolveHosts="false" />
>>>         <Valve
>>> className="at.allianz.tomcat.valve.RequestTimeValve"/>
>>>         <Valve
>>> className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
>>>         <Context path="" docBase="swl" />
>>>         <Context path="/monitor5" docBase="monitor" />
>>>         <Context path="/swl" docBase="swl" />
>>>       </Host>    

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk Problems - - worker went to error state and dont recover

Posted by Ahmed Musa <do...@gmx.at>.
Hello,
Wow -thank you very much Rainer for your very quick and informative answer.
I will go to 1.2.26 and think about some "smoother" Values for reply_timeout and max_reply_timeouts.
I will search for the requests which causes the Problems - becasue i still log the response time in your mentioned way - but I am not sure that the Userrequests are responsible for the Situation. 

So one further question - does mod_jk itself checks if the Backend is reachable - without userrequests? 
When there are connections to the Backend - are they closed after the respone or are the hold open for further requests.
Is it possible that the Checkpoint Firewall in Between can be responsible for the connectivity problem?

Another point is the "not recovering" of the worker. Yes, you are right - in this situation i have many reply_timeouts - but these happens in a period of time - for example 30 minutes - but the worker is still dead even then when there are no more reply_timeouts. It remains dead.
It was necessary to restart it manually via jkstatus.

Another point is the learning - i read the dics - the infos on the apache Website i dont't find other ones - are there other ones ? - and they are not going in depth - if you read the spec and watch the logs it is - for me - very hard to match the things. Also the many possibilities that mod_jk has to prove if there is a connection to the Backend,... - i understand them but check the reality in an error situation is very hard. Under matching i mean "Which Part of the Communication sequence failed - why - and causes which error message".
But i will try - and study also the mailing list..

Thank you for your time - tomorrow we will have the new version and will see what happens.

best
ahmed

-------- Original-Nachricht --------
> Datum: Wed, 20 Feb 2008 15:56:42 +0100
> Von: Rainer Jung <ra...@kippdata.de>
> An: Tomcat Users List <us...@tomcat.apache.org>
> Betreff: Re: mod_jk Problems - - worker went to error state and dont recover

> samk@twinix.com wrote:
> > See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted on
> behalf of a User
> > 
> > Hallo to all, After long unsuccessful research i hope someone can
> > give me a hint to the following problems.
> > 
> > Our Apache-mod_jk-Tomcat Infrastructur was running without Problems
> > for about one year-than since two month mod_jk errors occurs.
> > We upgraded the mod_jk Version, made improvements in the
> > worker.properties - the problems changed and get less but sometimes they
> > appear further on.
> > 
> > It seems that the mod_jk worker loose the connection to their
> > Tomcat-Backendserver - there are messages in the mod_jk log Files which
> > points in this direction. Normally this seems not to be a big problem -
> > but under certain conditions (which ?) the worker goes to an error state
> > and cannot recover itself- must be done manually.
> > 
> > Problem 1: The Tomcats are reachable - unknown why the workers think the
> server is dead ?
> > Problem 2: I have no idea why the worker goes to an error state and
> cannot recover.
> 
> 2 is a consequence of 1
> 
> > Problem3: I miss explanations of logged messages - i read the messages -
> but cannot match them to the situation - when does a worker post this
> messages
> 
> 1 is a consequence of these messages
> 
> > [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]
> jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi 
> > [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with waiting reply from
> tomcat. Tomcat is down, stopped or network problems (errno=110)
> > [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply from tomcat failed with
> out recovery in send loop attempt=0
> > [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]
> service::jk_lb_worker.c (1105): unrecoverable error 504, request failed. Tomcat failed in
> the middle of request, we can't recover to another instance.
> 
> The second line tells us, that your configured reply_timeout fired.
> You set it to 120000 (2 minutes), so there are requests taking longer 
> than 2 minutes on the backend, before the first response packet comes 
> back from the backend.
> 
> With your configuration mod_jk then doesn't wait any longer on the reply 
> *and puts the backend into error mode*.
> 
> Up until version 1.2.25, if you use a reply-timeout, you need to set it 
> to a high number which justifies the resoning "if it takes that long, 
> that something is wrong with the backend".
> 
> Reality shows: there is no such number. Often there are few requests 
> that take unaccetably long on the backend *although* the backend is 
> still working.
> 
> So in 1.2.25 we added max_reply_timeouts. With this set in addition to 
> reply_timeout, mod_jk will abort waiting for a reply after 
> reply_timeout, but allow some timeouts before actually deciding to put 
> the backend into error.
> 
> Unfortunately the implementation of max_reply_timeouts in 1.2.25 was 
> wrong, so you need to go to 1.2.26 to get it working right.
> 
> See:
> 
> http://issues.apache.org/bugzilla/show_bug.cgi?id=43229
> 
> Caution: this does *not* explain, why the backends are not automatically 
> recovered after a minute of error condition. Maybe you have times, where 
> you getr to many of those reply_timeouts (see log file), and although we 
> recover after a minute the backend almost immediately goes back into 
> error status.
> 
> > -> Which Timeout - how does mod_jk think Tomcat is down ? Where can i
> found details to errno=110 ?...
> 
> reply_timeout, see above and also
> 
> http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
> 
> errno: a standard unix feature. The numbers are platform dependent. I 
> would assume in your case
> 
> ETIMEDOUT       110     /* Connection timed out */
> 
> so no wonder, that's exactly what we expect (and doesn't tell us the 
> reason, i.e. what's wrong on the *backend* taking that long for a
> response).
> 
> > -> receiving reply from tomcat failed with out recovery in send loop
> attempt=0  - ? with out recovery in send loop - means?
> 
> That your configuration doesn't allow us to send the request to another 
> backend. recovery_options 7 include: if mod_jk was able to send the 
> request to a backend, do not try to send it to another backend in case 
> of an error during the response handling. Even if you would allow 
> sending to another backend, it would not help with *not* putting the 
> worker into error state. More likely would be, that you would put all 
> workers into error state, because all of them might run into the same 
> timeout, one after the other.
> 
> > -> unrecoverable error 504 - details to this error ?
> 
> That's simply how we return the situation back to the client (browser).
> 
> > 
> > Ok - i turn the logging level to debug - the course of events get
> > more
> > clear - but also more questions appear - there are socket numbers -
> > which sockets - what are these numbers e.g will be shutting down socket
> > 35 for worker INETP1021 - The sockets are good for ? - how many are
> > there/per worker ? can i configure them ?
> 
> Should not be the problem here. For apache httpd if you do *not* 
> configure anything, we automatically choose the number of httpd threads 
> as the maximum number of connections. No need to change anything here.
> > 
> > => Generally -How can i solve such problems - i tried to look into
> > the
> > mod_jk code - searching for error codes, error messages - but cannot
> > find some relevant informations, - i am studying the log Files - but
> > don't find out what really happens.
> 
> Post to the list. Improve our dics.
> 
> The error message contains the word "timeout" and "reply" and you have a 
> "reply_timeout".
> 
> Long running requests are a frequent problem. If you want to get rid of 
> them, start by adding response times to your httpd and your tomcat 
> access log format (%D). Then have a look, which URLs are producing long 
> running requests, during what time of day are they happening etc. This 
> might give you a clue about the reasons.
> 
> And if they are very frequent: do Java Thread Dumps of your backends and 
> analyze them.
> 
> > So - maybe someone has an idea why the worker think that the
> > corresponding Tomcat is dead, and why he will not recover by itself. !
> 
> Tomecat is dead: from the point of view of mod_jk it simply means: we 
> didn't get an answer, when we expected one. Details depend on the 
> additional log lines (could not connect, reply timeout etc.).
> 
> > And i am also searching for tips how i can help myself - and where to
> > find something about the error codes, messages,..in mod_jk
> > 
> > thanks for your attention
> > Best
> > ahmed musa (writing from vienna)
> >
> 
> Regards,
> 
> Rainer
> 
> > Current Infrastructur
> > We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3
> /Kernelversion 2.6.9-34
> > In front of the Webserver there are two (two Locations) HW-Loadbalancer
> (but they have no role in this story)
> > The Webservers are hosted at our ISP.
> >  
> > The Webserver balance the requests via mod_jk (Version 1.2.25) for
> > approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver - because of
> > underlying Application-Parts the OS is Windows 2003 Server - a long
> > story not worth to explain :-) ). The Tomcatserver gain Data via
> > Requests against DB2 Server/DB2-Databases on the Mainframe. The
> > Tomcatserver are Inhouse -and were rebooted nightly because of automated
> > Deployment processes.
> > 
> > Between the Webserver and the Tomcatserver is a Checkpoint Firewall. 
> > All webapps are deployed on all Tomcats - only mod_jk manages the
> > requests to certain Tomcat- instances.
> > (on one Bladeserver there are two identically Tomcat Instances
> > running).
> > 
> > Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests against
> > the public Website(s) are normal short living requests - not many - The
> > most Webapps (Portals) need a login, have a strong focus on business
> > logic - so the instances are big (many MBs in RAM), the sessions are
> > sticky and the session timeout is 20 minutes. But there are also less
> > requests. To the User requests - Monitoring requests from our ISP are
> added.
> > The Problems appears at Servers/Portals which very less Userrequests.
> > 
> > worker.properties
> > worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
> > 
> > worker.template.type=ajp13
> > worker.template.lbfactor=5
> > worker.template.socket_keepalive=1
> > worker.template.connect_timeout=7000
> > worker.template.prepost_timeout=5000
> > worker.template.reply_timeout=120000
> > worker.template.retries=6
> > worker.template.activation=Active
> > worker.template.recovery_options=7
> > 
> > worker.lbtemplate.type=lb
> > worker.lbtemplate.max_reply_timeouts=6
> > worker.lbtemplate.method=Session
> > 
> > #Produktions Worker
> > # AS-INETP101 - 106 - 6/6 GGI
> > worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
> > worker.INETP1011.port=65001
> > worker.INETP1011.reference=worker.template
> > 
> > ....many more of the same
> > 
> > then
> > 
> > worker.ajp_ad.reference=worker.lbtemplate
> > worker.ajp_ad.balance_workers=INETP1032,INETP1062
> > 
> > .... many more portals
> > 
> > at least jkstatus
> > 
> > The JKMount is very simple
> > JkMount /* ajp_ad    --- for the other portals mostly the same
> > 
> > The Portals are Virtual Hosts on the Apache.
> > 
> > Tomcat - server.xml
> > example
> > <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
> >     <Engine name="Catalina" jvmRoute="INETP5021" defaultHost="default">
> > ......
> > <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
> > autoDeploy="false" deployOnStartup="false" xmlValidation="false"
> > xmlNamespaceAware="false">
> >         <Alias>www.slfinsol.com</Alias>
> >         <Alias>web1.slfinsol.com</Alias>
> >         ...
> >         <Alias>testweb.slfinsol.com</Alias>
> >         .....
> >         <Valve className="org.apache.catalina.valves.AccessLogValve"
> > directory="logs" prefix="swl_access_log." suffix=".txt" pattern="common"
> > resolveHosts="false" />
> >         <Valve
> > className="at.allianz.tomcat.valve.RequestTimeValve"/>
> >         <Valve
> > className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
> >         <Context path="" docBase="swl" />
> >         <Context path="/monitor5" docBase="monitor" />
> >         <Context path="/swl" docBase="swl" />
> >       </Host>    
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org

-- 
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: mod_jk Problems - - worker went to error state and dont recover

Posted by Rainer Jung <ra...@kippdata.de>.
samk@twinix.com wrote:
> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted on behalf of a User
> 
> Hallo to all, After long unsuccessful research i hope someone can
> give me a hint to the following problems.
> 
> Our Apache-mod_jk-Tomcat Infrastructur was running without Problems
> for about one year-than since two month mod_jk errors occurs.
> We upgraded the mod_jk Version, made improvements in the
> worker.properties - the problems changed and get less but sometimes they
> appear further on.
> 
> It seems that the mod_jk worker loose the connection to their
> Tomcat-Backendserver - there are messages in the mod_jk log Files which
> points in this direction. Normally this seems not to be a big problem -
> but under certain conditions (which ?) the worker goes to an error state
> and cannot recover itself- must be done manually.
> 
> Problem 1: The Tomcats are reachable - unknown why the workers think the server is dead ?
> Problem 2: I have no idea why the worker goes to an error state and cannot recover.

2 is a consequence of 1

> Problem3: I miss explanations of logged messages - i read the messages - but cannot match them to the situation - when does a worker post this messages

1 is a consequence of these messages

> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info] jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi 
> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error] ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with waiting reply from tomcat. Tomcat is down, stopped or network problems (errno=110)
> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error] ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply from tomcat failed with out recovery in send loop attempt=0
> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error] service::jk_lb_worker.c (1105): unrecoverable error 504, request failed. Tomcat failed in the middle of request, we can't recover to another instance.

The second line tells us, that your configured reply_timeout fired.
You set it to 120000 (2 minutes), so there are requests taking longer 
than 2 minutes on the backend, before the first response packet comes 
back from the backend.

With your configuration mod_jk then doesn't wait any longer on the reply 
*and puts the backend into error mode*.

Up until version 1.2.25, if you use a reply-timeout, you need to set it 
to a high number which justifies the resoning "if it takes that long, 
that something is wrong with the backend".

Reality shows: there is no such number. Often there are few requests 
that take unaccetably long on the backend *although* the backend is 
still working.

So in 1.2.25 we added max_reply_timeouts. With this set in addition to 
reply_timeout, mod_jk will abort waiting for a reply after 
reply_timeout, but allow some timeouts before actually deciding to put 
the backend into error.

Unfortunately the implementation of max_reply_timeouts in 1.2.25 was 
wrong, so you need to go to 1.2.26 to get it working right.

See:

http://issues.apache.org/bugzilla/show_bug.cgi?id=43229

Caution: this does *not* explain, why the backends are not automatically 
recovered after a minute of error condition. Maybe you have times, where 
you getr to many of those reply_timeouts (see log file), and although we 
recover after a minute the backend almost immediately goes back into 
error status.

> -> Which Timeout - how does mod_jk think Tomcat is down ? Where can i found details to errno=110 ?...

reply_timeout, see above and also

http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html

errno: a standard unix feature. The numbers are platform dependent. I 
would assume in your case

ETIMEDOUT       110     /* Connection timed out */

so no wonder, that's exactly what we expect (and doesn't tell us the 
reason, i.e. what's wrong on the *backend* taking that long for a response).

> -> receiving reply from tomcat failed with out recovery in send loop attempt=0  - ? with out recovery in send loop - means?

That your configuration doesn't allow us to send the request to another 
backend. recovery_options 7 include: if mod_jk was able to send the 
request to a backend, do not try to send it to another backend in case 
of an error during the response handling. Even if you would allow 
sending to another backend, it would not help with *not* putting the 
worker into error state. More likely would be, that you would put all 
workers into error state, because all of them might run into the same 
timeout, one after the other.

> -> unrecoverable error 504 - details to this error ?

That's simply how we return the situation back to the client (browser).

> 
> Ok - i turn the logging level to debug - the course of events get
> more
> clear - but also more questions appear - there are socket numbers -
> which sockets - what are these numbers e.g will be shutting down socket
> 35 for worker INETP1021 - The sockets are good for ? - how many are
> there/per worker ? can i configure them ?

Should not be the problem here. For apache httpd if you do *not* 
configure anything, we automatically choose the number of httpd threads 
as the maximum number of connections. No need to change anything here.
> 
> => Generally -How can i solve such problems - i tried to look into
> the
> mod_jk code - searching for error codes, error messages - but cannot
> find some relevant informations, - i am studying the log Files - but
> don't find out what really happens.

Post to the list. Improve our dics.

The error message contains the word "timeout" and "reply" and you have a 
"reply_timeout".

Long running requests are a frequent problem. If you want to get rid of 
them, start by adding response times to your httpd and your tomcat 
access log format (%D). Then have a look, which URLs are producing long 
running requests, during what time of day are they happening etc. This 
might give you a clue about the reasons.

And if they are very frequent: do Java Thread Dumps of your backends and 
analyze them.

> So - maybe someone has an idea why the worker think that the
> corresponding Tomcat is dead, and why he will not recover by itself. !

Tomecat is dead: from the point of view of mod_jk it simply means: we 
didn't get an answer, when we expected one. Details depend on the 
additional log lines (could not connect, reply timeout etc.).

> And i am also searching for tips how i can help myself - and where to
> find something about the error codes, messages,..in mod_jk
> 
> thanks for your attention
> Best
> ahmed musa (writing from vienna)
>

Regards,

Rainer

> Current Infrastructur
> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3 /Kernelversion 2.6.9-34
> In front of the Webserver there are two (two Locations) HW-Loadbalancer (but they have no role in this story)
> The Webservers are hosted at our ISP.
>  
> The Webserver balance the requests via mod_jk (Version 1.2.25) for
> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver - because of
> underlying Application-Parts the OS is Windows 2003 Server - a long
> story not worth to explain :-) ). The Tomcatserver gain Data via
> Requests against DB2 Server/DB2-Databases on the Mainframe. The
> Tomcatserver are Inhouse -and were rebooted nightly because of automated
> Deployment processes.
> 
> Between the Webserver and the Tomcatserver is a Checkpoint Firewall. 
> All webapps are deployed on all Tomcats - only mod_jk manages the
> requests to certain Tomcat- instances.
> (on one Bladeserver there are two identically Tomcat Instances
> running).
> 
> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests against
> the public Website(s) are normal short living requests - not many - The
> most Webapps (Portals) need a login, have a strong focus on business
> logic - so the instances are big (many MBs in RAM), the sessions are
> sticky and the session timeout is 20 minutes. But there are also less
> requests. To the User requests - Monitoring requests from our ISP are added.
> The Problems appears at Servers/Portals which very less Userrequests.
> 
> worker.properties
> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
> 
> worker.template.type=ajp13
> worker.template.lbfactor=5
> worker.template.socket_keepalive=1
> worker.template.connect_timeout=7000
> worker.template.prepost_timeout=5000
> worker.template.reply_timeout=120000
> worker.template.retries=6
> worker.template.activation=Active
> worker.template.recovery_options=7
> 
> worker.lbtemplate.type=lb
> worker.lbtemplate.max_reply_timeouts=6
> worker.lbtemplate.method=Session
> 
> #Produktions Worker
> # AS-INETP101 - 106 - 6/6 GGI
> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
> worker.INETP1011.port=65001
> worker.INETP1011.reference=worker.template
> 
> ....many more of the same
> 
> then
> 
> worker.ajp_ad.reference=worker.lbtemplate
> worker.ajp_ad.balance_workers=INETP1032,INETP1062
> 
> .... many more portals
> 
> at least jkstatus
> 
> The JKMount is very simple
> JkMount /* ajp_ad    --- for the other portals mostly the same
> 
> The Portals are Virtual Hosts on the Apache.
> 
> Tomcat - server.xml
> example
> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
>     <Engine name="Catalina" jvmRoute="INETP5021" defaultHost="default">
> ......
> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
> xmlNamespaceAware="false">
>         <Alias>www.slfinsol.com</Alias>
>         <Alias>web1.slfinsol.com</Alias>
>         ...
>         <Alias>testweb.slfinsol.com</Alias>
>         .....
>         <Valve className="org.apache.catalina.valves.AccessLogValve"
> directory="logs" prefix="swl_access_log." suffix=".txt" pattern="common"
> resolveHosts="false" />
>         <Valve
> className="at.allianz.tomcat.valve.RequestTimeValve"/>
>         <Valve
> className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
>         <Context path="" docBase="swl" />
>         <Context path="/monitor5" docBase="monitor" />
>         <Context path="/swl" docBase="swl" />
>       </Host>    

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org