You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Shinta Tjio <st...@broadjump.com> on 2001/03/08 23:31:36 UTC

RE: Design Review for ajp13's changes: WAS problem w/ ajp13 - if Tomc at is shutdown

Dan,
thanks for reviewing.... See my inline comments.

> First off, it's GREAT that you are working on this -- it's a 
> very heavily requested improvement.
> 
> I haven't had time to do a full review yet, but I've taken a 
> quick look and I have a few questions/suggestions:
> 
>  1) This work will end up being committed in the 3.3 branch, 
> rather than the 3.2 branch.  3.2 is only bug fixes now, not 
> new features. The mod_jk C code is very similar in 3.3 and 
> 3.2, but not identical.  I can adapt your patches, but if 
> you wanted to work against 3.3, that would make things
> easier.

I can do that. I don't think that would be too hard, unless
3.3 is widely different from 3.2. I haven't see the code for
3.3 yet.

Btw, when is 3.3 supposed to be released?

>  2) You seem to have managed to puzzle out some of the 
> workings of mod_jk already, but, FYI, I just committed 
> some internal documentation a few days ago. It's in the 
> 3.3 branch, in src/native/mod_jk/common/jk_service.h --
> may be worth taking a look at.

I will check this out... Thanks!

>  3) For option (1), I have a few questions
> 
>   - Is there a way in which data could be lost?  
> Specifically, as you state, the send() will return 
> without error, and then it will only get the error on
> the following read().  Is all the data always preserved 
> so that simply retrying will work correctly?  I think 
> most of that state is in the jk_ws_service_t object -- 
> is it possible a read pointer will be advanced and
> data will be lost?  This may be acceptable, but I'd like to 
> understand it...

Well, since I'm not that familiar with the code, I didn't know
what to look for. I will check the jk_ws_service_t object.
As far as I can see, in my tests, my requests got handled 
properly in the retry. But I will double check and let you
know.

>  - You only retry once.  If there are a number of connections 
> open (from a single Apache process), isn't it possible that 
> Tomcat has come back up, and that the next connection 
> obtained (from the endpoint cache), will also be stale?  
> Would it make sense in this case to trigger a shutdown of 
> all the connections currently in the cache (and then retry 
> once)? That would make sense if there were no other ways to 
> get a ECONNRESET error.  

I retry only once because I want to avoid the forever loop.
I was thinking what's the point of trying 2,3 more times
if I keep getting the same errors again. Someone has to fix
the real problem. I may hog the CPU by looping like that. 

You have a good point about the multiple cache connections
per Apache process. I hadn't think about that before. 
But I think it will work fine. My assumption is based
on each request using only one cache endpoint (ep_cache[]
array). If my first try of servicing the request gives me 
ECONNRESET, than I only close the connection for that
endpoint, and reopen it. 

The next request may get another cache endpoint that's
dead. But in that case it will also be closed and
retried properly.

I think this keeps the code simpler. It may take several
requests to clean up those dead sockets. But they will
eventually be cleaned up.

>  - Or, more generally, just so I (and everyone) can 
> understand, how does this new code deal with the 
> following stages:
> 
>   1) TC and Apache both up and running

If TC and Apache are both up and running, then recv()
should never get ECONNRESET, the is_recoverable_flag
will never be set to TRUE. Everything should work
like before. 

>   2) TC is shutdown
> If mod_jk is in the middle of handling a request, what 
> happens?  There was an infinite loop in the 3.2.1 code, 
> but that's been fixed in 3.2.2 and 3.3.

Do you know what was the exact cause of the infinite loop?

If TC is shutdown before the send(), then we will handle
when we get the ECONNRESET. If TC is shutdown after the
send() before recv(), same thing (ECONNRESET) will happen. 
If TC is shutdown after recv(), we won't consider this
as an error and retry won't be performed.

I would hope, because I retry only once, this should prevent
the looping. 

>   3) TC is shutdown, Apache is still up.  While TC is down, 
> requests come in. How are they handled?  Are there any loops 
> Apache gets stuck in?

Apache won't get stuck. If TC is still down, request comes in,
the recv() will get ECONNRESET, a retry will be performed.
Except that connect() will fail and this will go to a different
error path. An Internal Server Error will be returned. 
I accidentally tested this scenario when I forgot to restart
Tomcat and I reloaded the page. :-) 
 
>   4) TC starts back up.  Now requests get handled smoothly again? 

And yes. 

>  4) For option (2):
> 
>  - If the user has Win32, you're just punting, correct?  Why 
> is that?  I know nothing about Win32 socket programming, but I'm 
> curious...  You say you're testing on Win2k -- does Win2k support 
> select(), but win32 doesn't? Does anyone know about how widely 
> select() is supported?

The reason I didn't do one for Win32 is because I hadn't been able
to reproduce the problem on Win32, at least Windows 2000. They must 
handle socket differently. I hate to put unnecessary code, especially
when I can't reproduce it, can't test it and can't make sure I do 
fix a problem.

But I think select() is supported on Windows.

> I'll try to take a more thorough look (and do some testing). 

Do you have any comments as to which solution is better suited?

Just to recap, my action items are:
1. Merge fixes to 3.3 branch
2. Test & make sure no data is lost during retry.
3. Test some more on Windows.
4. If there're any changes because of the above action items,
   I will repost the changes.

Thanks, y'all! 
shinta

> Thanks again,
> -Dan 
> 
> > Shinta Tjio wrote:
> > 
> > Attached are the unified diffs for the proposed changes.
> > They are diffs against the 3.2.1 release code. I hope this
> > is sufficient. I haven't got to use Solaris patch tool yet.
> > These are tested on Solaris 2.8. Changes #1 is the one
> > that's less platform specific, since I don't call any
> > socket APIs.
> > 
> > I will test these on Windows 2000 tomorrow. As of other
> > UNIXes, we don't have those in house. So if someone
> > can volunteer testing it on other UNIX flavors, that
> > will be great!
> > 
> > Unified diffs for the proposed changes #1:
> >   jk_ajp13_worker.c.1.diff
> >   mod_jk.c.1.diff
> > 
> > Unified diffs for the proposed changes #2:
> >   jk_ajp13_worker.c.2.diff
> >   jk_connect.c.2.diff
> > 
> > thanks so much!
> > shinta
> > 
> > > -----Original Message-----
> > > From: cmanolache@yahoo.com [mailto:cmanolache@yahoo.com]
> > > Sent: Wednesday, March 07, 2001 6:57 PM
> > > To: 'tomcat-dev@jakarta.apache.org '
> > > Subject: Re: Design Review for ajp13's changes: WAS 
> problem w/ ajp13 -
> > > if Tomc at is shutdown
> > >
> > >
> > > Hi Shinta,
> > >
> > > It's sounds like a solution to a real problem, please 
> send a patch,
> > > I'm sure someone will read it. Dan and Henri are the best
> > > people to ask
> > > about this, I can also help a bit ( I've been using RPMs
> > > lately, it's too
> > > easy to get them and not worry about compile :-)
> > >
> > > My only sugestion/concern is that the code should work on
> > > both Windows and
> > > unix - or at least compile :-)
> > >
> > > Costin
> > >
> > >
> > > > I would like to propose some changes to eliminate the
> > > > requirement to restart Apache, when you restart Tomcat.
> > > > I'm willing to give the code to anyone who needs it,
> > > > when I'm done testing.
> > > >
> > > > But I need some help/suggestions so that I can put in
> > > > the right code. If any of the proposed changes below
> > > > should not exists ever, I'm open to other suggestions.
> > > > This is my first time looking at mod_jk's ajp13 code.
> > > > So any clue to make these better would be appreciated.
> > > >
> > > > Right now, if you use ajp13 and you restart Tomcat, you
> > > > have to also restart Apache. See details in previous
> > > > postings. For us, having to restart Apache is not a
> > > > feasible solution in our customers' environment.
> > > >
> > > > After looking at the code, I have two possible solutions:
> > > >
> > > > 1. From mod_jk, I can detect that the socket has been
> > > >    closed by Tomcat. This is normally indicated by the
> > > >    recv() returning ECONNRESET. The recv() is called
> > > >    after the request has been sent to the socket. The
> > > >    send() unfortunately, doesn't give you an error.
> > > >
> > > >    The proposed fixed is to check for errno ECONNRESET,
> > > >    then set the is_recoverable_error flag to TRUE, in
> > > >    the service() function in jk_ajp13_worker.c. I also
> > > >    add a code in mod_jk.c, to check for this flag, and
> > > >    call the service() method again if the flag is set
> > > >    TRUE. The 2nd time the service() method is called,
> > > >    it will reconnect to Tomcat like normal.
> > > >
> > > > 2. Another solution would be to put in a select() on the
> > > >    socket prior to send(), looking for the socket being
> > > >    read ready. Under normal condition, this select()
> > > >    should return nothing. But if Tomcat shuts down
> > > >    the socket, this select() should return the socket
> > > >    being read ready. When this happen, I issue a read()
> > > >    of 1 bytes. If the read() comes back with return code
> > > >    0, this should be an indication that the socket was
> > > >    closed on the remote end. Then I will proceed to close
> > > >    the socket. The remaining logic already handles the
> > > >    reconnect, etc.
> > > >
> > > > I have both of these solution prototyped and minimally
> > > > tested. They both Anyone care to comment which solution 
> fits better
> > > > with the overall code? Anyone voluteer to review the code?
> > > >
> > > > thanks,
> > > > shinta
> > > >
> > > > > -----Original Message-----
> > > > > From: Shinta Tjio
> > > > > To: tomcat-dev@jakarta.apache.org
> > > > > Cc: 'Dan Milstein'
> > > > > Sent: 3/6/01 7:01 PM
> > > > > Subject: RE: FW: problem w/ ajp13 - if Tomcat is shutdown
> > > > >
> > > > > I am using Tomcat 3.2.1, Apache 1.3.14, running on
> > > > > Solaris 2.8, Sun machines.
> > > > >
> > > > > After various attempts of debugging this, I have
> > > > > more information.
> > > > >
> > > > > 1. Even though I'm setting the worker's property
> > > > > cache_size to default (1), I'm finding there
> > > > > are up to 6 connections opened from Apache to
> > > > > Tomcat. I deduce this by looking at the mod_jk.conf
> > > > > and by doing netstat.
> > > > >
> > > > > I finally find out, this is so because my Apache
> > > > > is set to spawn minimum of 6 children and each
> > > > > of those children are making separate connections
> > > > > to Tomcat.
> > > > >
> > > > > This is very bad because, I ended up having to
> > > > > reload 6 times before Tomcat starts serving me
> > > > > the page again. Each time it uses a different
> > > > > Apache children that has defunct socket. So the
> > > > > more Apache children I have, the longer it takes
> > > > > me to recover from this problem.
> > > > >
> > > > > 2. It seems when Tomcat dies & restarts, the send()
> > > > > called by ajp13's jk_tcp_socket_sendfull() does not
> > > > > get an error. But the recv() does get an error, with
> > > > > errno ECONNRESET. After which, the socket is properly
> > > > > closed.
> > > > >
> > > > > 3. When I shutdown Tomcat, those sockets that were
> > > > > opened between Apache/Tomcat showed up in state
> > > > > CLOSE_WAIT, and FIN_WAIT2. I think this is normally
> > > > > solved by calling the shutdown() API after closing
> > > > > the socket. However, this would have to be done from
> > > > > the Tomcat side in Ajp13ConnectionHandler.java.
> > > > > I can't find the corresponding method of Socket
> > > > > in Java.
> > > > >
> > > > > So.. based on all of these, the only fix I can think
> > > > > of putting is to make mod_jk retry the send() if
> > > > > recv() comes back with an error ECONNRESET. The retry
> > > > > should happen after the old socket is properly closed.
> > > > >
> > > > > Anyone wants to comment?
> > > > >
> > > > > shinta
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Dan Milstein [ mailto:danmil@shore.net
> > > > > <ma...@shore.net>
> > > > > ]
> > > > > > Sent: Tuesday, March 06, 2001 12:00 PM
> > > > > > To: tomcat-dev@jakarta.apache.org
> > > > > > Subject: Re: FW: problem w/ ajp13 - if Tomcat is shutdown
> > > > > >
> > > > > >
> > > > > > What version of TC are you using?  What version of Apache?
> > > > > >
> > > > > > I would look into the mod_jk docs -- I think this is the
> > > > > > spec'd behavior
> > > > > > (which, admittedly, is not great, but that makes it more of
> > > > > a feature
> > > > > > request than a bug ;-).  With ajp13, Apache opens up a
> > > > > > persistent TCP/IP
> > > > > > connection TC -- if TC restarts, I think that connection may
> > > > > > just hang up
> > > > > > and then timeout (since Apache doesn't know that TC has
> > > restarted).
> > > > > >
> > > > > > If anyone wants to work on this, you would have the undying
> > > > > > thanks of the
> > > > > > rest of the TC community -- having to restart Apache all the
> > > > > > time bugs a
> > > > > > *lot* of people.
> > > > > >
> > > > > > -Dan
> > > > > >
> > > > > > > Shinta Tjio wrote:
> > > > > > >
> > > > > > > I'm having problem with mod_jk if ajp13 is used.
> > > > > > >
> > > > > > > The problem is often reproduced when Tomcat is shut
> > > > > > > down without Apache being shut down. When a request
> > > > > > > is fired through Apache as soon as Tomcat starts,
> > > > > > > I often get Internal Server Error. The mod_jk.log
> > > > > > > will have the following:
> > > > > > >
> > > > > > > > [jk_uri_worker_map.c (344)]: Into
> > > > > > jk_uri_worker_map_t::map_uri_to_worker
> > > > > > >
> > > > > > > > [jk_uri_worker_map.c (406)]:
> > > > > > jk_uri_worker_map_t::map_uri_to_worker,
> > > > > > > >   Found a match ajp13
> > > > > > > > [jk_worker.c (123)]: Into wc_get_worker_for_name ajp13
> > > > > > > > [jk_worker.c (127)]: wc_get_worker_for_name, done
> > > > > found a worker
> > > > > > > > [jk_ajp13_worker.c (651)]: Into 
> jk_worker_t::get_endpoint
> > > > > > > > [jk_ajp13_worker.c (536)]: Into jk_endpoint_t::service
> > > > > > > > [jk_ajp13.c (346)]: Into ajp13_marshal_into_msgb
> > > > > > > > [jk_ajp13.c (480)]: ajp13_marshal_into_msgb - Done
> > > > > > > > [jk_ajp13_worker.c (203)]:
> > > connection_tcp_get_message: Error -
> > > > > > > >    jk_tcp_socket_recvfull failed
> > > > > > > > [jk_ajp13_worker.c (619)]: Error reading request
> > > > > > > > [jk_ajp13_worker.c (489)]: Into jk_endpoint_t::done
> > > > > > >
> > > > > > > If I hit reload multiple times, eventually Tomcat will
> > > > > > > serve the servlet fine.
> > > > > > >
> > > > > > > Did anyone see this problem before? Is there anyway
> > > > > > > around this?
> > > > > > >
> > > > > > > shinta
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Dan Milstein // danmil@shore.net
> > > > > >
> 

Re: Design Review for ajp13's changes: WAS problem w/ ajp13 - if Tomc at is shutdown

Posted by Nick Holloway <Ni...@pyrites.org.uk>.
stjio@broadjump.com (Shinta Tjio) writes:
> The reason I didn't do one for Win32 is because I hadn't been able
> to reproduce the problem on Win32, at least Windows 2000. They must 
> handle socket differently. I hate to put unnecessary code, especially
> when I can't reproduce it, can't test it and can't make sure I do 
> fix a problem.

I can confirm that with Apache 1.3.12, mod_jk, ajp13 and Tomcat 3.2.1
running on Windows 2000 (SP1), I do not need to restart Apache when
Tomcat is restarted.

-- 
 `O O'  | Nick.Holloway@pyrites.org.uk
// ^ \\ | http://www.pyrites.org.uk/

---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
For additional commands, email: tomcat-dev-help@jakarta.apache.org