You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Marc Slemko <ma...@go2net.com> on 1998/07/17 21:33:34 UTC

Re: RFC 1337: Time-wait assassination (fwd)

An example of how you can do low-overhead TIME_WAITs and why Microsoft and
Netscape are foolish.

---------- Forwarded message ----------
Date: Fri, 17 Jul 1998 09:04:51 -0700
From: David S. Miller <da...@dm.cobaltmicro.com>
To: stevea@shastanets.com
Cc: dab@BSDI.COM,
    end2end-interest@ISI.EDU
Subject: Re: RFC 1337: Time-wait assassination

   From: "Steve Alexander" <st...@shastanets.com>
   Date: Thu, 16 Jul 1998 18:21:40 -0700

Hey Steve,

   From a performance point of view, handling large numbers of
   TIME-WAIT connections is pretty easy.  From a memory point of view,
   I agree that it is wasteful, unless you compress the state.  Under
   IRIX 6.4, for example, a socket, PCB, and TCB takes almost 1K of
   memory; lots of 8-byte pointers add up fast.  If I hadn't left SGI
   to "pursue other interests" I would probably have had to kludge up
   a scheme that junked the unneeded parts; ugly but doable.

The following was the best I could come up with and handle both ipv4
and ipv6 in the same scheme under Linux to solve this memory
consumption problem, with some helpful annotations:

struct tcp_tw_bucket {

The header is made to match exactly a real TCB, this makes processing
up to the actual TCP state specific handling dispatch a snap.  So
generic stuff like global TCP socket lists, local port binding hash
chains, etc.  This roughly fits into a cache line on most machines.

	struct sock		*sklist_next;
	struct sock		*sklist_prev;
	struct sock		*bind_next;
	struct sock		**bind_pprev;

What follows is more of the header of a real TCB.  TCP socket demux
keys and the actual hash chain linkage for all sockets with a full
identity.  Again, this is a cache line on most cpus.

	__u32			daddr;
	__u32			rcv_saddr;
	__u16			dport;
	unsigned short		num;
	int			bound_dev_if;
	struct sock		*next;
	struct sock		**pprev;

The incoming demux, after a match, always
is guarenteed to look at bucket->state first, and if TCP_TIME_WAIT
then this TCB gets passed as a "struct tcp_tw_bucket *" to a special
function which just handles this state.  The 'reuse' flag being
replicated here deserves a slight mention, we use it for fast port
allocations (useful on large scale non-passive ftp servers), the local
port TCB hashes keep track of whether "everyone using this local port"
has reuse set, if so the local port bucket has a flag set.  So if the
new TCB wants this port, and he has reuse set, and the local port
bucket has this special reuse flag set as well, we can fast path the
whole verification.  Anyways, the end result is that we need to keep
track of this state bit in the TIME_WAIT'ers as well.

	unsigned char		state,
				zapped;
	__u16			sport;
	unsigned short		family;
	unsigned char		reuse,
				nonagle;

The private state.  You need to know the next expected receive
sequence number, mostly for regenerating a new TCB during "BSD
TIME_WAIT" processing.  The bind_bucket is just the local port hash
header I just spoke of.  The next two members are for the efficient
reaping of all these time-waiters.  Finally, the IPv6 addresses are
stuck at the end.

	/* And these are ours. */
	__u32			rcv_nxt;
	struct tcp_func		*af_specific;
	struct tcp_bind_bucket	*tb;
	struct tcp_tw_bucket	*next_death;
	int			death_slot;
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
	struct in6_addr		v6_daddr;
	struct in6_addr		v6_rcv_saddr;
#endif
};

This amounts to around 100bytes on a 32-bit machine, and around 
144 bytes on a 64-bit machine.

Because I use a seperate structure, I can completely discard the file
object etc. references since I totally destroy the real TCB from which
this TIME_WAIT'er came to be.  So each TIME_WAIT connection literally
costs 100 or 144 bytes depending upon the word size of the machine.

So about 65,000 TIME_WAIT connections consume around 10MB of ram.
And I allocate them using a SLAB style allocator to get good cache
coloring to decrease demux time for new incoming connections (since I
have to walk at least one TIME_WAIT chain to make completely sure I am
creating a unique identity, and then thus validly look for an
appropriate listener).  Before I implemented all of this, 40,000
connections consumed roughly 64MB of ram.

It seems to be very effective from my tests.  On a 64-bit machine with
512k of L2 cache, things begin to break down at around 80,000
TIME_WAIT connections, but up until that point it scales linearly.

As I've shown, it is "doable".  Here it is only slightly "ugly", and I
could have made it much more if I had taken the time to define a
generic "TCB header" which real TCB's and these "pseudo-TCB's"
declared at the head of their structures.  I'm just too lazy...

Later,
David S. Miller
davem@dm.cobaltmicro.com