You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Dean Gaudet <dg...@arctic.org> on 1999/06/20 20:01:46 UTC

zero-copy and mux

On Sun, 20 Jun 1999, Ben Laurie wrote:

> What about HTTPng? Muxing? Is using multiple buffers (temporarily,
> presumably) really that much of a problem?
> 
> It seems a shame to lose this, because the ability to do it proves the
> abstraction is good.

Let's run through it a bit... our mux protocol will look something like
this:

    struct packet {
	int connection_id;
	unsigned num_bytes;
	char data[num_bytes];
    };

oversimplified of course.

Suppose we have, say 4 requests in-progress, and 4 threads generating
responses.  Those will all be writing to individual BUFFs.  Eventually one
or more of them will have to flush their BUFF.

It'll call down into iol_mux, which will have a mutex to prevent all
4 threads from entering.  Can iol_mux decide to buffer the response at
this point?  I think not -- the upper layer really wanted a flush at this
point, or it would not have flushed (assume we got that right, because
we'll get it right for the non-mux case, and the code is the same).
So iol_mux has to send the packet at this point.

Maybe I'm wrong, maybe iol_mux has the option of buffering the packet.
The heuristic we might use is "this is a small packet, and we know there
are other requests in progress, and/or there is data to be read on this
or other connections" (i.e. an improved "saferead"/halfduplex heuristic).
The corresponding code in apache 1.x at this point would copy the packet
into a buffer... it is small after all.  Similarly if the packet was
large, and there was already buffered stuff in the iol_mux then we could
use a writev() combining the existing mux buffer and the new packet,
much like we choose to do large_write() in apache 1.x.  We don't have
zero-copy, but we really only have partial-copy, just like apache
1.x... and I'm pretty sure it's good enough.

That last case is the same even if there's another thread trying to send
data over the mux -- the mux may have an existing buffer (of previous
small responses) and choose to combine it with the writev().

Notice the mux layer can put packet headers on with a writev() as well,
just like we did with chunking large packets.

And if encryption is sitting after the mux it is going to take all these
writev() fragments and combine them into one (or more) larger buffers
and write() those... a copy we can't avoid anyhow.

My argument is essentially this: partial-copy, like we have already,
is about as expensive as the overhead of zero-copy.

The mux layer sees small packets only when total responses are small --
the BUFF above the mux ensures that.  The "tail" of a response is always a
"small" packet, but there we have a similar saferead/halfduplex heuristic
by which we may or may not buffer it.

Dunno really.  I don't have any numbers to back up my claim.  All I have
is one implementation to compare with, a TCP-TCP proxy (think socksgw
on steroids) which was initially one-copy, and which I rewrote with a
zero-copy implementation.  The zero-copy implementation has the full
generality I posted first -- buffer_heads, buffer_lists, and buffers.
It has a lot of nice optimizations in it, but the API is a little more
general that it needs to be.  At any rate the zero-copy version breaks
even compared to the one-copy version.

Maybe something to remember which might help convince folks.  With present
100baseT hardware, your kernel is going to make one-copy of all your data
regardless -- because it has to assemble TCP packets to send off to the
network card.  If you've already done one-copy just before entering the
kernel there's a high chance that the entire 4k packet is still sitting in
your L1 data cache when the kernel needs it.  Optimistically it'll take
the kernel, say 200 32-bit operations to copy that 4k data into network
packets... that's 200 cycles, or .5us on a 400Mhz processor.  Worst case
scenario is that all the data is in the L2, and the L2 is say 10 cycles
away.  Then your cost is 5.5us... which is above the one-copy cost you had
to pay anyhow.

OK ok, so there is gigabit ethernet and ATM hardware which can do TCP
packet assembly.  And suppose we care about it in the apache 2.0 timeframe
(as opposed to a 2.1 or later timeframe).  In solaris 7 they implemented
true zero-copy, but it only worked on the page-aligned data that was going
from disk to the ATM card, the rest was one-copy (for assembly).  We
support this -- this is what large_write() with its writev() usage is
intended to support (actually it doesn't work with solaris 7 on the first
32k of the file, but the sun engineer told me he was thinking about how to
support the writev we use).  I suspect that other folks doing true
zero-copy are going to have similar restrictions -- disk -> net optimized,
memory -> net unoptimized... and we're back to that 5.5us cost. 

Let's just say I remain unconvinced.  I think our profile will have bigger
fish to fry than this. 

Dean

Re: zero-copy and mux

Posted by Dean Gaudet <dg...@arctic.org>.

On Sun, 20 Jun 1999, Dean Gaudet wrote:

> and we're back to that 5.5us cost. 

I just wanted to stress that this is like an absolute worst case cost
too... and it really only shows up when you have a lot of *small*
responses in a pipelined or mux situation.  The large_write() heuristic
essentially guarantees us a write pattern like this:

    copy of headers into our buffer
    writev(fd, [buffer, first_page_of_file])
    write(fd, second_page_of_file)
    ...
    write(fd, last_full_page_of_file)
    copy of last page of file into our buffer
    write(fd, buffer)

We're probably not arguing about getting rid of the copy of header
strings into the buffer -- the simple fact that portable writev() is
limited to 16 vectors makes this point moot.  And it's only on the order
of 300 bytes typically; negligable cost.

The full pages of the file we're giving the kernel all it needs to
do zero-copy -- we've handed it a page-aligned (using mmap) chunk of
the file.  We can't help it any more really.

The last page of the file is an interesting chance for more optimization.
We don't have to copy if we know we're about to flush anyhow -- the whole
saferead/halfduplex trick.  That's another optimization that isn't too
hard to perform, and doesn't require full zero-copy semantics.

We frequently have to copy the partial last page if we're pipelining or
if we're doing mux.  But there we can still reduce the impact.  Right now
we use 4k buffer sizes, we really should have something closer to the
1460 tcp packet payload as our large_write heuristic.  A bunch of chances
for someone with a good benchmark setup to tune and tweak.

On linux, the partial last page has another possible solution -- TCP_CORK.
When you setsockopt(TCP_CORK) it tells the kernel it can send any full
payload packets it can assemble, but it has to hang onto the last
non-full payload packet until you remove the cork.  That is to say,
there's an explicit flush operation... this is way better than the
nagle/no-nagle which the standard socket api provides.  So what we can
do on linux is set the cork, and write the last partial page regardless.
Then we pull the cork later when we've figured out we're really done
with all the mux pieces.  This lets us skip the extra copying.

The cork was put there to deal with the sendfile() initial page problem.
You'll notice that most other sendfile implementations include an iovec,
intended for the headers, so that the kernel can copy the headers into the
first packet and avoid an extra packet on the net.  The linux folks were
loathe to make combination syscalls like that, they put the cork in at
my suggestion because it lets us use a write() followed by a sendfile()
without causing a short packet to go out.

Dean

Re: zero-copy and mux

Posted by Zach Brown <za...@zabbo.net>.

[greetings, guys, just joined the list.. ]

On Sun, 20 Jun 1999, Dean Gaudet wrote:

> Maybe something to remember which might help convince folks.  With present
> 100baseT hardware, your kernel is going to make one-copy of all your data
> regardless -- because it has to assemble TCP packets to send off to the
> network card.  If you've already done one-copy just before entering the
> kernel there's a high chance that the entire 4k packet is still sitting in
> your L1 data cache when the kernel needs it.  Optimistically it'll take
> the kernel, say 200 32-bit operations to copy that 4k data into network
> packets... that's 200 cycles, or .5us on a 400Mhz processor.  Worst case
> scenario is that all the data is in the L2, and the L2 is say 10 cycles
> away.  Then your cost is 5.5us... which is above the one-copy cost you had
> to pay anyhow.

the large-ish zero copy tx case from the page cache is certainly something
to keep in mind for 2.0.  the 3com 905b, adaptec 'starfire' 9615 and
sun's hme are all pci and can all do byte-grained dma from memory into
their fifo and tack in ip checksums.  the 3com especially is affordable
and in wide spread use.

In linux the current near term plan is to use this mechanism for largeish
writes that come from the page cache (read: sendfile() and sunrpc for
kernel nfs work).  This will let us use an internal data structure
(kiobuf) to pass the references around and such.  the heuristic for 'big'
will probably be the cost of messing around with the kiobufs + the latency
incurred in having the full packet in the fifo before tx VS the cost of
building/copying a 'flat' network buffer before sending it out.  I imagine
128/256ish will be the cutoff, but I just pulled that out of thin air :)  
This stuff should be done in the next few to 6 months, I hope.

I guess all this really means is that we should have hooks for using
sendfile() whenever we're sending unmodified data from the fs.  This lets
us avoid the mmap()/munmap() gunk but also would have to be stepped around
for layers of the mux that want to modify data, etc, etc..

> OK ok, so there is gigabit ethernet and ATM hardware which can do TCP
> packet assembly.  And suppose we care about it in the apache 2.0 timeframe

don't forget hippi! :)

> support the writev we use).  I suspect that other folks doing true
> zero-copy are going to have similar restrictions -- disk -> net optimized,
> memory -> net unoptimized... and we're back to that 5.5us cost. 

*nod* don't expect linux to have user address space -> socket zero copy
any time soon.  the mm/api implecations are yucky.

> The cork was put there to deal with the sendfile() initial page problem.
> You'll notice that most other sendfile implementations include an iovec,
> intended for the headers, so that the kernel can copy the headers into
> the first packet and avoid an extra packet on the net.  The linux folks
> were loathe to make combination syscalls like that, they put the cork in
> at my suggestion because it lets us use a write() followed by a
> sendfile() without causing a short packet to go out.

* lots of nodding *

there has, however, been some noise as of late to really have some sort of
sendfile + head/tail iovecs call.  I dunno how far that will go.  The cork
thing works well; we use it in hftpd to make stupid SITE EXEC programs
spit out nice packets after we hand them the socket on stdout :)

-- zach

- - - - - -
007 373 5963