You are viewing a plain text version of this content. The canonical link for it is here.
Posted to cvs@httpd.apache.org by fi...@locus.apache.org on 2000/07/13 07:18:44 UTC
cvs commit: apache-2.0/src/lib/apr/buckets doc_page_io.txt

fielding    00/07/12 22:18:44

  Added:       src/lib/apr/buckets doc_page_io.txt
  Log:
  Some notes from Dean on zero-copy IO and its ilk
  
  Submitted by:	Dean Gaudet
  
  Revision  Changes    Path
  1.1                  apache-2.0/src/lib/apr/buckets/doc_page_io.txt
  
  Index: doc_page_io.txt
  ===================================================================
  
  From dgaudet@arctic.org Fri Feb 20 00:36:52 1998
  Date: Fri, 20 Feb 1998 00:35:37 -0800 (PST)
  From: Dean Gaudet <dg...@arctic.org>
  To: new-httpd@apache.org
  Subject: page-based i/o
  X-Comment: Visit http://www.arctic.org/~dgaudet/legal for information regarding copyright and disclaimer.
  Reply-To: new-httpd@apache.org
  
  Ed asked me for more details on what I mean when I talk about "paged based
  zero copy i/o". 
  
  While writing mod_mmap_static I was thinking about the primitives that the
  core requires of the filesystem.  What exactly is it that ties us into the
  filesystem?  and how would we abstract it?  The metadata (last modified
  time, file length) is actually pretty easy to abstract.  It's also easy to
  define an "index" function so that MultiViews and such can be implemented. 
  And with layered I/O we can hide the actual details of how you access
  these "virtual" files. 
  
  But therein lies an inefficiency.  If we had only bread() for reading
  virtual files, then we would enforce at least one copy of the data. 
  bread() supplies the place that the caller wants to see the data, and so
  the bread() code has to copy it.  But there's very little reason that
  bread() callers have to supply the buffer... bread() itself could supply
  the buffer.  Call this new interface page_read().  It looks something like
  this:
  
      typedef struct {
  	const void *data;
  	size_t data_len; /* amt of data on page which is valid */
  	... other stuff necessary for managing the page pool ...
      } a_page_head;
  
      /* returns NULL if an error or EOF occurs, on EOF errno will be
       * set to 0
       */
      a_page_head *page_read(BUFF *fb);
  
      /* queues entire page for writing, returns 0 on success, -1 on
       * error
       */
      int page_write(BUFF *fb, a_page_head *);
  
  It's very important that a_page_head structures point to the data page
  rather than be part of the data page.  This way we can build a_page_head
  structures which refer to parts of mmap()d memory.
  
  This stuff is a little more tricky to do, but is a big win for performance.
  With this integrated into our layered I/O it means that we can have
  zero-copy performance while still getting the advantages of layering.
  
  But note I'm glossing over a bunch of details... like the fact that we
  have to decide if a_page_heads are shared data, and hence need reference
  counting (i.e. I said "queues for writing" up there, which means some
  bit of the a_page_head data has to be kept until its actually written).
  Similarly for the page data.
  
  There are other tricks in this area that we can take advantage of --
  like interprocess communication on architectures that do page flipping.
  On these boxes if you write() something that's page-aligned and page-sized
  to a pipe or unix socket, and the other end read()s into a page-aligned
  page-sized buffer then the kernel can get away without copying any data.
  It just marks the two pages as shared copy-on-write, and only when
  they're written to will the copy be made.  So to make this work, your
  writer uses a ring of 2+ page-aligned/sized buffers so that it's not
  writing on something the reader is still reading.
  
  Dean
  
  ----
  
  For details on HPUX and avoiding extra data copies, see
  <ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.pdf>.
  
  (note that if you get the postscript version instead, you have to 
  manually edit it to remove the front page before any version of 
  ghostscript that I have used will read it)
  
  ----
  
  I've been told by an engineer in Sun's TCP/IP group that zero-copy TCP
  in Solaris 2.6 occurs when:
  
      - you've got the right interface card (OC-12 ATM card I think)
      - you use write()
      - your write buffer is 16k aligned and a multiple of 16k in size
  
  We currently get the 16k stuff for free by using mmap().  But sun's
  current code isn't smart enough to deal with our initial writev()
  of the headers and first part of the response.
  
  ----
  
  Systems that have a system call to efficiently send the contents of a 
  descriptor across the network.  This is probably the single best way
  to do static content on systems that support it.
  
  HPUX: (10.30 and on)
  
        ssize_t sendfile(int s, int fd, off_t offset, size_t nbytes,
                const struct iovec *hdtrl, int flags);
  
        (allows you to add headers and trailers in the form of iovec
        structs)  Marc has a man page; ask if you want a copy.  Not included
        due to copyright issues.  man page also available from 
        http://docs.hp.com/ (in particular, 
        http://docs.hp.com:80/dynaweb/hpux11/hpuxen1a/rvl3en1a/@Generic__BookTextView/59894;td=3 )
  
  Windows NT:
  
  	BOOL TransmitFile(     SOCKET hSocket, 
  	    HANDLE hFile, 
  	    DWORD nNumberOfBytesToWrite, 
  	    DWORD nNumberOfBytesPerSend, 
  	    LPOVERLAPPED lpOverlapped, 
  	    LPTRANSMIT_FILE_BUFFERS lpTransmitBuffers, 
  	    DWORD dwFlags 
  	   ); 
  
  	(does it start from the current position in the handle?  I would
  	hope so, or else it is pretty dumb.)
  
  	lpTransmitBuffers allows for headers and trailers.
  
  	Documentation at:
  
  	http://premium.microsoft.com/msdn/library/sdkdoc/wsapiref_3pwy.htm
  	http://premium.microsoft.com/msdn/library/conf/html/sa8ff.htm
  
  	Even less related to page based IO: just context switching:
  	AcceptEx does an accept(), and returns the start of the
  	input data.  see:
  
  	http://premium.microsoft.com/msdn/library/sdkdoc/pdnds/sock2/wsapiref_17jm.htm
  
  	What this means is you require one less syscall to do a
  	typical request, especially if you have a cache of handles
  	so you don't have to do an open or close.  Hmm.  Interesting
  	question: then, if TransmitFile starts from the current
  	position, you need a mutex around the seek and the
  	TransmitFile.  If not, you are just limited (eg. byte
  	ranges) in what you can use it for.
  
  	Also note that TransmitFile can specify TF_REUSE_SOCKET, so that
  	after use the same socket handle can be passed to AcceptEx.  
  	Obviously only good where we don't have a persistent connection
  	to worry about.
  
  ----
  
  Note that all this is shot to bloody hell by HTTP-NG's multiplexing.
  If fragment sizes are big enough, it could still be worthwhile to
  do copy avoidence.  It also causes performance issues because of
  its credit system that limits how much you can write in a single
  chunk.
  
  Don't tell me that if HTTP-NG becomes popular we will seen vendors 
  embedding SMUX (or whatever multiplexing is used) in the kernel to
  get around this stuff.  There we go, Apache with a loadable kernel
  module.  
  
  ----
  
  Larry McVoy's document for SGI regarding sendfile/TransmitFile:
  ftp://ftp.bitmover.com/pub/splice.ps.gz