You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Joe Orton <jo...@redhat.com> on 2009/02/16 11:07:47 UTC

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

On Sat, Feb 14, 2009 at 10:25:08AM +1100, Graham Dumpleton wrote:
...
> What the end result of the code is, is that if you have a file bucket
> getting this far where length of file is less than 8000 and an EOS
> follows it, then the actual file bucket is held over rather than data
> being read and buffered. This is as commented is to avoid doing an
> mmap+memcpy. What it means though is that the file descriptor within
> the file bucket must be maintained and cannot be closed as soon as
> ap_pass_brigade() has been called.

The call to:

            ap_save_brigade(f, &ctx->b, &b, ctx->deferred_write_pool);

in that code path should result in the FILE bucket and the contained fd 
being dup()ed.  (Though if that is failing, you wouldn't know because of 
the lack of error checking)

You say:

> For me this is an issue as the file descriptor has been supplied from
> a special object returned by a higher level application and it would
> be hard to maintain the file as open beyond the life of the request,
> up till end of keep alive or a subsequent request over same
> connection. Doing a dup on the file decriptor is also not necessarily
> an option.

can you explain why a dup shouldn't work?

Regards, Joe

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

Posted by Graham Dumpleton <gr...@gmail.com>.

2009/2/17 Mladen Turk <mt...@apache.org>:
> Graham Dumpleton wrote:
>>
>> 2009/2/17 Mladen Turk <mt...@apache.org>:
>>>
>>> Graham Dumpleton wrote:
>>>>
>>>> 2009/2/17 Joe Orton <jo...@redhat.com>:
>>>>>>
>>>>>> I did used to perform a dup, but was told that this would cause
>>>>>> problems with file locking. Specifically was told:
>>>>>
>>>>> I'm getting lost here.  What has file locking got to do with it?  Does
>>>>> mod_wscgi rely on file locking somehow?
>>>
>>> I'm lost as well :)
>>
>> Consider:
>>
>>  fd1 = ....
>>
>>  lock(fd1)
>>
>>  fd2 = dup(fd1)
>>
>>  close(fd2) # will release the lock under some lock APIs even though
>> not last reference to underlying file object
>>
>>  write(fd1) # lock has already been released so not gauranteed that only
>> writer
>>
>>  close(fd1)
>>
>> At least that is how I understand it from what is being explained to
>> me and pointed out in various documentation.
>>
>> So, if fd2 is the file descriptor created for file bucket in Apache,
>> if it gets closed before application later wants to write to file
>> through fd1, then application has lost its exclusive ownership
>> acquired by way of the lock and something else could have acquired
>> lock and started modifying it on basis that it has exclusive onwership
>> at that time.
>>
>
> Well, like said that won't work, neither is portable
> (eg, apr_os_file_t is HANDLE on win32)

I already said I only support the optimisation on UNIX. I don't care
about Windows.

> What you will need is the code that will take the Python
> object and invoke Python file api feeding the apr_bucket.
> (Basically writing the apr_bucket_python_file).

As I already tried to explain, even for the case of the bucket being
used to hold a reference to the Python object, that will not work
because of the gaurantees that WSGI applications require regarding
data needing to be flushed.

> However the simplest thing might be an intermediate temp file, in
> which case httpd could reference the file name not the file
> object itself.

Which would likely be slower than using existing fallback streaming
mechanism available that reads file into memory in blocks and pushes
them through as transient buckets.

> Not sure how woule that work with dynamic file
> since apr and python might use different platform locking
> mechanisms.

Python uses operating system locking mechanisms, just like APR library would.

Graham

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

Posted by Mladen Turk <mt...@apache.org>.

Graham Dumpleton wrote:
> 2009/2/17 Mladen Turk <mt...@apache.org>:
>> Graham Dumpleton wrote:
>>> 2009/2/17 Joe Orton <jo...@redhat.com>:
>>>>> I did used to perform a dup, but was told that this would cause
>>>>> problems with file locking. Specifically was told:
>>>> I'm getting lost here.  What has file locking got to do with it?  Does
>>>> mod_wscgi rely on file locking somehow?
>> I'm lost as well :)
> 
> Consider:
> 
>   fd1 = ....
> 
>   lock(fd1)
> 
>   fd2 = dup(fd1)
> 
>   close(fd2) # will release the lock under some lock APIs even though
> not last reference to underlying file object
> 
>   write(fd1) # lock has already been released so not gauranteed that only writer
> 
>   close(fd1)
> 
> At least that is how I understand it from what is being explained to
> me and pointed out in various documentation.
> 
> So, if fd2 is the file descriptor created for file bucket in Apache,
> if it gets closed before application later wants to write to file
> through fd1, then application has lost its exclusive ownership
> acquired by way of the lock and something else could have acquired
> lock and started modifying it on basis that it has exclusive onwership
> at that time.
> 

Well, like said that won't work, neither is portable
(eg, apr_os_file_t is HANDLE on win32)

What you will need is the code that will take the Python
object and invoke Python file api feeding the apr_bucket.
(Basically writing the apr_bucket_python_file).

However the simplest thing might be an intermediate temp file, in
which case httpd could reference the file name not the file
object itself. Not sure how woule that work with dynamic file
since apr and python might use different platform locking
mechanisms.

Regards
-- 
^(TM)

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

Posted by Graham Dumpleton <gr...@gmail.com>.

2009/2/17 Mladen Turk <mt...@apache.org>:
> Graham Dumpleton wrote:
>>
>> 2009/2/17 Joe Orton <jo...@redhat.com>:
>>>>
>>>> I did used to perform a dup, but was told that this would cause
>>>> problems with file locking. Specifically was told:
>
>>> I'm getting lost here.  What has file locking got to do with it?  Does
>>> mod_wscgi rely on file locking somehow?
>>
>
> I'm lost as well :)

Consider:

  fd1 = ....

  lock(fd1)

  fd2 = dup(fd1)

  close(fd2) # will release the lock under some lock APIs even though
not last reference to underlying file object

  write(fd1) # lock has already been released so not gauranteed that only writer

  close(fd1)

At least that is how I understand it from what is being explained to
me and pointed out in various documentation.

So, if fd2 is the file descriptor created for file bucket in Apache,
if it gets closed before application later wants to write to file
through fd1, then application has lost its exclusive ownership
acquired by way of the lock and something else could have acquired
lock and started modifying it on basis that it has exclusive onwership
at that time.

>> In WSGI applications, it is possible for the higher level Python web
>> application to pass back a file object reference for the response with
>> the intent that the WSGI adapter use any optimised methods available
>> for sending it back as response. This is where file buckets come into
>> the picture to begin with.
>
> Now it looks that you are trying to intermix the third party
> maintained native OS file descriptors and file buckets.
> You can create the apr_file_t from apr_os_file_t

Which is what it does. Simplified code below:

  apr_os_file_t fd = -1;
  apr_file_t *tmpfile = NULL;

  fd = PyObject_AsFileDescriptor(filelike);

  apr_os_file_put(&tmpfile, &fd, APR_SENDFILE_ENABLED, self->r->pool);

> (Think you'll have platform portability issues there)

The optimisation is only supported on UNIX systems.

> but the major problem would be to ensure the life cycle
> of the object, since Python has it's own GC and httpd has
> it's pool.
> IMHO you will need a new apr_bucket provider written in
> Python and C for something like that.

CPython uses reference counting. What is referred to as GC in Python
is actually just a mechanism that kicks in under certain circumstances
to break cycles between reference counted objects.

Having a special bucket type which holds a reference to the Python
file object will not help anyway. This is because the close() method
of the Python file object can be called prior to the file bucket being
destroyed. This closing of the Python file object would occur before
the delayed write of file bucket resulting due to the EOS
optimisation. So, same problem as when using naked file descriptor.

Also, using a special bucket type opens another can of works. This is
because multiple interpreters are supported as well as multithreading.
Thus it would be necessary to track the named interpreter in use
within the bucket and have to reaquire the lock on the interpreter
being used and ensure thread state is correctly reinstated. Although
possible to do, it gets a bit messy.

Holding onto the file descriptor to allow the optimisation isn't
really desirable for other reasons as well. This is because the WSGI
specification effectively requires the response content to have been
flushed out to the client before the final call back into the
application to clean up things. In the final call back into the
application to perform cleanup and close stuff like files, it could
technically rewrite the content of the file. If Apache has not
finished writing out the contents of the file, presuming the Python
file object hadn't been closed, then Apache would end up writing
different content to what was expected and possibly truncated content
if file resized.

Summary, you need to have a way of knowing that when you flush
something that it really has been flushed and that Apache is all done
with it.

Graham

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

Posted by Mladen Turk <mt...@apache.org>.

Graham Dumpleton wrote:
> 2009/2/17 Joe Orton <jo...@redhat.com>:
>>> I did used to perform a dup, but was told that this would cause
>>> problems with file locking. Specifically was told:

>> I'm getting lost here.  What has file locking got to do with it?  Does
>> mod_wscgi rely on file locking somehow?
> 

I'm lost as well :)

> 
> In WSGI applications, it is possible for the higher level Python web
> application to pass back a file object reference for the response with
> the intent that the WSGI adapter use any optimised methods available
> for sending it back as response. This is where file buckets come into
> the picture to begin with.
> 

Now it looks that you are trying to intermix the third party
maintained native OS file descriptors and file buckets.
You can create the apr_file_t from apr_os_file_t
(Think you'll have platform portability issues there)
but the major problem would be to ensure the life cycle
of the object, since Python has it's own GC and httpd has
it's pool.
IMHO you will need a new apr_bucket provider written in
Python and C for something like that.

Regards
-- 
^(TM)

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

Posted by Graham Dumpleton <gr...@gmail.com>.

2009/2/17 Joe Orton <jo...@redhat.com>:
> On Mon, Feb 16, 2009 at 10:52:15PM +1100, Graham Dumpleton wrote:
>> 2009/2/16 Joe Orton <jo...@redhat.com>:
>> > You say:
>> >
>> >> For me this is an issue as the file descriptor has been supplied from
>> >> a special object returned by a higher level application and it would
>> >> be hard to maintain the file as open beyond the life of the request,
>> >> up till end of keep alive or a subsequent request over same
>> >> connection. Doing a dup on the file decriptor is also not necessarily
>> >> an option.
>> >
>> > can you explain why a dup shouldn't work?
>>
>> I did used to perform a dup, but was told that this would cause
>> problems with file locking. Specifically was told:
>
> I'm getting lost here.  What has file locking got to do with it?  Does
> mod_wscgi rely on file locking somehow?

The mod_wsgi package is a gateway for hosting Python web applications.
Modern version of mod_python, but implementing generic/portable Python
WGSI interface rather than being Apache specific. It is what all the
Python people ditching mod_python are moving to.

In WSGI applications, it is possible for the higher level Python web
application to pass back a file object reference for the response with
the intent that the WSGI adapter use any optimised methods available
for sending it back as response. This is where file buckets come into
the picture to begin with.

Now, this file object was created by the Python application and it is
still the owner of it. If it is a file that it had first been
modifying and it needed exclusivity on that, it could well have used
file locks on it. Because locks are involved, the order in which files
contents are used for response and the closure and unlocking of the
file are important.

It appears that fcntl locking on some platforms has the behaviour that
if file descriptor is dup'd, closure of the first reference to the
file will cause release of the lock. That is, lock will not be
released only when last reference to the file is closed. Problems
therefore can arise if you have to dup the file descriptor, because if
the dup'd file descriptor gets closed before Python application had
finished with the file object, possibly involving it having to modify
the file after contents sent, something else could lock it and start
modifying the file before it was done.

In simple terms, the mod_wsgi module doesn't internally open the file
in the first place. Instead it comes from a higher level application
and one can't do things at the lower level that would change the state
of something like the locks associated with the file.

Graham

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

Posted by Joe Orton <jo...@redhat.com>.

On Mon, Feb 16, 2009 at 10:52:15PM +1100, Graham Dumpleton wrote:
> 2009/2/16 Joe Orton <jo...@redhat.com>:
> > You say:
> >
> >> For me this is an issue as the file descriptor has been supplied from
> >> a special object returned by a higher level application and it would
> >> be hard to maintain the file as open beyond the life of the request,
> >> up till end of keep alive or a subsequent request over same
> >> connection. Doing a dup on the file decriptor is also not necessarily
> >> an option.
> >
> > can you explain why a dup shouldn't work?
> 
> I did used to perform a dup, but was told that this would cause
> problems with file locking. Specifically was told:

I'm getting lost here.  What has file locking got to do with it?  Does 
mod_wscgi rely on file locking somehow?

joe

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

Posted by Graham Dumpleton <gr...@gmail.com>.

2009/2/16 Joe Orton <jo...@redhat.com>:
> On Sat, Feb 14, 2009 at 10:25:08AM +1100, Graham Dumpleton wrote:
> ...
>> What the end result of the code is, is that if you have a file bucket
>> getting this far where length of file is less than 8000 and an EOS
>> follows it, then the actual file bucket is held over rather than data
>> being read and buffered. This is as commented is to avoid doing an
>> mmap+memcpy. What it means though is that the file descriptor within
>> the file bucket must be maintained and cannot be closed as soon as
>> ap_pass_brigade() has been called.
>
> The call to:
>
>            ap_save_brigade(f, &ctx->b, &b, ctx->deferred_write_pool);
>
> in that code path should result in the FILE bucket and the contained fd
> being dup()ed.  (Though if that is failing, you wouldn't know because of
> the lack of error checking)
>
> You say:
>
>> For me this is an issue as the file descriptor has been supplied from
>> a special object returned by a higher level application and it would
>> be hard to maintain the file as open beyond the life of the request,
>> up till end of keep alive or a subsequent request over same
>> connection. Doing a dup on the file decriptor is also not necessarily
>> an option.
>
> can you explain why a dup shouldn't work?

I did used to perform a dup, but was told that this would cause
problems with file locking. Specifically was told:

"""I am not sure, but it looks like mod_wsgi duplicates the original file
descriptor, sends the file's data, closes the duplicated file
descriptor, and then calls the iterable's close method. That may be
problematic for applications that are using fcntl-based locking; the
closing of the duplicate file descriptor will release the lock before
the iterable's close method gets a chance to execute, so the close
method can no longer depend on the file remaining locked. Ideally, the
duplicate file descriptor shouldn't be closed until *after* the
iterable's close() method has been called."""

I never researched it properly as didn't have time at that point, so
simply believed what was being told. A quick check of flock manual
page says:

"""Locks created by flock() are associated with an open file table
entry. This means that duplicate file descriptors (created by, for
example, fork(2) or dup(2)) refer to the same lock, and this lock may
be modified or released using any of these descriptors. Furthermore,
the lock is released either by an explicit LOCK_UN operation on any of
these duplicate descriptors, or when all such descriptors have been
closed."""

So for flock() there is no problem. As to fcntl locking which person
mentioned, I haven't been able to find yet any description of the
behaviour of locks in context of dup'd file descriptors.

If you know of source of information which explains what happens for
fcntl locking and dup'd file descriptors that would help clear things
up. In the mean time I'll keep looking for any information about it.

Graham

Re: FLUSH, filtering, setaside, etc (was Re: Problems with EOSoptimisation in ap_core_output_filter() and file buckets.)

Posted by "Plüm, Rüdiger, VF-Group" <ru...@vodafone.com>.

 

> -----Ursprüngliche Nachricht-----
> Von: Joe Orton 
> Gesendet: Freitag, 20. Februar 2009 10:15
> An: dev@httpd.apache.org
> Betreff: Re: FLUSH, filtering, setaside, etc (was Re: 
> Problems with EOSoptimisation in ap_core_output_filter() and 
> file buckets.)
> 
> On Thu, Feb 19, 2009 at 10:00:50PM +0100, Ruediger Pluem wrote:
> > On 02/19/2009 12:32 PM, Joe Orton wrote:
> ...
> > > @@ -497,13 +500,17 @@
> > >                  next = APR_BUCKET_NEXT(bucket);
> > >              }
> > >              bytes_in_brigade += bucket->length;
> > > -            if (!APR_BUCKET_IS_FILE(bucket)) {
> > > +            if (APR_BUCKET_IS_FILE(bucket)) {
> > > +                num_files_in_brigade++;
> > > +            } 
> > > +            else {                
> > >                  non_file_bytes_in_brigade += bucket->length;
> > >              }
> > >          }
> > >      }
> > >  
> > > -    if (non_file_bytes_in_brigade >= THRESHOLD_MAX_BUFFER) {
> > > +    if (non_file_bytes_in_brigade >= THRESHOLD_MAX_BUFFER
> > > +        || num_files_in_brigade >= THRESHOLD_MAX_FILES) {
> > 
> > If the 16 FD's were split over more then one brigade and the
> > brigades before us were set aside the FD's belong already 
> to the wrong pool
> > (the connection pool). Deleting a file bucket doesn't close 
> the FD it uses.
> 
> Not sure what the concern is there - this loop is iterating over the 
> concatenation of the buffered brigade an the "new" brigade 
> (right?), so 
> it will count the total number of buckets which are potentially left 
> buffered after this c_o_f invocation terminates.

The problem is that originally the apr_file objects are created in the
request pool. If you now sent two brigades down the chain, let's say
the first one with 15 files and the second one with 1 file and the first
one gets set aside for later writing for whatever reason then the 15
apr_file objects from the first brigade move over to the connection
pool and thus do not get closed until the connection gets closed (which
may take a long time). This is where the leak happens.
So I guess for a final solution we must find a way to avoid set asides
in the connection pool.

Regards

Rüdiger

Re: FLUSH, filtering, setaside, etc (was Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.)

Posted by Joe Orton <jo...@redhat.com>.

On Thu, Feb 19, 2009 at 10:00:50PM +0100, Ruediger Pluem wrote:
> On 02/19/2009 12:32 PM, Joe Orton wrote:
...
> > @@ -497,13 +500,17 @@
> >                  next = APR_BUCKET_NEXT(bucket);
> >              }
> >              bytes_in_brigade += bucket->length;
> > -            if (!APR_BUCKET_IS_FILE(bucket)) {
> > +            if (APR_BUCKET_IS_FILE(bucket)) {
> > +                num_files_in_brigade++;
> > +            } 
> > +            else {                
> >                  non_file_bytes_in_brigade += bucket->length;
> >              }
> >          }
> >      }
> >  
> > -    if (non_file_bytes_in_brigade >= THRESHOLD_MAX_BUFFER) {
> > +    if (non_file_bytes_in_brigade >= THRESHOLD_MAX_BUFFER
> > +        || num_files_in_brigade >= THRESHOLD_MAX_FILES) {
> 
> If the 16 FD's were split over more then one brigade and the
> brigades before us were set aside the FD's belong already to the wrong pool
> (the connection pool). Deleting a file bucket doesn't close the FD it uses.

Not sure what the concern is there - this loop is iterating over the 
concatenation of the buffered brigade an the "new" brigade (right?), so 
it will count the total number of buckets which are potentially left 
buffered after this c_o_f invocation terminates.

w.r.t. MMAP buckets: the 64K bytes limit will apply here, since FILE 
only morphs into MMAP if the file size is > 8K.  (And no, theoretically, 
the fd from which an MMAP bucket was derived is not needed after the 
mmap() call, but I don't think the fds actually get closed earlier than 
pool closure, normally)

w.r.t. multiple FILE buckets for a single fd; failing to buffer as much 
as possible will not be the end of the world, but I suppose that case is 
common, so we could cope with it something like this.  (completely 
untested)

Index: server/core_filters.c
===================================================================
--- server/core_filters.c	(revision 734690)
+++ server/core_filters.c	(working copy)
@@ -367,6 +367,7 @@
 
 #define THRESHOLD_MIN_WRITE 4096
 #define THRESHOLD_MAX_BUFFER 65536
+#define THRESHOLD_MAX_FILES 16
 
 /* Optional function coming from mod_logio, used for logging of output
  * traffic
@@ -380,7 +381,8 @@
     core_output_filter_ctx_t *ctx = net->out_ctx;
     apr_bucket_brigade *bb;
     apr_bucket *bucket, *next;
-    apr_size_t bytes_in_brigade, non_file_bytes_in_brigade;
+    apr_size_t bytes_in_brigade, non_file_bytes_in_brigade, num_files_in_brigade;
+    apr_file_t *last_file_seen;
 
     /* Fail quickly if the connection has already been aborted. */
     if (c->aborted) {
@@ -466,6 +468,9 @@
 
     bytes_in_brigade = 0;
     non_file_bytes_in_brigade = 0;
+    num_files_in_brigade = 0;
+    last_file_seen = NULL;
+
     for (bucket = APR_BRIGADE_FIRST(bb); bucket != APR_BRIGADE_SENTINEL(bb);
          bucket = next) {
         next = APR_BUCKET_NEXT(bucket);
@@ -497,13 +502,20 @@
                 next = APR_BUCKET_NEXT(bucket);
             }
             bytes_in_brigade += bucket->length;
-            if (!APR_BUCKET_IS_FILE(bucket)) {
+            if (APR_BUCKET_IS_FILE(bucket) 
+                && (last_file_seen == NULL
+                    || last_file_seen != ((apr_bucket_file *)bucket->data)->fd)) {
+                num_files_in_brigade++;
+                last_file_seen = ((apr_bucket_file *)bucket->data)->fd;
+            } 
+            else {                
                 non_file_bytes_in_brigade += bucket->length;
             }
         }
     }
 
-    if (non_file_bytes_in_brigade >= THRESHOLD_MAX_BUFFER) {
+    if (non_file_bytes_in_brigade >= THRESHOLD_MAX_BUFFER
+        || num_files_in_brigade >= THRESHOLD_MAX_FILES) {
         /* ### Writing the entire brigade may be excessive; we really just
          * ### need to send enough data to be under THRESHOLD_MAX_BUFFER.
          */

Re: FLUSH, filtering, setaside, etc (was Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.)

Posted by Ruediger Pluem <rp...@apache.org>.


On 02/19/2009 12:32 PM, Joe Orton wrote:
> On Wed, Feb 18, 2009 at 09:39:31PM +0100, Ruediger Pluem wrote:
>> On 02/18/2009 11:16 AM, Joe Orton wrote:
>> There is still a nasty issue with the trunk code that can cause you to
>> run out of FD's as the new non blocking core output filter has some trouble
>> setting aside the file buckets to the correct pool (it puts them in the
>> connection pool which is too long living and bad).
>>
>> See thread starting at
>>
>> http://mail-archives.apache.org/mod_mbox/httpd-dev/200901.mbox/%3c495FF1B9.9070607@kippdata.de%3e
>>
>> I admit that I lost momentum on this, but this definitely needs to be fixed.
> 
> Yes, that looks broken.
> 
> Something like the patch below should fix it, though I'm not sure it's 
> really a good idea to be allowing a pathological case of 16 fds consumed 
> per client.  It might present some interesting tuning challenges; 
> hitting fd limits before hitting num(client) limits.
> 
> Index: core_filters.c
> ===================================================================
> --- core_filters.c	(revision 734690)
> +++ core_filters.c	(working copy)
> @@ -367,6 +367,7 @@
>  
>  #define THRESHOLD_MIN_WRITE 4096
>  #define THRESHOLD_MAX_BUFFER 65536
> +#define THRESHOLD_MAX_FILES 16
>  
>  /* Optional function coming from mod_logio, used for logging of output
>   * traffic
> @@ -380,7 +381,7 @@
>      core_output_filter_ctx_t *ctx = net->out_ctx;
>      apr_bucket_brigade *bb;
>      apr_bucket *bucket, *next;
> -    apr_size_t bytes_in_brigade, non_file_bytes_in_brigade;
> +    apr_size_t bytes_in_brigade, non_file_bytes_in_brigade, num_files_in_brigade;
>  
>      /* Fail quickly if the connection has already been aborted. */
>      if (c->aborted) {
> @@ -466,6 +467,8 @@
>  
>      bytes_in_brigade = 0;
>      non_file_bytes_in_brigade = 0;
> +    num_files_in_brigade = 0;
> +
>      for (bucket = APR_BRIGADE_FIRST(bb); bucket != APR_BRIGADE_SENTINEL(bb);
>           bucket = next) {
>          next = APR_BUCKET_NEXT(bucket);
> @@ -497,13 +500,17 @@
>                  next = APR_BUCKET_NEXT(bucket);
>              }
>              bytes_in_brigade += bucket->length;
> -            if (!APR_BUCKET_IS_FILE(bucket)) {
> +            if (APR_BUCKET_IS_FILE(bucket)) {
> +                num_files_in_brigade++;
> +            } 
> +            else {                
>                  non_file_bytes_in_brigade += bucket->length;
>              }
>          }
>      }
>  
> -    if (non_file_bytes_in_brigade >= THRESHOLD_MAX_BUFFER) {
> +    if (non_file_bytes_in_brigade >= THRESHOLD_MAX_BUFFER
> +        || num_files_in_brigade >= THRESHOLD_MAX_FILES) {

If the 16 FD's were split over more then one brigade and the
brigades before us were set aside the FD's belong already to the wrong pool
(the connection pool). Deleting a file bucket doesn't close the FD it uses.

Regards

Rüdiger

Re: FLUSH, filtering, setaside, etc (was Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.)

Posted by Ruediger Pluem <rp...@apache.org>.


On 02/19/2009 12:32 PM, Joe Orton wrote:
> On Wed, Feb 18, 2009 at 09:39:31PM +0100, Ruediger Pluem wrote:
>> On 02/18/2009 11:16 AM, Joe Orton wrote:
>> There is still a nasty issue with the trunk code that can cause you to
>> run out of FD's as the new non blocking core output filter has some trouble
>> setting aside the file buckets to the correct pool (it puts them in the
>> connection pool which is too long living and bad).
>>
>> See thread starting at
>>
>> http://mail-archives.apache.org/mod_mbox/httpd-dev/200901.mbox/%3c495FF1B9.9070607@kippdata.de%3e
>>
>> I admit that I lost momentum on this, but this definitely needs to be fixed.
> 
> Yes, that looks broken.
> 
> Something like the patch below should fix it, though I'm not sure it's 
> really a good idea to be allowing a pathological case of 16 fds consumed 
> per client.  It might present some interesting tuning challenges; 
> hitting fd limits before hitting num(client) limits.

Thanks for the patch and sorry for picking on the details :-)

> 
> Index: core_filters.c
> ===================================================================
> --- core_filters.c	(revision 734690)
> +++ core_filters.c	(working copy)
> @@ -367,6 +367,7 @@
>  
>  #define THRESHOLD_MIN_WRITE 4096
>  #define THRESHOLD_MAX_BUFFER 65536
> +#define THRESHOLD_MAX_FILES 16
>  
>  /* Optional function coming from mod_logio, used for logging of output
>   * traffic
> @@ -380,7 +381,7 @@
>      core_output_filter_ctx_t *ctx = net->out_ctx;
>      apr_bucket_brigade *bb;
>      apr_bucket *bucket, *next;
> -    apr_size_t bytes_in_brigade, non_file_bytes_in_brigade;
> +    apr_size_t bytes_in_brigade, non_file_bytes_in_brigade, num_files_in_brigade;
>  
>      /* Fail quickly if the connection has already been aborted. */
>      if (c->aborted) {
> @@ -466,6 +467,8 @@
>  
>      bytes_in_brigade = 0;
>      non_file_bytes_in_brigade = 0;
> +    num_files_in_brigade = 0;
> +
>      for (bucket = APR_BRIGADE_FIRST(bb); bucket != APR_BRIGADE_SENTINEL(bb);
>           bucket = next) {
>          next = APR_BUCKET_NEXT(bucket);
> @@ -497,13 +500,17 @@
>                  next = APR_BUCKET_NEXT(bucket);
>              }
>              bytes_in_brigade += bucket->length;
> -            if (!APR_BUCKET_IS_FILE(bucket)) {
> +            if (APR_BUCKET_IS_FILE(bucket)) {
> +                num_files_in_brigade++;

Hm, how do you know that the FD in this bucket is different from the one
in the last file bucket(s). It may be possible that we have file buckets
in the brigade that represent different parts of the same file as a filter
might have put something in between tehm in the form of heap buckets or something
like that.
Furthermore what about MMAP buckets? Don't they have an open FD as well?

Regards

Rüdiger

Re: FLUSH, filtering, setaside, etc (was Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.)

Posted by Joe Orton <jo...@redhat.com>.

On Wed, Feb 18, 2009 at 09:39:31PM +0100, Ruediger Pluem wrote:
> On 02/18/2009 11:16 AM, Joe Orton wrote:
> There is still a nasty issue with the trunk code that can cause you to
> run out of FD's as the new non blocking core output filter has some trouble
> setting aside the file buckets to the correct pool (it puts them in the
> connection pool which is too long living and bad).
> 
> See thread starting at
> 
> http://mail-archives.apache.org/mod_mbox/httpd-dev/200901.mbox/%3c495FF1B9.9070607@kippdata.de%3e
> 
> I admit that I lost momentum on this, but this definitely needs to be fixed.

Yes, that looks broken.

Something like the patch below should fix it, though I'm not sure it's 
really a good idea to be allowing a pathological case of 16 fds consumed 
per client.  It might present some interesting tuning challenges; 
hitting fd limits before hitting num(client) limits.

Index: core_filters.c
===================================================================
--- core_filters.c	(revision 734690)
+++ core_filters.c	(working copy)
@@ -367,6 +367,7 @@
 
 #define THRESHOLD_MIN_WRITE 4096
 #define THRESHOLD_MAX_BUFFER 65536
+#define THRESHOLD_MAX_FILES 16
 
 /* Optional function coming from mod_logio, used for logging of output
  * traffic
@@ -380,7 +381,7 @@
     core_output_filter_ctx_t *ctx = net->out_ctx;
     apr_bucket_brigade *bb;
     apr_bucket *bucket, *next;
-    apr_size_t bytes_in_brigade, non_file_bytes_in_brigade;
+    apr_size_t bytes_in_brigade, non_file_bytes_in_brigade, num_files_in_brigade;
 
     /* Fail quickly if the connection has already been aborted. */
     if (c->aborted) {
@@ -466,6 +467,8 @@
 
     bytes_in_brigade = 0;
     non_file_bytes_in_brigade = 0;
+    num_files_in_brigade = 0;
+
     for (bucket = APR_BRIGADE_FIRST(bb); bucket != APR_BRIGADE_SENTINEL(bb);
          bucket = next) {
         next = APR_BUCKET_NEXT(bucket);
@@ -497,13 +500,17 @@
                 next = APR_BUCKET_NEXT(bucket);
             }
             bytes_in_brigade += bucket->length;
-            if (!APR_BUCKET_IS_FILE(bucket)) {
+            if (APR_BUCKET_IS_FILE(bucket)) {
+                num_files_in_brigade++;
+            } 
+            else {                
                 non_file_bytes_in_brigade += bucket->length;
             }
         }
     }
 
-    if (non_file_bytes_in_brigade >= THRESHOLD_MAX_BUFFER) {
+    if (non_file_bytes_in_brigade >= THRESHOLD_MAX_BUFFER
+        || num_files_in_brigade >= THRESHOLD_MAX_FILES) {
         /* ### Writing the entire brigade may be excessive; we really just
          * ### need to send enough data to be under THRESHOLD_MAX_BUFFER.
          */

Re: FLUSH, filtering, setaside, etc (was Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.)

Posted by Ruediger Pluem <rp...@apache.org>.


On 02/18/2009 11:16 AM, Joe Orton wrote:
> On Mon, Feb 16, 2009 at 03:12:11PM +0100, Ruediger Pluem wrote:
>> On 02/16/2009 02:13 PM, Joe Orton wrote:
>>> Why is it invalid use of the filtering/buckets API to close the file 
>>> after sending the FILE-containing brigade up the filter stack?  
>>>
>>> It seems counter-intuitive to me that *anything* you do after the 
>>> ap_pass_brigade() call should make a difference to what was passed up 
>>> the filter stack.  But I suppose this is the case; even most of the 
>>> memory-backed bucket types don't duplicate referenced memory in the 
>>> setaside method.
>> I guess this is not needed in most cases and is not done for performance
>> reasons. But maybe we can change the setaside for file buckets to do
>> an additional dup on the fd.
> 
> The choices seem to be:
> 
> 1) declare the FLUSH to be be necessary in this case as you said 
> originally ;) - and document appropriately
> 
> 2) rewrite/reinvent all the setaside methods to dup/copy as appropriate
> 
> 3) declare the core output filter broken and fix it
> 
> Of these (2) and (3) will presumably have some negative performance 
> impact though I'm not sure to what degree.  Not sure (2) is even going 
> to be possible, e.g. with MMAP buckets, scary stuff.

Yes, I don't know how MMAP would work with this, so dup might be a bad
approach regarding this. In any case it looks like it would cause a lot
of trouble until working stable again :-)

> 
> Looking at the cray-zee non-blocking core output filter on the trunk, it 
> will setaside with wild abandon unless it hits a FLUSH.  I think all 

There is still a nasty issue with the trunk code that can cause you to
run out of FD's as the new non blocking core output filter has some trouble
setting aside the file buckets to the correct pool (it puts them in the
connection pool which is too long living and bad).

See thread starting at

http://mail-archives.apache.org/mod_mbox/httpd-dev/200901.mbox/%3c495FF1B9.9070607@kippdata.de%3e

I admit that I lost momentum on this, but this definitely needs to be fixed.

> these options are pretty distasteful but (1) seems like the only 
> feasible route.  What do you think?  

I guess at least for the moment (1) is the way to go exactly for the
reasons you stated.

Regards

Rüdiger

FLUSH, filtering, setaside, etc (was Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.)

Posted by Joe Orton <jo...@redhat.com>.

On Mon, Feb 16, 2009 at 03:12:11PM +0100, Ruediger Pluem wrote:
> On 02/16/2009 02:13 PM, Joe Orton wrote:
> > Why is it invalid use of the filtering/buckets API to close the file 
> > after sending the FILE-containing brigade up the filter stack?  
> > 
> > It seems counter-intuitive to me that *anything* you do after the 
> > ap_pass_brigade() call should make a difference to what was passed up 
> > the filter stack.  But I suppose this is the case; even most of the 
> > memory-backed bucket types don't duplicate referenced memory in the 
> > setaside method.
> 
> I guess this is not needed in most cases and is not done for performance
> reasons. But maybe we can change the setaside for file buckets to do
> an additional dup on the fd.

The choices seem to be:

1) declare the FLUSH to be be necessary in this case as you said 
originally ;) - and document appropriately

2) rewrite/reinvent all the setaside methods to dup/copy as appropriate

3) declare the core output filter broken and fix it

Of these (2) and (3) will presumably have some negative performance 
impact though I'm not sure to what degree.  Not sure (2) is even going 
to be possible, e.g. with MMAP buckets, scary stuff.

Looking at the cray-zee non-blocking core output filter on the trunk, it 
will setaside with wild abandon unless it hits a FLUSH.  I think all 
these options are pretty distasteful but (1) seems like the only 
feasible route.  What do you think?  

Would be good to get some more opinions on this.

Regards, Joe

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

Posted by Ruediger Pluem <rp...@apache.org>.

On 02/16/2009 02:13 PM, Joe Orton wrote:

> 
> Why is it invalid use of the filtering/buckets API to close the file 
> after sending the FILE-containing brigade up the filter stack?  
> 
> It seems counter-intuitive to me that *anything* you do after the 
> ap_pass_brigade() call should make a difference to what was passed up 
> the filter stack.  But I suppose this is the case; even most of the 
> memory-backed bucket types don't duplicate referenced memory in the 
> setaside method.

I guess this is not needed in most cases and is not done for performance
reasons. But maybe we can change the setaside for file buckets to do
an additional dup on the fd.

> So EOS is a better flush then FLUSH, right?

IMHO no. EOS can cause data to remain in the chain (for optimizing the
pipeline case) whereas FLUSH in almost every case should really get
the buckets processed. Yeah I now the words "almost every case" are bad,
but I know that some filters are an exception from the "almost every case" :-(.

Regards

Rüdiger

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

Posted by Joe Orton <jo...@redhat.com>.

On Mon, Feb 16, 2009 at 12:34:26PM +0100, Ruediger Pluem wrote:
> On 02/16/2009 11:07 AM, Joe Orton wrote:
> > The call to:
> > 
> >             ap_save_brigade(f, &ctx->b, &b, ctx->deferred_write_pool);
> > 
> > in that code path should result in the FILE bucket and the contained fd 
> > being dup()ed.  (Though if that is failing, you wouldn't know because of 
> > the lack of error checking)
> 
> Are you sure?
> 
> apr_file_setaside does not dup the fd.

Oh, good point.  This feels like one of those cases where some core 
assumptions about the filtering/bucket API start melting away underneath 
me.

Why is it invalid use of the filtering/buckets API to close the file 
after sending the FILE-containing brigade up the filter stack?  

It seems counter-intuitive to me that *anything* you do after the 
ap_pass_brigade() call should make a difference to what was passed up 
the filter stack.  But I suppose this is the case; even most of the 
memory-backed bucket types don't duplicate referenced memory in the 
setaside method.

I've always thought of FLUSH as a *hint* towards the desired network 
behaviour.  To say a FLUSH is required here is to model FLUSH as a 
*mandate* on the bucket/object lifetimes.  That's a huge difference.

Indeed, if I go read some API docs in the headers, which is always good 
for a laugh/cry moment:

apr_bucket_eos_make => "All filters should flush at this point."

apr_bucket_flush_make => "This indicates that filters should flush their
   data.  There is **no guarantee** that they will flush it, but this is 
   the best we can do."  (my emphasis)

So EOS is a better flush then FLUSH, right?

Regards, Joe

Re: Problems with EOS optimisation in ap_core_output_filter() and file buckets.

Posted by Ruediger Pluem <rp...@apache.org>.


On 02/16/2009 11:07 AM, Joe Orton wrote:
> On Sat, Feb 14, 2009 at 10:25:08AM +1100, Graham Dumpleton wrote:
> ...
>> What the end result of the code is, is that if you have a file bucket
>> getting this far where length of file is less than 8000 and an EOS
>> follows it, then the actual file bucket is held over rather than data
>> being read and buffered. This is as commented is to avoid doing an
>> mmap+memcpy. What it means though is that the file descriptor within
>> the file bucket must be maintained and cannot be closed as soon as
>> ap_pass_brigade() has been called.
> 
> The call to:
> 
>             ap_save_brigade(f, &ctx->b, &b, ctx->deferred_write_pool);
> 
> in that code path should result in the FILE bucket and the contained fd 
> being dup()ed.  (Though if that is failing, you wouldn't know because of 
> the lack of error checking)

Are you sure?

apr_file_setaside does not dup the fd.

Regards

Rüdiger