You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modules-dev@httpd.apache.org by Joshua Marantz <jm...@google.com> on 2011/01/05 14:45:57 UTC

Help trying to figure out why an output_filter is not called.

One of the improvements mod_pagespeed is supposed to do to sites is extend
the cache lifetime of their resources indefinitely by including a content
hash in the URL.  This is working well for a large number of sites, but I
encountered one today where it does not work.

To accomplish the cache extension, overriding any wildcarded or
directory-based expire settings a site admin has set for their resources,
mod_pagespeed inserts two output filters.

The first one does the HTML rewriting:

  ap_register_output_filter(
      kModPagespeedFilterName, instaweb_out_filter, NULL,
AP_FTYPE_RESOURCE);

When instaweb_out_filter runs, it makes this transformation:

  before:  <script src="foo.js"></script>
  after:   <script src="foo.js.pagespeed.ce.HASH.js"></script>

The rewritten resource, foo.js.pagespeed.ce.HASH.js, is served by a hook:

  ap_hook_handler(instaweb_handler, NULL, NULL, APR_HOOK_FIRST - 1);

Knowing that mod_headers will later override Cache-Control, which we don't
want, our hook serves the .js file with our own header:

   X-Mod-Pagespeed-Repair: max-age=31536000

We a second output filter, to repair it:

  // We need our repair headers filter to run after mod_headers. The
  // mod_headers, which is the filter that is used to add the cache
settings, is
  // AP_FTYPE_CONTENT_SET. Using (AP_FTYPE_CONTENT_SET + 2) to make sure
that we
  // run after mod_headers.
  ap_register_output_filter(
      InstawebContext::kRepairHeadersFilterName, repair_caching_header,
NULL,
      static_cast<ap_filter_type>(AP_FTYPE_CONTENT_SET + 2));

This is added into the filter chain whenever we want to extend cache:

  apr_table_add(request->headers_out, "X-Mod-Pagespeed-Repair",
cache_control);
  ap_add_output_filter("X-Mod-Pagespeed-Repair",
                       NULL, request, request->connection);

When working properly, this header is removed from request->headers_out by
repair_caching_header():

  const char* cache_control = apr_table_get(request->headers_out,
                                            "X-Mod-Pagespeed-Repair");
  if (cache_control != NULL) {
    SetCacheControl(cache_control, request);
    apr_table_unset(request->headers_out, kRepairCachingHeader);
  }

Where SetCacheControl also makes the Expires header consistent, etc.

While this approach is complex, I've never seen it fail until today, on the
site http://law.thu.edu.tw/main.php . On that site, the
"X-Mod-Pagespeed-Repair"
header is visible (it should have been removed) and the Cache-Control header
has the value set from the conf files (public, max-age=600).   So on this
server, the repair_caching_header filter is not being run, despite having
been programatically inserted by our code in the same place where we add "
X-Mod-Pagespeed-Repair" header.

What might be going wrong in his server to cause this to fail?  Could some
other filter be somehow finding our filter and killing it?  Or sending the
bytes directly to the network before our filter has a chance to run?

Thanks!
-Josh

Re: Help trying to figure out why an output_filter is not called.

Posted by Joshua Marantz <jm...@google.com>.
Thanks for all the responses.  I haven't tried to reproduce this at all --
this shows up on a site with a back-end infrastructure that I don't have
access to (e.g. PHP scripts).

But I'm inclined to go with Ben's suggestion of removing the mod_headers
filter (and also I think mod_expires) whenever I don't want my headers
messed with.  It has the virtue of leaving the headers in the correct state
in case a filter downstream to mine goes directly to the net, compared with
my current dependence on a late-running repair filter.

-Josh

On Wed, Jan 5, 2011 at 11:14 AM, Nick Kew <ni...@apache.org> wrote:

> On Wed, 5 Jan 2011 08:45:57 -0500
> Joshua Marantz <jm...@google.com> wrote:
>
>
> > What might be going wrong in his server to cause this to fail?  Could
> some
> > other filter be somehow finding our filter and killing it?  Or sending
> the
> > bytes directly to the network before our filter has a chance to run?
>
> Yes.  Any of the above.  Someone may be breaking modularity.
> PHP has a track record of that, and from your recent posts here
> I'd guess it's not the only suspect.
>
> Can you reproduce the problem?  If so, run it under a debugger or trace
> and look for when your filter is inserted and whether it's removed.
> Does mod_diagnostics tell you anything if you run it at the same level
> as yours?
>
> --
> Nick Kew
>

Re: Help trying to figure out why an output_filter is not called.

Posted by Nick Kew <ni...@apache.org>.
On Wed, 5 Jan 2011 08:45:57 -0500
Joshua Marantz <jm...@google.com> wrote:


> What might be going wrong in his server to cause this to fail?  Could some
> other filter be somehow finding our filter and killing it?  Or sending the
> bytes directly to the network before our filter has a chance to run?

Yes.  Any of the above.  Someone may be breaking modularity.
PHP has a track record of that, and from your recent posts here
I'd guess it's not the only suspect.

Can you reproduce the problem?  If so, run it under a debugger or trace
and look for when your filter is inserted and whether it's removed.
Does mod_diagnostics tell you anything if you run it at the same level
as yours?

-- 
Nick Kew

Re: Help trying to figure out why an output_filter is not called.

Posted by Rainer Jung <ra...@kippdata.de>.
On 05.01.2011 15:54, Joshua Marantz wrote:
> On Wed, Jan 5, 2011 at 8:57 AM, Ben Noordhuis<in...@bnoordhuis.nl>  wrote:
>> On Wed, Jan 5, 2011 at 14:45, Joshua Marantz<jm...@google.com>  wrote:
>>> other filter be somehow finding our filter and killing it?  Or sending
>> the
>>> bytes directly to the network before our filter has a chance to run?
>>
>> Possibly, yes.
>>
>
> Can you elaborate?   Is this a common practice, to write bytes directly to
> the network from an output filter?  What should I look for?  The owner of
> the site where this is breaking sent me a few conf files and it enumerates
> some of the modules inserted:
>
> LoadModule perl_module modules/mod_perl.so
> LoadModule mono_module modules/mod_mono.so
> LoadModule bwlimited_module modules/mod_bwlimited.so
> LoadModule bw_module modules/mod_bw.so
> LoadModule jk_module modules/mod_jk.so
>
> Does anyone know anything about these?  Could one of these have inserted an
> output filter that spews bytes directly to the network?  I'll try to find
> sources for those but if someone knows off-hand that would be helpful.

mod_jk does not use filters. It writes through other possibly existing 
filters using ap_rwrite().

Code available at:

http://svn.apache.org/repos/asf/tomcat/jk/trunk/

or

http://tomcat.apache.org/download-connectors.cgi

Regards,

Rainer

Re: Help trying to figure out why an output_filter is not called.

Posted by Joshua Marantz <jm...@google.com>.
This has certainly gotten off topic :)

http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=nginx+vs+apache has
lots of interesting opinions on the subject.

Having said that, mod_pagespeed's initial target is Apache because that's
the dominant server stack driving the web.  I'm optimistic that nginx will
also be a compelling opportunity at some point and I'm anxious to learn
more.



Now -- back on topic -- this issue is tracked as
http://code.google.com/p/modpagespeed/issues/detail?id=179 for those
following along at home, and hopefully will be resolved shortly based on the
advise of the contributors to this thread.

Thanks again, everyone,
-Josh

On Wed, Jan 5, 2011 at 8:40 PM, Ray Morris <su...@bettercgi.com> wrote:

>  Just a quick note since you mentioned nginx.  Nginx is of
> course normally used by people wanting higher performance than
> they are getting from Apache, because certain tests seemed to
> suggest that nginx can significantly outperform Apache in some
> cases. If that's the case for you, we learned something very
> interesting.
>
>  We wondered how nginx could possibly be much faster since the
> speed of the disk itself is normally the limiting factor.  Was
> there something to be learned from nginx which could be applied
> to Apache?  At the end of all of the testing, we learned what
> caused the large apparent difference.  noatime.  Nginx effectively
> skips atime updates, which can make a huge difference.  By simply
> mounting the directory with the "noatime" option, any reasonable
> Apache configuration will have about the same performance as nginx,
> which is basically the performance of the underlying storage.
>
>   People build complex systems with nginx as a proxy to Apache,
> but the same or better performance, with better standards compliance
> and better reliability, can be obtained by just setting noatime
> directly rather than using getting noatime accidentally as a
> side effect of nginx.
>
>   With noatime set, one server or another might be 1% faster,
> but using TWO servers, with one as a proxy, will be slower than
> just simply using Apache, and in no case will nginx be SIGNIFICANTLY
> faster, when using noatime.
> --
> Ray Morris
> support@bettercgi.com
>
> Strongbox - The next generation in site security:
> http://www.bettercgi.com/strongbox/
>
> Throttlebox - Intelligent Bandwidth Control
> http://www.bettercgi.com/throttlebox/
>
> Strongbox / Throttlebox affiliate program:
> http://www.bettercgi.com/affiliates/user/register.php
>
>
>
> On 01/05/2011 03:16:21 PM, Ben Noordhuis wrote:
>
>> On Wed, Jan 5, 2011 at 22:03, Joshua Marantz <jm...@google.com> wrote:
>> > Right you are.  That's much simpler then.  Thanks!
>>
>> My pleasure, Joshua.
>>
>> Two quick questions, hope you don't mind: Is mod_pagespeed an official
>> Google project? Or is it something you guys do on your day off? And
>> are there plans for a nginx port?
>>
>>
>>
>

Re: Help trying to figure out why an output_filter is not called.

Posted by Joshua Marantz <jm...@google.com>.
This is my day job at google.  We would love to do an nginx port but have
not scheduled it yet.

sent from my android
On Jan 5, 2011 4:16 PM, "Ben Noordhuis" <in...@bnoordhuis.nl> wrote:
> On Wed, Jan 5, 2011 at 22:03, Joshua Marantz <jm...@google.com> wrote:
>> Right you are.  That's much simpler then.  Thanks!
>
> My pleasure, Joshua.
>
> Two quick questions, hope you don't mind: Is mod_pagespeed an official
> Google project? Or is it something you guys do on your day off? And
> are there plans for a nginx port?

Re: Help trying to figure out why an output_filter is not called.

Posted by Ben Noordhuis <in...@bnoordhuis.nl>.
On Wed, Jan 5, 2011 at 22:03, Joshua Marantz <jm...@google.com> wrote:
> Right you are.  That's much simpler then.  Thanks!

My pleasure, Joshua.

Two quick questions, hope you don't mind: Is mod_pagespeed an official
Google project? Or is it something you guys do on your day off? And
are there plans for a nginx port?

Re: Help trying to figure out why an output_filter is not called.

Posted by Joshua Marantz <jm...@google.com>.
Right you are.  That's much simpler then.  Thanks!

-Josh

On Wed, Jan 5, 2011 at 3:46 PM, Ben Noordhuis <in...@bnoordhuis.nl> wrote:

> On Wed, Jan 5, 2011 at 20:40, Joshua Marantz <jm...@google.com> wrote:
> > So if I try to remove the 'expires' filter from my handler (which runs
> > early) then mod_expires will have a handler that runs later that inserts
> it
> > after my module has completed.
>
> No, it's the other way around. mod_expires uses the insert_filter hook
> to insert its filter before your handler is run (and how could it be
> otherwise? Output filters are there to post-process the content your
> handler generates).
>
> Have a look at ap_invoke_handler() in config.c, that should give you a
> handle on how the filter chain works. But don't hesitate to post your
> questions if you have them, of course. :)
>

Re: Help trying to figure out why an output_filter is not called.

Posted by Ben Noordhuis <in...@bnoordhuis.nl>.
On Wed, Jan 5, 2011 at 20:40, Joshua Marantz <jm...@google.com> wrote:
> So if I try to remove the 'expires' filter from my handler (which runs
> early) then mod_expires will have a handler that runs later that inserts it
> after my module has completed.

No, it's the other way around. mod_expires uses the insert_filter hook
to insert its filter before your handler is run (and how could it be
otherwise? Output filters are there to post-process the content your
handler generates).

Have a look at ap_invoke_handler() in config.c, that should give you a
handle on how the filter chain works. But don't hesitate to post your
questions if you have them, of course. :)

Re: Help trying to figure out why an output_filter is not called.

Posted by Joshua Marantz <jm...@google.com>.
On Wed, Jan 5, 2011 at 10:43 AM, Ben Noordhuis <in...@bnoordhuis.nl> wrote:

>  > I guess we should eliminate FIXUP_HEADERS_OUT, FIXUP_HEADERS_ERR, and
> > MOD_EXPIRES.
>

 Are there any other similar header-mucking-filters I need to kill?

 > Moreover, expires_insert_filter runs as APR_HOOK_MIDDLE which means it
> runs
> > after my content-generator, which means that it won't have been inserted
> by
> > the time when I want to set my caching headers.
>
> You can remove it from your handler, scan
> r->output_filters->frec->name to find the filter.
>

I'm not following you.  mod_expires.c has this:

static void expires_insert_filter(request_rec *r) {
  ...ap_add_output_filter("MOD_EXPIRES", NULL, r, r->connection);...
}

static void register_hooks(apr_pool_t *p) {
    /* mod_expires needs to run *before* the cache save filter which is
     * AP_FTYPE_CONTENT_SET-1.  Otherwise, our expires won't be honored.
     */
    ap_register_output_filter("MOD_EXPIRES", expires_filter, NULL,
                              AP_FTYPE_CONTENT_SET-2);
    ap_hook_insert_error_filter(expires_insert_filter, NULL, NULL,
APR_HOOK_MIDDLE);
    ap_hook_insert_filter(expires_insert_filter, NULL, NULL,
APR_HOOK_MIDDLE);
}

So if I try to remove the 'expires' filter from my handler (which runs
early) then mod_expires will have a handler that runs later that inserts it
after my module has completed.  Hence:

> I guess that means I have to insert a new late-running hook that kills
> > undesirable output filters.  Does that wind up being simpler?
>
> The above is probably easier but whatever ends up being the most
> readable / maintainable, right?
>

And also functional :) which, evidently, my current solution is not, at
least in the presence of mod_perl and mod_php.

Your solution has the advantage of being more robust when upstream filters
write directly to the network.  I just wish I didn't have to depend on my
copies of string literals from .c files I don't control.  But as you said
core-filters will hopefully not change internal string constants often.   I
just coded it up and it seems to work :)

-Josh

Re: Help trying to figure out why an output_filter is not called.

Posted by Ben Noordhuis <in...@bnoordhuis.nl>.
On Wed, Jan 5, 2011 at 15:54, Joshua Marantz <jm...@google.com> wrote:
> Can you elaborate?   Is this a common practice, to write bytes directly to
> the network from an output filter?  What should I look for?  The owner of

Not common but sometimes you can't avoid it (search the mailing list,
there are a few examples).

> LoadModule perl_module modules/mod_perl.so

mod_perl allows scripts to write directly to the socket.

> I guess we should eliminate FIXUP_HEADERS_OUT, FIXUP_HEADERS_ERR, and
> MOD_EXPIRES.  Are there any other similar header-mucking-filters I need to
> kill?  I don't mind squirreling through the source code to find these names
> (all are string literals in .c files) but I'm nervous they could change
> without warning in a future version.

That's very unlikely to happen with Apache core modules.

> Moreover, expires_insert_filter runs as APR_HOOK_MIDDLE which means it runs
> after my content-generator, which means that it won't have been inserted by
> the time when I want to set my caching headers.

You can remove it from your handler, scan
r->output_filters->frec->name to find the filter.

> I guess that means I have to insert a new late-running hook that kills
> undesirable output filters.  Does that wind up being simpler?

The above is probably easier but whatever ends up being the most
readable / maintainable, right?

Re: Help trying to figure out why an output_filter is not called.

Posted by Joshua Marantz <jm...@google.com>.
Thanks again for the fast response, Ben!

On Wed, Jan 5, 2011 at 8:57 AM, Ben Noordhuis <in...@bnoordhuis.nl> wrote:

> On Wed, Jan 5, 2011 at 14:45, Joshua Marantz <jm...@google.com> wrote:
> > other filter be somehow finding our filter and killing it?  Or sending
> the
> > bytes directly to the network before our filter has a chance to run?
>
> Possibly, yes.
>

Can you elaborate?   Is this a common practice, to write bytes directly to
the network from an output filter?  What should I look for?  The owner of
the site where this is breaking sent me a few conf files and it enumerates
some of the modules inserted:

LoadModule perl_module modules/mod_perl.so
LoadModule mono_module modules/mod_mono.so
LoadModule bwlimited_module modules/mod_bwlimited.so
LoadModule bw_module modules/mod_bw.so
LoadModule jk_module modules/mod_jk.so

Does anyone know anything about these?  Could one of these have inserted an
output filter that spews bytes directly to the network?  I'll try to find
sources for those but if someone knows off-hand that would be helpful.


> By the way, why the complex setup? If you don't want the mod_headers
> filter to run, insert your filter before it, then remove it for each
> request that you handle.
>

This is an interesting idea.

I guess we should eliminate FIXUP_HEADERS_OUT, FIXUP_HEADERS_ERR, and
MOD_EXPIRES.  Are there any other similar header-mucking-filters I need to
kill?  I don't mind squirreling through the source code to find these names
(all are string literals in .c files) but I'm nervous they could change
without warning in a future version.

Moreover, expires_insert_filter runs as APR_HOOK_MIDDLE which means it runs
after my content-generator, which means that it won't have been inserted by
the time when I want to set my caching headers.

I guess that means I have to insert a new late-running hook that kills
undesirable output filters.  Does that wind up being simpler?

-Josh

Re: Help trying to figure out why an output_filter is not called.

Posted by Ben Noordhuis <in...@bnoordhuis.nl>.
On Wed, Jan 5, 2011 at 14:45, Joshua Marantz <jm...@google.com> wrote:
> other filter be somehow finding our filter and killing it?  Or sending the
> bytes directly to the network before our filter has a chance to run?

Possibly, yes.

By the way, why the complex setup? If you don't want the mod_headers
filter to run, insert your filter before it, then remove it for each
request that you handle.