You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Tim Watts <tw...@dionic.net> on 2011/07/14 12:00:53 UTC

mod_perl output filter and mod_proxy, mod_cache

Hi,

Is it in theory possible to insert a perl output filter between 
mod_proxy and mod_cache?

Or at least between mod_proxy and the client?



The problem I'm trying to solve is this:

We have 100+ web servers where apache fronts a separate tomcat server 
using mod_proxy.

Sadly, the tomcat dev's forgot to set any caching headers in the HTTP 
response (either Expires, Last-Modified or Cache-control) so the sites 
are largely uncacheable by browsers and the various tomcats are becoming 
overloaded.

1/3 of our sites are typically invariant (the production sites have 
stable and unchanging data and most queries are via GET requests).

Therefore, the idea of forcing in some cache control headers en-route 
and also enabling some apache caching has a good chance of working well 
without affecting anything.

mod_headers and mod_proxy don't seem to play well together and mod-cache 
doesn't either (probably due to lack of cache control headers in the 
tomcat response, though I haven't proved this is actually the case).

So the thought of doing a perl based filter to insert cache-control 
headers occurred.

It is likely I can insert such a filter on Apache 2.2 *between* 
mod_proxy and mod_cache?

Or am I going to have to implement a filter that includes proxying 
and/or caching?

Many thanks for any advice,

Cheers,

Tim

-- 
Tim Watts
Personal Blog: http://www.dionic.net/tim/

Re: mod_perl output filter and mod_proxy, mod_cache

Posted by André Warnier <aw...@ice-sa.com>.
And here is another link which might be interesting.
It is a message on the Tomcat list (where I re-posted your original request, hem), from
Rainer Jung, who is one of the Apache/Tomcat mod_jk connector developers :

"
Yes, go for TC 7:

http://tomcat.apache.org/tomcat-7.0-doc/config/filter.html#Expires_Filter

Regards,

Rainer
"

Now that Tomcat page, apart from its own interest, also points to the Apache "mod_expires" 
module (which I never heard about before) in your case may be exactly what you're looking for.

It seems to be such that it can add headers in a response proxied to Tomcat, without 
overwriting such headers if they already exist.


Here is what I would do :

1) identify some "usual suspects" among the URLs proxied to Tomcat
    They would have to match the following criteria :
    - they happen on an overloaded Tomcat
    - they happen often
    - I am reasonably sure that the information delivered by that URL
      is stable over a period of time
    - I am reasonably sure that if it happened that the browser would,
      once in a while, get stale information, it would not be dramatic

2) carefully configure the front-end Apache to, for these particular URLs,
add an Expires header specifying "now + N", where N is initially not too large.
This way, a browser would not get a result that is more than N outdated, but any duplicate 
request within a period N would get the cached version.

3) look at the impact and loop or not, increasing or decreasing N

YMMV.



Re: mod_perl output filter and mod_proxy, mod_cache

Posted by Tim Watts <tw...@dionic.net>.
Hi Andre,

Thanks for such a detailed reply:

On 14/07/11 21:07, André Warnier wrote:

>
> Back to the main issue.
>
> See this as just a bit more generic information, as to what/how you
> could think of solving your problem, apart from the other suggestions
> already submitted.
>
> 1) I am not sure about mod_perl I/O filters, because I never used them. (*)
> But in order to (conditionally/unconditionally) insert/delete/modify
> request/response headers, you can also write your own perl handler, and
> by choosing the appropriate type of PerlHandler, you can have it run at
> just about any point in the request/response cycle.
>
> The real power of mod_perl (if you haven't yet discovered that aspect),
> is that it allows you to insert your own code at just about any point of
> the Apache request processing cycle, and to do just about anything you
> want with any aspect of the request/response.
> That includes "interfering" with anything that other, non-perl, Apache
> modules do.

I've written auth handlers in mod_perl before - I did get the impression 
then the possibilities were extensive to do other things,

> See the following page for a good overview of the Apache request
> processing cycle, and what you can do with such PerlHandlers :
> http://perl.apache.org/docs/2.0/user/handlers/intro.html#mod_perl_Handlers_Categories
>
> You are probably more interested in the "HTTP Protocol" section. By
> clicking on each item in that list, you get and explanation of /when/
> that type of handle runs.
> (It's also indirectly a very good introduction to how Apache itself works).
>
> Such handlers are usually easy to write and configure, and the code to
> play with HTTP headers is also quite simple, if you know what to put in
> the header(s).

ah - that is very useful - I shall read that.

> 2) about mod_headers and mod_proxy playing together :
> The trouble is that (contrarily to the mod_perl documentation above) it
> is not usually clear at all in the Apache module's documentation, to
> find out during which exact phase of the Apache request processing each
> module runs.
>
> But I seem to remember something in mod_headers about an "early"
> attribute or parameter.
> Maybe that tells you more of when it runs (or can run), compared to
> mod_proxy.

Hmm - I did read the web page several times, must have missed that - I 
was nearly at the point of reading the source.

> 3) In the documentation of mod_proxy, there should be a possibility to
> configure it inside of a <Location(Match)> section, instead of
> "globally" (outside of any section).
> That forces you to decide more finely which URLs should or should not be
> proxied/forwarded to Tomcat, but it also (in my view) makes it more
> evident to combine the proxying instruction with other modules, like
> perl filters or handlers.
>
> In effect, from Apache's point of view, mod_proxy must be the equivalent
> of a "content-generating handler" (like a PerlResponseHandler), because
> for Apache, passing a request to mod_proxy for processing is not much
> different than passing it to any other internal response-generating
> handler.
> Apache in fact knows nothing of Tomcat. It passes a request to
> mod_proxy, and expects the response (or an error status) back from
> mod_proxy. It has no idea that behind mod_proxy is another server.

It is an interesting possibility that is also worth playing with,

Most of our servers are: redirect all to the proxy *except* a couple of 
url's which are either locally handled or sent to a different proxy.

This is quite typical:

RewriteEngine on
RewriteRule "^/media"  - [L] # Local
RewriteRule "^/django" - [L] # Local
# Otherwise proxy
RewriteRule "^/(.*)$" "http://tomcat.server:8180/webapp/$1" [P,L]
ProxyPassReverse   / http://tomcat.server:8180/webapp
ProxyPassReverseCookiePath /webapp /


Previously, this had been done with ProxyPass directives, including 
negative ones. This did not work well with some Rewrite rules that were 
also needed in some cases. So I tend to handle the whole thing with an 
ordered list of rewrite rules like above, using the proxy flag to those 
where required. It makes the ordering more obvious.

I have not yet tried a system of building the website with set sof 
Location directives, which might be interesting - though I do use 
Location sections to enforce redirects to SSL and requiring 
authentication. Apache is like perl, more than one way to do it.

>
> 4) strictly according to the HTTP protocol, a "GET" request should be
> "idempotent", which means (roughly) that running it twice or more should
> always give the same answer.
> Which in theory means that even if the GET request goes to a database,
> the response should be cacheable under most circumstances.
> Unfortunately, the practice is such that the GET request is much
> overused, and it is not always that way.
> But if caching the response creates problems, you can always tell your
> application developers that it is their fault because they are misusing
> the protocol..
>
> (In really strict terms, a GET /could/ provide a different response; but
> it should not modify the state of the server).

I do recall that.

> 5) despite what I am saying in (4), a GET response can very validly be
> different from a previous GET response with the same URL (for example,
> if in-between the data has been modified by a POST). So if you are
> forcing headers on the responses, you should at least be a bit careful
> not to do this indiscriminately.
>
> That is also why I personally have a doubt about the effectiveness of
> another caching proxy front-end like a couple were mentioned earlier. If
> the Tomcat web applications themselves do not provide headers to
> indicate whether their response can be cached or not, how is the
> front-end going to determine that this response /is/ the same as a
> previous one ?
> It seems to me that such a determination would require elements that
> such a proxy does not have, no ?

I agree - the tomcat apps *should* be declaring what is the correct 
caching scenario. But they don't. So this is very much a work around. 
However, for any given case, the dev folk usually remember enough about 
a project to say "the content of the database does not change, and GETs 
will be invariant as a result" (or not). It's on that basis I'm happy to 
proceed with a kludge, just to save my poor servers from melting(!). 
Well the servers are all VMs, so in more to stop old projects stealing 
resources that could be better used on new projects.

I feel I understand Cache-Control (vs Expires) a lot better since I 
optimised my own website with mod_cache on top of HTML::Mason/mod_perl 
(which do play nice) - and my Mason bits do send sensible Cache-Control 
lines. So I plan to give a small lunchtime seminar on that topic with 
some demos of using Google's pagespeed firebug plugin (very useful for 
this stuff).

The stupid thing is, it is probably trivial at design time to wedge 
extra HTTP headers in (maybe JSP has a framework level TTL/expires 
control - I don't know) but one has to know one *should* be doing it...

>
> Now if you are still there, one more question :
> Are we talking here of a configuration where one front-end Apache
> front-ends for several Tomcats possibly on different machines ?
> or does each Tomcat have its own personal Apache front-end on the same
> machine ?
> or something in-between ?

Mix. Older projects sent 3 different VHOSTS to 3 different remote tomcat 
servers, each of which was handling a dozen+ webapps for a dozen+ 
different apache servers.

This was a disaster as one bad webapp could take out the tomcat farm and 
the bloody logs are so useless it was impossible to find out which one.

These days, we have 3 different tomcat instances on the front machine 
(dev, staging, live/production) and one apache with 3 VHOSTs mapping to 
each tomcat. We may also blend in some django on the same machine. 
Apache may mix in static content itself for efficiciency (CSS/JS).

At least then, the development tomcat can be killed and restarted 
without breaking the live one (and no, "touching" the web.xml file to 
trigger a single webapp reload is about reliable as asking a robber to 
drop your cash off at the bank!).

They used to use a lot of perl - but I think perl lost it a bit with 
forms handling and Ajax (until recently perhaps) which is why everyone 
went off playing with jsp and now django.

I must admit django does seem well designed and I object to python a lot 
less than java. Disadvantage - django likes to write your SQL for you 
leading to a lack of thinking there - eg, one I caught the other day:

5 JOINs with a SELECT DISTINCT over all. Bloke wondered why the MySQL 
server took 40 seconds to compute the result!

>
> (*) considering the name of "filter" however, I would think that
> - an "input filter" should always run /before/ any module which
> generates content (of which mod_proxy is one)
> - an "output filter" should always run /after/ any modules which
> generate content.
> So, it is probably difficult to have a filter which runs /in-between/
> other Apache modules.

I'm still going to have a look at mod_perl filters - I have a feeling 
they could be useful here and there.

Thanks :)

Tim

-- 
Tim Watts
Personal Blog: http://www.dionic.net/tim/

Re: mod_perl output filter and mod_proxy, mod_cache

Posted by André Warnier <aw...@ice-sa.com>.
Tim Watts wrote:
> Hi,
> 
> Is it in theory possible to insert a perl output filter between 
> mod_proxy and mod_cache?
> 
> Or at least between mod_proxy and the client?
> 
...

> 
> mod_headers and mod_proxy don't seem to play well together and mod-cache 
> doesn't either (probably due to lack of cache control headers in the 
> tomcat response, though I haven't proved this is actually the case).
> 
...

Back to the main issue.

See this as just a bit more generic information, as to what/how you could think of solving 
your problem, apart from the other suggestions already submitted.

1) I am not sure about mod_perl I/O filters, because I never used them. (*)
But in order to (conditionally/unconditionally) insert/delete/modify request/response 
headers, you can also write your own perl handler, and by choosing the appropriate type of 
  PerlHandler, you can have it run at just about any point in the request/response cycle.

The real power of mod_perl (if you haven't yet discovered that aspect), is that it allows 
you to insert your own code at just about any point of the Apache request processing 
cycle, and to do just about anything you want with any aspect of the request/response.
That includes "interfering" with anything that other, non-perl, Apache modules do.

See the following page for a good overview of the Apache request processing cycle, and 
what you can do with such PerlHandlers :
http://perl.apache.org/docs/2.0/user/handlers/intro.html#mod_perl_Handlers_Categories
You are probably more interested in the "HTTP Protocol" section.  By clicking on each item 
in that list, you get and explanation of /when/ that type of handle runs.
(It's also indirectly a very good introduction to how Apache itself works).

Such handlers are usually easy to write and configure, and the code to play with HTTP 
headers is also quite simple, if you know what to put in the header(s).

2) about mod_headers and mod_proxy playing together :
The trouble is that (contrarily to the mod_perl documentation above) it is not usually 
clear at all in the Apache module's documentation, to find out during which exact phase of 
the Apache request processing each module runs.

But I seem to remember something in mod_headers about an "early" attribute or parameter.
Maybe that tells you more of when it runs (or can run), compared to mod_proxy.

3) In the documentation of mod_proxy, there should be a possibility to configure it inside 
of a <Location(Match)> section, instead of "globally" (outside of any section).
That forces you to decide more finely which URLs should or should not be proxied/forwarded 
to Tomcat, but it also (in my view) makes it more evident to combine the proxying 
instruction with other modules, like perl filters or handlers.

In effect, from Apache's point of view, mod_proxy must be the equivalent of a 
"content-generating handler" (like a PerlResponseHandler), because for Apache, passing a 
request to mod_proxy for processing is not much different than passing it to any other 
internal response-generating handler.
Apache in fact knows nothing of Tomcat.  It passes a request to mod_proxy, and expects the 
response (or an error status) back from mod_proxy.  It has no idea that behind mod_proxy 
is another server.


4) strictly according to the HTTP protocol, a "GET" request should be "idempotent", which 
means (roughly) that running it twice or more should always give the same answer.
Which in theory means that even if the GET request goes to a database, the response should 
be cacheable under most circumstances.
Unfortunately, the practice is such that the GET request is much overused, and it is not 
always that way.
But if caching the response creates problems, you can always tell your application 
developers that it is their fault because they are misusing the protocol..

(In really strict terms, a GET /could/ provide a different response; but it should not 
modify the state of the server).

5) despite what I am saying in (4), a GET response can very validly be different from a 
previous GET response with the same URL (for example, if in-between the data has been 
modified by a POST).  So if you are forcing headers on the responses, you should at least 
be a bit careful not to do this indiscriminately.

That is also why I personally have a doubt about the effectiveness of another caching 
proxy front-end like a couple were mentioned earlier.  If the Tomcat web applications 
themselves do not provide headers to indicate whether their response can be cached or not, 
how is the front-end going to determine that this response /is/ the same as a previous one ?
It seems to me that such a determination would require elements that such a proxy does not 
have, no ?


Now if you are still there, one more question :
Are we talking here of a configuration where one front-end Apache front-ends for several 
Tomcats possibly on different machines ?
or does each Tomcat have its own personal Apache front-end on the same machine ?
or something in-between ?


(*) considering the name of "filter" however, I would think that
- an "input filter" should always run /before/ any module which generates content (of 
which mod_proxy is one)
- an "output filter" should always run /after/ any modules which generate content.
So, it is probably difficult to have a filter which runs /in-between/ other Apache modules.

Re: Re [OT]: mod_perl output filter and mod_proxy, mod_cache

Posted by Clinton Gormley <cl...@traveljury.com>.
Hi Niels

On Thu, 2011-07-14 at 20:09 +0200, Niels Larsen wrote:
> Yes, CPAN has very, very useful things. I consider its biggest problems
> 1) too difficult to find things when not knowing what one wants, 2) a
> huge undergrowth of modules that are either bad quality or unmaintained
> or duplicated with a later module. The number of lingering bugs are an
> obstacle, yet at the same time super-useful things are "hiding" in plain
> view. 

Check out http://metacpan.org - it's a GSOC 2011 project that aims to
improve cpan search.  Tagging and user ranking (plus integration of
those into the search results) are next on the feature list

clint



Re: Re [OT]: mod_perl output filter and mod_proxy, mod_cache

Posted by Niels Larsen <ni...@genomics.dk>.
Yes, CPAN has very, very useful things. I consider its biggest problems
1) too difficult to find things when not knowing what one wants, 2) a
huge undergrowth of modules that are either bad quality or unmaintained
or duplicated with a later module. The number of lingering bugs are an
obstacle, yet at the same time super-useful things are "hiding" in plain
view. 

Apropos, Perl Dancer was "hiding" for me because I didn't see it here,
http://search.cpan.org/modlist/World_Wide_Web .. but many more such 
discoveries in the past. A simple global ranking by popularity (the 
number of times downloaded) and/or by size and maturity (time located
on CPAN) would expose many "new" things to many, I think. If other 
modules depend on them, then that may speak to quality somewhat, and 
much better rating could be done. MongoDB would probably make managing
the collection easier. But, I am grateful for what exists of course.

While watching the language certainly, I'm moving from Apache/mod_perl
to Dancer/Nginx for speed and memory reason.

Ok, back to lurk-mode,

Niels Larsen


> [OT, ADVOCACY]
> 
> I am partial to perl and CPAN, because there are just so many things I have been able to 
> do with them over the years at little expense to solve real-world problems.
> And despite the fact that I also use a lot of OO modules in perl, I just cannot get in 
> sympathy with a language like *****, where it seems that you have to mobilise a couple of 
> dozen classes (and x MB of RAM) just to print a date or so.
> Never mind the time spent trying to find their documentations.
> 
> As a matter of fact, when I am confronted with a new kind of problem, in an area where I 
> know a-priori nothing, my first stop is usually not Google nor Wikipedia but CPAN, just to 
> read the documentation of the modules related to that area.  Whether you need to parse 
> text, to process some weird data format, to talk to Amazon, to make credit-card payments, 
> to dig out and generate system statistics, to understand how SOAP works, to drive an 
> MS-Office program through OLE (and know nothing of OLE to start with), create a TCP 
> server, convert images, read or create and send emails, or whatever, you always find an 
> answer there. Even if in the end it turns out that the answer is not something in perl, 
> there is so much knowledge stored in CPAN that it is a pity that it is only consulted by 
> perl-centric types.
> 
> [IDEA]
> Maybe creating a website named WikiPerl, containing just the CPAN documentation with a 
> decent search engine (KinoSearch/Lucy ?), would help restore perl's popularity ?
> 
> Or do we just keep that for ourselves, as the best job-preservation scheme ever designed ?
> 
> 
> Ooops. I was just about to send this to the wrong list...



Re [OT]: mod_perl output filter and mod_proxy, mod_cache

Posted by André Warnier <aw...@ice-sa.com>.
I'll have to watch my language here, as I might otherwise get ostracised on that other 
list of mine.

Tim Watts wrote:
> On 14/07/11 14:38, André Warnier wrote:
>> Tim Watts wrote:
>> ...
> 
>> "I think for this problem, I have to treat tomcat as a little, rather
>> inefficient, black box .."
>>
> 
> They liked that quote then? ;->>>>
> 
> <OT Rant>
> 
> I'm sure it's a lovely development environment (there must be some 
> reason people use it) - all I know is it's a resource hungry bitch 
> that's never happy unless it has GB's RAM and at least 2, preferably 4 
> fast cores. And if you p*ss it off, it will eat your swap and burn all 
> your cores at 100%. Bane of my sysadmin life...

We should start a club.

> 
> Don't get me started on the readability of its log files!!

Or worse, the logging configuration.

> 
> That's across a wide range of applications including commercial stuff 
> like Confluence.
> 
> Bah - give me mod_perl (or even mod_wsgi+python) anyday...

+1

> 
> I've got a lot done with HTML::Mason+mod_perl and very efficiently (for 
> such a  simple templating system) and I've considering Mojolicious for 
> fun. Learning django too right now too for the cool forms+DB stuff.
>

We have been re-developing stuff that is based on ****, using mod_perl and TT2 for now.
It works faster, uses umpteen MB less memory, and may soon deliver us from the management 
of that ****-based stuff too.


> Thankfully, our guys are making a switch to django away from **** and 
> it is so much nicer to manage.
> 
Don't know it, but will have a look.

[OT, ADVOCACY]

I am partial to perl and CPAN, because there are just so many things I have been able to 
do with them over the years at little expense to solve real-world problems.
And despite the fact that I also use a lot of OO modules in perl, I just cannot get in 
sympathy with a language like *****, where it seems that you have to mobilise a couple of 
dozen classes (and x MB of RAM) just to print a date or so.
Never mind the time spent trying to find their documentations.

As a matter of fact, when I am confronted with a new kind of problem, in an area where I 
know a-priori nothing, my first stop is usually not Google nor Wikipedia but CPAN, just to 
read the documentation of the modules related to that area.  Whether you need to parse 
text, to process some weird data format, to talk to Amazon, to make credit-card payments, 
to dig out and generate system statistics, to understand how SOAP works, to drive an 
MS-Office program through OLE (and know nothing of OLE to start with), create a TCP 
server, convert images, read or create and send emails, or whatever, you always find an 
answer there. Even if in the end it turns out that the answer is not something in perl, 
there is so much knowledge stored in CPAN that it is a pity that it is only consulted by 
perl-centric types.

[IDEA]
Maybe creating a website named WikiPerl, containing just the CPAN documentation with a 
decent search engine (KinoSearch/Lucy ?), would help restore perl's popularity ?

Or do we just keep that for ourselves, as the best job-preservation scheme ever designed ?


Ooops. I was just about to send this to the wrong list...

Re: mod_perl output filter and mod_proxy, mod_cache

Posted by Tim Watts <tw...@dionic.net>.
On 14/07/11 14:38, André Warnier wrote:
> Tim Watts wrote:
> ...

> "I think for this problem, I have to treat tomcat as a little, rather
> inefficient, black box .."
>

They liked that quote then? ;->>>>

<OT Rant>

I'm sure it's a lovely development environment (there must be some 
reason people use it) - all I know is it's a resource hungry bitch 
that's never happy unless it has GB's RAM and at least 2, preferably 4 
fast cores. And if you p*ss it off, it will eat your swap and burn all 
your cores at 100%. Bane of my sysadmin life...

Don't get me started on the readability of its log files!!

That's across a wide range of applications including commercial stuff 
like Confluence.

Bah - give me mod_perl (or even mod_wsgi+python) anyday...

I've got a lot done with HTML::Mason+mod_perl and very efficiently (for 
such a  simple templating system) and I've considering Mojolicious for 
fun. Learning django too right now too for the cool forms+DB stuff.

Thankfully, our guys are making a switch to django away from tomcat and 
it is so much nicer to manage.

Cheers,

Tim

-- 
Tim Watts
Personal Blog: http://www.dionic.net/tim/

Re: mod_perl output filter and mod_proxy, mod_cache

Posted by André Warnier <aw...@ice-sa.com>.
Tim Watts wrote:
...

> 
> LoL - I hate tomcat anyway (for it's fatness) so I don't mind if they 
> hate me ;->
> 
> I should have clarified as "my Department's dev team" (ie the ones who 
> use tomcat here) rather than the Tomcat Developers themselves...
> 
Well, I said that too, and said I had misquoted you, but there was little I could do about 
  that next phrase of yours :

"I think for this problem, I have to treat tomcat as a little, rather inefficient, black 
box .."


RE: mod_perl output filter and mod_proxy, mod_cache

Posted by "James B. Muir" <Ja...@hitchcock.org>.
I had to bolt on an input servlet filter to tomcat once. To do this I had to write the servlet filter code and then add <filter> and <filter-mapping> tags to the application WEB-INF/web.xml file.
-James


-----Original Message-----
From: Tim Watts [mailto:tw@dionic.net]
Sent: Thursday, July 14, 2011 8:12 AM
To: mod_perl list
Subject: Re: mod_perl output filter and mod_proxy, mod_cache

On 14/07/11 12:43, André Warnier wrote:
> Hi.
>
> I have to apologise.
> I misunderstood your first post, and I wanted to verify on the Tomcat
> list, so I quoted the following passage of your first post in my message
> there :
>
> "Sadly, the tomcat dev's forgot to set any caching headers in the HTTP
> response (either Expires, Last-Modified or Cache-control) so the sites
> are largely uncacheable by browsers and the various tomcats are becoming
> overloaded."
>
> Unfortunately, the Tomcat Dev's there took it rather seriously, and as a
> consequence now you name is shit on the Tomcat list.
>
>
> .. just kidding, I did not quote your name.

LoL - I hate tomcat anyway (for it's fatness) so I don't mind if they
hate me ;->

I should have clarified as "my Department's dev team" (ie the ones who
use tomcat here) rather than the Tomcat Developers themselves...

I have no doubts that jsp can be told to emit certain headers but for
some reason a lot of web developers IME often miss the finer points of
HTTP. This of course would be the correct place to do it as they can
choose different max-age times to suit the content.

I plan to run a 20 minute seminar on this specific point for my lot (and
more such seminars for other issues like security and SQL efficiency)
but that still leaves loads of old black-boxes to manage for a few years.

> Anyway, apart from a few huffed responses to my misquote (since then
> rectified), someone provided a suggestion that may not be the simplest,
> but might be helpful anyway in some cases :
>
> Have a look at : http://www.tuckey.org/urlrewrite/
>
> This is a "Java Servlet Filter", which can be added transparently
> "around" any Tomcat web application (by adding the required section in
> the web.xml config file of that web application).
> Java Servlet Filters are such that the Tomcat web application is not
> even aware that it is there, and continues to work as before. Much like
> Apache input and output filters in fact, except that a Java Servlet
> Filter is both at the same time (it "wraps" the webapp on both sides).

That could be interesting too - as long as it's something I can bolt in
without having to recompile the webapp code, I'm game. As a linux
sysadmin, I draw a clear line between the systems (my problem) and the
apps (dev team) - and not knowing java (much) I'm not qualified to mess
with their stuff... I'm happy to go as far as messing with server.xml
and web.xml though :)

> Anyway, this filter can do such things as conditionally or not adding
> response headers to anything the webapp produces. And it can do much
> more, as with time it has evolved into some kind of mish-mash of
> mod_rewrite, mod_headers and mod_proxy.
>
> It is more one-by-one work than doing something at the Apache front-end
> level or via a proxy, but it also provides better fine-tuning
> possibilities.
> So, if you can for instance easily identify the worst offenders, it
> might be an option.
>
> And it is certainly a good tool to have in one's toolcase.

I agree - I'll have a look at that after I play with Alex's suggestion
of Varnish :)

Thanks very much for your time :)

all the best,

Tim

--
Tim Watts
Personal Blog: http://www.dionic.net/tim/

IMPORTANT NOTICE REGARDING THIS ELECTRONIC MESSAGE:

This message is intended for the use of the person to whom it is addressed and may contain information that is privileged, confidential, and protected from disclosure under applicable law.  If you are not the intended recipient, your use of this message for any purpose is strictly prohibited.  If you have received this communication in error, please delete the message and notify the sender so that we may correct our records.

Re: mod_perl output filter and mod_proxy, mod_cache

Posted by Tim Watts <tw...@dionic.net>.
On 14/07/11 12:43, André Warnier wrote:
> Hi.
>
> I have to apologise.
> I misunderstood your first post, and I wanted to verify on the Tomcat
> list, so I quoted the following passage of your first post in my message
> there :
>
> "Sadly, the tomcat dev's forgot to set any caching headers in the HTTP
> response (either Expires, Last-Modified or Cache-control) so the sites
> are largely uncacheable by browsers and the various tomcats are becoming
> overloaded."
>
> Unfortunately, the Tomcat Dev's there took it rather seriously, and as a
> consequence now you name is shit on the Tomcat list.
>
>
> .. just kidding, I did not quote your name.

LoL - I hate tomcat anyway (for it's fatness) so I don't mind if they 
hate me ;->

I should have clarified as "my Department's dev team" (ie the ones who 
use tomcat here) rather than the Tomcat Developers themselves...

I have no doubts that jsp can be told to emit certain headers but for 
some reason a lot of web developers IME often miss the finer points of 
HTTP. This of course would be the correct place to do it as they can 
choose different max-age times to suit the content.

I plan to run a 20 minute seminar on this specific point for my lot (and 
more such seminars for other issues like security and SQL efficiency) 
but that still leaves loads of old black-boxes to manage for a few years.

> Anyway, apart from a few huffed responses to my misquote (since then
> rectified), someone provided a suggestion that may not be the simplest,
> but might be helpful anyway in some cases :
>
> Have a look at : http://www.tuckey.org/urlrewrite/
>
> This is a "Java Servlet Filter", which can be added transparently
> "around" any Tomcat web application (by adding the required section in
> the web.xml config file of that web application).
> Java Servlet Filters are such that the Tomcat web application is not
> even aware that it is there, and continues to work as before. Much like
> Apache input and output filters in fact, except that a Java Servlet
> Filter is both at the same time (it "wraps" the webapp on both sides).

That could be interesting too - as long as it's something I can bolt in 
without having to recompile the webapp code, I'm game. As a linux 
sysadmin, I draw a clear line between the systems (my problem) and the 
apps (dev team) - and not knowing java (much) I'm not qualified to mess 
with their stuff... I'm happy to go as far as messing with server.xml 
and web.xml though :)

> Anyway, this filter can do such things as conditionally or not adding
> response headers to anything the webapp produces. And it can do much
> more, as with time it has evolved into some kind of mish-mash of
> mod_rewrite, mod_headers and mod_proxy.
>
> It is more one-by-one work than doing something at the Apache front-end
> level or via a proxy, but it also provides better fine-tuning
> possibilities.
> So, if you can for instance easily identify the worst offenders, it
> might be an option.
>
> And it is certainly a good tool to have in one's toolcase.

I agree - I'll have a look at that after I play with Alex's suggestion 
of Varnish :)

Thanks very much for your time :)

all the best,

Tim

-- 
Tim Watts
Personal Blog: http://www.dionic.net/tim/

Re: mod_perl output filter and mod_proxy, mod_cache

Posted by André Warnier <aw...@ice-sa.com>.
Hi.

I have to apologise.
I misunderstood your first post, and I wanted to verify on the Tomcat list, so I quoted 
the following passage of your first post in my message there :

"Sadly, the tomcat dev's forgot to set any caching headers in the HTTP response (either 
Expires, Last-Modified or Cache-control) so the sites are largely uncacheable by browsers 
and the various tomcats are becoming overloaded."

Unfortunately, the Tomcat Dev's there took it rather seriously, and as a consequence now 
you name is shit on the Tomcat list.


.. just kidding, I did not quote your name.

Anyway, apart from a few huffed responses to my misquote (since then rectified), someone 
provided a suggestion that may not be the simplest, but might be helpful anyway in some 
cases :

Have a look at : http://www.tuckey.org/urlrewrite/

This is a "Java Servlet Filter", which can be added transparently "around" any Tomcat web 
application (by adding the required section in the web.xml config file of that web 
application).
Java Servlet Filters are such that the Tomcat web application is not even aware that it is 
there, and continues to work as before.  Much like Apache input and output filters in 
fact, except that a Java Servlet Filter is both at the same time (it "wraps" the webapp on 
both sides).

Anyway, this filter can do such things as conditionally or not adding response headers to 
anything the webapp produces.  And it can do much more, as with time it has evolved into 
some kind of mish-mash of mod_rewrite, mod_headers and mod_proxy.

It is more one-by-one work than doing something at the Apache front-end level or via a 
proxy, but it also provides better fine-tuning possibilities.
So, if you can for instance easily identify the worst offenders, it might be an option.

And it is certainly a good tool to have in one's toolcase.




Re: mod_perl output filter and mod_proxy, mod_cache

Posted by Tim Watts <tw...@dionic.net>.
On 14/07/11 11:52, "Alex J. G. Burzyński" wrote:
> Hi Tim,
>
> If you are after caching the responses, maybe an easier solution would
> be to use a reverse proxy - like Varnish?
>
> You would be then in complete control over the incoming and outgoing
> headers and could cache responses based on the url / inject Expires
> headers so browsers could cache them too etc.
>
> Cheers,
> Alex
>

[Sorry Alex, hit reply instead of reply-list]

Hi Alex,

I was initially also thinking Squid - but it's rather heavy.

I have not come across Varnish but having a quick look (and noting it is 
available on Debian - good) it looks like a damn good option.

I think you are right - apache is great, but the order of execution of 
modules is not well documented and prone to changing (hence my original 
question here) and trying to splice effectively 3 filters together 
(proxy, header-fiddling and cache) is probably doomed to grief.

Thanks for the tip - I'm off to try that today!

All the best,

Tim

-- 
Tim Watts
Personal Blog: http://www.dionic.net/tim/

Re: mod_perl output filter and mod_proxy, mod_cache

Posted by "Alex J. G. Burzyński" <aj...@ajgb.net>.
Hi Tim,

If you are after caching the responses, maybe an easier solution would
be to use a reverse proxy - like Varnish?

You would be then in complete control over the incoming and outgoing
headers and could cache responses based on the url / inject Expires
headers so browsers could cache them too etc.

Cheers,
Alex


On 14/07/11 11:39, Tim Watts wrote:
> On 14/07/11 11:16, André Warnier wrote:
>
> Hi Andre,
>
> Thanks for the quick reply :)
>
>> (That would probably be difficult, inefficient or both)
>>
>> Assuming that what you say about Tomcat is true (I don't know, and it
>> may be worth asking this on the Tomcat list), I can think of another way
>> to achieve what you seem to want :
>> if you can distinguish, from the request URL (or any other request
>> property), the requests that are for invariant things, then you could
>> arrange to /not/ proxy these requests to Tomcat, and serve them directly
>> from Apache httpd.
>
> Indeed that is a good idea. We are doing that for new projects for css
> and js files (apache does not proxy certain paths and picks these up
> from the local filesystem).
>
> We can't do that for the 100 odd legacy servers as no-one has time o
> delve into the java/JSP code. I need to do something "outside" of
> tomcat where possible. Just to explain, each web server is a paid-for
> project - and when it's done, it sits there for 5+ years.
>
> Only I have the time/inclination to fix this as it's killing my VMWare
> infrastructure. Because the sites are all fronted by apache in a
> similar way, one solution is likely to apply to most of the sites.
>
> I would also add that most of the sites are "dynamically" driven
> pages, even involving MySQL querying, but once launched, the data
> remains fairly static - eg GET X will always resolve to reponse Y.
>
> I'm planning a small seminar on the value of Cache-Control for my dev
> colleagues so they can stop making this mistake ;-> But that still
> leaves a lot of "done" projects to fix.
>
>> Which proxying method exactly are you using between Apache and Tomcat ?
>> (if you are using mod_proxy, then you are either using mod_proxy_http or
>> mod_proxy_ajp; you could also consider using mod_jk).
>
> mod_proxy_http specifically.
>
> mod_jk looks interesting for new projects (we have local tomcats for
> those now) - I think it may be a non-starter for old stuff as trying
> to retro fit it may not be so simple (our older tomcat servers are in
> a remote farm on their own machines hence the use of mod_proxy_http).
>
>> Also, what are the versions of Apache and Tomcat that you are using ?
>>
>
> Apache 2.2 (various sub versions) and both tomcat 5.5 and tomcat 6
> (but all on remote machines listening on TCP sockets).
>
> I think for this problem, I have to treat tomcat as a little, rather
> inefficient, black box and try to fixup on the apache front ends,
> hence the direction of my original idea...
>
> Cheers,
>
> Tim
>


Re: mod_perl output filter and mod_proxy, mod_cache

Posted by James Smith <js...@sanger.ac.uk>.
On 14/07/2011 11:39, Tim Watts wrote:
> On 14/07/11 11:16, André Warnier wrote:
>
> Hi Andre,
>
> Thanks for the quick reply :)
>
>> (That would probably be difficult, inefficient or both)
>>
>> Assuming that what you say about Tomcat is true (I don't know, and it
>> may be worth asking this on the Tomcat list), I can think of another way
>> to achieve what you seem to want :
>> if you can distinguish, from the request URL (or any other request
>> property), the requests that are for invariant things, then you could
>> arrange to /not/ proxy these requests to Tomcat, and serve them directly
>> from Apache httpd.
>
> Indeed that is a good idea. We are doing that for new projects for css 
> and js files (apache does not proxy certain paths and picks these up 
> from the local filesystem).
>
> We can't do that for the 100 odd legacy servers as no-one has time o 
> delve into the java/JSP code. I need to do something "outside" of 
> tomcat where possible. Just to explain, each web server is a paid-for 
> project - and when it's done, it sits there for 5+ years.
>
> Only I have the time/inclination to fix this as it's killing my VMWare 
> infrastructure. Because the sites are all fronted by apache in a 
> similar way, one solution is likely to apply to most of the sites.
>
> I would also add that most of the sites are "dynamically" driven 
> pages, even involving MySQL querying, but once launched, the data 
> remains fairly static - eg GET X will always resolve to reponse Y.
>
> I'm planning a small seminar on the value of Cache-Control for my dev 
> colleagues so they can stop making this mistake ;-> But that still 
> leaves a lot of "done" projects to fix.
>
>> Which proxying method exactly are you using between Apache and Tomcat ?
>> (if you are using mod_proxy, then you are either using mod_proxy_http or
>> mod_proxy_ajp; you could also consider using mod_jk).
>
> mod_proxy_http specifically.
>
> mod_jk looks interesting for new projects (we have local tomcats for 
> those now) - I think it may be a non-starter for old stuff as trying 
> to retro fit it may not be so simple (our older tomcat servers are in 
> a remote farm on their own machines hence the use of mod_proxy_http).
>
Shouldn't be an issue you can point the mod_jk to a remote machine - I 
do it a lot so that we can push the Tomcat application out through our 
templating output filter ... The tomcat produces a plain HTML page with 
none of the styling, and this is wrapped using our custom output filter, 
I'm guessing at this stage you can do what you want with the script...

James

>> Also, what are the versions of Apache and Tomcat that you are using ?
>>
>
> Apache 2.2 (various sub versions) and both tomcat 5.5 and tomcat 6 
> (but all on remote machines listening on TCP sockets).
>
> I think for this problem, I have to treat tomcat as a little, rather 
> inefficient, black box and try to fixup on the apache front ends, 
> hence the direction of my original idea...
>
> Cheers,
>
> Tim
>



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

Re: mod_perl output filter and mod_proxy, mod_cache

Posted by Tim Watts <tw...@dionic.net>.
On 14/07/11 11:16, André Warnier wrote:

Hi Andre,

Thanks for the quick reply :)

> (That would probably be difficult, inefficient or both)
>
> Assuming that what you say about Tomcat is true (I don't know, and it
> may be worth asking this on the Tomcat list), I can think of another way
> to achieve what you seem to want :
> if you can distinguish, from the request URL (or any other request
> property), the requests that are for invariant things, then you could
> arrange to /not/ proxy these requests to Tomcat, and serve them directly
> from Apache httpd.

Indeed that is a good idea. We are doing that for new projects for css 
and js files (apache does not proxy certain paths and picks these up 
from the local filesystem).

We can't do that for the 100 odd legacy servers as no-one has time o 
delve into the java/JSP code. I need to do something "outside" of tomcat 
where possible. Just to explain, each web server is a paid-for project - 
and when it's done, it sits there for 5+ years.

Only I have the time/inclination to fix this as it's killing my VMWare 
infrastructure. Because the sites are all fronted by apache in a similar 
way, one solution is likely to apply to most of the sites.

I would also add that most of the sites are "dynamically" driven pages, 
even involving MySQL querying, but once launched, the data remains 
fairly static - eg GET X will always resolve to reponse Y.

I'm planning a small seminar on the value of Cache-Control for my dev 
colleagues so they can stop making this mistake ;-> But that still 
leaves a lot of "done" projects to fix.

> Which proxying method exactly are you using between Apache and Tomcat ?
> (if you are using mod_proxy, then you are either using mod_proxy_http or
> mod_proxy_ajp; you could also consider using mod_jk).

mod_proxy_http specifically.

mod_jk looks interesting for new projects (we have local tomcats for 
those now) - I think it may be a non-starter for old stuff as trying to 
retro fit it may not be so simple (our older tomcat servers are in a 
remote farm on their own machines hence the use of mod_proxy_http).

> Also, what are the versions of Apache and Tomcat that you are using ?
>

Apache 2.2 (various sub versions) and both tomcat 5.5 and tomcat 6 (but 
all on remote machines listening on TCP sockets).

I think for this problem, I have to treat tomcat as a little, rather 
inefficient, black box and try to fixup on the apache front ends, hence 
the direction of my original idea...

Cheers,

Tim

-- 
Tim Watts
Personal Blog: http://www.dionic.net/tim/

Re: mod_perl output filter and mod_proxy, mod_cache

Posted by André Warnier <aw...@ice-sa.com>.
Tim Watts wrote:
> Hi,
> 
> Is it in theory possible to insert a perl output filter between 
> mod_proxy and mod_cache?
> 
> Or at least between mod_proxy and the client?
> 
> 
> 
> The problem I'm trying to solve is this:
> 
> We have 100+ web servers where apache fronts a separate tomcat server 
> using mod_proxy.
> 
> Sadly, the tomcat dev's forgot to set any caching headers in the HTTP 
> response (either Expires, Last-Modified or Cache-control) so the sites 
> are largely uncacheable by browsers and the various tomcats are becoming 
> overloaded.
> 
> 1/3 of our sites are typically invariant (the production sites have 
> stable and unchanging data and most queries are via GET requests).
> 
> Therefore, the idea of forcing in some cache control headers en-route 
> and also enabling some apache caching has a good chance of working well 
> without affecting anything.
> 
> mod_headers and mod_proxy don't seem to play well together and mod-cache 
> doesn't either (probably due to lack of cache control headers in the 
> tomcat response, though I haven't proved this is actually the case).
> 
> So the thought of doing a perl based filter to insert cache-control 
> headers occurred.
> 
> It is likely I can insert such a filter on Apache 2.2 *between* 
> mod_proxy and mod_cache?
> 
> Or am I going to have to implement a filter that includes proxying 
> and/or caching?
>  
(That would probably be difficult, inefficient or both)

Assuming that what you say about Tomcat is true (I don't know, and it may be worth asking 
this on the Tomcat list), I can think of another way to achieve what you seem to want :
if you can distinguish, from the request URL (or any other request property), the requests 
that are for invariant things, then you could arrange to /not/ proxy these requests to 
Tomcat, and serve them directly from Apache httpd.

Which proxying method exactly are you using between Apache and Tomcat ? (if you are using 
mod_proxy, then you are either using mod_proxy_http or mod_proxy_ajp; you could also 
consider using mod_jk).

Also, what are the versions of Apache and Tomcat that you are using ?