You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Aaron Bannert <aa...@clove.org> on 2002/05/10 02:18:42 UTC

How I Think Filters Should Work

> > That just sounds like the same thing with a blocking or non-blocking*
> > flag. To be honest, I don't see how any input filters would need anything
> > except one bucket at a time. If the filter doesn't need it, it passes
> > it downstream, otherwise it chugs and spits out other buckets. What else
> > is there?
> 
> Yuck.  I think it'd be possible for input filters to buffer up or
> modify data and then pass them up with multiple buckets in a
> brigade rather than one bucket.  Think of a mod_deflate input
> filter.  -- justin

Let me be more precise. I'm not saying that we shouldn't use
brigades. What I'm saying is we shouldn't be dealing with specific types
of data at this level. Right now, by requiring a filter to request
"bytes" or "lines", we are seriously constraining the performance of
the filters. A filter should only inspect the types of the buckets it
retrieves and then move on. The bytes should only come in to play once
we have actually retrieved a bucket of a certain type that we are able
to process.

Furthermore, we should be using a dynamic type system, and liberally
creating new bucket types as we invent new implementations. Filters need
not know which filters are upstream or downstream from them, but they
should have been strategically placed to consume certain buckets from
upstream filters and to produce certain buckets required by downstream
filters.


[Warning: long-winded brainstorm follows:]


I want a typical filter chain to look like this:

input_source  --->  protocol filters  -->  sub-protocol filters  --> handlers

an input socket would produce this:

SOCKET
EOS

an http header parser filter would produce these:

HEADER
HEADER
HEADER
DATA (extra data read past headers)
SOCKET
EOS

an http request parser would only work at the request level, performing
dechunking, dealing with content-length, and dealing with pipelined
requests. It would produce these:

BEGIN_OF_REQUEST
HEADERS
BEGIN_OF_BODY_DATA
BODY_DATA
BODY_DATA
BODY_DATA
BODY_DATA
END_OF_BODY_DATA
TRAILERS...
END_OF_REQUEST
... and so on

a multipart input handler would then pass all types except BODY_DATA,
which it could use to produce:

...
MULTIPART_SECTION_BEGIN
BODY_DATA
MULTIPART_SECTION_END
...

or a magic mime filter could simply buffer enough BODY_DATA buckets until
it knew the type, prepending a MIME_TYPE to the front and sending
the whole thing downstream.

...
MIME_TYPE
BODY_DATA
BODY_DATA
...


The basic pattern for any input filter (which is pull-based at the moment
in Apache) would be the following:

1. retrieve next "abstract data unit"
2. inspect "abstract data unit", can we operate on it?
3. if yes, operate_on(unit) and pass the result to the next filter.
4. if no, pass the current unit to the next filter.
5. go to #1

In this model, the operate_on() behavior has been separated from the
mechanics of passing data around. I believe this would improve filter
performance as well as simplifying the implementation details that
module authors must understand. I also think this would dramatically
improve the extendability of the Apache Filters system.

[Sorry for the long brain dump. Some of these ideas have been floating
around in my head for a long time. When they become clear enough I will
write up a more formal and concise proposal on how I think the future
filter system should work (possible for 2.1 or beyond). I think the
apr-serf project is a perfect place to play with some of these ideas. I
would appreciate any constructive comments to the above. ]

-aaron

Re: How I Think Filters Should Work

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 09, 2002 at 05:18:42PM -0700, Aaron Bannert wrote:
>...
> The basic pattern for any input filter (which is pull-based at the moment
> in Apache) would be the following:
> 
> 1. retrieve next "abstract data unit"
> 2. inspect "abstract data unit", can we operate on it?
> 3. if yes, operate_on(unit) and pass the result to the next filter.
> 4. if no, pass the current unit to the next filter.
> 5. go to #1
> 
> In this model, the operate_on() behavior has been separated from the
> mechanics of passing data around. I believe this would improve filter

That's fine, as long as you ensure that the retrieval can be bounded. When
the HTTP processor realizes that it can only read 100 more bytes from the
next-filter, then you're outside of "abstract data unit" and into "concrete
100 bytes."

Due to the presence of the Upgrade: header, an HTTP processing filter must
always be per-request, and must never read past the end of its request. That
enforces a number of limitations on your design.

[ unless you go for "pushback", which I believe is a poor design. ]

What would be neat is to have a connection-level filter that does HTTP
processing, but can be signalled to morph itself into a simple buffer. For
example, let's say that filter pulls 10k from next-filter ("pull" here,
remember). It parses up the data into some headers and a 500 byte body. It
has 9k leftover, which it holds to the side.

Now, the request processor sees an "Upgrade" and switches protocols to
something else entirely. The connection filter gets demoted to a simple
buffer, returning the 9k without processing. When the buffer is empty, it
removes itself from the filter stack.

The implication here is that filters need to register with particular hooks
in the server. In particular, with a hook to state that a protocol change
has occurred on <this> connection (also implying an input and an output
filter stack). The protocol-related filters in the stack can then take
appropriate action (in the above example, to disable HTTP processing and
just be a buffer). Other subsystems may have also registered with the hook
and will *install* new protocol handler filters.

You could even use this protocol-change hook to set up the initial HTTP
processing filters. Go from "null" protocol to "http", and that installs
your chunking, http processing, etc. It could even be the mechanism which
tells the MPM to call ap_run_request (??) (the app-level thing which starts
sucking input from the filter stack and processing it).

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Re: How I Think Filters Should Work

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 09, 2002 at 05:18:42PM -0700, Aaron Bannert wrote:
>...
> The basic pattern for any input filter (which is pull-based at the moment
> in Apache) would be the following:
> 
> 1. retrieve next "abstract data unit"
> 2. inspect "abstract data unit", can we operate on it?
> 3. if yes, operate_on(unit) and pass the result to the next filter.
> 4. if no, pass the current unit to the next filter.
> 5. go to #1
> 
> In this model, the operate_on() behavior has been separated from the
> mechanics of passing data around. I believe this would improve filter

That's fine, as long as you ensure that the retrieval can be bounded. When
the HTTP processor realizes that it can only read 100 more bytes from the
next-filter, then you're outside of "abstract data unit" and into "concrete
100 bytes."

Due to the presence of the Upgrade: header, an HTTP processing filter must
always be per-request, and must never read past the end of its request. That
enforces a number of limitations on your design.

[ unless you go for "pushback", which I believe is a poor design. ]

What would be neat is to have a connection-level filter that does HTTP
processing, but can be signalled to morph itself into a simple buffer. For
example, let's say that filter pulls 10k from next-filter ("pull" here,
remember). It parses up the data into some headers and a 500 byte body. It
has 9k leftover, which it holds to the side.

Now, the request processor sees an "Upgrade" and switches protocols to
something else entirely. The connection filter gets demoted to a simple
buffer, returning the 9k without processing. When the buffer is empty, it
removes itself from the filter stack.

The implication here is that filters need to register with particular hooks
in the server. In particular, with a hook to state that a protocol change
has occurred on <this> connection (also implying an input and an output
filter stack). The protocol-related filters in the stack can then take
appropriate action (in the above example, to disable HTTP processing and
just be a buffer). Other subsystems may have also registered with the hook
and will *install* new protocol handler filters.

You could even use this protocol-change hook to set up the initial HTTP
processing filters. Go from "null" protocol to "http", and that installs
your chunking, http processing, etc. It could even be the mechanism which
tells the MPM to call ap_run_request (??) (the app-level thing which starts
sucking input from the filter stack and processing it).

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Re: How I Think Filters Should Work

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 09, 2002 at 05:57:16PM -0700, Justin Erenkrantz wrote:
>...
> I really think you're talking about a push-based filter system.
> However, it seems that there was a conscious decision to use
> pull for input-filters.  I wasn't around when that discussion
> was made.  I'd like to hear the rationale for using pull for
> input filters.

Historical. Handlers "pull" input data. Thus, the input filter stack also
needed to be a pull mechanism.

In apr-serf, I've advocated providing both models to the application. The
app can push content at the network, or the network can pull content from
the app. Also, the app can pull input from the network, or the network can
push input at the app.

Note that a push-based filter stack can be used in a pull-fashion. When the
app wants to pull content, the subsystem tells the network endpoint to push
data into the filter stack. The data is then captured on the other side, and
an appropriate amount is returned to the app (and the rest is buffered off
to the side.

>...
> Sending metadata down is a big change.  Again, I *think* this was
> discussed before, but was determined that this wasn't the right way.

No. We think it is right, but it was too big of a change for Apache. Too
much code simply likes to write to r->output_headers.

>...
> (If we do this for input filters, I think we need to do the
> same for output filters.)

The filter stack "should" transport all metadata. The request_rec is an
out-of-band data delivery that hurts us quite a bit in a filter-stack world.

> > around in my head for a long time. When they become clear enough I will
> > write up a more formal and concise proposal on how I think the future
> > filter system should work (possible for 2.1 or beyond). I think the
> > apr-serf project is a perfect place to play with some of these ideas. I
> > would appreciate any constructive comments to the above. ]

I would totally agree. My hope is that apr-serf can establish a new
substrate for the filter systems. It is only a client, though, so it would
be used by proxy, but not by the MPM/listener stuff in Apache (the filter
stack code would be; just not the standard HTTP client endpoints).

> I'm not sure I'm happy that so early in the 2.0 series that we're
> already concerned about the input filtering.  I don't think it's
> ever been "right" - probably because it was ignored for so long.
> It's showing now.  If this prevents people from writing good
> input filters, I think we need to fix this sooner rather than
> later.  -- justin

The input stuff works, but it could probably be better. At a minimum, it
probably makes some sense to have a mode that says "give me as much of the
request as you feel cozy giving me." That would allow the input filters to
return a SOCKET rather than a bunch of 8k buckets. However, to really make
it work (at all, and "best"), we would need a variant of the SOCKET bucket.
It would allow us to share the apr_socket_t and apply a read-limit on the
thing. Thus, you could say "here are 1000 bytes, read from a socket." That
would give you delayed read from the socket (and possible later optimization
of doing a sendfile() from the socket fd into a file fd), yet apply the
appropriate request-boundary limitations.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Re: How I Think Filters Should Work

Posted by Greg Ames <gr...@apache.org>.

sorry for the fat finger post.

Justin Erenkrantz wrote:

> (As Manoj kidded me last night, you and I seem to retrace old
> discussions coming to the same conclusions Ryan and he did.)
> So, I think we need some of the old people to tell us why we
> aren't doing this.

dang!  if Manoj & Ryan are old people, I'm a friggin' mummy.

Greg

Re: How I Think Filters Should Work

Posted by Greg Stein <gs...@lyra.org>.

On Thu, May 09, 2002 at 05:57:16PM -0700, Justin Erenkrantz wrote:
>...
> I really think you're talking about a push-based filter system.
> However, it seems that there was a conscious decision to use
> pull for input-filters.  I wasn't around when that discussion
> was made.  I'd like to hear the rationale for using pull for
> input filters.

Historical. Handlers "pull" input data. Thus, the input filter stack also
needed to be a pull mechanism.

In apr-serf, I've advocated providing both models to the application. The
app can push content at the network, or the network can pull content from
the app. Also, the app can pull input from the network, or the network can
push input at the app.

Note that a push-based filter stack can be used in a pull-fashion. When the
app wants to pull content, the subsystem tells the network endpoint to push
data into the filter stack. The data is then captured on the other side, and
an appropriate amount is returned to the app (and the rest is buffered off
to the side.

>...
> Sending metadata down is a big change.  Again, I *think* this was
> discussed before, but was determined that this wasn't the right way.

No. We think it is right, but it was too big of a change for Apache. Too
much code simply likes to write to r->output_headers.

>...
> (If we do this for input filters, I think we need to do the
> same for output filters.)

The filter stack "should" transport all metadata. The request_rec is an
out-of-band data delivery that hurts us quite a bit in a filter-stack world.

> > around in my head for a long time. When they become clear enough I will
> > write up a more formal and concise proposal on how I think the future
> > filter system should work (possible for 2.1 or beyond). I think the
> > apr-serf project is a perfect place to play with some of these ideas. I
> > would appreciate any constructive comments to the above. ]

I would totally agree. My hope is that apr-serf can establish a new
substrate for the filter systems. It is only a client, though, so it would
be used by proxy, but not by the MPM/listener stuff in Apache (the filter
stack code would be; just not the standard HTTP client endpoints).

> I'm not sure I'm happy that so early in the 2.0 series that we're
> already concerned about the input filtering.  I don't think it's
> ever been "right" - probably because it was ignored for so long.
> It's showing now.  If this prevents people from writing good
> input filters, I think we need to fix this sooner rather than
> later.  -- justin

The input stuff works, but it could probably be better. At a minimum, it
probably makes some sense to have a mode that says "give me as much of the
request as you feel cozy giving me." That would allow the input filters to
return a SOCKET rather than a bunch of 8k buckets. However, to really make
it work (at all, and "best"), we would need a variant of the SOCKET bucket.
It would allow us to share the apr_socket_t and apply a read-limit on the
thing. Thus, you could say "here are 1000 bytes, read from a socket." That
would give you delayed read from the socket (and possible later optimization
of doing a sendfile() from the socket fd into a file fd), yet apply the
appropriate request-boundary limitations.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Re: How I Think Filters Should Work

Posted by Greg Ames <gr...@apache.org>.

Justin Erenkrantz wrote:
> 
> On Thu, May 09, 2002 at 05:18:42PM -0700, Aaron Bannert wrote:
> > Let me be more precise. I'm not saying that we shouldn't use
> > brigades. What I'm saying is we shouldn't be dealing with specific types
> > of data at this level. Right now, by requiring a filter to request
> > "bytes" or "lines", we are seriously constraining the performance of
> > the filters. A filter should only inspect the types of the buckets it
> > retrieves and then move on. The bytes should only come in to play once
> > we have actually retrieved a bucket of a certain type that we are able
> > to process.
> 
> I really think you're talking about a push-based filter system.
> However, it seems that there was a conscious decision to use
> pull for input-filters.  I wasn't around when that discussion
> was made.  I'd like to hear the rationale for using pull for
> input filters.
> 
> > HEADER
> > HEADER
> > HEADER
> > DATA (extra data read past headers)
> > SOCKET
> > EOS
> 
> Sending metadata down is a big change.  Again, I *think* this was
> discussed before, but was determined that this wasn't the right way.
> I think we're going down a path that was discussed before.
> (As Manoj kidded me last night, you and I seem to retrace old
> discussions coming to the same conclusions Ryan and he did.)
> So, I think we need some of the old people to tell us why we
> aren't doing this.
> 
> (If we do this for input filters, I think we need to do the
> same for output filters.)
> 
> > around in my head for a long time. When they become clear enough I will
> > write up a more formal and concise proposal on how I think the future
> > filter system should work (possible for 2.1 or beyond). I think the
> > apr-serf project is a perfect place to play with some of these ideas. I
> > would appreciate any constructive comments to the above. ]
> 
> I'm not sure I'm happy that so early in the 2.0 series that we're
> already concerned about the input filtering.  I don't think it's
> ever been "right" - probably because it was ignored for so long.
> It's showing now.  If this prevents people from writing good
> input filters, I think we need to fix this sooner rather than
> later.  -- justin

Re: How I Think Filters Should Work

Posted by Justin Erenkrantz <je...@apache.org>.

On Thu, May 09, 2002 at 05:18:42PM -0700, Aaron Bannert wrote:
> Let me be more precise. I'm not saying that we shouldn't use
> brigades. What I'm saying is we shouldn't be dealing with specific types
> of data at this level. Right now, by requiring a filter to request
> "bytes" or "lines", we are seriously constraining the performance of
> the filters. A filter should only inspect the types of the buckets it
> retrieves and then move on. The bytes should only come in to play once
> we have actually retrieved a bucket of a certain type that we are able
> to process.

I really think you're talking about a push-based filter system.
However, it seems that there was a conscious decision to use
pull for input-filters.  I wasn't around when that discussion
was made.  I'd like to hear the rationale for using pull for
input filters.

> HEADER
> HEADER
> HEADER
> DATA (extra data read past headers)
> SOCKET
> EOS

Sending metadata down is a big change.  Again, I *think* this was
discussed before, but was determined that this wasn't the right way.
I think we're going down a path that was discussed before.
(As Manoj kidded me last night, you and I seem to retrace old
discussions coming to the same conclusions Ryan and he did.)
So, I think we need some of the old people to tell us why we
aren't doing this.

(If we do this for input filters, I think we need to do the
same for output filters.)

> around in my head for a long time. When they become clear enough I will
> write up a more formal and concise proposal on how I think the future
> filter system should work (possible for 2.1 or beyond). I think the
> apr-serf project is a perfect place to play with some of these ideas. I
> would appreciate any constructive comments to the above. ]

I'm not sure I'm happy that so early in the 2.0 series that we're
already concerned about the input filtering.  I don't think it's
ever been "right" - probably because it was ignored for so long.
It's showing now.  If this prevents people from writing good
input filters, I think we need to fix this sooner rather than
later.  -- justin