You are viewing a plain text version of this content. The canonical link for it is here.

Posted to modules-dev@httpd.apache.org by Sindhi Sindhi <si...@gmail.com> on 2013/05/01 14:54:24 UTC

Apache Buckets and Brigade

Hello,

Thanks a lot for providing answers to my earlier emails with subject
"Apache C++ equivalent of javax.servlet.Filter". I really appreciate your
help.

I had another question. My requirement is something like this -

I have a huge html file that I have copied into the Apache htdocs folder.
In my C++ Apache module, I want to get this html file contents and
remove/replace some strings.

Say I have a HTML file that has the string "oldString" appearing 3 times in
the file. My requirement is to replace "oldString" with the new string
"newString". I have already written a C++ function that has a signature
like this -

char* processHTML(char* inHTMLString) {
//
char* newHTMLWithNewString = <code to replace all occurrences of
"oldString" with "newString">
return newHTMLWithNewString;
}

The above function does a lot more than just string replace, it has lot of
business logic implemented and finally returns the new HTML string.

I want to call processHTML() inside my C++ Apache module. As I know Apache
maintains an internal data structure called Buckets and Brigades which
actually contain the HTML file data. My question is, is the entire HTML
file content (in my case the html file is huge) residing in a single
bucket? Means, when I fetch one bucket at a time from a brigade, can I be
sure that the entire HTML file data from <html> to </html> can be found in
a single bucket? For ex. if my html file looks like this -
<html>
..
..
oldString
... oldString...........oldString..
..
</html>

When I iterate through all buckets of a brigade, will I find my entire HTML
file content in a single bucket OR the HTML file content can be present in
multiple buckets, say like this -

case1:
bucket-1 contents =
"<html>
..
..
oldString
... oldString...........oldString..
..
</html>"

case2:
bucket-1 contents =
"<html>
..
..
oldStr"

bucket-2 contents =
"ing
... oldString...........oldString..
..
</html>"

If its case2, then the the function processHTML() I have written will not
work because it searches for the entire string "oldString" and in case2
"oldString" is found only partially.

Thanks a lot.

Re: Apache Buckets and Brigade

Posted by Joshua Marantz <jm...@google.com>.

On Wed, May 1, 2013 at 9:48 AM, Sindhi Sindhi <si...@gmail.com> wrote:

> I doubt if I can use ModPagespeedSubstitute, because
> our string replacement actually uses some business logic. For ex. in
> "oldString", if i find a "old" string at offset 0 i'll replace it with
>

I agree: that configuration-only solution I proposed wouldn't meet your
needs.  Before I give up on it completely, is your business logic going to
always be in C++?  I am just trying to avoid overconstraining the solution.
 For example, would Lua be a suitable vehicle for your business logic? See
https://httpd.apache.org/docs/trunk/mod/mod_lua.html.

I wonder if there's some way mod_lua & an enhanced mod_pagespeed could work
together to provide the substitution with rich business logic.  The reason
I ask is that building & hacking mod_pagespeed's source for your purposes
is a great way to get something working, but I don't have a great answer
for how you'd maintain that over time as mod_pagespeed updates roll out.

> The HTML-centric fetch of data as you mentioned suits the best for me. But
> I dont want mod_pagespeed to actually modify anything in my HTML file, if
> it can give me either the entire HTML file OR HTML-centric fetch of data
> that will solve my problem :)
>

I'm not sure I understand your concern.  mod_pagespeed will not change the
HTML file on disk.  It acts as an Apache output filter changing the bytes
of HTML as they stream out.  As the writer of a mod_pagespeed filter, you
get to interpose C++ handlers for HTML lexical tokens (including
Characters()) and mutate those tokens before they are serialized out to the
next filter.  Is that what you want?

-Josh

>

Re: Apache Buckets and Brigade

Posted by Sindhi Sindhi <si...@gmail.com>.

Thanks.
I'd definitely be interested in discussing further.

Theres one more thing, I doubt if I can use ModPagespeedSubstitute, because
our string replacement actually uses some business logic. For ex. in
"oldString", if i find a "old" string at offset 0 i'll replace it with
"new" otherwise I'll replace it with "temp". The one I mentioned in my
previous email was just a very simple and straight forward example. When
our business logic runs over the huge html file we have it executes a lot
more rules to find out if it should replace "oldString" with "newString" or
with "tempString" or with some other string. So for me its very critical
that the HTML tags are read in complete and not partially when the string
replacement function is called.

The HTML-centric fetch of data as you mentioned suits the best for me. But
I dont want mod_pagespeed to actually modify anything in my HTML file, if
it can give me either the entire HTML file OR HTML-centric fetch of data
that will solve my problem :)


On Wed, May 1, 2013 at 6:52 PM, Joshua Marantz <jm...@google.com> wrote:

> I have a crazy idea for you.  Maybe this is overkill but this sounds like
> it'd be natural to add to mod_pagespeed <http://modpagespeed.com> as a new
> filter.
>
> Here's some code you might use as a template
>
>
> https://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/rewriter/collapse_whitespace_filter.cc
>
> one thing we've thought of doing is providing a generic text-substitution
> filter that would take strings in character-blocks and do arbitrary
> substitutions in them, that could be specified in the .conf file:
>   ModPagespeedSubstitute "oldString" "newString"
>
> You are right that text-blocks in Apache output filters can be split
> arbitrarily across buckets, but mod_pagespeed takes care of that in an
> HTML-centric way, breaking up blocks on html tokens. A block of free-format
> text would be treated as a single atomic token independent of the structure
> of the incoming bucket brigade.
>
> Let me know if you'd like to discuss this further.
>
> -Josh
>
>
> On Wed, May 1, 2013 at 8:54 AM, Sindhi Sindhi <si...@gmail.com>
> wrote:
>
> > Hello,
> >
> > Thanks a lot for providing answers to my earlier emails with subject
> > "Apache C++ equivalent of javax.servlet.Filter". I really appreciate your
> > help.
> >
> > I had another question. My requirement is something like this -
> >
> > I have a huge html file that I have copied into the Apache htdocs folder.
> > In my C++ Apache module, I want to get this html file contents and
> > remove/replace some strings.
> >
> > Say I have a HTML file that has the string "oldString" appearing 3 times
> in
> > the file. My requirement is to replace "oldString" with the new string
> > "newString". I have already written a C++ function that has a signature
> > like this -
> >
> > char* processHTML(char* inHTMLString) {
> > //
> > char* newHTMLWithNewString = <code to replace all occurrences of
> > "oldString" with "newString">
> > return newHTMLWithNewString;
> > }
> >
> > The above function does a lot more than just string replace, it has lot
> of
> > business logic implemented and finally returns the new HTML string.
> >
> > I want to call processHTML() inside my C++ Apache module. As I know
> Apache
> > maintains an internal data structure called Buckets and Brigades which
> > actually contain the HTML file data. My question is, is the entire HTML
> > file content (in my case the html file is huge) residing in a single
> > bucket? Means, when I fetch one bucket at a time from a brigade, can I be
> > sure that the entire HTML file data from <html> to </html> can be found
> in
> > a single bucket? For ex. if my html file looks like this -
> > <html>
> > ..
> > ..
> > oldString
> > ... oldString...........oldString..
> > ..
> > </html>
> >
> > When I iterate through all buckets of a brigade, will I find my entire
> HTML
> > file content in a single bucket OR the HTML file content can be present
> in
> > multiple buckets, say like this -
> >
> > case1:
> > bucket-1 contents =
> > "<html>
> > ..
> > ..
> > oldString
> > ... oldString...........oldString..
> > ..
> > </html>"
> >
> > case2:
> > bucket-1 contents =
> > "<html>
> > ..
> > ..
> > oldStr"
> >
> > bucket-2 contents =
> > "ing
> > ... oldString...........oldString..
> > ..
> > </html>"
> >
> > If its case2, then the the function processHTML() I have written will not
> > work because it searches for the entire string "oldString" and in case2
> > "oldString" is found only partially.
> >
> > Thanks a lot.
> >
>

Re: Apache Buckets and Brigade

Posted by Joshua Marantz <jm...@google.com>.

On Wed, May 1, 2013 at 12:14 PM, Sindhi Sindhi <si...@gmail.com> wrote:

> Thanks to all for the reply.
>
> Josh, the concern I mentioned was, we may not want mod_pagespeed to modify
> the in-memory HTML content. The only change we may want to see in our HTML
> will be that the old strings are replaced by the new strings after applying
> our business logic which is already done by the C++ filter module I have
> written. This C++ filter implements all our business logic and takes an
> input buffer that is expected to be the entire HTML file content. So the
>

Yeah the filter I gave you gives you only a block of HTML characters.  E.g.
if you have
  <div>a b c d <i>e</i> f</div>
then you'll get this as 3 calls to Characters: "a b c d ", "e", and " f".
 Other than that, the Characters method I pointed you to has exactly the
interface you asked for: it gets you an entire block of HTML text in one
modifiable std::string which you can mutate at will.  And it irons out all
the brigade stuff for you.

You can configure mod_pagespeed to run just one filter so no other
modifications are made.

But it looks like you probably need to read Nick's book.  You can also read
mod_deflate.c or one of the other content-modifying filters such as mod_sed
or mod_substitute.

-Josh

Re: Apache Buckets and Brigade

Posted by Sindhi Sindhi <si...@gmail.com>.

Thanks to all for the reply.

Josh, the concern I mentioned was, we may not want mod_pagespeed to modify
the in-memory HTML content. The only change we may want to see in our HTML
will be that the old strings are replaced by the new strings after applying
our business logic which is already done by the C++ filter module I have
written. This C++ filter implements all our business logic and takes an
input buffer that is expected to be the entire HTML file content. So the
only issue is we are not sure if the Apache APR brigades contain all
contents of the HTML file. And what you understood is right, that we dont
do a static definition of a substitution, its very much dynamic that
depends on a lot of run-time logic and rules.

Nick, Thanks for the reply, could you please send some reference links
where I can look at how some of the existing HTML filters have handled this
issue? I have searched for similar issues on the internet but unfortunately
havent found any exact solution or a way to do it :(


On Wed, May 1, 2013 at 9:04 PM, Joshua Marantz <jm...@google.com> wrote:

> I didn't know about mod_substitute or mod_sed  :)  The
> ModPagespeedSubstitute command I proposed probably adds nothing to those.
>
> But in any case that was not sufficient for Sindhi's use-case where he
> needs to impose data-dependent business logic and not statically define a
> substitution in a conf file.
>
> -Josh
>
>
> On Wed, May 1, 2013 at 11:19 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>
> > How is that different from mod_substitute and/or mod_sed?
> >
> > On May 1, 2013, at 9:22 AM, Joshua Marantz <jm...@google.com> wrote:
> >
> > > I have a crazy idea for you.  Maybe this is overkill but this sounds
> like
> > > it'd be natural to add to mod_pagespeed <http://modpagespeed.com> as a
> > new
> > > filter.
> > >
> > > Here's some code you might use as a template
> > >
> > >
> >
> https://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/rewriter/collapse_whitespace_filter.cc
> > >
> > > one thing we've thought of doing is providing a generic
> text-substitution
> > > filter that would take strings in character-blocks and do arbitrary
> > > substitutions in them, that could be specified in the .conf file:
> > >  ModPagespeedSubstitute "oldString" "newString"
> > >
> > > You are right that text-blocks in Apache output filters can be split
> > > arbitrarily across buckets, but mod_pagespeed takes care of that in an
> > > HTML-centric way, breaking up blocks on html tokens. A block of
> > free-format
> > > text would be treated as a single atomic token independent of the
> > structure
> > > of the incoming bucket brigade.
> > >
> > > Let me know if you'd like to discuss this further.
> > >
> > > -Josh
> > >
> > >
> > > On Wed, May 1, 2013 at 8:54 AM, Sindhi Sindhi <si...@gmail.com>
> > wrote:
> > >
> > >> Hello,
> > >>
> > >> Thanks a lot for providing answers to my earlier emails with subject
> > >> "Apache C++ equivalent of javax.servlet.Filter". I really appreciate
> > your
> > >> help.
> > >>
> > >> I had another question. My requirement is something like this -
> > >>
> > >> I have a huge html file that I have copied into the Apache htdocs
> > folder.
> > >> In my C++ Apache module, I want to get this html file contents and
> > >> remove/replace some strings.
> > >>
> > >> Say I have a HTML file that has the string "oldString" appearing 3
> > times in
> > >> the file. My requirement is to replace "oldString" with the new string
> > >> "newString". I have already written a C++ function that has a
> signature
> > >> like this -
> > >>
> > >> char* processHTML(char* inHTMLString) {
> > >> //
> > >> char* newHTMLWithNewString = <code to replace all occurrences of
> > >> "oldString" with "newString">
> > >> return newHTMLWithNewString;
> > >> }
> > >>
> > >> The above function does a lot more than just string replace, it has
> lot
> > of
> > >> business logic implemented and finally returns the new HTML string.
> > >>
> > >> I want to call processHTML() inside my C++ Apache module. As I know
> > Apache
> > >> maintains an internal data structure called Buckets and Brigades which
> > >> actually contain the HTML file data. My question is, is the entire
> HTML
> > >> file content (in my case the html file is huge) residing in a single
> > >> bucket? Means, when I fetch one bucket at a time from a brigade, can I
> > be
> > >> sure that the entire HTML file data from <html> to </html> can be
> found
> > in
> > >> a single bucket? For ex. if my html file looks like this -
> > >> <html>
> > >> ..
> > >> ..
> > >> oldString
> > >> ... oldString...........oldString..
> > >> ..
> > >> </html>
> > >>
> > >> When I iterate through all buckets of a brigade, will I find my entire
> > HTML
> > >> file content in a single bucket OR the HTML file content can be
> present
> > in
> > >> multiple buckets, say like this -
> > >>
> > >> case1:
> > >> bucket-1 contents =
> > >> "<html>
> > >> ..
> > >> ..
> > >> oldString
> > >> ... oldString...........oldString..
> > >> ..
> > >> </html>"
> > >>
> > >> case2:
> > >> bucket-1 contents =
> > >> "<html>
> > >> ..
> > >> ..
> > >> oldStr"
> > >>
> > >> bucket-2 contents =
> > >> "ing
> > >> ... oldString...........oldString..
> > >> ..
> > >> </html>"
> > >>
> > >> If its case2, then the the function processHTML() I have written will
> > not
> > >> work because it searches for the entire string "oldString" and in
> case2
> > >> "oldString" is found only partially.
> > >>
> > >> Thanks a lot.
> > >>
> >
> >
>

Re: Apache Buckets and Brigade

Posted by Joshua Marantz <jm...@google.com>.

I didn't know about mod_substitute or mod_sed  :)  The
ModPagespeedSubstitute command I proposed probably adds nothing to those.

But in any case that was not sufficient for Sindhi's use-case where he
needs to impose data-dependent business logic and not statically define a
substitution in a conf file.

-Josh


On Wed, May 1, 2013 at 11:19 AM, Jim Jagielski <ji...@jagunet.com> wrote:

> How is that different from mod_substitute and/or mod_sed?
>
> On May 1, 2013, at 9:22 AM, Joshua Marantz <jm...@google.com> wrote:
>
> > I have a crazy idea for you.  Maybe this is overkill but this sounds like
> > it'd be natural to add to mod_pagespeed <http://modpagespeed.com> as a
> new
> > filter.
> >
> > Here's some code you might use as a template
> >
> >
> https://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/rewriter/collapse_whitespace_filter.cc
> >
> > one thing we've thought of doing is providing a generic text-substitution
> > filter that would take strings in character-blocks and do arbitrary
> > substitutions in them, that could be specified in the .conf file:
> >  ModPagespeedSubstitute "oldString" "newString"
> >
> > You are right that text-blocks in Apache output filters can be split
> > arbitrarily across buckets, but mod_pagespeed takes care of that in an
> > HTML-centric way, breaking up blocks on html tokens. A block of
> free-format
> > text would be treated as a single atomic token independent of the
> structure
> > of the incoming bucket brigade.
> >
> > Let me know if you'd like to discuss this further.
> >
> > -Josh
> >
> >
> > On Wed, May 1, 2013 at 8:54 AM, Sindhi Sindhi <si...@gmail.com>
> wrote:
> >
> >> Hello,
> >>
> >> Thanks a lot for providing answers to my earlier emails with subject
> >> "Apache C++ equivalent of javax.servlet.Filter". I really appreciate
> your
> >> help.
> >>
> >> I had another question. My requirement is something like this -
> >>
> >> I have a huge html file that I have copied into the Apache htdocs
> folder.
> >> In my C++ Apache module, I want to get this html file contents and
> >> remove/replace some strings.
> >>
> >> Say I have a HTML file that has the string "oldString" appearing 3
> times in
> >> the file. My requirement is to replace "oldString" with the new string
> >> "newString". I have already written a C++ function that has a signature
> >> like this -
> >>
> >> char* processHTML(char* inHTMLString) {
> >> //
> >> char* newHTMLWithNewString = <code to replace all occurrences of
> >> "oldString" with "newString">
> >> return newHTMLWithNewString;
> >> }
> >>
> >> The above function does a lot more than just string replace, it has lot
> of
> >> business logic implemented and finally returns the new HTML string.
> >>
> >> I want to call processHTML() inside my C++ Apache module. As I know
> Apache
> >> maintains an internal data structure called Buckets and Brigades which
> >> actually contain the HTML file data. My question is, is the entire HTML
> >> file content (in my case the html file is huge) residing in a single
> >> bucket? Means, when I fetch one bucket at a time from a brigade, can I
> be
> >> sure that the entire HTML file data from <html> to </html> can be found
> in
> >> a single bucket? For ex. if my html file looks like this -
> >> <html>
> >> ..
> >> ..
> >> oldString
> >> ... oldString...........oldString..
> >> ..
> >> </html>
> >>
> >> When I iterate through all buckets of a brigade, will I find my entire
> HTML
> >> file content in a single bucket OR the HTML file content can be present
> in
> >> multiple buckets, say like this -
> >>
> >> case1:
> >> bucket-1 contents =
> >> "<html>
> >> ..
> >> ..
> >> oldString
> >> ... oldString...........oldString..
> >> ..
> >> </html>"
> >>
> >> case2:
> >> bucket-1 contents =
> >> "<html>
> >> ..
> >> ..
> >> oldStr"
> >>
> >> bucket-2 contents =
> >> "ing
> >> ... oldString...........oldString..
> >> ..
> >> </html>"
> >>
> >> If its case2, then the the function processHTML() I have written will
> not
> >> work because it searches for the entire string "oldString" and in case2
> >> "oldString" is found only partially.
> >>
> >> Thanks a lot.
> >>
>
>

Re: Apache Buckets and Brigade

Posted by Jim Jagielski <ji...@jaguNET.com>.

How is that different from mod_substitute and/or mod_sed?

On May 1, 2013, at 9:22 AM, Joshua Marantz <jm...@google.com> wrote:

> I have a crazy idea for you.  Maybe this is overkill but this sounds like
> it'd be natural to add to mod_pagespeed <http://modpagespeed.com> as a new
> filter.
> 
> Here's some code you might use as a template
> 
> https://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/rewriter/collapse_whitespace_filter.cc
> 
> one thing we've thought of doing is providing a generic text-substitution
> filter that would take strings in character-blocks and do arbitrary
> substitutions in them, that could be specified in the .conf file:
>  ModPagespeedSubstitute "oldString" "newString"
> 
> You are right that text-blocks in Apache output filters can be split
> arbitrarily across buckets, but mod_pagespeed takes care of that in an
> HTML-centric way, breaking up blocks on html tokens. A block of free-format
> text would be treated as a single atomic token independent of the structure
> of the incoming bucket brigade.
> 
> Let me know if you'd like to discuss this further.
> 
> -Josh
> 
> 
> On Wed, May 1, 2013 at 8:54 AM, Sindhi Sindhi <si...@gmail.com> wrote:
> 
>> Hello,
>> 
>> Thanks a lot for providing answers to my earlier emails with subject
>> "Apache C++ equivalent of javax.servlet.Filter". I really appreciate your
>> help.
>> 
>> I had another question. My requirement is something like this -
>> 
>> I have a huge html file that I have copied into the Apache htdocs folder.
>> In my C++ Apache module, I want to get this html file contents and
>> remove/replace some strings.
>> 
>> Say I have a HTML file that has the string "oldString" appearing 3 times in
>> the file. My requirement is to replace "oldString" with the new string
>> "newString". I have already written a C++ function that has a signature
>> like this -
>> 
>> char* processHTML(char* inHTMLString) {
>> //
>> char* newHTMLWithNewString = <code to replace all occurrences of
>> "oldString" with "newString">
>> return newHTMLWithNewString;
>> }
>> 
>> The above function does a lot more than just string replace, it has lot of
>> business logic implemented and finally returns the new HTML string.
>> 
>> I want to call processHTML() inside my C++ Apache module. As I know Apache
>> maintains an internal data structure called Buckets and Brigades which
>> actually contain the HTML file data. My question is, is the entire HTML
>> file content (in my case the html file is huge) residing in a single
>> bucket? Means, when I fetch one bucket at a time from a brigade, can I be
>> sure that the entire HTML file data from <html> to </html> can be found in
>> a single bucket? For ex. if my html file looks like this -
>> <html>
>> ..
>> ..
>> oldString
>> ... oldString...........oldString..
>> ..
>> </html>
>> 
>> When I iterate through all buckets of a brigade, will I find my entire HTML
>> file content in a single bucket OR the HTML file content can be present in
>> multiple buckets, say like this -
>> 
>> case1:
>> bucket-1 contents =
>> "<html>
>> ..
>> ..
>> oldString
>> ... oldString...........oldString..
>> ..
>> </html>"
>> 
>> case2:
>> bucket-1 contents =
>> "<html>
>> ..
>> ..
>> oldStr"
>> 
>> bucket-2 contents =
>> "ing
>> ... oldString...........oldString..
>> ..
>> </html>"
>> 
>> If its case2, then the the function processHTML() I have written will not
>> work because it searches for the entire string "oldString" and in case2
>> "oldString" is found only partially.
>> 
>> Thanks a lot.
>>

Re: Apache Buckets and Brigade

Posted by Joshua Marantz <jm...@google.com>.

I have a crazy idea for you.  Maybe this is overkill but this sounds like
it'd be natural to add to mod_pagespeed <http://modpagespeed.com> as a new
filter.

Here's some code you might use as a template

https://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/rewriter/collapse_whitespace_filter.cc

one thing we've thought of doing is providing a generic text-substitution
filter that would take strings in character-blocks and do arbitrary
substitutions in them, that could be specified in the .conf file:
  ModPagespeedSubstitute "oldString" "newString"

You are right that text-blocks in Apache output filters can be split
arbitrarily across buckets, but mod_pagespeed takes care of that in an
HTML-centric way, breaking up blocks on html tokens. A block of free-format
text would be treated as a single atomic token independent of the structure
of the incoming bucket brigade.

Let me know if you'd like to discuss this further.

-Josh


On Wed, May 1, 2013 at 8:54 AM, Sindhi Sindhi <si...@gmail.com> wrote:

> Hello,
>
> Thanks a lot for providing answers to my earlier emails with subject
> "Apache C++ equivalent of javax.servlet.Filter". I really appreciate your
> help.
>
> I had another question. My requirement is something like this -
>
> I have a huge html file that I have copied into the Apache htdocs folder.
> In my C++ Apache module, I want to get this html file contents and
> remove/replace some strings.
>
> Say I have a HTML file that has the string "oldString" appearing 3 times in
> the file. My requirement is to replace "oldString" with the new string
> "newString". I have already written a C++ function that has a signature
> like this -
>
> char* processHTML(char* inHTMLString) {
> //
> char* newHTMLWithNewString = <code to replace all occurrences of
> "oldString" with "newString">
> return newHTMLWithNewString;
> }
>
> The above function does a lot more than just string replace, it has lot of
> business logic implemented and finally returns the new HTML string.
>
> I want to call processHTML() inside my C++ Apache module. As I know Apache
> maintains an internal data structure called Buckets and Brigades which
> actually contain the HTML file data. My question is, is the entire HTML
> file content (in my case the html file is huge) residing in a single
> bucket? Means, when I fetch one bucket at a time from a brigade, can I be
> sure that the entire HTML file data from <html> to </html> can be found in
> a single bucket? For ex. if my html file looks like this -
> <html>
> ..
> ..
> oldString
> ... oldString...........oldString..
> ..
> </html>
>
> When I iterate through all buckets of a brigade, will I find my entire HTML
> file content in a single bucket OR the HTML file content can be present in
> multiple buckets, say like this -
>
> case1:
> bucket-1 contents =
> "<html>
> ..
> ..
> oldString
> ... oldString...........oldString..
> ..
> </html>"
>
> case2:
> bucket-1 contents =
> "<html>
> ..
> ..
> oldStr"
>
> bucket-2 contents =
> "ing
> ... oldString...........oldString..
> ..
> </html>"
>
> If its case2, then the the function processHTML() I have written will not
> work because it searches for the entire string "oldString" and in case2
> "oldString" is found only partially.
>
> Thanks a lot.
>

Re: Apache Buckets and Brigade

Posted by Nick Kew <ni...@apache.org>.

On 1 May 2013, at 14:41, Sorin Manolache wrote:

> In my experience the buckets that I've seen have about 8 kilobytes. So you will not consume too much memory if you "flatten" the bucket brigade into one buffer and then perform the replacement in the buffer. (see apr_brigade_flatten). However, you have to provide for the case in which oldString is split between two brigades.

Don't rely on that.  The size of buckets and brigades depends entirely
on what comes before your filter.  8k chunks are common from some
sources (e.g. mod_proxy), but even there another filter could change
all that - e.g. mod_deflate if the data arrive compressed.

Bottom line: don't make assumptions, as there are no guarantees.

You can of course look at existing filters that do similar things
to yours.  Or even read about it in my book :-)

-- 
Nick Kew

Re: Apache Buckets and Brigade

Posted by Sorin Manolache <so...@gmail.com>.

On 2013-05-01 14:54, Sindhi Sindhi wrote:
> Hello,
>
> Thanks a lot for providing answers to my earlier emails with subject
> "Apache C++ equivalent of javax.servlet.Filter". I really appreciate your
> help.
>
> I had another question. My requirement is something like this -
>
> I have a huge html file that I have copied into the Apache htdocs folder.
> In my C++ Apache module, I want to get this html file contents and
> remove/replace some strings.
>
> Say I have a HTML file that has the string "oldString" appearing 3 times in
> the file. My requirement is to replace "oldString" with the new string
> "newString". I have already written a C++ function that has a signature
> like this -
>
> char* processHTML(char* inHTMLString) {
> //
> char* newHTMLWithNewString = <code to replace all occurrences of
> "oldString" with "newString">
> return newHTMLWithNewString;
> }
>
> The above function does a lot more than just string replace, it has lot of
> business logic implemented and finally returns the new HTML string.
>
> I want to call processHTML() inside my C++ Apache module. As I know Apache
> maintains an internal data structure called Buckets and Brigades which
> actually contain the HTML file data. My question is, is the entire HTML
> file content (in my case the html file is huge) residing in a single
> bucket? Means, when I fetch one bucket at a time from a brigade, can I be
> sure that the entire HTML file data from <html> to </html> can be found in
> a single bucket? For ex. if my html file looks like this -
> <html>
> ..
> ..
> oldString
> ... oldString...........oldString..
> ..
> </html>
>
> When I iterate through all buckets of a brigade, will I find my entire HTML
> file content in a single bucket OR the HTML file content can be present in
> multiple buckets, say like this -
>
> case1:
> bucket-1 contents =
> "<html>
> ..
> ..
> oldString
> ... oldString...........oldString..
> ..
> </html>"
>
> case2:
> bucket-1 contents =
> "<html>
> ..
> ..
> oldStr"
>
> bucket-2 contents =
> "ing
> ... oldString...........oldString..
> ..
> </html>"
>
> If its case2, then the the function processHTML() I have written will not
> work because it searches for the entire string "oldString" and in case2
> "oldString" is found only partially.


Unfortunately there is no guarantee that the whole file is in one brigade.

Even if it was in one brigade, there is no guarantee that is it in a 
single bucket.

So you can have case2.

In my experience the buckets that I've seen have about 8 kilobytes. So 
you will not consume too much memory if you "flatten" the bucket brigade 
into one buffer and then perform the replacement in the buffer. (see 
apr_brigade_flatten). However, you have to provide for the case in which 
oldString is split between two brigades.

Sorin

>
> Thanks a lot.
>