You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Justin Erenkrantz <je...@ebuilt.com> on 2001/09/05 10:42:46 UTC
[PATCH] Potential replacement for find_start_sequence...
I'm not totally sure I'm sold on this approach being better. But,
I'm not sure that it is any worse either. Don't have time to
benchmark this right now. I'm going to throw it to the wolves and
see what you think.
Basically, replace the inner search with a Rabin-Karp search (which
seemed to best describe whatever OtherBill posted - I read what you
posted and then I looked through my books and this seems to be the
closest one to whatever you were describing - it uses a hash value
- it's probably not even close to what you had in mind).
This patch also handles the "leftover" case independently (why there
is a new function as a portion of the code gets called twice). (Too
bad we don't have a "super-bucket" that would allow a portion of a
bucket to get added to another bucket. If this were there, a lot of
code could be simplified throughout the server, I think...)
You could, in theory, replace it with a KMP search (the search
algorithm doesn't matter too much in this patch). I think
Rabin-Karp gets you linear time. But, I think we already had
that. However, the inner loop gets dramatically reduced (two
mathematical formulas versus 3 ifs that may have been nullifying
branch predicition).
I'm wondering where we are spending our time in find_start_sequence.
BTW, I know this works in the limited testing I gave it. -- justin
Index: modules/filters/mod_include.c
===================================================================
RCS file: /home/cvs/httpd-2.0/modules/filters/mod_include.c,v
retrieving revision 1.143
diff -u -r1.143 mod_include.c
--- modules/filters/mod_include.c 2001/09/04 02:13:58 1.143
+++ modules/filters/mod_include.c 2001/09/05 08:23:31
@@ -206,6 +206,54 @@
/* --------------------------- Parser functions --------------------------- */
+/* Rabin-Karp search as described in Sedgewick 2nd Ed, */
+int rksearch(const char *p, int pLen, const char *a, int aLen)
+{
+ const int q = 33554393;
+ const int d = 32;
+ int i, dM = 1, h1 = 0, h2 = 0;
+ int M = pLen, N = aLen;
+ for (i = 1; i < M; i++) dM = (d*dM) % q;
+ for (i = 0; i < M; i++) {
+ h1 = (h1*d+p[i]) % q;
+ h2 = (h2*d+a[i]) % q;
+ }
+ for (i = 0; h1 != h2; i++) {
+ h2 = (h2+d*q-a[i]*dM) % q;
+ h2 = (h2*d+a[i+M]) % q;
+ if (i > N - M) return N;
+ }
+ return i;
+}
+
+/* We've now found a start sequence tag... */
+static apr_bucket* found_start_sequence(apr_bucket *dptr,
+ include_ctx_t *ctx,
+ int tagStart)
+{
+ /* We want to split the bucket at the '<'. */
+ ctx->state = PARSE_DIRECTIVE;
+ ctx->tag_length = 0;
+ ctx->parse_pos = 0;
+ ctx->tag_start_bucket = dptr;
+ ctx->tag_start_index = tagStart;
+ if (ctx->head_start_index > 0) {
+ apr_bucket *tmp_bkt;
+
+ /* Split the bucket with the start of the tag in it */
+ apr_bucket_split(ctx->head_start_bucket, ctx->head_start_index);
+ tmp_bkt = APR_BUCKET_NEXT(ctx->head_start_bucket);
+ /* If it was a one bucket match */
+ if (dptr == ctx->head_start_bucket) {
+ ctx->tag_start_bucket = tmp_bkt;
+ ctx->tag_start_index = tagStart - ctx->head_start_index;
+ }
+ ctx->head_start_bucket = tmp_bkt;
+ ctx->head_start_index = 0;
+ }
+ return ctx->head_start_bucket;
+}
+
/* This function returns either a pointer to the split bucket containing the
* first byte of the BEGINNING_SEQUENCE (after finding a complete match) or it
* returns NULL if no match found.
@@ -217,8 +265,8 @@
const char *c;
const char *buf;
const char *str = STARTING_SEQUENCE;
- apr_bucket *tmp_bkt;
- apr_size_t start_index;
+ int slen = strlen(str);
+ int pos;
*do_cleanup = 0;
@@ -269,8 +317,53 @@
if (len == 0) { /* end of pipe? */
break;
}
+
+ /* Set our buffer to use. */
+
c = buf;
- while (c < buf + len) {
+ /* The last bucket had a left over partial match that we need to
+ * complete.
+ */
+ if (ctx->state == PARSE_HEAD)
+ {
+ apr_size_t tmpLen;
+ tmpLen = (len > (slen - 1)) ? len : (slen - 1);
+
+ while (c < buf + tmpLen && *c == str[ctx->parse_pos])
+ {
+ c++;
+ ctx->parse_pos++;
+ }
+
+ if (str[ctx->parse_pos] == '\0')
+ {
+ ctx->bytes_parsed += c - buf;
+ return found_start_sequence(dptr, ctx, c - buf);
+ }
+
+ /* False alarm... */
+ ctx->state = PRE_HEAD;
+ }
+
+ if (len)
+ {
+ pos = rksearch(str, slen, c, len);
+ if (pos != len)
+ {
+ ctx->head_start_bucket = dptr;
+ ctx->head_start_index = pos;
+ ctx->bytes_parsed += pos + slen;
+ return found_start_sequence(dptr, ctx, pos + slen);
+ }
+ }
+
+ /* Consider the case where we have <!-- at the end of the bucket. */
+ ctx->bytes_parsed += len - slen;
+ ctx->parse_pos = 0;
+
+ c = buf + len - slen + 1;
+ while (c < buf + len)
+ {
if (*c == str[ctx->parse_pos]) {
if (ctx->state == PRE_HEAD) {
ctx->state = PARSE_HEAD;
@@ -279,48 +372,24 @@
}
ctx->parse_pos++;
}
- else {
- if (str[ctx->parse_pos] == '\0') {
- /* We want to split the bucket at the '<'. */
- ctx->bytes_parsed++;
- ctx->state = PARSE_DIRECTIVE;
- ctx->tag_length = 0;
- ctx->parse_pos = 0;
- ctx->tag_start_bucket = dptr;
- ctx->tag_start_index = c - buf;
- if (ctx->head_start_index > 0) {
- start_index = (c - buf) - ctx->head_start_index;
- apr_bucket_split(ctx->head_start_bucket,
- ctx->head_start_index);
- tmp_bkt = APR_BUCKET_NEXT(ctx->head_start_bucket);
- if (dptr == ctx->head_start_bucket) {
- ctx->tag_start_bucket = tmp_bkt;
- ctx->tag_start_index = start_index;
- }
- ctx->head_start_bucket = tmp_bkt;
- ctx->head_start_index = 0;
- }
- return ctx->head_start_bucket;
+ else if (ctx->parse_pos != 0)
+ {
+ /* The reason for this, is that we need to make sure
+ * that we catch cases like <<!--#. This makes the
+ * second check after the original check fails.
+ * If parse_pos was already 0 then we already checked this.
+ */
+ /* FIXME: Why? */
+ *do_cleanup = 1;
+ if (*c == str[0]) {
+ ctx->parse_pos = 1;
+ ctx->head_start_index = c - buf;
}
- else if (ctx->parse_pos != 0) {
- /* The reason for this, is that we need to make sure
- * that we catch cases like <<!--#. This makes the
- * second check after the original check fails.
- * If parse_pos was already 0 then we already checked this.
- */
- *do_cleanup = 1;
- if (*c == str[0]) {
- ctx->parse_pos = 1;
- ctx->state = PARSE_HEAD;
- ctx->head_start_bucket = dptr;
- ctx->head_start_index = c - buf;
- }
- else {
- ctx->parse_pos = 0;
- ctx->state = PRE_HEAD;
- ctx->head_start_bucket = NULL;
- ctx->head_start_index = 0;
- }
+ else {
+ ctx->parse_pos = 0;
+ ctx->state = PRE_HEAD;
+ ctx->head_start_bucket = NULL;
+ ctx->head_start_index = 0;
}
}
c++;
RE: [PATCH] Potential replacement for find_start_sequence...
Posted by Sascha Schumann <sa...@schumann.cx>.
> Btw, do we have any string matching code in apr or apr-util? I
> can see the applicability of decent string matchers in other parts
> than mod_include.
Note that BNDM inherits the limit of shift-and/or algos --
they only work well for patterns which are not longer than
the number of bits representable by a machine word
(sizeof(int)<<3). In an application like Apache, long search
patterns are uncommon, but a generic string-matching facility
would usually need to handle these.
- Sascha Experience IRCG
http://schumann.cx/ http://schumann.cx/ircg
RE: [PATCH] Potential replacement for find_start_sequence...
Posted by Sascha Schumann <sa...@schumann.cx>.
> Btw, do we have any string matching code in apr or apr-util? I
> can see the applicability of decent string matchers in other parts
> than mod_include.
Note that BNDM inherits the limit of shift-and/or algos --
they only work well for patterns which are not longer than
the number of bits representable by a machine word
(sizeof(int)<<3). In an application like Apache, long search
patterns are uncommon, but a generic string-matching facility
would usually need to handle these.
- Sascha Experience IRCG
http://schumann.cx/ http://schumann.cx/ircg
RE: [PATCH] Potential replacement for find_start_sequence...
Posted by Sander Striker <st...@apache.org>.
> From: Sascha Schumann [mailto:sascha@schumann.cx]
> On Wed, 5 Sep 2001, Sander Striker wrote:
>
>>> I'm not totally sure I'm sold on this approach being better. But,
>>> I'm not sure that it is any worse either. Don't have time to
>>> benchmark this right now. I'm going to throw it to the wolves and
>>> see what you think.
>>
>> Me neither. Rabin-Karp introduces a lot of * and %.
>> I'll try Boyer-Moore with precalced tables for '<!--#' and '--->'.
>
> Well, there are more advanced algorithms than BM available
> today which are even easier to implement that the original BM
> algo.
Easier to implement? BM is so simple, I can't imagine something
being simpler.
> I'd suggest looking at BNDM which combines the advantages of
> bit-parallelism (shift-and/-or algorithms) and suffix
> automata (BM-style).
>
> http://citeseer.nj.nec.com/navarro01fast.html
>
> To give you an idea on how a bndm implementation looks like,
> I'm appending an unpolished implementation I did some time
> ago which includes a test-case for locating '<!--#'.
Indeed the LOC is low. Very cool! Considering that we can
precal a bndm_t for '<!--#' and '--->'.
It is certainly more advanced.
> - Sascha Experience IRCG
> http://schumann.cx/ http://schumann.cx/ircg
Ian, since you were doing the mod_include.c optimizations anyway,
can you check this out?
Btw, do we have any string matching code in apr or apr-util? I
can see the applicability of decent string matchers in other parts
than mod_include.
Sander
RE: [PATCH] Potential replacement for find_start_sequence...
Posted by Sander Striker <st...@apache.org>.
> From: Sascha Schumann [mailto:sascha@schumann.cx]
> On Wed, 5 Sep 2001, Sander Striker wrote:
>
>>> I'm not totally sure I'm sold on this approach being better. But,
>>> I'm not sure that it is any worse either. Don't have time to
>>> benchmark this right now. I'm going to throw it to the wolves and
>>> see what you think.
>>
>> Me neither. Rabin-Karp introduces a lot of * and %.
>> I'll try Boyer-Moore with precalced tables for '<!--#' and '--->'.
>
> Well, there are more advanced algorithms than BM available
> today which are even easier to implement that the original BM
> algo.
Easier to implement? BM is so simple, I can't imagine something
being simpler.
> I'd suggest looking at BNDM which combines the advantages of
> bit-parallelism (shift-and/-or algorithms) and suffix
> automata (BM-style).
>
> http://citeseer.nj.nec.com/navarro01fast.html
>
> To give you an idea on how a bndm implementation looks like,
> I'm appending an unpolished implementation I did some time
> ago which includes a test-case for locating '<!--#'.
Indeed the LOC is low. Very cool! Considering that we can
precal a bndm_t for '<!--#' and '--->'.
It is certainly more advanced.
> - Sascha Experience IRCG
> http://schumann.cx/ http://schumann.cx/ircg
Ian, since you were doing the mod_include.c optimizations anyway,
can you check this out?
Btw, do we have any string matching code in apr or apr-util? I
can see the applicability of decent string matchers in other parts
than mod_include.
Sander
RE: [PATCH] Potential replacement for find_start_sequence...
Posted by Sascha Schumann <sa...@schumann.cx>.
On Wed, 5 Sep 2001, Sander Striker wrote:
> > I'm not totally sure I'm sold on this approach being better. But,
> > I'm not sure that it is any worse either. Don't have time to
> > benchmark this right now. I'm going to throw it to the wolves and
> > see what you think.
>
> Me neither. Rabin-Karp introduces a lot of * and %.
> I'll try Boyer-Moore with precalced tables for '<!--#' and '--->'.
Well, there are more advanced algorithms than BM available
today which are even easier to implement that the original BM
algo.
I'd suggest looking at BNDM which combines the advantages of
bit-parallelism (shift-and/-or algorithms) and suffix
automata (BM-style).
http://citeseer.nj.nec.com/navarro01fast.html
To give you an idea on how a bndm implementation looks like,
I'm appending an unpolished implementation I did some time
ago which includes a test-case for locating '<!--#'.
- Sascha Experience IRCG
http://schumann.cx/ http://schumann.cx/ircg
RE: [PATCH] Potential replacement for find_start_sequence...
Posted by Sander Striker <st...@apache.org>.
> I'm not totally sure I'm sold on this approach being better. But,
> I'm not sure that it is any worse either. Don't have time to
> benchmark this right now. I'm going to throw it to the wolves and
> see what you think.
Me neither. Rabin-Karp introduces a lot of * and %.
I'll try Boyer-Moore with precalced tables for '<!--#' and '--->'.
[...]
Sander