You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Justin Erenkrantz <je...@ebuilt.com> on 2001/09/05 10:42:46 UTC

[PATCH] Potential replacement for find_start_sequence...

I'm not totally sure I'm sold on this approach being better.  But,
I'm not sure that it is any worse either.  Don't have time to
benchmark this right now.  I'm going to throw it to the wolves and 
see what you think.  

Basically, replace the inner search with a Rabin-Karp search (which 
seemed to best describe whatever OtherBill posted - I read what you 
posted and then I looked through my books and this seems to be the 
closest one to whatever you were describing - it uses a hash value
- it's probably not even close to what you had in mind).

This patch also handles the "leftover" case independently (why there 
is a new function as a portion of the code gets called twice).  (Too
bad we don't have a "super-bucket" that would allow a portion of a
bucket to get added to another bucket.  If this were there, a lot of
code could be simplified throughout the server, I think...)

You could, in theory, replace it with a KMP search (the search 
algorithm doesn't matter too much in this patch).  I think 
Rabin-Karp gets you linear time.  But, I think we already had 
that.  However, the inner loop gets dramatically reduced (two 
mathematical formulas versus 3 ifs that may have been nullifying 
branch predicition).

I'm wondering where we are spending our time in find_start_sequence.
BTW, I know this works in the limited testing I gave it.  -- justin

Index: modules/filters/mod_include.c
===================================================================
RCS file: /home/cvs/httpd-2.0/modules/filters/mod_include.c,v
retrieving revision 1.143
diff -u -r1.143 mod_include.c
--- modules/filters/mod_include.c	2001/09/04 02:13:58	1.143
+++ modules/filters/mod_include.c	2001/09/05 08:23:31
@@ -206,6 +206,54 @@
 
 /* --------------------------- Parser functions --------------------------- */
 
+/* Rabin-Karp search as described in Sedgewick 2nd Ed, */
+int rksearch(const char *p, int pLen, const char *a, int aLen)
+{
+    const int q = 33554393;
+    const int d = 32;
+    int i, dM = 1, h1 = 0, h2 = 0;
+    int M = pLen, N = aLen;
+    for (i = 1; i < M; i++) dM = (d*dM) % q;
+    for (i = 0; i < M; i++) {
+        h1 = (h1*d+p[i]) % q;
+        h2 = (h2*d+a[i]) % q;
+    }
+    for (i = 0; h1 != h2; i++) {
+        h2 = (h2+d*q-a[i]*dM) % q;
+        h2 = (h2*d+a[i+M]) % q;
+        if (i > N - M) return N;
+    }
+    return i;
+}
+
+/* We've now found a start sequence tag... */
+static apr_bucket* found_start_sequence(apr_bucket *dptr,
+                                          include_ctx_t *ctx, 
+                                          int tagStart)
+{
+    /* We want to split the bucket at the '<'. */
+    ctx->state = PARSE_DIRECTIVE;
+    ctx->tag_length = 0;
+    ctx->parse_pos = 0;
+    ctx->tag_start_bucket = dptr;
+    ctx->tag_start_index = tagStart;
+    if (ctx->head_start_index > 0) {
+        apr_bucket *tmp_bkt;
+
+        /* Split the bucket with the start of the tag in it */
+        apr_bucket_split(ctx->head_start_bucket, ctx->head_start_index);
+        tmp_bkt = APR_BUCKET_NEXT(ctx->head_start_bucket);
+        /* If it was a one bucket match */
+        if (dptr == ctx->head_start_bucket) {
+            ctx->tag_start_bucket = tmp_bkt;
+            ctx->tag_start_index = tagStart - ctx->head_start_index;
+        }
+        ctx->head_start_bucket = tmp_bkt;
+        ctx->head_start_index = 0;
+    }
+    return ctx->head_start_bucket;
+}
+
 /* This function returns either a pointer to the split bucket containing the
  * first byte of the BEGINNING_SEQUENCE (after finding a complete match) or it
  * returns NULL if no match found.
@@ -217,8 +265,8 @@
     const char *c;
     const char *buf;
     const char *str = STARTING_SEQUENCE;
-    apr_bucket *tmp_bkt;
-    apr_size_t  start_index;
+    int slen = strlen(str);
+    int pos;
 
     *do_cleanup = 0;
 
@@ -269,8 +317,53 @@
         if (len == 0) { /* end of pipe? */
             break;
         }
+
+        /* Set our buffer to use. */
+
         c = buf;
-        while (c < buf + len) {
+        /* The last bucket had a left over partial match that we need to
+         * complete. 
+         */
+        if (ctx->state == PARSE_HEAD)
+        {
+            apr_size_t tmpLen;
+            tmpLen = (len > (slen - 1)) ? len : (slen - 1);
+
+            while (c < buf + tmpLen && *c == str[ctx->parse_pos])
+            {
+                c++; 
+                ctx->parse_pos++;
+            }
+
+            if (str[ctx->parse_pos] == '\0')
+            {
+                ctx->bytes_parsed += c - buf;
+                return found_start_sequence(dptr, ctx, c - buf);
+            }
+
+            /* False alarm... */
+            ctx->state = PRE_HEAD;
+        }
+
+        if (len)
+        {
+            pos = rksearch(str, slen, c, len);
+            if (pos != len)
+            {
+                ctx->head_start_bucket = dptr;
+                ctx->head_start_index = pos;
+                ctx->bytes_parsed += pos + slen;
+                return found_start_sequence(dptr, ctx, pos + slen);
+            }
+        }
+        
+        /* Consider the case where we have <!-- at the end of the bucket. */
+        ctx->bytes_parsed += len - slen;
+        ctx->parse_pos = 0;
+
+        c = buf + len - slen + 1;
+        while (c < buf + len)
+        {
             if (*c == str[ctx->parse_pos]) {
                 if (ctx->state == PRE_HEAD) {
                     ctx->state = PARSE_HEAD;
@@ -279,48 +372,24 @@
                 }
                 ctx->parse_pos++;
             }
-            else {
-                if (str[ctx->parse_pos] == '\0') {
-                    /* We want to split the bucket at the '<'. */
-                    ctx->bytes_parsed++;
-                    ctx->state = PARSE_DIRECTIVE;
-                    ctx->tag_length = 0;
-                    ctx->parse_pos = 0;
-                    ctx->tag_start_bucket = dptr;
-                    ctx->tag_start_index = c - buf;
-                    if (ctx->head_start_index > 0) {
-                        start_index = (c - buf) - ctx->head_start_index;
-                        apr_bucket_split(ctx->head_start_bucket, 
-                                         ctx->head_start_index);
-                        tmp_bkt = APR_BUCKET_NEXT(ctx->head_start_bucket);
-                        if (dptr == ctx->head_start_bucket) {
-                            ctx->tag_start_bucket = tmp_bkt;
-                            ctx->tag_start_index = start_index;
-                        }
-                        ctx->head_start_bucket = tmp_bkt;
-                        ctx->head_start_index = 0;
-                    }
-                    return ctx->head_start_bucket;
+            else if (ctx->parse_pos != 0) 
+            {
+                /* The reason for this, is that we need to make sure 
+                 * that we catch cases like <<!--#.  This makes the 
+                 * second check after the original check fails.
+                 * If parse_pos was already 0 then we already checked this.
+                 */
+                /* FIXME: Why? */
+                *do_cleanup = 1;
+                if (*c == str[0]) {
+                    ctx->parse_pos = 1;
+                    ctx->head_start_index = c - buf;
                 }
-                else if (ctx->parse_pos != 0) {
-                    /* The reason for this, is that we need to make sure 
-                     * that we catch cases like <<!--#.  This makes the 
-                     * second check after the original check fails.
-                     * If parse_pos was already 0 then we already checked this.
-                     */
-                    *do_cleanup = 1;
-                    if (*c == str[0]) {
-                        ctx->parse_pos = 1;
-                        ctx->state = PARSE_HEAD;
-                        ctx->head_start_bucket = dptr;
-                        ctx->head_start_index = c - buf;
-                    }
-                    else {
-                        ctx->parse_pos = 0;
-                        ctx->state = PRE_HEAD;
-                        ctx->head_start_bucket = NULL;
-                        ctx->head_start_index = 0;
-                    }
+                else {
+                    ctx->parse_pos = 0;
+                    ctx->state = PRE_HEAD;
+                    ctx->head_start_bucket = NULL;
+                    ctx->head_start_index = 0;
                 }
             }
             c++;

RE: [PATCH] Potential replacement for find_start_sequence...

Posted by Sascha Schumann <sa...@schumann.cx>.

> Btw, do we have any string matching code in apr or apr-util?  I
> can see the applicability of decent string matchers in other parts
> than mod_include.

    Note that BNDM inherits the limit of shift-and/or algos --
    they only work well for patterns which are not longer than
    the number of bits representable by a machine word
    (sizeof(int)<<3).  In an application like Apache, long search
    patterns are uncommon, but a generic string-matching facility
    would usually need to handle these.

    - Sascha                                     Experience IRCG
      http://schumann.cx/                http://schumann.cx/ircg

RE: [PATCH] Potential replacement for find_start_sequence...

Posted by Sascha Schumann <sa...@schumann.cx>.

> Btw, do we have any string matching code in apr or apr-util?  I
> can see the applicability of decent string matchers in other parts
> than mod_include.

    Note that BNDM inherits the limit of shift-and/or algos --
    they only work well for patterns which are not longer than
    the number of bits representable by a machine word
    (sizeof(int)<<3).  In an application like Apache, long search
    patterns are uncommon, but a generic string-matching facility
    would usually need to handle these.

    - Sascha                                     Experience IRCG
      http://schumann.cx/                http://schumann.cx/ircg

RE: [PATCH] Potential replacement for find_start_sequence...

Posted by Sander Striker <st...@apache.org>.

> From: Sascha Schumann [mailto:sascha@schumann.cx]
> On Wed, 5 Sep 2001, Sander Striker wrote:
> 
>>> I'm not totally sure I'm sold on this approach being better.  But,
>>> I'm not sure that it is any worse either.  Don't have time to
>>> benchmark this right now.  I'm going to throw it to the wolves and
>>> see what you think.
>>
>> Me neither.  Rabin-Karp introduces a lot of * and %.
>> I'll try Boyer-Moore with precalced tables for '<!--#' and '--->'.
> 
>     Well, there are more advanced algorithms than BM available
>     today which are even easier to implement that the original BM
>     algo.

Easier to implement?  BM is so simple, I can't imagine something
being simpler.
 
>     I'd suggest looking at BNDM which combines the advantages of
>     bit-parallelism (shift-and/-or algorithms) and suffix
>     automata (BM-style).
> 
>     http://citeseer.nj.nec.com/navarro01fast.html
> 
>     To give you an idea on how a bndm implementation looks like,
>     I'm appending an unpolished implementation I did some time
>     ago which includes a test-case for locating '<!--#'.

Indeed the LOC is low.  Very cool!  Considering that we can
precal a bndm_t for '<!--#' and '--->'.
It is certainly more advanced.

>     - Sascha                                     Experience IRCG
>       http://schumann.cx/                http://schumann.cx/ircg

Ian, since you were doing the mod_include.c optimizations anyway,
can you check this out?

Btw, do we have any string matching code in apr or apr-util?  I
can see the applicability of decent string matchers in other parts
than mod_include.

Sander

RE: [PATCH] Potential replacement for find_start_sequence...

Posted by Sander Striker <st...@apache.org>.

> From: Sascha Schumann [mailto:sascha@schumann.cx]
> On Wed, 5 Sep 2001, Sander Striker wrote:
> 
>>> I'm not totally sure I'm sold on this approach being better.  But,
>>> I'm not sure that it is any worse either.  Don't have time to
>>> benchmark this right now.  I'm going to throw it to the wolves and
>>> see what you think.
>>
>> Me neither.  Rabin-Karp introduces a lot of * and %.
>> I'll try Boyer-Moore with precalced tables for '<!--#' and '--->'.
> 
>     Well, there are more advanced algorithms than BM available
>     today which are even easier to implement that the original BM
>     algo.

Easier to implement?  BM is so simple, I can't imagine something
being simpler.
 
>     I'd suggest looking at BNDM which combines the advantages of
>     bit-parallelism (shift-and/-or algorithms) and suffix
>     automata (BM-style).
> 
>     http://citeseer.nj.nec.com/navarro01fast.html
> 
>     To give you an idea on how a bndm implementation looks like,
>     I'm appending an unpolished implementation I did some time
>     ago which includes a test-case for locating '<!--#'.

Indeed the LOC is low.  Very cool!  Considering that we can
precal a bndm_t for '<!--#' and '--->'.
It is certainly more advanced.

>     - Sascha                                     Experience IRCG
>       http://schumann.cx/                http://schumann.cx/ircg

Ian, since you were doing the mod_include.c optimizations anyway,
can you check this out?

Btw, do we have any string matching code in apr or apr-util?  I
can see the applicability of decent string matchers in other parts
than mod_include.

Sander

RE: [PATCH] Potential replacement for find_start_sequence...

Posted by Sascha Schumann <sa...@schumann.cx>.

On Wed, 5 Sep 2001, Sander Striker wrote:

> > I'm not totally sure I'm sold on this approach being better.  But,
> > I'm not sure that it is any worse either.  Don't have time to
> > benchmark this right now.  I'm going to throw it to the wolves and
> > see what you think.
>
> Me neither.  Rabin-Karp introduces a lot of * and %.
> I'll try Boyer-Moore with precalced tables for '<!--#' and '--->'.

    Well, there are more advanced algorithms than BM available
    today which are even easier to implement that the original BM
    algo.

    I'd suggest looking at BNDM which combines the advantages of
    bit-parallelism (shift-and/-or algorithms) and suffix
    automata (BM-style).

    http://citeseer.nj.nec.com/navarro01fast.html

    To give you an idea on how a bndm implementation looks like,
    I'm appending an unpolished implementation I did some time
    ago which includes a test-case for locating '<!--#'.

    - Sascha                                     Experience IRCG
      http://schumann.cx/                http://schumann.cx/ircg

RE: [PATCH] Potential replacement for find_start_sequence...

Posted by Sander Striker <st...@apache.org>.

> I'm not totally sure I'm sold on this approach being better.  But,
> I'm not sure that it is any worse either.  Don't have time to
> benchmark this right now.  I'm going to throw it to the wolves and 
> see what you think.  

Me neither.  Rabin-Karp introduces a lot of * and %.
I'll try Boyer-Moore with precalced tables for '<!--#' and '--->'.

[...]

Sander