You are viewing a plain text version of this content. The canonical link for it is here.
Posted to docs@httpd.apache.org by Joshua Slive <jo...@slive.ca> on 2003/04/30 23:47:31 UTC

search.apache.org (fwd)

I haven't tried this out in detail, so I don't have any specific opinion.
But Bill has spent some time sorting out searching on apache.org, so we
may want to consider returning our local search engine.

Joshua.


---------- Forwarded message ----------
Date: Wed, 30 Apr 2003 14:45:32 -0700
From: Bill Moseley <mo...@hank.org>
To: infrastructure@apache.org
Subject: search.apache.org

http://search.apache.org has been updated with the new index as was described here a week or
so ago.

The new index is a result of spidering instead of indexing the file system -- the index
files are about 11MB instead of 200MB.  Frees a little space on /x2, which is up at 95% full
again.

There may be things that don't need to be indexed (cvs update archives?), so let me know if
anything else should be excluded.  Or make use of robots.txt.

Site specific searches can be done by setting the "what" CGI parameter.  For example:

  http://search.apache.org/index.cgi?what=httpd&keyword=installation
or
  http://search.apache.org/index.cgi?what=docs2&keyword=installation

limit to just httpd.apache.org or 2.0 docs.

The "advanced" form is just:

  http://search.apache.org/index.cgi

which just allows searching over the entire site, search by field, and set sort order.

There's two other features that are not shown by default.  One is to select "fuzzy"
searching, and the other is to limit searches by a data range.  I'm not sure they
need to be enabled at all.

Those features can be tested by:

  http://search.apache.org/index.cgi?full=1



And to ramble a bit...

This is all in Perl CGI, which is slow.  Plus, the highlighting code is set in the most
aggressive mode, and that's where most of the time is spent.  It's brut-force highlighting.

The CGI script runs the swish-e binary for searches, but only if there are not more than 4
swish-e binaries running as found by grepping the output from /bin/ps -Unobody -ocommand.
Still, hitting the CGI script hard will load the server, no doubt.

Running under mod_perl would help, especially with highlighting turned off or down, and
using the swish-e C library (via SWISH::API module) instead of the swish-e binary.

Here's some general request/second on an Athlon XP 1800+ with 1/2GB RAM, Linux 2.4.20
and Apache/1.3.26 mod_perl/1.26 using ab.

                             Requests per Second

                              Highlighting Mode
                      Off      Phrase    Default     Simple
   Using SWISH::API   45        1.5        2          12
   ----------------------------------------------------------------------------
   Using swish-e      12        1.3       1.8         7.5
     binary

As you can see the highlighting code is the limiting factor.  I have search.apache.org setup
for the swish-e binary and "Phrase" highlighting.  The worst combination. ;)


-- 
Bill Moseley
moseley@hank.org



---------------------------------------------------------------------
To unsubscribe, e-mail: docs-unsubscribe@httpd.apache.org
For additional commands, e-mail: docs-help@httpd.apache.org


Re: search.apache.org

Posted by Bill Moseley <mo...@hank.org>.
On Thu May 01, 2003 at 12:09:23AM +0200, Andr? Malo wrote:
> [CC infrastructure]
> 
> * Joshua wrote:
> 
> > I haven't tried this out in detail, so I don't have any specific
> > opinion. [...]
> 
> Didn't try it yet ...
> The main question is - how does it work with different
> languages/encodings, e.g. japanese or russian patterns?

It doesn't.  Swish only indexes 8-bit chars.  The parser converts everything to 8859-1
before indexing. The plan is to rewrite to use UTF-8 internally.


-- 
Bill Moseley
moseley@hank.org


---------------------------------------------------------------------
To unsubscribe, e-mail: docs-unsubscribe@httpd.apache.org
For additional commands, e-mail: docs-help@httpd.apache.org


Re: search.apache.org

Posted by André Malo <nd...@perlig.de>.
[CC infrastructure]

* Joshua wrote:

> I haven't tried this out in detail, so I don't have any specific
> opinion. [...]

Didn't try it yet ...
The main question is - how does it work with different
languages/encodings, e.g. japanese or russian patterns?

nd (still not subscribed)
-- 
"Das Verhalten von Gates hatte mir bewiesen, dass ich auf ihn und seine
beiden Gefährten nicht zu zählen brauchte" -- Karl May, "Winnetou III"

Im Westen was neues: <http://pub.perlig.de/books.html#apache2>

---------------------------------------------------------------------
To unsubscribe, e-mail: docs-unsubscribe@httpd.apache.org
For additional commands, e-mail: docs-help@httpd.apache.org