You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Nick Tonkin <ni...@rlnt.net> on 2001/11/27 21:55:44 UTC
[OT] Re: search.cpan.org
Because it does a full text search of all the contents of the DB.
~~~~~~~~~~~
Nick Tonkin
On Tue, 27 Nov 2001, Robert Landrum wrote:
> Does anyone know why search.cpan.org is always the s-l-o-w-e-s-t site
> on the internet? I can't believe it always busy. I've had trouble
> hitting it at 3 AM.
>
> Maybe it's just me...
>
>
> Rob
>
> --
> "Only two things are infinite: The universe, and human stupidity. And I'm not
> sure about the former." --Albert Einstein
>
Re: [OT] Re: search.cpan.org
Posted by Bill Moseley <mo...@hank.org>.
At 09:02 PM 11/27/01 +0000, Mark Maunder wrote:
>I'm using it on our site and searching fulltext
>indexes on three fields (including a large text field) in under 3 seconds
on over
>70,000 records on a p550 with 490 megs of ram.
>
>
Hi Mark,
<plug>
Some day if you are bored, try indexing with swish-e (the development
version).
http://swish-e.org
The big problem with it right now is it doesn't do incremental indexing.
One of the developers is trying to get that working with in a few weeks.
But for most small sets of files it's not an issue since indexing is so fast.
My favorite feature is it can run an external program, such as a perl mbox
or html parser or perl spider, or DBI program or whatever to get the source
to index. Use it with Cache::Cache and mod_perl and it's nice and fast
from page to page of results.
Here's indexing only 24,000 files:
> ./swish-e -c u -i /usr/doc
Indexing Data Source: "File-System"
Indexing "/usr/doc"
270279 unique words indexed.
4 properties sorted.
23840 files indexed. 177638538 total bytes.
Elapsed time: 00:03:50 CPU time: 00:03:16
Indexing done!
Here's searching:
> ./swish-e -w install -m 1
# SWISH format: 2.1-dev-24
# Search words: install
# Number of hits: 2202
# Search time: 0.006 seconds
# Run time: 0.011 seconds
A phrase:
> ./swish-e -w '"public license"' -m 1
# SWISH format: 2.1-dev-24
# Search words: "public license"
# Number of hits: 348
# Search time: 0.007 seconds
# Run time: 0.012 seconds
998 /usr/doc/packages/ijb/gpl.html "gpl.html" 26002
A wild card and boolean search:
> ./swish-e -w 'sa* or java' -m 1
# SWISH format: 2.1-dev-24
# Search words: sa* or java
# Number of hits: 7476
# Search time: 0.082 seconds
# Run time: 0.087 seconds
Or a good number of results:
> ./swish-e -w 'is or und or run' -m 1
# SWISH format: 2.1-dev-24
# Search words: is or und or run
# Number of hits: 14477
# Search time: 0.084 seconds
# Run time: 0.089 seconds
Or everything:
> ./swish-e -w 'not dksksks' -m 1
# SWISH format: 2.1-dev-24
# Search words: not dksksks
# Number of hits: 23840
# Search time: 0.069 seconds
# Run time: 0.074 seconds
This is pushing the limit for little old swish, but here's indexing a few
more very small xml files (~150 bytes each)
3830016 files indexed. 582898349 total bytes.
Elapsed time: 00:48:22 CPU time: 00:44:01
</plug>
Bill Moseley
mailto:moseley@hank.org
Re: [OT] Re: search.cpan.org
Posted by Mark Maunder <ma...@swiftcamel.com>.
Nick Tonkin wrote:
> Because it does a full text search of all the contents of the DB.
>
Not sure what he's using for a back end, but mysql 4.0 (in alpha) has very fast and
feature rich full text searching now, so perhaps he can migrate to that once it's
released in December sometime. I'm using it on our site and searching fulltext
indexes on three fields (including a large text field) in under 3 seconds on over
70,000 records on a p550 with 490 megs of ram.
Re: [OT] Re: search.cpan.org
Posted by Ask Bjoern Hansen <as...@valueclick.com>.
On Tue, 27 Nov 2001, Nick Tonkin wrote:
> Well, ask Ask if you want the whole truth. But when I saked him that's
> what he said. Maybe there's a problem with the architecture and some
> pre-indexing is done per session or something suboptimal like that. Ask?
No, Robert is right. It's just searches that are doing a full scan
of the database. I know Graham is working on a better search
system.
If Bill got swish-e to support incremental database updates I'm sure
it would help. ;-)
- ask
--
ask bjoern hansen, http://ask.netcetera.dk/ !try; do();
more than a billion impressions per week, http://valueclick.com
[OT] Re: search.cpan.org
Posted by Nick Tonkin <ni...@rlnt.net>.
Well, ask Ask if you want the whole truth. But when I saked him that's
what he said. Maybe there's a problem with the architecture and some
pre-indexing is done per session or something suboptimal like that. Ask?
~~~~~~~~~~~
Nick Tonkin
On Tue, 27 Nov 2001, Robert Landrum wrote:
> Sure... When doing searches. But it takes me about 20 seconds just to
> hit any page... not just the search pages.
>
> And don't misunderstand, I think search.cpan.org is awesome. I just
> wondered if anyone knew why it was slow all the time.
>
> Rob
>
>
> At 12:55 PM -0800 11/27/01, Nick Tonkin wrote:
> >Because it does a full text search of all the contents of the DB.
> >
> >
> >~~~~~~~~~~~
> >Nick Tonkin
> >
> >On Tue, 27 Nov 2001, Robert Landrum wrote:
> >
> >> Does anyone know why search.cpan.org is always the s-l-o-w-e-s-t site
> >> on the internet? I can't believe it always busy. I've had trouble
> >> hitting it at 3 AM.
> >>
> >> Maybe it's just me...
> >>
> >>
> >> Rob
> >>
> >> --
> >> "Only two things are infinite: The universe, and human stupidity.
> >>And I'm not
> >> sure about the former." --Albert Einstein
> >>
> >
>
>
> --
> "Only two things are infinite: The universe, and human stupidity. And I'm not
> sure about the former." --Albert Einstein
>
[OT] Re: search.cpan.org
Posted by Robert Landrum <rl...@capitoladvantage.com>.
Sure... When doing searches. But it takes me about 20 seconds just to
hit any page... not just the search pages.
And don't misunderstand, I think search.cpan.org is awesome. I just
wondered if anyone knew why it was slow all the time.
Rob
At 12:55 PM -0800 11/27/01, Nick Tonkin wrote:
>Because it does a full text search of all the contents of the DB.
>
>
>~~~~~~~~~~~
>Nick Tonkin
>
>On Tue, 27 Nov 2001, Robert Landrum wrote:
>
>> Does anyone know why search.cpan.org is always the s-l-o-w-e-s-t site
>> on the internet? I can't believe it always busy. I've had trouble
>> hitting it at 3 AM.
>>
>> Maybe it's just me...
>>
>>
>> Rob
>>
>> --
>> "Only two things are infinite: The universe, and human stupidity.
>>And I'm not
>> sure about the former." --Albert Einstein
>>
>
--
"Only two things are infinite: The universe, and human stupidity. And I'm not
sure about the former." --Albert Einstein