You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Robert Landrum <rl...@capitoladvantage.com> on 2001/11/27 21:46:04 UTC

[OT] search.cpan.org

Does anyone know why search.cpan.org is always the s-l-o-w-e-s-t site 
on the internet?  I can't believe it always busy.  I've had trouble 
hitting it at 3 AM.

Maybe it's just me...


Rob

--
"Only two things are infinite: The universe, and human stupidity. And I'm not
sure about the former." --Albert Einstein

Re: [OT] search.cpan.org

Posted by Perrin Harkins <pe...@elem.com>.
There's a simple solution:
http://kobesearch.cpan.org/


Re: [OT] Re: search.cpan.org

Posted by Randy Kobes <ra...@theoryx5.uwinnipeg.ca>.
On Tue, 27 Nov 2001, Bill Moseley wrote:

> At 12:55 PM 11/27/01 -0800, Nick Tonkin wrote:
> >
> >Because it does a full text search of all the contents of the DB.
>
> Perhaps, but it's just overloaded.

I think the load, and network connection, is the main reason; the
search itself, if you were connected locally at a time when the
machine isn't so busy, is pretty quick.

> I'm sure he's working on it, but anyone want of offer Graham free hosting?
> A few mirrors would be nice, too.

They (Graham and Elaine) are aware that it can be slow at times,
and have set up at least one mirror site to help spread the load.

best regards,
randy kobes


Re: [OT] Re: search.cpan.org

Posted by Bill Moseley <mo...@hank.org>.
At 12:55 PM 11/27/01 -0800, Nick Tonkin wrote:
>
>Because it does a full text search of all the contents of the DB.

Perhaps, but it's just overloaded.

I'm sure he's working on it, but anyone want of offer Graham free hosting?
A few mirrors would be nice, too.

(Plus, all my CPAN.pm setups are now failing to work, too....)



Bill Moseley
mailto:moseley@hank.org

Re: [OT] Re: search.cpan.org

Posted by Bill Moseley <mo...@hank.org>.
At 09:02 PM 11/27/01 +0000, Mark Maunder wrote:
>I'm using it on our site and searching fulltext
>indexes on three fields (including a large text field) in under 3 seconds
on over
>70,000 records on a p550 with 490 megs of ram.
>
>
Hi Mark,

<plug>

Some day if you are bored, try indexing with swish-e (the development
version).
http://swish-e.org

The big problem with it right now is it doesn't do incremental indexing.
One of the developers is trying to get that working with in a few weeks.
But for most small sets of files it's not an issue since indexing is so fast.

My favorite feature is it can run an external program, such as a perl mbox
or html parser or perl spider, or DBI program or whatever to get the source
to index.  Use it with Cache::Cache and mod_perl and it's nice and fast
from page to page of results.

Here's indexing only 24,000 files:

> ./swish-e -c u -i /usr/doc
Indexing Data Source: "File-System"
Indexing "/usr/doc"
270279 unique words indexed.
4 properties sorted.                                              
23840 files indexed.  177638538 total bytes.
Elapsed time: 00:03:50 CPU time: 00:03:16
Indexing done!

Here's searching:

> ./swish-e -w install -m 1
# SWISH format: 2.1-dev-24
# Search words: install
# Number of hits: 2202
# Search time: 0.006 seconds
# Run time: 0.011 seconds

A phrase:

> ./swish-e -w '"public license"' -m 1
# SWISH format: 2.1-dev-24
# Search words: "public license"
# Number of hits: 348
# Search time: 0.007 seconds
# Run time: 0.012 seconds
998 /usr/doc/packages/ijb/gpl.html "gpl.html" 26002


A wild card and boolean search:

> ./swish-e -w 'sa* or java' -m 1
# SWISH format: 2.1-dev-24
# Search words: sa* or java
# Number of hits: 7476
# Search time: 0.082 seconds
# Run time: 0.087 seconds

Or a good number of results:

> ./swish-e -w 'is or und or run' -m 1
# SWISH format: 2.1-dev-24
# Search words: is or und or run
# Number of hits: 14477
# Search time: 0.084 seconds
# Run time: 0.089 seconds

Or everything:

> ./swish-e -w 'not dksksks' -m 1
# SWISH format: 2.1-dev-24
# Search words: not dksksks
# Number of hits: 23840
# Search time: 0.069 seconds
# Run time: 0.074 seconds


This is pushing the limit for little old swish, but here's indexing a few
more very small xml files (~150 bytes each)

3830016 files indexed.  582898349 total bytes.
Elapsed time: 00:48:22 CPU time: 00:44:01

</plug>

Bill Moseley
mailto:moseley@hank.org

Re: [OT] Re: search.cpan.org

Posted by Mark Maunder <ma...@swiftcamel.com>.
Nick Tonkin wrote:

> Because it does a full text search of all the contents of the DB.
>

Not sure what he's using for a back end, but mysql 4.0 (in alpha) has very fast and
feature rich full text searching now, so perhaps he can migrate to that once it's
released in December sometime. I'm using it on our site and searching fulltext
indexes on three fields (including a large text field) in under 3 seconds on over
70,000 records on a p550 with 490 megs of ram.


Re: [OT] Re: search.cpan.org

Posted by Ask Bjoern Hansen <as...@valueclick.com>.
On Tue, 27 Nov 2001, Nick Tonkin wrote:

> Well, ask Ask if you want the whole truth. But when I saked him that's
> what he said. Maybe there's a problem with the architecture and some
> pre-indexing is done per session or something suboptimal like that. Ask?

No, Robert is right. It's just searches that are doing a full scan
of the database.  I know Graham is working on a better search
system.

If Bill got swish-e to support incremental database updates I'm sure
it would help. ;-) 


 - ask

-- 
ask bjoern hansen, http://ask.netcetera.dk/         !try; do();
more than a billion impressions per week, http://valueclick.com



[OT] Re: search.cpan.org

Posted by Nick Tonkin <ni...@rlnt.net>.
Well, ask Ask if you want the whole truth. But when I saked him that's
what he said. Maybe there's a problem with the architecture and some
pre-indexing is done per session or something suboptimal like that. Ask?

~~~~~~~~~~~
Nick Tonkin

On Tue, 27 Nov 2001, Robert Landrum wrote:

> Sure... When doing searches. But it takes me about 20 seconds just to 
> hit any page... not just the search pages.
> 
> And don't misunderstand, I think search.cpan.org is awesome.  I just 
> wondered if anyone knew why it was slow all the time.
> 
> Rob
> 
> 
> At 12:55 PM -0800 11/27/01, Nick Tonkin wrote:
> >Because it does a full text search of all the contents of the DB.
> >
> >
> >~~~~~~~~~~~
> >Nick Tonkin
> >
> >On Tue, 27 Nov 2001, Robert Landrum wrote:
> >
> >> Does anyone know why search.cpan.org is always the s-l-o-w-e-s-t site
> >> on the internet?  I can't believe it always busy.  I've had trouble
> >> hitting it at 3 AM.
> >>
> >> Maybe it's just me...
> >>
> >>
> >> Rob
> >>
> >> --
> >> "Only two things are infinite: The universe, and human stupidity. 
> >>And I'm not
> >> sure about the former." --Albert Einstein
> >>
> >
> 
> 
> --
> "Only two things are infinite: The universe, and human stupidity. And I'm not
> sure about the former." --Albert Einstein
> 


[OT] Re: search.cpan.org

Posted by Robert Landrum <rl...@capitoladvantage.com>.
Sure... When doing searches. But it takes me about 20 seconds just to 
hit any page... not just the search pages.

And don't misunderstand, I think search.cpan.org is awesome.  I just 
wondered if anyone knew why it was slow all the time.

Rob


At 12:55 PM -0800 11/27/01, Nick Tonkin wrote:
>Because it does a full text search of all the contents of the DB.
>
>
>~~~~~~~~~~~
>Nick Tonkin
>
>On Tue, 27 Nov 2001, Robert Landrum wrote:
>
>> Does anyone know why search.cpan.org is always the s-l-o-w-e-s-t site
>> on the internet?  I can't believe it always busy.  I've had trouble
>> hitting it at 3 AM.
>>
>> Maybe it's just me...
>>
>>
>> Rob
>>
>> --
>> "Only two things are infinite: The universe, and human stupidity. 
>>And I'm not
>> sure about the former." --Albert Einstein
>>
>


--
"Only two things are infinite: The universe, and human stupidity. And I'm not
sure about the former." --Albert Einstein

[OT] Re: search.cpan.org

Posted by Nick Tonkin <ni...@rlnt.net>.
Because it does a full text search of all the contents of the DB.


~~~~~~~~~~~
Nick Tonkin

On Tue, 27 Nov 2001, Robert Landrum wrote:

> Does anyone know why search.cpan.org is always the s-l-o-w-e-s-t site 
> on the internet?  I can't believe it always busy.  I've had trouble 
> hitting it at 3 AM.
> 
> Maybe it's just me...
> 
> 
> Rob
> 
> --
> "Only two things are infinite: The universe, and human stupidity. And I'm not
> sure about the former." --Albert Einstein
>