You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/02/06 21:33:02 UTC

Speeding up initial searches using cache

Hi,

Running nutch 0.71on Mandrake linux 2006 (P4 with a 2 sata drives on 
raid 0, 2 gigs of ram, about 4 million pages, but expecting to hit 10+), 
and finding that our initial queries take up to 15-20 seconds to return 
results.  I'd like to get that speeded up and am seeking thoughts on how 
to do so.

My initial thoughts are that I need to do something with caching 
somehow.  Byron M had commented a while back on this list:
--
I would warm up your index by throwing
queries at it to get the blocks cached on an OS level
or work on implementing RAMDirectory instead of
FSDirectory to store your index in ram if you have the
resources to do so.
----
Does this seem the best way to ensure my users are getting fast search 
results?  If so, does someone have a list of queries that I might try 
using?  I suppose I could use the list of 'one day's worth of search 
terms' that I found here:
http://blog.sli-systems.com/2006/02/what_would_the_search_engines.html
However, I'm not sure how to even measure the effectiveness of that - if 
I ran the say 500K search terms in there past the software, perhaps 
there isn't enough cache space for that amount of searches......

Any suggestions on the above, or comments on best way to cover this long 
lead time on search results?

Thanks.




Re: Speeding up initial searches using cache

Posted by Stefan Groschupf <sg...@media-style.com>.
Normally only the first query takes that long.
Do you plan often to reboot the search server?
If you do that by a script you can add something like wget ...? 
query=http
Cache makes somehow sense but only if you have multiple search  
servers and many repeated identically queries.

Am 06.02.2006 um 21:33 schrieb Insurance Squared Inc.:

> Hi,
>
> Running nutch 0.71on Mandrake linux 2006 (P4 with a 2 sata drives  
> on raid 0, 2 gigs of ram, about 4 million pages, but expecting to  
> hit 10+), and finding that our initial queries take up to 15-20  
> seconds to return results.  I'd like to get that speeded up and am  
> seeking thoughts on how to do so.
>
> My initial thoughts are that I need to do something with caching  
> somehow.  Byron M had commented a while back on this list:
> --
> I would warm up your index by throwing
> queries at it to get the blocks cached on an OS level
> or work on implementing RAMDirectory instead of
> FSDirectory to store your index in ram if you have the
> resources to do so.
> ----
> Does this seem the best way to ensure my users are getting fast  
> search results?  If so, does someone have a list of queries that I  
> might try using?  I suppose I could use the list of 'one day's  
> worth of search terms' that I found here:
> http://blog.sli-systems.com/2006/02/what_would_the_search_engines.html
> However, I'm not sure how to even measure the effectiveness of that  
> - if I ran the say 500K search terms in there past the software,  
> perhaps there isn't enough cache space for that amount of  
> searches......
>
> Any suggestions on the above, or comments on best way to cover this  
> long lead time on search results?
>
> Thanks.
>
>
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



Re: Speeding up initial searches using cache

Posted by Chris Lamprecht <cl...@gmail.com>.
Just out of curiousity, does anyone here know how well query caching
works in general with an extremely high-volume search engine?

It seems like as your search volume goes up, and the number of unique
queries goes up with it, the cache hit rate would go down, and caching
would help less and less.  Urs Hoelzle (Google) mentioned this in a
talk he gave at UW in 2002:

http://rakaposhi.eas.asu.edu/f02-cse494-mailarchive/msg00138.html
(link to video on this page)

-chris

On 2/7/06, Byron Miller <by...@yahoo.com> wrote:
> I use OSCache with great success.
>
> I would an amazing amount (more then i assumed) of
> queries we get are duplicate of one fashion or another
> so on top of warming things up as much as possible to
> the OS buffer cache we use OSCache as well.
>
> You could also use Squid to cache pages for x amount
> of time to offload your hotspots to free up cpu time
> for those ad-hoc/random queries. (as long as you
> aren't forcing content expire in your headers)
>
> -byron
>
>
> --- "Insurance Squared Inc."
> <gc...@insurancesquared.com> wrote:
>
> > Hi,
> >
> > Running nutch 0.71on Mandrake linux 2006 (P4 with a
> > 2 sata drives on
> > raid 0, 2 gigs of ram, about 4 million pages, but
> > expecting to hit 10+),
> > and finding that our initial queries take up to
> > 15-20 seconds to return
> > results.  I'd like to get that speeded up and am
> > seeking thoughts on how
> > to do so.
> >
>

Re: Speeding up initial searches using cache

Posted by Byron Miller <by...@yahoo.com>.
I use OSCache with great success.  

I would an amazing amount (more then i assumed) of
queries we get are duplicate of one fashion or another
so on top of warming things up as much as possible to
the OS buffer cache we use OSCache as well.

You could also use Squid to cache pages for x amount
of time to offload your hotspots to free up cpu time
for those ad-hoc/random queries. (as long as you
aren't forcing content expire in your headers)

-byron


--- "Insurance Squared Inc."
<gc...@insurancesquared.com> wrote:

> Hi,
> 
> Running nutch 0.71on Mandrake linux 2006 (P4 with a
> 2 sata drives on 
> raid 0, 2 gigs of ram, about 4 million pages, but
> expecting to hit 10+), 
> and finding that our initial queries take up to
> 15-20 seconds to return 
> results.  I'd like to get that speeded up and am
> seeking thoughts on how 
> to do so.
>