You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hilkiah Lavinier <hi...@yahoo.com> on 2008/01/19 22:45:53 UTC

distributed search servers

Hi all,

Have a distributed search issue I need some advice on.  The scenario is that I have tomcat running off one server and two nutch search servers running off two other machines (so 3 machines in total).  I've setup the nutch war to correctly call the search servers and they respond.  Problem is I get duplicate results.  Now I have the same data/information from the crawl copied on both machines so the crawl data is replicated on both machines.

Questions:
1) how do I prevent the duplicate response? If I start a third search server I only get two duplicate responses so it doesn't seem to increase with the number of search servers
2) does tomcat wait for ALL search servers to respond before displaying the query result or does it display the result as soon as one server responds?
3) in terms of load sharing, what is the best approach for distributed search servers?

Any help would be greatly appreciated!

Thanks,

Hilkiah G. Lavinier MEng (Hons), ACGI 
6 Winston Lane, 
Goodwill, 
Roseau, Dominica 
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
 
Email: hilkiah@yahoo.com
Email: hilkiah.lavinier@gmail.com
IM: Yahoo hilkiah / MSN hilkiahlavinier@hotmail.com
IM: ICQ #8978201  / AOL hilkiah21





      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping

Re: distributed search servers

Posted by ianwong <yi...@hotmail.com>.
Dear Dennis,

about your answer|:

> 2) does tomcat wait for ALL search servers to respond before displaying
> the query result or does it display the result as soon as one server
> responds?

Yes, to a timeout value.  If one goes down it will slow down the entire 
search cluster.

how can I change the timeout value? what is the default value?

Ian




Dennis Kubes-2 wrote:
> 
> 
> 
> Hilkiah Lavinier wrote:
>> Hi all,
>> 
>> Have a distributed search issue I need some advice on.  The scenario is
>> that I have tomcat running off one server and two nutch search servers
>> running off two other machines (so 3 machines in total).  I've setup the
>> nutch war to correctly call the search servers and they respond.  Problem
>> is I get duplicate results.  Now I have the same data/information from
>> the crawl copied on both machines so the crawl data is replicated on both
>> machines.
>> 
>> Questions:
>> 1) how do I prevent the duplicate response? If I start a third search
>> server I only get two duplicate responses so it doesn't seem to increase
>> with the number of search servers
> 
> In your query or in NutchBean set the hitsPerSite=1, here is an example:
> 
> Duplicates:
> http://search.isc.swlabs.org/search.jsp?lang=en&query=java
> 
> No Duplicates:
> http://search.isc.swlabs.org/search.jsp?lang=en&query=java&hitsPerSite=1
> 
> This is based on hostname so for instance java.net and www.java.net will 
> be considered different even though they are the same.  The latter 
> problem has not been corrected yet in Nutch, but we are working on it.
> 
>> 2) does tomcat wait for ALL search servers to respond before displaying
>> the query result or does it display the result as soon as one server
>> responds?
> 
> Yes, to a timeout value.  If one goes down it will slow down the entire 
> search cluster.
> 
>> 3) in terms of load sharing, what is the best approach for distributed
>> search servers?
> 
> If you are looking at a round-robin sort of load balancing I would say 
> two nutch servers hitting different search servers with replicated 
> content fronted by an apache server or hardware load balancer.  Remember 
> that the entire search can still be up even if one or more search 
> servers fail.  I would worry more about clustering the front end search 
> website than load balancing the search servers but it all depends on 
> what your goal is.  For a www search we don't care if a few of the 
> search servers are down as long as the search is functional.
> 
> Dennis Kubes
> 
> 
>> 
>> Any help would be greatly appreciated!
>> 
>> Thanks,
>> 
>> Hilkiah G. Lavinier MEng (Hons), ACGI 
>> 6 Winston Lane, 
>> Goodwill, 
>> Roseau, Dominica 
>> Mbl: (767) 275 3382
>> Hm : (767) 440 3924
>> Fax: (767) 440 4991
>> VoIP USA: (646) 432 4487
>>  
>> Email: hilkiah@yahoo.com
>> Email: hilkiah.lavinier@gmail.com
>> IM: Yahoo hilkiah / MSN hilkiahlavinier@hotmail.com
>> IM: ICQ #8978201  / AOL hilkiah21
>> 
>> 
>> 
>> 
>> 
>>      
>> ____________________________________________________________________________________
>> Looking for last minute shopping deals?  
>> Find them fast with Yahoo! Search. 
>> http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 
> 

-- 
View this message in context: http://www.nabble.com/distributed-search-servers-tp14975657p21072494.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: distributed search servers

Posted by Dennis Kubes <ku...@apache.org>.

Hilkiah Lavinier wrote:
> Hi all,
> 
> Have a distributed search issue I need some advice on.  The scenario is that I have tomcat running off one server and two nutch search servers running off two other machines (so 3 machines in total).  I've setup the nutch war to correctly call the search servers and they respond.  Problem is I get duplicate results.  Now I have the same data/information from the crawl copied on both machines so the crawl data is replicated on both machines.
> 
> Questions:
> 1) how do I prevent the duplicate response? If I start a third search server I only get two duplicate responses so it doesn't seem to increase with the number of search servers

In your query or in NutchBean set the hitsPerSite=1, here is an example:

Duplicates:
http://search.isc.swlabs.org/search.jsp?lang=en&query=java

No Duplicates:
http://search.isc.swlabs.org/search.jsp?lang=en&query=java&hitsPerSite=1

This is based on hostname so for instance java.net and www.java.net will 
be considered different even though they are the same.  The latter 
problem has not been corrected yet in Nutch, but we are working on it.

> 2) does tomcat wait for ALL search servers to respond before displaying the query result or does it display the result as soon as one server responds?

Yes, to a timeout value.  If one goes down it will slow down the entire 
search cluster.

> 3) in terms of load sharing, what is the best approach for distributed search servers?

If you are looking at a round-robin sort of load balancing I would say 
two nutch servers hitting different search servers with replicated 
content fronted by an apache server or hardware load balancer.  Remember 
that the entire search can still be up even if one or more search 
servers fail.  I would worry more about clustering the front end search 
website than load balancing the search servers but it all depends on 
what your goal is.  For a www search we don't care if a few of the 
search servers are down as long as the search is functional.

Dennis Kubes


> 
> Any help would be greatly appreciated!
> 
> Thanks,
> 
> Hilkiah G. Lavinier MEng (Hons), ACGI 
> 6 Winston Lane, 
> Goodwill, 
> Roseau, Dominica 
> Mbl: (767) 275 3382
> Hm : (767) 440 3924
> Fax: (767) 440 4991
> VoIP USA: (646) 432 4487
>  
> Email: hilkiah@yahoo.com
> Email: hilkiah.lavinier@gmail.com
> IM: Yahoo hilkiah / MSN hilkiahlavinier@hotmail.com
> IM: ICQ #8978201  / AOL hilkiah21
> 
> 
> 
> 
> 
>       ____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping