You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andy Liu <an...@gmail.com> on 2005/04/15 16:58:07 UTC

Questions about distributed search servers

1. Right now if you use the distributed search servers, the
QueryFilter plugins are executed on each server machine.  Would it
make sense to have the client execute the plugins once, and then
dispatch the Lucene query object (instead of Nutch query object) to
the server machines?

2. When using distributed search, I believe that docFreq() is handled
incorrectly.  Each server will perform its own docFreq() calculation
instead of taking the true docFreq() of the entire index across all
machines.  This throws off the scores.  I'm having trouble thinking of
a clean solution to this problem.  Any ideas?

Andy

How to exclude content other than Script & Style from indexing

Posted by Sundaramoorthy Kannan <ka...@cognizant.com>.
Hi,
If I have to exclude some parts of a web page from getting indexed, how
can I do it? As I understand, DOMContentUtils class of HTML parser
plugin currently ignores only SCRIPT, STYLE and comment text. Can I
configure it to exclude some other tags too?

Thanks,
Kannan


Re: Questions about distributed search servers

Posted by Daniel Naber <da...@t-online.de>.
On Friday 15 April 2005 16:58, Andy Liu wrote:

> 2. When using distributed search, I believe that docFreq() is handled
> incorrectly.  Each server will perform its own docFreq() calculation
> instead of taking the true docFreq() of the entire index across all
> machines.  This throws off the scores.  I'm having trouble thinking of
> a clean solution to this problem.  Any ideas?

This issue is being addressed in Lucene:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31841

Regards
 Daniel

-- 
http://www.danielnaber.de