You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Martin Bayly <Ma...@taglocity.com> on 2007/07/18 19:55:14 UTC

Newbie question about Nutch query architecture - multiple indexes

Fairly new to Lucene/Nutch and Search in general - so bear with me.

 

Using Lucene in an application and (although not a concern yet) want to
understand implications for scalability going forward.

 

I was reading in the Lucene in Action book Nutch case study, about how
Nutch splits its indexes across many machines.

 

<snip>

"The Query Handler does some light processing of the query and forwards
the search terms to a large set of Index Searcher machines."

 

"There are now many streams of search results that come back to the
Query Handler. The Query Handler collates the results, finding the best
ranking across all of them."

 

"The Query Handler asks each Index Searcher for only a small number of
documents (usually 10)"

</snip>

 

What I don't follow is what are the implications of splitting the
indexes in this way for relevancy?  Let's say the first 20 docs on Index
Searcher machine A are highly relevant and the first 10 docs on Index
Searcher machine B are not very relevant.  But if I understand
correctly, the user will see only 10 docs from machine A and 10 docs
from machine B. i.e. docs 11-20 in the search result will not be very
relevant?

 

Not sure I really see a way around this - I guess one of the critical
things is how you choose to split your indexes?  My impression is Nutch
does this based on the URL of the content being indexed? 

 

Thanks for any insights

Martin