You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chun Wei Ho <cw...@gmail.com> on 2006/02/01 04:32:19 UTC

Distributed vs Merged Searching

I am deploying a web application serving searches on a Lucene index,
and am deciding between distributing search between several machines
or single searching, and was hoping that someone could tell me from
their experiences:

+ Is there anything particular to watch out for if using distributed
searching instead of  searching one merged Lucene index?

+ What should be the size of the index that I am looking at before I
need to (or should be) turn to distributed searching to reduce
response/search time? I know it would depend a lot on hardware and request
frequency but I was wondering if anyone could post their hardware
info and index size as a reference of when/if they had to use
distributed search due to load issues.

Thanks :)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Distributed vs Merged Searching

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Lamprecht wrote:
> One issue is that if you are splitting the index in half (for
> example), getting some results from index A and some from index B,
> then you need to merge the results somewhere.  But the scores coming
> from the two indexes are not related at all, for example, document 100
> from index A has score 0.85, document 200 from index B has score 0.90
> -- this doesn't necessarily mean that document 200 should be ranked
> before document 100.    This is one issue to deal with.
>
> I think this issue has been discussed on this mailing list before. 
> Has anyone else had to deal with this issue with a distributed index? 
> What does Nutch do?
>   

Please see http://issues.apache.org/jira/browse/NUTCH-92 . A solution is 
described there, but not implemented yet...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Distributed vs Merged Searching

Posted by Chris Lamprecht <cl...@gmail.com>.

One issue is that if you are splitting the index in half (for
example), getting some results from index A and some from index B,
then you need to merge the results somewhere.  But the scores coming
from the two indexes are not related at all, for example, document 100
from index A has score 0.85, document 200 from index B has score 0.90
-- this doesn't necessarily mean that document 200 should be ranked
before document 100.    This is one issue to deal with.

I think this issue has been discussed on this mailing list before. 
Has anyone else had to deal with this issue with a distributed index? 
What does Nutch do?

-chris

On 1/31/06, Chun Wei Ho <cw...@gmail.com> wrote:
> I am deploying a web application serving searches on a Lucene index,
> and am deciding between distributing search between several machines
> or single searching, and was hoping that someone could tell me from
> their experiences:
>
> + Is there anything particular to watch out for if using distributed
> searching instead of  searching one merged Lucene index?
>
> + What should be the size of the index that I am looking at before I
> need to (or should be) turn to distributed searching to reduce
> response/search time? I know it would depend a lot on hardware and request
> frequency but I was wondering if anyone could post their hardware
> info and index size as a reference of when/if they had to use
> distributed search due to load issues.
>
> Thanks :)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Indexing and searching and item with attachments

Posted by Gwyn Carwardine <gw...@carwardine.net>.

I want to be able to store items and attachments such that they are treated
as a single document.

On the other hand I want to be able to store them separately; there is not
point in reindexing an attachment if I've simply changed the description.

Here's an example: A system used to track job candidates and their CVs.

A database keeps the core stuff, such as name, address, next availability,
salary expectation, record of conversations etc. The candidates CV(s) are
stored separately in the file system.

In this example I want to enable full text searching. I want to be able to
specify a query of "salary:[20000 TO 30000] c#". In this case "salary" will
have been taken from the database and the "c#" I would expect to be found in
the CV.

If I store the items and attachments as separate documents then Lucene will
probably return me nothing. I need to effectively store them together, but
on the other hand I don't want to have to keep reindexing attachments that
haven't changed.

Is there a standard method for dealing with this?

-Gwyn


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Distributed vs Merged Searching

Posted by Grant Ingersoll <gs...@syr.edu>.

You might find http://hellonline.com/blog/?p=55 helpful.  It discusses 
some issues with parallel distributed searches and may be helpful.

How many documents are you expecting to index?  And how many unique 
terms do you expect?

Chun Wei Ho wrote:
> I am deploying a web application serving searches on a Lucene index,
> and am deciding between distributing search between several machines
> or single searching, and was hoping that someone could tell me from
> their experiences:
>
> + Is there anything particular to watch out for if using distributed
> searching instead of  searching one merged Lucene index?
>
> + What should be the size of the index that I am looking at before I
> need to (or should be) turn to distributed searching to reduce
> response/search time? I know it would depend a lot on hardware and request
> frequency but I was wondering if anyone could post their hardware
> info and index size as a reference of when/if they had to use
> distributed search due to load issues.
>
> Thanks :)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   

-- 
------------------------------------------------------------------- 
Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org