You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by sol myr <so...@yahoo.com> on 2011/09/13 21:58:41 UTC

Lucene Grid question

Hi,

I have a huge Lucene index, which I'd like to split between machines ("Grid").

E.g. say I have a chain of book-stores, in different countries, and I'm aiming for the following:
- Each country has its own index file, on its own machine (e.g. books from Japan are indexed on machine "japan1")
- Most users search only within their own country (e.g. search only the "japan1" index)
- But sometimes, they might ask to search the entire chain (all countries), meaning some sort of "map/reduce" (=collect data from all countries).

The main challenge is the "entire chain search", especially if I want reasonable ranking.

After some investigation (+great help from Hibernate Search forum), I've seen the following suggestions:

1) Implement a LuceneDirectory that transparently spreads across several machines.

I'm not sure how the Search would work - can I ask each index for *relevant* data only?
Or would I need to maintain one huge combined file, allowing "random access" for the Searcher?

2) Run an IndexReader on each machine.

They tell me each reader can report its relevant term-frequencies, and based on that I can fetch relevant results from each machine.
Apparently the ranking won't be perfect (for the overhaul result), but bearable.

Now, I'm not familiar with Lucene internals, and would really appreciate your views on it.
- Any good articles on Lucene "Gridding"?
- Any idea whether approach #1 makes any sense (IMHO it's not very sensible if I need to merge everything to a single huge file).
- Any good implementations (of either approaches)? So far I found Hibernate Search 4, and Solandra.

Thanks very much.

Re: Lucene Grid question

Posted by sol myr <so...@yahoo.com>.

Thank you very much (sorry for the delayed reply).

________________________________
From: Chris Hostetter <ho...@fucit.org>
To: solr-users <so...@lucene.apache.org>; sol myr <so...@yahoo.com>
Sent: Wednesday, September 21, 2011 4:15 AM
Subject: Re: Lucene Grid question

: E.g. say I have a chain of book-stores, in different countries, and I'm aiming for the following:
: - Each country has its own index file, on its own machine (e.g. books from Japan are indexed on machine "japan1")
: - Most users search only within their own country (e.g. search only the "japan1" index)
: - But sometimes, they might ask to search the entire chain (all countries), meaning some sort of "map/reduce" (=collect data from all countries).

what you're describing is one possible usecase of "Distributed Search"

http://wiki.apache.org/solr/DistributedSearch

as long as each of the individual "country" indexes have schemas that 
overlap (ie: share some common fields) and have the same uniqueKey field, 
with an id space that does *not* overlap between countries (ie: document 
"1" can only be in one index, not in any others) then you can do a 
distributed query that is distributed out to all of hte individual 
indexes, and then merged together to generate aggregate results.

-Hoss

Re: Lucene Grid question

Posted by Chris Hostetter <ho...@fucit.org>.

: E.g. say I have a chain of book-stores, in different countries, and I'm aiming for the following:
: - Each country has its own index file, on its own machine (e.g. books from Japan are indexed on machine "japan1")
: - Most users search only within their own country (e.g. search only the "japan1" index)
: - But sometimes, they might ask to search the entire chain (all countries), meaning some sort of "map/reduce" (=collect data from all countries).

what you're describing is one possible usecase of "Distributed Search"

http://wiki.apache.org/solr/DistributedSearch

as long as each of the individual "country" indexes have schemas that 
overlap (ie: share some common fields) and have the same uniqueKey field, 
with an id space that does *not* overlap between countries (ie: document 
"1" can only be in one index, not in any others) then you can do a 
distributed query that is distributed out to all of hte individual 
indexes, and then merged together to generate aggregate results.


-Hoss