You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Thorsten Scherler <th...@juntadeandalucia.es> on 2010/01/20 13:55:10 UTC

big index vs. lots of small ones

Hi all,

I have to do an analyses about following usecase.

I am working as consultant in a public company. We are talking about to
offer in the future each public institution its own search server
(probably) based on Apache Solr. However the user of our portal should
be able to search all indexes.

The problematic part for our customer is that a meta search on various
indexes which then later merges the response will change the scoring.

Imagine you have the two indexes
- public health department (A)
- press relations department (B)

Now you have 300 documents in A and only one in B about "influenza A".
The B server will return the only document in its index with a very high
score, since being the only one it gets a very high "base" score,
correct?

On the other hand A may have much more important documents but they will
not get the same "base" score.

Meaning on a merge most likely the document from Server B will be top of
the list.

To prevent this phenomenon we are looking into merging all the
standalone indexes in on big index but that will lead us in other
problems because it will become pretty big pretty fast.

So here my questions:

- What are other people doing to solve this problem?
- What is the best way with Solr to solve the problem of the "base"
scoring?
- What is the best way to have multiple indexes in solr?
- Is it possible to get rid of the "base" scoring in solr?

TIA for any informations.

salu2
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)

Re: big index vs. lots of small ones

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.

On Wed, 2010-01-20 at 08:38 -0800, Marc Sturlese wrote:
> Check out this patch witch solve the distributed IDF's problem:
> https://issues.apache.org/jira/browse/SOLR-1632
> I think it fixes what you are explaining. The price you pay is that there
> are 2 requests per shard. If I am not worng the first is to get term
> frequencies and needed info and the second one is the proper search request.
> The patch also includes caching for terms in the first request.
> 

Nice!

Thank you very much, Mark.

Como van las cosas en Barcelona?

salu2

> 
> Thorsten Scherler-3 wrote:
> > 
> > Hi all,
> > 
> > I have to do an analyses about following usecase.
> > 
> > I am working as consultant in a public company. We are talking about to
> > offer in the future each public institution its own search server
> > (probably) based on Apache Solr. However the user of our portal should
> > be able to search all indexes.
> > 
> > The problematic part for our customer is that a meta search on various
> > indexes which then later merges the response will change the scoring.
> > 
> > Imagine you have the two indexes
> > - public health department (A)
> > - press relations department (B)
> > 
> > Now you have 300 documents in A and only one in B about "influenza A".
> > The B server will return the only document in its index with a very high
> > score, since being the only one it gets a very high "base" score,
> > correct?
> > 
> > On the other hand A may have much more important documents but they will
> > not get the same "base" score.
> > 
> > Meaning on a merge most likely the document from Server B will be top of
> > the list.
> > 
> > To prevent this phenomenon we are looking into merging all the
> > standalone indexes in on big index but that will lead us in other
> > problems because it will become pretty big pretty fast.
> > 
> > So here my questions:
> > 
> > - What are other people doing to solve this problem?
> > - What is the best way with Solr to solve the problem of the "base"
> > scoring?
> > - What is the best way to have multiple indexes in solr?
> > - Is it possible to get rid of the "base" scoring in solr?
> > 
> > TIA for any informations.
> > 
> > salu2
> > -- 
> > Thorsten Scherler <thorsten.at.apache.org>
> > Open Source Java <consulting, training and solutions>
> > 
> > Sociedad Andaluza para el Desarrollo de la Sociedad 
> > de la Información, S.A.U. (SADESI)
> > 
> > 
> > 
> > 
> > 
> > 
> 
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)

Re: big index vs. lots of small ones

Posted by Marc Sturlese <ma...@gmail.com>.

Check out this patch witch solve the distributed IDF's problem:
https://issues.apache.org/jira/browse/SOLR-1632
I think it fixes what you are explaining. The price you pay is that there
are 2 requests per shard. If I am not worng the first is to get term
frequencies and needed info and the second one is the proper search request.
The patch also includes caching for terms in the first request.


Thorsten Scherler-3 wrote:
> 
> Hi all,
> 
> I have to do an analyses about following usecase.
> 
> I am working as consultant in a public company. We are talking about to
> offer in the future each public institution its own search server
> (probably) based on Apache Solr. However the user of our portal should
> be able to search all indexes.
> 
> The problematic part for our customer is that a meta search on various
> indexes which then later merges the response will change the scoring.
> 
> Imagine you have the two indexes
> - public health department (A)
> - press relations department (B)
> 
> Now you have 300 documents in A and only one in B about "influenza A".
> The B server will return the only document in its index with a very high
> score, since being the only one it gets a very high "base" score,
> correct?
> 
> On the other hand A may have much more important documents but they will
> not get the same "base" score.
> 
> Meaning on a merge most likely the document from Server B will be top of
> the list.
> 
> To prevent this phenomenon we are looking into merging all the
> standalone indexes in on big index but that will lead us in other
> problems because it will become pretty big pretty fast.
> 
> So here my questions:
> 
> - What are other people doing to solve this problem?
> - What is the best way with Solr to solve the problem of the "base"
> scoring?
> - What is the best way to have multiple indexes in solr?
> - Is it possible to get rid of the "base" scoring in solr?
> 
> TIA for any informations.
> 
> salu2
> -- 
> Thorsten Scherler <thorsten.at.apache.org>
> Open Source Java <consulting, training and solutions>
> 
> Sociedad Andaluza para el Desarrollo de la Sociedad 
> de la Información, S.A.U. (SADESI)
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/big-index-vs.-lots-of-small-ones-tp27241203p27244706.html
Sent from the Solr - User mailing list archive at Nabble.com.