You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Toke Eskildsen <te...@statsbiblioteket.dk> on 2010/10/25 11:22:00 UTC

Re: a bug of solr distributed search

On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: 
> But itshows a problem of distrubted search without common idf.
> A doc will get different score in different shard.

Bingo.

I really don't understand why this fundamental problem with sharding
isn't mentioned more often. Every time the advice "use sharding" is
given, it should be followed with a "but be aware that it will make
relevance ranking unreliable".

Regards,
Toke Eskildsen


Re: a bug of solr distributed search

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-10-25 13:37, Toke Eskildsen wrote:
> On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote:
>> * there is an exact solution to this problem, namely to make two
>> distributed calls instead of one (first call to collect per-shard IDFs
>> for given query terms, second call to submit a query rewritten with the
>> global IDF-s). This solution is implemented in SOLR-1632, with some
>> caching to reduce the cost for common queries.
> 
> I must admit that I have not tried the patch myself. Looking at
> https://issues.apache.org/jira/browse/SOLR-1632
> i see that the last comment is from LiLi with a failed patch, but as
> there are no further comments it is unclear if the problem is general or
> just with LiLi's setup. I might be a bit harsh here, but the other
> comments for the JIRA issue also indicate that one would have to be
> somewhat adventurous to run this in production. 

Oh, definitely this is not production quality yet - there are known
bugs, for example, that I need to fix, and then it needs to be
forward-ported to trunk. It shouldn't be too much work to bring it back
into usable state.

>> * another reason is that in many many cases the difference between using
>> exact global IDF and per-shard IDFs is not that significant. If shards
>> are more or less homogenous (e.g. you assign documents to shards by
>> hash(docId)) then term distributions will be also similar.
> 
> While I agree on the validity of the solution, it does put some serious
> constraints on the shard-setup.

True. But this is the simplest setup that just may be enough.

> 
>> To summarize, I would qualify your statement with: "...if the
>> composition of your shards is drastically different". Otherwise the cost
>> of using global IDF is not worth it, IMHO.
> 
> Do you know of any studies of the differences in ranking with regard to
> indexing-distribution by hashing, logical grouping and distributed IDF?

Unfortunately, this information is surprisingly scarce - research
predating year 2000 is often not applicable, and most current research
concentrates on P2P systems, which are really a different ball of wax.
Here's a few papers that I found that are related to this issue:

* Global Term Weights in Distributed Environments, H. Witschel, 2007
(Elsevier)

* KLEE: A Framework for Distributed Top-k Query Algorithms, S. Michel,
P. Triantafillou, G. Weikum, VLDB'05 (ACM)

* Exploring the Stability of IDF Term Weighting, Xin Fu and  Miao Chen,
2008 (Springer Verlag)

* A Comparison of Techniques for Estimating IDF Values to Generate
Lexical Signatures for the Web, M. Klein, M. Nelson, WIDM'08 (ACM)

* Comparison of dierent Collection Fusion Models in Distributed
Information Retrieval, Alexander Steidinger - this paper gives a nice
comparison framework for different strategies for joining partial
results; apparently we use the most primitive strategy explained there,
based on raw scores...

These papers likely don't fully answer your question, but at least they
provide a broader picture of the issue...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: a bug of solr distributed search

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote:
> * there is an exact solution to this problem, namely to make two
> distributed calls instead of one (first call to collect per-shard IDFs
> for given query terms, second call to submit a query rewritten with the
> global IDF-s). This solution is implemented in SOLR-1632, with some
> caching to reduce the cost for common queries.

I must admit that I have not tried the patch myself. Looking at
https://issues.apache.org/jira/browse/SOLR-1632
i see that the last comment is from LiLi with a failed patch, but as
there are no further comments it is unclear if the problem is general or
just with LiLi's setup. I might be a bit harsh here, but the other
comments for the JIRA issue also indicate that one would have to be
somewhat adventurous to run this in production. 

> * another reason is that in many many cases the difference between using
> exact global IDF and per-shard IDFs is not that significant. If shards
> are more or less homogenous (e.g. you assign documents to shards by
> hash(docId)) then term distributions will be also similar.

While I agree on the validity of the solution, it does put some serious
constraints on the shard-setup.

> To summarize, I would qualify your statement with: "...if the
> composition of your shards is drastically different". Otherwise the cost
> of using global IDF is not worth it, IMHO.

Do you know of any studies of the differences in ranking with regard to
indexing-distribution by hashing, logical grouping and distributed IDF?

Regards,
Toke Eskildsen


Re: a bug of solr distributed search

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Tue, 2010-10-26 at 15:48 +0200, Ron Mayer wrote:
> And a third potential reason - it's arguably a feature instead of a bug
> for some applications.  Depending on how I organize my shards, "give me
> the most relevant document from each shard for this search" seems like
> it could be useful.

You can get that even if the shards scored equally, so it is a
limitation, not a feature. I hope to find the time later this week to
read some of the papers Andrzej was kind enough to point out, but it
seems like I really need to do the heavy lifting of setting up
comparisons for our own material.

The problem is of course to judge the quality of the outputs, but
setting the single index as the norm and plotting the differences in
document positions in the result sets might provide some insight.

Regards,
Toke Eskildsen


Re: a bug of solr distributed search

Posted by Ron Mayer <rm...@0ape.com>.
Andrzej Bialecki wrote:
> On 2010-10-25 11:22, Toke Eskildsen wrote:
>> On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: 
>>> But itshows a problem of distrubted search without common idf.
>>> A doc will get different score in different shard.
>> Bingo.
>>
>> I really don't understand why this fundamental problem with sharding
>> isn't mentioned more often. Every time the advice "use sharding" is
>> given, it should be followed with a "but be aware that it will make
>> relevance ranking unreliable".
> 
> The reason is twofold, I think:


And a third potential reason - it's arguably a feature instead of a bug
for some applications.  Depending on how I organize my shards, "give me
the most relevant document from each shard for this search" seems like
it could be useful.

> * there is an exact solution to this problem, namely to make two
> distributed calls instead of one (first call to collect per-shard IDFs
> for given query terms, second call to submit a query rewritten with the
> global IDF-s). This solution is implemented in SOLR-1632, with some
> caching to reduce the cost for common queries. However, this means that
> now for every query you need to make two calls instead of one, which
> potentially doubles the time to return results (for simple common
> queries - for rare complex queries the time will be still dominated by
> the query runtime on shard servers).
> 
> * another reason is that in many many cases the difference between using
> exact global IDF and per-shard IDFs is not that significant. If shards
> are more or less homogenous (e.g. you assign documents to shards by
> hash(docId)) then term distributions will be also similar. So then the
> question is whether you can accept an N% variance in scores across
> shards, or whether you want to bear the cost of an additional
> distributed RPC for every query...
> 
> To summarize, I would qualify your statement with: "...if the
> composition of your shards is drastically different". Otherwise the cost
> of using global IDF is not worth it, IMHO.
> 


Re: a bug of solr distributed search

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-10-25 11:22, Toke Eskildsen wrote:
> On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: 
>> But itshows a problem of distrubted search without common idf.
>> A doc will get different score in different shard.
> 
> Bingo.
> 
> I really don't understand why this fundamental problem with sharding
> isn't mentioned more often. Every time the advice "use sharding" is
> given, it should be followed with a "but be aware that it will make
> relevance ranking unreliable".

The reason is twofold, I think:

* there is an exact solution to this problem, namely to make two
distributed calls instead of one (first call to collect per-shard IDFs
for given query terms, second call to submit a query rewritten with the
global IDF-s). This solution is implemented in SOLR-1632, with some
caching to reduce the cost for common queries. However, this means that
now for every query you need to make two calls instead of one, which
potentially doubles the time to return results (for simple common
queries - for rare complex queries the time will be still dominated by
the query runtime on shard servers).

* another reason is that in many many cases the difference between using
exact global IDF and per-shard IDFs is not that significant. If shards
are more or less homogenous (e.g. you assign documents to shards by
hash(docId)) then term distributions will be also similar. So then the
question is whether you can accept an N% variance in scores across
shards, or whether you want to bear the cost of an additional
distributed RPC for every query...

To summarize, I would qualify your statement with: "...if the
composition of your shards is drastically different". Otherwise the cost
of using global IDF is not worth it, IMHO.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com