You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by dboychuck <db...@build.com> on 2013/10/23 04:26:14 UTC

Solr Cloud Distributed IDF

I recently moved an index from 3.6 non-distributed to Solr Cloud 4.4 with
three shards. My company uses a boosting function with a value assigned to
each document. This boosting function no longer works dependably and I
believe the cause is that IDF is not distributed.

This seems like it should be a high priority for Solr Cloud. Does anybody
know the status of this feature? I understand that the elevate component
does work for Solr Cloud in version 4.5 but unfortunately it would be a
pretty big leap for how we are currently using our index and our boosting
function for relevancy scoring.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Distributed-IDF-tp4097127.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud Distributed IDF

Posted by dboychuck <db...@build.com>.
I am indexing documents using the domin:id format ex id = k-690kohler!670614
This ensures that all k-690kohler documents are indexed to the same shard.
This does cause numDocs that are not perfectly distributed across shards
probably even worse than the default sharding algorithm.

Here is the search on Solr Cloud
http://solrsolr/productindex/productQuery?q=categories_82_is:108996&bf=linear(popularity_82_i,1,2)^3&debugQuery=true

And on Solr 3.6
http://solr-2-build.sys.id.build.com:8080/solr-build/select?q.alt=categoryId:108996&qt=dismax&bf=linear(popularity,1,2)^3&debugQuery=true&fl=id,productID,manufacturer

Here is the debug output from Solr Cloud

<lst name="explain">
<str name="921rusticware!1210842">
48481.992 = (MATCH) sum of: 4.7323933 = (MATCH)
weight(categories_82_is:`#8;#0;#6;SD in 248779) [DefaultSimilarity], result
of: 4.7323933 = score(doc=248779,freq=1.0 = termFreq=1.0 ), product of:
0.8745785 = queryWeight, product of: 5.411056 = idf(docFreq=3181,
maxDocs=262058) 0.16162805 = queryNorm 5.411056 = fieldWeight in 248779,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.411056 =
idf(docFreq=3181, maxDocs=262058) 1.0 = fieldNorm(doc=248779) 48477.26 =
(MATCH) FunctionQuery(1.0*float(int(popularity_82_i))+2.0), product of:
99977.0 = 1.0*float(int(popularity_82_i)=99975)+2.0 3.0 = boost 0.16162805 =
queryNorm
</str>
<str name="4706baldwin!1223898">
48380.168 = (MATCH) sum of: 4.7323933 = (MATCH)
weight(categories_82_is:`#8;#0;#6;SD in 67238) [DefaultSimilarity], result
of: 4.7323933 = score(doc=67238,freq=1.0 = termFreq=1.0 ), product of:
0.8745785 = queryWeight, product of: 5.411056 = idf(docFreq=3181,
maxDocs=262058) 0.16162805 = queryNorm 5.411056 = fieldWeight in 67238,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.411056 =
idf(docFreq=3181, maxDocs=262058) 1.0 = fieldNorm(doc=67238) 48375.438 =
(MATCH) FunctionQuery(1.0*float(int(popularity_82_i))+2.0), product of:
99767.0 = 1.0*float(int(popularity_82_i)=99765)+2.0 3.0 = boost 0.16162805 =
queryNorm
</str>
<str name="yb5405moen!1748274">
48278.34 = (MATCH) sum of: 4.7323933 = (MATCH)
weight(categories_82_is:`#8;#0;#6;SD in 123982) [DefaultSimilarity], result
of: 4.7323933 = score(doc=123982,freq=1.0 = termFreq=1.0 ), product of:
0.8745785 = queryWeight, product of: 5.411056 = idf(docFreq=3181,
maxDocs=262058) 0.16162805 = queryNorm 5.411056 = fieldWeight in 123982,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.411056 =
idf(docFreq=3181, maxDocs=262058) 1.0 = fieldNorm(doc=123982) 48273.61 =
(MATCH) FunctionQuery(1.0*float(int(popularity_82_i))+2.0), product of:
99557.0 = 1.0*float(int(popularity_82_i)=99555)+2.0 3.0 = boost 0.16162805 =
queryNorm
</str>
<str name="bp53005amerock!1721790">
48262.008 = (MATCH) sum of: 4.7675867 = (MATCH)
weight(categories_82_is:`#8;#0;#6;SD in 108146) [DefaultSimilarity], result
of: 4.7675867 = score(doc=108146,freq=1.0 = termFreq=1.0 ), product of:
0.8758082 = queryWeight, product of: 5.4436426 = idf(docFreq=3131,
maxDocs=266484) 0.16088642 = queryNorm 5.4436426 = fieldWeight in 108146,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.4436426 =
idf(docFreq=3131, maxDocs=266484) 1.0 = fieldNorm(doc=108146) 48257.24 =
(MATCH) FunctionQuery(1.0*float(int(popularity_82_i))+2.0), product of:
99982.0 = 1.0*float(int(popularity_82_i)=99980)+2.0 3.0 = boost 0.16088642 =
queryNorm
</str>
<str name="bp29340amerock!1721865">
48208.918 = (MATCH) sum of: 4.7675867 = (MATCH)
weight(categories_82_is:`#8;#0;#6;SD in 108031) [DefaultSimilarity], result
of: 4.7675867 = score(doc=108031,freq=1.0 = termFreq=1.0 ), product of:
0.8758082 = queryWeight, product of: 5.4436426 = idf(docFreq=3131,
maxDocs=266484) 0.16088642 = queryNorm 5.4436426 = fieldWeight in 108031,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.4436426 =
idf(docFreq=3131, maxDocs=266484) 1.0 = fieldNorm(doc=108031) 48204.15 =
(MATCH) FunctionQuery(1.0*float(int(popularity_82_i))+2.0), product of:
99872.0 = 1.0*float(int(popularity_82_i)=99870)+2.0 3.0 = boost 0.16088642 =
queryNorm
</str>
<str name="bp53001amerock!1314101">
48176.516 = (MATCH) sum of: 4.7323933 = (MATCH)
weight(categories_82_is:`#8;#0;#6;SD in 47622) [DefaultSimilarity], result
of: 4.7323933 = score(doc=47622,freq=1.0 = termFreq=1.0 ), product of:
0.8745785 = queryWeight, product of: 5.411056 = idf(docFreq=3181,
maxDocs=262058) 0.16162805 = queryNorm 5.411056 = fieldWeight in 47622,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.411056 =
idf(docFreq=3181, maxDocs=262058) 1.0 = fieldNorm(doc=47622) 48171.785 =
(MATCH) FunctionQuery(1.0*float(int(popularity_82_i))+2.0), product of:
99347.0 = 1.0*float(int(popularity_82_i)=99345)+2.0 3.0 = boost 0.16162805 =
queryNorm
</str>


And here is the debug output from Solr 3.6
<lst name="explain">
<str name="bp53005amerock">
15421.395 = (MATCH) sum of: 1.6594616 = (MATCH)
weight(categoryId:`#8;#0;#6;SD in 45538), product of: 0.29207912 =
queryWeight(categoryId:`#8;#0;#6;SD), product of: 5.681548 =
idf(docFreq=4636, maxDocs=500504) 0.05140837 = queryNorm 5.681548 = (MATCH)
fieldWeight(categoryId:`#8;#0;#6;SD in 45538), product of: 1.0 =
tf(termFreq(categoryId:`#8;#0;#6;SD)=1) 5.681548 = idf(docFreq=4636,
maxDocs=500504) 1.0 = fieldNorm(field=categoryId, doc=45538) 15419.735 =
(MATCH) FunctionQuery(1.0*float(int(popularity))+2.0), product of: 99982.0 =
1.0*float(int(popularity)=99980)+2.0 3.0 = boost 0.05140837 = queryNorm
</str>
<str name="921rusticware">
15420.623 = (MATCH) sum of: 1.6594616 = (MATCH)
weight(categoryId:`#8;#0;#6;SD in 2394), product of: 0.29207912 =
queryWeight(categoryId:`#8;#0;#6;SD), product of: 5.681548 =
idf(docFreq=4636, maxDocs=500504) 0.05140837 = queryNorm 5.681548 = (MATCH)
fieldWeight(categoryId:`#8;#0;#6;SD in 2394), product of: 1.0 =
tf(termFreq(categoryId:`#8;#0;#6;SD)=1) 5.681548 = idf(docFreq=4636,
maxDocs=500504) 1.0 = fieldNorm(field=categoryId, doc=2394) 15418.964 =
(MATCH) FunctionQuery(1.0*float(int(popularity))+2.0), product of: 99977.0 =
1.0*float(int(popularity)=99975)+2.0 3.0 = boost 0.05140837 = queryNorm
</str>
<str name="bp29340amerock">
15404.43 = (MATCH) sum of: 1.6594616 = (MATCH)
weight(categoryId:`#8;#0;#6;SD in 154688), product of: 0.29207912 =
queryWeight(categoryId:`#8;#0;#6;SD), product of: 5.681548 =
idf(docFreq=4636, maxDocs=500504) 0.05140837 = queryNorm 5.681548 = (MATCH)
fieldWeight(categoryId:`#8;#0;#6;SD in 154688), product of: 1.0 =
tf(termFreq(categoryId:`#8;#0;#6;SD)=1) 5.681548 = idf(docFreq=4636,
maxDocs=500504) 1.0 = fieldNorm(field=categoryId, doc=154688) 15402.7705 =
(MATCH) FunctionQuery(1.0*float(int(popularity))+2.0), product of: 99872.0 =
1.0*float(int(popularity)=99870)+2.0 3.0 = boost 0.05140837 = queryNorm
</str>
<str name="4706baldwin">
15388.235 = (MATCH) sum of: 1.6594616 = (MATCH)
weight(categoryId:`#8;#0;#6;SD in 38679), product of: 0.29207912 =
queryWeight(categoryId:`#8;#0;#6;SD), product of: 5.681548 =
idf(docFreq=4636, maxDocs=500504) 0.05140837 = queryNorm 5.681548 = (MATCH)
fieldWeight(categoryId:`#8;#0;#6;SD in 38679), product of: 1.0 =
tf(termFreq(categoryId:`#8;#0;#6;SD)=1) 5.681548 = idf(docFreq=4636,
maxDocs=500504) 1.0 = fieldNorm(field=categoryId, doc=38679) 15386.576 =
(MATCH) FunctionQuery(1.0*float(int(popularity))+2.0), product of: 99767.0 =
1.0*float(int(popularity)=99765)+2.0 3.0 = boost 0.05140837 = queryNorm
</str>
<str name="bp1586amerock">
15372.042 = (MATCH) sum of: 1.6594616 = (MATCH)
weight(categoryId:`#8;#0;#6;SD in 112748), product of: 0.29207912 =
queryWeight(categoryId:`#8;#0;#6;SD), product of: 5.681548 =
idf(docFreq=4636, maxDocs=500504) 0.05140837 = queryNorm 5.681548 = (MATCH)
fieldWeight(categoryId:`#8;#0;#6;SD in 112748), product of: 1.0 =
tf(termFreq(categoryId:`#8;#0;#6;SD)=1) 5.681548 = idf(docFreq=4636,
maxDocs=500504) 1.0 = fieldNorm(field=categoryId, doc=112748) 15370.383 =
(MATCH) FunctionQuery(1.0*float(int(popularity))+2.0), product of: 99662.0 =
1.0*float(int(popularity)=99660)+2.0 3.0 = boost 0.05140837 = queryNorm
</str>
<str name="yb5405moen">
15355.849 = (MATCH) sum of: 1.6594616 = (MATCH)
weight(categoryId:`#8;#0;#6;SD in 3515), product of: 0.29207912 =
queryWeight(categoryId:`#8;#0;#6;SD), product of: 5.681548 =
idf(docFreq=4636, maxDocs=500504) 0.05140837 = queryNorm 5.681548 = (MATCH)
fieldWeight(categoryId:`#8;#0;#6;SD in 3515), product of: 1.0 =
tf(termFreq(categoryId:`#8;#0;#6;SD)=1) 5.681548 = idf(docFreq=4636,
maxDocs=500504) 1.0 = fieldNorm(field=categoryId, doc=3515) 15354.189 =
(MATCH) FunctionQuery(1.0*float(int(popularity))+2.0), product of: 99557.0 =
1.0*float(int(popularity)=99555)+2.0 3.0 = boost 0.05140837 = queryNorm
</str>


The problem was noticed when the bp53005amerock didnt' show in the first
position in Solr Cloud. The popularity values are the same and this is just
simple field search the the TF should always be 1. The only discrepancy I
can see is in the IDF value as the maxDocs and docFreq values are different
per shard which would account for the scoring differences between the two
indexes.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Distributed-IDF-tp4097127p4097262.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud Distributed IDF

Posted by Upayavira <uv...@odoko.co.uk>.
Can you say more about the problem? What did you see that led to that
problem? How did you distribute docs between shards, and how is that
different from your 3.6 setup?

It might be a distributed IDF thing, or it could be something simpler.

Upayavira

On Wed, Oct 23, 2013, at 03:26 AM, dboychuck wrote:
> I recently moved an index from 3.6 non-distributed to Solr Cloud 4.4 with
> three shards. My company uses a boosting function with a value assigned
> to
> each document. This boosting function no longer works dependably and I
> believe the cause is that IDF is not distributed.
> 
> This seems like it should be a high priority for Solr Cloud. Does anybody
> know the status of this feature? I understand that the elevate component
> does work for Solr Cloud in version 4.5 but unfortunately it would be a
> pretty big leap for how we are currently using our index and our boosting
> function for relevancy scoring.
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Distributed-IDF-tp4097127.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud Distributed IDF

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2013-10-23 at 04:26 +0200, dboychuck wrote:
> I recently moved an index from 3.6 non-distributed to Solr Cloud 4.4 with
> three shards. My company uses a boosting function with a value assigned to
> each document. This boosting function no longer works dependably and I
> believe the cause is that IDF is not distributed.
> 
> This seems like it should be a high priority for Solr Cloud.

It has been relevant for several years, well before SolrCloud. We run a
mixed environment (Lucene/Solr/external index) and hacked a kinda-sorta
distributed IDF together by boosting the search terms, but it is a poor
man's solution.

Distributed IDF for Solr is a very old JIRA issue, dating back to 2009:
https://issues.apache.org/jira/browse/SOLR-1632
Activity has been on/off and I can see that it was last updated in June,
but I have no idea of how close it is to completion.

If you want anything out-of-the-box at this time, you'll have to look at
Elasticsearch, which has this feature.

- Toke Eskildsen, State and University Library, Denmark