You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/06/01 11:13:44 UTC

RE: per-fieldtype similarity not working

Thanks but i am clearly missing something? We declare the similarity in the fieldType just as in the example and looking at the example again i don't see how it's being done differently. What am i missnig and where do i miss it? :)

-----Original message-----
> From:Robert Muir <rc...@gmail.com>
> Sent: Thu 31-May-2012 17:47
> To: solr-user@lucene.apache.org
> Subject: Re: per-fieldtype similarity not working
> 
> On Thu, May 31, 2012 at 11:23 AM, Markus Jelsma
> <ma...@openindex.io> wrote:
> 
> > We simply declare the following in our fieldType:
> > <similarity class="FQCN"/>
> >
> 
> Thats not enough, see the example:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema-sim.xml
> 
> 
> -- 
> lucidimagination.com
>

RE: per-fieldtype similarity not working

Posted by Markus Jelsma <ma...@openindex.io>.

Excellent!
Thanks

 
 
-----Original message-----
> From:Robert Muir <rc...@gmail.com>
> Sent: Fri 08-Jun-2012 13:06
> To: Markus Jelsma <ma...@openindex.io>
> Cc: solr-user@lucene.apache.org
> Subject: Re: per-fieldtype similarity not working
> 
> On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma
> <ma...@openindex.io> wrote:
> > Thanks Robert,
> >
> > The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect ranking but coord does. Can you explain why coord is left out now and why it is considered to skew results and why queryNorm skews results? And which specific new ranking algorithms they confuse, BM25F?
> 
> I think its easiest to compare the two TF normalization functions,
> DefaultSimilarity really needs something like this because its
> function (sqrt) grows very fast for a single term.
> On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates
> rather quickly for a single term, so when multiple terms are being
> scored, huge numbers of occurrences of a single term won't dominate
> the overall score.
> 
> You can see this visually here (give it a second to load, and imagine
> documentLength = averageDocumentLength and k=1.2):
> http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100
> 
> >
> > Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity this might raise some further confusion down the line.
> 
> Thats ok: I'd rather the very expert case (Per-Field scoring) be
> trickier than have a trap for people that try to use any algorithm
> other than TFIDFSimilarity
> 
> -- 
> lucidimagination.com
>

Re: per-fieldtype similarity not working

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> Thanks Robert,
>
> The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect ranking but coord does. Can you explain why coord is left out now and why it is considered to skew results and why queryNorm skews results? And which specific new ranking algorithms they confuse, BM25F?

I think its easiest to compare the two TF normalization functions,
DefaultSimilarity really needs something like this because its
function (sqrt) grows very fast for a single term.
On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates
rather quickly for a single term, so when multiple terms are being
scored, huge numbers of occurrences of a single term won't dominate
the overall score.

You can see this visually here (give it a second to load, and imagine
documentLength = averageDocumentLength and k=1.2):
http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100

>
> Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity this might raise some further confusion down the line.

Thats ok: I'd rather the very expert case (Per-Field scoring) be
trickier than have a trap for people that try to use any algorithm
other than TFIDFSimilarity

-- 
lucidimagination.com

RE: per-fieldtype similarity not working

Posted by Markus Jelsma <ma...@openindex.io>.

Thanks Robert,

The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect ranking but coord does. Can you explain why coord is left out now and why it is considered to skew results and why queryNorm skews results? And which specific new ranking algorithms they confuse, BM25F? 

Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity this might raise some further confusion down the line.

I'll open an issue for the lack of Similarity impl. in the debug output when per-field similarity is enabled.

Cheers!

-----Original message-----
> From:Robert Muir <rc...@gmail.com>
> Sent: Fri 01-Jun-2012 18:16
> To: solr-user@lucene.apache.org
> Subject: Re: per-fieldtype similarity not working
> 
> On Fri, Jun 1, 2012 at 11:39 AM, Markus Jelsma
> <ma...@openindex.io> wrote:
> > Hi!
> >
> >
> > Ah, it makes sense now! This global configured similarity knows returns a fieldType defined similarity if available and if not the standard Lucene similarity. This would, i assume, mean that the two defined similarities below without per fieldType declared similarities would always yield the same results?
> 
> Not true: note that two methods (coord and querynorm) are not perfield
> but global across the entire query tree.
> 
> By default these are disabled in the wrapper, as they only skew or
> confuse most modern scoring algorithms (eg all the new ranking
> algorithms in lucene 4) respectively.
> 
> So if you want to do per-field scoring where *all* of your sims are
> vector-space, it could make sense to customize (e.g. subclass)
> SchemaSimilarityFactory and do something useful for these methods.
> 
> 
> -- 
> lucidimagination.com
>

Re: per-fieldtype similarity not working

Posted by "mike.vogel" <mi...@knowledgent.com>.

Any example or suggestion for how to patch the wrapper so that coord method
is called for the field type with the custom similarity?



--
View this message in context: http://lucene.472066.n3.nabble.com/per-fieldtype-similarity-not-working-tp3987050p4052470.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: per-fieldtype similarity not working

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Jun 1, 2012 at 11:39 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> Hi!
>
>
> Ah, it makes sense now! This global configured similarity knows returns a fieldType defined similarity if available and if not the standard Lucene similarity. This would, i assume, mean that the two defined similarities below without per fieldType declared similarities would always yield the same results?

Not true: note that two methods (coord and querynorm) are not perfield
but global across the entire query tree.

By default these are disabled in the wrapper, as they only skew or
confuse most modern scoring algorithms (eg all the new ranking
algorithms in lucene 4) respectively.

So if you want to do per-field scoring where *all* of your sims are
vector-space, it could make sense to customize (e.g. subclass)
SchemaSimilarityFactory and do something useful for these methods.

-- 
lucidimagination.com

RE: per-fieldtype similarity not working

Posted by Markus Jelsma <ma...@openindex.io>.

Hi!


Ah, it makes sense now! This global configured similarity knows returns a fieldType defined similarity if available and if not the standard Lucene similarity. This would, i assume, mean that the two defined similarities below without per fieldType declared similarities would always yield the same results?

<similarity class="org.apache.lucene.search.similarities.DefaultSimilarity"/>
<similarity class="solr.SchemaSimilarityFactory"/>

I would assume because without per fieldType declared the SchemaSimilarityFactory returns the default lucene Similarity. However, when checking out it doesn't work for my url field but does work for the content and title field. I have defined the same similarity for the url fieldType as i did for the title fieldType. This is the output for solr.SchemaSimilarityFactory without per-field declared: 

  38.565483 = (MATCH) max plus 0.27 times others of:
    5.434552 = (MATCH) weight(content:groning^1.4 in 384) [], result of:
      5.434552 = score(doc=384,freq=10.0 = termFreq=10.0
), product of:
        1.5511217 = queryWeight, product of:
          1.4 = boost
          1.1079441 = idf(docFreq=1236, maxDocs=1378)
          1.0 = queryNorm
        3.503627 = fieldWeight in 384, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.1079441 = idf(docFreq=1236, maxDocs=1378)
          1.0 = fieldNorm(doc=384)
    4.300008 = (MATCH) weight(title:groning^4.7 in 384) [], result of:
      4.300008 = score(doc=384,freq=2.0 = termFreq=2.0
), product of:
        5.346149 = queryWeight, product of:
          4.7 = boost
          1.1374786 = idf(docFreq=1200, maxDocs=1378)
          1.0 = queryNorm
        0.8043188 = fieldWeight in 384, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.1374786 = idf(docFreq=1200, maxDocs=1378)
          0.5 = fieldNorm(doc=384)
    35.937153 = (MATCH) weight(url:groning^2.1 in 384) [], result of:
      35.937153 = score(doc=384,freq=1.0 = termFreq=1.0
), product of:
        10.988577 = queryWeight, product of:
          2.1 = boost
          5.232656 = idf(docFreq=19, maxDocs=1378)
          1.0 = queryNorm
        3.27041 = fieldWeight in 384, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          5.232656 = idf(docFreq=19, maxDocs=1378)
          0.625 = fieldNorm(doc=384)


Here's the output with DefaultSimilarity declared:

  3.2723136 = (MATCH) max plus 0.27 times others of:
    0.46112633 = (MATCH) weight(content:groning^1.4 in 327) [DefaultSimilarity], result of:
      0.46112633 = score(doc=327,freq=10.0 = termFreq=10.0
), product of:
        0.13161398 = queryWeight, product of:
          1.4 = boost
          1.1079441 = idf(docFreq=1236, maxDocs=1378)
          0.08485084 = queryNorm
        3.503627 = fieldWeight in 327, product of:
          3.1622777 = tf(freq=10.0), with freq of:
            10.0 = termFreq=10.0
          1.1079441 = idf(docFreq=1236, maxDocs=1378)
          1.0 = fieldNorm(doc=327)
    0.36485928 = (MATCH) weight(title:groning^4.7 in 327) [DefaultSimilarity], result of:
      0.36485928 = score(doc=327,freq=2.0 = termFreq=2.0
), product of:
        0.45362523 = queryWeight, product of:
          4.7 = boost
          1.1374786 = idf(docFreq=1200, maxDocs=1378)
          0.08485084 = queryNorm
        0.8043188 = fieldWeight in 327, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.1374786 = idf(docFreq=1200, maxDocs=1378)
          0.5 = fieldNorm(doc=327)
    3.0492976 = (MATCH) weight(url:groning^2.1 in 327) [DefaultSimilarity], result of:It also seems the debug output is wrong, it does not write the similarity classname between [] and produces an empty [] for each match.
      3.0492976 = score(doc=327,freq=1.0 = termFreq=1.0
), product of:
        0.93239 = queryWeight, product of:
          2.1 = boost
          5.232656 = idf(docFreq=19, maxDocs=1378)
          0.08485084 = queryNorm
        3.27041 = fieldWeight in 327, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          5.232656 = idf(docFreq=19, maxDocs=1378)
          0.625 = fieldNorm(doc=327)

How can i explain the difference? Also, with the factory declared, the score of the url field is still the same, it does not seem to listen to the per-field declared similarity. It also seems the debug output is wrong, it does not write the similarity classname between [] and produces an empty [] for each match.

Many thanks and a nice weekend!
Markus
 
 
-----Original message-----
> From:Robert Muir <rc...@gmail.com>
> Sent: Fri 01-Jun-2012 17:00
> To: solr-user@lucene.apache.org
> Subject: Re: per-fieldtype similarity not working
> 
> On Fri, Jun 1, 2012 at 5:13 AM, Markus Jelsma
> <ma...@openindex.io> wrote:
> > Thanks but i am clearly missing something? We declare the similarity in the fieldType just as in the example and looking at the example again i don't see how it's being done differently. What am i missnig and where do i miss it? :)
> >
> 
> Hi Markus, checkout the last line at the bottom:
>  <!-- default similarity, defers to the fieldType -->
>  <similarity class="solr.SchemaSimilarityFactory"/>
> 
> When this is set, it means IndexSearcher/IndexWriter use a
> PerFieldSimilarityWrapper that delegates based to the Solr schema
> fieldtype.
> 
> Note this is just a simple ordinary similarity impl
> (http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/similarities/SchemaSimilarityFactory.java),
> you could also write your own that works differently.
> 
> -- 
> lucidimagination.com
>

Re: per-fieldtype similarity not working

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Jun 1, 2012 at 5:13 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> Thanks but i am clearly missing something? We declare the similarity in the fieldType just as in the example and looking at the example again i don't see how it's being done differently. What am i missnig and where do i miss it? :)
>

Hi Markus, checkout the last line at the bottom:
 <!-- default similarity, defers to the fieldType -->
 <similarity class="solr.SchemaSimilarityFactory"/>

When this is set, it means IndexSearcher/IndexWriter use a
PerFieldSimilarityWrapper that delegates based to the Solr schema
fieldtype.

Note this is just a simple ordinary similarity impl
(http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/similarities/SchemaSimilarityFactory.java),
you could also write your own that works differently.

-- 
lucidimagination.com