You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by tstusr <ul...@gmail.com> on 2017/04/21 15:22:44 UTC

Modify solr score

Hi.

We are making an application that searches for certain specific topics, as
many captured words on a document the higher the score.

We have 2 scenarios of testing. The first one with documents that users tag
as relevant and other ones that contains documents out of our domain.

In first scenario, we report ratios of 1-2% on the amount of captured terms
against all document words. For the second scenario, we report ratios of
less than 0.005%.

Nevertheless, scores remain almost equal, ~0.85 for the first stage and ~0.8
for the latter one.


So what we want is to decrease the score we report for this latter scenario
according to the percentage of words captured in some way.


Is there any way to store those values in a field in order to use them as
query boost. Or any way to override the score default calculation to change
relevancy?


Thanks in advance...



--
View this message in context: http://lucene.472066.n3.nabble.com/Modify-solr-score-tp4331300.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Modify solr score

Posted by Rick Leir <rl...@leirtech.com>.
Ulf: Maybe there is a way you could filter out the unrelated documents. Qf?
Rick

On April 21, 2017 2:18:59 PM EDT, tstusr <ul...@gmail.com> wrote:
>Well, I know they can change.
>
>I think, the main problem here it that (in this point) documents
>completely
>unrelated to a topic are being ranked as high as documents related. So,
>in
>order to penalize them we are trying to use the ratio or term
>frequency/word
>length.
>
>Nevertheless we aren't able to find a practical way to make it.
>
>Greetings.
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Modify-solr-score-tp4331300p4331342.html
>Sent from the Solr - User mailing list archive at Nabble.com.

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Modify solr score

Posted by tstusr <ul...@gmail.com>.
Well, I know they can change.

I think, the main problem here it that (in this point) documents completely
unrelated to a topic are being ranked as high as documents related. So, in
order to penalize them we are trying to use the ratio or term frequency/word
length.

Nevertheless we aren't able to find a practical way to make it.

Greetings.



--
View this message in context: http://lucene.472066.n3.nabble.com/Modify-solr-score-tp4331300p4331342.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Modify solr score

Posted by Walter Underwood <wu...@wunderwood.org>.
Using a minimum score cut off does not work. The score is not an absolute estimate of relevance.

The idf component of the score is a whole-corpus metric. When you add or delete documents, the scores for the exact same query can change.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 21, 2017, at 10:18 AM, tstusr <ul...@gmail.com> wrote:
> 
> Well, maybe I explain it wrong.
> 
> We have entry points, each of them are related to a topic. It mens that when
> we select the first topic all information has to be related in some way to
> this vocabulary. So, it can work since we select documents not related to
> each vocabulary of every entry point. To establish a threshold of minimums,
> so that, we are trying to use hit ratio to modify score.
> 
> After we rank on that topics, all work after that is about faceting, word
> selection and so on.
> 
> Greeting
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Modify-solr-score-tp4331300p4331331.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modify solr score

Posted by tstusr <ul...@gmail.com>.
Well, maybe I explain it wrong.

We have entry points, each of them are related to a topic. It mens that when
we select the first topic all information has to be related in some way to
this vocabulary. So, it can work since we select documents not related to
each vocabulary of every entry point. To establish a threshold of minimums,
so that, we are trying to use hit ratio to modify score.

After we rank on that topics, all work after that is about faceting, word
selection and so on.

Greeting



--
View this message in context: http://lucene.472066.n3.nabble.com/Modify-solr-score-tp4331300p4331331.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Modify solr score

Posted by Walter Underwood <wu...@wunderwood.org>.
It isn’t going to work. The score is not an absolute relevance measurement. It only says that the first document is more relevant than the second, and so on.

Scores are not comparable between different queries. The score cannot be used to say that the first hit for query A is a better match than the first hit for query B.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 21, 2017, at 9:35 AM, tstusr <ul...@gmail.com> wrote:
> 
> Since we report the score, we think there will be some relation between them.
> As far as we know scoring (and then ranking) are calculated based on tf-idf.
> 
> What we want to do is to make a qualitative ranking, it means, according to
> one topic we will tag documents as "very related", "fairly related" or "poor
> related". So, we select some documents completely unrelated to a topic.
> 
> On a very related document we found a ratio of ~2% of words that reports
> ~0.85 of score (what we think is related to ranking). On a test document we
> found a ratio of less than 0.01% and the score is heigher than the first
> one. What we expect is that documents not related (those ones with less
> ratio) report lower scores so we can then use them as minimum and create the
> scale.
> 
> We came with multiply (of affect in some way) the default rank solr provide
> us with the ratio of documents so unrelated documents will be penalized
> while those with higher ratio values will be overrated.
> 
> Greetings, and thanks for your help.
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Modify-solr-score-tp4331300p4331315.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modify solr score

Posted by tstusr <ul...@gmail.com>.
We came with a simple solution.

We use  termfreq <https://wiki.apache.org/solr/FunctionQuery#termfreq>   and
write a simple processor that counts words for making a boost function that
only calculates the ratio between words that hit terms and the whole field
length.

Some tests are being made, maybe it could solves the problem.

Thanks for your help!



--
View this message in context: http://lucene.472066.n3.nabble.com/Modify-solr-score-tp4331300p4331614.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Modify solr score

Posted by Erik Hatcher <er...@gmail.com>.
This may be suggesting a solution that is too experimental or using the wrong hammer for the job, but to me it sounds like you could use “payloads” for this type of ranking of terms relationship to a document.   

See SOLR-1485 for the recent work I’ve been doing (and aim to get committed soon).   You could index documents in this way:

   id, weighted_terms_dpf
   1, A|5.0 B|95.0
    2,A|88.7 B|0.1

And then search for “A” and use the 88.7 value to factor into the score or sorting.  

	Erik



> On Apr 21, 2017, at 12:35 PM, tstusr <ul...@gmail.com> wrote:
> 
> Since we report the score, we think there will be some relation between them.
> As far as we know scoring (and then ranking) are calculated based on tf-idf.
> 
> What we want to do is to make a qualitative ranking, it means, according to
> one topic we will tag documents as "very related", "fairly related" or "poor
> related". So, we select some documents completely unrelated to a topic.
> 
> On a very related document we found a ratio of ~2% of words that reports
> ~0.85 of score (what we think is related to ranking). On a test document we
> found a ratio of less than 0.01% and the score is heigher than the first
> one. What we expect is that documents not related (those ones with less
> ratio) report lower scores so we can then use them as minimum and create the
> scale.
> 
> We came with multiply (of affect in some way) the default rank solr provide
> us with the ratio of documents so unrelated documents will be penalized
> while those with higher ratio values will be overrated.
> 
> Greetings, and thanks for your help.
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Modify-solr-score-tp4331300p4331315.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modify solr score

Posted by tstusr <ul...@gmail.com>.
Since we report the score, we think there will be some relation between them.
As far as we know scoring (and then ranking) are calculated based on tf-idf.

What we want to do is to make a qualitative ranking, it means, according to
one topic we will tag documents as "very related", "fairly related" or "poor
related". So, we select some documents completely unrelated to a topic.

On a very related document we found a ratio of ~2% of words that reports
~0.85 of score (what we think is related to ranking). On a test document we
found a ratio of less than 0.01% and the score is heigher than the first
one. What we expect is that documents not related (those ones with less
ratio) report lower scores so we can then use them as minimum and create the
scale.

We came with multiply (of affect in some way) the default rank solr provide
us with the ratio of documents so unrelated documents will be penalized
while those with higher ratio values will be overrated.

Greetings, and thanks for your help.




--
View this message in context: http://lucene.472066.n3.nabble.com/Modify-solr-score-tp4331300p4331315.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Modify solr score

Posted by "alessandro.benedetti" <a....@sease.io>.
It has been discussed countless times, never rely on score values.
Rely on the ranking of your results.
It seems you model a <topic> as a least of keywords and then you just run a
query for each topic.
Essentially for you, a <topic> is a query.

The ranking of your results will already be affected by how many times (
Term Frequency) such keywords appear in the results.
You can even play with different query parsers ( such as dismax/edismax) and
play with the mm percentage to estabilish how strict you want your results
to be, in relation with input query [1] .
Can you elaborate better the way you would like to customize the score ?
Which factor would you like to modify ?

Cheers

[1]
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser#TheDisMaxQueryParser-Themm(MinimumShouldMatch)Parameter



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.nabble.com/Modify-solr-score-tp4331300p4331310.html
Sent from the Solr - User mailing list archive at Nabble.com.