You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Arnold Bronley <ar...@gmail.com> on 2018/02/07 22:26:54 UTC

Judging the MoreLikeThis results for relevancy

Hi,

I am using MoreLikeThis handler to get related documents for a given
document. To determine if I am getting good results or not, here is what I
do:

The same original document should be returned as a top match.

If it is not, then there is some problem with the relevancy.

Then, as same input document will be 100% match with itself, we can use its
absolute score to compare how other documents (ranked 2nd, ranked 3rd and
so on) are doing in terms of relevancy by comparing their scores to the
score of the top result which is the same input document

Is this a good idea?

Do you see any flaw in this logic?

Re: Judging the MoreLikeThis results for relevancy

Posted by Alessandro Benedetti <a....@sease.io>.

So let me answer point by point :

1) Similarity is misleading here if you interpret it as a probabilistic
measure.
Given a query, it doesn't exist the "Ideal Document". Both with TF-IDF and
BM25 ( that solves the problem better) you are scoring the document. Higher
the score, higher the relevance of that document for the given query. BM25
does a better job in this , the relevance function will hit a saturation
point so it is closer to your expectation, this blog from Doug should
help[1]

2) "if document vector A is at a
distance of 5 and 10 units from document vectors B and C respectively then
can't we say that B is twice as relevant to A as C is to A? Or in terms of
distance, C is twice as distant to A and B is to A?"

Not in Lucene, at least not strictly.
Current MLT uses TF-IDF as a scoring formula.
When the score of B is double of the score of C, you can say that B is twice
as relevant to A than C for Lucene.
From a User perspective this can be different (quoting Doug : "If an
article mentions “dog” six times is it twice as relevant as an article
mentioning “dog” 3 times? Most users say no")

3) MLT under the hood build a Lucene query and retrieve documents from the
index.
When building the MLT query, to keep it simple it extract from the seed
document a subset of terms which are considered representative of the seed
document ( let's call them relevant terms).
This is managed through a parameter, but usually and by default you collect
a limited set of relevant terms ( not all the terms).
When retrieving similar documents you score them using TF-IDF ( and in the
future BM25).
So first of all, you can have documents with higher scores than the original
( it doesn't make sense in a probabilistic world, but this is how Lucene
works).
Reverting the documents, so applying the MLT to document B you could build a
slightly different query.
So :
given seed(a) the score(b) != the score(a) given seed(b)

I understand you think it doesn't make sense, but this how Lucene works.

I do also understand that a lot of times users want a percentage out of a
MLT query.
I will work toward that direction for sure, step by step, first I need to
have the MLT refactor approved and patched :)

[1]
https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Judging the MoreLikeThis results for relevancy

Posted by Arnold Bronley <ar...@gmail.com>.

Thanks for the reply,  Alessandro.

Can you please elaborate on a point  "a document which has a score 50% of
the original doc score, it doesn't
mean it is 50% similar"? I did not understand this for two reasons:

1. In the end, we are calculating similarity score between documents when
we are solving the Problem of Search where search query is also treated as
a small document. Similarity has inherent meaning of how similar one thing
is to the another.

2. If we think about the vector representations of documents in
multidimensional space, we are basically calculating the "distance" between
these documents. We interpret that distance as "similarity". Farther away
the document vectors in that space, less similar those documents are with
each other. How we calculate the distance is one thing (e.g. cosine
distance, Euclidean distance,etc) but once we agree upon
distance/similarity calculation method, if document vector A is at a
distance of 5 and 10 units from document vectors B and C respectively then
can't we say that B is twice as relevant to A as C is to A? Or in terms of
distance, C is twice as distant to  A and B is to A?

I found this response from jlman in following thread very similar to my
solution.

http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=print_post&node=561671

He also warns about the scores between two documents not being
bidirectional.

If all else remains constant (relevancy algorithm, number of documents in
index etc), why the relevancy between two documents calculated with the
approach that I mentioned is not bidirectional? That is why is it possible
that document A is more similar to B than B is similar to A.
When I think in terms of multidimensional vector space, this does not make
sense at all. Because, distance between A and B in multidimensional space
is not going to change provided all else remains constant ( relevancy
algorithm, number of document in index etc). If A is at a distance of 5
units from B then B is also at distance of 5 units from A. Isn't it?

Thanks,
Arnold

On Thu, Feb 8, 2018 at 7:02 AM, Alessandro Benedetti <a....@sease.io>
wrote:

> Hi,
> I have been personally working a lot with the MoreLikeThis and I am close
> to
> contribute a refactor of that module ( to break up the monolithic giant
> facade class mostly) .
>
> First of all the MoreLikeThis handler will return the original document (
> not scored) + the similar documents(scored).
> The original document is not considered by the MoreLikeThis query, so it is
> not returned as part of the results of the MLT lucene query, it is just
> added to the response in the beginning.
>
> if I remember well, but I am unable to check at the moment, you should be
> able to get the original document in the response set ( with max score)
> using the More Like This query parser.
> Please double check that
>
> Generally speaking at the moment TF-IDF is used under the hood, which means
> that sometime the score is not probabilistic.
> So a document which has a score 50% of the original doc score, it doesn't
> mean it is 50% similar, but for your use case it may be a feasible
> approximation.
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Judging the MoreLikeThis results for relevancy

Posted by Alessandro Benedetti <a....@sease.io>.

Hi,
I have been personally working a lot with the MoreLikeThis and I am close to
contribute a refactor of that module ( to break up the monolithic giant
facade class mostly) .

First of all the MoreLikeThis handler will return the original document (
not scored) + the similar documents(scored).
The original document is not considered by the MoreLikeThis query, so it is
not returned as part of the results of the MLT lucene query, it is just
added to the response in the beginning.

if I remember well, but I am unable to check at the moment, you should be
able to get the original document in the response set ( with max score)
using the More Like This query parser.
Please double check that

Generally speaking at the moment TF-IDF is used under the hood, which means
that sometime the score is not probabilistic.
So a document which has a score 50% of the original doc score, it doesn't
mean it is 50% similar, but for your use case it may be a feasible
approximation.



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html