You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Steven White <sw...@gmail.com> on 2021/04/18 21:26:21 UTC

How to explain Lucene's ranking algorithm to someone who is not technical?

Hi everyone,

If you are asked to explain how Lucene's algorithm works, to someone who is
not technical and doesn't understand math, how do you go about doing so?

I'm going to list what I see as key points to use but please correct me
where correction is needed and do add where addition is needed.  Here are
the talking points I can think of.

Search terms are: "to be or not to be, that is the question"   The examples
below or simple term search (no booleans, no phrase, no fields, etc.)

1) Documents that contain all or most of the search terms are ranked
highest.
hit #1: ... ... to be or not to be, that is the question ... ...
hit #2: ... ... to be, that is the question ... ...
hit #3: ... ... is the question  ... ...

2) Documents that contain all or most of the search terms, more often than
other documents are ranked higher.
hit #1: ... ... to be or not to be, that is the question and is still the
question ... ...
hit #2: ... ... to be or not to be, that is the question ... ...
hit #3: ... ... to be, that is the question ... ...

3) Documents that contain the search terms closer to each other are ranked
higher
hit #1: ... ... to be or not to be, that is the question ... ...
hit #2: ... ... to be or not to be, is what being asked, that is the
question ... ...
hit #3: ... ... is the question  ... ...

4) Documents that contain the exact search terms, including number of times
search terms occur, the smaller document is ranked higher
hit #1: to be or not to be, that is the question
hit #2: ... ... to be or not to be, that is the question ... ...

5) Documents that contain more of the complex / longer terms are ranked
higher than those containing more of the lighter terms.
hit #1: ... ... to be or not to be, that is the question and is still the
question to question  ... ...
hit #2: ... ... to be or not to be and to be or not to be, and to be or not
to be, that is the question ... ...

6) Documents that contain search terms, match the order, are ranked higher:
hit #1: ... ... to be or not to be, that is the question ... ...
hit #2: ... ... question the that is be not to be or be ... ...

I think I get all the above right (I'm not sure about #6).

Thanks

Steven