You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by danield <da...@gmail.com> on 2015/01/13 20:24:12 UTC

Similarity formula documentation is misleading + how to make field-agnostic queries?

Hi all,

I have found, much to my dismay, that the documentation on Lucene’s default
similarity formula is very dangerously misleading. See it here:
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf

Term Frequency (TF) counts are expected to be per-document in the IR
literature, and this documentation doesn’t say any differently. However, it
turns out that for Lucene, TF scores are in fact PER-FIELD.

This furthermore applies to the /coord/ component. I realise that /coord/ is
a ratio of query terms matched over total query terms, but I believe an
effort could be made to make clear that field1:term1 and field2:term1 count
as 2 different query terms. 

As an example, for 2 documents with fields field1 and field2, where 
query1=”field1:term1”
query2=”field1:term1 or field2:term1”

document1={field1:”term1 term1”, field2:””}
document2={field2:”term1”, field2:”term1”}

Coord(query1,document1)= 1/1 = 1
Coord(query2,document1)= 1/2 = 0.5
Coord(query1,document2)= 1/2 = 0.5
Coord(query2,document2)= 2/2 = 1

Now, the TF scores will be normalized with the fieldNorm component which is
computed based on field length at indexing time and stored in a single byte,
with a significant loss of precision. These things together make it
impossible to run Lucene retrieval in such a way that 

*similarity(query2,document1) == similarity(query2,document2)*

which is precisely what I need in my use case.

Here are my questions:
1. I think the documentation should be updated to make this clear! Can I do
this myself?
2. Has anyone encountered this problem before? Is there an easy fix?

Cheers,
Daniel



--
View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Posted by danield <da...@gmail.com>.

Corrections: 

document2={field1:”term1”, field2:”term1”} 
Coord(query1,document2)= 1/1 = 1

(Doesn't affect the problem/observation)



--
View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179370.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Posted by danield <da...@gmail.com>.

Update: I have implemented my own subclasses of QueryParser, BooleanQuery,
BooleanScorer and Similarity to deal with this.

I have been successful in getting the exact behaviour I want... when
calling the .explain() method. However, the scores for some documents often
differ when calling IndexSearcher.search() vs IndexSearcher.explain().

I am a bit confused by this. The coord() seems to be one of the things I
need to change, but is not the only element in the formula that I have
clearly changed for the .explain() pipeline but not for .search().

The implementation of BulkScorer remains perplexing to me and I suspect it
is something in there I have missed. Any pointers?

Thanks!
Daniel


On 15 January 2015 at 23:00, Jack Krupansky-3 [via Lucene] <
ml-node+s472066n4179925h74@n3.nabble.com> wrote:

> File a Jira for this particular doc fix since it is significant and not
> just mere worksmithing. Better yet, submit a patch since that's Javadoc,
> although the exact form of the doc fix might be debatable, so I general
> description of the problem should be sufficient, unless you feel
> motivated.
>
> -- Jack Krupansky
>
> On Thu, Jan 15, 2015 at 11:23 AM, danield <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4179925&i=0>> wrote:
>
> > Hi Mike,
> >
> > Thank you for your reply. Yes, I had thought of this, but it is not a
> > solution to my problem, and this is because the Term Frequency and
> > therefore
> > the results will still be wrong, as prepending or appending a string to
> the
> > term will still make it a different term.
> >
> > Similarily, I could use regex queries, but again that doesn't fix the TF
> > issue. I am not talking here hypothetically, I have proof this doesn't
> work
> > experimentally (i.e. the precision for my task goes down in my
> > experiments).
> >
> > Also, I agree that when your fields are essentially different as in
> > /title/,
> > /author /and /text/, normalizing by field length makes sense, but in my
> > case
> > my fields are many and are all chunks of a larger text (extracted
> sentences
> > that have been labelled with a number of different classes), and in the
> > experiments I am running I am trying to establish whether weighting
> > sentences in different classes differently will lead to increased
> relevance
> > of results.
> >
> > This also doesn't change the fact that documentation is wrong! Any ideas
> > how
> > to fix?
> > Daniel
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4179925&i=1>
> > For additional commands, e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4179925&i=2>
> >
> >
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179925.html
>  To unsubscribe from Similarity formula documentation is misleading + how
> to make field-agnostic queries?, click here
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4179307&code=ZGFuaWVsZHVtYUBnbWFpbC5jb218NDE3OTMwN3wxMjkzMjkwMDg3>
> .
> NAML
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4180529.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Posted by Jack Krupansky <ja...@gmail.com>.

File a Jira for this particular doc fix since it is significant and not
just mere worksmithing. Better yet, submit a patch since that's Javadoc,
although the exact form of the doc fix might be debatable, so I general
description of the problem should be sufficient, unless you feel motivated.

-- Jack Krupansky

On Thu, Jan 15, 2015 at 11:23 AM, danield <da...@gmail.com> wrote:

> Hi Mike,
>
> Thank you for your reply. Yes, I had thought of this, but it is not a
> solution to my problem, and this is because the Term Frequency and
> therefore
> the results will still be wrong, as prepending or appending a string to the
> term will still make it a different term.
>
> Similarily, I could use regex queries, but again that doesn't fix the TF
> issue. I am not talking here hypothetically, I have proof this doesn't work
> experimentally (i.e. the precision for my task goes down in my
> experiments).
>
> Also, I agree that when your fields are essentially different as in
> /title/,
> /author /and /text/, normalizing by field length makes sense, but in my
> case
> my fields are many and are all chunks of a larger text (extracted sentences
> that have been labelled with a number of different classes), and in the
> experiments I am running I am trying to establish whether weighting
> sentences in different classes differently will lead to increased relevance
> of results.
>
> This also doesn't change the fact that documentation is wrong! Any ideas
> how
> to fix?
> Daniel
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Posted by danield <da...@gmail.com>.

Oh thanks Mike, it did say somewhere. I guess it wouldn't hurt to make that
explanation more prominent, as I clearly missed it.

Never mind, I am working on my own solution for this, through subclassing
QueryParser, BooleanQuery, BooleanScorer, Similarity and a bunch of other
classes.

Cheers,
Daniel




--
View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179851.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 1/15/15 11:23 AM, danield wrote:
> Hi Mike,
>
> Thank you for your reply. Yes, I had thought of this, but it is not a
> solution to my problem, and this is because the Term Frequency and therefore
> the results will still be wrong, as prepending or appending a string to the
> term will still make it a different term.
>
> Similarily, I could use regex queries, but again that doesn't fix the TF
> issue. I am not talking here hypothetically, I have proof this doesn't work
> experimentally (i.e. the precision for my task goes down in my experiments).
>
> Also, I agree that when your fields are essentially different as in /title/,
> /author /and /text/, normalizing by field length makes sense, but in my case
> my fields are many and are all chunks of a larger text (extracted sentences
> that have been labelled with a number of different classes), and in the
> experiments I am running I am trying to establish whether weighting
> sentences in different classes differently will lead to increased relevance
> of results.
>
> This also doesn't change the fact that documentation is wrong! Any ideas how
> to fix?
> Daniel
>
In Lucene a "Term" encodes the field and the term text, so the 
documentation is not incorrect.  In fact this is stated explicitly here:

Lucene is field based, hence each query term applies to a single field, 
document length normalization is by the length of the certain field, and 
in addition to document boost there are also document fields boosts.

You might consider indexing your sentences as multiple values of a 
single field.  If you need to label them you could possibly use payloads 
for that.

-Mike

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Posted by danield <da...@gmail.com>.

Hi Mike,

Thank you for your reply. Yes, I had thought of this, but it is not a
solution to my problem, and this is because the Term Frequency and therefore
the results will still be wrong, as prepending or appending a string to the
term will still make it a different term.

Similarily, I could use regex queries, but again that doesn't fix the TF
issue. I am not talking here hypothetically, I have proof this doesn't work
experimentally (i.e. the precision for my task goes down in my experiments).

Also, I agree that when your fields are essentially different as in /title/,
/author /and /text/, normalizing by field length makes sense, but in my case
my fields are many and are all chunks of a larger text (extracted sentences
that have been labelled with a number of different classes), and in the
experiments I am running I am trying to establish whether weighting
sentences in different classes differently will lead to increased relevance
of results.

This also doesn't change the fact that documentation is wrong! Any ideas how
to fix?
Daniel

--
View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

In practice, normalization by field length proves to be more useful than 
normalization by the sum of the lengths of all fields (document length), 
which I think is what you seem to be after.  Think of a book chapter 
document with two fields: title and full text.  It makes little sense to 
weight the terms in the title differently for longer and shorter texts.

To get the behavior (I think) you want, you could index your documents 
like this:

document1={field:"field1:term1 field1:term1"}
document2={field:"field1:term1 field2:term1"}

and form queries like:

query1="field:field1\:term1"
query2="field:(field1\:term1 or field2\:term1)"

-Mike

On 1/13/15 2:24 PM, danield wrote:
> Hi all,
>
> I have found, much to my dismay, that the documentation on Lucene’s default
> similarity formula is very dangerously misleading. See it here:
> http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf
>
> Term Frequency (TF) counts are expected to be per-document in the IR
> literature, and this documentation doesn’t say any differently. However, it
> turns out that for Lucene, TF scores are in fact PER-FIELD.
>
> This furthermore applies to the /coord/ component. I realise that /coord/ is
> a ratio of query terms matched over total query terms, but I believe an
> effort could be made to make clear that field1:term1 and field2:term1 count
> as 2 different query terms.
>
> As an example, for 2 documents with fields field1 and field2, where
> query1=”field1:term1”
> query2=”field1:term1 or field2:term1”
>
> document1={field1:”term1 term1”, field2:””}
> document2={field2:”term1”, field2:”term1”}
>
> Coord(query1,document1)= 1/1 = 1
> Coord(query2,document1)= 1/2 = 0.5
> Coord(query1,document2)= 1/2 = 0.5
> Coord(query2,document2)= 2/2 = 1
>
> Now, the TF scores will be normalized with the fieldNorm component which is
> computed based on field length at indexing time and stored in a single byte,
> with a significant loss of precision. These things together make it
> impossible to run Lucene retrieval in such a way that
>
> *similarity(query2,document1) == similarity(query2,document2)*
>
> which is precisely what I need in my use case.
>
> Here are my questions:
> 1. I think the documentation should be updated to make this clear! Can I do
> this myself?
> 2. Has anyone encountered this problem before? Is there an easy fix?
>
> Cheers,
> Daniel
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org