You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/06/01 21:24:19 UTC

[GitHub] [lucene] pminkov opened a new pull request, #940: Use similarity.tf() in MoreLikeThis

pminkov opened a new pull request, #940:
URL: https://github.com/apache/lucene/pull/940

   ### Description
   
   MoreLikeThis picks terms by their TF-IDF score. The TF part of the score was used by taking the term frequency directly, without applying a square root through ClassicSimilarity.tf(). The result is that how common a term is in an input can have too much weight on whether it's selected as a search term. An example of a negative effect is that this can make more stop words make their way into the final query.
   
   ### Tests
   
   Ran MoreLikeThis tests with:
   ```commandline
   ./gradlew -p lucene/queries test --tests TestMoreLikeThis
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Use similarity.tf() in MoreLikeThis [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1790429476

   It looks like this PR is a nice improvement to MLQ quality, and we agree we should just enable it by default (`Similarity` can turn it off if the old way is really needed), and the PR is quite close?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] pminkov commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

pminkov commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1144930676

   Thanks for taking a look @mocobeta - your comment makes sense. I have a dataset and have noticed the problem on it. I'll create a short analysis with examples.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

mocobeta commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1146613684

   I haven't closely looked at the `SimpleQQParser` yet, but I imagine this measures something like the performance of pseudo relevance feedback?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

mocobeta commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1146600229

   As for backward compatibility, my thought on that was slightly changed - let's commit this into `main` without adding extra configuration parameters if there are no objections, it's a major release anyway. I'll add a flag to opt-in for `branch_9x` when backporting (we would need to preserve the current behavior as default in 9.x).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Use similarity.tf() in MoreLikeThis [lucene]

Posted by "MarcusSorealheis (via GitHub)" <gi...@apache.org>.

MarcusSorealheis commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1791704616

   I can reawaken it and get it to closure. I need to carve out time on Sunday unless someone else picks it up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

mocobeta commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1146612939

   Thanks @rmuir for your suggestions.
   I have a deadline myself this month so can't help in trying extensive experiments right now, but I'll try to be responsive here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1146616934

   It does not do this by default, but can be easily modified to do so... I have done it before. It is one way to measure how well the mlt is working relatively easily, albeit indirectly. At least it is a reasonable use of mlt, if numbers become very bad, we might want to investigate further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

mocobeta commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1144848214

   I'm using MLT somewhere else, I'll try to apply this change there and see the result with real data if I have a chance. Meanwhile, other people who are more confident with this change can merge it I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] pminkov commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

pminkov commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1145379881

I created a branch with some analysis of what happens, it's [here](https://github.com/pminkov/lucene/commit/25c5ea4c12d92b8f534d40e449509a327ab6eea9). The code is a bit hacky, sorry.

**Dataset**

I used one of the MongoDB Atlas datasets - [mflix](https://www.mongodb.com/docs/atlas/sample-data/sample-mflix/). This dataset has a collection with ~20k movies and I dumped their plot descriptions into the plots.txt file (it's in the branch).

```commandline
$ cat ./plots.txt | wc -l
23531
```

A sample of the file is [here](https://gist.github.com/pminkov/c040e96835501bb2bfa34d029c5fa0d9).

**Test**

I sorted the documents by length and cleaned up punctuation, then I indexed the documents. The documents with lower document ids are biggest.

Next step is I picked 15 documents and created a MLT query from each one.

Here are the terms that are selected for each document: https://gist.github.com/pminkov/1432b04f794b97d1fc042ffc1ac0dce2

As you can see, when we don't have the fix, the code selects a lot more stopword like words and that is more visible when you have longer documents. That I believe happens since the stop words appear many times and if the frequency is not damped down with a square root (`similarity.tf()`), they tend to bubble up to the top of the priority queue. On shorter documents there's not much visible difference.

Let me know if I should elaborate more on any of this or look into something additional.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1146608421

   i can try to do some tests this weekend (using query-expansion approach). i've just been busy with work and it is a bit time-consuming (need to find some external hard drives and so on)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Use similarity.tf() in MoreLikeThis [lucene]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1880904681

   This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

mocobeta commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1181301533

   Personally, I'd love to commit this to the upstream branch.
   I think we'd need a reproducible quality check (or regression test?) in Lucene as Robert suggested; I just haven't been able to take enough time to look at it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] pminkov commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

pminkov commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1150241668

   I'm uploading the selected words file:
   [mlt-selected-words.txt](https://github.com/apache/lucene/files/8863975/mlt-selected-words.txt)
   
   And the input file:
   [plots.txt](https://github.com/apache/lucene/files/8863978/plots.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1144943399

   > I think the problem is that we have no test corpus to measure the MLT search quality, so we can't directly know if taking square roots of raw term frequency improves the search quality. I'm not against the change at all, just can't estimate the possible effects of this change.
   
   One way to do this is to expand queries of a test collection with MLT (blind feedback approach) and look at usual measurements. Easiest way to do that (if using built-in QueryDriver to run relevance tests) is to modify SimpleQQParser to incorporate MLT: https://github.com/apache/lucene/blob/main/lucene/benchmark/src/java/org/apache/lucene/benchmark/quality/utils/SimpleQQParser.java#L64


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

mocobeta commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1144837827

   Thanks for taking look at this. The change makes sense and looks consistent in the usage of TFIDFSimilarity to me.
   I think the problem is that we have no test corpus to measure the MLT search quality, so we can't directly know if taking square roots of raw term frequency improves the search quality. I'm not against the change at all, just can't estimate the possible effects of this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1146609887

   > I'll add a flag to opt-in for `branch_9x` when backporting (we would need to preserve the current behavior as default in 9.x).
   
   why add flag? The similarity is already pluggable. So I think there is already plenty of configuration knobs. If someone is unhappy they can pass one that doesn't saturate tf?
   ```
   mlt.setSimilarity(new DefaultSimilarity() { 
     @Override
     public float tf(float freq) { return freq; }
   });
   ```
   
   But first, i'd like to try to measure any improvement. I'm just not able to do this during the week.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

mocobeta commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1146545273

@pminkov thank you for the thorough analysis!

Looking at the result, with this fix too common words do not appear, as expected, and too rare words still not be selected - so the result will be more balanced I think.

> Here are the terms that are selected for each document: https://gist.github.com/pminkov/1432b04f794b97d1fc042ffc1ac0dce2

Could you attach the .txt file to this PR? It's a good reference for others and I think it'd be great to directly have it here.

I think the improvement in quality is clear now. On the other hand, MLT query is a very widely used feature, I've been thinking about having a bool parameter (say `useRawTermFreq`, maybe) to switch back to the old behavior so that we give users an option to stick to the old results if it's needed. We can set the default to `false` to use `TFIDFSimilarity.tf(freq)` for main (opt-out), and `true` for 9x (opt-in).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] MarcusSorealheis commented on pull request #940: Use similarity.tf() in MoreLikeThis

Posted by GitBox <gi...@apache.org>.

MarcusSorealheis commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1180816282

   Is there anything else needed here? Is there something we can add to improve the robustness of the quality check? Please advise us @rmuir  and @mocobeta 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Use similarity.tf() in MoreLikeThis [lucene]

Posted by "MarcusSorealheis (via GitHub)" <gi...@apache.org>.

MarcusSorealheis commented on PR #940:
URL: https://github.com/apache/lucene/pull/940#issuecomment-1795346408

   Is the idea for using `SimpleQQParser` that we would add another parse method, maybe `parseWithMoreLikeThis` that combines mlt and quality query parset to build a query or two separate queries that are then compared?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org