You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2015/11/04 19:15:27 UTC

[jira] [Commented] (SOLR-8057) Change default Sim to BM25 (w/backcompat config handling)

    [ https://issues.apache.org/jira/browse/SOLR-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990072#comment-14990072 ] 

Hoss Man commented on SOLR-8057:
--------------------------------


The more I work on this and think about it, the more I think my current approach of putting luceneMatchVersion conditional logic in DefaultSimFactory is the wrong way to go (independent of the bugs that i seem to have uncovered in making a SimFactories SolrCoreAware - which i'll confirm & file seperately) ...

I'm starting to think that a better long term solution would be to split this up into 3 discrete tasks/ideas...

{panel:title=Task #1 - Deprecate/rename DefaultSimilarityFactory in 5.x}
* clone DefaultSimilarityFactory -> ClassicSimilarityFactory
* prune DefaultSimilarityFactory down to a trivial subclass of ClassicSimilarityFactory
** make it log a warning on init
* change default behavior of IndexSchema to use ClassicSimilarityFactory directly
* mark DefaultSimilarityFactory as deprecated in 5.x, remove from trunk/6.0
{panel}

Task #1 would put us in a better position moving forward of having the facotry names directly map to the underlying implementation, leaving less ambiguity when an explicit factory is specified in the schema.xml (either as the main similarity, or as a per field similarity)

{panel:title="Task #2 - Make the wrapped per-field default in SchemaSimilarityFactory conditional on luceneMatchVersion"}
* use ClassicSimilarity as per-field default when luceneMatchVersion < 6.0
* use BM25Similarity as per-field default when luceneMatchVersion < 6.0
{panel}

Task #2 would give us better defaults (via BM25) for people using SchemaSimilarityFactory moving forward, while existing users would have no back compat change.

{panel:title=Task #3 - Change the implicit default Similarity on trunk}
* make the Similariy init logic in IndexSchema conditional on luceneMatchVersion
* use ClassicSimilarityFactory as default when luceneMatchVersion < 6.0
* *use SchemaSimilarityFactory as default when luceneMatchVersion >= 6.0*
** combined with Task #2, this would mean the wrapped per-field default would be BM25
{panel}

Task #3 is where things start to get noticibly diff from the goals i outlined when i originally filed this jira...

As far as i can tell, the chief reason SchemaSimilarityFactory wasn't made the implicit default in IndexSchema when it was introduced is because of how it differed/differs from DefaultSimilarity/ClassicSimilarity with respect to multi-clause queries -- see SchemaSimilarityFactory's class javadoc notes relating to {{queryNorm}} and {{coord}}.  Users were expected to think about this trade off when making a concious choice to switch from DefaultSimilarity/ClassicSimilarity to SchemaSimilarityFactory.  But (again, AFAICT) these discrepencies don't exist between SchemaSimilarityFactory's PerFieldSimilarityWrapper and BM25Similiarity.   So if we want to make BM25Similiarity the default when luceneMatchVersion >= 6.0, there doesn't seem to be any downside to _actually_ making SchemaSimilarityFactory (wrapping BM25Similiarity) the default instead.

----

Task #1 seems like a no brainer to me, and likeise Task #2 seems like a sensible change balancing new user experience vs backcompat -- so i'm going to go ahead and move forward with individual sub-tasks to tackle those (in that order).

If there are no concerns/objections to Task #3 by the time I get to that point, and if i haven't changed my mind that it's a good idea, I'll move forward with that as well -- the alternative is to stick with the original plan and make BM25SimilarityFactory (directly) the default when luceneMatchVersion >= 6.0.


> Change default Sim to BM25 (w/backcompat config handling)
> ---------------------------------------------------------
>
>                 Key: SOLR-8057
>                 URL: https://issues.apache.org/jira/browse/SOLR-8057
>             Project: Solr
>          Issue Type: Task
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>            Priority: Blocker
>             Fix For: Trunk
>
>         Attachments: SOLR-8057.patch, SOLR-8057.patch
>
>
> LUCENE-6789 changed the default similarity for IndexSearcher to BM25 and renamed "DefaultSimilarity" to "ClassicSimilarity"
> Solr needs to be updated accordingly:
> * a "ClassicSimilarityFactory" should exist w/expected behavior/javadocs
> * default behavior (in 6.0) when no similarity is specified in configs should (ultimately) use BM25 depending on luceneMatchVersion
> ** either by assuming BM25SimilarityFactory or by changing the internal behavior of DefaultSimilarityFactory
> * comments in sample configs need updated to reflect new default behavior
> * ref guide needs updated anywhere it mentions/implies that a particular similarity is used (or implies TF-IDF is used by default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org