You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2011/01/25 18:24:49 UTC
[jira] Updated: (LUCENE-2392) Enable flexible scoring

     [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2392:
--------------------------------

    Attachment: LUCENE-2392_take2.patch

here's a really really rough "take 2" at the problem.

The general idea is to take a smaller "baby-step" as Mike calls it, to the problem.
Really we have been working our way towards this anyway, exposing additional statistics,
making Similarity per-field, fixing up inconsistencies... and this is the way I prefer, as we
get things actually committed and moving.

So whatever is in this patch (which is full of nocommits, but all tests pass and all queries work with it),
we could possibly then split up into other issues and continue slowly proceeding, or maybe
create a branch, whatever.

My problem with the other patch is it requires a ton more work to make any progress on it...
and things don't even compile with it, forget about tests.

The basics here are to:
# Split the "matching" and "scoring calculations" of Scorer. All responsibility of calculations belongs
in the Similarity, the Scorer should be matching positions, working docsEnums, etc etc.
# Similarity as we know it now, gets a more low-level API, and TFIDFSimilarity implements this API,
but exposes its customizations via the tf(), idf(), etc we know now.
# Things like score-caching and specialization of calculations are the responsibility of the Similarity,
as these depend upon the formula being used. For TFIDFSimilarity, i added some optimizations here,
for example it specializes its norms == null case away to remove the per-doc "if".
# Since all Weights create PerReaderTermState (<-- this one needs a new name), to separate the
seeking/stats collection from the calculations, i also optimized PhraseQuery's Weight/Scorer construction
to be single-pass. 

Also I like to benchmark every step of the way, so we don't come up with
this design that won't be performant: here are the scores for lucene's default Sim with the patch:
||Query||QPS trunk||QPS patch||Pct diff||||
|spanNear([unit, state], 10, true)|3.04|2.92|{color:red}-4.0%{color}|
|doctitle:.*[Uu]nited.*|4.00|3.99|{color:red}-0.1%{color}|
|+unit +state|8.11|8.12|{color:green}0.2%{color}|
|united~2.0|4.36|4.40|{color:green}1.0%{color}|
|united~1.0|18.70|18.93|{color:green}1.2%{color}|
|unit~2.0|8.54|8.71|{color:green}2.1%{color}|
|spanFirst(unit, 5)|11.35|11.59|{color:green}2.2%{color}|
|unit~1.0|8.69|8.91|{color:green}2.6%{color}|
|unit state|7.03|7.23|{color:green}2.8%{color}|
|"unit state"~3|3.74|3.86|{color:green}3.2%{color}|
|u*d|16.72|17.30|{color:green}3.5%{color}|
|state|19.24|20.04|{color:green}4.1%{color}|
|un*d|49.42|51.55|{color:green}4.3%{color}|
|"unit state"|5.99|6.31|{color:green}5.3%{color}|
|+nebraska +state|140.74|151.85|{color:green}7.9%{color}|
|uni*|10.66|11.55|{color:green}8.4%{color}|
|unit*|18.77|20.41|{color:green}8.7%{color}|
|doctimesecnum:[10000 TO 60000]|6.97|7.70|{color:green}10.4%{color}|

All Lucene/Solr tests pass, but there are lots of nocommits, especially
# No Javadocs
# Explains need to be fixed: in general the explanation of "matching" belongs where it is now,
but the explanation of "score calculations" belongs in the Similarity.
# need to refactor more out of Weight, currently we pass it to the docscorer, but
its the wrong object, as it can only "hold" a single float.

Anyway, its gonna take some time to rough all this out I'm sure, but I wanted
to show some progress/invite ideas, and also show we can do this stuff
without losing performance.

I have separate patches that need to be integrated/relevance tested e.g. 
for average doc length... maybe i'll do that next so we can get some concrete
alternate sims in here before going any further.



> Enable flexible scoring
> -----------------------
>
>                 Key: LUCENE-2392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2392.patch, LUCENE-2392.patch, LUCENE-2392_take2.patch
>
>
> This is a first step (nowhere near committable!), implementing the
> design iterated to in the recent "Baby steps towards making Lucene's
> scoring more flexible" java-dev thread.
> The idea is (if you turn it on for your Field; it's off by default) to
> store full stats in the index, into a new _X.sts file, per doc (X
> field) in the index.
> And then have FieldSimilarityProvider impls that compute doc's boost
> bytes (norms) from these stats.
> The patch is able to index the stats, merge them when segments are
> merged, and provides an iterator-only API.  It also has starting point
> for per-field Sims that use the stats iterator API to compute boost
> bytes.  But it's not at all tied into actual searching!  There's still
> tons left to do, eg, how does one configure via Field/FieldType which
> stats one wants indexed.
> All tests pass, and I added one new TestStats unit test.
> The stats I record now are:
>   - field's boost
>   - field's unique term count (a b c a a b --> 3)
>   - field's total term count (a b c a a b --> 6)
>   - total term count per-term (sum of total term count for all docs
>     that have this term)
> Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org