You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2010/04/11 19:30:41 UTC

[jira] Created: (LUCENE-2392) Enable flexible scoring

Enable flexible scoring
-----------------------

                 Key: LUCENE-2392
                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Search
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 3.1
         Attachments: LUCENE-2392.patch

This is a first step (nowhere near committable!), implementing the
design iterated to in the recent "Baby steps towards making Lucene's
scoring more flexible" java-dev thread.

The idea is (if you turn it on for your Field; it's off by default) to
store full stats in the index, into a new _X.sts file, per doc (X
field) in the index.

And then have FieldSimilarityProvider impls that compute doc's boost
bytes (norms) from these stats.

The patch is able to index the stats, merge them when segments are
merged, and provides an iterator-only API.  It also has starting point
for per-field Sims that use the stats iterator API to compute boost
bytes.  But it's not at all tied into actual searching!  There's still
tons left to do, eg, how does one configure via Field/FieldType which
stats one wants indexed.

All tests pass, and I added one new TestStats unit test.

The stats I record now are:

  - field's boost

  - field's unique term count (a b c a a b --> 3)

  - field's total term count (a b c a a b --> 6)

  - total term count per-term (sum of total term count for all docs
    that have this term)

Still need at least the total term count for each field.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-2392) Enable flexible scoring

Posted by Shai Erera <se...@gmail.com>.

I'm not sure Robert where did I propose to shove random statistics into the
index? Lucene calculated a doc length today which some in the
academy/research here disagree w/ how it's done. So instead of attempting to
fix it for all, I think it'd be great if one can define what is the doc
Length as one perceives it. Why is that problematic?

What Mike opened is an issue titled "enable flexible scoring" ... what I'm
asking for falls under that hood?

Also, maybe we should have that discussion on the issue?

Shai

On Mon, Apr 12, 2010 at 11:31 AM, Robert Muir <rc...@gmail.com> wrote:

> I disagree. I think what Mike has defined here is way beyond a baby-step:
> its all the stats needed to support modern IR models in Lucene: BM25,
> additional vector space algorithms, divergence from randomness, and language
> modelling.
>
> I think the ability to calculate your own random statistics and shove them
> into the index (this would be messy like how to get access to the aggregates
> you need anyway) is something different entirely, best left to research
> systems.
>
> You can't even do that with Terrier now.
>
> On Mon, Apr 12, 2010 at 3:35 AM, Shai Erera (JIRA) <ji...@apache.org>wrote:
>
>>
>>    [
>> https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855875#action_12855875]
>>
>> Shai Erera commented on LUCENE-2392:
>> ------------------------------------
>>
>> Mike - it'll also be great if we can store the length of the document in a
>> custom way. I think what I'm saying is that if we can open up the norms
>> computation to custom code - that will do what I want, right? Maybe we can
>> have a class like DocLengthProvider which apps can plug in if they want to
>> customize how that length is computed. Wherever we write the norms, we'll
>> call that impl, which by default will do what Lucene does today?
>> I think though that it's not a field-level setting, but an IW one?
>>
>> > Enable flexible scoring
>> > -----------------------
>> >
>> >                 Key: LUCENE-2392
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>> >             Project: Lucene - Java
>> >          Issue Type: Improvement
>> >          Components: Search
>> >            Reporter: Michael McCandless
>> >            Assignee: Michael McCandless
>> >             Fix For: 3.1
>> >
>> >         Attachments: LUCENE-2392.patch
>> >
>> >
>> > This is a first step (nowhere near committable!), implementing the
>> > design iterated to in the recent "Baby steps towards making Lucene's
>> > scoring more flexible" java-dev thread.
>> > The idea is (if you turn it on for your Field; it's off by default) to
>> > store full stats in the index, into a new _X.sts file, per doc (X
>> > field) in the index.
>> > And then have FieldSimilarityProvider impls that compute doc's boost
>> > bytes (norms) from these stats.
>> > The patch is able to index the stats, merge them when segments are
>> > merged, and provides an iterator-only API.  It also has starting point
>> > for per-field Sims that use the stats iterator API to compute boost
>> > bytes.  But it's not at all tied into actual searching!  There's still
>> > tons left to do, eg, how does one configure via Field/FieldType which
>> > stats one wants indexed.
>> > All tests pass, and I added one new TestStats unit test.
>> > The stats I record now are:
>> >   - field's boost
>> >   - field's unique term count (a b c a a b --> 3)
>> >   - field's total term count (a b c a a b --> 6)
>> >   - total term count per-term (sum of total term count for all docs
>> >     that have this term)
>> > Still need at least the total term count for each field.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of the administrators:
>> https://issues.apache.org/jira/secure/Administrators.jspa
>> -
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: [jira] Commented: (LUCENE-2392) Enable flexible scoring

Posted by Robert Muir <rc...@gmail.com>.

I disagree. I think what Mike has defined here is way beyond a baby-step:
its all the stats needed to support modern IR models in Lucene: BM25,
additional vector space algorithms, divergence from randomness, and language
modelling.

I think the ability to calculate your own random statistics and shove them
into the index (this would be messy like how to get access to the aggregates
you need anyway) is something different entirely, best left to research
systems.

You can't even do that with Terrier now.

On Mon, Apr 12, 2010 at 3:35 AM, Shai Erera (JIRA) <ji...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855875#action_12855875]
>
> Shai Erera commented on LUCENE-2392:
> ------------------------------------
>
> Mike - it'll also be great if we can store the length of the document in a
> custom way. I think what I'm saying is that if we can open up the norms
> computation to custom code - that will do what I want, right? Maybe we can
> have a class like DocLengthProvider which apps can plug in if they want to
> customize how that length is computed. Wherever we write the norms, we'll
> call that impl, which by default will do what Lucene does today?
> I think though that it's not a field-level setting, but an IW one?
>
> > Enable flexible scoring
> > -----------------------
> >
> >                 Key: LUCENE-2392
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
> >             Project: Lucene - Java
> >          Issue Type: Improvement
> >          Components: Search
> >            Reporter: Michael McCandless
> >            Assignee: Michael McCandless
> >             Fix For: 3.1
> >
> >         Attachments: LUCENE-2392.patch
> >
> >
> > This is a first step (nowhere near committable!), implementing the
> > design iterated to in the recent "Baby steps towards making Lucene's
> > scoring more flexible" java-dev thread.
> > The idea is (if you turn it on for your Field; it's off by default) to
> > store full stats in the index, into a new _X.sts file, per doc (X
> > field) in the index.
> > And then have FieldSimilarityProvider impls that compute doc's boost
> > bytes (norms) from these stats.
> > The patch is able to index the stats, merge them when segments are
> > merged, and provides an iterator-only API.  It also has starting point
> > for per-field Sims that use the stats iterator API to compute boost
> > bytes.  But it's not at all tied into actual searching!  There's still
> > tons left to do, eg, how does one configure via Field/FieldType which
> > stats one wants indexed.
> > All tests pass, and I added one new TestStats unit test.
> > The stats I record now are:
> >   - field's boost
> >   - field's unique term count (a b c a a b --> 3)
> >   - field's total term count (a b c a a b --> 6)
> >   - total term count per-term (sum of total term count for all docs
> >     that have this term)
> > Still need at least the total term count for each field.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> https://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

[jira] Commented: (LUCENE-2392) Enable flexible scoring

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855906#action_12855906 ] 

Michael McCandless commented on LUCENE-2392:
--------------------------------------------

bq. I think what I'm saying is that if we can open up the norms computation to custom code - that will do what I want, right? 

I'm calling the norms "boost bytes" :)  This was Marvin's term.. I
like it.

This patch makes boost byte computation completely private to the
sim (see the *FieldSimProvider).  Ie the sim providers walk the stats
and do whatever they want to "prepare" for real searching.  EG if you
have the RAM maybe you want to use a true float[] not boost bytes.  Or
if you really don't have the RAM maybe you use only 4 bits per-doc,
not 8.  The FieldSim just provides a "float boost(int docID)" so what
it does under the hood is private.

bq. Maybe we can have a class like DocLengthProvider which apps can plug in if they want to customize how that length is computed.

So... I'm actually trying to avoid extensibility on the first go, here
(this is the "baby steps" part of the original thread).

Ie, the IR world seems to have converged on a smallish set of "stats"
that are commonly required, so I'd like to make those initial stats
work well, for starters.  Commit that (it enables all sorts of state
of the art scoring models), and perhaps cutover to the default Robert
created in LUCENE-2187 (which needs stats to work correctly).  And
then (phase 2) work out plugability so you can put your own stats
in....

bq. Wherever we write the norms, we'll call that impl, which by default will do what Lucene does today?

Right, this is the DefaultSimProvider in my current patch -- it simply
computes the same thing Lucene does today, but uses the stats at IR
open time (once it's hooked up) to do, instead of doing so during
indexing.

bq. I think though that it's not a field-level setting, but an IW one?

It's field level now and I think we should keep it that way.  EG
Terrier was apparently document oriented in the past but has now
deprecated that and moved to per-field.

You can always make a catch-all field if you "really" want aggregated
stats across the entire doc?


> Enable flexible scoring
> -----------------------
>
>                 Key: LUCENE-2392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>         Attachments: LUCENE-2392.patch
>
>
> This is a first step (nowhere near committable!), implementing the
> design iterated to in the recent "Baby steps towards making Lucene's
> scoring more flexible" java-dev thread.
> The idea is (if you turn it on for your Field; it's off by default) to
> store full stats in the index, into a new _X.sts file, per doc (X
> field) in the index.
> And then have FieldSimilarityProvider impls that compute doc's boost
> bytes (norms) from these stats.
> The patch is able to index the stats, merge them when segments are
> merged, and provides an iterator-only API.  It also has starting point
> for per-field Sims that use the stats iterator API to compute boost
> bytes.  But it's not at all tied into actual searching!  There's still
> tons left to do, eg, how does one configure via Field/FieldType which
> stats one wants indexed.
> All tests pass, and I added one new TestStats unit test.
> The stats I record now are:
>   - field's boost
>   - field's unique term count (a b c a a b --> 3)
>   - field's total term count (a b c a a b --> 6)
>   - total term count per-term (sum of total term count for all docs
>     that have this term)
> Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2392) Enable flexible scoring

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855934#action_12855934 ] 

Robert Muir commented on LUCENE-2392:
-------------------------------------

{quote}
Really, maybe somehow we should be using at attr about the token
itself? Instead of posIncr == 0? I mean a broken synonym injector
could conceivably provide the synonyms first (w/ first one having
posIncr 1), followed by the real term (w/ posIncr 0)?
{quote}

Right, its my opinion all along (others disagree with me) that since
we have this 'ordered (incrementToken) Attributesource' / Token*Stream* that 
this sorta broken filter isn't a valid equivalent..., its definitely a different
TokenStream,even if its treated in some ways today as the same... we gotta
break away from this for reasons like this.

otherwise why have it ordered at all?


> Enable flexible scoring
> -----------------------
>
>                 Key: LUCENE-2392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>         Attachments: LUCENE-2392.patch
>
>
> This is a first step (nowhere near committable!), implementing the
> design iterated to in the recent "Baby steps towards making Lucene's
> scoring more flexible" java-dev thread.
> The idea is (if you turn it on for your Field; it's off by default) to
> store full stats in the index, into a new _X.sts file, per doc (X
> field) in the index.
> And then have FieldSimilarityProvider impls that compute doc's boost
> bytes (norms) from these stats.
> The patch is able to index the stats, merge them when segments are
> merged, and provides an iterator-only API.  It also has starting point
> for per-field Sims that use the stats iterator API to compute boost
> bytes.  But it's not at all tied into actual searching!  There's still
> tons left to do, eg, how does one configure via Field/FieldType which
> stats one wants indexed.
> All tests pass, and I added one new TestStats unit test.
> The stats I record now are:
>   - field's boost
>   - field's unique term count (a b c a a b --> 3)
>   - field's total term count (a b c a a b --> 6)
>   - total term count per-term (sum of total term count for all docs
>     that have this term)
> Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2392) Enable flexible scoring

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2392:
---------------------------------------

    Attachment: LUCENE-2392.patch

Rough first patch attached....

> Enable flexible scoring
> -----------------------
>
>                 Key: LUCENE-2392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>         Attachments: LUCENE-2392.patch
>
>
> This is a first step (nowhere near committable!), implementing the
> design iterated to in the recent "Baby steps towards making Lucene's
> scoring more flexible" java-dev thread.
> The idea is (if you turn it on for your Field; it's off by default) to
> store full stats in the index, into a new _X.sts file, per doc (X
> field) in the index.
> And then have FieldSimilarityProvider impls that compute doc's boost
> bytes (norms) from these stats.
> The patch is able to index the stats, merge them when segments are
> merged, and provides an iterator-only API.  It also has starting point
> for per-field Sims that use the stats iterator API to compute boost
> bytes.  But it's not at all tied into actual searching!  There's still
> tons left to do, eg, how does one configure via Field/FieldType which
> stats one wants indexed.
> All tests pass, and I added one new TestStats unit test.
> The stats I record now are:
>   - field's boost
>   - field's unique term count (a b c a a b --> 3)
>   - field's total term count (a b c a a b --> 6)
>   - total term count per-term (sum of total term count for all docs
>     that have this term)
> Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2392) Enable flexible scoring

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855802#action_12855802 ] 

Robert Muir commented on LUCENE-2392:
-------------------------------------

Mike, I don't think overlapTermCount should really exist in the Stats.
Trying to implement some concrete FieldSimProviders, it starts getting messy.
When using term unique pivoted length norm, i don't want to count these positionIncrement=0 terms either...
so do we need a uniqueOverlapTermCount?
Even when using non-unique (BM25) pivoted length norm, we don't want to count these when summing to calculate 
averages either.

So i know you probably see this as 'baking something into the index' but I think positionIncrement=0 means 
"doesn't contribute to the document length" by definition, and the discountOverlaps=false (no longer the default)
should be considered deprecated compatibility behavior :)

> Enable flexible scoring
> -----------------------
>
>                 Key: LUCENE-2392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>         Attachments: LUCENE-2392.patch
>
>
> This is a first step (nowhere near committable!), implementing the
> design iterated to in the recent "Baby steps towards making Lucene's
> scoring more flexible" java-dev thread.
> The idea is (if you turn it on for your Field; it's off by default) to
> store full stats in the index, into a new _X.sts file, per doc (X
> field) in the index.
> And then have FieldSimilarityProvider impls that compute doc's boost
> bytes (norms) from these stats.
> The patch is able to index the stats, merge them when segments are
> merged, and provides an iterator-only API.  It also has starting point
> for per-field Sims that use the stats iterator API to compute boost
> bytes.  But it's not at all tied into actual searching!  There's still
> tons left to do, eg, how does one configure via Field/FieldType which
> stats one wants indexed.
> All tests pass, and I added one new TestStats unit test.
> The stats I record now are:
>   - field's boost
>   - field's unique term count (a b c a a b --> 3)
>   - field's total term count (a b c a a b --> 6)
>   - total term count per-term (sum of total term count for all docs
>     that have this term)
> Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2392) Enable flexible scoring

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855905#action_12855905 ] 

Michael McCandless commented on LUCENE-2392:
--------------------------------------------

bq. Mike, I don't think overlapTermCount should really exist in the Stats.

OK I will remove it -- I was unsure whether it was overkill :)  So
it's purely an index time decision, whether the posIncr 0 tokens
"count".

Hmm, but... we have a problem, which is that these posIncr 0 tokens
are now counted in the unique token count.  Have to mull how to avoid
that...hmm... to do it correctly, I think means "count this token as
+1 on the unique tokens for this doc if ever it has non-zero posIncr"?

Really, maybe somehow we should be using at attr about the token
itself?  Instead of posIncr == 0?  I mean a broken synonym injector
could conceivably provide the synonyms first (w/ first one having
posIncr 1), followed by the real term (w/ posIncr 0)?

BTW the cost of storing the stats isn't that bad -- it increases index
size by 1.5%, on a 10M wikipedia index where each doc is 1KB of text
(~171 words per doc on avg).


> Enable flexible scoring
> -----------------------
>
>                 Key: LUCENE-2392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>         Attachments: LUCENE-2392.patch
>
>
> This is a first step (nowhere near committable!), implementing the
> design iterated to in the recent "Baby steps towards making Lucene's
> scoring more flexible" java-dev thread.
> The idea is (if you turn it on for your Field; it's off by default) to
> store full stats in the index, into a new _X.sts file, per doc (X
> field) in the index.
> And then have FieldSimilarityProvider impls that compute doc's boost
> bytes (norms) from these stats.
> The patch is able to index the stats, merge them when segments are
> merged, and provides an iterator-only API.  It also has starting point
> for per-field Sims that use the stats iterator API to compute boost
> bytes.  But it's not at all tied into actual searching!  There's still
> tons left to do, eg, how does one configure via Field/FieldType which
> stats one wants indexed.
> All tests pass, and I added one new TestStats unit test.
> The stats I record now are:
>   - field's boost
>   - field's unique term count (a b c a a b --> 3)
>   - field's total term count (a b c a a b --> 6)
>   - total term count per-term (sum of total term count for all docs
>     that have this term)
> Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2392) Enable flexible scoring

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855875#action_12855875 ] 

Shai Erera commented on LUCENE-2392:
------------------------------------

Mike - it'll also be great if we can store the length of the document in a custom way. I think what I'm saying is that if we can open up the norms computation to custom code - that will do what I want, right? Maybe we can have a class like DocLengthProvider which apps can plug in if they want to customize how that length is computed. Wherever we write the norms, we'll call that impl, which by default will do what Lucene does today?
I think though that it's not a field-level setting, but an IW one?

> Enable flexible scoring
> -----------------------
>
>                 Key: LUCENE-2392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>         Attachments: LUCENE-2392.patch
>
>
> This is a first step (nowhere near committable!), implementing the
> design iterated to in the recent "Baby steps towards making Lucene's
> scoring more flexible" java-dev thread.
> The idea is (if you turn it on for your Field; it's off by default) to
> store full stats in the index, into a new _X.sts file, per doc (X
> field) in the index.
> And then have FieldSimilarityProvider impls that compute doc's boost
> bytes (norms) from these stats.
> The patch is able to index the stats, merge them when segments are
> merged, and provides an iterator-only API.  It also has starting point
> for per-field Sims that use the stats iterator API to compute boost
> bytes.  But it's not at all tied into actual searching!  There's still
> tons left to do, eg, how does one configure via Field/FieldType which
> stats one wants indexed.
> All tests pass, and I added one new TestStats unit test.
> The stats I record now are:
>   - field's boost
>   - field's unique term count (a b c a a b --> 3)
>   - field's total term count (a b c a a b --> 6)
>   - total term count per-term (sum of total term count for all docs
>     that have this term)
> Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2392) Enable flexible scoring

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855913#action_12855913 ] 

Shai Erera commented on LUCENE-2392:
------------------------------------

I'd like to withdraw my request from above. I misunderstood that the stats I need are stored per-field per-doc. So that will allow me to compute the docLength as I want.

> Enable flexible scoring
> -----------------------
>
>                 Key: LUCENE-2392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>         Attachments: LUCENE-2392.patch
>
>
> This is a first step (nowhere near committable!), implementing the
> design iterated to in the recent "Baby steps towards making Lucene's
> scoring more flexible" java-dev thread.
> The idea is (if you turn it on for your Field; it's off by default) to
> store full stats in the index, into a new _X.sts file, per doc (X
> field) in the index.
> And then have FieldSimilarityProvider impls that compute doc's boost
> bytes (norms) from these stats.
> The patch is able to index the stats, merge them when segments are
> merged, and provides an iterator-only API.  It also has starting point
> for per-field Sims that use the stats iterator API to compute boost
> bytes.  But it's not at all tied into actual searching!  There's still
> tons left to do, eg, how does one configure via Field/FieldType which
> stats one wants indexed.
> All tests pass, and I added one new TestStats unit test.
> The stats I record now are:
>   - field's boost
>   - field's unique term count (a b c a a b --> 3)
>   - field's total term count (a b c a a b --> 6)
>   - total term count per-term (sum of total term count for all docs
>     that have this term)
> Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org