You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Mark Nemeskey (JIRA)" <ji...@apache.org> on 2011/08/02 10:32:27 UTC

[jira] [Created] (LUCENE-3357) Unit and integration test cases for the new Similarities

Unit and integration test cases for the new Similarities
--------------------------------------------------------

                 Key: LUCENE-3357
                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
             Project: Lucene - Java
          Issue Type: Sub-task
          Components: core/query/scoring
    Affects Versions: flexscoring branch
            Reporter: David Mark Nemeskey
            Assignee: David Mark Nemeskey
            Priority: Minor
             Fix For: flexscoring branch


Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
 * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
 * integration tests, in which a small collection is indexed and then searched using the Similarities.

Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Added a spoof version for all search-related classes that are necessary to properly fill the EasyStats object in EasySimilarity subclasses.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Fixed {{LMDirichletSimilarity}} (see my last comment).

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

 * EasySimilarity subclasses return their names in toString()
 * The two test cases return the name of the Similarity that failed the test.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Fixed NaN and infinite scores in DFR and IB; all that's left is to fix the negative scores as well.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083084#comment-13083084 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

heh, i fought that guy last night for quite some time... couldn't figure out a good solution.

if you make a patch I can do some sanity testing though to try to help.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088182#comment-13088182 ] 

David Mark Nemeskey commented on LUCENE-3357:
---------------------------------------------

Robert: with this, all EasySimilarity-based classes have been tested. Do you think we could close this issue?

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Robert: I modified the nocommits a bit to provide input to the Similarities that looks somewhat plausible. I think it's better to avoid situations where e.g. docLen < freq to minimize the chance of error.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082859#comment-13082859 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

For the negative in the IF model, one solution is this:

{noformat}
-    return tfn * (float)(log2((N + 1) / (F + 0.5)));
+    return tfn * (float)(log2(1 + (N + 1) / (F + 0.5)));
{noformat}

in quick relevance tests, this was slightly better (likely not significant either way).

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Rebased the changes on the current state of trunk.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084096#comment-13084096 ] 

David Mark Nemeskey commented on LUCENE-3357:
---------------------------------------------

D: good question, I think if F > tfn, then D > 0, but I guess I have to prove it (and fix it if it isn't).

Could you tell me which sims were affected negatively?

freq: I didn't know about that! Still, I want to provide not "plausible", but at least "safe" statistics in this case. You didn't touch docFreq and numberOfDocuments, so I assumed at least these two are filled with actual values, is that so?

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082303#comment-13082303 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

Ok, here is what i did here for BM25:

in the case norms are omitted by the user, the formula behaves as if b=0 (no length normalization). so this is always a possibility sims should handle, thoguh for EasySimilarity perhaps it should just supply doclen=1 or something of that nature?

in the case norms are available, but sumTotalTermFreq is not (e.g. frequencies are omitted by the user), I use a value of 1 for avg doc len... This is probably ok
since in the case of omitTF all the TF values will be 1 anyway.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Fixed the omit norms case.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082011#comment-13082011 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

I wouldn't worry about the scores being negative necessarily myself: there is nothing wrong with this.

But we should fix the Nan/Inf score problems.

Also: some of the stats that are newer in Lucene will get stupid results with PreFlex codec, it doesnt support them.

In my opinion add the following to the test's setup:
{noformat}
    assumeFalse("test cannot run with PreFlex codec !", 
        "PreFlex".equals(CodecProvider.getDefault().getDefaultFieldCodec()));
{noformat}

and I can help in the places where EasySim collects these stats, for example I think we should add checks in case totalTermFreq = -1, and throw UOE here instead.


> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087578#comment-13087578 ] 

David Mark Nemeskey edited comment on LUCENE-3357 at 8/19/11 6:59 AM:
----------------------------------------------------------------------

bq. I would just shoot for 'breadth' as far as across the different sims?
What do you mean by 'breadth'? Unit and integration tests (well... the "heart" test) already cover all the sims, and this includes score vs explanation comparison. As for the correctness tests, both LM and IB sims are tested, as well as four DFR methods. I can write tests for the three missing DFR sims, but that is as much breadth as I can add. Or do you have something else in mind?

      was (Author: david_nemeskey):
    bq I would just shoot for 'breadth' as far as across the different sims?
What do you mean by 'breadth'? Unit and integration tests (well... the "heart" test) already cover all the sims, and this includes score vs explanation comparison. As for the correctness tests, both LM and IB sims are tested, as well as four DFR methods. I can write tests for the three missing DFR sims, but that is as much breadth as I can add. Or do you have something else in mind?
  
> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082865#comment-13082865 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

OK I committed this, ran on 3 test collections, i was definitely getting negative scores (not crazy corner cases).
In one case, fixing this improved MAP > 10%, so I think its important.


> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Ah, I forgot to modify the explain() methods to handle the omitted norms case in the same way as score(). Fixed it now.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Correctness tests added for the rest of the DFR sims.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Integration tests added. There are two of them; however, ant test only runs one?

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3357.
---------------------------------

    Resolution: Fixed

great work! 

These tests were very effective at finding problems in these formulas :)

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087632#comment-13087632 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

{quote}
I can write tests for the three missing DFR sims, but that is as much breadth as I can add. Or do you have something else in mind?
{quote}

That sounds good! 

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084080#comment-13084080 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

I think the change to D is fine? what about the rest of the equation? (especially the variable "D") 

I tested D and its fine with this change, however with some of the other sims the changes had some negative effect... lets figure out D for now.

Also as far as the values if you omit stuff: i don't think we should provide fake values that seem plausible: remember if you omit term frequencies such that totalTermFreq is unavailable, then freq will always be 1 :)

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087308#comment-13087308 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

{quote}
Do you think that I should re-write the ones where the computation of the gold value is missing? Or the other way around?
{quote}

I don't think so, i think we will take whatever we can get as far as tests :) I would just shoot for 'breadth' as far as across the different sims?

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082301#comment-13082301 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

{quote}
Robert: I'm on the Nan/Inf problems. As for the negative score, I'll leave it there for the time being, these Similarities should always return positive scores. I don't feel very confident about this test myself, so I guess I'll remove it (or at least make it optional) once all tests are successful.
{quote}

Ahh, ok. I didn't know the sims should always return positive scores! If this is really the case, then its good to test for it.

{quote}
As for the PreFlex codec, I must admit I am not familiar with it, so I would be grateful for a few pointers.
{quote}

PreFlex codec emulates the Lucene 3.x index format, which does not support TotalTermFreq, SumTotalTermFreq, SumDocFreq, etc. It will return -1 here.
Though I just realized: in some situations any codec can return -1 for these values, for example if frequencies are omitted by the user (omitTFAP).
So currently, unfortunately, similarities have to deal with this case (and also the case where norms == null, because norms are omitted by the user (omitNorms) !).

I've been working on the BM25 sim with all these regards, Ill commit an update to it as an example.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082702#comment-13082702 ] 

David Mark Nemeskey edited comment on LUCENE-3357 at 8/10/11 9:51 PM:
----------------------------------------------------------------------

Fixed NaN and infinite scores in DFR and IB; all that's left is to fix the negative scores as well. ... and everything else discussed earlier.

      was (Author: david_nemeskey):
    Fixed NaN and infinite scores in DFR and IB; all that's left is to fix the negative scores as well.
  
> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084096#comment-13084096 ] 

David Mark Nemeskey edited comment on LUCENE-3357 at 8/12/11 1:11 PM:
----------------------------------------------------------------------

D: good question, I think if F > tfn, then D > 0, but I guess I have to prove it (and fix it if it isn't).

Could you tell me which sims were affected negatively?

freq: I didn't know about that! Still, I want to provide not "plausible", but at least "safe" statistics in this case. You didn't touch docFreq and numberOfDocuments, so I assumed at least these two are filled with the actual values, is that so?

      was (Author: david_nemeskey):
    D: good question, I think if F > tfn, then D > 0, but I guess I have to prove it (and fix it if it isn't).

Could you tell me which sims were affected negatively?

freq: I didn't know about that! Still, I want to provide not "plausible", but at least "safe" statistics in this case. You didn't touch docFreq and numberOfDocuments, so I assumed at least these two are filled with actual values, is that so?
  
> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

License added.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

I've added the correctness tests (is there a better name for these?). Do you think that I should re-write the ones where the computation of the gold value is missing? Or the other way around? :)

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Labels: gsoc gsoc2011 test  (was: gsoc gsoc2011)

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084104#comment-13084104 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

{quote}
freq: I didn't know about that! Still, I want to provide not "plausible", but at least "safe" statistics in this case. You didn't touch docFreq and numberOfDocuments, so I assumed at least these two are filled with the actual values, is that so?
{quote}

But I don't think we should populate it with arbitrary ones, I like 1 because this is consistent with what you asked for if you omit term frequency (I think its confusing to put something other than 1 here, its inconsistent with how omitTF works for lucene's default scoring).

right, docFreq is always populated. but if you omitTF, freq will be 1 (for exact scorers) or <= 1 (for sloppy scorers) as no frequency is available.

I ran a quick test and got decreases in MAP (probably slight, maybe not even significant) with PL2 and dirichlet with the changes. I figure we can first fix D and then move on to P and such, save LM for last as its a major pain :)

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082841#comment-13082841 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

I committed this to the branch, even though we have the failing tests with negative scores, i think it will prevent the patch from becoming hellacious.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Fixed a bug in TestEasySimilarity that prevented Similarities that use a subclass of EasyStats to be tested. Also, modified EasyStats so that totalBoost is set to the value of queryBoost in the constructor.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081942#comment-13081942 ] 

David Mark Nemeskey commented on LUCENE-3357:
---------------------------------------------

Some of the tests fail at certain Similarities, so those have to be fixed as well.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082847#comment-13082847 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

Ok I added some things (marked nocommit for your review):

Basically we have the case for norms/totalTermFreq/sumTotalTermFreq that they can be unavailable because
freqs or norms are omitted, but currently all sims have to deal with this problem :(

Ideally sims would not have to deal with this stuff, but for the time being it prevents NaN/Inf for the hearts test
if the test gets preflexcodec (about 1/4 of the time), and it will prevent NPE if norms are omitted.

in the case these values are unavailable i set these to "1"... if you can review that this is ok, we can nuke the nocommits.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Unit tests added.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081818#comment-13081818 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

looks good as a start: can you add apache license header to the new test file?

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083194#comment-13083194 ] 

David Mark Nemeskey edited comment on LUCENE-3357 at 8/11/11 4:08 PM:
----------------------------------------------------------------------

Robert: I modified the nocommits a bit to provide input to the Similarities that looks somewhat plausible. I think it's better to avoid situations where e.g. docLen < freq to minimize the chance of error.

Please let me know what you think of these modifications; if they're OK, I'll nuke the nocommits.

      was (Author: david_nemeskey):
    Robert: I modified the nocommits a bit to provide input to the Similarities that looks somewhat plausible. I think it's better to avoid situations where e.g. docLen < freq to minimize the chance of error.
  
> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Labels: gsoc gsoc2011  (was: )

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080416#comment-13080416 ] 

David Mark Nemeskey edited comment on LUCENE-3357 at 8/6/11 3:52 PM:
---------------------------------------------------------------------

Integration tests added. There are two of them; however, ant test runs only one?

      was (Author: david_nemeskey):
    Integration tests added. There are two of them; however, ant test only runs one?
  
> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088209#comment-13088209 ] 

Robert Muir commented on LUCENE-3357:
-------------------------------------

I think we must be very close: I just need to review this patch and lets get it committed and close the issue.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Did something so that D and P (the binomial models) return only positive scores, but neither is it theoretically sound, nor do I like it much.

Robert: could you test D please, to see how the results are affected?

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mark Nemeskey updated LUCENE-3357:
----------------------------------------

    Attachment: LUCENE-3357.patch

Fixed integer division bug in BasicModelG.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083080#comment-13083080 ] 

David Mark Nemeskey commented on LUCENE-3357:
---------------------------------------------

Apparently the Dirichlet method returns a negative score if the tf / docLen < corpusTf / corpusLen. Unfortunately the negative number can be arbitrarily large, so it's not as easy as adding a constant to the score. This of course makes sense if all documents are scored, as the function is monotone and consequently documents, whose tf is 0, will always be ranked lower than those that contain the word. But this is not how IR engines work.

Having said that, I believe that we could simulate such a system. I don't know exactly how the query architecture works, but I presume the clauses that don't match a document are assigned a zero value. Now instead of this zero, the Scorer (or whatever class does this) could ask for a default value from the Similarity. In this case LMDirichletSimilarity could return score(stats, 0, Integer.MAX_VALUE), which is somewhere around -12.

If we don't do this, we have three options:
1. add score(stats, 0, Integer.MAX_VALUE) to the score
2. if (score < 0) return 0
3. add corpusTf / corpusLen * docLen to tf

All ensure a positive score, but also each has its own disadvantage.
1. adds a pretty big constant to the score, which may not play well with the other parts of the query
2. some documents that contain the term get the same 0 score as documents that don't (though I cannot say this is not in line with the LM approach)
3. this introduces a transformation that is difficult to characterize

For the time being, I'll go with 2, but we have to discuss this.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087578#comment-13087578 ] 

David Mark Nemeskey commented on LUCENE-3357:
---------------------------------------------

bq I would just shoot for 'breadth' as far as across the different sims?
What do you mean by 'breadth'? Unit and integration tests (well... the "heart" test) already cover all the sims, and this includes score vs explanation comparison. As for the correctness tests, both LM and IB sims are tested, as well as four DFR methods. I can write tests for the three missing DFR sims, but that is as much breadth as I can add. Or do you have something else in mind?

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities

Posted by "David Mark Nemeskey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082261#comment-13082261 ] 

David Mark Nemeskey commented on LUCENE-3357:
---------------------------------------------

Robert: I'm on the Nan/Inf problems. As for the negative score, I'll leave it there for the time being, these Similarities should always return positive scores. I don't feel very confident about this test myself, so I guess I'll remove it (or at least make it optional) once all tests are successful.

As for the PreFlex codec, I must admit I am not familiar with it, so I would be grateful for a few pointers.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org