You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (Created) (JIRA)" <ji...@apache.org> on 2012/02/02 21:38:56 UTC

[jira] [Created] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

Similarity.java javadocs and simplifications for 4.0
----------------------------------------------------

                 Key: LUCENE-3749
                 URL: https://issues.apache.org/jira/browse/LUCENE-3749
             Project: Lucene - Java
          Issue Type: Task
    Affects Versions: 4.0
            Reporter: Robert Muir
             Fix For: 4.0
         Attachments: LUCENE-3749.patch

As part of adding additional scoring systems to lucene, we made a lower-level Similarity
and the existing stuff became e.g. TFIDFSimilarity which extends it.

However, I always feel bad about the complexity introduced here (though I do feel there
are some "excuses", that its a difficult challenge).

In order to try to mitigate this, we also exposed an easier API (SimilarityBase) on top of 
it that makes some assumptions (and trades off some performance) to try to provide something 
consumable for e.g. experiments.

Still, we can cleanup a few things with the low-level api: fix outdated documentation and
shoot for better/clearer naming etc.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199219#comment-13199219 ] 

Michael McCandless commented on LUCENE-3749:
--------------------------------------------

+1 looks great!
                
> Similarity.java javadocs and simplifications for 4.0
> ----------------------------------------------------
>
>                 Key: LUCENE-3749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3749
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3749.patch
>
>
> As part of adding additional scoring systems to lucene, we made a lower-level Similarity
> and the existing stuff became e.g. TFIDFSimilarity which extends it.
> However, I always feel bad about the complexity introduced here (though I do feel there
> are some "excuses", that its a difficult challenge).
> In order to try to mitigate this, we also exposed an easier API (SimilarityBase) on top of 
> it that makes some assumptions (and trades off some performance) to try to provide something 
> consumable for e.g. experiments.
> Still, we can cleanup a few things with the low-level api: fix outdated documentation and
> shoot for better/clearer naming etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3749:
--------------------------------

    Attachment: LUCENE-3749_part2.patch

Here's part2: nuking SimilarityProvider (instead use PerFieldSimilarityWrapper if you want special per-field stuff).

This really simplifies the APIs, especially for say a casual user who just wants to try out a new ranking model.
                
> Similarity.java javadocs and simplifications for 4.0
> ----------------------------------------------------
>
>                 Key: LUCENE-3749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3749
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3749.patch, LUCENE-3749_part2.patch
>
>
> As part of adding additional scoring systems to lucene, we made a lower-level Similarity
> and the existing stuff became e.g. TFIDFSimilarity which extends it.
> However, I always feel bad about the complexity introduced here (though I do feel there
> are some "excuses", that its a difficult challenge).
> In order to try to mitigate this, we also exposed an easier API (SimilarityBase) on top of 
> it that makes some assumptions (and trades off some performance) to try to provide something 
> consumable for e.g. experiments.
> Still, we can cleanup a few things with the low-level api: fix outdated documentation and
> shoot for better/clearer naming etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

Posted by "Neil Hooey (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222521#comment-13222521 ] 

Neil Hooey commented on LUCENE-3749:
------------------------------------

Thanks Robert, I've got it working now. I just set my default similarity to {{SchemaSimilarityFactory}} and it works just as it did before.
                
> Similarity.java javadocs and simplifications for 4.0
> ----------------------------------------------------
>
>                 Key: LUCENE-3749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3749
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3749.patch, LUCENE-3749_part2.patch
>
>
> As part of adding additional scoring systems to lucene, we made a lower-level Similarity
> and the existing stuff became e.g. TFIDFSimilarity which extends it.
> However, I always feel bad about the complexity introduced here (though I do feel there
> are some "excuses", that its a difficult challenge).
> In order to try to mitigate this, we also exposed an easier API (SimilarityBase) on top of 
> it that makes some assumptions (and trades off some performance) to try to provide something 
> consumable for e.g. experiments.
> Still, we can cleanup a few things with the low-level api: fix outdated documentation and
> shoot for better/clearer naming etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

Posted by "Robert Muir (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3749.
---------------------------------

    Resolution: Fixed
      Assignee: Robert Muir
    
> Similarity.java javadocs and simplifications for 4.0
> ----------------------------------------------------
>
>                 Key: LUCENE-3749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3749
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3749.patch, LUCENE-3749_part2.patch
>
>
> As part of adding additional scoring systems to lucene, we made a lower-level Similarity
> and the existing stuff became e.g. TFIDFSimilarity which extends it.
> However, I always feel bad about the complexity introduced here (though I do feel there
> are some "excuses", that its a difficult challenge).
> In order to try to mitigate this, we also exposed an easier API (SimilarityBase) on top of 
> it that makes some assumptions (and trades off some performance) to try to provide something 
> consumable for e.g. experiments.
> Still, we can cleanup a few things with the low-level api: fix outdated documentation and
> shoot for better/clearer naming etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199382#comment-13199382 ] 

Robert Muir commented on LUCENE-3749:
-------------------------------------

Thanks Mike: i will prematurely commit, just to make try to make some incremental improvements.

I think its especially confusing/horrible in trunk after LUCENE-3555 and as the javadocs are 
out of date since norms are no longer required to be single-bytes, etc, etc.

if anyone objects, or has better ideas (ESPECIALLY NAMING: its the worst!), don't hesitate...
this stuff is really important.

                
> Similarity.java javadocs and simplifications for 4.0
> ----------------------------------------------------
>
>                 Key: LUCENE-3749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3749
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3749.patch
>
>
> As part of adding additional scoring systems to lucene, we made a lower-level Similarity
> and the existing stuff became e.g. TFIDFSimilarity which extends it.
> However, I always feel bad about the complexity introduced here (though I do feel there
> are some "excuses", that its a difficult challenge).
> In order to try to mitigate this, we also exposed an easier API (SimilarityBase) on top of 
> it that makes some assumptions (and trades off some performance) to try to provide something 
> consumable for e.g. experiments.
> Still, we can cleanup a few things with the low-level api: fix outdated documentation and
> shoot for better/clearer naming etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3749:
--------------------------------

    Attachment: LUCENE-3749.patch

patch
                
> Similarity.java javadocs and simplifications for 4.0
> ----------------------------------------------------
>
>                 Key: LUCENE-3749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3749
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3749.patch
>
>
> As part of adding additional scoring systems to lucene, we made a lower-level Similarity
> and the existing stuff became e.g. TFIDFSimilarity which extends it.
> However, I always feel bad about the complexity introduced here (though I do feel there
> are some "excuses", that its a difficult challenge).
> In order to try to mitigate this, we also exposed an easier API (SimilarityBase) on top of 
> it that makes some assumptions (and trades off some performance) to try to provide something 
> consumable for e.g. experiments.
> Still, we can cleanup a few things with the low-level api: fix outdated documentation and
> shoot for better/clearer naming etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222059#comment-13222059 ] 

Robert Muir commented on LUCENE-3749:
-------------------------------------

This patch does not break anything, it changes the configuration mechanism for an unreleased feature.

if you want a per-field similarity, then configure a <similarity> in your schema.xml that
extends PerFieldSimilarityWrapper. If you want it to defer to the fieldType in the schema,
then make it SchemaAware so that its initialized with an IndexSchema object.

An example one (SchemaSimilarityFactory) is provided that just this, and here is its test: http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema-sim.xml
                
> Similarity.java javadocs and simplifications for 4.0
> ----------------------------------------------------
>
>                 Key: LUCENE-3749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3749
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3749.patch, LUCENE-3749_part2.patch
>
>
> As part of adding additional scoring systems to lucene, we made a lower-level Similarity
> and the existing stuff became e.g. TFIDFSimilarity which extends it.
> However, I always feel bad about the complexity introduced here (though I do feel there
> are some "excuses", that its a difficult challenge).
> In order to try to mitigate this, we also exposed an easier API (SimilarityBase) on top of 
> it that makes some assumptions (and trades off some performance) to try to provide something 
> consumable for e.g. experiments.
> Still, we can cleanup a few things with the low-level api: fix outdated documentation and
> shoot for better/clearer naming etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

Posted by "Neil Hooey (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222055#comment-13222055 ] 

Neil Hooey commented on LUCENE-3749:
------------------------------------

This change breaks per-field similarity configuration in Solr. Specifically with this commit:

{code}
commit 5d371928263d8d78d0e52781340ae95506bd9bf6
Author: Robert Muir <rm...@apache.org>
Date:   Mon Feb 6 12:48:01 2012 +0000

    LUCENE-3749: replace SimilarityProvider with PerFieldSimilarityWrapper
    
    git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1241001 13f79535-47bb-0310-9956-ffa450edef68
{code}

I have the following configuration in my schema.xml:

{code}
<fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" >
  <analyzer>
    <tokenizer class="com.foo.lucene.analysis.core.PayloadTermTokenizerFactory"/>
    <filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <similarity class="com.foo.lucene.search.PayloadSimilarity" />
</fieldtype>
{code}

But when I build against and use a version of a Solr with the commit mentioned above, my similarity class is no longer executed. I've confirmed this by putting prints in the scorePayload(), tf() and idf() functions and noticing they print before and don't print after including that commit.

It seems this is intentional, based on Robert Muir's comments, but how can you get per-field similarity to work in Solr with this new code?
                
> Similarity.java javadocs and simplifications for 4.0
> ----------------------------------------------------
>
>                 Key: LUCENE-3749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3749
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3749.patch, LUCENE-3749_part2.patch
>
>
> As part of adding additional scoring systems to lucene, we made a lower-level Similarity
> and the existing stuff became e.g. TFIDFSimilarity which extends it.
> However, I always feel bad about the complexity introduced here (though I do feel there
> are some "excuses", that its a difficult challenge).
> In order to try to mitigate this, we also exposed an easier API (SimilarityBase) on top of 
> it that makes some assumptions (and trades off some performance) to try to provide something 
> consumable for e.g. experiments.
> Still, we can cleanup a few things with the low-level api: fix outdated documentation and
> shoot for better/clearer naming etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200771#comment-13200771 ] 

Michael McCandless commented on LUCENE-3749:
--------------------------------------------

+1
                
> Similarity.java javadocs and simplifications for 4.0
> ----------------------------------------------------
>
>                 Key: LUCENE-3749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3749
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3749.patch, LUCENE-3749_part2.patch
>
>
> As part of adding additional scoring systems to lucene, we made a lower-level Similarity
> and the existing stuff became e.g. TFIDFSimilarity which extends it.
> However, I always feel bad about the complexity introduced here (though I do feel there
> are some "excuses", that its a difficult challenge).
> In order to try to mitigate this, we also exposed an easier API (SimilarityBase) on top of 
> it that makes some assumptions (and trades off some performance) to try to provide something 
> consumable for e.g. experiments.
> Still, we can cleanup a few things with the low-level api: fix outdated documentation and
> shoot for better/clearer naming etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org