You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Kaleem Ahmed (Created) (JIRA)" <ji...@apache.org> on 2011/12/07 10:54:40 UTC

[jira] [Created] (SOLR-2953) Introducing hit Count as an alternative to score

Introducing hit Count as an alternative to score 
-------------------------------------------------

                 Key: SOLR-2953
                 URL: https://issues.apache.org/jira/browse/SOLR-2953
             Project: Solr
          Issue Type: New Feature
          Components: search
    Affects Versions: 4.0
            Reporter: Kaleem Ahmed
             Fix For: 4.0


As of now we have score as relevancy factor for a query against a document, and this score is relative to the number of documents in the index. In the same way why not have some other relevancy feature say "hitCounts" which is absolute for a given doc and a given query, It shouldn't depend on the number of documents in the index. This will help a lot for the frequently changing indexes , where the search rules are predefined along the relevancy factor for a document to be qualified for that query(search rule). 

Ex: consider a use case where a list of queries are formed with a threshold number for each query and these are searched on a frequently updated index to get the documents that score above the threshold i.e. when a document's relevancy factor crosses the threshold for a query the document is said to be qualified for that query. 
For the above use case to satisfy the score shouldn't change every time the index gets updated with new documents. So we introduce new feature called "hitCount"  which represents the relevancy of a document against a query and it is absolute(won't change with index size). 

This hitCount is a positive integer and is calculated as follows 
Ex: Document with text "the quick fox jumped over the lazy dog, while the lazy dog was too lazy to care" 
1. for the query "lazy AND dog" the hitCount will be == (no of occurrences of "lazy" in the document) +  (no of occurrences of "dog" in the document)  =>  3+2 => 5  


2. for the phrase query  \"lazy dog\"  the hitCount will be == (no of occurrences of exact phrase "lazy dog" in the document) => 2

This will be very useful  as an alternative scoring mechanism.

I already implemented this whole thing in the Solr source code(that I downloaded) and we are using it. So far it's going good. 
It would be really great if this feature is added to trunk (original  Solr) so that we don't have to implement the changes every time  a new version is released and also others could be benefited with this.     







--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-2953) Introducing hit Count as an alternative to score

Posted by "Kaleem Ahmed (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165050#comment-13165050 ] 

Kaleem Ahmed edited comment on SOLR-2953 at 12/8/11 6:58 AM:
-------------------------------------------------------------

I don't think changing the similarity does it all.. refer this link 
http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/package-summary.html#changingSimilarity

it says
{quote}

Changing Scoring — Expert Level

Changing scoring is an expert level task, so tread carefully and be prepared to share your code if you want help.

With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by three main classes:

    Query — The abstract object representation of the user's information need.
    Weight — The internal interface representation of the user's Query, so that Query objects may be reused.
    Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.{quote}
 I mainly changed the scorers and query classes of different queries to achieve it  .. 

Will post the patch soon..
                
      was (Author: kaleemxy):
    I don't think changing the similarity does it all.. refer this link 
http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/package-summary.html#changingSimilarity

it says
{quote}

Changing Scoring — Expert Level

Changing scoring is an expert level task, so tread carefully and be prepared to share your code if you want help.

With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by three main classes:

    Query — The abstract object representation of the user's information need.
    Weight — The internal interface representation of the user's Query, so that Query objects may be reused.
    Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.{quote}
 I mainly changed the scorers and query classes of different queries to achieve it
                  
> Introducing hit Count as an alternative to score 
> -------------------------------------------------
>
>                 Key: SOLR-2953
>                 URL: https://issues.apache.org/jira/browse/SOLR-2953
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 4.0
>            Reporter: Kaleem Ahmed
>              Labels: features
>             Fix For: 4.0
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> As of now we have score as relevancy factor for a query against a document, and this score is relative to the number of documents in the index. In the same way why not have some other relevancy feature say "hitCounts" which is absolute for a given doc and a given query, It shouldn't depend on the number of documents in the index. This will help a lot for the frequently changing indexes , where the search rules are predefined along the relevancy factor for a document to be qualified for that query(search rule). 
> Ex: consider a use case where a list of queries are formed with a threshold number for each query and these are searched on a frequently updated index to get the documents that score above the threshold i.e. when a document's relevancy factor crosses the threshold for a query the document is said to be qualified for that query. 
> For the above use case to satisfy the score shouldn't change every time the index gets updated with new documents. So we introduce new feature called "hitCount"  which represents the relevancy of a document against a query and it is absolute(won't change with index size). 
> This hitCount is a positive integer and is calculated as follows 
> Ex: Document with text "the quick fox jumped over the lazy dog, while the lazy dog was too lazy to care" 
> 1. for the query "lazy AND dog" the hitCount will be == (no of occurrences of "lazy" in the document) +  (no of occurrences of "dog" in the document)  =>  3+2 => 5  
> 2. for the phrase query  \"lazy dog\"  the hitCount will be == (no of occurrences of exact phrase "lazy dog" in the document) => 2
> This will be very useful  as an alternative scoring mechanism.
> I already implemented this whole thing in the Solr source code(that I downloaded) and we are using it. So far it's going good. 
> It would be really great if this feature is added to trunk (original  Solr) so that we don't have to implement the changes every time  a new version is released and also others could be benefited with this.     

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2953) Introducing hit Count as an alternative to score

Posted by "Erick Erickson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164386#comment-13164386 ] 

Erick Erickson commented on SOLR-2953:
--------------------------------------

Can you make a patch and upload it? See: http://wiki.apache.org/solr/HowToContribute#Generating_a_patch

Then people can take a look and see how you implemented it and discuss.
                
> Introducing hit Count as an alternative to score 
> -------------------------------------------------
>
>                 Key: SOLR-2953
>                 URL: https://issues.apache.org/jira/browse/SOLR-2953
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 4.0
>            Reporter: Kaleem Ahmed
>              Labels: features
>             Fix For: 4.0
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> As of now we have score as relevancy factor for a query against a document, and this score is relative to the number of documents in the index. In the same way why not have some other relevancy feature say "hitCounts" which is absolute for a given doc and a given query, It shouldn't depend on the number of documents in the index. This will help a lot for the frequently changing indexes , where the search rules are predefined along the relevancy factor for a document to be qualified for that query(search rule). 
> Ex: consider a use case where a list of queries are formed with a threshold number for each query and these are searched on a frequently updated index to get the documents that score above the threshold i.e. when a document's relevancy factor crosses the threshold for a query the document is said to be qualified for that query. 
> For the above use case to satisfy the score shouldn't change every time the index gets updated with new documents. So we introduce new feature called "hitCount"  which represents the relevancy of a document against a query and it is absolute(won't change with index size). 
> This hitCount is a positive integer and is calculated as follows 
> Ex: Document with text "the quick fox jumped over the lazy dog, while the lazy dog was too lazy to care" 
> 1. for the query "lazy AND dog" the hitCount will be == (no of occurrences of "lazy" in the document) +  (no of occurrences of "dog" in the document)  =>  3+2 => 5  
> 2. for the phrase query  \"lazy dog\"  the hitCount will be == (no of occurrences of exact phrase "lazy dog" in the document) => 2
> This will be very useful  as an alternative scoring mechanism.
> I already implemented this whole thing in the Solr source code(that I downloaded) and we are using it. So far it's going good. 
> It would be really great if this feature is added to trunk (original  Solr) so that we don't have to implement the changes every time  a new version is released and also others could be benefited with this.     

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Closed] (SOLR-2953) Introducing hit Count as an alternative to score

Posted by "Kaleem Ahmed (Closed) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kaleem Ahmed closed SOLR-2953.
------------------------------

    Resolution: Not A Problem

Closing as the 4.0 has this feature already implemented through similarity pacakage classes
                
> Introducing hit Count as an alternative to score 
> -------------------------------------------------
>
>                 Key: SOLR-2953
>                 URL: https://issues.apache.org/jira/browse/SOLR-2953
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 4.0
>            Reporter: Kaleem Ahmed
>              Labels: features
>             Fix For: 4.0
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> As of now we have score as relevancy factor for a query against a document, and this score is relative to the number of documents in the index. In the same way why not have some other relevancy feature say "hitCounts" which is absolute for a given doc and a given query, It shouldn't depend on the number of documents in the index. This will help a lot for the frequently changing indexes , where the search rules are predefined along the relevancy factor for a document to be qualified for that query(search rule). 
> Ex: consider a use case where a list of queries are formed with a threshold number for each query and these are searched on a frequently updated index to get the documents that score above the threshold i.e. when a document's relevancy factor crosses the threshold for a query the document is said to be qualified for that query. 
> For the above use case to satisfy the score shouldn't change every time the index gets updated with new documents. So we introduce new feature called "hitCount"  which represents the relevancy of a document against a query and it is absolute(won't change with index size). 
> This hitCount is a positive integer and is calculated as follows 
> Ex: Document with text "the quick fox jumped over the lazy dog, while the lazy dog was too lazy to care" 
> 1. for the query "lazy AND dog" the hitCount will be == (no of occurrences of "lazy" in the document) +  (no of occurrences of "dog" in the document)  =>  3+2 => 5  
> 2. for the phrase query  \"lazy dog\"  the hitCount will be == (no of occurrences of exact phrase "lazy dog" in the document) => 2
> This will be very useful  as an alternative scoring mechanism.
> I already implemented this whole thing in the Solr source code(that I downloaded) and we are using it. So far it's going good. 
> It would be really great if this feature is added to trunk (original  Solr) so that we don't have to implement the changes every time  a new version is released and also others could be benefited with this.     

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2953) Introducing hit Count as an alternative to score

Posted by "Kaleem Ahmed (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211798#comment-13211798 ] 

Kaleem Ahmed commented on SOLR-2953:
------------------------------------

Looks like the present trunk 4.0 has the feature of implementing our own score through a plugin by overriding the similarity package's DefaultSimilarityProvider class. so I guess the change is not required through a patch.

The changes that I've made were on the 3.5 version which won't be compatible with the present trunk. So closing this issue.
                
> Introducing hit Count as an alternative to score 
> -------------------------------------------------
>
>                 Key: SOLR-2953
>                 URL: https://issues.apache.org/jira/browse/SOLR-2953
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 4.0
>            Reporter: Kaleem Ahmed
>              Labels: features
>             Fix For: 4.0
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> As of now we have score as relevancy factor for a query against a document, and this score is relative to the number of documents in the index. In the same way why not have some other relevancy feature say "hitCounts" which is absolute for a given doc and a given query, It shouldn't depend on the number of documents in the index. This will help a lot for the frequently changing indexes , where the search rules are predefined along the relevancy factor for a document to be qualified for that query(search rule). 
> Ex: consider a use case where a list of queries are formed with a threshold number for each query and these are searched on a frequently updated index to get the documents that score above the threshold i.e. when a document's relevancy factor crosses the threshold for a query the document is said to be qualified for that query. 
> For the above use case to satisfy the score shouldn't change every time the index gets updated with new documents. So we introduce new feature called "hitCount"  which represents the relevancy of a document against a query and it is absolute(won't change with index size). 
> This hitCount is a positive integer and is calculated as follows 
> Ex: Document with text "the quick fox jumped over the lazy dog, while the lazy dog was too lazy to care" 
> 1. for the query "lazy AND dog" the hitCount will be == (no of occurrences of "lazy" in the document) +  (no of occurrences of "dog" in the document)  =>  3+2 => 5  
> 2. for the phrase query  \"lazy dog\"  the hitCount will be == (no of occurrences of exact phrase "lazy dog" in the document) => 2
> This will be very useful  as an alternative scoring mechanism.
> I already implemented this whole thing in the Solr source code(that I downloaded) and we are using it. So far it's going good. 
> It would be really great if this feature is added to trunk (original  Solr) so that we don't have to implement the changes every time  a new version is released and also others could be benefited with this.     

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2953) Introducing hit Count as an alternative to score

Posted by "Jan Høydahl (Commented JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164764#comment-13164764 ] 

Jan Høydahl commented on SOLR-2953:
-----------------------------------

Can't you do this simply by plugging in your own Similarity class in Schema?
                
> Introducing hit Count as an alternative to score 
> -------------------------------------------------
>
>                 Key: SOLR-2953
>                 URL: https://issues.apache.org/jira/browse/SOLR-2953
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 4.0
>            Reporter: Kaleem Ahmed
>              Labels: features
>             Fix For: 4.0
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> As of now we have score as relevancy factor for a query against a document, and this score is relative to the number of documents in the index. In the same way why not have some other relevancy feature say "hitCounts" which is absolute for a given doc and a given query, It shouldn't depend on the number of documents in the index. This will help a lot for the frequently changing indexes , where the search rules are predefined along the relevancy factor for a document to be qualified for that query(search rule). 
> Ex: consider a use case where a list of queries are formed with a threshold number for each query and these are searched on a frequently updated index to get the documents that score above the threshold i.e. when a document's relevancy factor crosses the threshold for a query the document is said to be qualified for that query. 
> For the above use case to satisfy the score shouldn't change every time the index gets updated with new documents. So we introduce new feature called "hitCount"  which represents the relevancy of a document against a query and it is absolute(won't change with index size). 
> This hitCount is a positive integer and is calculated as follows 
> Ex: Document with text "the quick fox jumped over the lazy dog, while the lazy dog was too lazy to care" 
> 1. for the query "lazy AND dog" the hitCount will be == (no of occurrences of "lazy" in the document) +  (no of occurrences of "dog" in the document)  =>  3+2 => 5  
> 2. for the phrase query  \"lazy dog\"  the hitCount will be == (no of occurrences of exact phrase "lazy dog" in the document) => 2
> This will be very useful  as an alternative scoring mechanism.
> I already implemented this whole thing in the Solr source code(that I downloaded) and we are using it. So far it's going good. 
> It would be really great if this feature is added to trunk (original  Solr) so that we don't have to implement the changes every time  a new version is released and also others could be benefited with this.     

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2953) Introducing hit Count as an alternative to score

Posted by "Kaleem Ahmed (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165050#comment-13165050 ] 

Kaleem Ahmed commented on SOLR-2953:
------------------------------------

I don't think changing the similarity does it all.. refer this link 
http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/package-summary.html#changingSimilarity

it says
{quote}

Changing Scoring — Expert Level

Changing scoring is an expert level task, so tread carefully and be prepared to share your code if you want help.

With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by three main classes:

    Query — The abstract object representation of the user's information need.
    Weight — The internal interface representation of the user's Query, so that Query objects may be reused.
    Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.{quote}
 I mainly changed the scorers and query classes of different queries to achieve it
                
> Introducing hit Count as an alternative to score 
> -------------------------------------------------
>
>                 Key: SOLR-2953
>                 URL: https://issues.apache.org/jira/browse/SOLR-2953
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 4.0
>            Reporter: Kaleem Ahmed
>              Labels: features
>             Fix For: 4.0
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> As of now we have score as relevancy factor for a query against a document, and this score is relative to the number of documents in the index. In the same way why not have some other relevancy feature say "hitCounts" which is absolute for a given doc and a given query, It shouldn't depend on the number of documents in the index. This will help a lot for the frequently changing indexes , where the search rules are predefined along the relevancy factor for a document to be qualified for that query(search rule). 
> Ex: consider a use case where a list of queries are formed with a threshold number for each query and these are searched on a frequently updated index to get the documents that score above the threshold i.e. when a document's relevancy factor crosses the threshold for a query the document is said to be qualified for that query. 
> For the above use case to satisfy the score shouldn't change every time the index gets updated with new documents. So we introduce new feature called "hitCount"  which represents the relevancy of a document against a query and it is absolute(won't change with index size). 
> This hitCount is a positive integer and is calculated as follows 
> Ex: Document with text "the quick fox jumped over the lazy dog, while the lazy dog was too lazy to care" 
> 1. for the query "lazy AND dog" the hitCount will be == (no of occurrences of "lazy" in the document) +  (no of occurrences of "dog" in the document)  =>  3+2 => 5  
> 2. for the phrase query  \"lazy dog\"  the hitCount will be == (no of occurrences of exact phrase "lazy dog" in the document) => 2
> This will be very useful  as an alternative scoring mechanism.
> I already implemented this whole thing in the Solr source code(that I downloaded) and we are using it. So far it's going good. 
> It would be really great if this feature is added to trunk (original  Solr) so that we don't have to implement the changes every time  a new version is released and also others could be benefited with this.     

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org