You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by sachin <sa...@noemacorp.com> on 2006/08/23 14:30:53 UTC

Scoring Technique based on Relevance Feeback & other Parameters

 

Hello Great/smart guys 

       This is my first question for this group as I started working on the
Lucene last month. 

 

        Lucene provide the scoring of documents based on TF-IDF vector
analysis. Lucene also provides the Scorer and Weight inside the Search
package. By implementing new type of  tuple (Query,Weight,Scorer) I can
easily implement new Scoring technique. Unfortunatly Lucene index shows that
it stores only TF / Position vectors for each term within document. 

 

        I am interested in investigating new scoring technique where I will
use some other parameters relating to the Term to rank the documents. For an
example web page ranking is assisted by parameters like number of links
towards webpage and number of link from web - page.  It indicates that we
need to store relatively more information about terms within the index. But
HoW ? . I need to investigate

 

        Another parameter is relevance feedback from the User. Ranking
should get affected by relevance feedback from the user. 

 

Would someone interested in helping out or thinking about the same problem.

Re: Scoring Technique based on Relevance Feeback & other Parameters

Posted by Grant Ingersoll <gs...@syr.edu>.

On Aug 23, 2006, at 8:30 AM, sachin wrote:

> Hello Great/smart guys
>
>        This is my first question for this group as I started  
> working on the Lucene last month.
>
>
>
>         Lucene provide the scoring of documents based on TF-IDF  
> vector analysis. Lucene also provides the Scorer and Weight inside  
> the Search package. By implementing new type of  tuple  
> (Query,Weight,Scorer) I can easily implement new Scoring technique.  
> Unfortunatly Lucene index shows that it stores only TF / Position  
> vectors for each term within document.
>
>
>
>         I am interested in investigating new scoring technique  
> where I will use some other parameters relating to the Term to rank  
> the documents. For an example web page ranking is assisted by  
> parameters like number of links towards webpage and number of link  
> from web – page.  It indicates that we need to store relatively  
> more information about terms within the index. But HoW ? … I need  
> to investigate
>
>
People are working on this.  Search the java-dev archives for  
Flexible Indexing or Payloads.  See http://issues.apache.org/jira/ 
browse/LUCENE-662 for a possible patch.  Note that the patch is not  
committed yet (you can be one of the first to test it!)

>
>
>         Another parameter is relevance feedback from the User.  
> Ranking should get affected by relevance feedback from the user.
>
>
Take a look at Term Vectors.  Search the list.  Read about them at  
http://www.cnlp.org/apachecon2005 or in "Lucene In Action".  There is  
also a contribution called "More Like This" that you may find useful


> Would someone interested in helping out or thinking about the same  
> problem.
>
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886

RE: Scoring Technique based on Relevance Feeback & other Parameters

Posted by sachin <sa...@noemacorp.com>.

Hello,
Very small and sweet Question?

Does Apache allow me to change the Final classes which are distributed by
Apache for Scorers?  Or can I copy and paste some of the Lucene code into my
commercial application within my organization?

TermScorer, BooleanScorer are final classes. But all other scorers are
non-final classes.

Because my interest lies with changes in scoring strategy which is based on
Relevance Feedback? 

One observation :
Lucene is designed with inflexible scoring mechanism based on TF-IDF.
It would be really nice if much simpler scoring mechanisms should have given
chance for implementation

Query object should have construct "ScoringStrategy" object which will pass
to the scorer.

ScoringStrategy may look like this..

Interface ScoringStragey
{
// Where float is the score... 
// List of objects will be passed to the strategy which will calculate the
// scoreof 
 	float score(vector[] objects) ;		 
}

There may be other implementations possible for flexible scoring ?





-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Thursday, August 24, 2006 1:14 AM
To: java-user@lucene.apache.org
Subject: Re: Scoring Technique based on Relevance Feeback & other Parameters


: package. By implementing new type of  tuple (Query,Weight,Scorer) I can
: easily implement new Scoring technique. Unfortunatly Lucene index shows
that
: it stores only TF / Position vectors for each term within document.

:         I am interested in investigating new scoring technique where I
will
: use some other parameters relating to the Term to rank the documents. For
an
: example web page ranking is assisted by parameters like number of links
: towards webpage and number of link from web - page.  It indicates that we
: need to store relatively more information about terms within the index.
But
: HoW ? . I need to investigate

there is a distinction between storing more information about a term and
storing additional information about a document.

the flexible payload type approaches that have been discussed should make
info about a term easy (ie: the term is "wind", it's type is "noun", it's
usage in the sentence is as a "subject", it's importance is "88.3") but
you can already store additional information about documents (like the
total popularity of a document) in Lucene -- either by using the document
boost (if you always want it to be part of the score calculations) or as a
seperate field which you can factor into the score calculations using
something like FunctionQuery...

http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/pa
ckage-summary.html

...i use this all the time to make "recent" docs score better, or "more
popular docs" score better.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: sharing of Design documents of Lucene

Posted by Michael McCandless <lu...@mikemccandless.com>.

> Its nice if someone shares design documents of Lucene with Me.

You could start with the javadocs here:

    http://lucene.apache.org/java/docs/api/index.html

Click on the "Document" class to see some decription for Documents in 
particular.

Or for a broader "get your feet wet" introduction, here:

   http://lucene.apache.org/java/docs/gettingstarted.html

And the FAQ is also helpful:

   http://wiki.apache.org/jakarta-lucene/LuceneFAQ

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

sharing of Design documents of Lucene

Posted by sachin <sa...@noemacorp.com>.

Its nice if someone shares design documents of Lucene with Me.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring Technique based on Relevance Feeback & other Parameters

Posted by Chris Hostetter <ho...@fucit.org>.

: package. By implementing new type of tuple (Query,Weight,Scorer) I can
: easily implement new Scoring technique. Unfortunatly Lucene index shows that
: it stores only TF / Position vectors for each term within document.

: I am interested in investigating new scoring technique where I will
: use some other parameters relating to the Term to rank the documents. For an
: example web page ranking is assisted by parameters like number of links
: towards webpage and number of link from web - page. It indicates that we
: need to store relatively more information about terms within the index. But
: HoW ? . I need to investigate

there is a distinction between storing more information about a term and
storing additional information about a document.

the flexible payload type approaches that have been discussed should make
info about a term easy (ie: the term is "wind", it's type is "noun", it's
usage in the sentence is as a "subject", it's importance is "88.3") but
you can already store additional information about documents (like the
total popularity of a document) in Lucene -- either by using the document
boost (if you always want it to be part of the score calculations) or as a
seperate field which you can factor into the score calculations using
something like FunctionQuery...

http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/package-summary.html

...i use this all the time to make "recent" docs score better, or "more
popular docs" score better.

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Scoring Technique based on Relevance Feeback & other Parameters

Posted by Dejan Nenov <de...@jollyobject.com>.

Indeed - you bring up interesting questions. You may want to take a look at
NUTCH first, however - I am not sure if they have done some of the
Google-like ranking you mention.

However - collaborative relevance enhancement, based on user feedback, would
be a nice Web-2.0-ish feature to bake into the Lucene core :-)

 

Dejan

 

  _____  

From: sachin [mailto:sachin.khaire@noemacorp.com] 
Sent: Wednesday, August 23, 2006 5:31 AM
To: java-user@lucene.apache.org
Subject: Scoring Technique based on Relevance Feeback & other Parameters

 

Hello Great/smart guys 

       This is my first question for this group as I started working on the
Lucene last month. 

 

        Lucene provide the scoring of documents based on TF-IDF vector
analysis. Lucene also provides the Scorer and Weight inside the Search
package. By implementing new type of  tuple (Query,Weight,Scorer) I can
easily implement new Scoring technique. Unfortunatly Lucene index shows that
it stores only TF / Position vectors for each term within document. 

 

        I am interested in investigating new scoring technique where I will
use some other parameters relating to the Term to rank the documents. For an
example web page ranking is assisted by parameters like number of links
towards webpage and number of link from web - page.  It indicates that we
need to store relatively more information about terms within the index. But
HoW ? . I need to investigate

 

        Another parameter is relevance feedback from the User. Ranking
should get affected by relevance feedback from the user. 

 

Would someone interested in helping out or thinking about the same problem.

RE: Scoring Technique based on Relevance Feeback & other Parameters

Posted by "Russell M. Allen" <Ru...@aebn.net>.

I have a similar interest in specifying a custom scoring strategy.  I
previously posted about it under the subject "Scoring a document
(count?)" on 7/27/06.  In brief, I want a documents score to be a count
of term matches.  This is nearly identical to a SQL count()
functionality.
 
If you are able to modify Lucene such that I can specify a weight and
scorer as the calling code then, yes, I am definitely interested.
 
 

________________________________

From: sachin [mailto:sachin.khaire@noemacorp.com] 
Sent: Wednesday, August 23, 2006 8:31 AM
To: java-user@lucene.apache.org
Subject: Scoring Technique based on Relevance Feeback & other Parameters


 

Hello Great/smart guys 

       This is my first question for this group as I started working on
the Lucene last month. 

 

        Lucene provide the scoring of documents based on TF-IDF vector
analysis. Lucene also provides the Scorer and Weight inside the Search
package. By implementing new type of  tuple (Query,Weight,Scorer) I can
easily implement new Scoring technique. Unfortunatly Lucene index shows
that it stores only TF / Position vectors for each term within document.


 

        I am interested in investigating new scoring technique where I
will use some other parameters relating to the Term to rank the
documents. For an example web page ranking is assisted by parameters
like number of links towards webpage and number of link from web - page.
It indicates that we need to store relatively more information about
terms within the index. But HoW ? ... I need to investigate

 

        Another parameter is relevance feedback from the User. Ranking
should get affected by relevance feedback from the user. 

 

Would someone interested in helping out or thinking about the same
problem.