You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Christoph Kiehl <ki...@subshell.com> on 2005/01/07 19:00:43 UTC

Use a date field for ranking

Hi,

we are currently implementing a search engine for a news site. Our goal 
is to have a search result that uses the publish date of the documents 
to boost the score of the documents.

I took a look at nutch to see how it implements pagerank and it seems 
like this is done at index time by setting a document boost.

This approach won't work for us because ranking by date is optional. We 
have to use something that boosts the scores at _search_ time.

My idea is to implement it like the sort functionality built into lucene 
and use the FieldCache.

Has anyone a better idea or an important downside of this approach?

Regards,
Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Use a date field for ranking

Posted by Chris Hostetter <ho...@fucit.org>.
: > : have to use something that boosts the scores at _search_ time.

: Yes, I know I can boost Query objects, but that is not the same as
: boosting the document score by a factor. By boosting query objects I
: _add_ values to the score. Let me show you an example:

well, sure it is ... you have to have some way (add search time) to
indicate that you want to want documents that meet a certain critera to
have their scores affected in a certain way -- that's exactly what a Query
is.  there may not be an existing Query subclass that meets you needs
exactly, but if you want the scores of documents to be influnced
conditionally at search time, a Query object is the way to indicate that.

: If I had used a boost of 3.0 per document and left the date part of the
: query out I would have:
:
: Query 1: 0.3
: Query 2: 0.03
:
: Which maintains the original proportion. Now if I want to specify a
: function (like 1/x) that calculates the boost factor of a specific
: publish date I can't emulate this by using Query boosts because the
: query boost must be adjusted to the first part of the query to achieve
: an equal distribution for any query.

Based on a recent thread about scores, I *think* you are making an
incorrect assumption about the relative scores of documents...

http://mail-archives.apache.org/eyebrowse/SearchList?listName=lucene-user%40jakarta.apache.org&searchText=%22A+question+about+scoring+function+in+Lucene%22&defaultField=subject

...but I'll be totally honest, I'm not sure exactly what your point is.
you're talking about comparing the final scores of too different queries,
but I'm not sure if you mean the score of a specific document against two
different queries, or the score of two documents against a single query in
which one document is more relevant to the term you search for.

: date but don't contain the first part of the query. So we might use a
: query like this:
:
: (a word) AND (date:20050108^3 OR date:20050107^1)
:
: But now I have to specify _all_ possible dates in the date part to reach
: all documents the index contains. This smells ;) Because it's all only
: an emulation of the real strategy.

well, this is why i proposed finding a feasible "granularity" and
"age" that you were comfortable with to use in picking your boosts.  If
you must have at least single day granularity, and you must provide a
gradually decreasing boost for every day back to the begining of time,
then you are correct: my suggestion was not practical. but if you are
willing to go with "week" based granularity, and only boost items from the
last 6 weeks, then you can do something like...

(a word) AND (    [date:20050108-20050114]^7
               OR [date:20050101-20050107]^6
               OR [date:20041225-20041231]^5
               OR [date:20041218-20041224]^4
               OR [date:20041211-20041217]^3
               OR [date:20041204-20041210]^2
               OR [date:00000000-20041204]^1 )

...except that i loath doing DateRange queries (see my first post in the
archives for why i think they are a silly/inefficient way of doing things)
which is why i suggested just using special keywords to denote which week
an item was published

: > 3) I'm sure there is a very cool and efficient way to do this using a
: > custom Similarity implimentation (which somhow causes the default score
: > to be divided by the age of the document) but i've never acctualy played
: > with the SImilarity class, so i won't say for certain it can be done that
: > way (hopefully someone else can chime in)
:
: AFAIK, Similarity can only be used on term level. But as outlined above
: I need a boost factor on document level.

You're right ... I was thinking of the Scorer class ... there was a recent
discussion about creating your own Scorer to return an arbitrary value
value as the Score of a (new class of) Query.  I don't know how much work
is involved, but take a look at this message...

http://mail-archives.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgId=2055565

...maybe it would be easy to crank out "RecentDocsScorer" and
"RecentQuery" classes which can do what you want (by returning the date
difference from a field and "now" as a score of the query)

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Use a date field for ranking

Posted by Christoph Kiehl <ck...@sulu3000.de>.
Chris Hostetter wrote:

> : we are currently implementing a search engine for a news site. Our goal
> : is to have a search result that uses the publish date of the documents
> : to boost the score of the documents.
> 
> : have to use something that boosts the scores at _search_ time.
 >
> 1) There is a way to boost individual Query objects (which you may then
> compose into a Tree of BooleanQueries) see Query.setBoost(float)

Yes, I know I can boost Query objects, but that is not the same as 
boosting the document score by a factor. By boosting query objects I 
_add_ values to the score. Let me show you an example:

I may use queries like this:

Query 1:
(a word that gets a score of 0.1) OR (date:20050108^3 OR date:20050107^1)

Query 2:
(a word that gets a score of 0.01) OR (date:20050108^3 OR date:20050107^1)

The date part of the clause gets a constant score of 0.3. So the total 
score of the queries will be:

Query 1: 0.4
Query 2: 0.31

If I had used a boost of 3.0 per document and left the date part of the 
query out I would have:

Query 1: 0.3
Query 2: 0.03

Which maintains the original proportion. Now if I want to specify a 
function (like 1/x) that calculates the boost factor of a specific 
publish date I can't emulate this by using Query boosts because the 
query boost must be adjusted to the first part of the query to achieve 
an equal distribution for any query.

I'm sure there is a mathematical term which describes exactly this 
problem - but I'm no mathematician ;) So I hope you understand my issues.

Additionally the construct above find also documents that have the right 
date but don't contain the first part of the query. So we might use a 
query like this:

(a word) AND (date:20050108^3 OR date:20050107^1)

But now I have to specify _all_ possible dates in the date part to reach 
all documents the index contains. This smells ;) Because it's all only 
an emulation of the real strategy.


> 2) if you are planning to rebuild your index on a regular basis (ie:
> nightly) then you can easily apply boosts to your documets when you index
> them.

Unfortunately this is no option because the index is updated incrementally.

> 3) I'm sure there is a very cool and efficient way to do this using a
> custom Similarity implimentation (which somhow causes the default score
> to be divided by the age of the document) but i've never acctualy played
> with the SImilarity class, so i won't say for certain it can be done that
> way (hopefully someone else can chime in)

AFAIK, Similarity can only be used on term level. But as outlined above 
I need a boost factor on document level.

Thanks for your input,
Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Use a date field for ranking

Posted by Chris Hostetter <ho...@fucit.org>.
: we are currently implementing a search engine for a news site. Our goal
: is to have a search result that uses the publish date of the documents
: to boost the score of the documents.

: have to use something that boosts the scores at _search_ time.

1) There is a way to boost individual Query objects (which you may then
compose into a Tree of BooleanQueries) see Query.setBoost(float)

2) if you are planning to rebuild your index on a regular basis (ie:
nightly) then you can easily apply boosts to your documets when you index
them.

If you want to be able to do only incrimental additions...

3) I'm sure there is a very cool and efficient way to do this using a
custom Similarity implimentation (which somhow causes the default score
to be divided by the age of the document) but i've never acctualy played
with the SImilarity class, so i won't say for certain it can be done that
way (hopefully someone else can chime in)

4) I can tell you what i cam up with when i was proof of concepting this a
while back...

In my case, I'm willing to accept that there is some finite granularity of
time at which "newer" documents are no longer very much more "fresh" then
"older" documents (ie: articles from the same week are equally "fresh" to
me) I also have a practicle cut off of how old things can get before they
are just plan old: 52 weeks.

With those numbers in mind, I can add a special field to each document
that indicates which week the article was published (ie: 2004w1, 2004w2,
2004w3, etc...).  At search time, my query can include a BooleanQuery of
52 clauses ORed together, each one containing the magic token for the last
52 weeks prio to when the search was execuded, each with a slightly
decreasing boost from the week before.





-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org