You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by chris sleeman <ch...@gmail.com> on 2007/12/18 09:01:06 UTC

Nutch score based on document recency

Hi,

I am interested in writing a plugin, where the recency of the document,
would also be a determinant as far as relevance/scoring is concerned. I
don't want to sort by date, but would rather like to boost the score for
pages which are most recently indexed.

Have tried adding range queries using a custom query filter, which creates
queries of the form -

<query> AND +(date:[20071215 TO 20071218]^3.0 date:[20071201 TO 20071214]^
2.0 date:[20071103 TO 20071201]^1.5 date:[00000000 TO 20071102])^1.0


But I am not sure whether this is a good way or whether including date range
clauses would have an adverse impact on performance.
Am I missing something? Is there a better way of doing this? Any help would
be much appreciated.

Regards,
Chris

Re: Nutch score based on document recency

Posted by chris sleeman <ch...@gmail.com>.
Thats a great suggestion, Ken. Thanks a lot.

--Chris

On Dec 19, 2007 2:53 AM, Ken Krugler <kk...@transpac.com> wrote:

> >It seems an OK way of doing it to me.
> >
> >I don't know how expensive those range queries are, but if it turns
> >out they do eat a lot of performance and/or you want more control
> >over exactly how scoring is done, AFAIK you'll have to get into the
> >guts of Lucene and define a custom scorer as documented here:
> >
> >
> >
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/package-summary.html#scoring
> >.
> >
> >However this is expert territory so I wouldn't go there lightly.
>
> Another approach, which I've tried out using raw Lucene but not via
> Nutch, is to implement Andrzej's suggestion - have a new "dateBoost"
> field, and set the content of that field to be "1 1 1 1 ...", where
> the number of "1" characters equals the date of the page from some
> arbitrary earliest date.
>
> For example, I set it to the number of weeks from 1997, which was my
> earliest known document date.
>
> Then, at query time include "... AND dateBoost:1", so that newer
> fields get a higher score.
>
> You can fool around with specifying a run-time boost on the dateBoost
> field to tune the importance of the document's last modified time
> relative to other factors (static doc score, other query terms).
>
> -- Ken
>
>
> >On Dec 18, 2007, at 12:01 AM, chris sleeman wrote:
> >
> >>Hi,
> >>
> >>I am interested in writing a plugin, where the recency of the document,
> >>would also be a determinant as far as relevance/scoring is concerned. I
> >>don't want to sort by date, but would rather like to boost the score for
> >>pages which are most recently indexed.
> >>
> >>Have tried adding range queries using a custom query filter, which
> creates
> >>queries of the form -
> >>
> >><query> AND +(date:[20071215 TO 20071218]^3.0 date:[20071201 TO
> 20071214]^
> >>2.0 date:[20071103 TO 20071201]^1.5 date:[00000000 TO 20071102])^1.0
> >>
> >>
> >>But I am not sure whether this is a good way or whether including date
> range
> >>clauses would have an adverse impact on performance.
> >>Am I missing something? Is there a better way of doing this? Any help
> would
> >>be much appreciated.
> >>
> >>Regards,
> >>Chris
>
>
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
>

Re: Nutch score based on document recency

Posted by Ken Krugler <kk...@transpac.com>.
>It seems an OK way of doing it to me.
>
>I don't know how expensive those range queries are, but if it turns 
>out they do eat a lot of performance and/or you want more control 
>over exactly how scoring is done, AFAIK you'll have to get into the 
>guts of Lucene and define a custom scorer as documented here:
>
> 
>http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/package-summary.html#scoring 
>.
>
>However this is expert territory so I wouldn't go there lightly.

Another approach, which I've tried out using raw Lucene but not via 
Nutch, is to implement Andrzej's suggestion - have a new "dateBoost" 
field, and set the content of that field to be "1 1 1 1 ...", where 
the number of "1" characters equals the date of the page from some 
arbitrary earliest date.

For example, I set it to the number of weeks from 1997, which was my 
earliest known document date.

Then, at query time include "... AND dateBoost:1", so that newer 
fields get a higher score.

You can fool around with specifying a run-time boost on the dateBoost 
field to tune the importance of the document's last modified time 
relative to other factors (static doc score, other query terms).

-- Ken


>On Dec 18, 2007, at 12:01 AM, chris sleeman wrote:
>
>>Hi,
>>
>>I am interested in writing a plugin, where the recency of the document,
>>would also be a determinant as far as relevance/scoring is concerned. I
>>don't want to sort by date, but would rather like to boost the score for
>>pages which are most recently indexed.
>>
>>Have tried adding range queries using a custom query filter, which creates
>>queries of the form -
>>
>><query> AND +(date:[20071215 TO 20071218]^3.0 date:[20071201 TO 20071214]^
>>2.0 date:[20071103 TO 20071201]^1.5 date:[00000000 TO 20071102])^1.0
>>
>>
>>But I am not sure whether this is a good way or whether including date range
>>clauses would have an adverse impact on performance.
>>Am I missing something? Is there a better way of doing this? Any help would
>>be much appreciated.
>>
>>Regards,
>>Chris


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Nutch score based on document recency

Posted by Jasper Kamperman <ja...@openwaternet.com>.
It seems an OK way of doing it to me.

I don't know how expensive those range queries are, but if it turns  
out they do eat a lot of performance and/or you want more control  
over exactly how scoring is done, AFAIK you'll have to get into the  
guts of Lucene and define a custom scorer as documented here:

   http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ 
javadoc/org/apache/lucene/search/package-summary.html#scoring .

However this is expert territory so I wouldn't go there lightly.

Jasper

On Dec 18, 2007, at 12:01 AM, chris sleeman wrote:

> Hi,
>
> I am interested in writing a plugin, where the recency of the  
> document,
> would also be a determinant as far as relevance/scoring is  
> concerned. I
> don't want to sort by date, but would rather like to boost the  
> score for
> pages which are most recently indexed.
>
> Have tried adding range queries using a custom query filter, which  
> creates
> queries of the form -
>
> <query> AND +(date:[20071215 TO 20071218]^3.0 date:[20071201 TO  
> 20071214]^
> 2.0 date:[20071103 TO 20071201]^1.5 date:[00000000 TO 20071102])^1.0
>
>
> But I am not sure whether this is a good way or whether including  
> date range
> clauses would have an adverse impact on performance.
> Am I missing something? Is there a better way of doing this? Any  
> help would
> be much appreciated.
>
> Regards,
> Chris