You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Venkatraju <ve...@gmail.com> on 2004/12/01 17:54:17 UTC

Proximity in ranking, summary generation

Hi,

This is actually 2 somewhat related questions:
- In regular multi term queries, does the default ranking function of
Lucene take into account proximity of the search terms? As far as I
know, proximity data is used only in phrase searches. Is this correct?
If so, does someone have pointers/sample implementation of how
proximity data can be used to supplement tfidf in ranking documents?

- Is there a way to get the term offset or byte offset of the best
match(es) in the document? I am looking to use this information for
summary generation/highlighting.

Thanks,
Venkat

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Proximity in ranking, summary generation

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Dec 1, 2004, at 11:54 AM, Venkatraju wrote:
> This is actually 2 somewhat related questions:
> - In regular multi term queries, does the default ranking function of
> Lucene take into account proximity of the search terms? As far as I
> know, proximity data is used only in phrase searches. Is this correct?

Correct.

> If so, does someone have pointers/sample implementation of how
> proximity data can be used to supplement tfidf in ranking documents?

Have a look at PhraseQuery and how it does its ranking.

> - Is there a way to get the term offset or byte offset of the best
> match(es) in the document? I am looking to use this information for
> summary generation/highlighting.

Term offset (aka position increments) is stored in the index.  However, 
for highlighting you need byte offsets.  The CVS version of Lucene 
includes an option to capture the byte offsets in the index also.  
Prior to this CVS version, it required re-analyzing the text to be 
highlighted.  Token objects contain the byte offsets that the 
Highlighter package uses.

Have a look at the Highlighter code in the Lucene Sandbox.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org