You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jong Kim <jk...@sitescape.com> on 2006/11/21 16:36:55 UTC

Use case for term vector's token position/offset?

Hi,
 
When I look at org.apache.lucene.document.Field.TermVector,
it defines the following 5 options as to the detailed info 
that can be stored wrt term vectors. 

1. NO
2. WITH_OFFSETS
3. WITH_POSITIONS
4. WITH_POSITIONS_OFFSETS
5. YES

It isn't difficult to understand where the basic term vector
information (ie, terms and their number of occurences - option 5)
might be useful. I believe it can be used to implement features
like "concept search" or "more like this" functionalities. 

However, it isn't clear to me how the other extra info (ie,
token position information and/or token offset information)
might be used? Can anyone help me understand what kind of 
(advanced) search techniques people use these extra 
information for, or even better, any pointer to real world 
examples?

Thanks
/Jong


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Use case for term vector's token position/offset?

Posted by Grant Ingersoll <gs...@apache.org>.

Hi Jong,

I think these are useful for things like highlighting (I think  
contrib/highlighter can use them); other post processing algorithms  
such as: question answering, calculating co-occurrences (find the 6  
terms to the left and right of the term at position 16).  Perhaps you  
want to give higher scores to documents where your terms occur in a  
certain part of the document (like the beginning)

Really, any application where you need to know the relationships  
between the terms in a document or the document and the original.

HTH,
Grant

On Nov 21, 2006, at 10:36 AM, Jong Kim wrote:

> Hi,
>
> When I look at org.apache.lucene.document.Field.TermVector,
> it defines the following 5 options as to the detailed info
> that can be stored wrt term vectors.
>
> 1. NO
> 2. WITH_OFFSETS
> 3. WITH_POSITIONS
> 4. WITH_POSITIONS_OFFSETS
> 5. YES
>
> It isn't difficult to understand where the basic term vector
> information (ie, terms and their number of occurences - option 5)
> might be useful. I believe it can be used to implement features
> like "concept search" or "more like this" functionalities.
>
> However, it isn't clear to me how the other extra info (ie,
> token position information and/or token offset information)
> might be used? Can anyone help me understand what kind of
> (advanced) search techniques people use these extra
> information for, or even better, any pointer to real world
> examples?
>
> Thanks
> /Jong
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org