You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jong Kim <jk...@sitescape.com> on 2006/11/21 16:36:55 UTC
Use case for term vector's token position/offset?
Hi,
When I look at org.apache.lucene.document.Field.TermVector,
it defines the following 5 options as to the detailed info
that can be stored wrt term vectors.
1. NO
2. WITH_OFFSETS
3. WITH_POSITIONS
4. WITH_POSITIONS_OFFSETS
5. YES
It isn't difficult to understand where the basic term vector
information (ie, terms and their number of occurences - option 5)
might be useful. I believe it can be used to implement features
like "concept search" or "more like this" functionalities.
However, it isn't clear to me how the other extra info (ie,
token position information and/or token offset information)
might be used? Can anyone help me understand what kind of
(advanced) search techniques people use these extra
information for, or even better, any pointer to real world
examples?
Thanks
/Jong
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Use case for term vector's token position/offset?
Posted by Grant Ingersoll <gs...@apache.org>.
Hi Jong,
I think these are useful for things like highlighting (I think
contrib/highlighter can use them); other post processing algorithms
such as: question answering, calculating co-occurrences (find the 6
terms to the left and right of the term at position 16). Perhaps you
want to give higher scores to documents where your terms occur in a
certain part of the document (like the beginning)
Really, any application where you need to know the relationships
between the terms in a document or the document and the original.
HTH,
Grant
On Nov 21, 2006, at 10:36 AM, Jong Kim wrote:
> Hi,
>
> When I look at org.apache.lucene.document.Field.TermVector,
> it defines the following 5 options as to the detailed info
> that can be stored wrt term vectors.
>
> 1. NO
> 2. WITH_OFFSETS
> 3. WITH_POSITIONS
> 4. WITH_POSITIONS_OFFSETS
> 5. YES
>
> It isn't difficult to understand where the basic term vector
> information (ie, terms and their number of occurences - option 5)
> might be useful. I believe it can be used to implement features
> like "concept search" or "more like this" functionalities.
>
> However, it isn't clear to me how the other extra info (ie,
> token position information and/or token offset information)
> might be used? Can anyone help me understand what kind of
> (advanced) search techniques people use these extra
> information for, or even better, any pointer to real world
> examples?
>
> Thanks
> /Jong
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org