You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Armbrust, Daniel C." <Ar...@mayo.edu> on 2003/04/24 19:09:12 UTC

Type information on Tokens?

If I wanted to build an index where all of the words were tagged with part of speech information, its seems that the type field of the Token would be the place to put this.

But, as I understand it, lucene does not keep track of the type fields that are assigned during tokenizing, and therefore doesn't use them while searching.

How could I go about keeping track of part of speech information in my index?  

So far, I can only think of two ways to accomplish this, 1, is to build it into my tokens, i.e. my tokens would look something like "<noun>patient".  I'm afraid there may be some pit-falls with this approach that I haven't identified yet, however, since I haven't tried it out.

Or, I could make lucene use the type field in its index.  But, would I be correct in assuming this would not be a trivial change?  I have looked over the source a bit, but I don't yet have a full grasp of how hits are found and scored.  

Thanks, 

Dan


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Type information on Tokens?

Posted by Stephane Vaucher <va...@cirano.qc.ca>.

I've already posted a message concerning meta info and am wondering if 
there might be some interest that there be (in the future) a standard way 
of holding (or encoding in the token) meta-info. I've got to store weights 
given to specific tokens (and perhaps even change the scoring for this. I 
won't go into details here because of my previous post ).

sv


On Tue, 29 Apr 2003, Doug Cutting wrote:

> Armbrust, Daniel C. wrote:
> > So far, I can only think of two ways to accomplish this, 1, is to 
> build it into my tokens, i.e. my tokens would look something like 
> "<noun>patient".  I'm afraid there may be some pit-falls with this 
> approach that I haven't identified yet, however, since I haven't tried 
> it out.
> 
> This should actually work fine, so long as you use the same analyzer on 
> your queries.  Another option would be to put each part of speech in a 
> different Lucene field.  I think the token-prefix option would be 
> preferable, since you probably don't need separate boost and 
> normalization factors for each part of speech.
> 
> Doug
> 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Type information on Tokens?

Posted by Doug Cutting <cu...@lucene.com>.

Armbrust, Daniel C. wrote:
> So far, I can only think of two ways to accomplish this, 1, is to build it into my tokens, i.e. my tokens would look something like "<noun>patient".  I'm afraid there may be some pit-falls with this approach that I haven't identified yet, however, since I haven't tried it out.

This should actually work fine, so long as you use the same analyzer on 
your queries.  Another option would be to put each part of speech in a 
different Lucene field.  I think the token-prefix option would be 
preferable, since you probably don't need separate boost and 
normalization factors for each part of speech.

Doug







---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Type information on Tokens?

Posted by Erik Hatcher <li...@ehatchersolutions.com>.

Here's another vote for keeping the type information in the index.  I 
suspect where this would break down is if a field has two tokens that 
are textually the same, but are considered different types - so maybe 
its not a good idea technically.  I'd love to hear more about the 
pros/cons to keeping this information, as it seems like something quite 
useful during searching.

	Erik


On Thursday, April 24, 2003, at 01:09  PM, Armbrust, Daniel C. wrote:
> If I wanted to build an index where all of the words were tagged with 
> part of speech information, its seems that the type field of the Token 
> would be the place to put this.
>
> But, as I understand it, lucene does not keep track of the type fields 
> that are assigned during tokenizing, and therefore doesn't use them 
> while searching.
>
> How could I go about keeping track of part of speech information in my 
> index?
>
> So far, I can only think of two ways to accomplish this, 1, is to 
> build it into my tokens, i.e. my tokens would look something like 
> "<noun>patient".  I'm afraid there may be some pit-falls with this 
> approach that I haven't identified yet, however, since I haven't tried 
> it out.
>
> Or, I could make lucene use the type field in its index.  But, would I 
> be correct in assuming this would not be a trivial change?  I have 
> looked over the source a bit, but I don't yet have a full grasp of how 
> hits are found and scored.
>
> Thanks,
>
> Dan
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org