You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Shahan Khatchadourian <sh...@cs.toronto.edu> on 2007/07/13 18:43:59 UTC
Token offset values for custom Tokenizer
Hi,
I am storing custom values in the Tokens provided by a Tokenizer but
when retrieving them from the index the values don't match. I've looked
in the LIA book but it's not current since it mentioned term vectors
aren't stored. I'm using Lucene Nightly 146 but the same thing has
happened with older versions. Looking at the internals, DocumentWriter
seems to keep track of the end offset that was placed into the index and
modifies the token values (with +1) but I'm not sure whether I should be
concerned with it.
No existing analyzers are used when adding the document so all the
offsets are generated manually.
Any suggestions of how the token offsets should be stored?
Is this valid?
Token, start, end
aaa, 0, 3
bbb, 4, 7
ccc, 8, 11
Thanks,
Shahan
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org