You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ma...@yahoo.co.uk on 2004/07/22 21:19:28 UTC

Re: Can I retrieve token offsets from Hits?

> I wonder if the information in termPositions or termVector can be used
> to restore token position from indicies?

TermFreqVector gives you term frequencies (not positions). This can be of use in computing document 
similarities.
TermPositions gives you the sequence number . eg in the last sentence the word "sequence" was 
token number 5,  (not character position 5). This is used for PhraseQueries to determine proximity.

Character position is what is required to do highlighting and this isnt stored anywhere currently. 
The requirements for such a store would be indexed access by doc number, and a compact means
of storing term/character position info. This could add considerable size to the index.

Previously we concluded that highlighting is only typically done on the first 10 or so records in a result set 
anyway and that re-analyzing the text shouldnt add too much of an overhead. If you want to limit the size of
an individual document's text to be tokenized use highlighter.setMaxDocBytesToAnalyze().
If you find tokenizing slow check you arent using StandardAnalyzer - I have found that to be slow
(see http://marc.theaimsgroup.com/?l=lucene-dev&m=108080820315779&w=2 )

Cheers
Mark




 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org