You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2007/09/16 09:06:34 UTC

EdgeNGramTokenFilter, term position?

Should the EdgeNGramFilter use the same term position for the ngrams 
within a single token?

As is, the EdgeNGramTokenFilter increments the term position for each 
character.  In analysis.jsp, with the input "hello", I get:

term position 	1	2	3	4	5
term text 	h	he	hel	hell	hello
term type 	word	word	word	word	word
start,end 	0,1	0,2	0,3	0,4	0,5


I would expect something more like what is generated from SOLR-357:

term position 	1
term text 	hello
		hell
		hel
		he
		h
term type 	word
		prefix
		prefix
		prefix
		prefix
start,end 	0,5
		0,4
		0,3
		0,2
		0,1

This seems like it would affect slop queries, but I don't really 
understand them yet.

thanks
ryan

Re: EdgeNGramTokenFilter, term position?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/16/07, Ryan McKinley <ry...@gmail.com> wrote:
> Should the EdgeNGramFilter use the same term position for the ngrams
> within a single token?

It feels like that is the right approach.
I don't see value in having them sequential, and I can think of uses
for having them overlap.

-Yonik

Re: EdgeNGramTokenFilter, term position?

Posted by Chris Hostetter <ho...@fucit.org>.

: Should the EdgeNGramFilter use the same term position for the ngrams within a
: single token?

i can see the argument going both ways ... imagine a hypothetical 
CharSplitterTokenFilter that takes replaces each token in the stream with 
one token per character in the orriginal token (ie: "hello" becomes 
h,e,l,l,o) ... should those tokens all have the same position?  the have a 
logical ordered flow to them, so in theory they are sequential ... but 
they did occupy the same "space" in the orriginal token stream.

when in doubt: make it an option



-Hoss