You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2007/09/16 09:06:34 UTC
EdgeNGramTokenFilter, term position?
Should the EdgeNGramFilter use the same term position for the ngrams
within a single token?
As is, the EdgeNGramTokenFilter increments the term position for each
character. In analysis.jsp, with the input "hello", I get:
term position 1 2 3 4 5
term text h he hel hell hello
term type word word word word word
start,end 0,1 0,2 0,3 0,4 0,5
I would expect something more like what is generated from SOLR-357:
term position 1
term text hello
hell
hel
he
h
term type word
prefix
prefix
prefix
prefix
start,end 0,5
0,4
0,3
0,2
0,1
This seems like it would affect slop queries, but I don't really
understand them yet.
thanks
ryan
Re: EdgeNGramTokenFilter, term position?
Posted by Yonik Seeley <yo...@apache.org>.
On 9/16/07, Ryan McKinley <ry...@gmail.com> wrote:
> Should the EdgeNGramFilter use the same term position for the ngrams
> within a single token?
It feels like that is the right approach.
I don't see value in having them sequential, and I can think of uses
for having them overlap.
-Yonik
Re: EdgeNGramTokenFilter, term position?
Posted by Chris Hostetter <ho...@fucit.org>.
: Should the EdgeNGramFilter use the same term position for the ngrams within a
: single token?
i can see the argument going both ways ... imagine a hypothetical
CharSplitterTokenFilter that takes replaces each token in the stream with
one token per character in the orriginal token (ie: "hello" becomes
h,e,l,l,o) ... should those tokens all have the same position? the have a
logical ordered flow to them, so in theory they are sequential ... but
they did occupy the same "space" in the orriginal token stream.
when in doubt: make it an option
-Hoss