You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Fernando Wasylyszyn <fe...@yahoo.com.ar> on 2011/01/24 15:42:04 UTC
StopTokenizer Proposal
Hi everybody. I am a developer and researcher working for Snoop Consulting
S.R.L. in Argentina, specially in projects related to information retrieval and
machine learning.
Working on a project from Yell Argentina (Yellow Pages) I have developed what I
called a StopTokenizer.
Problem:
We developed a small "suggest engine" to be included in the project. This
suggest engine shouldn't generate suggestions for a set of stopwords (for
example: "for"). So we add a StopFilter with a predefined set of stopwords
(including "for"). The problem arised when we tested the engine with prefixes
that match with a stopword. For example, we were testing "for" expecting
"forsaken" to be returned as a result and it did not happen.
Solution:
We implemented the StopTokenizer. This tokenizer takes an input string from a
Reader, a set of characters to be used as delimiters and a set of stopwords. The
text is tokenized using the delimiters. Then, it analyze each token and decides
if a token is a stopword not only based on a predefined set of stopwords (like
the StopFilter does) but also based on:
1) The position of the token: if the text is "for sending" and whitespace is a
delimiter, then "for" is recognized as a stopword. Also if the the text is "this
is for sending"
2) Characters surrounding the token: if the text is "for " (note the trailing
whitespace) then the token is recognized as a stopword, but if the text is "for"
(without whitespaces surrounding), then the token is NOT recognized as a
stopword in order to retrieve "forsaken" as a suggestion.
We think that this tokenizer in query analysis, combined with a StopFilter for
indexing, can be useful for the community.
Comments and ideas are welcome!
Thank you.
Cheers.
Fernando.