You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Fernando Wasylyszyn <fe...@yahoo.com.ar> on 2011/01/24 15:42:04 UTC

StopTokenizer Proposal

Hi everybody. I am a developer and researcher working for Snoop Consulting 
S.R.L. in Argentina, specially in projects related to information retrieval and 
machine learning.
Working on a project from Yell Argentina (Yellow Pages) I have developed what I 
called a StopTokenizer.

Problem:

We developed a small "suggest engine" to be included in the project. This 
suggest engine shouldn't generate suggestions for a set of stopwords (for 
example: "for"). So we add a StopFilter with a predefined set of stopwords 
(including "for"). The problem arised when we tested the engine with prefixes 
that match with a stopword. For example, we were testing "for" expecting 
"forsaken" to be returned as a result and it did not happen.

Solution:

We implemented the StopTokenizer. This tokenizer takes an input string from a 
Reader, a set of characters to be used as delimiters and a set of stopwords. The 
text is tokenized using the delimiters. Then, it analyze each token and decides 
if a token is a stopword not only based on a predefined set of stopwords  (like 
the StopFilter does) but also based on:

1) The position of the token: if the text is "for sending" and whitespace is a 
delimiter, then "for" is recognized as a stopword. Also if the the text is "this 
is for sending"
2) Characters surrounding the token: if the text is "for " (note the trailing 
whitespace) then the token is recognized as a stopword, but if the text is "for" 
(without whitespaces surrounding), then the token is NOT recognized as a 
stopword in order to retrieve "forsaken" as a suggestion.

We think that this tokenizer in query analysis, combined with a StopFilter for 
indexing, can be useful for the community.
Comments and ideas are welcome!

Thank you.

Cheers.
Fernando.