You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jim <jw...@sgi.com> on 2004/12/25 17:05:00 UTC
Need an analyzer that includes numbers.
I've seen some discussion on this and the answer seems to be "write your
own". Hasn't someone already done that by now that would share? I
really have to be able to include numeric and alphanumeric strings in my
searches. I don't understand analyzers well enough to roll my own.
Thanks,
Jim.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Need an analyzer that includes numbers.
Posted by Jim Lynch <jw...@sgi.com>.
Hi, Erik,
Thank you very much for taking the time to do this. I may have
mentioned, I'm evaluating search engines and am implementing a subset of
the features that we'll need eventually. This will help greatly.
Thanks,
Jim.
Erik Hatcher wrote:
>
> On Dec 25, 2004, at 11:05 AM, Jim wrote:
>
>> I've seen some discussion on this and the answer seems to be "write
>> your own". Hasn't someone already done that by now that would
>> share? I really have to be able to include numeric and alphanumeric
>> strings in my searches. I don't understand analyzers well enough to
>> roll my own.
>
>
> This is more involved than just keeping numbers around... or at least
> there are more steps to consider. Do you want the alpha characters
> lower-cased, which is the typical behavior so that searches are
> case-insensitive. What about punctuation characters? Generally these
> get tossed, however there are cases where that is not desired either.
> (Snip excellent response)
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Need an analyzer that includes numbers.
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Dec 25, 2004, at 11:05 AM, Jim wrote:
> I've seen some discussion on this and the answer seems to be "write
> your own". Hasn't someone already done that by now that would share?
> I really have to be able to include numeric and alphanumeric strings
> in my searches. I don't understand analyzers well enough to roll my
> own.
This is more involved than just keeping numbers around... or at least
there are more steps to consider. Do you want the alpha characters
lower-cased, which is the typical behavior so that searches are
case-insensitive. What about punctuation characters? Generally these
get tossed, however there are cases where that is not desired either.
The good news is that writing Tokenizer and TokenFilter pieces of an
analyzer are generally relatively easy. There are a number of built-in
Lucene pieces that you can leverage. I whipped up a quick
AlphanumericAnalyzer for you demonstrating the CharTokenizer which
treats alphanumeric characters as part of tokens, and any other
character as a separator that gets thrown away. At the same time, it
lowercases. The output of the main() method is shown below also.
public class AlphanumericAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
return new CharTokenizer(reader) {
protected char normalize(char c) {
return Character.toLowerCase(c);
}
protected boolean isTokenChar(char c) {
return Character.isLetter(c) || Character.isDigit(c);
}
};
}
public static void main(String[] args) throws IOException {
TokenStream ts =
new AlphanumericAnalyzer().tokenStream("field",
new StringReader("December 26, 2004"));
String month = ts.next().termText();
String day = ts.next().termText();
String year = ts.next().termText();
System.out.println(month + " " + day + " " + year);
}
}
Output:
december 26 2004
Calling .tokenStream and .next().termText() is not something your
production code would need to do - but its what happens under the
covers of Lucene. If you are going to write a custom analyzer, you
*should* write unit tests that "analyze" the analyzer using these
lower-level methods.
Lucene in Action goes into the analysis topic deeply, but simply, and I
spent a great deal of time toying with different customizations to
analyzers to write about them. The sample code distribution includes
utility methods and unit test helpers to illustrate, test, and debug
the analysis process. And in retrospect, this very example I cobbled
together to reply to this e-mail would have been a great example to add
as well.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Need an analyzer that includes numbers.
Posted by Otis Gospodnetic <ot...@yahoo.com>.
WhitespaceAnalyzer will let you have it. It just breaks the input on
spaces.
Otis
--- Jim <jw...@sgi.com> wrote:
> I've seen some discussion on this and the answer seems to be "write
> your
> own". Hasn't someone already done that by now that would share? I
> really have to be able to include numeric and alphanumeric strings in
> my
> searches. I don't understand analyzers well enough to roll my own.
>
> Thanks,
> Jim.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org