You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jim <jw...@sgi.com> on 2004/12/25 17:05:00 UTC

Need an analyzer that includes numbers.

I've seen some discussion on this and the answer seems to be "write your 
own".  Hasn't someone already done that by now that would share?  I 
really have to be able to include numeric and alphanumeric strings in my 
searches.   I don't understand analyzers well enough to roll my own.

Thanks,
Jim.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Need an analyzer that includes numbers.

Posted by Jim Lynch <jw...@sgi.com>.
Hi, Erik,

Thank you very much for taking the time to do this.  I may have 
mentioned, I'm evaluating search engines and am implementing a subset of 
the features that we'll need eventually.  This will help greatly. 

Thanks,
Jim.

Erik Hatcher wrote:

>
> On Dec 25, 2004, at 11:05 AM, Jim wrote:
>
>> I've seen some discussion on this and the answer seems to be "write 
>> your own".  Hasn't someone already done that by now that would 
>> share?  I really have to be able to include numeric and alphanumeric 
>> strings in my searches.   I don't understand analyzers well enough to 
>> roll my own.
>
>
> This is more involved than just keeping numbers around... or at least 
> there are more steps to consider.  Do you want the alpha characters 
> lower-cased, which is the typical behavior so that searches are 
> case-insensitive.  What about punctuation characters?  Generally these 
> get tossed, however there are cases where that is not desired either.
> (Snip excellent response)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Need an analyzer that includes numbers.

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Dec 25, 2004, at 11:05 AM, Jim wrote:

> I've seen some discussion on this and the answer seems to be "write 
> your own".  Hasn't someone already done that by now that would share?  
> I really have to be able to include numeric and alphanumeric strings 
> in my searches.   I don't understand analyzers well enough to roll my 
> own.

This is more involved than just keeping numbers around... or at least 
there are more steps to consider.  Do you want the alpha characters 
lower-cased, which is the typical behavior so that searches are 
case-insensitive.  What about punctuation characters?  Generally these 
get tossed, however there are cases where that is not desired either.

The good news is that writing Tokenizer and TokenFilter pieces of an 
analyzer are generally relatively easy.  There are a number of built-in 
Lucene pieces that you can leverage.  I whipped up a quick 
AlphanumericAnalyzer for you demonstrating the CharTokenizer which 
treats alphanumeric characters as part of tokens, and any other 
character as a separator that gets thrown away.  At the same time, it 
lowercases.  The output of the main() method is shown below also.

public class AlphanumericAnalyzer extends Analyzer {
   public TokenStream tokenStream(String fieldName, Reader reader) {
     return new CharTokenizer(reader) {
       protected char normalize(char c) {
         return Character.toLowerCase(c);
       }

       protected boolean isTokenChar(char c) {
         return Character.isLetter(c) || Character.isDigit(c);
       }
     };
   }


   public static void main(String[] args) throws IOException {
     TokenStream ts =
         new AlphanumericAnalyzer().tokenStream("field",
             new StringReader("December 26, 2004"));

     String month = ts.next().termText();
     String day = ts.next().termText();
     String year = ts.next().termText();

     System.out.println(month + " " + day + " " + year);
   }

}


Output:
december 26 2004

Calling .tokenStream and .next().termText() is not something your 
production code would need to do - but its what happens under the 
covers of Lucene.  If you are going to write a custom analyzer, you 
*should* write unit tests that "analyze" the analyzer using these 
lower-level methods.

Lucene in Action goes into the analysis topic deeply, but simply, and I 
spent a great deal of time toying with different customizations to 
analyzers to write about them.  The sample code distribution includes 
utility methods and unit test helpers to illustrate, test, and debug 
the analysis process.  And in retrospect, this very example I cobbled 
together to reply to this e-mail would have been a great example to add 
as well.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Need an analyzer that includes numbers.

Posted by Otis Gospodnetic <ot...@yahoo.com>.
WhitespaceAnalyzer will let you have it.  It just breaks the input on
spaces.

Otis

--- Jim <jw...@sgi.com> wrote:

> I've seen some discussion on this and the answer seems to be "write
> your 
> own".  Hasn't someone already done that by now that would share?  I 
> really have to be able to include numeric and alphanumeric strings in
> my 
> searches.   I don't understand analyzers well enough to roll my own.
> 
> Thanks,
> Jim.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org