You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Max Metral <ma...@artsalliancelabs.com> on 2008/04/17 20:08:26 UTC

Word split problems

In our app, we search for businesses.  So here's an example:

 

Lululemon Athletica

 

I'd like any of these search terms to work for this:

 

Lulu lemon

Lu Lu Lemon

Lululemon

 

What strategy would be optimal for this kind of thing (of course keeping
in mind negative matches are also bad)?  Right now we're using Snowball
Analyzer.  It's a wiki, so one answer so far has been "let the user
deal," and we have a way of specifying "also known as", but for this
case I feel like that shouldn't be required.

 

Thanks!

--Max

http://boston.povo.com

 


RE: Word split problems

Posted by Max Metral <ma...@artsalliancelabs.com>.
It's probably about 100,000 entries per "thing that it would care about
at once".

-----Original Message-----
From: Karl Wettin [mailto:karl.wettin@gmail.com] 
Sent: Thursday, April 17, 2008 3:17 PM
To: java-user@lucene.apache.org
Subject: Re: Word split problems

Max Metral skrev:
 >
> Lululemon Athletica
> 
> I'd like any of these search terms to work for this:
> 
> Lulu lemon
> Lu Lu Lemon
> Lululemon
> 
> What strategy would be optimal for this kind of thing (of course
keeping

How large is your corpus? I suggest you look at NGramTokenizer.


    karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Word split problems

Posted by Karl Wettin <ka...@gmail.com>.
Max Metral skrev:
 >
> Lululemon Athletica
> 
> I'd like any of these search terms to work for this:
> 
> Lulu lemon
> Lu Lu Lemon
> Lululemon
> 
> What strategy would be optimal for this kind of thing (of course keeping

How large is your corpus? I suggest you look at NGramTokenizer.


    karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org