You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Max Metral <ma...@artsalliancelabs.com> on 2008/04/17 20:08:26 UTC
Word split problems
In our app, we search for businesses. So here's an example:
Lululemon Athletica
I'd like any of these search terms to work for this:
Lulu lemon
Lu Lu Lemon
Lululemon
What strategy would be optimal for this kind of thing (of course keeping
in mind negative matches are also bad)? Right now we're using Snowball
Analyzer. It's a wiki, so one answer so far has been "let the user
deal," and we have a way of specifying "also known as", but for this
case I feel like that shouldn't be required.
Thanks!
--Max
http://boston.povo.com
RE: Word split problems
Posted by Max Metral <ma...@artsalliancelabs.com>.
It's probably about 100,000 entries per "thing that it would care about
at once".
-----Original Message-----
From: Karl Wettin [mailto:karl.wettin@gmail.com]
Sent: Thursday, April 17, 2008 3:17 PM
To: java-user@lucene.apache.org
Subject: Re: Word split problems
Max Metral skrev:
>
> Lululemon Athletica
>
> I'd like any of these search terms to work for this:
>
> Lulu lemon
> Lu Lu Lemon
> Lululemon
>
> What strategy would be optimal for this kind of thing (of course
keeping
How large is your corpus? I suggest you look at NGramTokenizer.
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Word split problems
Posted by Karl Wettin <ka...@gmail.com>.
Max Metral skrev:
>
> Lululemon Athletica
>
> I'd like any of these search terms to work for this:
>
> Lulu lemon
> Lu Lu Lemon
> Lululemon
>
> What strategy would be optimal for this kind of thing (of course keeping
How large is your corpus? I suggest you look at NGramTokenizer.
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org