You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by bhecht <bh...@ams-sys.com> on 2007/05/21 22:05:02 UTC

stop words, synonyms... what's in it for me?

Hi there,

I have started using Lucene not long ago, with plans to replace my current
sql queries in my application with it.
As I wasn't aware of Lucene before, I have implemented some similar tools
(filters) as Lucene includes.

For example I have implemented a "stop word" tool.
In my case I have much more configuration options than Lucene, having the
option to remove sub strings in addition to complete tokens.
I can configure the wanted location of the sub string within the token,
or even the location of the token within the phrase.

I have implemented a synonym mechanism (substitution mechanism) that can
also be configured according to location within a phrase. It can also be
configured to find synonyms taking into account spelling mistakes. Although
it doesn't expand but only transforms to one certain replacement.It can find
replacements for sub strings as well. So I can use it to separate words. For
example in German I have "strasse"=> " strasse" (with a space in the front),
so words like "mainstrasse" will be split to "main" and "strasse".

I am wondering if I can use my "standardization" tools before calling the
lucene indexing, without implementing any custom analyzers and achieve more
or less the same results?

What do I "loose" if I go this way? The stemming filters are really one
thing I didn't have and I will use.
Is there any point for me to start creating custom analyzers with filter for
stop words, synonyms, and implementing my own "sub string" filter, for
separating tokens into "sub words" (like "mainstrasse"=> "main", "strasse")
?

Thanks in advance

--
View this message in context: http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10725950
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: stop words, synonyms... what's in it for me?

Posted by bhecht <bh...@ams-sys.com>.

Thanks Erik, thats what I thought.
In my case no phrase queries are done, so it seems I am good to go.
Any additional thoughts on the issue are welcomed.
Thanks



Erick Erickson wrote:
> 
> No, a phrase search it will NOT match. Phrase semantics
> requires that split tokens be adjacent (slop of 0). So, since
> "mainstrasse" was split into two tokens at index time, the test for
> "is schöne right next to strasse" will fail because of the intervening
> (introduced) term "main". Whether this is desired behavior or not is
> another question.
> 
> You're right that asking for a non-phrase search *will* work
> though.
> 
> Best
> Erick
> 
> On 5/21/07, bhecht <bh...@ams-sys.com> wrote:
>>
>>
>> I will never have "mainstrasse" in my lucene index, since strasse is
>> always
>> replaced with " strasse" causing "mainstrasse" to be split to "main
>> strasse".
>> So the example you gave:
>> "schöne strasse" will match "schöne mainstrasse", since in the lucene
>> index
>> I have "schöne main strasse".
>>
>>
>> Daniel Naber-5 wrote:
>> >
>> > On Monday 21 May 2007 22:53, bhecht wrote:
>> >
>> >> If someone searches for mainstrasse, my tools will split it again to
>> >> main and strasse, and then lucene will be able to find it.
>> >
>> > "strasse" will match "mainstrasse" but the phrase query "schöne
>> strasse"
>> > will not match "schöne mainstrasse". However, this could be considered
>> a
>> > feature. Aynway, it will be difficult to use features that rely on the
>> > term list, e.g. the spellchecker. It will not be able to suggest
>> > "mainstrasse", as that's not in the index.
>> >
>> > Regards
>> >  Daniel
>> >
>> > --
>> > http://www.danielnaber.de
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10727213
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10731811
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: stop words, synonyms... what's in it for me?

Posted by Erick Erickson <er...@gmail.com>.

No, a phrase search it will NOT match. Phrase semantics
requires that split tokens be adjacent (slop of 0). So, since
"mainstrasse" was split into two tokens at index time, the test for
"is schöne right next to strasse" will fail because of the intervening
(introduced) term "main". Whether this is desired behavior or not is
another question.

You're right that asking for a non-phrase search *will* work
though.

Best
Erick

On 5/21/07, bhecht <bh...@ams-sys.com> wrote:
>
>
> I will never have "mainstrasse" in my lucene index, since strasse is
> always
> replaced with " strasse" causing "mainstrasse" to be split to "main
> strasse".
> So the example you gave:
> "schöne strasse" will match "schöne mainstrasse", since in the lucene
> index
> I have "schöne main strasse".
>
>
> Daniel Naber-5 wrote:
> >
> > On Monday 21 May 2007 22:53, bhecht wrote:
> >
> >> If someone searches for mainstrasse, my tools will split it again to
> >> main and strasse, and then lucene will be able to find it.
> >
> > "strasse" will match "mainstrasse" but the phrase query "schöne strasse"
> > will not match "schöne mainstrasse". However, this could be considered a
> > feature. Aynway, it will be difficult to use features that rely on the
> > term list, e.g. the spellchecker. It will not be able to suggest
> > "mainstrasse", as that's not in the index.
> >
> > Regards
> >  Daniel
> >
> > --
> > http://www.danielnaber.de
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10727213
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: stop words, synonyms... what's in it for me?

Posted by bhecht <bh...@ams-sys.com>.

I will never have "mainstrasse" in my lucene index, since strasse is always
replaced with " strasse" causing "mainstrasse" to be split to "main
strasse".
So the example you gave:
"schöne strasse" will match "schöne mainstrasse", since in the lucene index
I have "schöne main strasse".


Daniel Naber-5 wrote:
> 
> On Monday 21 May 2007 22:53, bhecht wrote:
> 
>> If someone searches for mainstrasse, my tools will split it again to
>> main and strasse, and then lucene will be able to find it.
> 
> "strasse" will match "mainstrasse" but the phrase query "schöne strasse" 
> will not match "schöne mainstrasse". However, this could be considered a 
> feature. Aynway, it will be difficult to use features that rely on the 
> term list, e.g. the spellchecker. It will not be able to suggest 
> "mainstrasse", as that's not in the index.
> 
> Regards
>  Daniel
> 
> -- 
> http://www.danielnaber.de
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10727213
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: stop words, synonyms... what's in it for me?

Posted by Daniel Naber <lu...@danielnaber.de>.

On Monday 21 May 2007 22:53, bhecht wrote:

> If someone searches for mainstrasse, my tools will split it again to
> main and strasse, and then lucene will be able to find it.

"strasse" will match "mainstrasse" but the phrase query "schöne strasse" 
will not match "schöne mainstrasse". However, this could be considered a 
feature. Aynway, it will be difficult to use features that rely on the 
term list, e.g. the spellchecker. It will not be able to suggest 
"mainstrasse", as that's not in the index.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: stop words, synonyms... what's in it for me?

Posted by bhecht <bh...@ams-sys.com>.

Thanks Daniel,

But when searching, I will run my "standardization" tools again before
querying Lucene, so what you mentioned will not be a problem.
If someone searches for mainstrasse, my tools will split it again to main
and strasse, and then lucene will be able to find it.


Daniel Naber-5 wrote:
> 
> On Monday 21 May 2007 22:05, bhecht wrote:
> 
>> Is there any point for me to start creating custom analyzers with filter
>> for stop words, synonyms, and implementing my own "sub string" filter,
>> for separating tokens into "sub words" (like "mainstrasse"=> "main",
>> "strasse")
> 
> Yes: I assume your document should be found both with "strasse" and with 
> "mainstrasse". You will then need to put main, strasse, and mainstrasse at 
> the same position (setPositionIncrement(0)). If you don't do that, phrase 
> queries will not work anymore as expected. Thus you need an analyzer, 
> modifying the string before they are put in Lucene is not enough.
> 
> Regards
>  Daniel
> 
> -- 
> http://www.danielnaber.de
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10726812
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: stop words, synonyms... what's in it for me?

Posted by Daniel Naber <lu...@danielnaber.de>.

On Monday 21 May 2007 22:05, bhecht wrote:

> Is there any point for me to start creating custom analyzers with filter
> for stop words, synonyms, and implementing my own "sub string" filter,
> for separating tokens into "sub words" (like "mainstrasse"=> "main",
> "strasse")

Yes: I assume your document should be found both with "strasse" and with 
"mainstrasse". You will then need to put main, strasse, and mainstrasse at 
the same position (setPositionIncrement(0)). If you don't do that, phrase 
queries will not work anymore as expected. Thus you need an analyzer, 
modifying the string before they are put in Lucene is not enough.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org