You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Robert Brown <ro...@intelcompute.com> on 2011/11/29 18:33:27 UTC

Don't snowball depending on terms

Is it possible to search a field but not be affected by the snowball 
filter?

ie, searching for "manage" is matching "management", but a user may 
want to restrict results to only containing "manage".

I was hoping that simply quoting the term would do this, but it 
doesn't appear to make any difference.




--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

Re: Don't snowball depending on terms

Posted by Rob Brown <ro...@intelcompute.com>.

Yes, it looks like I'll have to do some pre-processing outside of Solr.

I don't mind giving users the option to query a differently indexed
field, ie, same content, but not stemmed, although this would apply to
all keywords they enter, so they couldn't allow stemming on one keyword,
but not another.

ie, "manage" and exec = manage and (exec or executive)

My current config is using the example "text" fieldtype, so stemmed at
index time.

-- 

IntelCompute
Web Design and Online Marketing

http://www.intelcompute.com

-----Original Message-----
From: François Schiettecatte <fs...@gmail.com>
Reply-to: solr-user@lucene.apache.org
To: solr-user@lucene.apache.org
Subject: Re: Don't snowball depending on terms
Date: Tue, 29 Nov 2011 13:53:44 -0500

It won't and depending on how your analyzer is set up the terms are most likely stemmed at index time.

You could create a separate field for unstemmed terms though, or use a less aggressive stemmer such as EnglishMinimalStemFilterFactory.

François

On Nov 29, 2011, at 12:33 PM, Robert Brown wrote:

> Is it possible to search a field but not be affected by the snowball filter?
> 
> ie, searching for "manage" is matching "management", but a user may want to restrict results to only containing "manage".
> 
> I was hoping that simply quoting the term would do this, but it doesn't appear to make any difference.
> 
> 
> 
> 
> --
> 
> IntelCompute
> Web Design & Local Online Marketing
> 
> http://www.intelcompute.com
>

Re: Don't snowball depending on terms

Posted by Erick Erickson <er...@gmail.com>.

Ahhh, I hate making a new implementation match all of the old behavior, but
sometimes ya' just got no choice.

I *swear* that there's a JIRA with an approach to creating a filter for
this situation, but I can't find it....

Best
Erick

On Wed, Nov 30, 2011 at 9:19 AM, Robert Brown <ro...@intelcompute.com> wrote:
> Thanks Erick,
>
> This is a required feature since we're swapping out an existing search
> engine for Solr - users have saved searches that need to behave the
> same.
>
> I'll look into the edismax stuff, that's the handler we're using
> anyway.
>
>
>
> ---
>
> IntelCompute
> Web Design & Local Online Marketing
>
> http://www.intelcompute.com
>
> On Wed, 30 Nov 2011 09:12:11 -0500, Erick Erickson
> <er...@gmail.com> wrote:
>> First, watch the syntax <G>....
>>
>> q=+(stemmed:perl^2 or stemmed:java^3) +unstemmed:"development manager"^5
>> although it is a bit confusing to see the dismax stuff where the boost
>> is put on the
>> field name, but that's not how the queries are formed.
>>
>> BTW, have you looked at edismax queries? You can distribute your terms
>> across the fields, applying whatever boost you want and have the query
>> input be pretty simple. It takes a bit to get your head around what
>> edismax does,
>> but it's worth it....
>>
>> But before you go there.... You've presented no evidence that this is
>> desirable.
>> What is the use-case here? You say "users may want"... Well, why do the work
>> unless they *do* want this capability? I'd strongly advise that you
>> just forget about
>> this feature unless and until there's a demonstrated need. Here's a
>> blog I made at
>> Lucid. Long-winded, but I'm like that sometimes....
>>
>> http://www.lucidimagination.com/blog/2011/11/03/stop-being-so-agreeable/
>>
>> Best
>> Erick
>>
>>
>> On Wed, Nov 30, 2011 at 8:50 AM, Robert Brown <ro...@intelcompute.com> wrote:
>>> Boosts can be included there too can't they?
>>>
>>> so this is valid?
>>>
>>> q=+(stemmed^2:perl or stemmed^3:java) +unstemmed^5:"development
>>> manager"
>>>
>>> is it possible to have different boosts on the same field btw?
>>>
>>> We currently search across 5 fields anyway, so my queries are gonna
>>> start getting messy.  :-/
>>>
>>>
>>> ---
>>>
>>> IntelCompute
>>> Web Design & Local Online Marketing
>>>
>>> http://www.intelcompute.com
>>>
>>> On Wed, 30 Nov 2011 08:08:41 -0500, Erick Erickson
>>> <er...@gmail.com> wrote:
>>>> You can't have multiple "q" clauses (as opposed to "fq" clauses).
>>>> You could form something like
>>>> q=unstemmed:perl or java&fq=stemmed:manager
>>>> or
>>>> q=+(unstemmed:perl or java) +stemmed:manager
>>>>
>>>> BTW, this fragment of the query probably doesn't do
>>>> what you expect:
>>>> unstemmed:perl or java
>>>> would be parsed as
>>>> unstemmed:perl OR default_search_field:java
>>>>
>>>> FWIW
>>>> Erick
>>>>
>>>> On Wed, Nov 30, 2011 at 7:39 AM, Rob Brown <ro...@intelcompute.com> wrote:
>>>>> I guess I could do a bit of pre-processing, look for any words that are
>>>>> quoted, and search in a diff field for those
>>>>>
>>>>> How is a query like this formulated?
>>>>>
>>>>> q=unstemmed:perl or java&q=stemmed:manager
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> IntelCompute
>>>>> Web Design and Online Marketing
>>>>>
>>>>> http://www.intelcompute.com
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Tomas Zerolo <to...@axelspringer.de>
>>>>> Reply-to: solr-user@lucene.apache.org
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: Don't snowball depending on terms
>>>>> Date: Wed, 30 Nov 2011 08:49:37 +0100
>>>>>
>>>>> On Tue, Nov 29, 2011 at 01:53:44PM -0500, François Schiettecatte wrote:
>>>>>> It won't and depending on how your analyzer is set up the terms are most likely stemmed at index time.
>>>>>>
>>>>>> You could create a separate field for unstemmed terms though, or use a less aggressive stemmer such as EnglishMinimalStemFilterFactory.
>>>>>
>>>>> This is surprising to me. Snowball introduces new homonyms, meaning it
>>>>> will lump e.g. "management" and "manage" into one index entry. Thus,
>>>>> I'd expect a handful of "false positives" (but usually not too many).
>>>>>
>>>>> That's a "lossy index" (loosely speaking) and could be fixed by
>>>>> post-filtering (instead of introducing another index, which in
>>>>> most cases would seem a waste of resurces).
>>>>>
>>>>> Is there no way in SOLR of filtering the results *after* the index
>>>>> scan? I'd be disappointed!
>>>>>
>>>>> Regards
>>>>> -- tomás
>>>>>
>>>
>

Re: Don't snowball depending on terms

Posted by Robert Brown <ro...@intelcompute.com>.

Thanks Erick,

This is a required feature since we're swapping out an existing search
engine for Solr - users have saved searches that need to behave the
same.

I'll look into the edismax stuff, that's the handler we're using
anyway.



---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

On Wed, 30 Nov 2011 09:12:11 -0500, Erick Erickson
<er...@gmail.com> wrote:
> First, watch the syntax <G>....
> 
> q=+(stemmed:perl^2 or stemmed:java^3) +unstemmed:"development manager"^5
> although it is a bit confusing to see the dismax stuff where the boost
> is put on the
> field name, but that's not how the queries are formed.
> 
> BTW, have you looked at edismax queries? You can distribute your terms
> across the fields, applying whatever boost you want and have the query
> input be pretty simple. It takes a bit to get your head around what
> edismax does,
> but it's worth it....
> 
> But before you go there.... You've presented no evidence that this is
> desirable.
> What is the use-case here? You say "users may want"... Well, why do the work
> unless they *do* want this capability? I'd strongly advise that you
> just forget about
> this feature unless and until there's a demonstrated need. Here's a
> blog I made at
> Lucid. Long-winded, but I'm like that sometimes....
> 
> http://www.lucidimagination.com/blog/2011/11/03/stop-being-so-agreeable/
> 
> Best
> Erick
> 
> 
> On Wed, Nov 30, 2011 at 8:50 AM, Robert Brown <ro...@intelcompute.com> wrote:
>> Boosts can be included there too can't they?
>>
>> so this is valid?
>>
>> q=+(stemmed^2:perl or stemmed^3:java) +unstemmed^5:"development
>> manager"
>>
>> is it possible to have different boosts on the same field btw?
>>
>> We currently search across 5 fields anyway, so my queries are gonna
>> start getting messy.  :-/
>>
>>
>> ---
>>
>> IntelCompute
>> Web Design & Local Online Marketing
>>
>> http://www.intelcompute.com
>>
>> On Wed, 30 Nov 2011 08:08:41 -0500, Erick Erickson
>> <er...@gmail.com> wrote:
>>> You can't have multiple "q" clauses (as opposed to "fq" clauses).
>>> You could form something like
>>> q=unstemmed:perl or java&fq=stemmed:manager
>>> or
>>> q=+(unstemmed:perl or java) +stemmed:manager
>>>
>>> BTW, this fragment of the query probably doesn't do
>>> what you expect:
>>> unstemmed:perl or java
>>> would be parsed as
>>> unstemmed:perl OR default_search_field:java
>>>
>>> FWIW
>>> Erick
>>>
>>> On Wed, Nov 30, 2011 at 7:39 AM, Rob Brown <ro...@intelcompute.com> wrote:
>>>> I guess I could do a bit of pre-processing, look for any words that are
>>>> quoted, and search in a diff field for those
>>>>
>>>> How is a query like this formulated?
>>>>
>>>> q=unstemmed:perl or java&q=stemmed:manager
>>>>
>>>>
>>>> --
>>>>
>>>> IntelCompute
>>>> Web Design and Online Marketing
>>>>
>>>> http://www.intelcompute.com
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Tomas Zerolo <to...@axelspringer.de>
>>>> Reply-to: solr-user@lucene.apache.org
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Don't snowball depending on terms
>>>> Date: Wed, 30 Nov 2011 08:49:37 +0100
>>>>
>>>> On Tue, Nov 29, 2011 at 01:53:44PM -0500, François Schiettecatte wrote:
>>>>> It won't and depending on how your analyzer is set up the terms are most likely stemmed at index time.
>>>>>
>>>>> You could create a separate field for unstemmed terms though, or use a less aggressive stemmer such as EnglishMinimalStemFilterFactory.
>>>>
>>>> This is surprising to me. Snowball introduces new homonyms, meaning it
>>>> will lump e.g. "management" and "manage" into one index entry. Thus,
>>>> I'd expect a handful of "false positives" (but usually not too many).
>>>>
>>>> That's a "lossy index" (loosely speaking) and could be fixed by
>>>> post-filtering (instead of introducing another index, which in
>>>> most cases would seem a waste of resurces).
>>>>
>>>> Is there no way in SOLR of filtering the results *after* the index
>>>> scan? I'd be disappointed!
>>>>
>>>> Regards
>>>> -- tomás
>>>>
>>

Re: Don't snowball depending on terms

Posted by Erick Erickson <er...@gmail.com>.

First, watch the syntax <G>....

q=+(stemmed:perl^2 or stemmed:java^3) +unstemmed:"development manager"^5
although it is a bit confusing to see the dismax stuff where the boost
is put on the
field name, but that's not how the queries are formed.

BTW, have you looked at edismax queries? You can distribute your terms
across the fields, applying whatever boost you want and have the query
input be pretty simple. It takes a bit to get your head around what
edismax does,
but it's worth it....

But before you go there.... You've presented no evidence that this is desirable.
What is the use-case here? You say "users may want"... Well, why do the work
unless they *do* want this capability? I'd strongly advise that you
just forget about
this feature unless and until there's a demonstrated need. Here's a
blog I made at
Lucid. Long-winded, but I'm like that sometimes....

http://www.lucidimagination.com/blog/2011/11/03/stop-being-so-agreeable/

Best
Erick


On Wed, Nov 30, 2011 at 8:50 AM, Robert Brown <ro...@intelcompute.com> wrote:
> Boosts can be included there too can't they?
>
> so this is valid?
>
> q=+(stemmed^2:perl or stemmed^3:java) +unstemmed^5:"development
> manager"
>
> is it possible to have different boosts on the same field btw?
>
> We currently search across 5 fields anyway, so my queries are gonna
> start getting messy.  :-/
>
>
> ---
>
> IntelCompute
> Web Design & Local Online Marketing
>
> http://www.intelcompute.com
>
> On Wed, 30 Nov 2011 08:08:41 -0500, Erick Erickson
> <er...@gmail.com> wrote:
>> You can't have multiple "q" clauses (as opposed to "fq" clauses).
>> You could form something like
>> q=unstemmed:perl or java&fq=stemmed:manager
>> or
>> q=+(unstemmed:perl or java) +stemmed:manager
>>
>> BTW, this fragment of the query probably doesn't do
>> what you expect:
>> unstemmed:perl or java
>> would be parsed as
>> unstemmed:perl OR default_search_field:java
>>
>> FWIW
>> Erick
>>
>> On Wed, Nov 30, 2011 at 7:39 AM, Rob Brown <ro...@intelcompute.com> wrote:
>>> I guess I could do a bit of pre-processing, look for any words that are
>>> quoted, and search in a diff field for those
>>>
>>> How is a query like this formulated?
>>>
>>> q=unstemmed:perl or java&q=stemmed:manager
>>>
>>>
>>> --
>>>
>>> IntelCompute
>>> Web Design and Online Marketing
>>>
>>> http://www.intelcompute.com
>>>
>>>
>>> -----Original Message-----
>>> From: Tomas Zerolo <to...@axelspringer.de>
>>> Reply-to: solr-user@lucene.apache.org
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Don't snowball depending on terms
>>> Date: Wed, 30 Nov 2011 08:49:37 +0100
>>>
>>> On Tue, Nov 29, 2011 at 01:53:44PM -0500, François Schiettecatte wrote:
>>>> It won't and depending on how your analyzer is set up the terms are most likely stemmed at index time.
>>>>
>>>> You could create a separate field for unstemmed terms though, or use a less aggressive stemmer such as EnglishMinimalStemFilterFactory.
>>>
>>> This is surprising to me. Snowball introduces new homonyms, meaning it
>>> will lump e.g. "management" and "manage" into one index entry. Thus,
>>> I'd expect a handful of "false positives" (but usually not too many).
>>>
>>> That's a "lossy index" (loosely speaking) and could be fixed by
>>> post-filtering (instead of introducing another index, which in
>>> most cases would seem a waste of resurces).
>>>
>>> Is there no way in SOLR of filtering the results *after* the index
>>> scan? I'd be disappointed!
>>>
>>> Regards
>>> -- tomás
>>>
>

Re: Don't snowball depending on terms

Posted by Robert Brown <ro...@intelcompute.com>.

Boosts can be included there too can't they?

so this is valid?

q=+(stemmed^2:perl or stemmed^3:java) +unstemmed^5:"development
manager"

is it possible to have different boosts on the same field btw?

We currently search across 5 fields anyway, so my queries are gonna
start getting messy.  :-/


---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

On Wed, 30 Nov 2011 08:08:41 -0500, Erick Erickson
<er...@gmail.com> wrote:
> You can't have multiple "q" clauses (as opposed to "fq" clauses).
> You could form something like
> q=unstemmed:perl or java&fq=stemmed:manager
> or
> q=+(unstemmed:perl or java) +stemmed:manager
> 
> BTW, this fragment of the query probably doesn't do
> what you expect:
> unstemmed:perl or java
> would be parsed as
> unstemmed:perl OR default_search_field:java
> 
> FWIW
> Erick
> 
> On Wed, Nov 30, 2011 at 7:39 AM, Rob Brown <ro...@intelcompute.com> wrote:
>> I guess I could do a bit of pre-processing, look for any words that are
>> quoted, and search in a diff field for those
>>
>> How is a query like this formulated?
>>
>> q=unstemmed:perl or java&q=stemmed:manager
>>
>>
>> --
>>
>> IntelCompute
>> Web Design and Online Marketing
>>
>> http://www.intelcompute.com
>>
>>
>> -----Original Message-----
>> From: Tomas Zerolo <to...@axelspringer.de>
>> Reply-to: solr-user@lucene.apache.org
>> To: solr-user@lucene.apache.org
>> Subject: Re: Don't snowball depending on terms
>> Date: Wed, 30 Nov 2011 08:49:37 +0100
>>
>> On Tue, Nov 29, 2011 at 01:53:44PM -0500, François Schiettecatte wrote:
>>> It won't and depending on how your analyzer is set up the terms are most likely stemmed at index time.
>>>
>>> You could create a separate field for unstemmed terms though, or use a less aggressive stemmer such as EnglishMinimalStemFilterFactory.
>>
>> This is surprising to me. Snowball introduces new homonyms, meaning it
>> will lump e.g. "management" and "manage" into one index entry. Thus,
>> I'd expect a handful of "false positives" (but usually not too many).
>>
>> That's a "lossy index" (loosely speaking) and could be fixed by
>> post-filtering (instead of introducing another index, which in
>> most cases would seem a waste of resurces).
>>
>> Is there no way in SOLR of filtering the results *after* the index
>> scan? I'd be disappointed!
>>
>> Regards
>> -- tomás
>>

Re: Don't snowball depending on terms

Posted by Erick Erickson <er...@gmail.com>.

You can't have multiple "q" clauses (as opposed to "fq" clauses).
You could form something like
q=unstemmed:perl or java&fq=stemmed:manager
or
q=+(unstemmed:perl or java) +stemmed:manager

BTW, this fragment of the query probably doesn't do
what you expect:
unstemmed:perl or java
would be parsed as
unstemmed:perl OR default_search_field:java

FWIW
Erick

On Wed, Nov 30, 2011 at 7:39 AM, Rob Brown <ro...@intelcompute.com> wrote:
> I guess I could do a bit of pre-processing, look for any words that are
> quoted, and search in a diff field for those
>
> How is a query like this formulated?
>
> q=unstemmed:perl or java&q=stemmed:manager
>
>
> --
>
> IntelCompute
> Web Design and Online Marketing
>
> http://www.intelcompute.com
>
>
> -----Original Message-----
> From: Tomas Zerolo <to...@axelspringer.de>
> Reply-to: solr-user@lucene.apache.org
> To: solr-user@lucene.apache.org
> Subject: Re: Don't snowball depending on terms
> Date: Wed, 30 Nov 2011 08:49:37 +0100
>
> On Tue, Nov 29, 2011 at 01:53:44PM -0500, François Schiettecatte wrote:
>> It won't and depending on how your analyzer is set up the terms are most likely stemmed at index time.
>>
>> You could create a separate field for unstemmed terms though, or use a less aggressive stemmer such as EnglishMinimalStemFilterFactory.
>
> This is surprising to me. Snowball introduces new homonyms, meaning it
> will lump e.g. "management" and "manage" into one index entry. Thus,
> I'd expect a handful of "false positives" (but usually not too many).
>
> That's a "lossy index" (loosely speaking) and could be fixed by
> post-filtering (instead of introducing another index, which in
> most cases would seem a waste of resurces).
>
> Is there no way in SOLR of filtering the results *after* the index
> scan? I'd be disappointed!
>
> Regards
> -- tomás
>

Re: Don't snowball depending on terms

Posted by Rob Brown <ro...@intelcompute.com>.

I guess I could do a bit of pre-processing, look for any words that are
quoted, and search in a diff field for those

How is a query like this formulated?

q=unstemmed:perl or java&q=stemmed:manager

-- 

IntelCompute
Web Design and Online Marketing

http://www.intelcompute.com

-----Original Message-----
From: Tomas Zerolo <to...@axelspringer.de>
Reply-to: solr-user@lucene.apache.org
To: solr-user@lucene.apache.org
Subject: Re: Don't snowball depending on terms
Date: Wed, 30 Nov 2011 08:49:37 +0100

On Tue, Nov 29, 2011 at 01:53:44PM -0500, François Schiettecatte wrote:
> It won't and depending on how your analyzer is set up the terms are most likely stemmed at index time.
> 
> You could create a separate field for unstemmed terms though, or use a less aggressive stemmer such as EnglishMinimalStemFilterFactory.

This is surprising to me. Snowball introduces new homonyms, meaning it
will lump e.g. "management" and "manage" into one index entry. Thus,
I'd expect a handful of "false positives" (but usually not too many).

That's a "lossy index" (loosely speaking) and could be fixed by
post-filtering (instead of introducing another index, which in
most cases would seem a waste of resurces).

Is there no way in SOLR of filtering the results *after* the index
scan? I'd be disappointed!

Regards
-- tomás

Re: Don't snowball depending on terms

Posted by Tomas Zerolo <to...@axelspringer.de>.

On Tue, Nov 29, 2011 at 01:53:44PM -0500, François Schiettecatte wrote:
> It won't and depending on how your analyzer is set up the terms are most likely stemmed at index time.
> 
> You could create a separate field for unstemmed terms though, or use a less aggressive stemmer such as EnglishMinimalStemFilterFactory.

This is surprising to me. Snowball introduces new homonyms, meaning it
will lump e.g. "management" and "manage" into one index entry. Thus,
I'd expect a handful of "false positives" (but usually not too many).

That's a "lossy index" (loosely speaking) and could be fixed by
post-filtering (instead of introducing another index, which in
most cases would seem a waste of resurces).

Is there no way in SOLR of filtering the results *after* the index
scan? I'd be disappointed!

Regards
-- tomás

Re: Don't snowball depending on terms

Posted by François Schiettecatte <fs...@gmail.com>.

It won't and depending on how your analyzer is set up the terms are most likely stemmed at index time.

You could create a separate field for unstemmed terms though, or use a less aggressive stemmer such as EnglishMinimalStemFilterFactory.

François

On Nov 29, 2011, at 12:33 PM, Robert Brown wrote:

> Is it possible to search a field but not be affected by the snowball filter?
> 
> ie, searching for "manage" is matching "management", but a user may want to restrict results to only containing "manage".
> 
> I was hoping that simply quoting the term would do this, but it doesn't appear to make any difference.
> 
> 
> 
> 
> --
> 
> IntelCompute
> Web Design & Local Online Marketing
> 
> http://www.intelcompute.com
>