You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matt Mitchell <go...@gmail.com> on 2010/10/08 17:22:36 UTC

dynamic "stop" words?

Is it possible to have certain query terms not effect score, if that
same query term is present in a field? For example, I have an index of
hotels. Each hotel has a name and city. If the name of a hotel has the
name of the city in it's "name" field, I want to completely ignore
that and not have it influence score.

Example:

Doc 1
name => "Holiday Inn"
city => "Denver"

Doc 2
name => "Holiday Inn, Denver"
city => "Denver"

q=name:(Holiday Inn, Denver)

I'd like those docs to have the same score in the response. I don't
want Doc2 to have a higher score, just because it has all of the query
terms.

Is this possible without using stop words? I hope this makes sense!

Thanks,
Matt

Re: dynamic "stop" words?

Posted by Matt Mitchell <go...@gmail.com>.
Great, thanks Hoss. I'll try dismax out today and see what happens with this.

Matt

On Tue, Oct 12, 2010 at 7:35 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : Is it possible to have certain query terms not effect score, if that
> : same query term is present in a field? For example, I have an index of
>
> that use case is precisely what the DisjunctionMaxQuery (generated by
> the dismax parser) does for you if you set the "tie" param to "0"
>
> when one of the words in query results in a high score in fieldA, the
> contribution to the score from that word in all of the other fields is
> ignored (the "tie" attribute is multiplied by the score of all the fields
> that are not the "max" score contribution)
>
>
> -Hoss
>

Re: dynamic "stop" words?

Posted by Chris Hostetter <ho...@fucit.org>.
: Is it possible to have certain query terms not effect score, if that
: same query term is present in a field? For example, I have an index of

that use case is precisely what the DisjunctionMaxQuery (generated by 
the dismax parser) does for you if you set the "tie" param to "0"

when one of the words in query results in a high score in fieldA, the 
contribution to the score from that word in all of the other fields is 
ignored (the "tie" attribute is multiplied by the score of all the fields 
that are not the "max" score contribution)


-Hoss

Re: dynamic "stop" words?

Posted by Matt Mitchell <go...@gmail.com>.
Exactly yep. I think that'll work nicely. Thanks Jonathan,

Matt

On Tue, Oct 12, 2010 at 9:47 AM, Jonathan Rochkind <ro...@jhu.edu> wrote:
> You can identify what words are the city name at index time, because they're the ones in the "city" field, right? So why not just strip those words out at index time?  Create a new field, name_search, and search on that, not name.
>
> Doc 1
> name => "Holiday  Inn"
> name_search => "Holiday Inn"   [analyzed, perhaps lowercase normalized etc]
> city => "Denver"
>
> Doc 2
> name => "Holiday Inn,  Denver"
> name_search => "Holiday Inn"
> city => "Denver"
>
> Jonathan
>
> ________________________________________
> From: Matt Mitchell [goodieboy@gmail.com]
> Sent: Tuesday, October 12, 2010 9:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: dynamic "stop" words?
>
> Thanks for the feedback. I thought about stop words but since I have a
> lot of documents spanning lots of different countries, I won't know
> all of the possible cities so stop-words could get hard to manage.
> Also, the city name is in the same field. I think I might try creating
> a new field called name_no_city, and at index time just strip the city
> name out?
>
> Matt
>
> On Sat, Oct 9, 2010 at 11:17 AM, Geert-Jan Brits <gb...@gmail.com> wrote:
>> That might work, although depending on your use-case it might be hard to
>> have a good controlled vocab on citynames (hotel metropole bruxelles, hotel
>> metropole brussels, hotel metropole brussel, etc.)  Also 'hotel paris
>> bruxelles' stinks...
>>
>> given your example:
>>
>>> Doc 1
>>> name => "Holiday  Inn"
>>> city => "Denver"
>>>
>>> Doc 2
>>> name => "Holiday Inn,  Denver"
>>> city => "Denver"
>>>
>>> q=name:(Holiday Inn, Denver)
>>
>> turning it upside down, perhaps an alternative would be to query on:
>> q=name:Holiday Inn+city:Denver
>>
>> and configure field 'name' in such a way that doc1 and doc2 score the same.
>> I believe that must be possible, just not sure how to config it exactly at
>> the moment.
>>
>> Of course, it depends on your scenario if you have enough knowlegde on the
>> clientside to transform:
>> q=name:(Holiday Inn, Denver)  to   q=name:Holiday Inn+city:Denver
>>
>> Hth,
>> Geert-Jan
>>
>> 2010/10/9 Otis Gospodnetic <ot...@yahoo.com>
>>
>>> Matt,
>>>
>>> The first thing that came to my mind is that this might be interesting to
>>> try
>>> with a dictionary (of city names) if this example is not a made-up one.
>>>
>>>
>>> Otis
>>> ----
>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>>> Lucene ecosystem search :: http://search-lucene.com/
>>>
>>>
>>>
>>> ----- Original Message ----
>>> > From: Matt Mitchell <go...@gmail.com>
>>> > To: solr-user@lucene.apache.org
>>> > Sent: Fri, October 8, 2010 11:22:36 AM
>>> > Subject: dynamic "stop" words?
>>> >
>>> > Is it possible to have certain query terms not effect score, if that
>>> > same  query term is present in a field? For example, I have an index of
>>> > hotels.  Each hotel has a name and city. If the name of a hotel has the
>>> > name of the  city in it's "name" field, I want to completely ignore
>>> > that and not have it  influence score.
>>> >
>>> > Example:
>>> >
>>> > Doc 1
>>> > name => "Holiday  Inn"
>>> > city => "Denver"
>>> >
>>> > Doc 2
>>> > name => "Holiday Inn,  Denver"
>>> > city => "Denver"
>>> >
>>> > q=name:(Holiday Inn, Denver)
>>> >
>>> > I'd  like those docs to have the same score in the response. I don't
>>> > want Doc2 to  have a higher score, just because it has all of the query
>>> > terms.
>>> >
>>> > Is  this possible without using stop words? I hope this makes  sense!
>>> >
>>> > Thanks,
>>> > Matt
>>> >
>>>
>>
>

RE: dynamic "stop" words?

Posted by Jonathan Rochkind <ro...@jhu.edu>.
You can identify what words are the city name at index time, because they're the ones in the "city" field, right? So why not just strip those words out at index time?  Create a new field, name_search, and search on that, not name. 

Doc 1
name => "Holiday  Inn"
name_search => "Holiday Inn"   [analyzed, perhaps lowercase normalized etc]
city => "Denver"

Doc 2
name => "Holiday Inn,  Denver"
name_search => "Holiday Inn"
city => "Denver"

Jonathan

________________________________________
From: Matt Mitchell [goodieboy@gmail.com]
Sent: Tuesday, October 12, 2010 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: dynamic "stop" words?

Thanks for the feedback. I thought about stop words but since I have a
lot of documents spanning lots of different countries, I won't know
all of the possible cities so stop-words could get hard to manage.
Also, the city name is in the same field. I think I might try creating
a new field called name_no_city, and at index time just strip the city
name out?

Matt

On Sat, Oct 9, 2010 at 11:17 AM, Geert-Jan Brits <gb...@gmail.com> wrote:
> That might work, although depending on your use-case it might be hard to
> have a good controlled vocab on citynames (hotel metropole bruxelles, hotel
> metropole brussels, hotel metropole brussel, etc.)  Also 'hotel paris
> bruxelles' stinks...
>
> given your example:
>
>> Doc 1
>> name => "Holiday  Inn"
>> city => "Denver"
>>
>> Doc 2
>> name => "Holiday Inn,  Denver"
>> city => "Denver"
>>
>> q=name:(Holiday Inn, Denver)
>
> turning it upside down, perhaps an alternative would be to query on:
> q=name:Holiday Inn+city:Denver
>
> and configure field 'name' in such a way that doc1 and doc2 score the same.
> I believe that must be possible, just not sure how to config it exactly at
> the moment.
>
> Of course, it depends on your scenario if you have enough knowlegde on the
> clientside to transform:
> q=name:(Holiday Inn, Denver)  to   q=name:Holiday Inn+city:Denver
>
> Hth,
> Geert-Jan
>
> 2010/10/9 Otis Gospodnetic <ot...@yahoo.com>
>
>> Matt,
>>
>> The first thing that came to my mind is that this might be interesting to
>> try
>> with a dictionary (of city names) if this example is not a made-up one.
>>
>>
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> ----- Original Message ----
>> > From: Matt Mitchell <go...@gmail.com>
>> > To: solr-user@lucene.apache.org
>> > Sent: Fri, October 8, 2010 11:22:36 AM
>> > Subject: dynamic "stop" words?
>> >
>> > Is it possible to have certain query terms not effect score, if that
>> > same  query term is present in a field? For example, I have an index of
>> > hotels.  Each hotel has a name and city. If the name of a hotel has the
>> > name of the  city in it's "name" field, I want to completely ignore
>> > that and not have it  influence score.
>> >
>> > Example:
>> >
>> > Doc 1
>> > name => "Holiday  Inn"
>> > city => "Denver"
>> >
>> > Doc 2
>> > name => "Holiday Inn,  Denver"
>> > city => "Denver"
>> >
>> > q=name:(Holiday Inn, Denver)
>> >
>> > I'd  like those docs to have the same score in the response. I don't
>> > want Doc2 to  have a higher score, just because it has all of the query
>> > terms.
>> >
>> > Is  this possible without using stop words? I hope this makes  sense!
>> >
>> > Thanks,
>> > Matt
>> >
>>
>

Re: dynamic "stop" words?

Posted by Matt Mitchell <go...@gmail.com>.
Thanks for the feedback. I thought about stop words but since I have a
lot of documents spanning lots of different countries, I won't know
all of the possible cities so stop-words could get hard to manage.
Also, the city name is in the same field. I think I might try creating
a new field called name_no_city, and at index time just strip the city
name out?

Matt

On Sat, Oct 9, 2010 at 11:17 AM, Geert-Jan Brits <gb...@gmail.com> wrote:
> That might work, although depending on your use-case it might be hard to
> have a good controlled vocab on citynames (hotel metropole bruxelles, hotel
> metropole brussels, hotel metropole brussel, etc.)  Also 'hotel paris
> bruxelles' stinks...
>
> given your example:
>
>> Doc 1
>> name => "Holiday  Inn"
>> city => "Denver"
>>
>> Doc 2
>> name => "Holiday Inn,  Denver"
>> city => "Denver"
>>
>> q=name:(Holiday Inn, Denver)
>
> turning it upside down, perhaps an alternative would be to query on:
> q=name:Holiday Inn+city:Denver
>
> and configure field 'name' in such a way that doc1 and doc2 score the same.
> I believe that must be possible, just not sure how to config it exactly at
> the moment.
>
> Of course, it depends on your scenario if you have enough knowlegde on the
> clientside to transform:
> q=name:(Holiday Inn, Denver)  to   q=name:Holiday Inn+city:Denver
>
> Hth,
> Geert-Jan
>
> 2010/10/9 Otis Gospodnetic <ot...@yahoo.com>
>
>> Matt,
>>
>> The first thing that came to my mind is that this might be interesting to
>> try
>> with a dictionary (of city names) if this example is not a made-up one.
>>
>>
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> ----- Original Message ----
>> > From: Matt Mitchell <go...@gmail.com>
>> > To: solr-user@lucene.apache.org
>> > Sent: Fri, October 8, 2010 11:22:36 AM
>> > Subject: dynamic "stop" words?
>> >
>> > Is it possible to have certain query terms not effect score, if that
>> > same  query term is present in a field? For example, I have an index of
>> > hotels.  Each hotel has a name and city. If the name of a hotel has the
>> > name of the  city in it's "name" field, I want to completely ignore
>> > that and not have it  influence score.
>> >
>> > Example:
>> >
>> > Doc 1
>> > name => "Holiday  Inn"
>> > city => "Denver"
>> >
>> > Doc 2
>> > name => "Holiday Inn,  Denver"
>> > city => "Denver"
>> >
>> > q=name:(Holiday Inn, Denver)
>> >
>> > I'd  like those docs to have the same score in the response. I don't
>> > want Doc2 to  have a higher score, just because it has all of the query
>> > terms.
>> >
>> > Is  this possible without using stop words? I hope this makes  sense!
>> >
>> > Thanks,
>> > Matt
>> >
>>
>

Re: dynamic "stop" words?

Posted by Geert-Jan Brits <gb...@gmail.com>.
That might work, although depending on your use-case it might be hard to
have a good controlled vocab on citynames (hotel metropole bruxelles, hotel
metropole brussels, hotel metropole brussel, etc.)  Also 'hotel paris
bruxelles' stinks...

given your example:

> Doc 1
> name => "Holiday  Inn"
> city => "Denver"
>
> Doc 2
> name => "Holiday Inn,  Denver"
> city => "Denver"
>
> q=name:(Holiday Inn, Denver)

turning it upside down, perhaps an alternative would be to query on:
q=name:Holiday Inn+city:Denver

and configure field 'name' in such a way that doc1 and doc2 score the same.
I believe that must be possible, just not sure how to config it exactly at
the moment.

Of course, it depends on your scenario if you have enough knowlegde on the
clientside to transform:
q=name:(Holiday Inn, Denver)  to   q=name:Holiday Inn+city:Denver

Hth,
Geert-Jan

2010/10/9 Otis Gospodnetic <ot...@yahoo.com>

> Matt,
>
> The first thing that came to my mind is that this might be interesting to
> try
> with a dictionary (of city names) if this example is not a made-up one.
>
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Matt Mitchell <go...@gmail.com>
> > To: solr-user@lucene.apache.org
> > Sent: Fri, October 8, 2010 11:22:36 AM
> > Subject: dynamic "stop" words?
> >
> > Is it possible to have certain query terms not effect score, if that
> > same  query term is present in a field? For example, I have an index of
> > hotels.  Each hotel has a name and city. If the name of a hotel has the
> > name of the  city in it's "name" field, I want to completely ignore
> > that and not have it  influence score.
> >
> > Example:
> >
> > Doc 1
> > name => "Holiday  Inn"
> > city => "Denver"
> >
> > Doc 2
> > name => "Holiday Inn,  Denver"
> > city => "Denver"
> >
> > q=name:(Holiday Inn, Denver)
> >
> > I'd  like those docs to have the same score in the response. I don't
> > want Doc2 to  have a higher score, just because it has all of the query
> > terms.
> >
> > Is  this possible without using stop words? I hope this makes  sense!
> >
> > Thanks,
> > Matt
> >
>

Re: dynamic "stop" words?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Matt,

The first thing that came to my mind is that this might be interesting to try 
with a dictionary (of city names) if this example is not a made-up one.


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Matt Mitchell <go...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Fri, October 8, 2010 11:22:36 AM
> Subject: dynamic "stop" words?
> 
> Is it possible to have certain query terms not effect score, if that
> same  query term is present in a field? For example, I have an index of
> hotels.  Each hotel has a name and city. If the name of a hotel has the
> name of the  city in it's "name" field, I want to completely ignore
> that and not have it  influence score.
> 
> Example:
> 
> Doc 1
> name => "Holiday  Inn"
> city => "Denver"
> 
> Doc 2
> name => "Holiday Inn,  Denver"
> city => "Denver"
> 
> q=name:(Holiday Inn, Denver)
> 
> I'd  like those docs to have the same score in the response. I don't
> want Doc2 to  have a higher score, just because it has all of the query
> terms.
> 
> Is  this possible without using stop words? I hope this makes  sense!
> 
> Thanks,
> Matt
> 

Re: Accented Search in Solr

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Param,

Note that the original value will be stored even if ISOLatin1AccentFilter 
removes the accept for indexing / matching purposes.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: "Sethi, Parampreet" <pa...@teamaol.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Fri, October 8, 2010 11:33:02 AM
> Subject: Accented Search in Solr
> 
> Hi All,
> 
> I am using Solr 1.3 in my project. Just wanted to know if there  is any other 
>way by which below mentioned queries will return the same  results:
> 
>  Gruyère-and-Zucchini
>  Gruyere-and-Zucchini
> 
> The first  query has accented characters in it. I was just going through the 
>Solr  tokenizers and filter factories documentation, there is a filter factory 
>listed  "solr.ISOLatin1AccentFilterFactory" that can be used to replace accented  
>characters with their non-accented counterparts.
> 
> Is there any other way  to do this search which is independent of how data is 
>stored (whether in  accented or non-accented form)?
> 
> Thanks for the  help.
> 
> Regards,
> param
> 

Re: Accented Search in Solr

Posted by Erick Erickson <er...@gmail.com>.
not that I know of. Do note that whether the query has the accent filter
active or not MUST
be matched with the index-time filter. In other words, if you indexed with
the filter but
search without it or vice-versa you won't get the resultsyou expect.

Also note that no matter what, the original text (without the filter
applied) is what's #stored#
untokenized. This is entirely independent of what's #indexed# for all that
these options are
specified for the same field.

If this is irrelevant, what are you really trying to accomplish? This may be
an "xy" problem, see:
http://people.apache.org/~hossman/#xyproblem

<http://people.apache.org/~hossman/#xyproblem>

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341

Erick


On Fri, Oct 8, 2010 at 11:33 AM, Sethi, Parampreet <
parampreet.sethi@teamaol.com> wrote:

> Hi All,
>
> I am using Solr 1.3 in my project. Just wanted to know if there is any
> other way by which below mentioned queries will return the same results:
>
>  Gruyère-and-Zucchini
>  Gruyere-and-Zucchini
>
> The first query has accented characters in it. I was just going through the
> Solr tokenizers and filter factories documentation, there is a filter
> factory listed "solr.ISOLatin1AccentFilterFactory" that can be used to
> replace accented characters with their non-accented counterparts.
>
> Is there any other way to do this search which is independent of how data
> is stored (whether in accented or non-accented form)?
>
> Thanks for the help.
>
> Regards,
> param
>

Re: Accented Search in Solr

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Accented Search in Solr
: References: <AA...@mail.gmail.com>
: In-Reply-To: <AA...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss

Accented Search in Solr

Posted by "Sethi, Parampreet" <pa...@teamaol.com>.
Hi All,

I am using Solr 1.3 in my project. Just wanted to know if there is any other way by which below mentioned queries will return the same results:

 Gruyère-and-Zucchini
 Gruyere-and-Zucchini

The first query has accented characters in it. I was just going through the Solr tokenizers and filter factories documentation, there is a filter factory listed "solr.ISOLatin1AccentFilterFactory" that can be used to replace accented characters with their non-accented counterparts.

Is there any other way to do this search which is independent of how data is stored (whether in accented or non-accented form)?

Thanks for the help.

Regards,
param