You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matt Mitchell <go...@gmail.com> on 2010/10/08 17:22:36 UTC
dynamic "stop" words?
Is it possible to have certain query terms not effect score, if that
same query term is present in a field? For example, I have an index of
hotels. Each hotel has a name and city. If the name of a hotel has the
name of the city in it's "name" field, I want to completely ignore
that and not have it influence score.
Example:
Doc 1
name => "Holiday Inn"
city => "Denver"
Doc 2
name => "Holiday Inn, Denver"
city => "Denver"
q=name:(Holiday Inn, Denver)
I'd like those docs to have the same score in the response. I don't
want Doc2 to have a higher score, just because it has all of the query
terms.
Is this possible without using stop words? I hope this makes sense!
Thanks,
Matt
Re: dynamic "stop" words?
Posted by Matt Mitchell <go...@gmail.com>.
Great, thanks Hoss. I'll try dismax out today and see what happens with this.
Matt
On Tue, Oct 12, 2010 at 7:35 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : Is it possible to have certain query terms not effect score, if that
> : same query term is present in a field? For example, I have an index of
>
> that use case is precisely what the DisjunctionMaxQuery (generated by
> the dismax parser) does for you if you set the "tie" param to "0"
>
> when one of the words in query results in a high score in fieldA, the
> contribution to the score from that word in all of the other fields is
> ignored (the "tie" attribute is multiplied by the score of all the fields
> that are not the "max" score contribution)
>
>
> -Hoss
>
Re: dynamic "stop" words?
Posted by Chris Hostetter <ho...@fucit.org>.
: Is it possible to have certain query terms not effect score, if that
: same query term is present in a field? For example, I have an index of
that use case is precisely what the DisjunctionMaxQuery (generated by
the dismax parser) does for you if you set the "tie" param to "0"
when one of the words in query results in a high score in fieldA, the
contribution to the score from that word in all of the other fields is
ignored (the "tie" attribute is multiplied by the score of all the fields
that are not the "max" score contribution)
-Hoss
Re: dynamic "stop" words?
Posted by Matt Mitchell <go...@gmail.com>.
Exactly yep. I think that'll work nicely. Thanks Jonathan,
Matt
On Tue, Oct 12, 2010 at 9:47 AM, Jonathan Rochkind <ro...@jhu.edu> wrote:
> You can identify what words are the city name at index time, because they're the ones in the "city" field, right? So why not just strip those words out at index time? Create a new field, name_search, and search on that, not name.
>
> Doc 1
> name => "Holiday Inn"
> name_search => "Holiday Inn" [analyzed, perhaps lowercase normalized etc]
> city => "Denver"
>
> Doc 2
> name => "Holiday Inn, Denver"
> name_search => "Holiday Inn"
> city => "Denver"
>
> Jonathan
>
> ________________________________________
> From: Matt Mitchell [goodieboy@gmail.com]
> Sent: Tuesday, October 12, 2010 9:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: dynamic "stop" words?
>
> Thanks for the feedback. I thought about stop words but since I have a
> lot of documents spanning lots of different countries, I won't know
> all of the possible cities so stop-words could get hard to manage.
> Also, the city name is in the same field. I think I might try creating
> a new field called name_no_city, and at index time just strip the city
> name out?
>
> Matt
>
> On Sat, Oct 9, 2010 at 11:17 AM, Geert-Jan Brits <gb...@gmail.com> wrote:
>> That might work, although depending on your use-case it might be hard to
>> have a good controlled vocab on citynames (hotel metropole bruxelles, hotel
>> metropole brussels, hotel metropole brussel, etc.) Also 'hotel paris
>> bruxelles' stinks...
>>
>> given your example:
>>
>>> Doc 1
>>> name => "Holiday Inn"
>>> city => "Denver"
>>>
>>> Doc 2
>>> name => "Holiday Inn, Denver"
>>> city => "Denver"
>>>
>>> q=name:(Holiday Inn, Denver)
>>
>> turning it upside down, perhaps an alternative would be to query on:
>> q=name:Holiday Inn+city:Denver
>>
>> and configure field 'name' in such a way that doc1 and doc2 score the same.
>> I believe that must be possible, just not sure how to config it exactly at
>> the moment.
>>
>> Of course, it depends on your scenario if you have enough knowlegde on the
>> clientside to transform:
>> q=name:(Holiday Inn, Denver) to q=name:Holiday Inn+city:Denver
>>
>> Hth,
>> Geert-Jan
>>
>> 2010/10/9 Otis Gospodnetic <ot...@yahoo.com>
>>
>>> Matt,
>>>
>>> The first thing that came to my mind is that this might be interesting to
>>> try
>>> with a dictionary (of city names) if this example is not a made-up one.
>>>
>>>
>>> Otis
>>> ----
>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>>> Lucene ecosystem search :: http://search-lucene.com/
>>>
>>>
>>>
>>> ----- Original Message ----
>>> > From: Matt Mitchell <go...@gmail.com>
>>> > To: solr-user@lucene.apache.org
>>> > Sent: Fri, October 8, 2010 11:22:36 AM
>>> > Subject: dynamic "stop" words?
>>> >
>>> > Is it possible to have certain query terms not effect score, if that
>>> > same query term is present in a field? For example, I have an index of
>>> > hotels. Each hotel has a name and city. If the name of a hotel has the
>>> > name of the city in it's "name" field, I want to completely ignore
>>> > that and not have it influence score.
>>> >
>>> > Example:
>>> >
>>> > Doc 1
>>> > name => "Holiday Inn"
>>> > city => "Denver"
>>> >
>>> > Doc 2
>>> > name => "Holiday Inn, Denver"
>>> > city => "Denver"
>>> >
>>> > q=name:(Holiday Inn, Denver)
>>> >
>>> > I'd like those docs to have the same score in the response. I don't
>>> > want Doc2 to have a higher score, just because it has all of the query
>>> > terms.
>>> >
>>> > Is this possible without using stop words? I hope this makes sense!
>>> >
>>> > Thanks,
>>> > Matt
>>> >
>>>
>>
>
RE: dynamic "stop" words?
Posted by Jonathan Rochkind <ro...@jhu.edu>.
You can identify what words are the city name at index time, because they're the ones in the "city" field, right? So why not just strip those words out at index time? Create a new field, name_search, and search on that, not name.
Doc 1
name => "Holiday Inn"
name_search => "Holiday Inn" [analyzed, perhaps lowercase normalized etc]
city => "Denver"
Doc 2
name => "Holiday Inn, Denver"
name_search => "Holiday Inn"
city => "Denver"
Jonathan
________________________________________
From: Matt Mitchell [goodieboy@gmail.com]
Sent: Tuesday, October 12, 2010 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: dynamic "stop" words?
Thanks for the feedback. I thought about stop words but since I have a
lot of documents spanning lots of different countries, I won't know
all of the possible cities so stop-words could get hard to manage.
Also, the city name is in the same field. I think I might try creating
a new field called name_no_city, and at index time just strip the city
name out?
Matt
On Sat, Oct 9, 2010 at 11:17 AM, Geert-Jan Brits <gb...@gmail.com> wrote:
> That might work, although depending on your use-case it might be hard to
> have a good controlled vocab on citynames (hotel metropole bruxelles, hotel
> metropole brussels, hotel metropole brussel, etc.) Also 'hotel paris
> bruxelles' stinks...
>
> given your example:
>
>> Doc 1
>> name => "Holiday Inn"
>> city => "Denver"
>>
>> Doc 2
>> name => "Holiday Inn, Denver"
>> city => "Denver"
>>
>> q=name:(Holiday Inn, Denver)
>
> turning it upside down, perhaps an alternative would be to query on:
> q=name:Holiday Inn+city:Denver
>
> and configure field 'name' in such a way that doc1 and doc2 score the same.
> I believe that must be possible, just not sure how to config it exactly at
> the moment.
>
> Of course, it depends on your scenario if you have enough knowlegde on the
> clientside to transform:
> q=name:(Holiday Inn, Denver) to q=name:Holiday Inn+city:Denver
>
> Hth,
> Geert-Jan
>
> 2010/10/9 Otis Gospodnetic <ot...@yahoo.com>
>
>> Matt,
>>
>> The first thing that came to my mind is that this might be interesting to
>> try
>> with a dictionary (of city names) if this example is not a made-up one.
>>
>>
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> ----- Original Message ----
>> > From: Matt Mitchell <go...@gmail.com>
>> > To: solr-user@lucene.apache.org
>> > Sent: Fri, October 8, 2010 11:22:36 AM
>> > Subject: dynamic "stop" words?
>> >
>> > Is it possible to have certain query terms not effect score, if that
>> > same query term is present in a field? For example, I have an index of
>> > hotels. Each hotel has a name and city. If the name of a hotel has the
>> > name of the city in it's "name" field, I want to completely ignore
>> > that and not have it influence score.
>> >
>> > Example:
>> >
>> > Doc 1
>> > name => "Holiday Inn"
>> > city => "Denver"
>> >
>> > Doc 2
>> > name => "Holiday Inn, Denver"
>> > city => "Denver"
>> >
>> > q=name:(Holiday Inn, Denver)
>> >
>> > I'd like those docs to have the same score in the response. I don't
>> > want Doc2 to have a higher score, just because it has all of the query
>> > terms.
>> >
>> > Is this possible without using stop words? I hope this makes sense!
>> >
>> > Thanks,
>> > Matt
>> >
>>
>
Re: dynamic "stop" words?
Posted by Matt Mitchell <go...@gmail.com>.
Thanks for the feedback. I thought about stop words but since I have a
lot of documents spanning lots of different countries, I won't know
all of the possible cities so stop-words could get hard to manage.
Also, the city name is in the same field. I think I might try creating
a new field called name_no_city, and at index time just strip the city
name out?
Matt
On Sat, Oct 9, 2010 at 11:17 AM, Geert-Jan Brits <gb...@gmail.com> wrote:
> That might work, although depending on your use-case it might be hard to
> have a good controlled vocab on citynames (hotel metropole bruxelles, hotel
> metropole brussels, hotel metropole brussel, etc.) Also 'hotel paris
> bruxelles' stinks...
>
> given your example:
>
>> Doc 1
>> name => "Holiday Inn"
>> city => "Denver"
>>
>> Doc 2
>> name => "Holiday Inn, Denver"
>> city => "Denver"
>>
>> q=name:(Holiday Inn, Denver)
>
> turning it upside down, perhaps an alternative would be to query on:
> q=name:Holiday Inn+city:Denver
>
> and configure field 'name' in such a way that doc1 and doc2 score the same.
> I believe that must be possible, just not sure how to config it exactly at
> the moment.
>
> Of course, it depends on your scenario if you have enough knowlegde on the
> clientside to transform:
> q=name:(Holiday Inn, Denver) to q=name:Holiday Inn+city:Denver
>
> Hth,
> Geert-Jan
>
> 2010/10/9 Otis Gospodnetic <ot...@yahoo.com>
>
>> Matt,
>>
>> The first thing that came to my mind is that this might be interesting to
>> try
>> with a dictionary (of city names) if this example is not a made-up one.
>>
>>
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> ----- Original Message ----
>> > From: Matt Mitchell <go...@gmail.com>
>> > To: solr-user@lucene.apache.org
>> > Sent: Fri, October 8, 2010 11:22:36 AM
>> > Subject: dynamic "stop" words?
>> >
>> > Is it possible to have certain query terms not effect score, if that
>> > same query term is present in a field? For example, I have an index of
>> > hotels. Each hotel has a name and city. If the name of a hotel has the
>> > name of the city in it's "name" field, I want to completely ignore
>> > that and not have it influence score.
>> >
>> > Example:
>> >
>> > Doc 1
>> > name => "Holiday Inn"
>> > city => "Denver"
>> >
>> > Doc 2
>> > name => "Holiday Inn, Denver"
>> > city => "Denver"
>> >
>> > q=name:(Holiday Inn, Denver)
>> >
>> > I'd like those docs to have the same score in the response. I don't
>> > want Doc2 to have a higher score, just because it has all of the query
>> > terms.
>> >
>> > Is this possible without using stop words? I hope this makes sense!
>> >
>> > Thanks,
>> > Matt
>> >
>>
>
Re: dynamic "stop" words?
Posted by Geert-Jan Brits <gb...@gmail.com>.
That might work, although depending on your use-case it might be hard to
have a good controlled vocab on citynames (hotel metropole bruxelles, hotel
metropole brussels, hotel metropole brussel, etc.) Also 'hotel paris
bruxelles' stinks...
given your example:
> Doc 1
> name => "Holiday Inn"
> city => "Denver"
>
> Doc 2
> name => "Holiday Inn, Denver"
> city => "Denver"
>
> q=name:(Holiday Inn, Denver)
turning it upside down, perhaps an alternative would be to query on:
q=name:Holiday Inn+city:Denver
and configure field 'name' in such a way that doc1 and doc2 score the same.
I believe that must be possible, just not sure how to config it exactly at
the moment.
Of course, it depends on your scenario if you have enough knowlegde on the
clientside to transform:
q=name:(Holiday Inn, Denver) to q=name:Holiday Inn+city:Denver
Hth,
Geert-Jan
2010/10/9 Otis Gospodnetic <ot...@yahoo.com>
> Matt,
>
> The first thing that came to my mind is that this might be interesting to
> try
> with a dictionary (of city names) if this example is not a made-up one.
>
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Matt Mitchell <go...@gmail.com>
> > To: solr-user@lucene.apache.org
> > Sent: Fri, October 8, 2010 11:22:36 AM
> > Subject: dynamic "stop" words?
> >
> > Is it possible to have certain query terms not effect score, if that
> > same query term is present in a field? For example, I have an index of
> > hotels. Each hotel has a name and city. If the name of a hotel has the
> > name of the city in it's "name" field, I want to completely ignore
> > that and not have it influence score.
> >
> > Example:
> >
> > Doc 1
> > name => "Holiday Inn"
> > city => "Denver"
> >
> > Doc 2
> > name => "Holiday Inn, Denver"
> > city => "Denver"
> >
> > q=name:(Holiday Inn, Denver)
> >
> > I'd like those docs to have the same score in the response. I don't
> > want Doc2 to have a higher score, just because it has all of the query
> > terms.
> >
> > Is this possible without using stop words? I hope this makes sense!
> >
> > Thanks,
> > Matt
> >
>
Re: dynamic "stop" words?
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Matt,
The first thing that came to my mind is that this might be interesting to try
with a dictionary (of city names) if this example is not a made-up one.
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ----
> From: Matt Mitchell <go...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Fri, October 8, 2010 11:22:36 AM
> Subject: dynamic "stop" words?
>
> Is it possible to have certain query terms not effect score, if that
> same query term is present in a field? For example, I have an index of
> hotels. Each hotel has a name and city. If the name of a hotel has the
> name of the city in it's "name" field, I want to completely ignore
> that and not have it influence score.
>
> Example:
>
> Doc 1
> name => "Holiday Inn"
> city => "Denver"
>
> Doc 2
> name => "Holiday Inn, Denver"
> city => "Denver"
>
> q=name:(Holiday Inn, Denver)
>
> I'd like those docs to have the same score in the response. I don't
> want Doc2 to have a higher score, just because it has all of the query
> terms.
>
> Is this possible without using stop words? I hope this makes sense!
>
> Thanks,
> Matt
>
Re: Accented Search in Solr
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Param,
Note that the original value will be stored even if ISOLatin1AccentFilter
removes the accept for indexing / matching purposes.
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ----
> From: "Sethi, Parampreet" <pa...@teamaol.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Fri, October 8, 2010 11:33:02 AM
> Subject: Accented Search in Solr
>
> Hi All,
>
> I am using Solr 1.3 in my project. Just wanted to know if there is any other
>way by which below mentioned queries will return the same results:
>
> Gruyère-and-Zucchini
> Gruyere-and-Zucchini
>
> The first query has accented characters in it. I was just going through the
>Solr tokenizers and filter factories documentation, there is a filter factory
>listed "solr.ISOLatin1AccentFilterFactory" that can be used to replace accented
>characters with their non-accented counterparts.
>
> Is there any other way to do this search which is independent of how data is
>stored (whether in accented or non-accented form)?
>
> Thanks for the help.
>
> Regards,
> param
>
Re: Accented Search in Solr
Posted by Erick Erickson <er...@gmail.com>.
not that I know of. Do note that whether the query has the accent filter
active or not MUST
be matched with the index-time filter. In other words, if you indexed with
the filter but
search without it or vice-versa you won't get the resultsyou expect.
Also note that no matter what, the original text (without the filter
applied) is what's #stored#
untokenized. This is entirely independent of what's #indexed# for all that
these options are
specified for the same field.
If this is irrelevant, what are you really trying to accomplish? This may be
an "xy" problem, see:
http://people.apache.org/~hossman/#xyproblem
<http://people.apache.org/~hossman/#xyproblem>
Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue. Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341
Erick
On Fri, Oct 8, 2010 at 11:33 AM, Sethi, Parampreet <
parampreet.sethi@teamaol.com> wrote:
> Hi All,
>
> I am using Solr 1.3 in my project. Just wanted to know if there is any
> other way by which below mentioned queries will return the same results:
>
> Gruyère-and-Zucchini
> Gruyere-and-Zucchini
>
> The first query has accented characters in it. I was just going through the
> Solr tokenizers and filter factories documentation, there is a filter
> factory listed "solr.ISOLatin1AccentFilterFactory" that can be used to
> replace accented characters with their non-accented counterparts.
>
> Is there any other way to do this search which is independent of how data
> is stored (whether in accented or non-accented form)?
>
> Thanks for the help.
>
> Regards,
> param
>
Re: Accented Search in Solr
Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Accented Search in Solr
: References: <AA...@mail.gmail.com>
: In-Reply-To: <AA...@mail.gmail.com>
http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists
When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email. Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention. It makes following discussions in the mailing list archives
particularly difficult.
See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
-Hoss
Accented Search in Solr
Posted by "Sethi, Parampreet" <pa...@teamaol.com>.
Hi All,
I am using Solr 1.3 in my project. Just wanted to know if there is any other way by which below mentioned queries will return the same results:
Gruyère-and-Zucchini
Gruyere-and-Zucchini
The first query has accented characters in it. I was just going through the Solr tokenizers and filter factories documentation, there is a filter factory listed "solr.ISOLatin1AccentFilterFactory" that can be used to replace accented characters with their non-accented counterparts.
Is there any other way to do this search which is independent of how data is stored (whether in accented or non-accented form)?
Thanks for the help.
Regards,
param