You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by PeterKerk <ve...@hotmail.com> on 2010/12/03 16:46:21 UTC

finding exact case insensitive matches on single and multiword values


Users call this URL on my site:
/?search=1&city=den+haag
or even /?search=1&city=Den+Haag (casing of ctyname can be anything)


Under water I call Solr:
http://localhost:8983/solr/db/select/?indent=on&facet=true&fq=city:den+haag&q=*:*&start=0&rows=25&fl=id,title,friendlyurl,city&facet.field=city


but this returns 0 results, even though I KNOW there are exactly 54 records
that have an exact match on "den haag" (in this case even with lower casing
in DB).

citynames are stored with various casings in DB, so when searching with
solr, the search must ignore casing.


my schema.xml

<fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true" />
<field name="city" type="string" indexed="true" stored="true"/>


To check what was going on, I opened my analysis.jsp, 

for field <name> I provide: "city"
for Field value (Index)  I provide: "den haag"
When I analyze this I get:
"den haag"

So that seems correct to me. Why is it that no results are returned?

My requirements summarized:
- I want to search independant of case on cityname:
	when user searches on "DEn HaAG" he will get the records that have value
"Den Haag", but also records that have "den haag" etc.
- citynames may consists of multiple words but only an exact match is valid,
so when user searches for "den", he will not find "den haag" records. And
when searched on "den haag" it will only return match on that and not other
cities like "den bosch".

How can I achieve this?

I think I need a new fieldtype  in my schema.xml, but am not sure which
tokenizers and analyzers I need, here's what I tried:

<fieldType name="exactmatch" class="solr.TextField"
positionIncrementGap="100" >
  <analyzer>
	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_dutch.txt" />
	<filter class="solr.ISOLatin1AccentFilterFactory"/>
	<filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>


Help is really appreciated!
-- 
View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012207.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: finding exact case insensitive matches on single and multiword values

Posted by PeterKerk <ve...@hotmail.com>.

Alright guys, thanks!

I went for storing everything lowercase in DB. I also lowercased all solr
queries and only on client for UX purposes I applied the proper casing
logic.

Thanks for suggestions!
-- 
View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2022834.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: finding exact case insensitive matches on single and multiword values

Posted by Jonathan Rochkind <ro...@jhu.edu>.

ALL solr queries are case-sensitive.  

The trick is in the analyzers.  If you downcase everything at index time before you put it in the index, and downcase all queries at query time too -- then you have case-insensitive query.   Not because the Solr search algorithms are case insensitive, but because you've normalized all values to be all lowercase at both index and query time, so things will match. 

You can only do this kind of normalization through analyzers on a Solr text field, not a Solr string field. It's what the Solr text type is for. 

This wiki page, and this question in particular, will be helpful to you:
http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching
________________________________________
From: PeterKerk [vetteparty@hotmail.com]
Sent: Saturday, December 04, 2010 6:24 AM
To: solr-user@lucene.apache.org
Subject: Re: finding exact case insensitive matches on single and multiword values

Geert-Jan and Erick, thanks!

What I tried first is making it work with string type, that works perfect
for all lowercase values!

What I do not understand is how and why I have to make the casing work at
the client, since the casing differs in the database. Right now in the
database I have values for city:
Den Haag
Den HAAG
den haag
den haag

using &fq=city:(den\ haag) gives me 2 results.

So it seems to me that because of the string type this casing issue cannot
be resolved as long as I'm using this fieldtype?


Then to the solution of tweaking the fieldtype for me to work.
I have this right now:

        <fieldType name="myField" class="solr.TextField" sortMissingLast="true"
omitNorms="true">
        <analyzer>
                <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
        </fieldType>

But I find it difficult to test what the result of the filters are, and
since as Erick already mentioned, the result looks correct but really
isnt...
Is there some tool where I can add and remove the filters to quickly see
what the output will be? (without having to reload schema.xml and do
reimport?
--
View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2017851.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: finding exact case insensitive matches on single and multiword values

Posted by Ahmet Arslan <io...@yahoo.com>.

> Then to the solution of tweaking the fieldtype for me to
> work.
> I have this right now:
>     
>     <fieldType name="myField"
> class="solr.TextField" sortMissingLast="true"
> omitNorms="true"> 
>     <analyzer> 
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/> 
>         <filter
> class="solr.LowerCaseFilterFactory"/> 
>     </analyzer> 
>     </fieldType> 


Additionally you can add TrimFilterFactory to your analyzer chain. 

And instead of escaping white spaces you can use RawQParserPlugin.
&fq={!raw f=city}den haag

Re: finding exact case insensitive matches on single and multiword values

Posted by PeterKerk <ve...@hotmail.com>.

Geert-Jan and Erick, thanks!

What I tried first is making it work with string type, that works perfect
for all lowercase values!

What I do not understand is how and why I have to make the casing work at
the client, since the casing differs in the database. Right now in the
database I have values for city:
Den Haag
Den HAAG
den haag
den haag

using &fq=city:(den\ haag) gives me 2 results.

So it seems to me that because of the string type this casing issue cannot
be resolved as long as I'm using this fieldtype?


Then to the solution of tweaking the fieldtype for me to work.
I have this right now:
	
	<fieldType name="myField" class="solr.TextField" sortMissingLast="true"
omitNorms="true"> 
	<analyzer> 
		<tokenizer class="solr.KeywordTokenizerFactory"/> 
		<filter class="solr.LowerCaseFilterFactory"/> 
	</analyzer> 
	</fieldType> 

But I find it difficult to test what the result of the filters are, and
since as Erick already mentioned, the result looks correct but really
isnt...
Is there some tool where I can add and remove the filters to quickly see
what the output will be? (without having to reload schema.xml and do
reimport?
-- 
View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2017851.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: finding exact case insensitive matches on single and multiword values

Posted by Erick Erickson <er...@gmail.com>.

Arrrgh, Geert-Jan is right, that't the 15th time at least this has tripped
me up.

I'm pretty sure that text will work if you escape the space, e.g.
city:(den\ haag). The debug output is a little confusing since it has a line
like
city:den haag

which almost looks wrong... but it worked
out OK on a couple of queries I tried.

Geert-Jan is also right in that filters aren't applied to string types
so there's two possibilities, either handle the casing on the client
side as he suggests and use string or make the text type work.


Sorry for the confusion
Erick

On Fri, Dec 3, 2010 at 11:54 AM, Geert-Jan Brits <gb...@gmail.com> wrote:

> when you went from strField to TextField in your config you enabled
> tokenizing (which I believe splits on spaces by default),
> which is why you see seperate 'words' / terms in the
> debugQuery-explanation.
>
> I believe you want to keep your old strField config and try quoting:
>
> fq=city:"den+haag" or fq=city:"den haag"
>
> Concerning the lower-casing: wouldn't if be easiest to do that at the
> client? (I'm not sure at the moment how to do lowercasing with a strField)
> .
>
> Geert-jan
>
>
> 2010/12/3 PeterKerk <ve...@hotmail.com>
>
> >
> >
> > You are right, this is what I see when I append the debug query (very
> very
> > useful btw!!!) in old situation:
> > <arr name="parsed_filter_queries">
> >        <str>city:den title:haag</str>
> >        <str>PhraseQuery(themes:"hotel en restaur")</str>
> > </arr>
> >
> >
> >
> > I then changed the schema.xml to:
> >
> > <fieldType name="myField" class="solr.TextField" sortMissingLast="true"
> > omitNorms="true">
> > <analyzer>
> >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> > </analyzer>
> > </fieldType>
> >
> > <field name="city" type="myField" indexed="true" stored="true"/> <!--
> used
> > to be "string" -->
> >
> >
> > I then tried adding parentheses:
> >
> >
> http://localhost:8983/solr/db/select/?indent=on&facet=true&fq=city:(den+haag)&q=*:*&start=0&rows=25&fl=id,title,friendlyurl,city&facet.field=city
> > also tried (without +):
> > http://localhost:8983/solr/db/select/?indent=on&facet=true&fq=city:(den
> > haag)&q=*:*&start=0&rows=25&fl=id,title,friendlyurl,city&facet.field=city
> >
> > Then I get:
> >
> > <arr name="parsed_filter_queries">
> >        <str>city:den city:haag</str>
> > </arr>
> >
> > And still 0 results
> >
> > But as you can see the query is split up into 2 separate words, I dont
> > think
> > that is what I need?
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>

Re: finding exact case insensitive matches on single and multiword values

Posted by Geert-Jan Brits <gb...@gmail.com>.

when you went from strField to TextField in your config you enabled
tokenizing (which I believe splits on spaces by default),
which is why you see seperate 'words' / terms in the debugQuery-explanation.

I believe you want to keep your old strField config and try quoting:

fq=city:"den+haag" or fq=city:"den haag"

Concerning the lower-casing: wouldn't if be easiest to do that at the
client? (I'm not sure at the moment how to do lowercasing with a strField)
.

Geert-jan


2010/12/3 PeterKerk <ve...@hotmail.com>

>
>
> You are right, this is what I see when I append the debug query (very very
> useful btw!!!) in old situation:
> <arr name="parsed_filter_queries">
>        <str>city:den title:haag</str>
>        <str>PhraseQuery(themes:"hotel en restaur")</str>
> </arr>
>
>
>
> I then changed the schema.xml to:
>
> <fieldType name="myField" class="solr.TextField" sortMissingLast="true"
> omitNorms="true">
> <analyzer>
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
> <field name="city" type="myField" indexed="true" stored="true"/> <!-- used
> to be "string" -->
>
>
> I then tried adding parentheses:
>
> http://localhost:8983/solr/db/select/?indent=on&facet=true&fq=city:(den+haag)&q=*:*&start=0&rows=25&fl=id,title,friendlyurl,city&facet.field=city
> also tried (without +):
> http://localhost:8983/solr/db/select/?indent=on&facet=true&fq=city:(den
> haag)&q=*:*&start=0&rows=25&fl=id,title,friendlyurl,city&facet.field=city
>
> Then I get:
>
> <arr name="parsed_filter_queries">
>        <str>city:den city:haag</str>
> </arr>
>
> And still 0 results
>
> But as you can see the query is split up into 2 separate words, I dont
> think
> that is what I need?
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: finding exact case insensitive matches on single and multiword values

Posted by PeterKerk <ve...@hotmail.com>.


You are right, this is what I see when I append the debug query (very very
useful btw!!!) in old situation:
<arr name="parsed_filter_queries">
	<str>city:den title:haag</str>
	<str>PhraseQuery(themes:"hotel en restaur")</str>
</arr>



I then changed the schema.xml to:

<fieldType name="myField" class="solr.TextField" sortMissingLast="true"
omitNorms="true"> 
<analyzer> 
	<tokenizer class="solr.KeywordTokenizerFactory"/> 
	<filter class="solr.LowerCaseFilterFactory"/> 
</analyzer> 
</fieldType> 
	
<field name="city" type="myField" indexed="true" stored="true"/> <!-- used
to be "string" -->


I then tried adding parentheses:
http://localhost:8983/solr/db/select/?indent=on&facet=true&fq=city:(den+haag)&q=*:*&start=0&rows=25&fl=id,title,friendlyurl,city&facet.field=city
also tried (without +):
http://localhost:8983/solr/db/select/?indent=on&facet=true&fq=city:(den
haag)&q=*:*&start=0&rows=25&fl=id,title,friendlyurl,city&facet.field=city

Then I get:

<arr name="parsed_filter_queries">
	<str>city:den city:haag</str>
</arr>

And still 0 results

But as you can see the query is split up into 2 separate words, I dont think
that is what I need?


-- 
View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: finding exact case insensitive matches on single and multiword values

Posted by Erick Erickson <er...@gmail.com>.

The root of your problem, I think, is fq=city:den+haag which parses into
city:den +defaultfield:haag

Try parens, i.e. city:(den haag).

Attaching &debugQuery=on is often a way to see thing like this quickly....

Also, if you haven't seen the analysis page from the admin page, it's really
valuable
for figuring out the effects of analyzers. You can probably do something
like:

<fieldType name="myField" class="solr.TextField" sortMissingLast="true"
omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

to get what you want.

Best
Erick

On Fri, Dec 3, 2010 at 10:46 AM, PeterKerk <ve...@hotmail.com> wrote:

>
>
> Users call this URL on my site:
> /?search=1&city=den+haag
> or even /?search=1&city=Den+Haag (casing of ctyname can be anything)
>
>
> Under water I call Solr:
>
> http://localhost:8983/solr/db/select/?indent=on&facet=true&fq=city:den+haag&q=*:*&start=0&rows=25&fl=id,title,friendlyurl,city&facet.field=city
>
>
> but this returns 0 results, even though I KNOW there are exactly 54 records
> that have an exact match on "den haag" (in this case even with lower casing
> in DB).
>
> citynames are stored with various casings in DB, so when searching with
> solr, the search must ignore casing.
>
>
> my schema.xml
>
> <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> omitNorms="true" />
> <field name="city" type="string" indexed="true" stored="true"/>
>
>
> To check what was going on, I opened my analysis.jsp,
>
> for field <name> I provide: "city"
> for Field value (Index)  I provide: "den haag"
> When I analyze this I get:
> "den haag"
>
> So that seems correct to me. Why is it that no results are returned?
>
> My requirements summarized:
> - I want to search independant of case on cityname:
>        when user searches on "DEn HaAG" he will get the records that have
> value
> "Den Haag", but also records that have "den haag" etc.
> - citynames may consists of multiple words but only an exact match is
> valid,
> so when user searches for "den", he will not find "den haag" records. And
> when searched on "den haag" it will only return match on that and not other
> cities like "den bosch".
>
> How can I achieve this?
>
> I think I need a new fieldtype  in my schema.xml, but am not sure which
> tokenizers and analyzers I need, here's what I tried:
>
> <fieldType name="exactmatch" class="solr.TextField"
> positionIncrementGap="100" >
>  <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_dutch.txt" />
>        <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
>
>
> Help is really appreciated!
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012207.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>