You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by jo...@aol.com on 2013/10/24 15:20:06 UTC
Searching on special characters
Hi,
How should I setup Solr so I can search and get hit on special characters such as: + - && || ! ( ) { } [ ] ^ " ~ * ? : \
My need is, if a user has text like so:
Doc-#1: "(Solr)"
Doc-#2: "Solr"
And they type "(solr)" I want a hit on "(solr)" only in document #1, with the brackets matching. And if they type "solr", they will get a hit in Document #2 only.
An additional nice-to-have is, if they type "solr", I want a hit in both document #1 and #2.
Here is what my current schema.xml looks like:
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" splitOnNumerics="1" stemEnglishPossessive="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
Currently, special characters are being stripped.
Any idea how I can configure Solr to do this? I'm using Solr 3.6.
Thanks !!
-MJ
Re: Searching on special characters
Posted by jo...@aol.com.
I'm not sure what you mean. Based on what you are saying, is there an example of how I can setup my schema.xml to get the result I need?
Also, the way I execute a search is using http://localhost:8080/solr/select/?q=<search-term> Does your solution require me to change this? If so, in what way?
It would be great if all this is documented somewhere, so I won't have to bug you guys !!!
--MJ
-----Original Message-----
From: Jack Krupansky <ja...@basetechnology.com>
To: solr-user <so...@lucene.apache.org>
Sent: Thu, Oct 24, 2013 9:39 am
Subject: Re: Searching on special characters
Have two or three copies of the text, one field could be raw string and
boosted heavily for exact match, a second could be text using the keyword
tokenizer but with lowercase filter also heavily boosted, and the third
field general, tokenized text with a lower boost. You could also have a copy
that uses the keyword tokenizer to maintain a single token but also applies
a regex filter to strip special characters and applies a lower case filter
and give that an intermediate boost.
-- Jack Krupansky
-----Original Message-----
From: johnmunir@aol.com
Sent: Thursday, October 24, 2013 9:20 AM
To: solr-user@lucene.apache.org
Subject: Searching on special characters
Hi,
How should I setup Solr so I can search and get hit on special characters
such as: + - && || ! ( ) { } [ ] ^ " ~ * ? : \
My need is, if a user has text like so:
Doc-#1: "(Solr)"
Doc-#2: "Solr"
And they type "(solr)" I want a hit on "(solr)" only in document #1, with
the brackets matching. And if they type "solr", they will get a hit in
Document #2 only.
An additional nice-to-have is, if they type "solr", I want a hit in both
document #1 and #2.
Here is what my current schema.xml looks like:
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1" splitOnCaseChange="0"
splitOnNumerics="1" stemEnglishPossessive="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
Currently, special characters are being stripped.
Any idea how I can configure Solr to do this? I'm using Solr 3.6.
Thanks !!
-MJ
Re: Searching on special characters
Posted by Jack Krupansky <ja...@basetechnology.com>.
Have two or three copies of the text, one field could be raw string and
boosted heavily for exact match, a second could be text using the keyword
tokenizer but with lowercase filter also heavily boosted, and the third
field general, tokenized text with a lower boost. You could also have a copy
that uses the keyword tokenizer to maintain a single token but also applies
a regex filter to strip special characters and applies a lower case filter
and give that an intermediate boost.
-- Jack Krupansky
-----Original Message-----
From: johnmunir@aol.com
Sent: Thursday, October 24, 2013 9:20 AM
To: solr-user@lucene.apache.org
Subject: Searching on special characters
Hi,
How should I setup Solr so I can search and get hit on special characters
such as: + - && || ! ( ) { } [ ] ^ " ~ * ? : \
My need is, if a user has text like so:
Doc-#1: "(Solr)"
Doc-#2: "Solr"
And they type "(solr)" I want a hit on "(solr)" only in document #1, with
the brackets matching. And if they type "solr", they will get a hit in
Document #2 only.
An additional nice-to-have is, if they type "solr", I want a hit in both
document #1 and #2.
Here is what my current schema.xml looks like:
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1" splitOnCaseChange="0"
splitOnNumerics="1" stemEnglishPossessive="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
Currently, special characters are being stripped.
Any idea how I can configure Solr to do this? I'm using Solr 3.6.
Thanks !!
-MJ