You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by jo...@aol.com on 2013/10/24 15:20:06 UTC

Searching on special characters

Hi,


How should I setup Solr so I can search and get hit on special characters such as: + - && || ! ( ) { } [ ] ^ " ~ * ? : \


My need is, if a user has text like so:


Doc-#1: "(Solr)"
Doc-#2: "Solr"


And they type "(solr)" I want a hit on "(solr)" only in document #1, with the brackets matching.  And if they type "solr", they will get a hit in Document #2 only.


An additional nice-to-have is, if they type "solr", I want a hit in both document #1 and #2.


Here is what my current schema.xml looks like:



      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" splitOnNumerics="1" stemEnglishPossessive="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>



Currently, special characters are being stripped.



Any idea how I can configure Solr to do this?  I'm using Solr 3.6.



Thanks !!


-MJ

Re: Searching on special characters

Posted by jo...@aol.com.
I'm not sure what you mean.  Based on what you are saying, is there an example of how I can setup my schema.xml to get the result I need?


Also, the way I execute a search is using http://localhost:8080/solr/select/?q=<search-term>  Does your solution require me to change this?  If so, in what way?


It would be great if all this is documented somewhere, so I won't have to bug you guys !!!



--MJ



-----Original Message-----
From: Jack Krupansky <ja...@basetechnology.com>
To: solr-user <so...@lucene.apache.org>
Sent: Thu, Oct 24, 2013 9:39 am
Subject: Re: Searching on special characters


Have two or three copies of the text, one field could be raw string and 
boosted heavily for exact match, a second could be text using the keyword 
tokenizer but with lowercase filter also heavily boosted, and the third 
field general, tokenized text with a lower boost. You could also have a copy 
that uses the keyword tokenizer to maintain a single token but also applies 
a regex filter to strip special characters and applies a lower case filter 
and give that an intermediate boost.

-- Jack Krupansky

-----Original Message----- 
From: johnmunir@aol.com
Sent: Thursday, October 24, 2013 9:20 AM
To: solr-user@lucene.apache.org
Subject: Searching on special characters

Hi,


How should I setup Solr so I can search and get hit on special characters 
such as: + - && || ! ( ) { } [ ] ^ " ~ * ? : \


My need is, if a user has text like so:


Doc-#1: "(Solr)"
Doc-#2: "Solr"


And they type "(solr)" I want a hit on "(solr)" only in document #1, with 
the brackets matching.  And if they type "solr", they will get a hit in 
Document #2 only.


An additional nice-to-have is, if they type "solr", I want a hit in both 
document #1 and #2.


Here is what my current schema.xml looks like:



      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" 
splitOnNumerics="1" stemEnglishPossessive="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>



Currently, special characters are being stripped.



Any idea how I can configure Solr to do this?  I'm using Solr 3.6.



Thanks !!


-MJ 


 


Re: Searching on special characters

Posted by Jack Krupansky <ja...@basetechnology.com>.
Have two or three copies of the text, one field could be raw string and 
boosted heavily for exact match, a second could be text using the keyword 
tokenizer but with lowercase filter also heavily boosted, and the third 
field general, tokenized text with a lower boost. You could also have a copy 
that uses the keyword tokenizer to maintain a single token but also applies 
a regex filter to strip special characters and applies a lower case filter 
and give that an intermediate boost.

-- Jack Krupansky

-----Original Message----- 
From: johnmunir@aol.com
Sent: Thursday, October 24, 2013 9:20 AM
To: solr-user@lucene.apache.org
Subject: Searching on special characters

Hi,


How should I setup Solr so I can search and get hit on special characters 
such as: + - && || ! ( ) { } [ ] ^ " ~ * ? : \


My need is, if a user has text like so:


Doc-#1: "(Solr)"
Doc-#2: "Solr"


And they type "(solr)" I want a hit on "(solr)" only in document #1, with 
the brackets matching.  And if they type "solr", they will get a hit in 
Document #2 only.


An additional nice-to-have is, if they type "solr", I want a hit in both 
document #1 and #2.


Here is what my current schema.xml looks like:



      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" 
splitOnNumerics="1" stemEnglishPossessive="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>



Currently, special characters are being stripped.



Any idea how I can configure Solr to do this?  I'm using Solr 3.6.



Thanks !!


-MJ