You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bill Bell <bi...@gmail.com> on 2011/09/25 06:22:12 UTC
Best Solr escaping?
What is the best algorithm for escaping strings before sending to Solr? Does
someone have some code?
A few things I have witnessed in "q" using DIH handler
* Double quotes - " that are not balanced can cause several issues from an
error (strip the double quote?), to no results.
* Should we use + or %20 and what cases make sense:
> * "Dr. Phil Smith" or "Dr.+Phil+Smith" or "Dr.%20Phil%20Smith" - also what is
> the impact of double quotes?
* Unmatched parenthesis I.e. Opening ( and not closing.
> * (Dr. Holstein
> * Cardiologist+(Dr. Holstein
Regular encoding of strings does not always work for the whole string due to
several issues like white space:
* White space works better when we use back quote "Bill\ Bell" especially
when using facets.
Thoughts? Code? Ideas? Better Wikis?
Re: Best Solr escaping?
Posted by Chris Hostetter <ho...@fucit.org>.
a) It depends entirely on what QueryParser you are using.
If your input is "from a human" i would suggest using dismax or edismax
and not escaping anything - unless you get some type of error, and then
maybe give the user a "there was a problem with your query, would you like
to try ____" where you suggest a new query with all meta-characters striped out.
b) URL escaping is really a completley independent issue...
: * Should we use + or %20 and what cases make sense:
: > * "Dr. Phil Smith" or "Dr.+Phil+Smith" or "Dr.%20Phil%20Smith" - also what is
...solr doesn't know of car wether you use "+" or "%20" when building up a
URL. by the time Solr sees your input, the servlet container has already
url-decoded the query params.
in general: if you are even *thinking* about how params are getting URL
encoded, you are probably doing something wrong. writing custom code to
construct Solr query strings is one thing, writting custom code to
construct/escape values in URLs is something else -- i don't know what
client langauge you are using, but i garuntee you it has an HTTp/CGI API
that completely eliminates any need for you to even think about such
issues.
-Hoss
RE: Best Solr escaping?
Posted by Bob Sandiford <bo...@sirsidynix.com>.
I won't guarantee this is the 'best algorithm', but here's what we use. (This is in a final class with only static helper methods):
// Set of characters / Strings SOLR treats as having special meaning in a query, and the corresponding Escaped versions.
// Note that the actual operators '&&' and '||' don't show up here - we'll just escape the characters '&' and '|' wherever they occur.
private static final String[] SOLR_SPECIAL_CHARACTERS = new String[] {"+", "-", "&", "|", "!", "(", ")", "{", "}", "[", "]", "^", "\"", "~", "*", "?", ":", "\\"};
private static final String[] SOLR_REPLACEMENT_CHARACTERS = new String[] {"\\+", "\\-", "\\&", "\\|", "\\!", "\\(", "\\)", "\\{", "\\}", "\\[", "\\]", "\\^", "\\\"", "\\~", "\\*", "\\?", "\\:", "\\\\"};
/**
* Escapes all special characters from the Search Terms, so they don't get confused with
* the Solr query language special characters.
* @param value - Search Term to escape
* @return - escaped Search value, suitable for a Solr "q" parameter
*/
public static String escapeSolrCharacters(String value)
{
return StringUtils.replaceEach(value, SOLR_SPECIAL_CHARACTERS, SOLR_REPLACEMENT_CHARACTERS);
}
Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com
www.sirsidynix.com
> -----Original Message-----
> From: Bill Bell [mailto:billnbell@gmail.com]
> Sent: Sunday, September 25, 2011 12:22 AM
> To: solr-user@lucene.apache.org
> Subject: Best Solr escaping?
>
> What is the best algorithm for escaping strings before sending to Solr?
> Does
> someone have some code?
>
> A few things I have witnessed in "q" using DIH handler
> * Double quotes - " that are not balanced can cause several issues from
> an
> error (strip the double quote?), to no results.
> * Should we use + or %20 and what cases make sense:
> > * "Dr. Phil Smith" or "Dr.+Phil+Smith" or "Dr.%20Phil%20Smith" - also
> what is
> > the impact of double quotes?
> * Unmatched parenthesis I.e. Opening ( and not closing.
> > * (Dr. Holstein
> > * Cardiologist+(Dr. Holstein
> Regular encoding of strings does not always work for the whole string
> due to
> several issues like white space:
> * White space works better when we use back quote "Bill\ Bell"
> especially
> when using facets.
>
> Thoughts? Code? Ideas? Better Wikis?
>
>