You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bill Bell <bi...@gmail.com> on 2011/09/25 06:22:12 UTC

Best Solr escaping?

What is the best algorithm for escaping strings before sending to Solr? Does
someone have some code?

A few things I have witnessed in "q" using DIH handler
* Double quotes - " that are not balanced can cause several issues from an
error (strip the double quote?), to no results.
* Should we use + or %20 ­ and what cases make sense:
> * "Dr. Phil Smith" or "Dr.+Phil+Smith" or "Dr.%20Phil%20Smith" - also what is
> the impact of double quotes?
* Unmatched parenthesis I.e. Opening ( and not closing.
> * (Dr. Holstein
> * Cardiologist+(Dr. Holstein
Regular encoding of strings does not always work for the whole string due to
several issues like white space:
* White space works better when we use back quote "Bill\ Bell" especially
when using facets.

Thoughts? Code? Ideas? Better Wikis?




Re: Best Solr escaping?

Posted by Chris Hostetter <ho...@fucit.org>.
a) It depends entirely on what QueryParser you are using.

If your input is "from a human" i would suggest using dismax or edismax 
and not escaping anything - unless you get some type of error, and then 
maybe give the user a "there was a problem with your query, would you like 
to try ____" where you suggest a new query with all meta-characters striped out.

b) URL escaping is really a completley independent issue...

: * Should we use + or %20 ­ and what cases make sense:
: > * "Dr. Phil Smith" or "Dr.+Phil+Smith" or "Dr.%20Phil%20Smith" - also what is

...solr doesn't know of car wether you use "+" or "%20" when building up a 
URL.  by the time Solr sees your input, the servlet container has already 
url-decoded the query params.

in general: if you are even *thinking* about how params are getting URL 
encoded, you are probably doing something wrong.  writing custom code to 
construct Solr query strings is one thing, writting custom code to 
construct/escape values in URLs is something else -- i don't know what 
client langauge you are using, but i garuntee you it has an HTTp/CGI API 
that completely eliminates any need for you to even think about such 
issues.


-Hoss

RE: Best Solr escaping?

Posted by Bob Sandiford <bo...@sirsidynix.com>.
I won't guarantee this is the 'best algorithm', but here's what we use.  (This is in a final class with only static helper methods):

    // Set of characters / Strings SOLR treats as having special meaning in a query, and the corresponding Escaped versions.
    // Note that the actual operators '&&' and '||' don't show up here - we'll just escape the characters '&' and '|' wherever they occur.
    private static final String[] SOLR_SPECIAL_CHARACTERS = new String[] {"+", "-", "&", "|", "!", "(", ")", "{", "}", "[", "]", "^", "\"", "~", "*", "?", ":", "\\"};
    private static final String[] SOLR_REPLACEMENT_CHARACTERS = new String[] {"\\+", "\\-", "\\&", "\\|", "\\!", "\\(", "\\)", "\\{", "\\}", "\\[", "\\]", "\\^", "\\\"", "\\~", "\\*", "\\?", "\\:", "\\\\"};


    /**
     * Escapes all special characters from the Search Terms, so they don't get confused with
     * the Solr query language special characters.
     * @param value - Search Term to escape
     * @return - escaped Search value, suitable for a Solr "q" parameter
     */
    public static String escapeSolrCharacters(String value)
    {
        return StringUtils.replaceEach(value, SOLR_SPECIAL_CHARACTERS, SOLR_REPLACEMENT_CHARACTERS);
    }

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com
www.sirsidynix.com

> -----Original Message-----
> From: Bill Bell [mailto:billnbell@gmail.com]
> Sent: Sunday, September 25, 2011 12:22 AM
> To: solr-user@lucene.apache.org
> Subject: Best Solr escaping?
> 
> What is the best algorithm for escaping strings before sending to Solr?
> Does
> someone have some code?
> 
> A few things I have witnessed in "q" using DIH handler
> * Double quotes - " that are not balanced can cause several issues from
> an
> error (strip the double quote?), to no results.
> * Should we use + or %20 ­ and what cases make sense:
> > * "Dr. Phil Smith" or "Dr.+Phil+Smith" or "Dr.%20Phil%20Smith" - also
> what is
> > the impact of double quotes?
> * Unmatched parenthesis I.e. Opening ( and not closing.
> > * (Dr. Holstein
> > * Cardiologist+(Dr. Holstein
> Regular encoding of strings does not always work for the whole string
> due to
> several issues like white space:
> * White space works better when we use back quote "Bill\ Bell"
> especially
> when using facets.
> 
> Thoughts? Code? Ideas? Better Wikis?
> 
>