You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Kirk <pk...@alpha-solutions.dk> on 2014/04/08 15:37:11 UTC

Solr special characters like '(' and '&'?

Hi

How to search for Solr special characters like '(' and '&'?

I am trying to execute searches for "products" in my Solr (3.6.1) index, based on the "categories" to which these products belong.
The categories are stored in a multistring field for the products, and are hierarchical, and are fed to the index like:
A
A|B
A|B|C

So this product would actually belong to category named "C", which is a child of "B", which is a child of !"A".

I am able to execute queries for simple category names like this (eg. fq=categories_string:A|B|C).

But some categories have Solr special characters in their names, like: "D (E & F)"
(Real example: "Power supplies (Battery and Solar)").

A query like fq=categories_string:A|B|D (E & F) simply fails.
But even if I try 
fq=categories_string:A|B|D%20\(E%20%26amp%3B%20F\)
(where I try to escape the special characters) does not find the products in this category, and actually finds other unrelated categories.

What am I doing wrong?

Thanks,
Peter


RE: Solr special characters like '(' and '&'?

Posted by Peter Kirk <pk...@alpha-solutions.dk>.
Thanks for the comments, and for the idea for the term query parser.
This seems to work well (except I still can't get '&' in a category name to work - I can get the (one and only) customer to change the category names).

I'll look into fixing the indexing side of things - could be an idea to strip out the "special characters".
I'm working on the search side of things.

/Peter


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: 8. april 2014 16:15
To: solr-user@lucene.apache.org; Ahmet Arslan
Subject: Re: Solr special characters like '(' and '&'?

I'd seriously consider filtering these characters out when you index and search, this is quite likely very brittle. The same item, say from two different vendors, might have D (E & F) or D E & F. If you just stripped all of the non alpha-num characters you'd likely get less brittle results.

You know your problem domain better than I do though, so whatever makes most sense.

Best,
Erick

On Tue, Apr 8, 2014 at 6:55 AM, Ahmet Arslan <io...@yahoo.com> wrote:
> Hi Peter,
>
> TermQueryParser is useful in your case.
> q={!term f=categories_string}A|B|D (E & F)
>
>
>
> On Tuesday, April 8, 2014 4:37 PM, Peter Kirk <pk...@alpha-solutions.dk> wrote:
> Hi
>
> How to search for Solr special characters like '(' and '&'?
>
> I am trying to execute searches for "products" in my Solr (3.6.1) index, based on the "categories" to which these products belong.
> The categories are stored in a multistring field for the products, and are hierarchical, and are fed to the index like:
> A
> A|B
> A|B|C
>
> So this product would actually belong to category named "C", which is a child of "B", which is a child of !"A".
>
> I am able to execute queries for simple category names like this (eg. fq=categories_string:A|B|C).
>
> But some categories have Solr special characters in their names, like: "D (E & F)"
> (Real example: "Power supplies (Battery and Solar)").
>
> A query like fq=categories_string:A|B|D (E & F) simply fails.
> But even if I try
> fq=categories_string:A|B|D%20\(E%20%26amp%3B%20F\)
> (where I try to escape the special characters) does not find the products in this category, and actually finds other unrelated categories.
>
> What am I doing wrong?
>
> Thanks,
> Peter
>

Re: Solr special characters like '(' and '&'?

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi;

I have developed a Search API for such kind of cases and generate Solr
query within that API. I have also have my own query syntax.

When a search query comes into my API I generate query and does not allow
for something like *:*. On the other hand I escape query string and append
the appropriate field for search query as like: field:(escaped_value) so
there is not a security concern about reaching the fields of schema or
escaping concern.

I think that building a search API something like that and handling
security, escaping etc. within it is a way you should consider. If try to
do something like that I can answer your questions.

Thanks;
Furkan KAMACI


2014-04-09 18:29 GMT+03:00 Erick Erickson <er...@gmail.com>:

> Note that when I mentioned "filter these characters out" I had
> something like PatternReplaceCharFilterFactory or LowerCaseTokenizer
> in mind rather than you having to do it manually. Doesn't help
> figuring out what to escape on the URL though.
>
> Best,
> Erick
>
> On Wed, Apr 9, 2014 at 8:05 AM, Shawn Heisey <so...@elyograg.org> wrote:
> > On 4/9/2014 8:39 AM, Philip Durbin wrote:
> >> Filtering out special characters sounds like a good idea, or possibly
> >> escaping some of them. I definitely want to avoid brittleness.
> >>
> >> Right now I'm passing the query relatively "as is" which means users
> >> can type "title:foo" to find documents that have "foo" in the "title"
> >> field. But a query for just a colon (":") throws an error
> >> (org.apache.solr.search.SyntaxError: Cannot parse ':') so obviously I
> >> need to do more processing of the query before I pass it to Solr. I
> >> need to escape that colon or something.
> >>
> >> Is there some general advice on doing some sanity checks or escaping
> >> special characters on user-supplied queries before you pass them to
> >> Solr? Is it documented in the wiki? I'm using Solrj but I imagine the
> >> advice applies to everyone.
> >
> > SolrJ has the ClientUtils.escapeQueryChars method, which will
> > automatically escape any character that has special meaning to the query
> > parser.  It does so by preceding it with a backslash.
> >
> >
> http://lucene.apache.org/solr/4_7_1/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html#escapeQueryChars%28java.lang.String%29
> >
> > You do need to be careful with it, though.  For a query formatted like
> > field:(value) you'd only want to apply it to the 'value' part, because
> > if you applied it to the whole query, the colon and parentheses would
> > become part of the query text -- probably not what you want.
> >
> > Thanks,
> > Shawn
> >
>

Re: Solr special characters like '(' and '&'?

Posted by Erick Erickson <er...@gmail.com>.
Note that when I mentioned "filter these characters out" I had
something like PatternReplaceCharFilterFactory or LowerCaseTokenizer
in mind rather than you having to do it manually. Doesn't help
figuring out what to escape on the URL though.

Best,
Erick

On Wed, Apr 9, 2014 at 8:05 AM, Shawn Heisey <so...@elyograg.org> wrote:
> On 4/9/2014 8:39 AM, Philip Durbin wrote:
>> Filtering out special characters sounds like a good idea, or possibly
>> escaping some of them. I definitely want to avoid brittleness.
>>
>> Right now I'm passing the query relatively "as is" which means users
>> can type "title:foo" to find documents that have "foo" in the "title"
>> field. But a query for just a colon (":") throws an error
>> (org.apache.solr.search.SyntaxError: Cannot parse ':') so obviously I
>> need to do more processing of the query before I pass it to Solr. I
>> need to escape that colon or something.
>>
>> Is there some general advice on doing some sanity checks or escaping
>> special characters on user-supplied queries before you pass them to
>> Solr? Is it documented in the wiki? I'm using Solrj but I imagine the
>> advice applies to everyone.
>
> SolrJ has the ClientUtils.escapeQueryChars method, which will
> automatically escape any character that has special meaning to the query
> parser.  It does so by preceding it with a backslash.
>
> http://lucene.apache.org/solr/4_7_1/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html#escapeQueryChars%28java.lang.String%29
>
> You do need to be careful with it, though.  For a query formatted like
> field:(value) you'd only want to apply it to the 'value' part, because
> if you applied it to the whole query, the colon and parentheses would
> become part of the query text -- probably not what you want.
>
> Thanks,
> Shawn
>

Re: Solr special characters like '(' and '&'?

Posted by Shawn Heisey <so...@elyograg.org>.
On 4/9/2014 8:39 AM, Philip Durbin wrote:
> Filtering out special characters sounds like a good idea, or possibly
> escaping some of them. I definitely want to avoid brittleness.
> 
> Right now I'm passing the query relatively "as is" which means users
> can type "title:foo" to find documents that have "foo" in the "title"
> field. But a query for just a colon (":") throws an error
> (org.apache.solr.search.SyntaxError: Cannot parse ':') so obviously I
> need to do more processing of the query before I pass it to Solr. I
> need to escape that colon or something.
> 
> Is there some general advice on doing some sanity checks or escaping
> special characters on user-supplied queries before you pass them to
> Solr? Is it documented in the wiki? I'm using Solrj but I imagine the
> advice applies to everyone.

SolrJ has the ClientUtils.escapeQueryChars method, which will
automatically escape any character that has special meaning to the query
parser.  It does so by preceding it with a backslash.

http://lucene.apache.org/solr/4_7_1/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html#escapeQueryChars%28java.lang.String%29

You do need to be careful with it, though.  For a query formatted like
field:(value) you'd only want to apply it to the 'value' part, because
if you applied it to the whole query, the colon and parentheses would
become part of the query text -- probably not what you want.

Thanks,
Shawn


Re: Solr special characters like '(' and '&'?

Posted by Philip Durbin <ph...@harvard.edu>.
Filtering out special characters sounds like a good idea, or possibly
escaping some of them. I definitely want to avoid brittleness.

Right now I'm passing the query relatively "as is" which means users
can type "title:foo" to find documents that have "foo" in the "title"
field. But a query for just a colon (":") throws an error
(org.apache.solr.search.SyntaxError: Cannot parse ':') so obviously I
need to do more processing of the query before I pass it to Solr. I
need to escape that colon or something.

Is there some general advice on doing some sanity checks or escaping
special characters on user-supplied queries before you pass them to
Solr? Is it documented in the wiki? I'm using Solrj but I imagine the
advice applies to everyone.

Phil

p.s. I noticed a note saying "These characters are part of the query
syntax and must be escaped" at
https://github.com/apache/lucene-solr/blob/lucene_solr_4_7_0/solr/solrj/src/java/org/apache/solr/client/solrj/util/ClientUtils.java#L231
and learned of this part of the code from
http://lucene.472066.n3.nabble.com/What-is-the-full-list-of-Solr-Special-Characters-td4094053.html

On Tue, Apr 8, 2014 at 10:14 AM, Erick Erickson <er...@gmail.com> wrote:
> I'd seriously consider filtering these characters out when you index
> and search, this is quite likely very brittle. The same item, say from
> two different vendors, might have D (E & F) or D E & F. If you just
> stripped all of the non alpha-num characters you'd likely get less
> brittle results.
>
> You know your problem domain better than I do though, so whatever
> makes most sense.
>
> Best,
> Erick
>
> On Tue, Apr 8, 2014 at 6:55 AM, Ahmet Arslan <io...@yahoo.com> wrote:
>> Hi Peter,
>>
>> TermQueryParser is useful in your case.
>> q={!term f=categories_string}A|B|D (E & F)
>>
>>
>>
>> On Tuesday, April 8, 2014 4:37 PM, Peter Kirk <pk...@alpha-solutions.dk> wrote:
>> Hi
>>
>> How to search for Solr special characters like '(' and '&'?
>>
>> I am trying to execute searches for "products" in my Solr (3.6.1) index, based on the "categories" to which these products belong.
>> The categories are stored in a multistring field for the products, and are hierarchical, and are fed to the index like:
>> A
>> A|B
>> A|B|C
>>
>> So this product would actually belong to category named "C", which is a child of "B", which is a child of !"A".
>>
>> I am able to execute queries for simple category names like this (eg. fq=categories_string:A|B|C).
>>
>> But some categories have Solr special characters in their names, like: "D (E & F)"
>> (Real example: "Power supplies (Battery and Solar)").
>>
>> A query like fq=categories_string:A|B|D (E & F) simply fails.
>> But even if I try
>> fq=categories_string:A|B|D%20\(E%20%26amp%3B%20F\)
>> (where I try to escape the special characters) does not find the products in this category, and actually finds other unrelated categories.
>>
>> What am I doing wrong?
>>
>> Thanks,
>> Peter
>>



-- 
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin

Re: Solr special characters like '(' and '&'?

Posted by Erick Erickson <er...@gmail.com>.
I'd seriously consider filtering these characters out when you index
and search, this is quite likely very brittle. The same item, say from
two different vendors, might have D (E & F) or D E & F. If you just
stripped all of the non alpha-num characters you'd likely get less
brittle results.

You know your problem domain better than I do though, so whatever
makes most sense.

Best,
Erick

On Tue, Apr 8, 2014 at 6:55 AM, Ahmet Arslan <io...@yahoo.com> wrote:
> Hi Peter,
>
> TermQueryParser is useful in your case.
> q={!term f=categories_string}A|B|D (E & F)
>
>
>
> On Tuesday, April 8, 2014 4:37 PM, Peter Kirk <pk...@alpha-solutions.dk> wrote:
> Hi
>
> How to search for Solr special characters like '(' and '&'?
>
> I am trying to execute searches for "products" in my Solr (3.6.1) index, based on the "categories" to which these products belong.
> The categories are stored in a multistring field for the products, and are hierarchical, and are fed to the index like:
> A
> A|B
> A|B|C
>
> So this product would actually belong to category named "C", which is a child of "B", which is a child of !"A".
>
> I am able to execute queries for simple category names like this (eg. fq=categories_string:A|B|C).
>
> But some categories have Solr special characters in their names, like: "D (E & F)"
> (Real example: "Power supplies (Battery and Solar)").
>
> A query like fq=categories_string:A|B|D (E & F) simply fails.
> But even if I try
> fq=categories_string:A|B|D%20\(E%20%26amp%3B%20F\)
> (where I try to escape the special characters) does not find the products in this category, and actually finds other unrelated categories.
>
> What am I doing wrong?
>
> Thanks,
> Peter
>

Re: Solr special characters like '(' and '&'?

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Peter,

TermQueryParser is useful in your case. 
q={!term f=categories_string}A|B|D (E & F)



On Tuesday, April 8, 2014 4:37 PM, Peter Kirk <pk...@alpha-solutions.dk> wrote:
Hi

How to search for Solr special characters like '(' and '&'?

I am trying to execute searches for "products" in my Solr (3.6.1) index, based on the "categories" to which these products belong.
The categories are stored in a multistring field for the products, and are hierarchical, and are fed to the index like:
A
A|B
A|B|C

So this product would actually belong to category named "C", which is a child of "B", which is a child of !"A".

I am able to execute queries for simple category names like this (eg. fq=categories_string:A|B|C).

But some categories have Solr special characters in their names, like: "D (E & F)"
(Real example: "Power supplies (Battery and Solar)").

A query like fq=categories_string:A|B|D (E & F) simply fails.
But even if I try 
fq=categories_string:A|B|D%20\(E%20%26amp%3B%20F\)
(where I try to escape the special characters) does not find the products in this category, and actually finds other unrelated categories.

What am I doing wrong?

Thanks,
Peter


Re: Solr special characters like '(' and '&'?

Posted by rulinma <ru...@gmail.com>.
mark.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-special-characters-like-and-tp4129854p4130333.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr special characters like '(' and '&'?

Posted by "T. Kuro Kurosaka" <ku...@healthline.com>.
I don't think & is special to the parser. Classic examples like "AT&T" 
just work, as far as query parser is considered.
https://wiki.apache.org/solr/SolrQuerySyntax
even tells that you can escape the special meaning by the backslash.

"&" is special in the URL, however, and that has to be hex-escaped as %26.

On 04/08/2014 06:37 AM, Peter Kirk wrote:
> Hi
>
> How to search for Solr special characters like '(' and '&'?
>

Kuro