You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Callum Lamb <cl...@mintel.com> on 2016/11/16 17:34:11 UTC

Handling ampersands in searches.

I'm having an issue where searches that contain ampersands aren't being
handled correctly. I need them to be dropped at index time *AND* query
time. When documents come in and are indexed the ampersands are
successfully dropped when they go into my stemmed field (When I facet on
the stemmed field they aren't in the list), but when I actually search with
a term containing an ampersand, I get no results.

E.g. I search for the string "light fit" or "light and fit" then I get
results, but when I search for "light & fit" I get none. Even though the
SnowballPorterFilterFactory should be dropping it at query time like it
does for the "and" and all 3 queries *should* be equivalent.

I've tried adding a synonym such that shows in
my _schema_analysis_synonyms_default.json (I only have one default file) in
both this form and its inverse as well:

"and":[

      "&",
      "and"],


I've also tried adding the StopWord filter to my fieldtype with & in the
stopwords (though this shouldn't be necessary because the SnowBallPorter
should be dropping it anyway) and it still doesn't work.

Is there some kind of special handling I need for ampersands? I'm thinking
that Solr must be interpreting it as some kind of operator and I need to
tell Solr that it's actually literal text so the SnowBallPorter knows to
drop it. Using backslashes or url encoding instead doesn't work though.
Does anyone have any ideas?

I can obviously just remove any ampersands from the q before I submit the
query to Solr and get the correct results, so this is not a game breaking
problem, but i'm more curious to *why* this is happening and how to fix it
correctly.

Cheers,

Callum.

Extra info:

I'm using Solr 5.5.2 in cloud mode.

The q in the queries is specified like this and are parsed the following
way:

"rawquerystring":"stemmed_description:light & fit", "querystring":"
stemmed_description:light & fit", "parsedquery":"(+(+stemmed_description:light
+DisjunctionMaxQuery((stemmed_description:&)) +DisjunctionMaxQuery((
stemmed_description:fit))))/no_coord", "parsedquery_toString":"+(+
stemmed_description:light +(stemmed_description:&) +(stemmed_description
:fit))",

I have a stemmed field defined in my schema (schema version 1.5) defined
like this:

<field name="stemmed_description" type="stemmed_text" indexed="true"
stored="false" required="false" multiValued="true"/>

with a field type defined like this:

    <!-- Stemmed text type -->
    <fieldType name="stemmed_text" class="solr.TextField"
positionIncrementGap="100" omitNorms="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
                catenateWords="1"
                preserveOriginal="0"
                splitOnNumerics="0"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.ManagedSynonymFilterFactory" managed="default"
/>
        <filter class="solr.SnowballPorterFilterFactory"
language="English"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
                catenateWords="1"
                preserveOriginal="1"
                splitOnNumerics="0"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>

        <filter class="solr.SnowballPorterFilterFactory"
language="English"/>
      </analyzer>
    </fieldType>

-- 

Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations.

This email and any attachments may include content that is confidential, 
privileged 
or otherwise protected under applicable law. Unauthorised disclosure, 
copying, distribution 
or use of the contents is prohibited and may be unlawful. If you have 
received this email in error,
including without appropriate authorisation, then please reply to the 
sender about the error 
and delete this email and any attachments.

Re: Handling ampersands in searches.

Posted by Erick Erickson <er...@gmail.com>.

Why do you think that the porter stemmer is involved here? That
takes tokens and tries to reduce them to their base form through
a set of rules. My guess is that the & just falls outside all rules so
is passed through unimpeded.

This is where the admin/analysis page is invaluable. If you look at
your type you'll notice that you have different options on
WordDelimiterFilterFactory for query and index time, in particular
"preserveOriginal" is 0 at index time and 1 at query.

So you get the tokens
light
fit

in your index and

light
&
fit

at query time.

Then since the query is looking for all three terms it fails. I'm
also guessing you have mm=100% or the default op set to
AND in your edismax configuration.

Anyway, this all kind of starts with choosing WhitespaceTokenizerFactory
as your tokenizer. StandardTokenizerFactory will (I think) remove the
ampersand in both cases. You can also use a CharFilterFactory to apply
some filtering to characters before anything starts going through the
analysis chain (NOTE: this is a CharFilter, not a Filter! See something
like PatternReplaceCharFilterFacotry at
https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory)

Best,
Erick

On Wed, Nov 16, 2016 at 9:34 AM, Callum Lamb <cl...@mintel.com> wrote:
> I'm having an issue where searches that contain ampersands aren't being
> handled correctly. I need them to be dropped at index time *AND* query
> time. When documents come in and are indexed the ampersands are
> successfully dropped when they go into my stemmed field (When I facet on
> the stemmed field they aren't in the list), but when I actually search with
> a term containing an ampersand, I get no results.
>
> E.g. I search for the string "light fit" or "light and fit" then I get
> results, but when I search for "light & fit" I get none. Even though the
> SnowballPorterFilterFactory should be dropping it at query time like it
> does for the "and" and all 3 queries *should* be equivalent.
>
> I've tried adding a synonym such that shows in
> my _schema_analysis_synonyms_default.json (I only have one default file) in
> both this form and its inverse as well:
>
> "and":[
>
>       "&",
>       "and"],
>
>
> I've also tried adding the StopWord filter to my fieldtype with & in the
> stopwords (though this shouldn't be necessary because the SnowBallPorter
> should be dropping it anyway) and it still doesn't work.
>
> Is there some kind of special handling I need for ampersands? I'm thinking
> that Solr must be interpreting it as some kind of operator and I need to
> tell Solr that it's actually literal text so the SnowBallPorter knows to
> drop it. Using backslashes or url encoding instead doesn't work though.
> Does anyone have any ideas?
>
> I can obviously just remove any ampersands from the q before I submit the
> query to Solr and get the correct results, so this is not a game breaking
> problem, but i'm more curious to *why* this is happening and how to fix it
> correctly.
>
> Cheers,
>
> Callum.
>
> Extra info:
>
> I'm using Solr 5.5.2 in cloud mode.
>
> The q in the queries is specified like this and are parsed the following
> way:
>
> "rawquerystring":"stemmed_description:light & fit", "querystring":"
> stemmed_description:light & fit", "parsedquery":"(+(+stemmed_description:light
> +DisjunctionMaxQuery((stemmed_description:&)) +DisjunctionMaxQuery((
> stemmed_description:fit))))/no_coord", "parsedquery_toString":"+(+
> stemmed_description:light +(stemmed_description:&) +(stemmed_description
> :fit))",
>
> I have a stemmed field defined in my schema (schema version 1.5) defined
> like this:
>
> <field name="stemmed_description" type="stemmed_text" indexed="true"
> stored="false" required="false" multiValued="true"/>
>
> with a field type defined like this:
>
>     <!-- Stemmed text type -->
>     <fieldType name="stemmed_text" class="solr.TextField"
> positionIncrementGap="100" omitNorms="true">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StandardFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
>                 catenateWords="1"
>                 preserveOriginal="0"
>                 splitOnNumerics="0"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.ManagedSynonymFilterFactory" managed="default"
> />
>         <filter class="solr.SnowballPorterFilterFactory"
> language="English"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StandardFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
>                 catenateWords="1"
>                 preserveOriginal="1"
>                 splitOnNumerics="0"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>
>         <filter class="solr.SnowballPorterFilterFactory"
> language="English"/>
>       </analyzer>
>     </fieldType>
>
> --
>
> Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
>
> Contact details for our other offices can be found at
> http://www.mintel.com/office-locations.
>
> This email and any attachments may include content that is confidential,
> privileged
> or otherwise protected under applicable law. Unauthorised disclosure,
> copying, distribution
> or use of the contents is prohibited and may be unlawful. If you have
> received this email in error,
> including without appropriate authorisation, then please reply to the
> sender about the error
> and delete this email and any attachments.
>