You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Nikolas Tautenhahn <ni...@livinglogic.de> on 2010/08/19 17:33:59 UTC

Proper Escaping of Ampersands

Hi,

I have a problem with, for example, company names like "AT&S".
A Job is sending data to the solr 1.4 (also tested it with 1.4.1) index
via python in XML, everything is escaped properly ("&" becomes "&amp;").

When I search for "at s"(q=%22at%20s%22), using the dismax handler, I
find the dataset to this company and I get all names back (The company
is still called at&s and not something like at&amp;s).

But when I search for q=at%26s (=at&s), I get nothing.
I also tried q=at%5C%26s (=at\&s) and q=at%5C%5C%26s blindly following
any clues for escaping with backslashes...


So, my question is: How do I search (correctly) for at&s?


When I use the "Analysis" Page in the admin panel and select my
fieldname and enter Field Value (Index) "AT&S" and enter the Field Value
(Query) as "AT&S" it shows me that the query matches - so I assume, SOLR
doesn't get the correct query string...

If it is necessary, I can supply information from schema.xml for the
fields in use, but as the "Analysis"-Page showed the match, I don't
think this is very useful...

best regards,
Nikolas Tautenhahn

Re: Proper Escaping of Ampersands

Posted by Nikolas Tautenhahn <ni...@livinglogic.de>.

Hi all,

just some further information:
https://issues.apache.org/jira/browse/SOLR-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

seems to be the same problem - but searching the archives yielded
nothing I could use.

Any hints on this?

best regards,
Nikolas Tautenhahn

Am 19.08.2010 17:33, schrieb Nikolas Tautenhahn:
> Hi,
> 
> I have a problem with, for example, company names like "AT&S".
> A Job is sending data to the solr 1.4 (also tested it with 1.4.1) index
> via python in XML, everything is escaped properly ("&" becomes "&amp;").
> 
> When I search for "at s"(q=%22at%20s%22), using the dismax handler, I
> find the dataset to this company and I get all names back (The company
> is still called at&s and not something like at&amp;s).
> 
> But when I search for q=at%26s (=at&s), I get nothing.
> I also tried q=at%5C%26s (=at\&s) and q=at%5C%5C%26s blindly following
> any clues for escaping with backslashes...
> 
> 
> So, my question is: How do I search (correctly) for at&s?
> 
> 
> When I use the "Analysis" Page in the admin panel and select my
> fieldname and enter Field Value (Index) "AT&S" and enter the Field Value
> (Query) as "AT&S" it shows me that the query matches - so I assume, SOLR
> doesn't get the correct query string...
> 
> If it is necessary, I can supply information from schema.xml for the
> fields in use, but as the "Analysis"-Page showed the match, I don't
> think this is very useful...
> 
> best regards,
> Nikolas Tautenhahn
>

Re: Proper Escaping of Ampersands

Posted by Yonik Seeley <yo...@lucidimagination.com>.

I'd recommend going back to the "textgen" field type as defined in the
example schema.
Your move of the StopFilter is what is causing the problem.
At index time, the "s" gets removed (because the StopFilter is now
after the WDF).
But a query of "at&s" is transformed into "at s" (the s isn't removed
because StopFilter is before WDF for the query analyzer).  Since "s"
isn't in the index, no docs are found.

Also, I notice you're using preserveOriginal=1 - make sure you really
need that... it's normally only useful if you are doing wildcard
searches (for example at&*).

-Yonik
http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8


On Mon, Aug 23, 2010 at 5:43 AM, Nikolas Tautenhahn
<ni...@livinglogic.de> wrote:
> Hi Yonik,
>
> I got it working, but I think the Stopword Filter is not behaving as
> expected - (The document could be found when I disabled the stopword
> filter, details later in this mail...)
>
> On 20.08.2010 16:57, Yonik Seeley wrote
>> On Thu, Aug 19, 2010 at 11:33 AM, Nikolas Tautenhahn
>> <ni...@livinglogic.de> wrote:
>>> But when I search for q=at%26s (=at&s), I get nothing.
>>
>> That's the correct encoding if you're typing it directly into a
>> browser address box.
>> http://localhost:8983/solr/select?defType=dismax&qf=text&q=at%26s&debugQuery=true
>>
>> But you should be able to verify that solr is getting the correct
>> query string by checking out "params" in the response (in the example
>> server, by default they are echoed back).  And adding debugQuery=true
>> to the request should show you exactly what query is being generated.
>>
>> But the real issue likely lies with your fieldType definition.  Can
>> you show that?
>
> As I (normally) query multiple fields, I changed my request URL to
> http://127.0.0.1:8983/solr/select?q=at%26s&fl=titel&qt=dismax&qf=titel&debugQuery=truefl=*&qt=dismax&qf=titel&debugQuery=true
> in order to narrow it down and got this response (cut to, as I think,
> relevant stuff)
>
>> <str name="rawquerystring">at&s</str>
>> <str name="querystring">at&s</str>
>> <str name="parsedquery">+DisjunctionMaxQuery((titel:"(at&s at) s")~0.1) ()</str>
>> <str name="parsedquery_toString">+(titel:"(at&s at) s")~0.1 ()</str>
>> <lst name="explain"/>
>> <str name="QParser">DisMaxQParser</str>
>
> on my local debugging instance, using standard dismax config (from the
> examples directory at solr).
>
> The "titel"-Field is configured like this:
>
>>   <field name="titel" type="textgen" indexed="true" stored="true"/>
>
> and "textgen" is configured like this
>
>>     <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
>>       <analyzer type="index">
>>         <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
>>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>       </analyzer>
>>     </fieldType>
>
> The document is indexed correctly, a search for "at s" found it and all
> fields looked great ("at&s and not for example, at&amp;s).
>
> As my stopword list does not contain "at" or "&" or "&amp;", I don't
> quite understand, why my result is found, when I disable the
> stopword-list. My stopwordlist can be found here
>
> http://pastebin.com/RfLuBHqd
>
> Do you happen to see bad things for a string like "at&s" here?
>
> The analysis page in the admin panel tells me, these steps for the Index
> Analyzer:
>
> (HTMLStripStandardTokenizer) at&s => at&s
> (SynonymFilter) at&s => at&s
> (WordDelimiterFilter) at&s => term position 1: at&s, at; term pos 2: s, ats
> (LowerCaseFilter) 1: at&s, at; 2: s, ats => 1: at&s, at; 2: s, ats
> (StopFilter) 1: at&s, at; 2: s, ats => 1: at&s, at; 2: ats
>
> So, according to this, it should be found even with my stopwords enabled...
>
>
> best regards and thanks for your response,
> Nikolas Tautenhahn
>

Re: Proper Escaping of Ampersands

Posted by Nikolas Tautenhahn <ni...@livinglogic.de>.

Hi Chris,

On 23.08.2010 21:37, Chris Hostetter wrote:
> : The document is indexed correctly, a search for "at s" found it and all
> : fields looked great ("at&s and not for example, at&amp;s).
> : 
> : As my stopword list does not contain "at" or "&" or "&amp;", I don't
> : quite understand, why my result is found, when I disable the
> : stopword-list. My stopwordlist can be found here
> : 
> : http://pastebin.com/RfLuBHqd
> : 
> : Do you happen to see bad things for a string like "at&s" here?
> 
> "s" is in your stopwords file, which may be part of the problem (but i 
> didn't look hard at your query string to verify that)
> 
> : The analysis page in the admin panel tells me, these steps for the Index
> : Analyzer:
> 	...
> : (StopFilter) 1: at&s, at; 2: s, ats => 1: at&s, at; 2: ats
> : 
> : So, according to this, it should be found even with my stopwords enabled...
> 
> Strange, based on the stopwords file you posted the "s" should definitely 
> be getting removed at index time -- it would also get removed at query 
> time, but because you have it *before* WDF at query time that wouldn't 
> affect this query (even though it did affect the index)
> 
> There was a bug with analysis.jsp and stopwords recently, but that 
> shouldn't have affected 1.4 (you are definitely using 1.4, correct?)
> 
> https://issues.apache.org/jira/browse/SOLR-2051

I am using solr 1.4 (actually LucidWorks Solr) in production and tried
1.4.1 for testing - unfortunately I can't tell for sure, if I tried the
analysis.jsp in both...

I moved the stopword filter before the WordDelimiterFilter - thanks for
your hints, Chris and Yonik!

best regards,
Nikolas Tautenhahn

Re: Proper Escaping of Ampersands

Posted by Chris Hostetter <ho...@fucit.org>.

: The document is indexed correctly, a search for "at s" found it and all
: fields looked great ("at&s and not for example, at&amp;s).
: 
: As my stopword list does not contain "at" or "&" or "&amp;", I don't
: quite understand, why my result is found, when I disable the
: stopword-list. My stopwordlist can be found here
: 
: http://pastebin.com/RfLuBHqd
: 
: Do you happen to see bad things for a string like "at&s" here?

"s" is in your stopwords file, which may be part of the problem (but i 
didn't look hard at your query string to verify that)

: The analysis page in the admin panel tells me, these steps for the Index
: Analyzer:
	...
: (StopFilter) 1: at&s, at; 2: s, ats => 1: at&s, at; 2: ats
: 
: So, according to this, it should be found even with my stopwords enabled...

Strange, based on the stopwords file you posted the "s" should definitely 
be getting removed at index time -- it would also get removed at query 
time, but because you have it *before* WDF at query time that wouldn't 
affect this query (even though it did affect the index)

There was a bug with analysis.jsp and stopwords recently, but that 
shouldn't have affected 1.4 (you are definitely using 1.4, correct?)

https://issues.apache.org/jira/browse/SOLR-2051






-Hoss

Re: Proper Escaping of Ampersands

Posted by Nikolas Tautenhahn <ni...@livinglogic.de>.

Hi Yonik,

I got it working, but I think the Stopword Filter is not behaving as
expected - (The document could be found when I disabled the stopword
filter, details later in this mail...)

On 20.08.2010 16:57, Yonik Seeley wrote
> On Thu, Aug 19, 2010 at 11:33 AM, Nikolas Tautenhahn
> <ni...@livinglogic.de> wrote:
>> But when I search for q=at%26s (=at&s), I get nothing.
> 
> That's the correct encoding if you're typing it directly into a
> browser address box.
> http://localhost:8983/solr/select?defType=dismax&qf=text&q=at%26s&debugQuery=true
> 
> But you should be able to verify that solr is getting the correct
> query string by checking out "params" in the response (in the example
> server, by default they are echoed back).  And adding debugQuery=true
> to the request should show you exactly what query is being generated.
> 
> But the real issue likely lies with your fieldType definition.  Can
> you show that?

As I (normally) query multiple fields, I changed my request URL to
http://127.0.0.1:8983/solr/select?q=at%26s&fl=titel&qt=dismax&qf=titel&debugQuery=truefl=*&qt=dismax&qf=titel&debugQuery=true
in order to narrow it down and got this response (cut to, as I think,
relevant stuff)

> <str name="rawquerystring">at&s</str>
> <str name="querystring">at&s</str>
> <str name="parsedquery">+DisjunctionMaxQuery((titel:"(at&s at) s")~0.1) ()</str>
> <str name="parsedquery_toString">+(titel:"(at&s at) s")~0.1 ()</str>
> <lst name="explain"/>
> <str name="QParser">DisMaxQParser</str>

on my local debugging instance, using standard dismax config (from the
examples directory at solr).

The "titel"-Field is configured like this:

>   <field name="titel" type="textgen" indexed="true" stored="true"/>

and "textgen" is configured like this

>     <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
> 	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>

The document is indexed correctly, a search for "at s" found it and all
fields looked great ("at&s and not for example, at&amp;s).

As my stopword list does not contain "at" or "&" or "&amp;", I don't
quite understand, why my result is found, when I disable the
stopword-list. My stopwordlist can be found here

http://pastebin.com/RfLuBHqd

Do you happen to see bad things for a string like "at&s" here?

The analysis page in the admin panel tells me, these steps for the Index
Analyzer:

(HTMLStripStandardTokenizer) at&s => at&s
(SynonymFilter) at&s => at&s
(WordDelimiterFilter) at&s => term position 1: at&s, at; term pos 2: s, ats
(LowerCaseFilter) 1: at&s, at; 2: s, ats => 1: at&s, at; 2: s, ats
(StopFilter) 1: at&s, at; 2: s, ats => 1: at&s, at; 2: ats

So, according to this, it should be found even with my stopwords enabled...


best regards and thanks for your response,
Nikolas Tautenhahn

Re: Proper Escaping of Ampersands

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Aug 19, 2010 at 11:33 AM, Nikolas Tautenhahn
<ni...@livinglogic.de> wrote:
> Hi,
>
> I have a problem with, for example, company names like "AT&S".
> A Job is sending data to the solr 1.4 (also tested it with 1.4.1) index
> via python in XML, everything is escaped properly ("&" becomes "&amp;").
>
> When I search for "at s"(q=%22at%20s%22), using the dismax handler, I
> find the dataset to this company and I get all names back (The company
> is still called at&s and not something like at&amp;s).
>
> But when I search for q=at%26s (=at&s), I get nothing.

That's the correct encoding if you're typing it directly into a
browser address box.
http://localhost:8983/solr/select?defType=dismax&qf=text&q=at%26s&debugQuery=true

But you should be able to verify that solr is getting the correct
query string by checking out "params" in the response (in the example
server, by default they are echoed back).  And adding debugQuery=true
to the request should show you exactly what query is being generated.

But the real issue likely lies with your fieldType definition.  Can
you show that?

-Yonik
http://www.lucidimagination.com