You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mike Phillips <m....@prosperodigital.com> on 2020/02/21 17:57:21 UTC

Is this a bug? Wildcard with PatternReplaceFilterFactory

Is this a bug? Wildcard with PatternReplaceFilterFactory

Attempting to normalize left and right single and double quotes for searches

‘       Left single quotation mark        '    Single quote
’       Right single quotation mark       '    Single quote
“       Left double quotation mark        "    Double quotes
”       Right double quotation mark       "    Double quotes


     <fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100" multiValued="true">
       <analyzer type="index">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />
         <filter class="solr.WordDelimiterGraphFilterFactory" 
preserveOriginal="1" catenateWords="1"/>
         <filter class="solr.FlattenGraphFilterFactory"/> <!-- required 
on index analyzers after graph filters -->
         <filter class="solr.PatternReplaceFilterFactory" pattern="‘" 
replacement="'"/>
         <filter class="solr.PatternReplaceFilterFactory" pattern="’" 
replacement="'"/>
         <filter class="solr.PatternReplaceFilterFactory" pattern="“" 
replacement="&quot;"/>
         <filter class="solr.PatternReplaceFilterFactory" pattern="”" 
replacement="&quot;"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.WordDelimiterGraphFilterFactory" 
preserveOriginal="1" catenateWords="1"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />
         <filter class="solr.SynonymGraphFilterFactory" 
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
         <filter class="solr.PatternReplaceFilterFactory" pattern="‘" 
replacement="'"/>
         <filter class="solr.PatternReplaceFilterFactory" pattern="’" 
replacement="'"/>
         <filter class="solr.PatternReplaceFilterFactory" pattern="“" 
replacement="&quot;"/>
         <filter class="solr.PatternReplaceFilterFactory" pattern="”" 
replacement="&quot;"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
     </fieldType>

The wildcard seems to NOT utilize the PatternReplaceFilterFactory

Rod’s  finds fields Rod's and Rod’s that are now in the index as rod's

but *Rod’s* finds nothing because the index now only contains rod's

Re: Is this a bug? Wildcard with PatternReplaceFilterFactory

Posted by Mike Phillips <mi...@comcast.net>.

It looks like the debug result you are showing me is the results for 
Rod's not Rod’s, but in answer to your question

This is why I think                    "Rod’s  finds fields Rod's and 
Rod’s that are now in the index as rod's"

The analysis page shows Rod’s gets stored in the index as:
rod's rods rod s

Field Value (Index)

Rod’s

Analyse Fieldname / FieldType: _text_ Schema Browser 
<https://centos1:8985/solr/#/rat_11/schema?field=_text_>

  *
    Verbose Output

WT
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position

	
Rod’s
[52 6f 64 e2 80 99 73]
0
5
1
word
1
1

SF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position

	
Rod’s
[52 6f 64 e2 80 99 73]
0
5
1
word
1
1

WDGF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword

	
Rod’s
[52 6f 64 e2 80 99 73]
0
5
2
word
1
1
false

	
Rods
[52 6f 64 73]
0
5
2
word
1
1
false

	
Rod
[52 6f 64]
0
3
1
word
1
1
false

	
s
[73]
4
5
1
word
1
2
false

FGF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword

	
Rod’s
[52 6f 64 e2 80 99 73]
0
5
2
word
1
1
false

	
Rods
[52 6f 64 73]
0
5
2
word
1
1
false

	
Rod
[52 6f 64]
0
3
1
word
1
1
false

	
s
[73]
4
5
1
word
1
2
false

PRF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword

	
Rod’s
[52 6f 64 e2 80 99 73]
0
5
2
word
1
1
false

	
Rods
[52 6f 64 73]
0
5
2
word
1
1
false

	
Rod
[52 6f 64]
0
3
1
word
1
1
false

	
s
[73]
4
5
1
word
1
2
false

PRF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword

	
Rod's
[52 6f 64 27 73]
0
5
2
word
1
1
false

	
Rods
[52 6f 64 73]
0
5
2
word
1
1
false

	
Rod
[52 6f 64]
0
3
1
word
1
1
false

	
s
[73]
4
5
1
word
1
2
false

PRF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword

	
Rod's
[52 6f 64 27 73]
0
5
2
word
1
1
false

	
Rods
[52 6f 64 73]
0
5
2
word
1
1
false

	
Rod
[52 6f 64]
0
3
1
word
1
1
false

	
s
[73]
4
5
1
word
1
2
false

PRF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword

	
Rod's
[52 6f 64 27 73]
0
5
2
word
1
1
false

	
Rods
[52 6f 64 73]
0
5
2
word
1
1
false

	
Rod
[52 6f 64]
0
3
1
word
1
1
false

	
s
[73]
4
5
1
word
1
2
false

LCF
	
tex

t
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword

	
rod's
[72 6f 64 27 73]
0
5
2
word
1
1
false

	
rods
[72 6f 64 73]
0
5
2
word
1
1
false

	
rod
[72 6f 64]
0
3
1
word
1
1
false

	
s
[73]
4
5
1
word
1
2
false



This is  what we were trying to achieve with the <filter 
class="solr.PatternReplaceFilterFactory" pattern="’" replacement="'"/>


The problem is when using wildcard *Rod’s* we get no hits
||

|"responseHeader":{ "status":0, "QTime":2, "params":{ "q":"*Rod’s*", 
"debugQuery":"on", "_":"1582315262594"}}, 
"response":{"numFound":0,"start":0,"docs":[] }, "debug":{ 
"rawquerystring":"*Rod’s*", "querystring":"*Rod’s*", 
"parsedquery":"_text_:*rod’s*", "parsedquery_toString":"_text_:*rod’s*", 
"explain":{}, "QParser":"LuceneQParser", ... |






On 2/21/2020 11:52 AM, Erick Erickson wrote:
> Why do you say “…that are now in the index as rod’s”? You have WordDelimiterGraphFilterFactory, which breaks things up. When I put your field definition in the schema and use the analysis page, turns “rod’s” into  the following 4 tokens:
>
> rod’s
> rods
> rod
> s
>
> And querying on field:”*Rod’s*” works just fine. I’m using 8.x, and when I add “&debug=query” to the URL, I see:
> {
> "responseHeader": {
> "status": 0, "QTime": 10, "params": {
> "q": "eoe:\"*Rod's*\"", "debug": "query"
> }
> }, "response": {
> "numFound": 1, "start": 0, "docs": [
> {
> "id": "1", "eoe": "Rod's", "_version_": 1659176849231577088
> }
> ]
> }, "debug": {
> "rawquerystring": "eoe:\"*Rod's*\"", "querystring": "eoe:\"*Rod's*\"", "parsedquery": "SynonymQuery(Synonym(eoe:*rod's* eoe:rod))", "parsedquery_toString": "Synonym(eoe:*rod's* eoe:rod)", "QParser": "LuceneQParser"
> }
> }
>
> What do you see?
>
> Best,
> Erick
>
>> On Feb 21, 2020, at 12:57 PM, Mike Phillips <m....@prosperodigital.com> wrote:
>>
>> Rod’s  finds fields Rod's and Rod’s that are now in the index as rod's
>>
>> but *Rod’s* finds nothing because the index now only contains rod's

Re: Is this a bug? Wildcard with PatternReplaceFilterFactory

Posted by Erick Erickson <er...@gmail.com>.

Why do you say “…that are now in the index as rod’s”? You have WordDelimiterGraphFilterFactory, which breaks things up. When I put your field definition in the schema and use the analysis page, turns “rod’s” into  the following 4 tokens:

rod’s
rods
rod
s

And querying on field:”*Rod’s*” works just fine. I’m using 8.x, and when I add “&debug=query” to the URL, I see: 
{
"responseHeader": {
"status": 0, "QTime": 10, "params": {
"q": "eoe:\"*Rod's*\"", "debug": "query"
}
}, "response": {
"numFound": 1, "start": 0, "docs": [
{
"id": "1", "eoe": "Rod's", "_version_": 1659176849231577088
}
]
}, "debug": {
"rawquerystring": "eoe:\"*Rod's*\"", "querystring": "eoe:\"*Rod's*\"", "parsedquery": "SynonymQuery(Synonym(eoe:*rod's* eoe:rod))", "parsedquery_toString": "Synonym(eoe:*rod's* eoe:rod)", "QParser": "LuceneQParser"
}
}

What do you see?

Best,
Erick

> On Feb 21, 2020, at 12:57 PM, Mike Phillips <m....@prosperodigital.com> wrote:
> 
> Rod’s  finds fields Rod's and Rod’s that are now in the index as rod's
> 
> but *Rod’s* finds nothing because the index now only contains rod's