You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mike Phillips <m....@prosperodigital.com> on 2020/02/21 17:57:21 UTC
Is this a bug? Wildcard with PatternReplaceFilterFactory
Is this a bug? Wildcard with PatternReplaceFilterFactory
Attempting to normalize left and right single and double quotes for searches
‘ Left single quotation mark ' Single quote
’ Right single quotation mark ' Single quote
“ Left double quotation mark " Double quotes
” Right double quotation mark " Double quotes
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.WordDelimiterGraphFilterFactory"
preserveOriginal="1" catenateWords="1"/>
<filter class="solr.FlattenGraphFilterFactory"/> <!-- required
on index analyzers after graph filters -->
<filter class="solr.PatternReplaceFilterFactory" pattern="‘"
replacement="'"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="’"
replacement="'"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="“"
replacement="""/>
<filter class="solr.PatternReplaceFilterFactory" pattern="”"
replacement="""/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory"
preserveOriginal="1" catenateWords="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="‘"
replacement="'"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="’"
replacement="'"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="“"
replacement="""/>
<filter class="solr.PatternReplaceFilterFactory" pattern="”"
replacement="""/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The wildcard seems to NOT utilize the PatternReplaceFilterFactory
Rod’s finds fields Rod's and Rod’s that are now in the index as rod's
but *Rod’s* finds nothing because the index now only contains rod's
Re: Is this a bug? Wildcard with PatternReplaceFilterFactory
Posted by Mike Phillips <mi...@comcast.net>.
It looks like the debug result you are showing me is the results for
Rod's not Rod’s, but in answer to your question
This is why I think "Rod’s finds fields Rod's and
Rod’s that are now in the index as rod's"
The analysis page shows Rod’s gets stored in the index as:
rod's rods rod s
Field Value (Index)
Rod’s
Analyse Fieldname / FieldType: _text_ Schema Browser
<https://centos1:8985/solr/#/rat_11/schema?field=_text_>
*
Verbose Output
WT
text
raw_bytes
start
end
positionLength
type
termFrequency
position
Rod’s
[52 6f 64 e2 80 99 73]
0
5
1
word
1
1
SF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
Rod’s
[52 6f 64 e2 80 99 73]
0
5
1
word
1
1
WDGF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
Rod’s
[52 6f 64 e2 80 99 73]
0
5
2
word
1
1
false
Rods
[52 6f 64 73]
0
5
2
word
1
1
false
Rod
[52 6f 64]
0
3
1
word
1
1
false
s
[73]
4
5
1
word
1
2
false
FGF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
Rod’s
[52 6f 64 e2 80 99 73]
0
5
2
word
1
1
false
Rods
[52 6f 64 73]
0
5
2
word
1
1
false
Rod
[52 6f 64]
0
3
1
word
1
1
false
s
[73]
4
5
1
word
1
2
false
PRF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
Rod’s
[52 6f 64 e2 80 99 73]
0
5
2
word
1
1
false
Rods
[52 6f 64 73]
0
5
2
word
1
1
false
Rod
[52 6f 64]
0
3
1
word
1
1
false
s
[73]
4
5
1
word
1
2
false
PRF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
Rod's
[52 6f 64 27 73]
0
5
2
word
1
1
false
Rods
[52 6f 64 73]
0
5
2
word
1
1
false
Rod
[52 6f 64]
0
3
1
word
1
1
false
s
[73]
4
5
1
word
1
2
false
PRF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
Rod's
[52 6f 64 27 73]
0
5
2
word
1
1
false
Rods
[52 6f 64 73]
0
5
2
word
1
1
false
Rod
[52 6f 64]
0
3
1
word
1
1
false
s
[73]
4
5
1
word
1
2
false
PRF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
Rod's
[52 6f 64 27 73]
0
5
2
word
1
1
false
Rods
[52 6f 64 73]
0
5
2
word
1
1
false
Rod
[52 6f 64]
0
3
1
word
1
1
false
s
[73]
4
5
1
word
1
2
false
LCF
tex
t
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
rod's
[72 6f 64 27 73]
0
5
2
word
1
1
false
rods
[72 6f 64 73]
0
5
2
word
1
1
false
rod
[72 6f 64]
0
3
1
word
1
1
false
s
[73]
4
5
1
word
1
2
false
This is what we were trying to achieve with the <filter
class="solr.PatternReplaceFilterFactory" pattern="’" replacement="'"/>
The problem is when using wildcard *Rod’s* we get no hits
||
|"responseHeader":{ "status":0, "QTime":2, "params":{ "q":"*Rod’s*",
"debugQuery":"on", "_":"1582315262594"}},
"response":{"numFound":0,"start":0,"docs":[] }, "debug":{
"rawquerystring":"*Rod’s*", "querystring":"*Rod’s*",
"parsedquery":"_text_:*rod’s*", "parsedquery_toString":"_text_:*rod’s*",
"explain":{}, "QParser":"LuceneQParser", ... |
On 2/21/2020 11:52 AM, Erick Erickson wrote:
> Why do you say “…that are now in the index as rod’s”? You have WordDelimiterGraphFilterFactory, which breaks things up. When I put your field definition in the schema and use the analysis page, turns “rod’s” into the following 4 tokens:
>
> rod’s
> rods
> rod
> s
>
> And querying on field:”*Rod’s*” works just fine. I’m using 8.x, and when I add “&debug=query” to the URL, I see:
> {
> "responseHeader": {
> "status": 0, "QTime": 10, "params": {
> "q": "eoe:\"*Rod's*\"", "debug": "query"
> }
> }, "response": {
> "numFound": 1, "start": 0, "docs": [
> {
> "id": "1", "eoe": "Rod's", "_version_": 1659176849231577088
> }
> ]
> }, "debug": {
> "rawquerystring": "eoe:\"*Rod's*\"", "querystring": "eoe:\"*Rod's*\"", "parsedquery": "SynonymQuery(Synonym(eoe:*rod's* eoe:rod))", "parsedquery_toString": "Synonym(eoe:*rod's* eoe:rod)", "QParser": "LuceneQParser"
> }
> }
>
> What do you see?
>
> Best,
> Erick
>
>> On Feb 21, 2020, at 12:57 PM, Mike Phillips <m....@prosperodigital.com> wrote:
>>
>> Rod’s finds fields Rod's and Rod’s that are now in the index as rod's
>>
>> but *Rod’s* finds nothing because the index now only contains rod's
Re: Is this a bug? Wildcard with PatternReplaceFilterFactory
Posted by Erick Erickson <er...@gmail.com>.
Why do you say “…that are now in the index as rod’s”? You have WordDelimiterGraphFilterFactory, which breaks things up. When I put your field definition in the schema and use the analysis page, turns “rod’s” into the following 4 tokens:
rod’s
rods
rod
s
And querying on field:”*Rod’s*” works just fine. I’m using 8.x, and when I add “&debug=query” to the URL, I see:
{
"responseHeader": {
"status": 0, "QTime": 10, "params": {
"q": "eoe:\"*Rod's*\"", "debug": "query"
}
}, "response": {
"numFound": 1, "start": 0, "docs": [
{
"id": "1", "eoe": "Rod's", "_version_": 1659176849231577088
}
]
}, "debug": {
"rawquerystring": "eoe:\"*Rod's*\"", "querystring": "eoe:\"*Rod's*\"", "parsedquery": "SynonymQuery(Synonym(eoe:*rod's* eoe:rod))", "parsedquery_toString": "Synonym(eoe:*rod's* eoe:rod)", "QParser": "LuceneQParser"
}
}
What do you see?
Best,
Erick
> On Feb 21, 2020, at 12:57 PM, Mike Phillips <m....@prosperodigital.com> wrote:
>
> Rod’s finds fields Rod's and Rod’s that are now in the index as rod's
>
> but *Rod’s* finds nothing because the index now only contains rod's