You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alexandre Rafalovitch <ar...@gmail.com> on 2014/05/17 10:05:48 UTC
Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?
Hello,
I am getting weird results that seem to come from eDisMax using
analyzer chain to break the input text. I have
WordDelimiterFilterFactory in my chain, which does a lot of
interesting things I did not expect query parser to be involved in.
Specifically, the string "abc123XYZ" gets split into 3 components on
digits and gets lowercased as well. I thought all that was happening
later, inside individual fields.
All documentation talks about query parsers splitting on space, so I
don't know where this "full chain" business is coming from. Or maybe I
am misunderstanding which phase debug output is from.
Here is the field definition:
<fieldType name="wdText" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="wsText" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<field name="wdText" type="wdText" indexed="true" stored="true" />
<field name="wsText" type="wsText" indexed="true" stored="true" />
And here is the debug output:
http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true
"rawquerystring":"hello big world abc123XYZ",
"querystring":"hello big world abc123XYZ",
"parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
DisjunctionMaxQuery((wdText:world | wsText:world))
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
wdText:xyz) | wsText:abc123XYZ))))/no_coord",
"parsedquery_toString":"+((wdText:hello | wsText:hello)
(wdText:big | wsText:big) (wdText:world | wsText:world)
(((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
wsText:abc123XYZ))",
Or, and enabling phrase search on the field type, gets even more
weird. But one problem at a time.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?
Posted by Jack Krupansky <ja...@basetechnology.com>.
Your bad experience seems to have occurred because you chose to use all
default values for the WDF attributes. In particular, the generateWordParts
and generateNumberParts attributes default to "1" (true), resulting in the
discrete "abc", "123", and "xyz" tokens, and the catenateAll attribute
defaults to "0" (false), which means that the "abc123xyz" token is not
generated by that attribute, although "abc123xyz" is generated because you
explicitly specified the preserveOriginal attribute to be "1".
Generally, you need to have asymmetric WDF analyzers, one for indexing that
generates multiple terms for better recall, and one for query that generates
only a sequence of the sub-terms (as if a quoted phrase) for more precise
matching. So, it's fine to use preserveOriginal="1" for indexing, as well as
catenateAll="1" and generateNumberParts="1" and generateWordParts="1", but
for query analysis you should have preserveOriginal="0", catenateAll="0" and
catenateWordParts="0" and catenateNumberParts="0" and
generateNumberParts="1" and generateWordParts="1".
The distinction between preserveOriginal and catenateAll is whether
punctuation should be included (for the former) or stripped out (the
latter):
abc. => abc. vs. abc
(xyz). => (xyz). vs. xyz
401(k). => 401(k). vs. 401 k
CD-ROM. => CD-ROM. vs. CD ROM
Finally, the default for the splitOnNumerics attribute is "1" (true), which
is why "abc123xyz" is split into three terms. If you don't want that split,
set splitOnNumerics="0".
There are more details on WDF in my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html
-- Jack Krupansky
-----Original Message-----
From: Alexandre Rafalovitch
Sent: Saturday, May 17, 2014 1:13 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?
My understanding was that the lower-case and other things happen on
per-field basis and is a step after the dismax formula is applied. In
this case, however, this seems to be happening before:
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz)
Hence to question to someone who actually understands those guts. For
eDisMax, what's the correct/expected call sequence between query
parser and field-type parser? Or maybe just a slightly more in-depth
explanation of Michael's statement.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency
On Sat, May 17, 2014 at 8:28 PM, Michael Sokolov
<ms...@safaribooksonline.com> wrote:
> Alex - the query parsers generally accept an analyzer, which they must
> apply
> after they perform their own tokenization. Consider: how would a
> capitalized query term match lower-cased terms in the index without query
> analysis?
>
> -Mike
>
>
> On 5/17/2014 4:05 AM, Alexandre Rafalovitch wrote:
>>
>> Hello,
>>
>> I am getting weird results that seem to come from eDisMax using
>> analyzer chain to break the input text. I have
>> WordDelimiterFilterFactory in my chain, which does a lot of
>> interesting things I did not expect query parser to be involved in.
>>
>> Specifically, the string "abc123XYZ" gets split into 3 components on
>> digits and gets lowercased as well. I thought all that was happening
>> later, inside individual fields.
>>
>> All documentation talks about query parsers splitting on space, so I
>> don't know where this "full chain" business is coming from. Or maybe I
>> am misunderstanding which phase debug output is from.
>>
>> Here is the field definition:
>> <fieldType name="wdText" class="solr.TextField" >
>> <analyzer>
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" />
>> <filter class="solr.LowerCaseFilterFactory" />
>> </analyzer>
>> </fieldType>
>> <fieldType name="wsText" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer>
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> <field name="wdText" type="wdText" indexed="true" stored="true"
>> />
>> <field name="wsText" type="wsText" indexed="true" stored="true"
>> />
>>
>> And here is the debug output:
>>
>> http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true
>>
>> "rawquerystring":"hello big world abc123XYZ",
>> "querystring":"hello big world abc123XYZ",
>> "parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
>> wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
>> DisjunctionMaxQuery((wdText:world | wsText:world))
>> DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
>> wdText:xyz) | wsText:abc123XYZ))))/no_coord",
>> "parsedquery_toString":"+((wdText:hello | wsText:hello)
>> (wdText:big | wsText:big) (wdText:world | wsText:world)
>> (((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
>> wsText:abc123XYZ))",
>>
>> Or, and enabling phrase search on the field type, gets even more
>> weird. But one problem at a time.
>>
>> Regards,
>> Alex.
>>
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>
>
Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
My understanding was that the lower-case and other things happen on
per-field basis and is a step after the dismax formula is applied. In
this case, however, this seems to be happening before:
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz)
Hence to question to someone who actually understands those guts. For
eDisMax, what's the correct/expected call sequence between query
parser and field-type parser? Or maybe just a slightly more in-depth
explanation of Michael's statement.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
On Sat, May 17, 2014 at 8:28 PM, Michael Sokolov
<ms...@safaribooksonline.com> wrote:
> Alex - the query parsers generally accept an analyzer, which they must apply
> after they perform their own tokenization. Consider: how would a
> capitalized query term match lower-cased terms in the index without query
> analysis?
>
> -Mike
>
>
> On 5/17/2014 4:05 AM, Alexandre Rafalovitch wrote:
>>
>> Hello,
>>
>> I am getting weird results that seem to come from eDisMax using
>> analyzer chain to break the input text. I have
>> WordDelimiterFilterFactory in my chain, which does a lot of
>> interesting things I did not expect query parser to be involved in.
>>
>> Specifically, the string "abc123XYZ" gets split into 3 components on
>> digits and gets lowercased as well. I thought all that was happening
>> later, inside individual fields.
>>
>> All documentation talks about query parsers splitting on space, so I
>> don't know where this "full chain" business is coming from. Or maybe I
>> am misunderstanding which phase debug output is from.
>>
>> Here is the field definition:
>> <fieldType name="wdText" class="solr.TextField" >
>> <analyzer>
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" />
>> <filter class="solr.LowerCaseFilterFactory" />
>> </analyzer>
>> </fieldType>
>> <fieldType name="wsText" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer>
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> <field name="wdText" type="wdText" indexed="true" stored="true"
>> />
>> <field name="wsText" type="wsText" indexed="true" stored="true"
>> />
>>
>> And here is the debug output:
>>
>> http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true
>>
>> "rawquerystring":"hello big world abc123XYZ",
>> "querystring":"hello big world abc123XYZ",
>> "parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
>> wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
>> DisjunctionMaxQuery((wdText:world | wsText:world))
>> DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
>> wdText:xyz) | wsText:abc123XYZ))))/no_coord",
>> "parsedquery_toString":"+((wdText:hello | wsText:hello)
>> (wdText:big | wsText:big) (wdText:world | wsText:world)
>> (((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
>> wsText:abc123XYZ))",
>>
>> Or, and enabling phrase search on the field type, gets even more
>> weird. But one problem at a time.
>>
>> Regards,
>> Alex.
>>
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>
>
Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
Alex - the query parsers generally accept an analyzer, which they must
apply after they perform their own tokenization. Consider: how would a
capitalized query term match lower-cased terms in the index without
query analysis?
-Mike
On 5/17/2014 4:05 AM, Alexandre Rafalovitch wrote:
> Hello,
>
> I am getting weird results that seem to come from eDisMax using
> analyzer chain to break the input text. I have
> WordDelimiterFilterFactory in my chain, which does a lot of
> interesting things I did not expect query parser to be involved in.
>
> Specifically, the string "abc123XYZ" gets split into 3 components on
> digits and gets lowercased as well. I thought all that was happening
> later, inside individual fields.
>
> All documentation talks about query parsers splitting on space, so I
> don't know where this "full chain" business is coming from. Or maybe I
> am misunderstanding which phase debug output is from.
>
> Here is the field definition:
> <fieldType name="wdText" class="solr.TextField" >
> <analyzer>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> preserveOriginal="1" />
> <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
> </fieldType>
> <fieldType name="wsText" class="solr.TextField" positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> </analyzer>
> </fieldType>
>
> <field name="wdText" type="wdText" indexed="true" stored="true" />
> <field name="wsText" type="wsText" indexed="true" stored="true" />
>
> And here is the debug output:
> http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true
>
> "rawquerystring":"hello big world abc123XYZ",
> "querystring":"hello big world abc123XYZ",
> "parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
> wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
> DisjunctionMaxQuery((wdText:world | wsText:world))
> DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
> wdText:xyz) | wsText:abc123XYZ))))/no_coord",
> "parsedquery_toString":"+((wdText:hello | wsText:hello)
> (wdText:big | wsText:big) (wdText:world | wsText:world)
> (((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
> wsText:abc123XYZ))",
>
> Or, and enabling phrase search on the field type, gets even more
> weird. But one problem at a time.
>
> Regards,
> Alex.
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency