You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alexandre Rafalovitch <ar...@gmail.com> on 2014/05/17 10:05:48 UTC

Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

Hello,

I am getting weird results that seem to come from eDisMax using
analyzer chain to break the input text. I have
WordDelimiterFilterFactory in my chain, which does a lot of
interesting things I did not expect query parser to be involved in.

Specifically, the string "abc123XYZ" gets split into 3 components on
digits and gets lowercased as well. I thought all that was happening
later, inside individual fields.

All documentation talks about query parsers splitting on space, so I
don't know where this "full chain" business is coming from. Or maybe I
am misunderstanding which phase debug output is from.

Here is the field definition:
    <fieldType name="wdText" class="solr.TextField" >
        <analyzer>
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory"
preserveOriginal="1" />
            <filter class="solr.LowerCaseFilterFactory" />
        </analyzer>
    </fieldType>
    <fieldType name="wsText" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>

    <field name="wdText"      type="wdText" indexed="true" stored="true" />
    <field name="wsText"      type="wsText" indexed="true" stored="true" />

And here is the debug output:
http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true

   "rawquerystring":"hello big world abc123XYZ",
    "querystring":"hello big world abc123XYZ",
    "parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
DisjunctionMaxQuery((wdText:world | wsText:world))
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
wdText:xyz) | wsText:abc123XYZ))))/no_coord",
    "parsedquery_toString":"+((wdText:hello | wsText:hello)
(wdText:big | wsText:big) (wdText:world | wsText:world)
(((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
wsText:abc123XYZ))",

Or, and enabling phrase search on the field type, gets even more
weird. But one problem at a time.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

Posted by Jack Krupansky <ja...@basetechnology.com>.

Your bad experience seems to have occurred because you chose to use all 
default values for the WDF attributes. In particular, the generateWordParts 
and generateNumberParts attributes default to "1" (true), resulting in the 
discrete "abc", "123", and "xyz" tokens, and the catenateAll attribute 
defaults to "0" (false), which means that the "abc123xyz" token is not 
generated by that attribute, although "abc123xyz" is generated because you 
explicitly specified the preserveOriginal attribute to be "1".

Generally, you need to have asymmetric WDF analyzers, one for indexing that 
generates multiple terms for better recall, and one for query that generates 
only a sequence of the sub-terms (as if a quoted phrase) for more precise 
matching. So, it's fine to use preserveOriginal="1" for indexing, as well as 
catenateAll="1" and generateNumberParts="1" and generateWordParts="1", but 
for query analysis you should have preserveOriginal="0", catenateAll="0" and 
catenateWordParts="0" and catenateNumberParts="0" and 
generateNumberParts="1" and generateWordParts="1".

The distinction between preserveOriginal and catenateAll is whether 
punctuation should be included (for the former) or stripped out (the 
latter):

abc. => abc. vs. abc

(xyz). => (xyz). vs. xyz

401(k). => 401(k). vs. 401 k

CD-ROM. => CD-ROM. vs. CD ROM

Finally, the default for the splitOnNumerics attribute is "1" (true), which 
is why "abc123xyz" is split into three terms. If you don't want that split, 
set splitOnNumerics="0".

There are more details on WDF in my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

-----Original Message----- 
From: Alexandre Rafalovitch
Sent: Saturday, May 17, 2014 1:13 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

My understanding was that the lower-case and other things happen on
per-field basis and is a step after the dismax formula is applied. In
this case, however, this seems to be happening before:
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz)

Hence to question to someone who actually understands those guts. For
eDisMax, what's the correct/expected call sequence between query
parser and field-type parser? Or maybe just a slightly more in-depth
explanation of Michael's statement.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr 
proficiency

On Sat, May 17, 2014 at 8:28 PM, Michael Sokolov
<ms...@safaribooksonline.com> wrote:
> Alex - the query parsers generally accept an analyzer, which they must 
> apply
> after they perform their own tokenization.  Consider: how would a
> capitalized query term match lower-cased terms in the index without query
> analysis?
>
> -Mike
>
>
> On 5/17/2014 4:05 AM, Alexandre Rafalovitch wrote:
>>
>> Hello,
>>
>> I am getting weird results that seem to come from eDisMax using
>> analyzer chain to break the input text. I have
>> WordDelimiterFilterFactory in my chain, which does a lot of
>> interesting things I did not expect query parser to be involved in.
>>
>> Specifically, the string "abc123XYZ" gets split into 3 components on
>> digits and gets lowercased as well. I thought all that was happening
>> later, inside individual fields.
>>
>> All documentation talks about query parsers splitting on space, so I
>> don't know where this "full chain" business is coming from. Or maybe I
>> am misunderstanding which phase debug output is from.
>>
>> Here is the field definition:
>>      <fieldType name="wdText" class="solr.TextField" >
>>          <analyzer>
>>              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>              <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" />
>>              <filter class="solr.LowerCaseFilterFactory" />
>>          </analyzer>
>>      </fieldType>
>>      <fieldType name="wsText" class="solr.TextField"
>> positionIncrementGap="100">
>>        <analyzer>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        </analyzer>
>>      </fieldType>
>>
>>      <field name="wdText"      type="wdText" indexed="true" stored="true"
>> />
>>      <field name="wsText"      type="wsText" indexed="true" stored="true"
>> />
>>
>> And here is the debug output:
>>
>> http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true
>>
>>     "rawquerystring":"hello big world abc123XYZ",
>>      "querystring":"hello big world abc123XYZ",
>>      "parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
>> wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
>> DisjunctionMaxQuery((wdText:world | wsText:world))
>> DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
>> wdText:xyz) | wsText:abc123XYZ))))/no_coord",
>>      "parsedquery_toString":"+((wdText:hello | wsText:hello)
>> (wdText:big | wsText:big) (wdText:world | wsText:world)
>> (((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
>> wsText:abc123XYZ))",
>>
>> Or, and enabling phrase search on the field type, gets even more
>> weird. But one problem at a time.
>>
>> Regards,
>>     Alex.
>>
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>
>

Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

My understanding was that the lower-case and other things happen on
per-field basis and is a step after the dismax formula is applied. In
this case, however, this seems to be happening before:
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz)

Hence to question to someone who actually understands those guts. For
eDisMax, what's the correct/expected call sequence between query
parser and field-type parser? Or maybe just a slightly more in-depth
explanation of Michael's statement.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Sat, May 17, 2014 at 8:28 PM, Michael Sokolov
<ms...@safaribooksonline.com> wrote:
> Alex - the query parsers generally accept an analyzer, which they must apply
> after they perform their own tokenization.  Consider: how would a
> capitalized query term match lower-cased terms in the index without query
> analysis?
>
> -Mike
>
>
> On 5/17/2014 4:05 AM, Alexandre Rafalovitch wrote:
>>
>> Hello,
>>
>> I am getting weird results that seem to come from eDisMax using
>> analyzer chain to break the input text. I have
>> WordDelimiterFilterFactory in my chain, which does a lot of
>> interesting things I did not expect query parser to be involved in.
>>
>> Specifically, the string "abc123XYZ" gets split into 3 components on
>> digits and gets lowercased as well. I thought all that was happening
>> later, inside individual fields.
>>
>> All documentation talks about query parsers splitting on space, so I
>> don't know where this "full chain" business is coming from. Or maybe I
>> am misunderstanding which phase debug output is from.
>>
>> Here is the field definition:
>>      <fieldType name="wdText" class="solr.TextField" >
>>          <analyzer>
>>              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>              <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" />
>>              <filter class="solr.LowerCaseFilterFactory" />
>>          </analyzer>
>>      </fieldType>
>>      <fieldType name="wsText" class="solr.TextField"
>> positionIncrementGap="100">
>>        <analyzer>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        </analyzer>
>>      </fieldType>
>>
>>      <field name="wdText"      type="wdText" indexed="true" stored="true"
>> />
>>      <field name="wsText"      type="wsText" indexed="true" stored="true"
>> />
>>
>> And here is the debug output:
>>
>> http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true
>>
>>     "rawquerystring":"hello big world abc123XYZ",
>>      "querystring":"hello big world abc123XYZ",
>>      "parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
>> wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
>> DisjunctionMaxQuery((wdText:world | wsText:world))
>> DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
>> wdText:xyz) | wsText:abc123XYZ))))/no_coord",
>>      "parsedquery_toString":"+((wdText:hello | wsText:hello)
>> (wdText:big | wsText:big) (wdText:world | wsText:world)
>> (((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
>> wsText:abc123XYZ))",
>>
>> Or, and enabling phrase search on the field type, gets even more
>> weird. But one problem at a time.
>>
>> Regards,
>>     Alex.
>>
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>
>

Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

Alex - the query parsers generally accept an analyzer, which they must 
apply after they perform their own tokenization.  Consider: how would a 
capitalized query term match lower-cased terms in the index without 
query analysis?

-Mike

On 5/17/2014 4:05 AM, Alexandre Rafalovitch wrote:
> Hello,
>
> I am getting weird results that seem to come from eDisMax using
> analyzer chain to break the input text. I have
> WordDelimiterFilterFactory in my chain, which does a lot of
> interesting things I did not expect query parser to be involved in.
>
> Specifically, the string "abc123XYZ" gets split into 3 components on
> digits and gets lowercased as well. I thought all that was happening
> later, inside individual fields.
>
> All documentation talks about query parsers splitting on space, so I
> don't know where this "full chain" business is coming from. Or maybe I
> am misunderstanding which phase debug output is from.
>
> Here is the field definition:
>      <fieldType name="wdText" class="solr.TextField" >
>          <analyzer>
>              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>              <filter class="solr.WordDelimiterFilterFactory"
> preserveOriginal="1" />
>              <filter class="solr.LowerCaseFilterFactory" />
>          </analyzer>
>      </fieldType>
>      <fieldType name="wsText" class="solr.TextField" positionIncrementGap="100">
>        <analyzer>
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        </analyzer>
>      </fieldType>
>
>      <field name="wdText"      type="wdText" indexed="true" stored="true" />
>      <field name="wsText"      type="wsText" indexed="true" stored="true" />
>
> And here is the debug output:
> http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true
>
>     "rawquerystring":"hello big world abc123XYZ",
>      "querystring":"hello big world abc123XYZ",
>      "parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
> wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
> DisjunctionMaxQuery((wdText:world | wsText:world))
> DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
> wdText:xyz) | wsText:abc123XYZ))))/no_coord",
>      "parsedquery_toString":"+((wdText:hello | wsText:hello)
> (wdText:big | wsText:big) (wdText:world | wsText:world)
> (((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
> wsText:abc123XYZ))",
>
> Or, and enabling phrase search on the field type, gets even more
> weird. But one problem at a time.
>
> Regards,
>     Alex.
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency