You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shamik Bandopadhyay <sh...@gmail.com> on 2016/08/25 23:41:26 UTC

Inventor-template vs Inventor template - issue with hyphen

Hi,

  I'm trying to figure out search behaviour related to similar terms, one
with and without the hyphen. Both of them are generating a different result
set , the search without the hyphen is bringing back more result compared
to the other. Here's the fieldtype definition :

<fieldType name="text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms/synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms/synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>

If I run the search term through the analyzer, the final indexed data for
both term (hyphen and without) results in  --> *inventor templat*

I was under the impression that based on my analyzers, both search term
will produce same result.

Here's the output from debug and splainer.

*Inventor-template*
*-------------------------*

<str name="parsedquery">(+DisjunctionMaxQuery(((+CommandSrch:inventor
+CommandSrch:templat) | text:"inventor templat"^1.5 | Description:"inventor
templat"^2.0 | title:"inventor templat"^3.5 | keywords:"inventor
templat"^1.2)~0.01) Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(1472083200000),date(PublishDate)))+1.0)))/no_coord</str>

<str name="parsedquery_toString">+((+CommandSrch:inventor
+CommandSrch:templat) | text:"inventor templat"^1.5 | Description:"inventor
templat"^2.0 | title:"inventor templat"^3.5 | keywords:"inventor
templat"^1.2)~0.01
1.0/(3.16E-11*float(ms(const(1472083200000),date(PublishDate)))+1.0)</str>

From Splainer:

10.974786 Sum of the following:
 9.203462 Dismax (max plus:0.01 times others)
   9.198681 title:"inventor templat"

   0.4781131 text:"inventor templat"

 1.7644342 Source2:sfdcarticles

 0.006889837 1.0/(3.16E-11*float(ms(const(1472083200000),date(PublishDate)))+1.0)


*Inventor template*
*--------------------------*

<str name="parsedquery">(+(+DisjunctionMaxQuery((CommandSrch:inventor |
text:inventor^1.5 | Description:inventor^2.0 | title:inventor^3.5 |
keywords:inventor^1.2)~0.01) +DisjunctionMaxQuery((CommandSrch:templat |
text:templat^1.5 | Description:templat^2.0 | title:templat^3.5 |
keywords:templat^1.2)~0.01)) Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(1472083200000),date(PublishDate)))+1.0)))/no_coord</str>

<str name="parsedquery_toString">+(+(CommandSrch:inventor |
text:inventor^1.5 | Description:inventor^2.0 | title:inventor^3.5 |
keywords:inventor^1.2)~0.01 +(CommandSrch:templat | text:templat^1.5 |
Description:templat^2.0 | title:templat^3.5 | keywords:templat^1.2)~0.01)
Source2:sfdcarticles^9.0 Source2:downloads^5.0
1.0/(3.16E-11*float(ms(const(1472083200000),date(PublishDate)))+1.0)</str>

From splainer :

9.915069 Sum of the following:
 5.03947 Dismax (max plus:0.01 times others)
   5.038846 title:templat

   0.062400598 text:templat

 4.767776 Dismax (max plus:0.01 times others)
   4.7674117 title:inventor

   0.03642158 text:inventor

 0.098686054 Source2:CloudHelp

 0.009136423
1.0/(3.16E-11*float(ms(const(1472083200000),date(PublishDate)))+1.0)


I'm using edismax.


Just wondering what I'm missing here. Any help will be appreciated.

Regards,
Shamik

Re: Inventor-template vs Inventor template - issue with hyphen

Posted by Erick Erickson <er...@gmail.com>.
This confuses a lot of people. The difference is at the top-level parser, way
before it gets to the analysis chain.

"Inventor-template"

comes out of the top-level parser as
a single token. From there it goes through edismax etc. So it's a single
token spread across your
fields by edismax. It's only during the field analysis that it's broken
into two tokens.

"Inventor template" is parsed as two distinct tokens and fed to edismax as
two tokens where
they're spread across your fields as a pair of words.

Best,
Erick




On Fri, Aug 26, 2016 at 8:09 AM, shamik <sh...@gmail.com> wrote:

> Anyone ?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Inventor-template-vs-Inventor-template-issue-
> with-hyphen-tp4293357p4293489.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Inventor-template vs Inventor template - issue with hyphen

Posted by shamik <sh...@gmail.com>.
Anyone ?



--
View this message in context: http://lucene.472066.n3.nabble.com/Inventor-template-vs-Inventor-template-issue-with-hyphen-tp4293357p4293489.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Inventor-template vs Inventor template - issue with hyphen

Posted by shamik <sh...@gmail.com>.
Thanks Erick. I did look into the analyser tool and debug query and posted
the results in my post. WDF is correctly stripping off the "-" from
Inventor-template, both terms are getting broken down to "inventor templat".
But not sure why the query construct is different during query time. Here's
parsed query:

*Inventor-template*

<str name="parsedquery">
(+DisjunctionMaxQuery(((+CommandSrch:inventor +CommandSrch:templat) |
text:"inventor templat"^1.5 | Description:"inventor templat"^2.0 |
title:"inventor templat"^3.5 | keywords:"inventor templat"^1.2)~0.01)
Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(1472169600000),date(PublishDate)))+1.0)))/no_coord
</str>

<str name="parsedquery_toString">
+((+CommandSrch:inventor +CommandSrch:templat) | text:"inventor templat"^1.5
| Description:"inventor templat"^2.0 | title:"inventor templat"^3.5 |
keywords:"inventor templat"^1.2)~0.01 Source2:sfdcarticles^9.0
Source2:downloads^5.0 
1.0/(3.16E-11*float(ms(const(1472169600000),date(PublishDate)))+1.0)
</str>

*Inventor template*

<str name="parsedquery">
(+(+DisjunctionMaxQuery((CommandSrch:inventor | text:inventor^1.5 |
Description:inventor^2.0 | title:inventor^3.5 | keywords:inventor^1.2)~0.01)
+DisjunctionMaxQuery((CommandSrch:templat | text:templat^1.5 |
Description:templat^2.0 | title:templat^3.5 | keywords:templat^1.2)~0.01))
Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(1472169600000),date(PublishDate)))+1.0)))/no_coord
</str>

<str name="parsedquery_toString">
+(+(CommandSrch:inventor | text:inventor^1.5 | Description:inventor^2.0 |
title:inventor^3.5 | keywords:inventor^1.2)~0.01 +(CommandSrch:templat |
text:templat^1.5 | Description:templat^2.0 | title:templat^3.5 |
keywords:templat^1.2)~0.01) Source2:sfdcarticles^9.0 Source2:downloads^5.0 
1.0/(3.16E-11*float(ms(const(1472169600000),date(PublishDate)))+1.0)
</str>

The part I'm confused is why the two queries are being interpreted
differently ?

Thanks,
Shamik



--
View this message in context: http://lucene.472066.n3.nabble.com/Inventor-template-vs-Inventor-template-issue-with-hyphen-tp4293357p4293380.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Inventor-template vs Inventor template - issue with hyphen

Posted by Erick Erickson <er...@gmail.com>.
Look at your admin/analysis page. Worddelimitetfilterfactory breaks on non
alpha-num. Also, adding &debug=query will show you the parsed form of the
query and that'll help

On Aug 25, 2016 4:41 PM, "Shamik Bandopadhyay" <sh...@gmail.com> wrote:

Hi,

  I'm trying to figure out search behaviour related to similar terms, one
with and without the hyphen. Both of them are generating a different result
set , the search without the hyphen is bringing back more result compared
to the other. Here's the fieldtype definition :

<fieldType name="text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms/synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms/synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>

If I run the search term through the analyzer, the final indexed data for
both term (hyphen and without) results in  --> *inventor templat*

I was under the impression that based on my analyzers, both search term
will produce same result.

Here's the output from debug and splainer.

*Inventor-template*
*-------------------------*

<str name="parsedquery">(+DisjunctionMaxQuery(((+CommandSrch:inventor
+CommandSrch:templat) | text:"inventor templat"^1.5 | Description:"inventor
templat"^2.0 | title:"inventor templat"^3.5 | keywords:"inventor
templat"^1.2)~0.01) Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(1472083200000),
date(PublishDate)))+1.0)))/no_coord</str>

<str name="parsedquery_toString">+((+CommandSrch:inventor
+CommandSrch:templat) | text:"inventor templat"^1.5 | Description:"inventor
templat"^2.0 | title:"inventor templat"^3.5 | keywords:"inventor
templat"^1.2)~0.01
1.0/(3.16E-11*float(ms(const(1472083200000),date(PublishDate)))+1.0)</str>

From Splainer:

10.974786 Sum of the following:
 9.203462 Dismax (max plus:0.01 times others)
   9.198681 title:"inventor templat"

   0.4781131 text:"inventor templat"

 1.7644342 Source2:sfdcarticles

 0.006889837 1.0/(3.16E-11*float(ms(const(1472083200000),date(
PublishDate)))+1.0)


*Inventor template*
*--------------------------*

<str name="parsedquery">(+(+DisjunctionMaxQuery((CommandSrch:inventor |
text:inventor^1.5 | Description:inventor^2.0 | title:inventor^3.5 |
keywords:inventor^1.2)~0.01) +DisjunctionMaxQuery((CommandSrch:templat |
text:templat^1.5 | Description:templat^2.0 | title:templat^3.5 |
keywords:templat^1.2)~0.01)) Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(1472083200000),
date(PublishDate)))+1.0)))/no_coord</str>

<str name="parsedquery_toString">+(+(CommandSrch:inventor |
text:inventor^1.5 | Description:inventor^2.0 | title:inventor^3.5 |
keywords:inventor^1.2)~0.01 +(CommandSrch:templat | text:templat^1.5 |
Description:templat^2.0 | title:templat^3.5 | keywords:templat^1.2)~0.01)
Source2:sfdcarticles^9.0 Source2:downloads^5.0
1.0/(3.16E-11*float(ms(const(1472083200000),date(PublishDate)))+1.0)</str>

From splainer :

9.915069 Sum of the following:
 5.03947 Dismax (max plus:0.01 times others)
   5.038846 title:templat

   0.062400598 text:templat

 4.767776 Dismax (max plus:0.01 times others)
   4.7674117 title:inventor

   0.03642158 text:inventor

 0.098686054 Source2:CloudHelp

 0.009136423
1.0/(3.16E-11*float(ms(const(1472083200000),date(PublishDate)))+1.0)


I'm using edismax.


Just wondering what I'm missing here. Any help will be appreciated.

Regards,
Shamik