You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Sokolov <ms...@safaribooksonline.com> on 2014/11/18 20:33:48 UTC

problems when hunspell returns multiple stems

I find that a query for stemmed terms sometimes fails with the edismax 
query parser and hunspell stemmer. Looklng at the output of analysis for 
the query (text:following) I can see that it generates two different 
terms at the same position: "follow" and "following". Then edismax seems 
to generate a sloppy phrase query from that; in the debug output of the 
query I can see ( text:following text:follow)~2. This doesn't match 
anything, even though both the words follow and following (as well as 
followed, follows, etc) both occur in various documents.

First, I'm confused as to what the source of the sloppy query is. Here 
are the relevant settings from solrconfig:

<str name="defType">edismax</str>
<str name="qf">archive_id^1 author^20 chapter_title^15 isbn^1 
publisher^5 subjects^5 text^1 title^120</str>
<str name="pf">chapter_title~2^1 subjects~2^20 text~10^1 title~2^4</str>
<str name="mm">100%</str>
<str name="q.op">OR</str>

Is there some process that generates a slop query for co-occurring terms?

As an aside, the same query returns a document when we use the lucene 
query parser: it matches one document.  But when I search across our 
unstemmed field, it returns more.  It appears as if

It seems as if when hunspell returns multiple terms from a single one, 
this causes problems?

So in summary: why would hunspell generate "following" as a stem for 
"following"? Probably just a buggy dictionary entry; we could fix that, 
but I wouldn't expect the phrase behavior in that case from edismax 
either.  Can anybody shed some light as to what's going on here?

Thanks

-Mike

Re: problems when hunspell returns multiple stems

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

On 18 November 2014 15:52, Michael Sokolov
<ms...@safaribooksonline.com> wrote:
> I found a rogue new component in our analyzer

We have a first Solr virus? I thought we were safe until the "upload
the plugin" JIRA was in production :-)

Regards,
   Alex.

Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

Re: problems when hunspell returns multiple stems

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

OK - please disregard; I found a rogue new component in our analyzer 
that was messing everything up.

The hunspell behavior was perhaps a little confusing, but I don't 
believe it leads to broken queries.

-Mike


On 11/18/2014 02:38 PM, Michael Sokolov wrote:
> followup - hunspell has:
>
> follow/SDRZGJ
> follower/M
> following/M
>
> follow/G generates following
>
> I guess the reason for the /M entries is to represent the nouns, which 
> have plural endings, so that
>
> following->followings
>
> -- I'm not really sure where the bug is, but it seems as if generating 
> multiple "stems" causes issues
>
>
> On 11/18/2014 02:33 PM, Michael Sokolov wrote:
>> I find that a query for stemmed terms sometimes fails with the 
>> edismax query parser and hunspell stemmer. Looklng at the output of 
>> analysis for the query (text:following) I can see that it generates 
>> two different terms at the same position: "follow" and "following".  
>> Then edismax seems to generate a sloppy phrase query from that; in 
>> the debug output of the query I can see ( text:following 
>> text:follow)~2. This doesn't match anything, even though both the 
>> words follow and following (as well as followed, follows, etc) both 
>> occur in various documents.
>>
>> First, I'm confused as to what the source of the sloppy query is.  
>> Here are the relevant settings from solrconfig:
>>
>> <str name="defType">edismax</str>
>> <str name="qf">archive_id^1 author^20 chapter_title^15 isbn^1 
>> publisher^5 subjects^5 text^1 title^120</str>
>> <str name="pf">chapter_title~2^1 subjects~2^20 text~10^1 title~2^4</str>
>> <str name="mm">100%</str>
>> <str name="q.op">OR</str>
>>
>> Is there some process that generates a slop query for co-occurring terms?
>>
>> As an aside, the same query returns a document when we use the lucene 
>> query parser: it matches one document.  But when I search across our 
>> unstemmed field, it returns more.  It appears as if
>>
>> It seems as if when hunspell returns multiple terms from a single 
>> one, this causes problems?
>>
>> So in summary: why would hunspell generate "following" as a stem for 
>> "following"? Probably just a buggy dictionary entry; we could fix 
>> that, but I wouldn't expect the phrase behavior in that case from 
>> edismax either.  Can anybody shed some light as to what's going on here?
>>
>> Thanks
>>
>> -Mike
>

Re: problems when hunspell returns multiple stems

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

followup - hunspell has:

follow/SDRZGJ
follower/M
following/M

follow/G generates following

I guess the reason for the /M entries is to represent the nouns, which 
have plural endings, so that

following->followings

-- I'm not really sure where the bug is, but it seems as if generating 
multiple "stems" causes issues


On 11/18/2014 02:33 PM, Michael Sokolov wrote:
> I find that a query for stemmed terms sometimes fails with the edismax 
> query parser and hunspell stemmer. Looklng at the output of analysis 
> for the query (text:following) I can see that it generates two 
> different terms at the same position: "follow" and "following".  Then 
> edismax seems to generate a sloppy phrase query from that; in the 
> debug output of the query I can see ( text:following text:follow)~2. 
> This doesn't match anything, even though both the words follow and 
> following (as well as followed, follows, etc) both occur in various 
> documents.
>
> First, I'm confused as to what the source of the sloppy query is. Here 
> are the relevant settings from solrconfig:
>
> <str name="defType">edismax</str>
> <str name="qf">archive_id^1 author^20 chapter_title^15 isbn^1 
> publisher^5 subjects^5 text^1 title^120</str>
> <str name="pf">chapter_title~2^1 subjects~2^20 text~10^1 title~2^4</str>
> <str name="mm">100%</str>
> <str name="q.op">OR</str>
>
> Is there some process that generates a slop query for co-occurring terms?
>
> As an aside, the same query returns a document when we use the lucene 
> query parser: it matches one document.  But when I search across our 
> unstemmed field, it returns more.  It appears as if
>
> It seems as if when hunspell returns multiple terms from a single one, 
> this causes problems?
>
> So in summary: why would hunspell generate "following" as a stem for 
> "following"? Probably just a buggy dictionary entry; we could fix 
> that, but I wouldn't expect the phrase behavior in that case from 
> edismax either.  Can anybody shed some light as to what's going on here?
>
> Thanks
>
> -Mike