You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Nicholas Violi <nv...@globalgiving.org> on 2014/10/10 21:19:02 UTC

Stemming breaks highlighting?

Hi all,

I changed some of my fields from text_general to text_en, hoping to take
advantage of stemming and some other improvements, but unfortunately the
change has broken highlighting. It seems that it only wants to highlight
non-stemmed words (i.e. words whose stemmed version is the same as the word
itself, like "child").
I'm using the default fieldType definition:
 <fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">   <analyzer type="index">     <tokenizer
class="solr.StandardTokenizerFactory"/>     <filter
class="solr.StopFilterFactory"             ignoreCase="true"
 words="lang/stopwords_en.txt"             />     <filter
class="solr.LowerCaseFilterFactory"/>     <filter
class="solr.EnglishPossessiveFilterFactory"/>     <filter
class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
 <filter class="solr.PorterStemFilterFactory"/>   </analyzer>   <analyzer
type="query">     <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>     <filter
class="solr.StopFilterFactory"             ignoreCase="true"
 words="lang/stopwords_en.txt"             />     <filter
class="solr.LowerCaseFilterFactory"/>     <filter
class="solr.EnglishPossessiveFilterFactory"/>     <filter
class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
 <filter class="solr.PorterStemFilterFactory"/>
 </analyzer> </fieldType>And enabling highlighting with hl.fl=title&hl=true
in my query. This is also a faceted search, if that matters.
In this case, as I said, only unstemmed words like "child" are highlighted.
If I remove the stemming filter from the index analyzer (only; the query
analyzer seems to have no effect) in the text_en definition, all matched
words except stopwords are highlighted. Furthermore, if I change text_en to
use the EnglishMinimalStemFilterFactory, more words are highlighted, which
I assume is because they are stemmed by the Porter stemmer but not by this
one. An example of such a word is "strides".
Does anyone know what's going on?
Thanks,Nick

Re: Stemming breaks highlighting?

Posted by Nicholas Violi <nv...@globalgiving.org>.

Hey Ahmet,
Thanks for the quick response. Yes, I reindexed; to be sure I just wiped
out the whole data directory and made a fresh index again.

Nick

On Fri, Oct 10, 2014 at 3:20 PM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi Nick,
>
> Did you re-index after schema change?
>
> Ahmet
>
>
>
> On Friday, October 10, 2014 10:19 PM, Nicholas Violi <
> nvioli@globalgiving.org> wrote:
> Hi all,
>
> I changed some of my fields from text_general to text_en, hoping to take
> advantage of stemming and some other improvements, but unfortunately the
> change has broken highlighting. It seems that it only wants to highlight
> non-stemmed words (i.e. words whose stemmed version is the same as the word
> itself, like "child").
> I'm using the default fieldType definition:
> <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">   <analyzer type="index">     <tokenizer
> class="solr.StandardTokenizerFactory"/>     <filter
> class="solr.StopFilterFactory"             ignoreCase="true"
> words="lang/stopwords_en.txt"             />     <filter
> class="solr.LowerCaseFilterFactory"/>     <filter
> class="solr.EnglishPossessiveFilterFactory"/>     <filter
> class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>   </analyzer>   <analyzer
> type="query">     <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>     <filter
> class="solr.StopFilterFactory"             ignoreCase="true"
> words="lang/stopwords_en.txt"             />     <filter
> class="solr.LowerCaseFilterFactory"/>     <filter
> class="solr.EnglishPossessiveFilterFactory"/>     <filter
> class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer> </fieldType>And enabling highlighting with hl.fl=title&hl=true
> in my query. This is also a faceted search, if that matters.
> In this case, as I said, only unstemmed words like "child" are highlighted.
> If I remove the stemming filter from the index analyzer (only; the query
> analyzer seems to have no effect) in the text_en definition, all matched
> words except stopwords are highlighted. Furthermore, if I change text_en to
> use the EnglishMinimalStemFilterFactory, more words are highlighted, which
> I assume is because they are stemmed by the Porter stemmer but not by this
> one. An example of such a word is "strides".
> Does anyone know what's going on?
> Thanks,Nick
>
>

Re: Stemming breaks highlighting?

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Nick,

Did you re-index after schema change?

Ahmet



On Friday, October 10, 2014 10:19 PM, Nicholas Violi <nv...@globalgiving.org> wrote:
Hi all,

I changed some of my fields from text_general to text_en, hoping to take
advantage of stemming and some other improvements, but unfortunately the
change has broken highlighting. It seems that it only wants to highlight
non-stemmed words (i.e. words whose stemmed version is the same as the word
itself, like "child").
I'm using the default fieldType definition:
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">   <analyzer type="index">     <tokenizer
class="solr.StandardTokenizerFactory"/>     <filter
class="solr.StopFilterFactory"             ignoreCase="true"
words="lang/stopwords_en.txt"             />     <filter
class="solr.LowerCaseFilterFactory"/>     <filter
class="solr.EnglishPossessiveFilterFactory"/>     <filter
class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>   </analyzer>   <analyzer
type="query">     <tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>     <filter
class="solr.StopFilterFactory"             ignoreCase="true"
words="lang/stopwords_en.txt"             />     <filter
class="solr.LowerCaseFilterFactory"/>     <filter
class="solr.EnglishPossessiveFilterFactory"/>     <filter
class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer> </fieldType>And enabling highlighting with hl.fl=title&hl=true
in my query. This is also a faceted search, if that matters.
In this case, as I said, only unstemmed words like "child" are highlighted.
If I remove the stemming filter from the index analyzer (only; the query
analyzer seems to have no effect) in the text_en definition, all matched
words except stopwords are highlighted. Furthermore, if I change text_en to
use the EnglishMinimalStemFilterFactory, more words are highlighted, which
I assume is because they are stemmed by the Porter stemmer but not by this
one. An example of such a word is "strides".
Does anyone know what's going on?
Thanks,Nick