You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Erlend Garåsen <e....@usit.uio.no> on 2014/12/15 13:38:47 UTC

My new lemmatizer interfers with the highlighter

I have written a dictionary-based lemmatizer for University of Oslo 
which I also want to donate back to Apache. Before I do that, I need 
some help to figure out why it interferes with the highlighter. Some 
totally irrelevant words get highlighted, so there is something strange 
going on. It does not happen frequently, but I'm not able to reproduce 
the problem if I change back to my default NorwegianMinimalStemmer.

Can someone take a look at the source code I have temporarily placed here?
http://folk.uio.no/erlendfg/solr/

Please ignore the bad parameter names "articles" and "articlePos". They 
will be changed to wordClass and wordClassPos respectively.

As you can see (browse.png), the words "eller" (en: or) and "utenfor" 
(en: outside) get highlighted if I search for the word "grønnest" (en: 
greenest). Otherwise all the other documents in the search result have 
correct highlighted words.

This behaviour has nothing to do with Norwegian special characters like 
æ, ø and å. I have seen other examples without these characters as well. 
If I enter the word "grønnest" in the Analyzer, everything seems to work 
as it should, also the words which sometimes get wrongly highlighted.

Some basic information about my lemmatizer:
- It is not bound to any specific language (it works with German 
dictionaries as well (tested)).
- It needs a comma-separated dictionary with at least two columns (word 
and its stem).
- A third column about the word class (verb, noun etc.) is preferable, 
but not mandatory.
- POS-tags may be stored optionally
- A small as possible hashmap is loaded into memory at startup with 
entries from the dictionary

My config in schema.xml:
<filter 
class="no.uio.webapps.sok.analysis.DictionaryLemmatizerFilterFactory"
charset="iso-8859-1" // charset of the dic file
storePosTag="false"
articles="subst,verb,adj" // which word class to add (note: bad 
parameter name, will be changed)
reduceTo="subst,verb" // words with several stems get reduced to one in 
this order. Optionally
stemPos="1" // Where to find the stems
wordPos="2" // Where to find the words
articlePos="3" // Where to find the word classes. Note: Bad parameter 
name, will be changed. Optionally
dictionaries="fullform_bm.txt.gz,fullform_nn.txt.gz,custom_dic.txt"/>

Our environment:
- Solr 4.4.0

Our highlighter config:
http://folk.uio.no/erlendfg/solr/highlighter.txt

Re: My new lemmatizer interfers with the highlighter

Posted by Erlend Garåsen <e....@usit.uio.no>.

Thanks Ahmet,

I think I have solved the problem, but I didn't replace the line you 
suggested. Instead I added the createToken method with 
AttributeSource.State as a parameter and overrode the reset method. I 
cannot reproduce the problem anymore.

BTW, what's the purpose of AttributeSource.State? Perhaps that alone has 
solved the problem.

Erlend

On 15.12.14 16:13, Ahmet Arslan wrote:
> Hi Erlend,
>
> I have written a similar token filter. Please see :
>
> https://github.com/iorixxx/lucene-solr-analysis-turkish/blob/master/src/main/java/org/apache/lucene/analysis/tr/Zemberek2DeasciifyFilterFactory.java
>
> replace
>
> final String[] values = stemmer.stem(tokenTerm);
>
> with
>
> stack = stemmer.stem(tokenTerm);
>
> Ahmet
>
>
>
>
> On Monday, December 15, 2014 4:53 PM, Michael Sokolov <ms...@safaribooksonline.com> wrote:
> Well I think your first step should be finding a reproducible test case
> and encoding it as a unit test.  But I suspect ultimately the fix will
> be something to do with positionIncrement ...
>
> -Mike
>
>
> On 12/15/2014 09:08 AM, Erlend Garåsen wrote:
>> On 15.12.14 14:11, Michael Sokolov wrote:
>>> I'm not sure, but is it necessary to set positionIncAttr to 1 when there
>>> are *not* any lemmas found?  I think the usual pattern is to call
>>> clearAttributes() at the start of incrementToken
>>
>> It is set to 0 only if there are stems/lemmas found:
>> if (!terms.isEmpty()) {
>>    positionAttr.setPositionIncrement(0);
>>
>> The terms list will only contain entries if there are lemmas found.
>>
>> But maybe I should empty this list before I return true, just like this?
>>
>> if (!terms.isEmpty()) {
>>    termAtt.setEmpty().append(terms.poll());
>>    positionAttr.setPositionIncrement(0);
>>    terms.clear();
>>    return true;
>> } else if ...
>>

Re: My new lemmatizer interfers with the highlighter

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Erlend,

I have written a similar token filter. Please see :

https://github.com/iorixxx/lucene-solr-analysis-turkish/blob/master/src/main/java/org/apache/lucene/analysis/tr/Zemberek2DeasciifyFilterFactory.java

replace 

final String[] values = stemmer.stem(tokenTerm);

with 

stack = stemmer.stem(tokenTerm);

Ahmet




On Monday, December 15, 2014 4:53 PM, Michael Sokolov <ms...@safaribooksonline.com> wrote:
Well I think your first step should be finding a reproducible test case 
and encoding it as a unit test.  But I suspect ultimately the fix will 
be something to do with positionIncrement ...

-Mike


On 12/15/2014 09:08 AM, Erlend Garåsen wrote:
> On 15.12.14 14:11, Michael Sokolov wrote:
>> I'm not sure, but is it necessary to set positionIncAttr to 1 when there
>> are *not* any lemmas found?  I think the usual pattern is to call
>> clearAttributes() at the start of incrementToken
>
> It is set to 0 only if there are stems/lemmas found:
> if (!terms.isEmpty()) {
>   positionAttr.setPositionIncrement(0);
>
> The terms list will only contain entries if there are lemmas found.
>
> But maybe I should empty this list before I return true, just like this?
>
> if (!terms.isEmpty()) {
>   termAtt.setEmpty().append(terms.poll());
>   positionAttr.setPositionIncrement(0);
>   terms.clear();
>   return true;
> } else if ...
>

Re: My new lemmatizer interfers with the highlighter

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

Well I think your first step should be finding a reproducible test case 
and encoding it as a unit test.  But I suspect ultimately the fix will 
be something to do with positionIncrement ...

-Mike

On 12/15/2014 09:08 AM, Erlend Garåsen wrote:
> On 15.12.14 14:11, Michael Sokolov wrote:
>> I'm not sure, but is it necessary to set positionIncAttr to 1 when there
>> are *not* any lemmas found?  I think the usual pattern is to call
>> clearAttributes() at the start of incrementToken
>
> It is set to 0 only if there are stems/lemmas found:
> if (!terms.isEmpty()) {
>   positionAttr.setPositionIncrement(0);
>
> The terms list will only contain entries if there are lemmas found.
>
> But maybe I should empty this list before I return true, just like this?
>
> if (!terms.isEmpty()) {
>   termAtt.setEmpty().append(terms.poll());
>   positionAttr.setPositionIncrement(0);
>   terms.clear();
>   return true;
> } else if ...
>

Re: My new lemmatizer interfers with the highlighter

Posted by Erlend Garåsen <e....@usit.uio.no>.

On 15.12.14 14:11, Michael Sokolov wrote:
> I'm not sure, but is it necessary to set positionIncAttr to 1 when there
> are *not* any lemmas found?  I think the usual pattern is to call
> clearAttributes() at the start of incrementToken

It is set to 0 only if there are stems/lemmas found:
if (!terms.isEmpty()) {
   positionAttr.setPositionIncrement(0);

The terms list will only contain entries if there are lemmas found.

But maybe I should empty this list before I return true, just like this?

if (!terms.isEmpty()) {
   termAtt.setEmpty().append(terms.poll());
   positionAttr.setPositionIncrement(0);
   terms.clear();
   return true;
} else if ...

Re: My new lemmatizer interfers with the highlighter

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

I'm not sure, but is it necessary to set positionIncAttr to 1 when there 
are *not* any lemmas found?  I think the usual pattern is to call 
clearAttributes() at the start of incrementToken

-Mike

On 12/15/14 7:38 AM, Erlend Garåsen wrote:
> I have written a dictionary-based lemmatizer for University of Oslo 
> which I also want to donate back to Apache. Before I do that, I need 
> some help to figure out why it interferes with the highlighter. Some 
> totally irrelevant words get highlighted, so there is something 
> strange going on. It does not happen frequently, but I'm not able to 
> reproduce the problem if I change back to my default 
> NorwegianMinimalStemmer.
>
> Can someone take a look at the source code I have temporarily placed 
> here?
> http://folk.uio.no/erlendfg/solr/
>
> Please ignore the bad parameter names "articles" and "articlePos". 
> They will be changed to wordClass and wordClassPos respectively.
>
> As you can see (browse.png), the words "eller" (en: or) and "utenfor" 
> (en: outside) get highlighted if I search for the word "grønnest" (en: 
> greenest). Otherwise all the other documents in the search result have 
> correct highlighted words.
>
> This behaviour has nothing to do with Norwegian special characters 
> like æ, ø and å. I have seen other examples without these characters 
> as well. If I enter the word "grønnest" in the Analyzer, everything 
> seems to work as it should, also the words which sometimes get wrongly 
> highlighted.
>
> Some basic information about my lemmatizer:
> - It is not bound to any specific language (it works with German 
> dictionaries as well (tested)).
> - It needs a comma-separated dictionary with at least two columns 
> (word and its stem).
> - A third column about the word class (verb, noun etc.) is preferable, 
> but not mandatory.
> - POS-tags may be stored optionally
> - A small as possible hashmap is loaded into memory at startup with 
> entries from the dictionary
>
> My config in schema.xml:
> <filter 
> class="no.uio.webapps.sok.analysis.DictionaryLemmatizerFilterFactory"
> charset="iso-8859-1" // charset of the dic file
> storePosTag="false"
> articles="subst,verb,adj" // which word class to add (note: bad 
> parameter name, will be changed)
> reduceTo="subst,verb" // words with several stems get reduced to one 
> in this order. Optionally
> stemPos="1" // Where to find the stems
> wordPos="2" // Where to find the words
> articlePos="3" // Where to find the word classes. Note: Bad parameter 
> name, will be changed. Optionally
> dictionaries="fullform_bm.txt.gz,fullform_nn.txt.gz,custom_dic.txt"/>
>
> Our environment:
> - Solr 4.4.0
>
> Our highlighter config:
> http://folk.uio.no/erlendfg/solr/highlighter.txt