You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Erlend Garåsen <e....@usit.uio.no> on 2014/12/15 13:38:47 UTC
My new lemmatizer interfers with the highlighter
I have written a dictionary-based lemmatizer for University of Oslo
which I also want to donate back to Apache. Before I do that, I need
some help to figure out why it interferes with the highlighter. Some
totally irrelevant words get highlighted, so there is something strange
going on. It does not happen frequently, but I'm not able to reproduce
the problem if I change back to my default NorwegianMinimalStemmer.
Can someone take a look at the source code I have temporarily placed here?
http://folk.uio.no/erlendfg/solr/
Please ignore the bad parameter names "articles" and "articlePos". They
will be changed to wordClass and wordClassPos respectively.
As you can see (browse.png), the words "eller" (en: or) and "utenfor"
(en: outside) get highlighted if I search for the word "grønnest" (en:
greenest). Otherwise all the other documents in the search result have
correct highlighted words.
This behaviour has nothing to do with Norwegian special characters like
æ, ø and å. I have seen other examples without these characters as well.
If I enter the word "grønnest" in the Analyzer, everything seems to work
as it should, also the words which sometimes get wrongly highlighted.
Some basic information about my lemmatizer:
- It is not bound to any specific language (it works with German
dictionaries as well (tested)).
- It needs a comma-separated dictionary with at least two columns (word
and its stem).
- A third column about the word class (verb, noun etc.) is preferable,
but not mandatory.
- POS-tags may be stored optionally
- A small as possible hashmap is loaded into memory at startup with
entries from the dictionary
My config in schema.xml:
<filter
class="no.uio.webapps.sok.analysis.DictionaryLemmatizerFilterFactory"
charset="iso-8859-1" // charset of the dic file
storePosTag="false"
articles="subst,verb,adj" // which word class to add (note: bad
parameter name, will be changed)
reduceTo="subst,verb" // words with several stems get reduced to one in
this order. Optionally
stemPos="1" // Where to find the stems
wordPos="2" // Where to find the words
articlePos="3" // Where to find the word classes. Note: Bad parameter
name, will be changed. Optionally
dictionaries="fullform_bm.txt.gz,fullform_nn.txt.gz,custom_dic.txt"/>
Our environment:
- Solr 4.4.0
Our highlighter config:
http://folk.uio.no/erlendfg/solr/highlighter.txt
Re: My new lemmatizer interfers with the highlighter
Posted by Erlend Garåsen <e....@usit.uio.no>.
Thanks Ahmet,
I think I have solved the problem, but I didn't replace the line you
suggested. Instead I added the createToken method with
AttributeSource.State as a parameter and overrode the reset method. I
cannot reproduce the problem anymore.
BTW, what's the purpose of AttributeSource.State? Perhaps that alone has
solved the problem.
Erlend
On 15.12.14 16:13, Ahmet Arslan wrote:
> Hi Erlend,
>
> I have written a similar token filter. Please see :
>
> https://github.com/iorixxx/lucene-solr-analysis-turkish/blob/master/src/main/java/org/apache/lucene/analysis/tr/Zemberek2DeasciifyFilterFactory.java
>
> replace
>
> final String[] values = stemmer.stem(tokenTerm);
>
> with
>
> stack = stemmer.stem(tokenTerm);
>
> Ahmet
>
>
>
>
> On Monday, December 15, 2014 4:53 PM, Michael Sokolov <ms...@safaribooksonline.com> wrote:
> Well I think your first step should be finding a reproducible test case
> and encoding it as a unit test. But I suspect ultimately the fix will
> be something to do with positionIncrement ...
>
> -Mike
>
>
> On 12/15/2014 09:08 AM, Erlend Garåsen wrote:
>> On 15.12.14 14:11, Michael Sokolov wrote:
>>> I'm not sure, but is it necessary to set positionIncAttr to 1 when there
>>> are *not* any lemmas found? I think the usual pattern is to call
>>> clearAttributes() at the start of incrementToken
>>
>> It is set to 0 only if there are stems/lemmas found:
>> if (!terms.isEmpty()) {
>> positionAttr.setPositionIncrement(0);
>>
>> The terms list will only contain entries if there are lemmas found.
>>
>> But maybe I should empty this list before I return true, just like this?
>>
>> if (!terms.isEmpty()) {
>> termAtt.setEmpty().append(terms.poll());
>> positionAttr.setPositionIncrement(0);
>> terms.clear();
>> return true;
>> } else if ...
>>
Re: My new lemmatizer interfers with the highlighter
Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Erlend,
I have written a similar token filter. Please see :
https://github.com/iorixxx/lucene-solr-analysis-turkish/blob/master/src/main/java/org/apache/lucene/analysis/tr/Zemberek2DeasciifyFilterFactory.java
replace
final String[] values = stemmer.stem(tokenTerm);
with
stack = stemmer.stem(tokenTerm);
Ahmet
On Monday, December 15, 2014 4:53 PM, Michael Sokolov <ms...@safaribooksonline.com> wrote:
Well I think your first step should be finding a reproducible test case
and encoding it as a unit test. But I suspect ultimately the fix will
be something to do with positionIncrement ...
-Mike
On 12/15/2014 09:08 AM, Erlend Garåsen wrote:
> On 15.12.14 14:11, Michael Sokolov wrote:
>> I'm not sure, but is it necessary to set positionIncAttr to 1 when there
>> are *not* any lemmas found? I think the usual pattern is to call
>> clearAttributes() at the start of incrementToken
>
> It is set to 0 only if there are stems/lemmas found:
> if (!terms.isEmpty()) {
> positionAttr.setPositionIncrement(0);
>
> The terms list will only contain entries if there are lemmas found.
>
> But maybe I should empty this list before I return true, just like this?
>
> if (!terms.isEmpty()) {
> termAtt.setEmpty().append(terms.poll());
> positionAttr.setPositionIncrement(0);
> terms.clear();
> return true;
> } else if ...
>
Re: My new lemmatizer interfers with the highlighter
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
Well I think your first step should be finding a reproducible test case
and encoding it as a unit test. But I suspect ultimately the fix will
be something to do with positionIncrement ...
-Mike
On 12/15/2014 09:08 AM, Erlend Garåsen wrote:
> On 15.12.14 14:11, Michael Sokolov wrote:
>> I'm not sure, but is it necessary to set positionIncAttr to 1 when there
>> are *not* any lemmas found? I think the usual pattern is to call
>> clearAttributes() at the start of incrementToken
>
> It is set to 0 only if there are stems/lemmas found:
> if (!terms.isEmpty()) {
> positionAttr.setPositionIncrement(0);
>
> The terms list will only contain entries if there are lemmas found.
>
> But maybe I should empty this list before I return true, just like this?
>
> if (!terms.isEmpty()) {
> termAtt.setEmpty().append(terms.poll());
> positionAttr.setPositionIncrement(0);
> terms.clear();
> return true;
> } else if ...
>
Re: My new lemmatizer interfers with the highlighter
Posted by Erlend Garåsen <e....@usit.uio.no>.
On 15.12.14 14:11, Michael Sokolov wrote:
> I'm not sure, but is it necessary to set positionIncAttr to 1 when there
> are *not* any lemmas found? I think the usual pattern is to call
> clearAttributes() at the start of incrementToken
It is set to 0 only if there are stems/lemmas found:
if (!terms.isEmpty()) {
positionAttr.setPositionIncrement(0);
The terms list will only contain entries if there are lemmas found.
But maybe I should empty this list before I return true, just like this?
if (!terms.isEmpty()) {
termAtt.setEmpty().append(terms.poll());
positionAttr.setPositionIncrement(0);
terms.clear();
return true;
} else if ...
Re: My new lemmatizer interfers with the highlighter
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
I'm not sure, but is it necessary to set positionIncAttr to 1 when there
are *not* any lemmas found? I think the usual pattern is to call
clearAttributes() at the start of incrementToken
-Mike
On 12/15/14 7:38 AM, Erlend Garåsen wrote:
> I have written a dictionary-based lemmatizer for University of Oslo
> which I also want to donate back to Apache. Before I do that, I need
> some help to figure out why it interferes with the highlighter. Some
> totally irrelevant words get highlighted, so there is something
> strange going on. It does not happen frequently, but I'm not able to
> reproduce the problem if I change back to my default
> NorwegianMinimalStemmer.
>
> Can someone take a look at the source code I have temporarily placed
> here?
> http://folk.uio.no/erlendfg/solr/
>
> Please ignore the bad parameter names "articles" and "articlePos".
> They will be changed to wordClass and wordClassPos respectively.
>
> As you can see (browse.png), the words "eller" (en: or) and "utenfor"
> (en: outside) get highlighted if I search for the word "grønnest" (en:
> greenest). Otherwise all the other documents in the search result have
> correct highlighted words.
>
> This behaviour has nothing to do with Norwegian special characters
> like æ, ø and å. I have seen other examples without these characters
> as well. If I enter the word "grønnest" in the Analyzer, everything
> seems to work as it should, also the words which sometimes get wrongly
> highlighted.
>
> Some basic information about my lemmatizer:
> - It is not bound to any specific language (it works with German
> dictionaries as well (tested)).
> - It needs a comma-separated dictionary with at least two columns
> (word and its stem).
> - A third column about the word class (verb, noun etc.) is preferable,
> but not mandatory.
> - POS-tags may be stored optionally
> - A small as possible hashmap is loaded into memory at startup with
> entries from the dictionary
>
> My config in schema.xml:
> <filter
> class="no.uio.webapps.sok.analysis.DictionaryLemmatizerFilterFactory"
> charset="iso-8859-1" // charset of the dic file
> storePosTag="false"
> articles="subst,verb,adj" // which word class to add (note: bad
> parameter name, will be changed)
> reduceTo="subst,verb" // words with several stems get reduced to one
> in this order. Optionally
> stemPos="1" // Where to find the stems
> wordPos="2" // Where to find the words
> articlePos="3" // Where to find the word classes. Note: Bad parameter
> name, will be changed. Optionally
> dictionaries="fullform_bm.txt.gz,fullform_nn.txt.gz,custom_dic.txt"/>
>
> Our environment:
> - Solr 4.4.0
>
> Our highlighter config:
> http://folk.uio.no/erlendfg/solr/highlighter.txt