You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andreas Niekler <an...@informatik.uni-leipzig.de> on 2012/11/07 10:02:51 UTC

Stemmer German2

Dear List,

i have an unwanted behavior with the German2 Stemmer. For example the 
river Elbe:

If i input elbe - the word gets reduced to elb
If i input Elbe - everything is ok and elbe is stored to the index.

If i now query for elbe or Elbe i get of course differnt Results 
allowing the users not either use Elbe or elbe to get the same results.

Can i insert an exception list to the Stemmer. Otherwise we will have a 
very hard time explaining some users why this is happaning for some words.

Thank you

Andreas

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

AW: Stemmer German2

Posted by André Widhani <An...@digicol.de>.

No - you need to restart Solr to pick up the changes to the schema and you need to re-index the existing documents.

Regards,
André

________________________________________
Von: Andreas Niekler [aniekler@informatik.uni-leipzig.de]
Gesendet: Mittwoch, 7. November 2012 16:40
An: solr-user@lucene.apache.org
Betreff: Re: Stemmer German2

Hello,

thanks for the advice. If i now change the schema that my lowercase
factory is before the stemmer. is the index updating itself after the
change? How could i achieve this. I stored all values within the index.

Thanks

andreas

Am 07.11.2012 10:47, schrieb André Widhani:
> Do you use the LowerCaseFilterFactory filter in your analysis chain? You will probably want to add it and if you aready have, make sure it is _before_ the stemming filter so you get consistent results regardless of lower- or uppercase spelling.
>
> You can protect words from being subject to stemming by adding a KeyWordMarkerFilterFactory filter before the stemmer, protected words are in a text file. This should be placed after the lower case filter so you can use lower csase terms in the file.
>
> Some stemmer classes like SnowballPorterFilterFactory also allow you to pass a "protected" attribute (again pointing to a file).
>
> All of this is on the Solr wiki (AnalyzersTokenizersTokenFilters, LanguageAnalysis) if you need more details.
>
> Regards,
> André
>
> ________________________________________
> Von: Andreas Niekler [aniekler@informatik.uni-leipzig.de]
> Gesendet: Mittwoch, 7. November 2012 10:02
> An: solr-user@lucene.apache.org
> Betreff: Stemmer German2
>
> Dear List,
>
> i have an unwanted behavior with the German2 Stemmer. For example the
> river Elbe:
>
> If i input elbe - the word gets reduced to elb
> If i input Elbe - everything is ok and elbe is stored to the index.
>
> If i now query for elbe or Elbe i get of course differnt Results
> allowing the users not either use Elbe or elbe to get the same results.
>
> Can i insert an exception list to the Stemmer. Otherwise we will have a
> very hard time explaining some users why this is happaning for some words.
>
> Thank you
>
> Andreas
>
> --
> Andreas Niekler, Dipl. Ing. (FH)
> NLP Group | Department of Computer Science
> University of Leipzig
> Johannisgasse 26 | 04103 Leipzig
>
> mail: aniekler@informatik.uni-leipzig.deg.de
>

--
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: Stemmer German2

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.

Hello,

thanks for the advice. If i now change the schema that my lowercase 
factory is before the stemmer. is the index updating itself after the 
change? How could i achieve this. I stored all values within the index.

Thanks

andreas

Am 07.11.2012 10:47, schrieb André Widhani:
> Do you use the LowerCaseFilterFactory filter in your analysis chain? You will probably want to add it and if you aready have, make sure it is _before_ the stemming filter so you get consistent results regardless of lower- or uppercase spelling.
>
> You can protect words from being subject to stemming by adding a KeyWordMarkerFilterFactory filter before the stemmer, protected words are in a text file. This should be placed after the lower case filter so you can use lower csase terms in the file.
>
> Some stemmer classes like SnowballPorterFilterFactory also allow you to pass a "protected" attribute (again pointing to a file).
>
> All of this is on the Solr wiki (AnalyzersTokenizersTokenFilters, LanguageAnalysis) if you need more details.
>
> Regards,
> André
>
> ________________________________________
> Von: Andreas Niekler [aniekler@informatik.uni-leipzig.de]
> Gesendet: Mittwoch, 7. November 2012 10:02
> An: solr-user@lucene.apache.org
> Betreff: Stemmer German2
>
> Dear List,
>
> i have an unwanted behavior with the German2 Stemmer. For example the
> river Elbe:
>
> If i input elbe - the word gets reduced to elb
> If i input Elbe - everything is ok and elbe is stored to the index.
>
> If i now query for elbe or Elbe i get of course differnt Results
> allowing the users not either use Elbe or elbe to get the same results.
>
> Can i insert an exception list to the Stemmer. Otherwise we will have a
> very hard time explaining some users why this is happaning for some words.
>
> Thank you
>
> Andreas
>
> --
> Andreas Niekler, Dipl. Ing. (FH)
> NLP Group | Department of Computer Science
> University of Leipzig
> Johannisgasse 26 | 04103 Leipzig
>
> mail: aniekler@informatik.uni-leipzig.deg.de
>

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

AW: Stemmer German2

Posted by André Widhani <An...@digicol.de>.

Do you use the LowerCaseFilterFactory filter in your analysis chain? You will probably want to add it and if you aready have, make sure it is _before_ the stemming filter so you get consistent results regardless of lower- or uppercase spelling.

You can protect words from being subject to stemming by adding a KeyWordMarkerFilterFactory filter before the stemmer, protected words are in a text file. This should be placed after the lower case filter so you can use lower csase terms in the file.

Some stemmer classes like SnowballPorterFilterFactory also allow you to pass a "protected" attribute (again pointing to a file).

All of this is on the Solr wiki (AnalyzersTokenizersTokenFilters, LanguageAnalysis) if you need more details.

Regards,
André

________________________________________
Von: Andreas Niekler [aniekler@informatik.uni-leipzig.de]
Gesendet: Mittwoch, 7. November 2012 10:02
An: solr-user@lucene.apache.org
Betreff: Stemmer German2

Dear List,

i have an unwanted behavior with the German2 Stemmer. For example the
river Elbe:

If i input elbe - the word gets reduced to elb
If i input Elbe - everything is ok and elbe is stored to the index.

If i now query for elbe or Elbe i get of course differnt Results
allowing the users not either use Elbe or elbe to get the same results.

Can i insert an exception list to the Stemmer. Otherwise we will have a
very hard time explaining some users why this is happaning for some words.

Thank you

Andreas

--
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de