You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/05/12 14:03:12 UTC

[GitHub] [lucene] janhoy opened a new pull request #136: LUCENE-9589 Swedish Minimal Stemmer

janhoy opened a new pull request #136:
URL: https://github.com/apache/lucene/pull/136


   https://issues.apache.org/jira/browse/LUCENE-9589 moved to new repo, see also old PR at https://github.com/apache/lucene-solr/pull/2062


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy merged pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
janhoy merged pull request #136:
URL: https://github.com/apache/lucene/pull/136


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] karlwettin edited a comment on pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
karlwettin edited a comment on pull request #136:
URL: https://github.com/apache/lucene/pull/136#issuecomment-843614395


   I gave the stemmer a spin on [SAOL](https://en.wikipedia.org/wiki/Svenska_Akademiens_ordlista) 13 (2006). I have to stay within the bounds of fair use and can't publish the complete results.
   
   Generally speaking I think it does a remarkable job with such a small decision tree. Given what it's meant to do, I would merge it.
   
   A few notes that are more applicable on a not so minimal implementation:
   
   The suffix-s pluralis rule has ~5300 exceptions where words ends with s is nominative case singularis.
   
   It's also missing the rules defined in LUCENE-1515, especially 'an' and 'ans'-suffixes. Back then I came to the conclusion that 8% of the Swedish language can be inflected that way, but there is a list of ~200 words that needs to be setup as exceptions to those rules.
   
   Two standard an/ans-suffixes:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   ättiksgurk | ättiksgurka
   ättiksgurka | ättiksgurkan
   ättiksgurka | ättiksgurkans
   ättiksgurk | ättiksgurkas
   ättiksgurk | ättiksgurkor
   ättiksgurk | ättiksgurkorna
   ättiksgurk | ättiksgurkornas
   ättiksgurk | ättiksgurkors
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   ättestup | ättestupa
   ättestupa | ättestupan
   ättestupa | ättestupans
   ättestup | ättestupas
   ättestup | ättestupor
   ättestup | ättestuporna
   ättestup | ättestupornas
   ättestup | ättestupors
   
   There are probably more complete and better examples of this in LUCENE-1515.
   
   And if I have to go looking for problems, I see these:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   höstmörk | höstmörker
   höstmörk | höstmörkers
   höstmörkr | höstmörkret
   höstmörkr | höstmörkrets
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   höstkollektio | höstkollektion
   höstkollektion | höstkollektionen
   höstkollektion | höstkollektionens
   höstkollektion | höstkollektioner
   höstkollektion | höstkollektionerna
   höstkollektion | höstkollektionernas
   höstkollektion | höstkollektioners
   höstkollektio | höstkollektions
   
   This one is a number of different words with very different meaning that turn out completely mixed up, not all nouns though:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   hölj | hölj
   hölj | hölja
   hölja | höljan
   höljand | höljande
   hölja | höljans
   hölj | höljas
   höljd | höljd
   höljd | höljda
   höljd | höljde
   höljd | höljdes
   hölj | hölje
   hölj | höljen
   höljen | höljena
   höljen | höljenas
   hölj | höljens
   hölj | höljer
   hölj | höljes
   hölj | höljet
   hölj | höljets
   hölj | höljor
   hölj | höljorna
   hölj | höljornas
   hölj | höljors
   hölj | höljs
   höljt | höljt
   höljt | höljts
   
   I'm afraid it isn't possible to extract stemmer rules and exception lists from SAOL due to copyright issues (unless we find a digital copy that's at least 20 years old), but perhaps an alternative and more global route would be to mine [Wikidata:Lexicographical data](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data)?
   
   https://www.wikidata.org/wiki/Lexeme:L38829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] karlwettin edited a comment on pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
karlwettin edited a comment on pull request #136:
URL: https://github.com/apache/lucene/pull/136#issuecomment-843614395


   I gave the stemmer a spin on [SAOL](https://en.wikipedia.org/wiki/Svenska_Akademiens_ordlista) 13 (2006). I have to stay within the bounds of fair use and can't publish the complete results.
   
   Generally speaking I think it does a remarkable job with such a small decision tree. Given what it's meant to do, I would merge it.
   
   A few notes that are more applicable on a not so minimal implementation:
   
   The suffix-s pluralis rule have ~5300 exceptions where words ends with s is nominative case singularis.
   
   It's also missing the rules defined in LUCENE-1515, especially 'an' and 'ans'-suffixes. Back then I came to the conclusion that 8% of the Swedish language can be inflected that way, but there is a list of ~200 words that needs to be setup as exceptions to those rules.
   
   Two standard an/ans-suffixes:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   ättiksgurk | ättiksgurka
   ättiksgurka | ättiksgurkan
   ättiksgurka | ättiksgurkans
   ättiksgurk | ättiksgurkas
   ättiksgurk | ättiksgurkor
   ättiksgurk | ättiksgurkorna
   ättiksgurk | ättiksgurkornas
   ättiksgurk | ättiksgurkors
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   ättestup | ättestupa
   ättestupa | ättestupan
   ättestupa | ättestupans
   ättestup | ättestupas
   ättestup | ättestupor
   ättestup | ättestuporna
   ättestup | ättestupornas
   ättestup | ättestupors
   
   There are probably more complete and better examples of this in LUCENE-1515.
   
   And if I have to go looking for problems, I see these:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   höstmörk | höstmörker
   höstmörk | höstmörkers
   höstmörkr | höstmörkret
   höstmörkr | höstmörkrets
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   höstkollektio | höstkollektion
   höstkollektion | höstkollektionen
   höstkollektion | höstkollektionens
   höstkollektion | höstkollektioner
   höstkollektion | höstkollektionerna
   höstkollektion | höstkollektionernas
   höstkollektion | höstkollektioners
   höstkollektio | höstkollektions
   
   This one is a number of different words with very different meaning that turn out completely mixed up, not all nous though:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   hölj | hölj
   hölj | hölja
   hölja | höljan
   höljand | höljande
   hölja | höljans
   hölj | höljas
   höljd | höljd
   höljd | höljda
   höljd | höljde
   höljd | höljdes
   hölj | hölje
   hölj | höljen
   höljen | höljena
   höljen | höljenas
   hölj | höljens
   hölj | höljer
   hölj | höljes
   hölj | höljet
   hölj | höljets
   hölj | höljor
   hölj | höljorna
   hölj | höljornas
   hölj | höljors
   hölj | höljs
   höljt | höljt
   höljt | höljts
   
   I'm afraid it isn't possible to extract stemmer rules and exception lists from SAOL due to copyright issues (unless we find a digital copy that's at least 20 years old), but perhaps an alternative and more global route would be to mine [Wikidata:Lexicographical data](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data)?
   
   https://www.wikidata.org/wiki/Lexeme:L38829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] karlwettin commented on pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
karlwettin commented on pull request #136:
URL: https://github.com/apache/lucene/pull/136#issuecomment-843614395


   I gave the stemmer a spin on [SAOL](https://en.wikipedia.org/wiki/Svenska_Akademiens_ordlista) 13 (2006). I have to stay within the bounds of fair use and can't publish the complete results.
   
   Generally speaking I think it does a remarkable job with such a small decision tree. Given what it's meant to do, I would merge it.
   
   A few notes that are more applicable on a not so minimal implementation:
   
   The suffix-s pluralis rule have ~5300 exceptions where words ends with s is nominative case singularis.
   
   It's however missing the rules defined in LUCENE-1515, especially 'an' and 'ans'-suffixes. Back then I came to the conclusion that 8% of the Swedish language can be inflected that way, but there is a list of ~200 words that needs to be setup as exceptions to those rules.
   
   Two standard an/ans-suffixes:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   ättiksgurk | ättiksgurka
   ättiksgurka | ättiksgurkan
   ättiksgurka | ättiksgurkans
   ättiksgurk | ättiksgurkas
   ättiksgurk | ättiksgurkor
   ättiksgurk | ättiksgurkorna
   ättiksgurk | ättiksgurkornas
   ättiksgurk | ättiksgurkors
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   ättestup | ättestupa
   ättestupa | ättestupan
   ättestupa | ättestupans
   ättestup | ättestupas
   ättestup | ättestupor
   ättestup | ättestuporna
   ättestup | ättestupornas
   ättestup | ättestupors
   
   There are probably more complete and better examples of this in LUCENE-1515.
   
   And if I have to go looking for problems, I see these:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   höstmörk | höstmörker
   höstmörk | höstmörkers
   höstmörkr | höstmörkret
   höstmörkr | höstmörkrets
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   höstkollektio | höstkollektion
   höstkollektion | höstkollektionen
   höstkollektion | höstkollektionens
   höstkollektion | höstkollektioner
   höstkollektion | höstkollektionerna
   höstkollektion | höstkollektionernas
   höstkollektion | höstkollektioners
   höstkollektio | höstkollektions
   
   This one is a number of different words with very different meaning that turn out completely mixed up, not all nous though:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   hölj | hölj
   hölj | hölja
   hölja | höljan
   höljand | höljande
   hölja | höljans
   hölj | höljas
   höljd | höljd
   höljd | höljda
   höljd | höljde
   höljd | höljdes
   hölj | hölje
   hölj | höljen
   höljen | höljena
   höljen | höljenas
   hölj | höljens
   hölj | höljer
   hölj	höljes
   hölj	höljet
   hölj	höljets
   hölj	höljor
   hölj	höljorna
   hölj	höljornas
   hölj	höljors
   hölj	höljs
   höljt	höljt
   höljt	höljts
   
   I'm afraid it isn't possible to extract stemmer rules and exception lists from SAOL due to copyright issues (unless we find a digital copy that's at least 20 years old), but perhaps an alternative and more global route would be to mine [Wikidata:Lexicographical data](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data)?
   
   https://www.wikidata.org/wiki/Lexeme:L38829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy commented on a change in pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
janhoy commented on a change in pull request #136:
URL: https://github.com/apache/lucene/pull/136#discussion_r631081371



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/sv/SwedishMinimalStemmer.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.sv;
+
+/*
+ * This algorithm is updated based on code located at:
+ * http://members.unine.ch/jacques.savoy/clef/
+ *
+ * Full copyright for that code follows:
+ */
+
+/*
+ * Copyright (c) 2005, Jacques Savoy

Review comment:
       Could these copyright notices be moved to LICENSE.txt or NOTICE.txt ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy commented on a change in pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
janhoy commented on a change in pull request #136:
URL: https://github.com/apache/lucene/pull/136#discussion_r639745969



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/sv/SwedishMinimalStemmer.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.sv;
+
+/*
+ * This algorithm is updated based on code located at:
+ * http://members.unine.ch/jacques.savoy/clef/
+ *
+ * Full copyright for that code follows:
+ */
+
+/*
+ * Copyright (c) 2005, Jacques Savoy

Review comment:
       Hmm, I see a ton of other stemmers with exactly the same headers. So I'll leave them as is in this PR and rather do a separate copyright cleanup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] HoustonPutman commented on a change in pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
HoustonPutman commented on a change in pull request #136:
URL: https://github.com/apache/lucene/pull/136#discussion_r631156602



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/sv/SwedishMinimalStemmer.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.sv;
+
+/*
+ * This algorithm is updated based on code located at:
+ * http://members.unine.ch/jacques.savoy/clef/
+ *
+ * Full copyright for that code follows:
+ */
+
+/*
+ * Copyright (c) 2005, Jacques Savoy

Review comment:
       I believe they should be moved to the NOTICE.txt




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] karlwettin edited a comment on pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
karlwettin edited a comment on pull request #136:
URL: https://github.com/apache/lucene/pull/136#issuecomment-843614395


   I gave the stemmer a spin on [SAOL](https://en.wikipedia.org/wiki/Svenska_Akademiens_ordlista) 13 (2006). I have to stay within the bounds of fair use and can't publish the complete results.
   
   Generally speaking I think it does a remarkable job with such a small decision tree. Given what it's meant to do, I would merge it.
   
   A few notes that are more applicable on a not so minimal implementation:
   
   The suffix-s pluralis rule has ~5300 exceptions where words ends with s is nominative case singularis.
   
   It's also missing the rules defined in LUCENE-1515, especially 'an' and 'ans'-suffixes. Back then I came to the conclusion that 8% of the Swedish language can be inflected that way, but there is a list of ~200 words that needs to be setup as exceptions to those rules.
   
   Two standard an/ans-suffixes:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   ättiksgurk | ättiksgurka
   ättiksgurka | ättiksgurkan
   ättiksgurka | ättiksgurkans
   ättiksgurk | ättiksgurkas
   ättiksgurk | ättiksgurkor
   ättiksgurk | ättiksgurkorna
   ättiksgurk | ättiksgurkornas
   ättiksgurk | ättiksgurkors
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   ättestup | ättestupa
   ättestupa | ättestupan
   ättestupa | ättestupans
   ättestup | ättestupas
   ättestup | ättestupor
   ättestup | ättestuporna
   ättestup | ättestupornas
   ättestup | ättestupors
   
   There are probably more complete and better examples of this in LUCENE-1515.
   
   And if I have to go looking for problems, I see these:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   höstmörk | höstmörker
   höstmörk | höstmörkers
   höstmörkr | höstmörkret
   höstmörkr | höstmörkrets
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   höstkollektio | höstkollektion
   höstkollektion | höstkollektionen
   höstkollektion | höstkollektionens
   höstkollektion | höstkollektioner
   höstkollektion | höstkollektionerna
   höstkollektion | höstkollektionernas
   höstkollektion | höstkollektioners
   höstkollektio | höstkollektions
   
   This one is a number of different words with very different meaning that turn out completely mixed up, not all nous though:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   hölj | hölj
   hölj | hölja
   hölja | höljan
   höljand | höljande
   hölja | höljans
   hölj | höljas
   höljd | höljd
   höljd | höljda
   höljd | höljde
   höljd | höljdes
   hölj | hölje
   hölj | höljen
   höljen | höljena
   höljen | höljenas
   hölj | höljens
   hölj | höljer
   hölj | höljes
   hölj | höljet
   hölj | höljets
   hölj | höljor
   hölj | höljorna
   hölj | höljornas
   hölj | höljors
   hölj | höljs
   höljt | höljt
   höljt | höljts
   
   I'm afraid it isn't possible to extract stemmer rules and exception lists from SAOL due to copyright issues (unless we find a digital copy that's at least 20 years old), but perhaps an alternative and more global route would be to mine [Wikidata:Lexicographical data](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data)?
   
   https://www.wikidata.org/wiki/Lexeme:L38829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy commented on pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
janhoy commented on pull request #136:
URL: https://github.com/apache/lucene/pull/136#issuecomment-839810656


   @rmuir  you already reviewed this PR over at lucene-solr repo. I am still trying to get feedback from a native swede but, otherwise I'm ready to merge this aiming for 8.9..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] karlwettin edited a comment on pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
karlwettin edited a comment on pull request #136:
URL: https://github.com/apache/lucene/pull/136#issuecomment-843614395


   I gave the stemmer a spin on [SAOL](https://en.wikipedia.org/wiki/Svenska_Akademiens_ordlista) 13 (2006). I have to stay within the bounds of fair use and can't publish the complete results.
   
   Generally speaking I think it does a remarkable job with such a small decision tree. Given what it's meant to do, I would merge it.
   
   A few notes that are more applicable on a not so minimal implementation:
   
   The suffix-s pluralis rule have ~5300 exceptions where words ends with s is nominative case singularis.
   
   It's however missing the rules defined in LUCENE-1515, especially 'an' and 'ans'-suffixes. Back then I came to the conclusion that 8% of the Swedish language can be inflected that way, but there is a list of ~200 words that needs to be setup as exceptions to those rules.
   
   Two standard an/ans-suffixes:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   ättiksgurk | ättiksgurka
   ättiksgurka | ättiksgurkan
   ättiksgurka | ättiksgurkans
   ättiksgurk | ättiksgurkas
   ättiksgurk | ättiksgurkor
   ättiksgurk | ättiksgurkorna
   ättiksgurk | ättiksgurkornas
   ättiksgurk | ättiksgurkors
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   ättestup | ättestupa
   ättestupa | ättestupan
   ättestupa | ättestupans
   ättestup | ättestupas
   ättestup | ättestupor
   ättestup | ättestuporna
   ättestup | ättestupornas
   ättestup | ättestupors
   
   There are probably more complete and better examples of this in LUCENE-1515.
   
   And if I have to go looking for problems, I see these:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   höstmörk | höstmörker
   höstmörk | höstmörkers
   höstmörkr | höstmörkret
   höstmörkr | höstmörkrets
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   höstkollektio | höstkollektion
   höstkollektion | höstkollektionen
   höstkollektion | höstkollektionens
   höstkollektion | höstkollektioner
   höstkollektion | höstkollektionerna
   höstkollektion | höstkollektionernas
   höstkollektion | höstkollektioners
   höstkollektio | höstkollektions
   
   This one is a number of different words with very different meaning that turn out completely mixed up, not all nous though:
   
   | Stemmed        | Original           |
   | ------------- |:-------------:|
   hölj | hölj
   hölj | hölja
   hölja | höljan
   höljand | höljande
   hölja | höljans
   hölj | höljas
   höljd | höljd
   höljd | höljda
   höljd | höljde
   höljd | höljdes
   hölj | hölje
   hölj | höljen
   höljen | höljena
   höljen | höljenas
   hölj | höljens
   hölj | höljer
   hölj | höljes
   hölj | höljet
   hölj | höljets
   hölj | höljor
   hölj | höljorna
   hölj | höljornas
   hölj | höljors
   hölj | höljs
   höljt | höljt
   höljt | höljts
   
   I'm afraid it isn't possible to extract stemmer rules and exception lists from SAOL due to copyright issues (unless we find a digital copy that's at least 20 years old), but perhaps an alternative and more global route would be to mine [Wikidata:Lexicographical data](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data)?
   
   https://www.wikidata.org/wiki/Lexeme:L38829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy commented on pull request #136: LUCENE-9589 Swedish Minimal Stemmer

Posted by GitBox <gi...@apache.org>.
janhoy commented on pull request #136:
URL: https://github.com/apache/lucene/pull/136#issuecomment-848803076


   Since release 8.9 is in feature freeze I now target this at 9.0.0. I moved CHANGES entry and @since tags. Will commit later this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org