You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/11/11 16:43:39 UTC

[jira] Created: (LUCENE-2055) Remove duplicate analysis functionality

Remove duplicate analysis functionality
---------------------------------------

                 Key: LUCENE-2055
                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
             Project: Lucene - Java
          Issue Type: Task
          Components: contrib/analyzers
            Reporter: Robert Muir


would like to mark the following code deprecated, so it can be removed.

* analyzers/fr: all except ElisionFilter, this is unrelated and standalone.
* analyzers/nl:entire package
* analyzers/ru: entire package

below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
I think we should delete all of this code in favor of the actual snowball package.


{noformat}
/**
 * A stemmer for French words. 
 * <p>
 * The algorithm is based on the work of
 * Dr Martin Porter on his snowball project<br>
 * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
 * (French stemming algorithm) for details
 * </p>
 */

public class FrenchStemmer {

/**
 * A stemmer for Dutch words. 
 * <p>
 * The algorithm is an implementation of
 * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
 * algorithm in Martin Porter's snowball project.
 * </p>
 */
public class DutchStemmer {

/**
 * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
 */
class RussianStemmer
{noformat}



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828054#action_12828054 ] 

Robert Muir commented on LUCENE-2055:
-------------------------------------

here is a short explanation of what i figure might be the controversial part: adding all the language-specific analyzers:

I think its too difficult for a non-english user to use lucene. 
Let's take the romanian case, sure its supported by SnowballAnalyzer, but:
* where are the stopwords? if the user is smart enough they can google this and find savoy's list... but it contains some stray nouns that should not be in there, and will they get the encoding correct?
* for some languages: french, dutch, turkish: we already want to do something different already. For french we need the elision filter to tokenize correctly, for dutch, the special dictionary-based exclusions (I have been told by some any stemmer that does not handle fiets correct is useless), for turkish we need the special lowercasing.
* for other languages: german, swedish, ... i think we REALLY want to implement decompounding support in the future. For german at least, there is a public domain wordlist just itching to be used for this.
* oh yeah, and all the javadocs are in english, so writing your own analyzer is another barrier to entry.

So I think instead its best to have a "recommended default" organized by language, preferably one we have relevance tested / or is already published. many of the existing snowball stemmers have published relevance results available already, thus my bias towards them. Sure it won't meet everyones needs, and users should still think about using them as a template, but I think digging up your own stoplist / writing your own analyzer, figuring out your language support is really buried in snowball, combined with documentation not in your native tongue, i think this adds up to a barrier to entry that is simply too high.



> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829687#action_12829687 ] 

Simon Willnauer commented on LUCENE-2055:
-----------------------------------------

Robert, nice work!
I have one comment on StemmerOverrideFilter

The ctor should not always copy the given dictionary dictionary - if is created with such a map we should use the given instance. This is similar to StopFilter vs. StopAnalyzer.
Maybe a CharArrayMap.castOrCopy(Map<?, String>) would be handy in that case.


One minor thing, the  null check in DutchAnalyzer seems to be unnecessary but anyway thats fine.
{code}
       if (stemdict != null && !stemdict.isEmpty())
{code}
DutchAnalyzer also has an unused import 

{code}
import java.util.Arrays;
{code}

except of those +1 from my side


> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Remove duplicate analysis functionality

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776491#action_12776491 ] 

Mark Miller commented on LUCENE-2055:
-------------------------------------

+1 - lets lose it.

> Remove duplicate analysis functionality
> ---------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>
> would like to mark the following code deprecated, so it can be removed.
> * analyzers/fr: all except ElisionFilter, this is unrelated and standalone.
> * analyzers/nl:entire package
> * analyzers/ru: entire package
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2055:
--------------------------------

    Attachment: LUCENE-2055.patch

updated patch for the generics policeman

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802079#action_12802079 ] 

Robert Muir commented on LUCENE-2055:
-------------------------------------

I also wanted to comment here regarding further duplicate analysis, we can look at these in a later issue if we want
* BrazilianStemmer looks suspiciously like the Snowball Portuguese algorithm except with different diacritics handling, need to look further
* ChineseAnalyzer (the one that does individual chinese characters) does essentially what StandardAnalyzer does with chinese text, I do not see any other features


> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801911#action_12801911 ] 

Robert Muir commented on LUCENE-2055:
-------------------------------------

Hello DM, me marking this as a bug does not mean it will be a backwards incompatible fix, i have not even proposed a patch yet.

This is undeniably a bug, each stemmer proudly lists that it implements the snowball algorithm, but it is not correct.
it is my understanding that such problems (buggy stemming impls) are the reason the snowball project was created in the first place

So, we can fix the bug in 2 different ways:
* delete the old stemmers and in the analyzers replace them with SnowballStemFilters (it does fix the bug, as they now become correct)
* keep the buggy code and behavior with version


> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830318#action_12830318 ] 

Uwe Schindler commented on LUCENE-2055:
---------------------------------------

+1, gogogo

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803286#action_12803286 ] 

Robert Muir commented on LUCENE-2055:
-------------------------------------

i'd like to work on getting these bugs fixed, but I'm not sure the best way to proceed.

looking at the different possibilities i came up with two good options, although maybe there are other ways:

* option 1, deprecate and keep the old broken impls and apis, but depending on Version use the correct ones instead: api and index back compat, but we keep the buggy code and support it for at least some time.
* option 2, deprecate the old apis, but implement it in terms of the correct one: api back compat only, but we drop the buggy code so maintenance is easier


> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2055:
--------------------------------

    Attachment: LUCENE-2055.patch

apologies for the large patch.

this patch does the following:
* deprecates RussianTokenizer, RussianStemmer, RussianStemFilter, DutchStemmer, DutchStemFilter, FrenchStemmer, FrenchStemFilter
* use snowball in the above analyzers instead, depending upon version.
* doesn't deprecate germanstemmer, but uses snowball instead (which is maintained and relevance-tested and supports things like u+umlaut = ue, etc). the old stemmer is kept because it is a different algorithm (alternate).
* the dutchstemmer had 'dictionary based stemming override' support, so to implement this, add StemmerOverrideFilter which does this in a generic way with KeywordAttribute
* adds KeywordAttribute support to SnowballFilter
* deprecates SnowballAnalyzer in favor of language-specific analyzers.
* adds Romanian and Turkish stopword lists, since snowball is missing them.
* implements language-specific analyzers in place of all the ones snowball tried to do at once before.



> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "DM Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801909#action_12801909 ] 

DM Smith commented on LUCENE-2055:
----------------------------------

I think it is right to fix bad behavior, but such a change is not bw compat. It will require an index rebuild.

I'm happy with the direction that non-english work is going. I'm hoping that once it is solid that the bw compat policy will be strict as core. (whatever that means ;) For any application of Lucene that handles many different languages, it is critical that this "stuff" is stable and solid.

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830217#action_12830217 ] 

Robert Muir commented on LUCENE-2055:
-------------------------------------

i will commit this monster soon if no one objects.

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828079#action_12828079 ] 

Robert Muir commented on LUCENE-2055:
-------------------------------------

bq. StemOverrideFilter should be final or have final incrementToken()

ok, thanks, will fix this now.

bq. The usage of mixed CharArraySet and a conventional dictionary map is buggy. The CAS is using a different contains algo and lowercasing, you can get a hit in the CAS but the Map returns null -> NPE. I would not use a CAS for the beginning and always cast to String for now and I will open an issue for extending CAS to be a CharArrayMap

No. Instead i will force it to be case sensitive to ignore this, it is stilly to have dutch stem filter be horribly slow because of some theoretical case like this.



> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801833#action_12801833 ] 

Robert Muir commented on LUCENE-2055:
-------------------------------------

Now that we have snowball tests, I started looking at integrating snowball and deprecating this custom code. 
So I ran the snowball tests against these hand-coded algorithms to see what the differences are... remember they all claim to implement porter:

* RussianStemFilter one passes 100% all snowball tests.

* DutchStemFilter passes 98.9% of snowball tests. all bugs were in handling of double consonants:
examples:
aangetroffen -> aangetrof expected: aangetroff
afvoerbonnen -> afvoerbon expected: afvoerbonn
klommen -> klom expected: klomm

* FrenchStemFilter only passes 93.92% of snowball tests. but if you put lowercasefilter after it, it passes 99.66%!
The problem is this stemmer incorrectly creates some uppercase stems from lowercase words. examples:
  xviii -> xviI expected: xvii
  vouer -> voU expected: vou
  tranquille -> tranqUill expected: tranquill



> Remove duplicate analysis functionality
> ---------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>
> would like to mark the following code deprecated, so it can be removed.
> * analyzers/fr: all except ElisionFilter, this is unrelated and standalone.
> * analyzers/nl:entire package
> * analyzers/ru: entire package
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829763#action_12829763 ] 

Uwe Schindler commented on LUCENE-2055:
---------------------------------------

I will apply the patch here later and also test everything and look through all analyzers, but as far as I see, I am happy with it. The CharArrayMap code is still ok (if the cast compiles without unchecked warning, not yet checked) - so far from generic police :-)

+1 also on having separate "default analyzer" classes for each language.

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2055.
---------------------------------

    Resolution: Fixed

Committed revision 907125. Thanks to the reviews and help Simon/Uwe!

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reassigned LUCENE-2055:
-----------------------------------

    Assignee: Robert Muir

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828071#action_12828071 ] 

Uwe Schindler commented on LUCENE-2055:
---------------------------------------

Several bugs:
- StemOverrideFilter should be final or have final incrementToken()
- The usage of mixed CharArraySet and a conventional dictionary map is buggy. The CAS is using a different contains algo and lowercasing, you can get a hit in the CAS but the Map returns null -> NPE. I would not use a CAS for the beginning and always cast to String for now and I will open an issue for extending CAS to be a CharArrayMap

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2055:
--------------------------------

    Attachment: LUCENE-2055.patch

updated patch:
* implement StemmerOverrideFilter with CharArrayMap
* fix casing problems in french and dutch: they did not call lowercasefilter, instead relying upon the stemmer to lowercase things. this causes inconsistencies with stopwords, dictionary-based stemming, exclusion sets, you name it. the old broken behavior is preserved depending on Version
* add missing standardfilter to greek (depending on Version).


> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2055:
--------------------------------

    Attachment: LUCENE-2055.patch

make StemmerOverrideFilter final, and hardcode it as case-sensitive (it makes sense to come after some sort of lowercasefilter anyway)

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2055:
--------------------------------

    Description: 
would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.

* analyzers/fr
* analyzers/nl
* analyzers/ru

below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
I think we should delete all of this custom stemming code in favor of the actual snowball package.


{noformat}
/**
 * A stemmer for French words. 
 * <p>
 * The algorithm is based on the work of
 * Dr Martin Porter on his snowball project<br>
 * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
 * (French stemming algorithm) for details
 * </p>
 */

public class FrenchStemmer {

/**
 * A stemmer for Dutch words. 
 * <p>
 * The algorithm is an implementation of
 * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
 * algorithm in Martin Porter's snowball project.
 * </p>
 */
public class DutchStemmer {

/**
 * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
 */
class RussianStemmer
{noformat}



  was:
would like to mark the following code deprecated, so it can be removed.

* analyzers/fr: all except ElisionFilter, this is unrelated and standalone.
* analyzers/nl:entire package
* analyzers/ru: entire package

below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
I think we should delete all of this code in favor of the actual snowball package.


{noformat}
/**
 * A stemmer for French words. 
 * <p>
 * The algorithm is based on the work of
 * Dr Martin Porter on his snowball project<br>
 * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
 * (French stemming algorithm) for details
 * </p>
 */

public class FrenchStemmer {

/**
 * A stemmer for Dutch words. 
 * <p>
 * The algorithm is an implementation of
 * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
 * algorithm in Martin Porter's snowball project.
 * </p>
 */
public class DutchStemmer {

/**
 * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
 */
class RussianStemmer
{noformat}



     Issue Type: Bug  (was: Task)
        Summary: Fix buggy stemmers and Remove duplicate analysis functionality  (was: Remove duplicate analysis functionality)

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2055) Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2055:
--------------------------------

    Fix Version/s: 3.1

setting to 3.1, because I would like to make use of simon's stopword-handling improvements to tie in the snowball stoplists.

this way we are not taking any functionality away.

> Remove duplicate analysis functionality
> ---------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>
> would like to mark the following code deprecated, so it can be removed.
> * analyzers/fr: all except ElisionFilter, this is unrelated and standalone.
> * analyzers/nl:entire package
> * analyzers/ru: entire package
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2055:
--------------------------------

    Attachment: LUCENE-2055.patch

patch addressing Simon's comments, and also fixing javadoc warnings.

while I am here, remove other unused imports in contrib/analyzers.

will commit in a day or two if no one objects.

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829125#action_12829125 ] 

Uwe Schindler commented on LUCENE-2055:
---------------------------------------

Map<?,? extends String> does not make sense as String is final. Map<?,String> and the same for CharArrayMap<String>.

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2055) Remove duplicate analysis functionality

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776503#action_12776503 ] 

Robert Muir commented on LUCENE-2055:
-------------------------------------

One thing I would like to fix: these packages have stoplists, but our snowball implementation is missing the stoplists from the snowball dist.

These are provided as .txt files in the full snowball distribution, so I think it would be an easy improvement to the snowball pkg to make these available somehow.



> Remove duplicate analysis functionality
> ---------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>
> would like to mark the following code deprecated, so it can be removed.
> * analyzers/fr: all except ElisionFilter, this is unrelated and standalone.
> * analyzers/nl:entire package
> * analyzers/ru: entire package
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this code in favor of the actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org