You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by sarfaraz masood <sa...@yahoo.com> on 2010/07/02 11:08:44 UTC

how to apply stemming to the index ?

I want to stem the terms in my index. but currently i am using standard analyzer that is not performing any kind of stemming. 

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);


After some searching i found a code for PorterStemAnalyzer but that is having some problems



import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.PorterStemFilter;

import java.io.Reader;
import java.util.Hashtable;


 // PorterStemAnalyzer processes input
 // text by stemming English words to their roots.
 // This Analyzer also converts the input to lower case
 // and removes stop words.  A small set of default stop
 // words is defined in the STOP_WORDS
 // array, but a caller can specify an alternative set
 // of stop words by calling non-default constructor.


public class PorterStemAnalyzer extends Analyzer
{
    private static Hashtable _stopTable;

   
     // An array containing some common English words
     // that are usually not useful for searching.
    
    public static final String[] STOP_WORDS =
    {
        "0", "1", "2", "3", "4", "5", "6", "7", "8",
        "9", "000", "$",
        "about", "after", "all", "also", "an", "and",
        "another", "any", "are", "as", "at", "be",
        "because", "been", "before", "being", "between",
        "both", "but", "by", "came", "can", "come",
        "could", "did", "do", "does", "each", "else",
        "for", "from", "get", "got", "has", "had",
        "he", "have", "her", "here", "him", "himself",
        "his", "how","if", "in", "into", "is", "it",
        "its", "just", "like", "make", "many", "me",
        "might", "more", "most", "much", "must", "my",
        "never", "now", "of", "on", "only", "or",
        "other", "our", "out", "over", "re", "said",
        "same", "see", "should", "since", "so", "some",
        "still", "such", "take", "than", "that", "the",
        "their", "them", "then", "there", "these",
        "they", "this", "those", "through", "to", "too",
        "under", "up", "use", "very", "want", "was",
        "way", "we", "well", "were", "what", "when",
        "where", "which", "while", "who", "will",
        "with", "would", "you", "your",
        "a", "b", "c", "d", "e", "f", "g", "h", "i",
        "j", "k", "l", "m", "n", "o", "p", "q", "r",
        "s", "t", "u", "v", "w", "x", "y", "z"
    };


     // Builds an analyzer.
   
    public PorterStemAnalyzer()
    {
        this(STOP_WORDS);
    }

      //Builds an analyzer with the given stop words.
     
     //@param stopWords a String array of stop words
     
    public PorterStemAnalyzer(String[] stopWords)
    {
        _stopTable = StopFilter.makeStopTable(stopWords);
    }

  
     // Processes the input by first converting it to
     // lower case, then by eliminating stop words, and
     // finally by performing Porter stemming on it.
     //
     // @param reader the Reader that
     //               provides access to the input text
     // @return an instance of TokenStream
     
    public final TokenStream tokenStream(Reader reader)
    {
        return new PorterStemFilter(
            new StopFilter(new LowerCaseTokenizer(reader),
                _stopTable));
    }
}

*Errors marked in bold.


Plz let me know if there is some alternate way to apply stemming to the index if this is 


-Sarfaraz




Re: how to apply stemming to the index ?

Posted by sarfaraz masood <sa...@yahoo.com>.
thanx a lot Erick.

It worked.

Regards
-Sarfaraz

--- On Mon, 5/7/10, Erick Erickson <er...@gmail.com> wrote:

From: Erick Erickson <er...@gmail.com>
Subject: Re: how to apply stemming to the index ?
To: solr-user@lucene.apache.org
Date: Monday, 5 July, 2010, 6:32 AM

I'm a little confused about what you're trying to accomplish where.
The fact that you posted to the SOLR users list would indicate
you're using SOLR, in which case all you have to do is apply
the stemming in your config file. Something like:

<filter class="solr.PorterStemFilterFactory"/>

in your schema.xml file for your index AND search analyzers.

If you're in Lucene, you can add PorterStemFilter to a filter chain
when making our own analyzer (see the synonym example in
Lucene In Action, first or second edition.

If this is gibberish, perhaps you could provide some more context
for what you're trying to accomplish.

HTH
Erick

On Fri, Jul 2, 2010 at 5:08 AM, sarfaraz masood <
sarfarazmasood2002@yahoo.com> wrote:

>
> I want to stem the terms in my index. but currently i am using standard
> analyzer that is not performing any kind of stemming.
>
> StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
>
>
> After some searching i found a code for PorterStemAnalyzer but that is
> having some problems
>
>
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.StopFilter;
> import org.apache.lucene.analysis.LowerCaseTokenizer;
> import org.apache.lucene.analysis.PorterStemFilter;
>
> import java.io.Reader;
> import java.util.Hashtable;
>
>
>  // PorterStemAnalyzer processes input
>  // text by stemming English words to their roots.
>  // This Analyzer also converts the input to lower case
>  // and removes stop words.  A small set of default stop
>  // words is defined in the STOP_WORDS
>  // array, but a caller can specify an alternative set
>  // of stop words by calling non-default constructor.
>
>
> public class PorterStemAnalyzer extends Analyzer
> {
>     private static Hashtable _stopTable;
>
>
>      // An array containing some common English words
>      // that are usually not useful for searching.
>
>     public static final String[] STOP_WORDS =
>     {
>         "0", "1", "2", "3", "4", "5", "6", "7", "8",
>         "9", "000", "$",
>         "about", "after", "all", "also", "an", "and",
>         "another", "any", "are", "as", "at", "be",
>         "because", "been", "before", "being", "between",
>         "both", "but", "by", "came", "can", "come",
>         "could", "did", "do", "does", "each", "else",
>         "for", "from", "get", "got", "has", "had",
>         "he", "have", "her", "here", "him", "himself",
>         "his", "how","if", "in", "into", "is", "it",
>         "its", "just", "like", "make", "many", "me",
>         "might", "more", "most", "much", "must", "my",
>         "never", "now", "of", "on", "only", "or",
>         "other", "our", "out", "over", "re", "said",
>         "same", "see", "should", "since", "so", "some",
>         "still", "such", "take", "than", "that", "the",
>         "their", "them", "then", "there", "these",
>         "they", "this", "those", "through", "to", "too",
>         "under", "up", "use", "very", "want", "was",
>         "way", "we", "well", "were", "what", "when",
>         "where", "which", "while", "who", "will",
>         "with", "would", "you", "your",
>         "a", "b", "c", "d", "e", "f", "g", "h", "i",
>         "j", "k", "l", "m", "n", "o", "p", "q", "r",
>         "s", "t", "u", "v", "w", "x", "y", "z"
>     };
>
>
>      // Builds an analyzer.
>
>     public PorterStemAnalyzer()
>     {
>         this(STOP_WORDS);
>     }
>
>       //Builds an analyzer with the given stop words.
>
>      //@param stopWords a String array of stop words
>
>     public PorterStemAnalyzer(String[] stopWords)
>     {
>         _stopTable = StopFilter.makeStopTable(stopWords);
>     }
>
>
>      // Processes the input by first converting it to
>      // lower case, then by eliminating stop words, and
>      // finally by performing Porter stemming on it.
>      //
>      // @param reader the Reader that
>      //               provides access to the input text
>      // @return an instance of TokenStream
>
>     public final TokenStream tokenStream(Reader reader)
>     {
>         return new PorterStemFilter(
>             new StopFilter(new LowerCaseTokenizer(reader),
>                 _stopTable));
>     }
> }
>
> *Errors marked in bold.
>
>
> Plz let me know if there is some alternate way to apply stemming to the
> index if this is
>
>
> -Sarfaraz
>
>
>
>



Re: how to apply stemming to the index ?

Posted by Erick Erickson <er...@gmail.com>.
I'm a little confused about what you're trying to accomplish where.
The fact that you posted to the SOLR users list would indicate
you're using SOLR, in which case all you have to do is apply
the stemming in your config file. Something like:

<filter class="solr.PorterStemFilterFactory"/>

in your schema.xml file for your index AND search analyzers.

If you're in Lucene, you can add PorterStemFilter to a filter chain
when making our own analyzer (see the synonym example in
Lucene In Action, first or second edition.

If this is gibberish, perhaps you could provide some more context
for what you're trying to accomplish.

HTH
Erick

On Fri, Jul 2, 2010 at 5:08 AM, sarfaraz masood <
sarfarazmasood2002@yahoo.com> wrote:

>
> I want to stem the terms in my index. but currently i am using standard
> analyzer that is not performing any kind of stemming.
>
> StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
>
>
> After some searching i found a code for PorterStemAnalyzer but that is
> having some problems
>
>
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.StopFilter;
> import org.apache.lucene.analysis.LowerCaseTokenizer;
> import org.apache.lucene.analysis.PorterStemFilter;
>
> import java.io.Reader;
> import java.util.Hashtable;
>
>
>  // PorterStemAnalyzer processes input
>  // text by stemming English words to their roots.
>  // This Analyzer also converts the input to lower case
>  // and removes stop words.  A small set of default stop
>  // words is defined in the STOP_WORDS
>  // array, but a caller can specify an alternative set
>  // of stop words by calling non-default constructor.
>
>
> public class PorterStemAnalyzer extends Analyzer
> {
>     private static Hashtable _stopTable;
>
>
>      // An array containing some common English words
>      // that are usually not useful for searching.
>
>     public static final String[] STOP_WORDS =
>     {
>         "0", "1", "2", "3", "4", "5", "6", "7", "8",
>         "9", "000", "$",
>         "about", "after", "all", "also", "an", "and",
>         "another", "any", "are", "as", "at", "be",
>         "because", "been", "before", "being", "between",
>         "both", "but", "by", "came", "can", "come",
>         "could", "did", "do", "does", "each", "else",
>         "for", "from", "get", "got", "has", "had",
>         "he", "have", "her", "here", "him", "himself",
>         "his", "how","if", "in", "into", "is", "it",
>         "its", "just", "like", "make", "many", "me",
>         "might", "more", "most", "much", "must", "my",
>         "never", "now", "of", "on", "only", "or",
>         "other", "our", "out", "over", "re", "said",
>         "same", "see", "should", "since", "so", "some",
>         "still", "such", "take", "than", "that", "the",
>         "their", "them", "then", "there", "these",
>         "they", "this", "those", "through", "to", "too",
>         "under", "up", "use", "very", "want", "was",
>         "way", "we", "well", "were", "what", "when",
>         "where", "which", "while", "who", "will",
>         "with", "would", "you", "your",
>         "a", "b", "c", "d", "e", "f", "g", "h", "i",
>         "j", "k", "l", "m", "n", "o", "p", "q", "r",
>         "s", "t", "u", "v", "w", "x", "y", "z"
>     };
>
>
>      // Builds an analyzer.
>
>     public PorterStemAnalyzer()
>     {
>         this(STOP_WORDS);
>     }
>
>       //Builds an analyzer with the given stop words.
>
>      //@param stopWords a String array of stop words
>
>     public PorterStemAnalyzer(String[] stopWords)
>     {
>         _stopTable = StopFilter.makeStopTable(stopWords);
>     }
>
>
>      // Processes the input by first converting it to
>      // lower case, then by eliminating stop words, and
>      // finally by performing Porter stemming on it.
>      //
>      // @param reader the Reader that
>      //               provides access to the input text
>      // @return an instance of TokenStream
>
>     public final TokenStream tokenStream(Reader reader)
>     {
>         return new PorterStemFilter(
>             new StopFilter(new LowerCaseTokenizer(reader),
>                 _stopTable));
>     }
> }
>
> *Errors marked in bold.
>
>
> Plz let me know if there is some alternate way to apply stemming to the
> index if this is
>
>
> -Sarfaraz
>
>
>
>