You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Matthew Holt <mh...@redhat.com> on 2006/07/28 17:03:25 UTC

Re: stemming - RESOLVED

Howie,
   Thanks for all the help configuring your stemming addon for version 
0.8. I compared query-basic and query-stemmer and the only new feature 
that was added is a "host" boost. I made the changes and everything 
works perfect.

I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can 
access it at the below URL..

http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c

Take care,
  Matt

Howie Wang wrote:
> Hi, Matt,
>
> In 0.7, you wouldn't miss anything. That code was written to
> replace the basic query filter, and handled all the fields that
> basic query filter was handling. For 0.8, I'm really not sure.
> I'm guessing the code is fairly simple still in 0.8. You can probably
> figure out if query-basic in 0.8 is doing something appreciably different
> than query-stemmer by just visually comparing the files.
>
> Howie
>
>> Howie,
>>  The query-stemmer works great as long as query-basic is not enabled. 
>> However, if I don't have query-basic enabled, won't I be missing some 
>> needed functionality?
>>  Matt
>>
>> Howie Wang wrote:
>>> Hi,
>>>
>>> The settings look reasonable. But for testing purposes, I would get 
>>> rid of
>>> the other query filters and put in some print statements in the
>>> query-stemmer to see what's happening.
>>>
>>> Howie
>>>
>>>> In my nutch-site.xml I overrode the plugin.includes property as below:
>>>>
>>>> <property>
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> 
>>>>
>>>>
>>>>  <description>Regular expression naming plugin directory names to
>>>>  include.  Any plugin not matching this expression is excluded.
>>>>  In any case you need at least include the nutch-extensionpoints 
>>>> plugin. By
>>>>  default Nutch includes crawling just HTML and plain text via HTTP,
>>>>  and basic indexing and search plugins.
>>>>  </description>
>>>> </property>
>>>>
>>>>
>>>> However, it is still only letting me search for the stemmed term 
>>>> (IE "Interview" returns results but "interviewed" doesnt, even 
>>>> though thats the word thats actually on the page).
>>>>
>>>> I tried a different approach and removed the query-stemmer value 
>>>> from nutch-site.xml to attempt to disable the plugin. I reran the 
>>>> crawl and it didn't load the plugin. However, it still had the same 
>>>> stemming functionality. I'm guessing this is due to editing the 
>>>> main files such as CommonGrams.java and NutchDocumentAnalyzer.java. 
>>>> Should I attempt too copy the needed methods into 
>>>> StemmerQueryFilter.java and try to isolate all functionality to the 
>>>> plugin alone?
>>>>
>>>> Thanks,
>>>>    Matt
>>>>
>>>> Howie Wang wrote:
>>>>> It sounds like the query-stemmer is not being called.
>>>>> The query string "interviews" needs to be processed
>>>>> into "interview". Are you sure that your nutch-default.xml
>>>>> is including the query-stemmer correctly? Put print statements
>>>>> in to see if it's getting there.
>>>>>
>>>>> By the way, someone recently told me that they
>>>>> were able to put all the stemming code into an indexing
>>>>> filter without touching any of the main code. All they
>>>>> did was to copy some of the code that is being done
>>>>> in NutchDocumentAnalyzer and CommonGrams into
>>>>> their custom index filter. Haven't tried it myself.
>>>>>
>>>>> HTH
>>>>> Howie
>>>>>
>>>>>> Ok. I did this for Nutch 0.8 (had to edit the listed code some to 
>>>>>> make up for changes from .7.2 to .8 - mostly having to do with 
>>>>>> the Configuration type being needed).
>>>>>>
>>>>>> It partially works.
>>>>>>
>>>>>> If the page I'm trying to index contains the word "interviews" 
>>>>>> and I type in the search engine "interview", the stemming takes 
>>>>>> place and the page with the word "interviews" is returned.
>>>>>> However, if I type in the word "interviews" no page is returned. 
>>>>>> (The page with the word interviews on it should be returned).
>>>>>>
>>>>>> Any ideas??
>>>>>> Matt
>>>>>>
>>>>>> Dima Mazmanov wrote:
>>>>>>> Hi, .
>>>>>>>
>>>>>>> I've gotten a couple of questions offlist about stemming
>>>>>>> so I thought I'd just post here with my changes. Sorry that
>>>>>>> some of the changes are in the main code and not in a plugin. It
>>>>>>> seemed that it's more efficient to put in the main analyzer. It
>>>>>>> would be nice if later releases could add support for plugging
>>>>>>> in a custom stemmer/analyzer.
>>>>>>>
>>>>>>> The first change I made is in NutchDocumentAnalyzer.java.
>>>>>>>
>>>>>>> Import the following classes at the top of the file:
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> Change tokenStream to:
>>>>>>>
>>>>>>>    public TokenStream tokenStream(String field, Reader reader) {
>>>>>>> TokenStream ts = CommonGrams.getFilter(new 
>>>>>>> NutchDocumentTokenizer(reader),
>>>>>>> field);
>>>>>>> if (field.equals("content") || field.equals("title")) {
>>>>>>>     ts = new LowerCaseFilter(ts);
>>>>>>>     return new PorterStemFilter(ts);
>>>>>>> } else {
>>>>>>>     return ts;
>>>>>>> }
>>>>>>>    }
>>>>>>>
>>>>>>> The second change is in CommonGrams.java.
>>>>>>> Import the following classes near the top:
>>>>>>>
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> In optimizePhrase, after this line:
>>>>>>>
>>>>>>>    TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>>>>
>>>>>>> Add:
>>>>>>>
>>>>>>>    ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>>>>
>>>>>>> And the rest is a new QueryFilter plugin that I'm calling 
>>>>>>> query-stemmer.
>>>>>>> Here's the full source for the Java file. You can copy the 
>>>>>>> build.xml
>>>>>>> and plugin.xml from query-basic, and alter the names for 
>>>>>>> query-stemmer.
>>>>>>>
>>>>>>> /* Copyright (c) 2003 The Nutch Organization.  All rights 
>>>>>>> reserved.   */
>>>>>>> /* Use subject to the conditions in 
>>>>>>> http://www.nutch.org/LICENSE.txt. */
>>>>>>>
>>>>>>> package org.apache.nutch.searcher.stemmer;
>>>>>>>
>>>>>>> import org.apache.lucene.search.BooleanQuery;
>>>>>>> import org.apache.lucene.search.PhraseQuery;
>>>>>>> import org.apache.lucene.search.TermQuery;
>>>>>>> import org.apache.lucene.analysis.TokenFilter;
>>>>>>> import org.apache.lucene.analysis.TokenStream;
>>>>>>> import org.apache.lucene.analysis.Token;
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>>>> import org.apache.nutch.analysis.CommonGrams;
>>>>>>>
>>>>>>> import org.apache.nutch.searcher.QueryFilter;
>>>>>>> import org.apache.nutch.searcher.Query;
>>>>>>> import org.apache.nutch.searcher.Query.*;
>>>>>>>
>>>>>>> import java.io.IOException;
>>>>>>> import java.util.HashSet;
>>>>>>> import java.io.StringReader;
>>>>>>>
>>>>>>> /** The default query filter.  Query terms in the default query 
>>>>>>> field are
>>>>>>> * expanded to search the url, anchor and content document fields.*/
>>>>>>> public class StemmerQueryFilter implements QueryFilter {
>>>>>>>
>>>>>>>   private static float URL_BOOST = 4.0f;
>>>>>>>   private static float ANCHOR_BOOST = 2.0f;
>>>>>>>
>>>>>>>   private static int SLOP = Integer.MAX_VALUE;
>>>>>>>   private static float PHRASE_BOOST = 1.0f;
>>>>>>>
>>>>>>>   private static final String[] FIELDS = {"url", "anchor", 
>>>>>>> "content",
>>>>>>> "title"};
>>>>>>>   private static final float[] FIELD_BOOSTS = {URL_BOOST, 
>>>>>>> ANCHOR_BOOST,
>>>>>>> 1.0f, 2.0f};
>>>>>>>
>>>>>>>   /** Set the boost factor for url matches, relative to content 
>>>>>>> and anchor
>>>>>>>    * matches */
>>>>>>>   public static void setUrlBoost(float boost) { URL_BOOST = 
>>>>>>> boost; }
>>>>>>>
>>>>>>>   /** Set the boost factor for title/anchor matches, relative to 
>>>>>>> url and
>>>>>>>    * content matches. */
>>>>>>>   public static void setAnchorBoost(float boost) { ANCHOR_BOOST 
>>>>>>> = boost; }
>>>>>>>
>>>>>>>   /** Set the boost factor for sloppy phrase matches relative to 
>>>>>>> unordered
>>>>>>> term
>>>>>>>    * matches. */
>>>>>>>   public static void setPhraseBoost(float boost) { PHRASE_BOOST 
>>>>>>> = boost; }
>>>>>>>
>>>>>>>   /** Set the maximum number of terms permitted between matching 
>>>>>>> terms in a
>>>>>>>    * sloppy phrase match. */
>>>>>>>   public static void setSlop(int slop) { SLOP = slop; }
>>>>>>>
>>>>>>>   public BooleanQuery filter(Query input, BooleanQuery output) {
>>>>>>>     addTerms(input, output);
>>>>>>>     addSloppyPhrases(input, output);
>>>>>>>     return output;
>>>>>>>   }
>>>>>>>
>>>>>>>   private static void addTerms(Query input, BooleanQuery output) {
>>>>>>>     Clause[] clauses = input.getClauses();
>>>>>>>     for (int i = 0; i < clauses.length; i++) {
>>>>>>>       Clause c = clauses[i];
>>>>>>>
>>>>>>>       if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>         continue;                                 // skip 
>>>>>>> non-default fields
>>>>>>>
>>>>>>>       BooleanQuery out = new BooleanQuery();
>>>>>>>       for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>>         Clause o = c;
>>>>>>>         String[] opt;
>>>>>>>
>>>>>>>         // TODO: I'm a little nervous about stemming for all 
>>>>>>> default fields.
>>>>>>>         //       Should keep an eye on this.
>>>>>>>         if (c.isPhrase()) {                         // optimize 
>>>>>>> phrase
>>>>>>> clauses
>>>>>>>             opt = CommonGrams.optimizePhrase(c.getPhrase(), 
>>>>>>> FIELDS[f]);
>>>>>>>         } else {
>>>>>>>             System.out.println("o.getTerm = " + 
>>>>>>> o.getTerm().toString());
>>>>>>>             opt = getStemmedWords(o.getTerm().toString());
>>>>>>>         }
>>>>>>>         if (opt.length==1) {
>>>>>>>             o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>>         } else {
>>>>>>>             o = new Clause(new Phrase(opt), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>>         }
>>>>>>>
>>>>>>>         out.add(o.isPhrase()
>>>>>>>                 ? exactPhrase(o.getPhrase(), FIELDS[f], 
>>>>>>> FIELD_BOOSTS[f])
>>>>>>>                 : termQuery(FIELDS[f], o.getTerm(), 
>>>>>>> FIELD_BOOSTS[f]),
>>>>>>>                 false, false);
>>>>>>>       }
>>>>>>>       output.add(out, c.isRequired(), c.isProhibited());
>>>>>>>     }
>>>>>>>     System.out.println("query = " + output.toString());
>>>>>>>   }
>>>>>>>
>>>>>>>     private static String[] getStemmedWords(String value) {
>>>>>>>           StringReader sr = new StringReader(value);
>>>>>>>           TokenStream ts = new PorterStemFilter(new 
>>>>>>> LowerCaseTokenizer(sr));
>>>>>>>
>>>>>>>           String stemmedValue = "";
>>>>>>>           try {
>>>>>>>               Token token = ts.next();
>>>>>>>               int count = 0;
>>>>>>>               while (token != null) {
>>>>>>>                   System.out.println("token = " + 
>>>>>>> token.termText());
>>>>>>>                   System.out.println("type = " + token.type());
>>>>>>>
>>>>>>>                   if (count == 0)
>>>>>>>                       stemmedValue = token.termText();
>>>>>>>                   else
>>>>>>>                       stemmedValue = stemmedValue + " " + 
>>>>>>> token.termText();
>>>>>>>
>>>>>>>                   token = ts.next();
>>>>>>>                   count++;
>>>>>>>               }
>>>>>>>           } catch (Exception e) {
>>>>>>>               stemmedValue = value;
>>>>>>>           }
>>>>>>>
>>>>>>>           if (stemmedValue.equals("")) {
>>>>>>>               stemmedValue = value;
>>>>>>>           }
>>>>>>>
>>>>>>>           String[] stemmedValues = stemmedValue.split("\\s+");
>>>>>>>
>>>>>>>           for (int j=0; j<stemmedValues.length; j++) {
>>>>>>>               System.out.println("stemmedValues = " + 
>>>>>>> stemmedValues[j]);
>>>>>>>           }
>>>>>>>           return stemmedValues;
>>>>>>>     }
>>>>>>>
>>>>>>>
>>>>>>>   private static void addSloppyPhrases(Query input, BooleanQuery 
>>>>>>> output) {
>>>>>>>     Clause[] clauses = input.getClauses();
>>>>>>>     for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>>       PhraseQuery sloppyPhrase = new PhraseQuery();
>>>>>>>       sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>>>>>       sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>>>>>                            ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>>>>>                            : SLOP);
>>>>>>>       int sloppyTerms = 0;
>>>>>>>
>>>>>>>       for (int i = 0; i < clauses.length; i++) {
>>>>>>>         Clause c = clauses[i];
>>>>>>>
>>>>>>>         if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>           continue;                               // skip 
>>>>>>> non-default fields
>>>>>>>
>>>>>>>         if (c.isPhrase())                         // skip exact 
>>>>>>> phrases
>>>>>>>           continue;
>>>>>>>
>>>>>>>         if (c.isProhibited())                     // skip 
>>>>>>> prohibited terms
>>>>>>>           continue;
>>>>>>>
>>>>>>>         sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>>>>>         sloppyTerms++;
>>>>>>>       }
>>>>>>>
>>>>>>>       if (sloppyTerms > 1)
>>>>>>>         output.add(sloppyPhrase, false, false);
>>>>>>>     }
>>>>>>>   }
>>>>>>>
>>>>>>>
>>>>>>>   private static org.apache.lucene.search.Query
>>>>>>>         termQuery(String field, Term term, float boost) {
>>>>>>>     TermQuery result = new TermQuery(luceneTerm(field, term));
>>>>>>>     result.setBoost(boost);
>>>>>>>     return result;
>>>>>>>   }
>>>>>>>
>>>>>>>   /** Utility to construct a Lucene exact phrase query for a 
>>>>>>> Nutch phrase.
>>>>>>> */
>>>>>>>   private static org.apache.lucene.search.Query
>>>>>>>        exactPhrase(Phrase nutchPhrase,
>>>>>>>                    String field, float boost) {
>>>>>>>     Term[] terms = nutchPhrase.getTerms();
>>>>>>>     PhraseQuery exactPhrase = new PhraseQuery();
>>>>>>>     for (int i = 0; i < terms.length; i++) {
>>>>>>>       exactPhrase.add(luceneTerm(field, terms[i]));
>>>>>>>     }
>>>>>>>     exactPhrase.setBoost(boost);
>>>>>>>     return exactPhrase;
>>>>>>>   }
>>>>>>>
>>>>>>>   /** Utility to construct a Lucene Term given a Nutch query 
>>>>>>> term and field.
>>>>>>> */
>>>>>>>   private static org.apache.lucene.index.Term luceneTerm(String 
>>>>>>> field,
>>>>>>>                                                          Term 
>>>>>>> term) {
>>>>>>>     return new org.apache.lucene.index.Term(field, 
>>>>>>> term.toString());
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>

Re: stemming - RESOLVED

Posted by Matthew Holt <mh...@redhat.com>.

We could, although other than readability, it won't make any difference.

bb300@mail.ru wrote:
> Hi, Matthew
>
> I think we should use fieldName instead of field, or not...
>
> ===============stemming code begin=======================
>
> public TokenStream tokenStream(String field, Reader reader) {
>     Analyzer analyzer;
>     if ("anchor".equals(field)) {
>         analyzer = ANCHOR_ANALYZER;
>     }
>     else {
>         analyzer = CONTENT_ANALYZER;
>
>         TokenStream ts = analyzer.tokenStream(field, reader);
>         if (field.equals("content") || field.equals("title")) {
>             ts = new LowerCaseFilter(ts);
>             return new PorterStemFilter(ts);
>         }
>         else {
>             return ts;
>         }
>     }
> }
>
> ===============stemming code end=======================
>
> P.S. this patch doesn't take any effect on russian language.
>
> Regards,
> Alexey
>
> ------------------------------
>
> Howie,
>    Thanks for all the help configuring your stemming addon for version 
> 0.8. I compared query-basic and query-stemmer and the only new feature 
> that was added is a "host" boost. I made the changes and everything 
> works perfect.
>
> I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can 
> access it at the below URL..
>
> http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c
>
> Take care,
>   Matt
>
> Howie Wang wrote:
>   
>> Hi, Matt,
>>
>> In 0.7, you wouldn't miss anything. That code was written to
>> replace the basic query filter, and handled all the fields that
>> basic query filter was handling. For 0.8, I'm really not sure.
>> I'm guessing the code is fairly simple still in 0.8. You can probably
>> figure out if query-basic in 0.8 is doing something appreciably different
>> than query-stemmer by just visually comparing the files.
>>
>> Howie
>>
>>     
>>> Howie,
>>>  The query-stemmer works great as long as query-basic is not enabled. 
>>> However, if I don't have query-basic enabled, won't I be missing some 
>>> needed functionality?
>>>  Matt
>>>
>>> Howie Wang wrote:
>>>       
>>>> Hi,
>>>>
>>>> The settings look reasonable. But for testing purposes, I would get 
>>>> rid of
>>>> the other query filters and put in some print statements in the
>>>> query-stemmer to see what's happening.
>>>>
>>>> Howie
>>>>
>>>>         
>>>>> In my nutch-site.xml I overrode the plugin.includes property as below:
>>>>>
>>>>> <property>
>>>>>  <name>plugin.includes</name>
>>>>>  
>>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> 
>>>>>
>>>>>
>>>>>  <description>Regular expression naming plugin directory names to
>>>>>  include.  Any plugin not matching this expression is excluded.
>>>>>  In any case you need at least include the nutch-extensionpoints 
>>>>> plugin. By
>>>>>  default Nutch includes crawling just HTML and plain text via HTTP,
>>>>>  and basic indexing and search plugins.
>>>>>  </description>
>>>>> </property>
>>>>>
>>>>>
>>>>> However, it is still only letting me search for the stemmed term 
>>>>> (IE "Interview" returns results but "interviewed" doesnt, even 
>>>>> though thats the word thats actually on the page).
>>>>>
>>>>> I tried a different approach and removed the query-stemmer value 
>>>>> from nutch-site.xml to attempt to disable the plugin. I reran the 
>>>>> crawl and it didn't load the plugin. However, it still had the same 
>>>>> stemming functionality. I'm guessing this is due to editing the 
>>>>> main files such as CommonGrams.java and NutchDocumentAnalyzer.java. 
>>>>> Should I attempt too copy the needed methods into 
>>>>> StemmerQueryFilter.java and try to isolate all functionality to the 
>>>>> plugin alone?
>>>>>
>>>>> Thanks,
>>>>>    Matt
>>>>>
>>>>> Howie Wang wrote:
>>>>>           
>>>>>> It sounds like the query-stemmer is not being called.
>>>>>> The query string "interviews" needs to be processed
>>>>>> into "interview". Are you sure that your nutch-default.xml
>>>>>> is including the query-stemmer correctly? Put print statements
>>>>>> in to see if it's getting there.
>>>>>>
>>>>>> By the way, someone recently told me that they
>>>>>> were able to put all the stemming code into an indexing
>>>>>> filter without touching any of the main code. All they
>>>>>> did was to copy some of the code that is being done
>>>>>> in NutchDocumentAnalyzer and CommonGrams into
>>>>>> their custom index filter. Haven't tried it myself.
>>>>>>
>>>>>> HTH
>>>>>> Howie
>>>>>>
>>>>>>             
>>>>>>> Ok. I did this for Nutch 0.8 (had to edit the listed code some to 
>>>>>>> make up for changes from .7.2 to .8 - mostly having to do with 
>>>>>>> the Configuration type being needed).
>>>>>>>
>>>>>>> It partially works.
>>>>>>>
>>>>>>> If the page I'm trying to index contains the word "interviews" 
>>>>>>> and I type in the search engine "interview", the stemming takes 
>>>>>>> place and the page with the word "interviews" is returned.
>>>>>>> However, if I type in the word "interviews" no page is returned. 
>>>>>>> (The page with the word interviews on it should be returned).
>>>>>>>
>>>>>>> Any ideas??
>>>>>>> Matt
>>>>>>>
>>>>>>> Dima Mazmanov wrote:
>>>>>>>               
>>>>>>>> Hi, .
>>>>>>>>
>>>>>>>> I've gotten a couple of questions offlist about stemming
>>>>>>>> so I thought I'd just post here with my changes. Sorry that
>>>>>>>> some of the changes are in the main code and not in a plugin. It
>>>>>>>> seemed that it's more efficient to put in the main analyzer. It
>>>>>>>> would be nice if later releases could add support for plugging
>>>>>>>> in a custom stemmer/analyzer.
>>>>>>>>
>>>>>>>> The first change I made is in NutchDocumentAnalyzer.java.
>>>>>>>>
>>>>>>>> Import the following classes at the top of the file:
>>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>>
>>>>>>>> Change tokenStream to:
>>>>>>>>
>>>>>>>>    public TokenStream tokenStream(String field, Reader reader) {
>>>>>>>> TokenStream ts = CommonGrams.getFilter(new 
>>>>>>>> NutchDocumentTokenizer(reader),
>>>>>>>> field);
>>>>>>>> if (field.equals("content") || field.equals("title")) {
>>>>>>>>     ts = new LowerCaseFilter(ts);
>>>>>>>>     return new PorterStemFilter(ts);
>>>>>>>> } else {
>>>>>>>>     return ts;
>>>>>>>> }
>>>>>>>>    }
>>>>>>>>
>>>>>>>> The second change is in CommonGrams.java.
>>>>>>>> Import the following classes near the top:
>>>>>>>>
>>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>>
>>>>>>>> In optimizePhrase, after this line:
>>>>>>>>
>>>>>>>>    TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>>>>>
>>>>>>>> Add:
>>>>>>>>
>>>>>>>>    ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>>>>>
>>>>>>>> And the rest is a new QueryFilter plugin that I'm calling 
>>>>>>>> query-stemmer.
>>>>>>>> Here's the full source for the Java file. You can copy the 
>>>>>>>> build.xml
>>>>>>>> and plugin.xml from query-basic, and alter the names for 
>>>>>>>> query-stemmer.
>>>>>>>>
>>>>>>>> /* Copyright (c) 2003 The Nutch Organization.  All rights 
>>>>>>>> reserved.   */
>>>>>>>> /* Use subject to the conditions in 
>>>>>>>> http://www.nutch.org/LICENSE.txt. */
>>>>>>>>
>>>>>>>> package org.apache.nutch.searcher.stemmer;
>>>>>>>>
>>>>>>>> import org.apache.lucene.search.BooleanQuery;
>>>>>>>> import org.apache.lucene.search.PhraseQuery;
>>>>>>>> import org.apache.lucene.search.TermQuery;
>>>>>>>> import org.apache.lucene.analysis.TokenFilter;
>>>>>>>> import org.apache.lucene.analysis.TokenStream;
>>>>>>>> import org.apache.lucene.analysis.Token;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>>
>>>>>>>> import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>>>>> import org.apache.nutch.analysis.CommonGrams;
>>>>>>>>
>>>>>>>> import org.apache.nutch.searcher.QueryFilter;
>>>>>>>> import org.apache.nutch.searcher.Query;
>>>>>>>> import org.apache.nutch.searcher.Query.*;
>>>>>>>>
>>>>>>>> import java.io.IOException;
>>>>>>>> import java.util.HashSet;
>>>>>>>> import java.io.StringReader;
>>>>>>>>
>>>>>>>> /** The default query filter.  Query terms in the default query 
>>>>>>>> field are
>>>>>>>> * expanded to search the url, anchor and content document fields.*/
>>>>>>>> public class StemmerQueryFilter implements QueryFilter {
>>>>>>>>
>>>>>>>>   private static float URL_BOOST = 4.0f;
>>>>>>>>   private static float ANCHOR_BOOST = 2.0f;
>>>>>>>>
>>>>>>>>   private static int SLOP = Integer.MAX_VALUE;
>>>>>>>>   private static float PHRASE_BOOST = 1.0f;
>>>>>>>>
>>>>>>>>   private static final String[] FIELDS = {"url", "anchor", 
>>>>>>>> "content",
>>>>>>>> "title"};
>>>>>>>>   private static final float[] FIELD_BOOSTS = {URL_BOOST, 
>>>>>>>> ANCHOR_BOOST,
>>>>>>>> 1.0f, 2.0f};
>>>>>>>>
>>>>>>>>   /** Set the boost factor for url matches, relative to content 
>>>>>>>> and anchor
>>>>>>>>    * matches */
>>>>>>>>   public static void setUrlBoost(float boost) { URL_BOOST = 
>>>>>>>> boost; }
>>>>>>>>
>>>>>>>>   /** Set the boost factor for title/anchor matches, relative to 
>>>>>>>> url and
>>>>>>>>    * content matches. */
>>>>>>>>   public static void setAnchorBoost(float boost) { ANCHOR_BOOST 
>>>>>>>> = boost; }
>>>>>>>>
>>>>>>>>   /** Set the boost factor for sloppy phrase matches relative to 
>>>>>>>> unordered
>>>>>>>> term
>>>>>>>>    * matches. */
>>>>>>>>   public static void setPhraseBoost(float boost) { PHRASE_BOOST 
>>>>>>>> = boost; }
>>>>>>>>
>>>>>>>>   /** Set the maximum number of terms permitted between matching 
>>>>>>>> terms in a
>>>>>>>>    * sloppy phrase match. */
>>>>>>>>   public static void setSlop(int slop) { SLOP = slop; }
>>>>>>>>
>>>>>>>>   public BooleanQuery filter(Query input, BooleanQuery output) {
>>>>>>>>     addTerms(input, output);
>>>>>>>>     addSloppyPhrases(input, output);
>>>>>>>>     return output;
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   private static void addTerms(Query input, BooleanQuery output) {
>>>>>>>>     Clause[] clauses = input.getClauses();
>>>>>>>>     for (int i = 0; i < clauses.length; i++) {
>>>>>>>>       Clause c = clauses[i];
>>>>>>>>
>>>>>>>>       if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>>         continue;                                 // skip 
>>>>>>>> non-default fields
>>>>>>>>
>>>>>>>>       BooleanQuery out = new BooleanQuery();
>>>>>>>>       for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>>
>>>>>>>>         Clause o = c;
>>>>>>>>         String[] opt;
>>>>>>>>
>>>>>>>>         // TODO: I'm a little nervous about stemming for all 
>>>>>>>> default fields.
>>>>>>>>         //       Should keep an eye on this.
>>>>>>>>         if (c.isPhrase()) {                         // optimize 
>>>>>>>> phrase
>>>>>>>> clauses
>>>>>>>>             opt = CommonGrams.optimizePhrase(c.getPhrase(), 
>>>>>>>> FIELDS[f]);
>>>>>>>>         } else {
>>>>>>>>             System.out.println("o.getTerm = " + 
>>>>>>>> o.getTerm().toString());
>>>>>>>>             opt = getStemmedWords(o.getTerm().toString());
>>>>>>>>         }
>>>>>>>>         if (opt.length==1) {
>>>>>>>>             o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>>>>> c.isProhibited());
>>>>>>>>         } else {
>>>>>>>>             o = new Clause(new Phrase(opt), c.isRequired(),
>>>>>>>> c.isProhibited());
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         out.add(o.isPhrase()
>>>>>>>>                 ? exactPhrase(o.getPhrase(), FIELDS[f], 
>>>>>>>> FIELD_BOOSTS[f])
>>>>>>>>                 : termQuery(FIELDS[f], o.getTerm(), 
>>>>>>>> FIELD_BOOSTS[f]),
>>>>>>>>                 false, false);
>>>>>>>>       }
>>>>>>>>       output.add(out, c.isRequired(), c.isProhibited());
>>>>>>>>     }
>>>>>>>>     System.out.println("query = " + output.toString());
>>>>>>>>   }
>>>>>>>>
>>>>>>>>     private static String[] getStemmedWords(String value) {
>>>>>>>>           StringReader sr = new StringReader(value);
>>>>>>>>           TokenStream ts = new PorterStemFilter(new 
>>>>>>>> LowerCaseTokenizer(sr));
>>>>>>>>
>>>>>>>>           String stemmedValue = "";
>>>>>>>>           try {
>>>>>>>>               Token token = ts.next();
>>>>>>>>               int count = 0;
>>>>>>>>               while (token != null) {
>>>>>>>>                   System.out.println("token = " + 
>>>>>>>> token.termText());
>>>>>>>>                   System.out.println("type = " + token.type());
>>>>>>>>
>>>>>>>>                   if (count == 0)
>>>>>>>>                       stemmedValue = token.termText();
>>>>>>>>                   else
>>>>>>>>                       stemmedValue = stemmedValue + " " + 
>>>>>>>> token.termText();
>>>>>>>>
>>>>>>>>                   token = ts.next();
>>>>>>>>                   count++;
>>>>>>>>               }
>>>>>>>>           } catch (Exception e) {
>>>>>>>>               stemmedValue = value;
>>>>>>>>           }
>>>>>>>>
>>>>>>>>           if (stemmedValue.equals("")) {
>>>>>>>>               stemmedValue = value;
>>>>>>>>           }
>>>>>>>>
>>>>>>>>           String[] stemmedValues = stemmedValue.split("\\s+");
>>>>>>>>
>>>>>>>>           for (int j=0; j<stemmedValues.length; j++) {
>>>>>>>>               System.out.println("stemmedValues = " + 
>>>>>>>> stemmedValues[j]);
>>>>>>>>           }
>>>>>>>>           return stemmedValues;
>>>>>>>>     }
>>>>>>>>
>>>>>>>>
>>>>>>>>   private static void addSloppyPhrases(Query input, BooleanQuery 
>>>>>>>> output) {
>>>>>>>>     Clause[] clauses = input.getClauses();
>>>>>>>>     for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>>
>>>>>>>>       PhraseQuery sloppyPhrase = new PhraseQuery();
>>>>>>>>       sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>>>>>>       sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>>>>>>                            ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>>>>>>                            : SLOP);
>>>>>>>>       int sloppyTerms = 0;
>>>>>>>>
>>>>>>>>       for (int i = 0; i < clauses.length; i++) {
>>>>>>>>         Clause c = clauses[i];
>>>>>>>>
>>>>>>>>         if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>>           continue;                               // skip 
>>>>>>>> non-default fields
>>>>>>>>
>>>>>>>>         if (c.isPhrase())                         // skip exact 
>>>>>>>> phrases
>>>>>>>>           continue;
>>>>>>>>
>>>>>>>>         if (c.isProhibited())                     // skip 
>>>>>>>> prohibited terms
>>>>>>>>           continue;
>>>>>>>>
>>>>>>>>         sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>>>>>>         sloppyTerms++;
>>>>>>>>       }
>>>>>>>>
>>>>>>>>       if (sloppyTerms > 1)
>>>>>>>>         output.add(sloppyPhrase, false, false);
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>>
>>>>>>>>
>>>>>>>>   private static org.apache.lucene.search.Query
>>>>>>>>         termQuery(String field, Term term, float boost) {
>>>>>>>>     TermQuery result = new TermQuery(luceneTerm(field, term));
>>>>>>>>     result.setBoost(boost);
>>>>>>>>     return result;
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   /** Utility to construct a Lucene exact phrase query for a 
>>>>>>>> Nutch phrase.
>>>>>>>> */
>>>>>>>>   private static org.apache.lucene.search.Query
>>>>>>>>        exactPhrase(Phrase nutchPhrase,
>>>>>>>>                    String field, float boost) {
>>>>>>>>     Term[] terms = nutchPhrase.getTerms();
>>>>>>>>     PhraseQuery exactPhrase = new PhraseQuery();
>>>>>>>>     for (int i = 0; i < terms.length; i++) {
>>>>>>>>       exactPhrase.add(luceneTerm(field, terms[i]));
>>>>>>>>     }
>>>>>>>>     exactPhrase.setBoost(boost);
>>>>>>>>     return exactPhrase;
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   /** Utility to construct a Lucene Term given a Nutch query 
>>>>>>>> term and field.
>>>>>>>> */
>>>>>>>>   private static org.apache.lucene.index.Term luceneTerm(String 
>>>>>>>> field,
>>>>>>>>                                                          Term 
>>>>>>>> term) {
>>>>>>>>     return new org.apache.lucene.index.Term(field, 
>>>>>>>> term.toString());
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>
>>>>>>             
>>>>
>>>>         
>>     
>
>
>

Re[2]: stemming - RESOLVED

Posted by bb...@mail.ru.

Hi, Matthew

I think we should use fieldName instead of field, or not...

===============stemming code begin=======================

public TokenStream tokenStream(String field, Reader reader) {
    Analyzer analyzer;
    if ("anchor".equals(field)) {
        analyzer = ANCHOR_ANALYZER;
    }
    else {
        analyzer = CONTENT_ANALYZER;

        TokenStream ts = analyzer.tokenStream(field, reader);
        if (field.equals("content") || field.equals("title")) {
            ts = new LowerCaseFilter(ts);
            return new PorterStemFilter(ts);
        }
        else {
            return ts;
        }
    }
}

===============stemming code end=======================

P.S. this patch doesn't take any effect on russian language.

Regards,
Alexey

------------------------------

Howie,
   Thanks for all the help configuring your stemming addon for version 
0.8. I compared query-basic and query-stemmer and the only new feature 
that was added is a "host" boost. I made the changes and everything 
works perfect.

I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can 
access it at the below URL..

http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c

Take care,
  Matt

Howie Wang wrote:
> Hi, Matt,
>
> In 0.7, you wouldn't miss anything. That code was written to
> replace the basic query filter, and handled all the fields that
> basic query filter was handling. For 0.8, I'm really not sure.
> I'm guessing the code is fairly simple still in 0.8. You can probably
> figure out if query-basic in 0.8 is doing something appreciably different
> than query-stemmer by just visually comparing the files.
>
> Howie
>
>> Howie,
>>  The query-stemmer works great as long as query-basic is not enabled. 
>> However, if I don't have query-basic enabled, won't I be missing some 
>> needed functionality?
>>  Matt
>>
>> Howie Wang wrote:
>>> Hi,
>>>
>>> The settings look reasonable. But for testing purposes, I would get 
>>> rid of
>>> the other query filters and put in some print statements in the
>>> query-stemmer to see what's happening.
>>>
>>> Howie
>>>
>>>> In my nutch-site.xml I overrode the plugin.includes property as below:
>>>>
>>>> <property>
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> 
>>>>
>>>>
>>>>  <description>Regular expression naming plugin directory names to
>>>>  include.  Any plugin not matching this expression is excluded.
>>>>  In any case you need at least include the nutch-extensionpoints 
>>>> plugin. By
>>>>  default Nutch includes crawling just HTML and plain text via HTTP,
>>>>  and basic indexing and search plugins.
>>>>  </description>
>>>> </property>
>>>>
>>>>
>>>> However, it is still only letting me search for the stemmed term 
>>>> (IE "Interview" returns results but "interviewed" doesnt, even 
>>>> though thats the word thats actually on the page).
>>>>
>>>> I tried a different approach and removed the query-stemmer value 
>>>> from nutch-site.xml to attempt to disable the plugin. I reran the 
>>>> crawl and it didn't load the plugin. However, it still had the same 
>>>> stemming functionality. I'm guessing this is due to editing the 
>>>> main files such as CommonGrams.java and NutchDocumentAnalyzer.java. 
>>>> Should I attempt too copy the needed methods into 
>>>> StemmerQueryFilter.java and try to isolate all functionality to the 
>>>> plugin alone?
>>>>
>>>> Thanks,
>>>>    Matt
>>>>
>>>> Howie Wang wrote:
>>>>> It sounds like the query-stemmer is not being called.
>>>>> The query string "interviews" needs to be processed
>>>>> into "interview". Are you sure that your nutch-default.xml
>>>>> is including the query-stemmer correctly? Put print statements
>>>>> in to see if it's getting there.
>>>>>
>>>>> By the way, someone recently told me that they
>>>>> were able to put all the stemming code into an indexing
>>>>> filter without touching any of the main code. All they
>>>>> did was to copy some of the code that is being done
>>>>> in NutchDocumentAnalyzer and CommonGrams into
>>>>> their custom index filter. Haven't tried it myself.
>>>>>
>>>>> HTH
>>>>> Howie
>>>>>
>>>>>> Ok. I did this for Nutch 0.8 (had to edit the listed code some to 
>>>>>> make up for changes from .7.2 to .8 - mostly having to do with 
>>>>>> the Configuration type being needed).
>>>>>>
>>>>>> It partially works.
>>>>>>
>>>>>> If the page I'm trying to index contains the word "interviews" 
>>>>>> and I type in the search engine "interview", the stemming takes 
>>>>>> place and the page with the word "interviews" is returned.
>>>>>> However, if I type in the word "interviews" no page is returned. 
>>>>>> (The page with the word interviews on it should be returned).
>>>>>>
>>>>>> Any ideas??
>>>>>> Matt
>>>>>>
>>>>>> Dima Mazmanov wrote:
>>>>>>> Hi, .
>>>>>>>
>>>>>>> I've gotten a couple of questions offlist about stemming
>>>>>>> so I thought I'd just post here with my changes. Sorry that
>>>>>>> some of the changes are in the main code and not in a plugin. It
>>>>>>> seemed that it's more efficient to put in the main analyzer. It
>>>>>>> would be nice if later releases could add support for plugging
>>>>>>> in a custom stemmer/analyzer.
>>>>>>>
>>>>>>> The first change I made is in NutchDocumentAnalyzer.java.
>>>>>>>
>>>>>>> Import the following classes at the top of the file:
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> Change tokenStream to:
>>>>>>>
>>>>>>>    public TokenStream tokenStream(String field, Reader reader) {
>>>>>>> TokenStream ts = CommonGrams.getFilter(new 
>>>>>>> NutchDocumentTokenizer(reader),
>>>>>>> field);
>>>>>>> if (field.equals("content") || field.equals("title")) {
>>>>>>>     ts = new LowerCaseFilter(ts);
>>>>>>>     return new PorterStemFilter(ts);
>>>>>>> } else {
>>>>>>>     return ts;
>>>>>>> }
>>>>>>>    }
>>>>>>>
>>>>>>> The second change is in CommonGrams.java.
>>>>>>> Import the following classes near the top:
>>>>>>>
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> In optimizePhrase, after this line:
>>>>>>>
>>>>>>>    TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>>>>
>>>>>>> Add:
>>>>>>>
>>>>>>>    ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>>>>
>>>>>>> And the rest is a new QueryFilter plugin that I'm calling 
>>>>>>> query-stemmer.
>>>>>>> Here's the full source for the Java file. You can copy the 
>>>>>>> build.xml
>>>>>>> and plugin.xml from query-basic, and alter the names for 
>>>>>>> query-stemmer.
>>>>>>>
>>>>>>> /* Copyright (c) 2003 The Nutch Organization.  All rights 
>>>>>>> reserved.   */
>>>>>>> /* Use subject to the conditions in 
>>>>>>> http://www.nutch.org/LICENSE.txt. */
>>>>>>>
>>>>>>> package org.apache.nutch.searcher.stemmer;
>>>>>>>
>>>>>>> import org.apache.lucene.search.BooleanQuery;
>>>>>>> import org.apache.lucene.search.PhraseQuery;
>>>>>>> import org.apache.lucene.search.TermQuery;
>>>>>>> import org.apache.lucene.analysis.TokenFilter;
>>>>>>> import org.apache.lucene.analysis.TokenStream;
>>>>>>> import org.apache.lucene.analysis.Token;
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>>>> import org.apache.nutch.analysis.CommonGrams;
>>>>>>>
>>>>>>> import org.apache.nutch.searcher.QueryFilter;
>>>>>>> import org.apache.nutch.searcher.Query;
>>>>>>> import org.apache.nutch.searcher.Query.*;
>>>>>>>
>>>>>>> import java.io.IOException;
>>>>>>> import java.util.HashSet;
>>>>>>> import java.io.StringReader;
>>>>>>>
>>>>>>> /** The default query filter.  Query terms in the default query 
>>>>>>> field are
>>>>>>> * expanded to search the url, anchor and content document fields.*/
>>>>>>> public class StemmerQueryFilter implements QueryFilter {
>>>>>>>
>>>>>>>   private static float URL_BOOST = 4.0f;
>>>>>>>   private static float ANCHOR_BOOST = 2.0f;
>>>>>>>
>>>>>>>   private static int SLOP = Integer.MAX_VALUE;
>>>>>>>   private static float PHRASE_BOOST = 1.0f;
>>>>>>>
>>>>>>>   private static final String[] FIELDS = {"url", "anchor", 
>>>>>>> "content",
>>>>>>> "title"};
>>>>>>>   private static final float[] FIELD_BOOSTS = {URL_BOOST, 
>>>>>>> ANCHOR_BOOST,
>>>>>>> 1.0f, 2.0f};
>>>>>>>
>>>>>>>   /** Set the boost factor for url matches, relative to content 
>>>>>>> and anchor
>>>>>>>    * matches */
>>>>>>>   public static void setUrlBoost(float boost) { URL_BOOST = 
>>>>>>> boost; }
>>>>>>>
>>>>>>>   /** Set the boost factor for title/anchor matches, relative to 
>>>>>>> url and
>>>>>>>    * content matches. */
>>>>>>>   public static void setAnchorBoost(float boost) { ANCHOR_BOOST 
>>>>>>> = boost; }
>>>>>>>
>>>>>>>   /** Set the boost factor for sloppy phrase matches relative to 
>>>>>>> unordered
>>>>>>> term
>>>>>>>    * matches. */
>>>>>>>   public static void setPhraseBoost(float boost) { PHRASE_BOOST 
>>>>>>> = boost; }
>>>>>>>
>>>>>>>   /** Set the maximum number of terms permitted between matching 
>>>>>>> terms in a
>>>>>>>    * sloppy phrase match. */
>>>>>>>   public static void setSlop(int slop) { SLOP = slop; }
>>>>>>>
>>>>>>>   public BooleanQuery filter(Query input, BooleanQuery output) {
>>>>>>>     addTerms(input, output);
>>>>>>>     addSloppyPhrases(input, output);
>>>>>>>     return output;
>>>>>>>   }
>>>>>>>
>>>>>>>   private static void addTerms(Query input, BooleanQuery output) {
>>>>>>>     Clause[] clauses = input.getClauses();
>>>>>>>     for (int i = 0; i < clauses.length; i++) {
>>>>>>>       Clause c = clauses[i];
>>>>>>>
>>>>>>>       if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>         continue;                                 // skip 
>>>>>>> non-default fields
>>>>>>>
>>>>>>>       BooleanQuery out = new BooleanQuery();
>>>>>>>       for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>>         Clause o = c;
>>>>>>>         String[] opt;
>>>>>>>
>>>>>>>         // TODO: I'm a little nervous about stemming for all 
>>>>>>> default fields.
>>>>>>>         //       Should keep an eye on this.
>>>>>>>         if (c.isPhrase()) {                         // optimize 
>>>>>>> phrase
>>>>>>> clauses
>>>>>>>             opt = CommonGrams.optimizePhrase(c.getPhrase(), 
>>>>>>> FIELDS[f]);
>>>>>>>         } else {
>>>>>>>             System.out.println("o.getTerm = " + 
>>>>>>> o.getTerm().toString());
>>>>>>>             opt = getStemmedWords(o.getTerm().toString());
>>>>>>>         }
>>>>>>>         if (opt.length==1) {
>>>>>>>             o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>>         } else {
>>>>>>>             o = new Clause(new Phrase(opt), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>>         }
>>>>>>>
>>>>>>>         out.add(o.isPhrase()
>>>>>>>                 ? exactPhrase(o.getPhrase(), FIELDS[f], 
>>>>>>> FIELD_BOOSTS[f])
>>>>>>>                 : termQuery(FIELDS[f], o.getTerm(), 
>>>>>>> FIELD_BOOSTS[f]),
>>>>>>>                 false, false);
>>>>>>>       }
>>>>>>>       output.add(out, c.isRequired(), c.isProhibited());
>>>>>>>     }
>>>>>>>     System.out.println("query = " + output.toString());
>>>>>>>   }
>>>>>>>
>>>>>>>     private static String[] getStemmedWords(String value) {
>>>>>>>           StringReader sr = new StringReader(value);
>>>>>>>           TokenStream ts = new PorterStemFilter(new 
>>>>>>> LowerCaseTokenizer(sr));
>>>>>>>
>>>>>>>           String stemmedValue = "";
>>>>>>>           try {
>>>>>>>               Token token = ts.next();
>>>>>>>               int count = 0;
>>>>>>>               while (token != null) {
>>>>>>>                   System.out.println("token = " + 
>>>>>>> token.termText());
>>>>>>>                   System.out.println("type = " + token.type());
>>>>>>>
>>>>>>>                   if (count == 0)
>>>>>>>                       stemmedValue = token.termText();
>>>>>>>                   else
>>>>>>>                       stemmedValue = stemmedValue + " " + 
>>>>>>> token.termText();
>>>>>>>
>>>>>>>                   token = ts.next();
>>>>>>>                   count++;
>>>>>>>               }
>>>>>>>           } catch (Exception e) {
>>>>>>>               stemmedValue = value;
>>>>>>>           }
>>>>>>>
>>>>>>>           if (stemmedValue.equals("")) {
>>>>>>>               stemmedValue = value;
>>>>>>>           }
>>>>>>>
>>>>>>>           String[] stemmedValues = stemmedValue.split("\\s+");
>>>>>>>
>>>>>>>           for (int j=0; j<stemmedValues.length; j++) {
>>>>>>>               System.out.println("stemmedValues = " + 
>>>>>>> stemmedValues[j]);
>>>>>>>           }
>>>>>>>           return stemmedValues;
>>>>>>>     }
>>>>>>>
>>>>>>>
>>>>>>>   private static void addSloppyPhrases(Query input, BooleanQuery 
>>>>>>> output) {
>>>>>>>     Clause[] clauses = input.getClauses();
>>>>>>>     for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>>       PhraseQuery sloppyPhrase = new PhraseQuery();
>>>>>>>       sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>>>>>       sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>>>>>                            ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>>>>>                            : SLOP);
>>>>>>>       int sloppyTerms = 0;
>>>>>>>
>>>>>>>       for (int i = 0; i < clauses.length; i++) {
>>>>>>>         Clause c = clauses[i];
>>>>>>>
>>>>>>>         if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>           continue;                               // skip 
>>>>>>> non-default fields
>>>>>>>
>>>>>>>         if (c.isPhrase())                         // skip exact 
>>>>>>> phrases
>>>>>>>           continue;
>>>>>>>
>>>>>>>         if (c.isProhibited())                     // skip 
>>>>>>> prohibited terms
>>>>>>>           continue;
>>>>>>>
>>>>>>>         sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>>>>>         sloppyTerms++;
>>>>>>>       }
>>>>>>>
>>>>>>>       if (sloppyTerms > 1)
>>>>>>>         output.add(sloppyPhrase, false, false);
>>>>>>>     }
>>>>>>>   }
>>>>>>>
>>>>>>>
>>>>>>>   private static org.apache.lucene.search.Query
>>>>>>>         termQuery(String field, Term term, float boost) {
>>>>>>>     TermQuery result = new TermQuery(luceneTerm(field, term));
>>>>>>>     result.setBoost(boost);
>>>>>>>     return result;
>>>>>>>   }
>>>>>>>
>>>>>>>   /** Utility to construct a Lucene exact phrase query for a 
>>>>>>> Nutch phrase.
>>>>>>> */
>>>>>>>   private static org.apache.lucene.search.Query
>>>>>>>        exactPhrase(Phrase nutchPhrase,
>>>>>>>                    String field, float boost) {
>>>>>>>     Term[] terms = nutchPhrase.getTerms();
>>>>>>>     PhraseQuery exactPhrase = new PhraseQuery();
>>>>>>>     for (int i = 0; i < terms.length; i++) {
>>>>>>>       exactPhrase.add(luceneTerm(field, terms[i]));
>>>>>>>     }
>>>>>>>     exactPhrase.setBoost(boost);
>>>>>>>     return exactPhrase;
>>>>>>>   }
>>>>>>>
>>>>>>>   /** Utility to construct a Lucene Term given a Nutch query 
>>>>>>> term and field.
>>>>>>> */
>>>>>>>   private static org.apache.lucene.index.Term luceneTerm(String 
>>>>>>> field,
>>>>>>>                                                          Term 
>>>>>>> term) {
>>>>>>>     return new org.apache.lucene.index.Term(field, 
>>>>>>> term.toString());
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>