You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matthew Holt <mh...@redhat.com> on 2006/07/28 17:03:25 UTC
Re: stemming - RESOLVED
Howie,
Thanks for all the help configuring your stemming addon for version
0.8. I compared query-basic and query-stemmer and the only new feature
that was added is a "host" boost. I made the changes and everything
works perfect.
I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can
access it at the below URL..
http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c
Take care,
Matt
Howie Wang wrote:
> Hi, Matt,
>
> In 0.7, you wouldn't miss anything. That code was written to
> replace the basic query filter, and handled all the fields that
> basic query filter was handling. For 0.8, I'm really not sure.
> I'm guessing the code is fairly simple still in 0.8. You can probably
> figure out if query-basic in 0.8 is doing something appreciably different
> than query-stemmer by just visually comparing the files.
>
> Howie
>
>> Howie,
>> The query-stemmer works great as long as query-basic is not enabled.
>> However, if I don't have query-basic enabled, won't I be missing some
>> needed functionality?
>> Matt
>>
>> Howie Wang wrote:
>>> Hi,
>>>
>>> The settings look reasonable. But for testing purposes, I would get
>>> rid of
>>> the other query filters and put in some print statements in the
>>> query-stemmer to see what's happening.
>>>
>>> Howie
>>>
>>>> In my nutch-site.xml I overrode the plugin.includes property as below:
>>>>
>>>> <property>
>>>> <name>plugin.includes</name>
>>>>
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>>>>
>>>>
>>>> <description>Regular expression naming plugin directory names to
>>>> include. Any plugin not matching this expression is excluded.
>>>> In any case you need at least include the nutch-extensionpoints
>>>> plugin. By
>>>> default Nutch includes crawling just HTML and plain text via HTTP,
>>>> and basic indexing and search plugins.
>>>> </description>
>>>> </property>
>>>>
>>>>
>>>> However, it is still only letting me search for the stemmed term
>>>> (IE "Interview" returns results but "interviewed" doesnt, even
>>>> though thats the word thats actually on the page).
>>>>
>>>> I tried a different approach and removed the query-stemmer value
>>>> from nutch-site.xml to attempt to disable the plugin. I reran the
>>>> crawl and it didn't load the plugin. However, it still had the same
>>>> stemming functionality. I'm guessing this is due to editing the
>>>> main files such as CommonGrams.java and NutchDocumentAnalyzer.java.
>>>> Should I attempt too copy the needed methods into
>>>> StemmerQueryFilter.java and try to isolate all functionality to the
>>>> plugin alone?
>>>>
>>>> Thanks,
>>>> Matt
>>>>
>>>> Howie Wang wrote:
>>>>> It sounds like the query-stemmer is not being called.
>>>>> The query string "interviews" needs to be processed
>>>>> into "interview". Are you sure that your nutch-default.xml
>>>>> is including the query-stemmer correctly? Put print statements
>>>>> in to see if it's getting there.
>>>>>
>>>>> By the way, someone recently told me that they
>>>>> were able to put all the stemming code into an indexing
>>>>> filter without touching any of the main code. All they
>>>>> did was to copy some of the code that is being done
>>>>> in NutchDocumentAnalyzer and CommonGrams into
>>>>> their custom index filter. Haven't tried it myself.
>>>>>
>>>>> HTH
>>>>> Howie
>>>>>
>>>>>> Ok. I did this for Nutch 0.8 (had to edit the listed code some to
>>>>>> make up for changes from .7.2 to .8 - mostly having to do with
>>>>>> the Configuration type being needed).
>>>>>>
>>>>>> It partially works.
>>>>>>
>>>>>> If the page I'm trying to index contains the word "interviews"
>>>>>> and I type in the search engine "interview", the stemming takes
>>>>>> place and the page with the word "interviews" is returned.
>>>>>> However, if I type in the word "interviews" no page is returned.
>>>>>> (The page with the word interviews on it should be returned).
>>>>>>
>>>>>> Any ideas??
>>>>>> Matt
>>>>>>
>>>>>> Dima Mazmanov wrote:
>>>>>>> Hi, .
>>>>>>>
>>>>>>> I've gotten a couple of questions offlist about stemming
>>>>>>> so I thought I'd just post here with my changes. Sorry that
>>>>>>> some of the changes are in the main code and not in a plugin. It
>>>>>>> seemed that it's more efficient to put in the main analyzer. It
>>>>>>> would be nice if later releases could add support for plugging
>>>>>>> in a custom stemmer/analyzer.
>>>>>>>
>>>>>>> The first change I made is in NutchDocumentAnalyzer.java.
>>>>>>>
>>>>>>> Import the following classes at the top of the file:
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> Change tokenStream to:
>>>>>>>
>>>>>>> public TokenStream tokenStream(String field, Reader reader) {
>>>>>>> TokenStream ts = CommonGrams.getFilter(new
>>>>>>> NutchDocumentTokenizer(reader),
>>>>>>> field);
>>>>>>> if (field.equals("content") || field.equals("title")) {
>>>>>>> ts = new LowerCaseFilter(ts);
>>>>>>> return new PorterStemFilter(ts);
>>>>>>> } else {
>>>>>>> return ts;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> The second change is in CommonGrams.java.
>>>>>>> Import the following classes near the top:
>>>>>>>
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> In optimizePhrase, after this line:
>>>>>>>
>>>>>>> TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>>>>
>>>>>>> Add:
>>>>>>>
>>>>>>> ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>>>>
>>>>>>> And the rest is a new QueryFilter plugin that I'm calling
>>>>>>> query-stemmer.
>>>>>>> Here's the full source for the Java file. You can copy the
>>>>>>> build.xml
>>>>>>> and plugin.xml from query-basic, and alter the names for
>>>>>>> query-stemmer.
>>>>>>>
>>>>>>> /* Copyright (c) 2003 The Nutch Organization. All rights
>>>>>>> reserved. */
>>>>>>> /* Use subject to the conditions in
>>>>>>> http://www.nutch.org/LICENSE.txt. */
>>>>>>>
>>>>>>> package org.apache.nutch.searcher.stemmer;
>>>>>>>
>>>>>>> import org.apache.lucene.search.BooleanQuery;
>>>>>>> import org.apache.lucene.search.PhraseQuery;
>>>>>>> import org.apache.lucene.search.TermQuery;
>>>>>>> import org.apache.lucene.analysis.TokenFilter;
>>>>>>> import org.apache.lucene.analysis.TokenStream;
>>>>>>> import org.apache.lucene.analysis.Token;
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>>>> import org.apache.nutch.analysis.CommonGrams;
>>>>>>>
>>>>>>> import org.apache.nutch.searcher.QueryFilter;
>>>>>>> import org.apache.nutch.searcher.Query;
>>>>>>> import org.apache.nutch.searcher.Query.*;
>>>>>>>
>>>>>>> import java.io.IOException;
>>>>>>> import java.util.HashSet;
>>>>>>> import java.io.StringReader;
>>>>>>>
>>>>>>> /** The default query filter. Query terms in the default query
>>>>>>> field are
>>>>>>> * expanded to search the url, anchor and content document fields.*/
>>>>>>> public class StemmerQueryFilter implements QueryFilter {
>>>>>>>
>>>>>>> private static float URL_BOOST = 4.0f;
>>>>>>> private static float ANCHOR_BOOST = 2.0f;
>>>>>>>
>>>>>>> private static int SLOP = Integer.MAX_VALUE;
>>>>>>> private static float PHRASE_BOOST = 1.0f;
>>>>>>>
>>>>>>> private static final String[] FIELDS = {"url", "anchor",
>>>>>>> "content",
>>>>>>> "title"};
>>>>>>> private static final float[] FIELD_BOOSTS = {URL_BOOST,
>>>>>>> ANCHOR_BOOST,
>>>>>>> 1.0f, 2.0f};
>>>>>>>
>>>>>>> /** Set the boost factor for url matches, relative to content
>>>>>>> and anchor
>>>>>>> * matches */
>>>>>>> public static void setUrlBoost(float boost) { URL_BOOST =
>>>>>>> boost; }
>>>>>>>
>>>>>>> /** Set the boost factor for title/anchor matches, relative to
>>>>>>> url and
>>>>>>> * content matches. */
>>>>>>> public static void setAnchorBoost(float boost) { ANCHOR_BOOST
>>>>>>> = boost; }
>>>>>>>
>>>>>>> /** Set the boost factor for sloppy phrase matches relative to
>>>>>>> unordered
>>>>>>> term
>>>>>>> * matches. */
>>>>>>> public static void setPhraseBoost(float boost) { PHRASE_BOOST
>>>>>>> = boost; }
>>>>>>>
>>>>>>> /** Set the maximum number of terms permitted between matching
>>>>>>> terms in a
>>>>>>> * sloppy phrase match. */
>>>>>>> public static void setSlop(int slop) { SLOP = slop; }
>>>>>>>
>>>>>>> public BooleanQuery filter(Query input, BooleanQuery output) {
>>>>>>> addTerms(input, output);
>>>>>>> addSloppyPhrases(input, output);
>>>>>>> return output;
>>>>>>> }
>>>>>>>
>>>>>>> private static void addTerms(Query input, BooleanQuery output) {
>>>>>>> Clause[] clauses = input.getClauses();
>>>>>>> for (int i = 0; i < clauses.length; i++) {
>>>>>>> Clause c = clauses[i];
>>>>>>>
>>>>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>> continue; // skip
>>>>>>> non-default fields
>>>>>>>
>>>>>>> BooleanQuery out = new BooleanQuery();
>>>>>>> for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>> Clause o = c;
>>>>>>> String[] opt;
>>>>>>>
>>>>>>> // TODO: I'm a little nervous about stemming for all
>>>>>>> default fields.
>>>>>>> // Should keep an eye on this.
>>>>>>> if (c.isPhrase()) { // optimize
>>>>>>> phrase
>>>>>>> clauses
>>>>>>> opt = CommonGrams.optimizePhrase(c.getPhrase(),
>>>>>>> FIELDS[f]);
>>>>>>> } else {
>>>>>>> System.out.println("o.getTerm = " +
>>>>>>> o.getTerm().toString());
>>>>>>> opt = getStemmedWords(o.getTerm().toString());
>>>>>>> }
>>>>>>> if (opt.length==1) {
>>>>>>> o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>> } else {
>>>>>>> o = new Clause(new Phrase(opt), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>> }
>>>>>>>
>>>>>>> out.add(o.isPhrase()
>>>>>>> ? exactPhrase(o.getPhrase(), FIELDS[f],
>>>>>>> FIELD_BOOSTS[f])
>>>>>>> : termQuery(FIELDS[f], o.getTerm(),
>>>>>>> FIELD_BOOSTS[f]),
>>>>>>> false, false);
>>>>>>> }
>>>>>>> output.add(out, c.isRequired(), c.isProhibited());
>>>>>>> }
>>>>>>> System.out.println("query = " + output.toString());
>>>>>>> }
>>>>>>>
>>>>>>> private static String[] getStemmedWords(String value) {
>>>>>>> StringReader sr = new StringReader(value);
>>>>>>> TokenStream ts = new PorterStemFilter(new
>>>>>>> LowerCaseTokenizer(sr));
>>>>>>>
>>>>>>> String stemmedValue = "";
>>>>>>> try {
>>>>>>> Token token = ts.next();
>>>>>>> int count = 0;
>>>>>>> while (token != null) {
>>>>>>> System.out.println("token = " +
>>>>>>> token.termText());
>>>>>>> System.out.println("type = " + token.type());
>>>>>>>
>>>>>>> if (count == 0)
>>>>>>> stemmedValue = token.termText();
>>>>>>> else
>>>>>>> stemmedValue = stemmedValue + " " +
>>>>>>> token.termText();
>>>>>>>
>>>>>>> token = ts.next();
>>>>>>> count++;
>>>>>>> }
>>>>>>> } catch (Exception e) {
>>>>>>> stemmedValue = value;
>>>>>>> }
>>>>>>>
>>>>>>> if (stemmedValue.equals("")) {
>>>>>>> stemmedValue = value;
>>>>>>> }
>>>>>>>
>>>>>>> String[] stemmedValues = stemmedValue.split("\\s+");
>>>>>>>
>>>>>>> for (int j=0; j<stemmedValues.length; j++) {
>>>>>>> System.out.println("stemmedValues = " +
>>>>>>> stemmedValues[j]);
>>>>>>> }
>>>>>>> return stemmedValues;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> private static void addSloppyPhrases(Query input, BooleanQuery
>>>>>>> output) {
>>>>>>> Clause[] clauses = input.getClauses();
>>>>>>> for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>> PhraseQuery sloppyPhrase = new PhraseQuery();
>>>>>>> sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>>>>> sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>>>>> ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>>>>> : SLOP);
>>>>>>> int sloppyTerms = 0;
>>>>>>>
>>>>>>> for (int i = 0; i < clauses.length; i++) {
>>>>>>> Clause c = clauses[i];
>>>>>>>
>>>>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>> continue; // skip
>>>>>>> non-default fields
>>>>>>>
>>>>>>> if (c.isPhrase()) // skip exact
>>>>>>> phrases
>>>>>>> continue;
>>>>>>>
>>>>>>> if (c.isProhibited()) // skip
>>>>>>> prohibited terms
>>>>>>> continue;
>>>>>>>
>>>>>>> sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>>>>> sloppyTerms++;
>>>>>>> }
>>>>>>>
>>>>>>> if (sloppyTerms > 1)
>>>>>>> output.add(sloppyPhrase, false, false);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> private static org.apache.lucene.search.Query
>>>>>>> termQuery(String field, Term term, float boost) {
>>>>>>> TermQuery result = new TermQuery(luceneTerm(field, term));
>>>>>>> result.setBoost(boost);
>>>>>>> return result;
>>>>>>> }
>>>>>>>
>>>>>>> /** Utility to construct a Lucene exact phrase query for a
>>>>>>> Nutch phrase.
>>>>>>> */
>>>>>>> private static org.apache.lucene.search.Query
>>>>>>> exactPhrase(Phrase nutchPhrase,
>>>>>>> String field, float boost) {
>>>>>>> Term[] terms = nutchPhrase.getTerms();
>>>>>>> PhraseQuery exactPhrase = new PhraseQuery();
>>>>>>> for (int i = 0; i < terms.length; i++) {
>>>>>>> exactPhrase.add(luceneTerm(field, terms[i]));
>>>>>>> }
>>>>>>> exactPhrase.setBoost(boost);
>>>>>>> return exactPhrase;
>>>>>>> }
>>>>>>>
>>>>>>> /** Utility to construct a Lucene Term given a Nutch query
>>>>>>> term and field.
>>>>>>> */
>>>>>>> private static org.apache.lucene.index.Term luceneTerm(String
>>>>>>> field,
>>>>>>> Term
>>>>>>> term) {
>>>>>>> return new org.apache.lucene.index.Term(field,
>>>>>>> term.toString());
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>
Re: stemming - RESOLVED
Posted by Matthew Holt <mh...@redhat.com>.
We could, although other than readability, it won't make any difference.
bb300@mail.ru wrote:
> Hi, Matthew
>
> I think we should use fieldName instead of field, or not...
>
> ===============stemming code begin=======================
>
> public TokenStream tokenStream(String field, Reader reader) {
> Analyzer analyzer;
> if ("anchor".equals(field)) {
> analyzer = ANCHOR_ANALYZER;
> }
> else {
> analyzer = CONTENT_ANALYZER;
>
> TokenStream ts = analyzer.tokenStream(field, reader);
> if (field.equals("content") || field.equals("title")) {
> ts = new LowerCaseFilter(ts);
> return new PorterStemFilter(ts);
> }
> else {
> return ts;
> }
> }
> }
>
> ===============stemming code end=======================
>
> P.S. this patch doesn't take any effect on russian language.
>
> Regards,
> Alexey
>
> ------------------------------
>
> Howie,
> Thanks for all the help configuring your stemming addon for version
> 0.8. I compared query-basic and query-stemmer and the only new feature
> that was added is a "host" boost. I made the changes and everything
> works perfect.
>
> I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can
> access it at the below URL..
>
> http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c
>
> Take care,
> Matt
>
> Howie Wang wrote:
>
>> Hi, Matt,
>>
>> In 0.7, you wouldn't miss anything. That code was written to
>> replace the basic query filter, and handled all the fields that
>> basic query filter was handling. For 0.8, I'm really not sure.
>> I'm guessing the code is fairly simple still in 0.8. You can probably
>> figure out if query-basic in 0.8 is doing something appreciably different
>> than query-stemmer by just visually comparing the files.
>>
>> Howie
>>
>>
>>> Howie,
>>> The query-stemmer works great as long as query-basic is not enabled.
>>> However, if I don't have query-basic enabled, won't I be missing some
>>> needed functionality?
>>> Matt
>>>
>>> Howie Wang wrote:
>>>
>>>> Hi,
>>>>
>>>> The settings look reasonable. But for testing purposes, I would get
>>>> rid of
>>>> the other query filters and put in some print statements in the
>>>> query-stemmer to see what's happening.
>>>>
>>>> Howie
>>>>
>>>>
>>>>> In my nutch-site.xml I overrode the plugin.includes property as below:
>>>>>
>>>>> <property>
>>>>> <name>plugin.includes</name>
>>>>>
>>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>>>>>
>>>>>
>>>>> <description>Regular expression naming plugin directory names to
>>>>> include. Any plugin not matching this expression is excluded.
>>>>> In any case you need at least include the nutch-extensionpoints
>>>>> plugin. By
>>>>> default Nutch includes crawling just HTML and plain text via HTTP,
>>>>> and basic indexing and search plugins.
>>>>> </description>
>>>>> </property>
>>>>>
>>>>>
>>>>> However, it is still only letting me search for the stemmed term
>>>>> (IE "Interview" returns results but "interviewed" doesnt, even
>>>>> though thats the word thats actually on the page).
>>>>>
>>>>> I tried a different approach and removed the query-stemmer value
>>>>> from nutch-site.xml to attempt to disable the plugin. I reran the
>>>>> crawl and it didn't load the plugin. However, it still had the same
>>>>> stemming functionality. I'm guessing this is due to editing the
>>>>> main files such as CommonGrams.java and NutchDocumentAnalyzer.java.
>>>>> Should I attempt too copy the needed methods into
>>>>> StemmerQueryFilter.java and try to isolate all functionality to the
>>>>> plugin alone?
>>>>>
>>>>> Thanks,
>>>>> Matt
>>>>>
>>>>> Howie Wang wrote:
>>>>>
>>>>>> It sounds like the query-stemmer is not being called.
>>>>>> The query string "interviews" needs to be processed
>>>>>> into "interview". Are you sure that your nutch-default.xml
>>>>>> is including the query-stemmer correctly? Put print statements
>>>>>> in to see if it's getting there.
>>>>>>
>>>>>> By the way, someone recently told me that they
>>>>>> were able to put all the stemming code into an indexing
>>>>>> filter without touching any of the main code. All they
>>>>>> did was to copy some of the code that is being done
>>>>>> in NutchDocumentAnalyzer and CommonGrams into
>>>>>> their custom index filter. Haven't tried it myself.
>>>>>>
>>>>>> HTH
>>>>>> Howie
>>>>>>
>>>>>>
>>>>>>> Ok. I did this for Nutch 0.8 (had to edit the listed code some to
>>>>>>> make up for changes from .7.2 to .8 - mostly having to do with
>>>>>>> the Configuration type being needed).
>>>>>>>
>>>>>>> It partially works.
>>>>>>>
>>>>>>> If the page I'm trying to index contains the word "interviews"
>>>>>>> and I type in the search engine "interview", the stemming takes
>>>>>>> place and the page with the word "interviews" is returned.
>>>>>>> However, if I type in the word "interviews" no page is returned.
>>>>>>> (The page with the word interviews on it should be returned).
>>>>>>>
>>>>>>> Any ideas??
>>>>>>> Matt
>>>>>>>
>>>>>>> Dima Mazmanov wrote:
>>>>>>>
>>>>>>>> Hi, .
>>>>>>>>
>>>>>>>> I've gotten a couple of questions offlist about stemming
>>>>>>>> so I thought I'd just post here with my changes. Sorry that
>>>>>>>> some of the changes are in the main code and not in a plugin. It
>>>>>>>> seemed that it's more efficient to put in the main analyzer. It
>>>>>>>> would be nice if later releases could add support for plugging
>>>>>>>> in a custom stemmer/analyzer.
>>>>>>>>
>>>>>>>> The first change I made is in NutchDocumentAnalyzer.java.
>>>>>>>>
>>>>>>>> Import the following classes at the top of the file:
>>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>>
>>>>>>>> Change tokenStream to:
>>>>>>>>
>>>>>>>> public TokenStream tokenStream(String field, Reader reader) {
>>>>>>>> TokenStream ts = CommonGrams.getFilter(new
>>>>>>>> NutchDocumentTokenizer(reader),
>>>>>>>> field);
>>>>>>>> if (field.equals("content") || field.equals("title")) {
>>>>>>>> ts = new LowerCaseFilter(ts);
>>>>>>>> return new PorterStemFilter(ts);
>>>>>>>> } else {
>>>>>>>> return ts;
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> The second change is in CommonGrams.java.
>>>>>>>> Import the following classes near the top:
>>>>>>>>
>>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>>
>>>>>>>> In optimizePhrase, after this line:
>>>>>>>>
>>>>>>>> TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>>>>>
>>>>>>>> Add:
>>>>>>>>
>>>>>>>> ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>>>>>
>>>>>>>> And the rest is a new QueryFilter plugin that I'm calling
>>>>>>>> query-stemmer.
>>>>>>>> Here's the full source for the Java file. You can copy the
>>>>>>>> build.xml
>>>>>>>> and plugin.xml from query-basic, and alter the names for
>>>>>>>> query-stemmer.
>>>>>>>>
>>>>>>>> /* Copyright (c) 2003 The Nutch Organization. All rights
>>>>>>>> reserved. */
>>>>>>>> /* Use subject to the conditions in
>>>>>>>> http://www.nutch.org/LICENSE.txt. */
>>>>>>>>
>>>>>>>> package org.apache.nutch.searcher.stemmer;
>>>>>>>>
>>>>>>>> import org.apache.lucene.search.BooleanQuery;
>>>>>>>> import org.apache.lucene.search.PhraseQuery;
>>>>>>>> import org.apache.lucene.search.TermQuery;
>>>>>>>> import org.apache.lucene.analysis.TokenFilter;
>>>>>>>> import org.apache.lucene.analysis.TokenStream;
>>>>>>>> import org.apache.lucene.analysis.Token;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>>
>>>>>>>> import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>>>>> import org.apache.nutch.analysis.CommonGrams;
>>>>>>>>
>>>>>>>> import org.apache.nutch.searcher.QueryFilter;
>>>>>>>> import org.apache.nutch.searcher.Query;
>>>>>>>> import org.apache.nutch.searcher.Query.*;
>>>>>>>>
>>>>>>>> import java.io.IOException;
>>>>>>>> import java.util.HashSet;
>>>>>>>> import java.io.StringReader;
>>>>>>>>
>>>>>>>> /** The default query filter. Query terms in the default query
>>>>>>>> field are
>>>>>>>> * expanded to search the url, anchor and content document fields.*/
>>>>>>>> public class StemmerQueryFilter implements QueryFilter {
>>>>>>>>
>>>>>>>> private static float URL_BOOST = 4.0f;
>>>>>>>> private static float ANCHOR_BOOST = 2.0f;
>>>>>>>>
>>>>>>>> private static int SLOP = Integer.MAX_VALUE;
>>>>>>>> private static float PHRASE_BOOST = 1.0f;
>>>>>>>>
>>>>>>>> private static final String[] FIELDS = {"url", "anchor",
>>>>>>>> "content",
>>>>>>>> "title"};
>>>>>>>> private static final float[] FIELD_BOOSTS = {URL_BOOST,
>>>>>>>> ANCHOR_BOOST,
>>>>>>>> 1.0f, 2.0f};
>>>>>>>>
>>>>>>>> /** Set the boost factor for url matches, relative to content
>>>>>>>> and anchor
>>>>>>>> * matches */
>>>>>>>> public static void setUrlBoost(float boost) { URL_BOOST =
>>>>>>>> boost; }
>>>>>>>>
>>>>>>>> /** Set the boost factor for title/anchor matches, relative to
>>>>>>>> url and
>>>>>>>> * content matches. */
>>>>>>>> public static void setAnchorBoost(float boost) { ANCHOR_BOOST
>>>>>>>> = boost; }
>>>>>>>>
>>>>>>>> /** Set the boost factor for sloppy phrase matches relative to
>>>>>>>> unordered
>>>>>>>> term
>>>>>>>> * matches. */
>>>>>>>> public static void setPhraseBoost(float boost) { PHRASE_BOOST
>>>>>>>> = boost; }
>>>>>>>>
>>>>>>>> /** Set the maximum number of terms permitted between matching
>>>>>>>> terms in a
>>>>>>>> * sloppy phrase match. */
>>>>>>>> public static void setSlop(int slop) { SLOP = slop; }
>>>>>>>>
>>>>>>>> public BooleanQuery filter(Query input, BooleanQuery output) {
>>>>>>>> addTerms(input, output);
>>>>>>>> addSloppyPhrases(input, output);
>>>>>>>> return output;
>>>>>>>> }
>>>>>>>>
>>>>>>>> private static void addTerms(Query input, BooleanQuery output) {
>>>>>>>> Clause[] clauses = input.getClauses();
>>>>>>>> for (int i = 0; i < clauses.length; i++) {
>>>>>>>> Clause c = clauses[i];
>>>>>>>>
>>>>>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>> continue; // skip
>>>>>>>> non-default fields
>>>>>>>>
>>>>>>>> BooleanQuery out = new BooleanQuery();
>>>>>>>> for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>>
>>>>>>>> Clause o = c;
>>>>>>>> String[] opt;
>>>>>>>>
>>>>>>>> // TODO: I'm a little nervous about stemming for all
>>>>>>>> default fields.
>>>>>>>> // Should keep an eye on this.
>>>>>>>> if (c.isPhrase()) { // optimize
>>>>>>>> phrase
>>>>>>>> clauses
>>>>>>>> opt = CommonGrams.optimizePhrase(c.getPhrase(),
>>>>>>>> FIELDS[f]);
>>>>>>>> } else {
>>>>>>>> System.out.println("o.getTerm = " +
>>>>>>>> o.getTerm().toString());
>>>>>>>> opt = getStemmedWords(o.getTerm().toString());
>>>>>>>> }
>>>>>>>> if (opt.length==1) {
>>>>>>>> o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>>>>> c.isProhibited());
>>>>>>>> } else {
>>>>>>>> o = new Clause(new Phrase(opt), c.isRequired(),
>>>>>>>> c.isProhibited());
>>>>>>>> }
>>>>>>>>
>>>>>>>> out.add(o.isPhrase()
>>>>>>>> ? exactPhrase(o.getPhrase(), FIELDS[f],
>>>>>>>> FIELD_BOOSTS[f])
>>>>>>>> : termQuery(FIELDS[f], o.getTerm(),
>>>>>>>> FIELD_BOOSTS[f]),
>>>>>>>> false, false);
>>>>>>>> }
>>>>>>>> output.add(out, c.isRequired(), c.isProhibited());
>>>>>>>> }
>>>>>>>> System.out.println("query = " + output.toString());
>>>>>>>> }
>>>>>>>>
>>>>>>>> private static String[] getStemmedWords(String value) {
>>>>>>>> StringReader sr = new StringReader(value);
>>>>>>>> TokenStream ts = new PorterStemFilter(new
>>>>>>>> LowerCaseTokenizer(sr));
>>>>>>>>
>>>>>>>> String stemmedValue = "";
>>>>>>>> try {
>>>>>>>> Token token = ts.next();
>>>>>>>> int count = 0;
>>>>>>>> while (token != null) {
>>>>>>>> System.out.println("token = " +
>>>>>>>> token.termText());
>>>>>>>> System.out.println("type = " + token.type());
>>>>>>>>
>>>>>>>> if (count == 0)
>>>>>>>> stemmedValue = token.termText();
>>>>>>>> else
>>>>>>>> stemmedValue = stemmedValue + " " +
>>>>>>>> token.termText();
>>>>>>>>
>>>>>>>> token = ts.next();
>>>>>>>> count++;
>>>>>>>> }
>>>>>>>> } catch (Exception e) {
>>>>>>>> stemmedValue = value;
>>>>>>>> }
>>>>>>>>
>>>>>>>> if (stemmedValue.equals("")) {
>>>>>>>> stemmedValue = value;
>>>>>>>> }
>>>>>>>>
>>>>>>>> String[] stemmedValues = stemmedValue.split("\\s+");
>>>>>>>>
>>>>>>>> for (int j=0; j<stemmedValues.length; j++) {
>>>>>>>> System.out.println("stemmedValues = " +
>>>>>>>> stemmedValues[j]);
>>>>>>>> }
>>>>>>>> return stemmedValues;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> private static void addSloppyPhrases(Query input, BooleanQuery
>>>>>>>> output) {
>>>>>>>> Clause[] clauses = input.getClauses();
>>>>>>>> for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>>
>>>>>>>> PhraseQuery sloppyPhrase = new PhraseQuery();
>>>>>>>> sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>>>>>> sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>>>>>> ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>>>>>> : SLOP);
>>>>>>>> int sloppyTerms = 0;
>>>>>>>>
>>>>>>>> for (int i = 0; i < clauses.length; i++) {
>>>>>>>> Clause c = clauses[i];
>>>>>>>>
>>>>>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>>> continue; // skip
>>>>>>>> non-default fields
>>>>>>>>
>>>>>>>> if (c.isPhrase()) // skip exact
>>>>>>>> phrases
>>>>>>>> continue;
>>>>>>>>
>>>>>>>> if (c.isProhibited()) // skip
>>>>>>>> prohibited terms
>>>>>>>> continue;
>>>>>>>>
>>>>>>>> sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>>>>>> sloppyTerms++;
>>>>>>>> }
>>>>>>>>
>>>>>>>> if (sloppyTerms > 1)
>>>>>>>> output.add(sloppyPhrase, false, false);
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> private static org.apache.lucene.search.Query
>>>>>>>> termQuery(String field, Term term, float boost) {
>>>>>>>> TermQuery result = new TermQuery(luceneTerm(field, term));
>>>>>>>> result.setBoost(boost);
>>>>>>>> return result;
>>>>>>>> }
>>>>>>>>
>>>>>>>> /** Utility to construct a Lucene exact phrase query for a
>>>>>>>> Nutch phrase.
>>>>>>>> */
>>>>>>>> private static org.apache.lucene.search.Query
>>>>>>>> exactPhrase(Phrase nutchPhrase,
>>>>>>>> String field, float boost) {
>>>>>>>> Term[] terms = nutchPhrase.getTerms();
>>>>>>>> PhraseQuery exactPhrase = new PhraseQuery();
>>>>>>>> for (int i = 0; i < terms.length; i++) {
>>>>>>>> exactPhrase.add(luceneTerm(field, terms[i]));
>>>>>>>> }
>>>>>>>> exactPhrase.setBoost(boost);
>>>>>>>> return exactPhrase;
>>>>>>>> }
>>>>>>>>
>>>>>>>> /** Utility to construct a Lucene Term given a Nutch query
>>>>>>>> term and field.
>>>>>>>> */
>>>>>>>> private static org.apache.lucene.index.Term luceneTerm(String
>>>>>>>> field,
>>>>>>>> Term
>>>>>>>> term) {
>>>>>>>> return new org.apache.lucene.index.Term(field,
>>>>>>>> term.toString());
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>
>
>
Re[2]: stemming - RESOLVED
Posted by bb...@mail.ru.
Hi, Matthew
I think we should use fieldName instead of field, or not...
===============stemming code begin=======================
public TokenStream tokenStream(String field, Reader reader) {
Analyzer analyzer;
if ("anchor".equals(field)) {
analyzer = ANCHOR_ANALYZER;
}
else {
analyzer = CONTENT_ANALYZER;
TokenStream ts = analyzer.tokenStream(field, reader);
if (field.equals("content") || field.equals("title")) {
ts = new LowerCaseFilter(ts);
return new PorterStemFilter(ts);
}
else {
return ts;
}
}
}
===============stemming code end=======================
P.S. this patch doesn't take any effect on russian language.
Regards,
Alexey
------------------------------
Howie,
Thanks for all the help configuring your stemming addon for version
0.8. I compared query-basic and query-stemmer and the only new feature
that was added is a "host" boost. I made the changes and everything
works perfect.
I uploaded the code to the wiki for both version 0.7.2 and 0.8. You can
access it at the below URL..
http://wiki.apache.org/nutch/FAQ#head-fa0c678473eeecf3771e490b22d385054697232c
Take care,
Matt
Howie Wang wrote:
> Hi, Matt,
>
> In 0.7, you wouldn't miss anything. That code was written to
> replace the basic query filter, and handled all the fields that
> basic query filter was handling. For 0.8, I'm really not sure.
> I'm guessing the code is fairly simple still in 0.8. You can probably
> figure out if query-basic in 0.8 is doing something appreciably different
> than query-stemmer by just visually comparing the files.
>
> Howie
>
>> Howie,
>> The query-stemmer works great as long as query-basic is not enabled.
>> However, if I don't have query-basic enabled, won't I be missing some
>> needed functionality?
>> Matt
>>
>> Howie Wang wrote:
>>> Hi,
>>>
>>> The settings look reasonable. But for testing purposes, I would get
>>> rid of
>>> the other query filters and put in some print statements in the
>>> query-stemmer to see what's happening.
>>>
>>> Howie
>>>
>>>> In my nutch-site.xml I overrode the plugin.includes property as below:
>>>>
>>>> <property>
>>>> <name>plugin.includes</name>
>>>>
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>>>>
>>>>
>>>> <description>Regular expression naming plugin directory names to
>>>> include. Any plugin not matching this expression is excluded.
>>>> In any case you need at least include the nutch-extensionpoints
>>>> plugin. By
>>>> default Nutch includes crawling just HTML and plain text via HTTP,
>>>> and basic indexing and search plugins.
>>>> </description>
>>>> </property>
>>>>
>>>>
>>>> However, it is still only letting me search for the stemmed term
>>>> (IE "Interview" returns results but "interviewed" doesnt, even
>>>> though thats the word thats actually on the page).
>>>>
>>>> I tried a different approach and removed the query-stemmer value
>>>> from nutch-site.xml to attempt to disable the plugin. I reran the
>>>> crawl and it didn't load the plugin. However, it still had the same
>>>> stemming functionality. I'm guessing this is due to editing the
>>>> main files such as CommonGrams.java and NutchDocumentAnalyzer.java.
>>>> Should I attempt too copy the needed methods into
>>>> StemmerQueryFilter.java and try to isolate all functionality to the
>>>> plugin alone?
>>>>
>>>> Thanks,
>>>> Matt
>>>>
>>>> Howie Wang wrote:
>>>>> It sounds like the query-stemmer is not being called.
>>>>> The query string "interviews" needs to be processed
>>>>> into "interview". Are you sure that your nutch-default.xml
>>>>> is including the query-stemmer correctly? Put print statements
>>>>> in to see if it's getting there.
>>>>>
>>>>> By the way, someone recently told me that they
>>>>> were able to put all the stemming code into an indexing
>>>>> filter without touching any of the main code. All they
>>>>> did was to copy some of the code that is being done
>>>>> in NutchDocumentAnalyzer and CommonGrams into
>>>>> their custom index filter. Haven't tried it myself.
>>>>>
>>>>> HTH
>>>>> Howie
>>>>>
>>>>>> Ok. I did this for Nutch 0.8 (had to edit the listed code some to
>>>>>> make up for changes from .7.2 to .8 - mostly having to do with
>>>>>> the Configuration type being needed).
>>>>>>
>>>>>> It partially works.
>>>>>>
>>>>>> If the page I'm trying to index contains the word "interviews"
>>>>>> and I type in the search engine "interview", the stemming takes
>>>>>> place and the page with the word "interviews" is returned.
>>>>>> However, if I type in the word "interviews" no page is returned.
>>>>>> (The page with the word interviews on it should be returned).
>>>>>>
>>>>>> Any ideas??
>>>>>> Matt
>>>>>>
>>>>>> Dima Mazmanov wrote:
>>>>>>> Hi, .
>>>>>>>
>>>>>>> I've gotten a couple of questions offlist about stemming
>>>>>>> so I thought I'd just post here with my changes. Sorry that
>>>>>>> some of the changes are in the main code and not in a plugin. It
>>>>>>> seemed that it's more efficient to put in the main analyzer. It
>>>>>>> would be nice if later releases could add support for plugging
>>>>>>> in a custom stemmer/analyzer.
>>>>>>>
>>>>>>> The first change I made is in NutchDocumentAnalyzer.java.
>>>>>>>
>>>>>>> Import the following classes at the top of the file:
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> Change tokenStream to:
>>>>>>>
>>>>>>> public TokenStream tokenStream(String field, Reader reader) {
>>>>>>> TokenStream ts = CommonGrams.getFilter(new
>>>>>>> NutchDocumentTokenizer(reader),
>>>>>>> field);
>>>>>>> if (field.equals("content") || field.equals("title")) {
>>>>>>> ts = new LowerCaseFilter(ts);
>>>>>>> return new PorterStemFilter(ts);
>>>>>>> } else {
>>>>>>> return ts;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> The second change is in CommonGrams.java.
>>>>>>> Import the following classes near the top:
>>>>>>>
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> In optimizePhrase, after this line:
>>>>>>>
>>>>>>> TokenStream ts = getFilter(new ArrayTokens(phrase), field);
>>>>>>>
>>>>>>> Add:
>>>>>>>
>>>>>>> ts = new PorterStemFilter(new LowerCaseFilter(ts));
>>>>>>>
>>>>>>> And the rest is a new QueryFilter plugin that I'm calling
>>>>>>> query-stemmer.
>>>>>>> Here's the full source for the Java file. You can copy the
>>>>>>> build.xml
>>>>>>> and plugin.xml from query-basic, and alter the names for
>>>>>>> query-stemmer.
>>>>>>>
>>>>>>> /* Copyright (c) 2003 The Nutch Organization. All rights
>>>>>>> reserved. */
>>>>>>> /* Use subject to the conditions in
>>>>>>> http://www.nutch.org/LICENSE.txt. */
>>>>>>>
>>>>>>> package org.apache.nutch.searcher.stemmer;
>>>>>>>
>>>>>>> import org.apache.lucene.search.BooleanQuery;
>>>>>>> import org.apache.lucene.search.PhraseQuery;
>>>>>>> import org.apache.lucene.search.TermQuery;
>>>>>>> import org.apache.lucene.analysis.TokenFilter;
>>>>>>> import org.apache.lucene.analysis.TokenStream;
>>>>>>> import org.apache.lucene.analysis.Token;
>>>>>>> import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> import org.apache.lucene.analysis.LowerCaseFilter;
>>>>>>> import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>
>>>>>>> import org.apache.nutch.analysis.NutchDocumentAnalyzer;
>>>>>>> import org.apache.nutch.analysis.CommonGrams;
>>>>>>>
>>>>>>> import org.apache.nutch.searcher.QueryFilter;
>>>>>>> import org.apache.nutch.searcher.Query;
>>>>>>> import org.apache.nutch.searcher.Query.*;
>>>>>>>
>>>>>>> import java.io.IOException;
>>>>>>> import java.util.HashSet;
>>>>>>> import java.io.StringReader;
>>>>>>>
>>>>>>> /** The default query filter. Query terms in the default query
>>>>>>> field are
>>>>>>> * expanded to search the url, anchor and content document fields.*/
>>>>>>> public class StemmerQueryFilter implements QueryFilter {
>>>>>>>
>>>>>>> private static float URL_BOOST = 4.0f;
>>>>>>> private static float ANCHOR_BOOST = 2.0f;
>>>>>>>
>>>>>>> private static int SLOP = Integer.MAX_VALUE;
>>>>>>> private static float PHRASE_BOOST = 1.0f;
>>>>>>>
>>>>>>> private static final String[] FIELDS = {"url", "anchor",
>>>>>>> "content",
>>>>>>> "title"};
>>>>>>> private static final float[] FIELD_BOOSTS = {URL_BOOST,
>>>>>>> ANCHOR_BOOST,
>>>>>>> 1.0f, 2.0f};
>>>>>>>
>>>>>>> /** Set the boost factor for url matches, relative to content
>>>>>>> and anchor
>>>>>>> * matches */
>>>>>>> public static void setUrlBoost(float boost) { URL_BOOST =
>>>>>>> boost; }
>>>>>>>
>>>>>>> /** Set the boost factor for title/anchor matches, relative to
>>>>>>> url and
>>>>>>> * content matches. */
>>>>>>> public static void setAnchorBoost(float boost) { ANCHOR_BOOST
>>>>>>> = boost; }
>>>>>>>
>>>>>>> /** Set the boost factor for sloppy phrase matches relative to
>>>>>>> unordered
>>>>>>> term
>>>>>>> * matches. */
>>>>>>> public static void setPhraseBoost(float boost) { PHRASE_BOOST
>>>>>>> = boost; }
>>>>>>>
>>>>>>> /** Set the maximum number of terms permitted between matching
>>>>>>> terms in a
>>>>>>> * sloppy phrase match. */
>>>>>>> public static void setSlop(int slop) { SLOP = slop; }
>>>>>>>
>>>>>>> public BooleanQuery filter(Query input, BooleanQuery output) {
>>>>>>> addTerms(input, output);
>>>>>>> addSloppyPhrases(input, output);
>>>>>>> return output;
>>>>>>> }
>>>>>>>
>>>>>>> private static void addTerms(Query input, BooleanQuery output) {
>>>>>>> Clause[] clauses = input.getClauses();
>>>>>>> for (int i = 0; i < clauses.length; i++) {
>>>>>>> Clause c = clauses[i];
>>>>>>>
>>>>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>> continue; // skip
>>>>>>> non-default fields
>>>>>>>
>>>>>>> BooleanQuery out = new BooleanQuery();
>>>>>>> for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>> Clause o = c;
>>>>>>> String[] opt;
>>>>>>>
>>>>>>> // TODO: I'm a little nervous about stemming for all
>>>>>>> default fields.
>>>>>>> // Should keep an eye on this.
>>>>>>> if (c.isPhrase()) { // optimize
>>>>>>> phrase
>>>>>>> clauses
>>>>>>> opt = CommonGrams.optimizePhrase(c.getPhrase(),
>>>>>>> FIELDS[f]);
>>>>>>> } else {
>>>>>>> System.out.println("o.getTerm = " +
>>>>>>> o.getTerm().toString());
>>>>>>> opt = getStemmedWords(o.getTerm().toString());
>>>>>>> }
>>>>>>> if (opt.length==1) {
>>>>>>> o = new Clause(new Term(opt[0]), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>> } else {
>>>>>>> o = new Clause(new Phrase(opt), c.isRequired(),
>>>>>>> c.isProhibited());
>>>>>>> }
>>>>>>>
>>>>>>> out.add(o.isPhrase()
>>>>>>> ? exactPhrase(o.getPhrase(), FIELDS[f],
>>>>>>> FIELD_BOOSTS[f])
>>>>>>> : termQuery(FIELDS[f], o.getTerm(),
>>>>>>> FIELD_BOOSTS[f]),
>>>>>>> false, false);
>>>>>>> }
>>>>>>> output.add(out, c.isRequired(), c.isProhibited());
>>>>>>> }
>>>>>>> System.out.println("query = " + output.toString());
>>>>>>> }
>>>>>>>
>>>>>>> private static String[] getStemmedWords(String value) {
>>>>>>> StringReader sr = new StringReader(value);
>>>>>>> TokenStream ts = new PorterStemFilter(new
>>>>>>> LowerCaseTokenizer(sr));
>>>>>>>
>>>>>>> String stemmedValue = "";
>>>>>>> try {
>>>>>>> Token token = ts.next();
>>>>>>> int count = 0;
>>>>>>> while (token != null) {
>>>>>>> System.out.println("token = " +
>>>>>>> token.termText());
>>>>>>> System.out.println("type = " + token.type());
>>>>>>>
>>>>>>> if (count == 0)
>>>>>>> stemmedValue = token.termText();
>>>>>>> else
>>>>>>> stemmedValue = stemmedValue + " " +
>>>>>>> token.termText();
>>>>>>>
>>>>>>> token = ts.next();
>>>>>>> count++;
>>>>>>> }
>>>>>>> } catch (Exception e) {
>>>>>>> stemmedValue = value;
>>>>>>> }
>>>>>>>
>>>>>>> if (stemmedValue.equals("")) {
>>>>>>> stemmedValue = value;
>>>>>>> }
>>>>>>>
>>>>>>> String[] stemmedValues = stemmedValue.split("\\s+");
>>>>>>>
>>>>>>> for (int j=0; j<stemmedValues.length; j++) {
>>>>>>> System.out.println("stemmedValues = " +
>>>>>>> stemmedValues[j]);
>>>>>>> }
>>>>>>> return stemmedValues;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> private static void addSloppyPhrases(Query input, BooleanQuery
>>>>>>> output) {
>>>>>>> Clause[] clauses = input.getClauses();
>>>>>>> for (int f = 0; f < FIELDS.length; f++) {
>>>>>>>
>>>>>>> PhraseQuery sloppyPhrase = new PhraseQuery();
>>>>>>> sloppyPhrase.setBoost(FIELD_BOOSTS[f] * PHRASE_BOOST);
>>>>>>> sloppyPhrase.setSlop("anchor".equals(FIELDS[f])
>>>>>>> ? NutchDocumentAnalyzer.INTER_ANCHOR_GAP
>>>>>>> : SLOP);
>>>>>>> int sloppyTerms = 0;
>>>>>>>
>>>>>>> for (int i = 0; i < clauses.length; i++) {
>>>>>>> Clause c = clauses[i];
>>>>>>>
>>>>>>> if (!c.getField().equals(Clause.DEFAULT_FIELD))
>>>>>>> continue; // skip
>>>>>>> non-default fields
>>>>>>>
>>>>>>> if (c.isPhrase()) // skip exact
>>>>>>> phrases
>>>>>>> continue;
>>>>>>>
>>>>>>> if (c.isProhibited()) // skip
>>>>>>> prohibited terms
>>>>>>> continue;
>>>>>>>
>>>>>>> sloppyPhrase.add(luceneTerm(FIELDS[f], c.getTerm()));
>>>>>>> sloppyTerms++;
>>>>>>> }
>>>>>>>
>>>>>>> if (sloppyTerms > 1)
>>>>>>> output.add(sloppyPhrase, false, false);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> private static org.apache.lucene.search.Query
>>>>>>> termQuery(String field, Term term, float boost) {
>>>>>>> TermQuery result = new TermQuery(luceneTerm(field, term));
>>>>>>> result.setBoost(boost);
>>>>>>> return result;
>>>>>>> }
>>>>>>>
>>>>>>> /** Utility to construct a Lucene exact phrase query for a
>>>>>>> Nutch phrase.
>>>>>>> */
>>>>>>> private static org.apache.lucene.search.Query
>>>>>>> exactPhrase(Phrase nutchPhrase,
>>>>>>> String field, float boost) {
>>>>>>> Term[] terms = nutchPhrase.getTerms();
>>>>>>> PhraseQuery exactPhrase = new PhraseQuery();
>>>>>>> for (int i = 0; i < terms.length; i++) {
>>>>>>> exactPhrase.add(luceneTerm(field, terms[i]));
>>>>>>> }
>>>>>>> exactPhrase.setBoost(boost);
>>>>>>> return exactPhrase;
>>>>>>> }
>>>>>>>
>>>>>>> /** Utility to construct a Lucene Term given a Nutch query
>>>>>>> term and field.
>>>>>>> */
>>>>>>> private static org.apache.lucene.index.Term luceneTerm(String
>>>>>>> field,
>>>>>>> Term
>>>>>>> term) {
>>>>>>> return new org.apache.lucene.index.Term(field,
>>>>>>> term.toString());
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>