You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Vanderdray, Jacob" <JV...@aarp.org> on 2006/03/31 00:18:52 UTC

Common Terms

	I've added some code to query-basic to log the query after it
has run both addTerms and addPhrases.  This helps me to better
understand what's going on.  I've noticed that when my search contains
words like "the" or "a", those don't appear in the actual query.

	It looks to me like the common-terms.utf8 file is supposed to be
used to strip common words like "the" out of queries for specific
fields, but that doesn't seem to be what's happening.  The term "the"
ends up getting stripped out of the query for all fields (url, content,
anchor, etc.).  I even tried removing "the" from the common-terms.utf8
file, but didn't see any change in behavior.

	Does this file only get used when indexing?  If so what
determines which words get stripped out of searches?

Thanks,
Jake.

Re: Common Terms

Posted by Rajesh Munavalli <ra...@gmail.com>.
There is a list of stop words in NutchAnalysis class 
(org.apache.nutch.analysis). I guess thats where the common terms are 
removed during analysis.

--Rajesh Munavalli
Blog: http://mathsearch.blogspot.com

Vanderdray, Jacob wrote:
> 	I've added some code to query-basic to log the query after it
> has run both addTerms and addPhrases.  This helps me to better
> understand what's going on.  I've noticed that when my search contains
> words like "the" or "a", those don't appear in the actual query.
>
> 	It looks to me like the common-terms.utf8 file is supposed to be
> used to strip common words like "the" out of queries for specific
> fields, but that doesn't seem to be what's happening.  The term "the"
> ends up getting stripped out of the query for all fields (url, content,
> anchor, etc.).  I even tried removing "the" from the common-terms.utf8
> file, but didn't see any change in behavior.
>
> 	Does this file only get used when indexing?  If so what
> determines which words get stripped out of searches?
>
> Thanks,
> Jake.
>
>   


Re: Common Terms

Posted by Rajesh Munavalli <fi...@gmail.com>.
There is a list of stop words in NutchAnalysis class (
org.apache.nutch.analysis). I guess thats where the common terms are removed
during analysis.

--Rajesh Munavalli
Blog: http://mathsearch.blogspot.com

On 3/30/06, Vanderdray, Jacob <JV...@aarp.org> wrote:
>
>        I've added some code to query-basic to log the query after it
> has run both addTerms and addPhrases.  This helps me to better
> understand what's going on.  I've noticed that when my search contains
> words like "the" or "a", those don't appear in the actual query.
>
>        It looks to me like the common-terms.utf8 file is supposed to be
> used to strip common words like "the" out of queries for specific
> fields, but that doesn't seem to be what's happening.  The term "the"
> ends up getting stripped out of the query for all fields (url, content,
> anchor, etc.).  I even tried removing "the" from the common-terms.utf8
> file, but didn't see any change in behavior.
>
>        Does this file only get used when indexing?  If so what
> determines which words get stripped out of searches?
>
> Thanks,
> Jake.
>