You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Teruhiko Kurosaka <Ku...@basistech.com> on 2005/12/28 19:16:36 UTC

Why does Nutch use n-grams in analysis?

I thought n-grams are used for language identification only but
I see they are used in another area.

In the source code of CommonGramps and the API doc:
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/CommonG
rams.html
I see (tokens representing) n-grams are "inserted" to the token stream.

Does this mean a situation such as "Web Services is cool" is represented
by token sequence of {"Web", "Services", "Web Services", ("is"
ignored being a stop word), "cool"}, assuming "web services"
is a commonly used bi-gram? Or something else?

Why does Nutch do this?

-kuro

Re: Why does Nutch use n-grams in analysis?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Teruhiko Kurosaka wrote:

>I thought n-grams are used for language identification only but
>I see they are used in another area.
>
>In the source code of CommonGramps and the API doc:
>http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/CommonG
>rams.html
>I see (tokens representing) n-grams are "inserted" to the token stream.
>
>  
>

These two are different n-grams ... In language identification we use 
character-level n-grams (that is, groups of two or more characters), in 
NutchAnalysis we use word-level n-grams (that is, groups of two or more 
words).


>Does this mean a situation such as "Web Services is cool" is represented
>by token sequence of {"Web", "Services", "Web Services", ("is"
>ignored being a stop word), "cool"}, assuming "web services"
>is a commonly used bi-gram? Or something else?
>
>  
>

No, in this case, if "web" and "services" were added to 
common-grams.utf8, the result would look like:

web|web-services, services|services-is, cool

where | marks tokens indexed at the same position in the index.

>Why does Nutch do this?
>
>  
>

Using word-level n-grams helps to lower the frequency of most common 
terms, which reduces the amount of work that the search engine needs to 
perform in order to find all occurences of these words when processing a 
phrase query.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com