You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lourival Júnior <ju...@gmail.com> on 2006/08/10 20:58:51 UTC

common-terms.utf8

Hi,

Could anyone explain me what does exactly the common-terms.utf8 file? I
don't understand the real functionality of this file...

Regards,

-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: [Nutch-general] common-terms.utf8

Posted by Andrzej Bialecki <ab...@getopt.org>.
ogjunk-nutch@yahoo.com wrote:
> This is because Nutch turns those common terms into ngrams (not sure of what size), and that increases the size of the index.
> For example, if you have a phrase like:
>
>   vacation time
>
> Normally, Nutch will index this phrase as 2 terms, a total of 12 characters (probably less, if these words are stemmed)
> If those two words are defined as common terms, and Nutch indexes them as ngrams (say bigrams), it will index something like this:
>
>   va ac ca at ti io on ti im me
>   

No, Nutch uses word-level ngrams (where n=2), so using this example it 
would be:

    vacation-time

Or, using a better example:

    "words in common"

becomes:

    words-in in-common

This is especially useful in case of phrase queries, because it 
drastically reduces the number of unique terms to check (lowers the term 
frequency in the index, hence the number of postings to check). This is 
at the cost of increasing somewhat the index size.

You can clearly see the effects of this file if you run your query 
through Nutch query parser:

    bin/nutch org.apache.nutch.searcher.Query

Give it a phrase query (surrounded by double quotes) containing one of 
common terms, and see what happens.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [Nutch-general] common-terms.utf8

Posted by og...@yahoo.com.
This is because Nutch turns those common terms into ngrams (not sure of what size), and that increases the size of the index.
For example, if you have a phrase like:

  vacation time

Normally, Nutch will index this phrase as 2 terms, a total of 12 characters (probably less, if these words are stemmed)
If those two words are defined as common terms, and Nutch indexes them as ngrams (say bigrams), it will index something like this:

  va ac ca at ti io on ti im me

That's 20 characters.  Hence the index size increase.  Doug Cutting once emailed (off the list) and reported his index doubled in size when he defined 10 of the most common terms in that file.  When he used the 5 most common terms, the index size increased by about 20%.

Also note that every time you change this file, you'll have to reindex.

Otis

----- Original Message ----
From: Lourival Júnior <ju...@gmail.com>
To: nutch-user@lucene.apache.org
Sent: Friday, August 11, 2006 8:19:41 AM
Subject: Re: [Nutch-general] common-terms.utf8

Hi Timo!

I analyzed to index before and after using correctly the
common-terms.utf8file. Before adding the common terms in my language
my index had about 3mb.
After add the common terms it has now 5mb! Why it occurs?

Regards!

On 8/11/06, Lourival Júnior <ju...@gmail.com> wrote:
>
> Hi Timo!
>
> Thanks a lot! now I have a clearly knowledge about this file. This article
> helps a lot too: http://searchenginewatch.com/showPage.html?page=2156061
>
> Thanks again!
>
>
> On 8/11/06, Timo Scheuer < timo.scheuer@dfki.de> wrote:
> >
> > Hi,
> >
> > > Could anyone explain me what does exactly the common-terms.utf8 file?
> > I
> > > don't understand the real functionality of this file...
> >
> > During indexing (and also during searching) the common terms are used to
> > form
> > n-grams to make search faster for common words like articles for
> > example. It
> > is an alternative to using stop words. N-grams keep the common words by
> > appending them to the following word. This approach increases the
> > selectivity.
> >
> >
> > Cheers,
> > Timo.
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: junior_ufpa@hotmail.com
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general




Re: common-terms.utf8

Posted by Lourival Júnior <ju...@gmail.com>.
Hi Timo!

I analyzed to index before and after using correctly the
common-terms.utf8file. Before adding the common terms in my language
my index had about 3mb.
After add the common terms it has now 5mb! Why it occurs?

Regards!

On 8/11/06, Lourival Júnior <ju...@gmail.com> wrote:
>
> Hi Timo!
>
> Thanks a lot! now I have a clearly knowledge about this file. This article
> helps a lot too: http://searchenginewatch.com/showPage.html?page=2156061
>
> Thanks again!
>
>
> On 8/11/06, Timo Scheuer < timo.scheuer@dfki.de> wrote:
> >
> > Hi,
> >
> > > Could anyone explain me what does exactly the common-terms.utf8 file?
> > I
> > > don't understand the real functionality of this file...
> >
> > During indexing (and also during searching) the common terms are used to
> > form
> > n-grams to make search faster for common words like articles for
> > example. It
> > is an alternative to using stop words. N-grams keep the common words by
> > appending them to the following word. This approach increases the
> > selectivity.
> >
> >
> > Cheers,
> > Timo.
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: junior_ufpa@hotmail.com
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: common-terms.utf8

Posted by Lourival Júnior <ju...@gmail.com>.
Hi Timo!

Thanks a lot! now I have a clearly knowledge about this file. This article
helps a lot too: http://searchenginewatch.com/showPage.html?page=2156061

Thanks again!

On 8/11/06, Timo Scheuer < timo.scheuer@dfki.de> wrote:
>
> Hi,
>
> > Could anyone explain me what does exactly the common-terms.utf8 file? I
> > don't understand the real functionality of this file...
>
> During indexing (and also during searching) the common terms are used to
> form
> n-grams to make search faster for common words like articles for example.
> It
> is an alternative to using stop words. N-grams keep the common words by
> appending them to the following word. This approach increases the
> selectivity.
>
>
> Cheers,
> Timo.
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: common-terms.utf8

Posted by Timo Scheuer <ti...@dfki.de>.
Hi,

> Could anyone explain me what does exactly the common-terms.utf8 file? I
> don't understand the real functionality of this file...

During indexing (and also during searching) the common terms are used to form 
n-grams to make search faster for common words like articles for example. It 
is an alternative to using stop words. N-grams keep the common words by 
appending them to the following word. This approach increases the 
selectivity.


Cheers,
Timo.