You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by carmmello <ca...@globo.com> on 2006/09/25 20:34:39 UTC

Common terms

I'm using Nutch 0.7.2 and have added to the common-terms.utf8 in the conf folder (and also under the classes folder, inside the ROOT folder on TomCat), some common terms in portuguese, one per line , like:
....................
content:da
contente:de
contente:eu
..................
However, when I try some search, I get all the results for those portuguese common terms, and, at the same time, I get zero results for the original english terms.  I have even tried to list all the terms in alphabetical order, including the original ones, with the same results.  In other words, Nutch does not seem to recognize, as such, the  added common terms, only the original ones, included in the distribution.
Can any one clarify this?
Tanks

Re: Common terms

Posted by Lourival Júnior <ju...@gmail.com>.
Ok. If you're crawling with this settings you don't need to reindex your
segments again. And how about the plugins that you are using? Are you using
the language-identifier plugin? If not, try it.

Regards,

Obs: Eu falo português :)

On 9/25/06, carmmello <ca...@globo.com> wrote:
>
> This issue happens even when I start a new crawl.  So, I'm not reindexing
> the segments.  The indexing is done by nutch itself, using the intranet
> method.
> Do you mean that after this is done, do I have to reindex the segments,
> once
> again?  But, if so, why the english common terms are recognized first
> time?
> Tanks again
> ----- Original Message -----
> From: "Lourival Júnior" <ju...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Monday, September 25, 2006 3:58 PM
> Subject: Re: Common terms
>
>
> Has you reindexed your segments? It's important, because it makes nutch
> recognize your common terms. I've tried it and the only thing I've noted
> was
> the index size that is more big than the original (before use the common
> terms).
>
> On 9/25/06, carmmello <ca...@globo.com> wrote:
> >
> > I'm using Nutch 0.7.2 and have added to the common-terms.utf8 in the
> conf
> > folder (and also under the classes folder, inside the ROOT folder on
> > TomCat), some common terms in portuguese, one per line , like:
> > ....................
> > content:da
> > contente:de
> > contente:eu
> > ..................
> > However, when I try some search, I get all the results for those
> > portuguese common terms, and, at the same time, I get zero results for
> the
> > original english terms.  I have even tried to list all the terms in
> > alphabetical order, including the original ones, with the same results.
> > In
> > other words, Nutch does not seem to recognize, as such, the  added
> common
> > terms, only the original ones, included in the distribution.
> > Can any one clarify this?
> > Tanks
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: junior_ufpa@hotmail.com
>
>
>
>
> --------------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.6/453 - Release Date: 20/9/2006
>
>


-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: Common terms

Posted by carmmello <ca...@globo.com>.
This issue happens even when I start a new crawl.  So, I'm not reindexing 
the segments.  The indexing is done by nutch itself, using the intranet 
method.
Do you mean that after this is done, do I have to reindex the segments, once 
again?  But, if so, why the english common terms are recognized first time?
Tanks again
----- Original Message ----- 
From: "Lourival Júnior" <ju...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Monday, September 25, 2006 3:58 PM
Subject: Re: Common terms


Has you reindexed your segments? It's important, because it makes nutch
recognize your common terms. I've tried it and the only thing I've noted was
the index size that is more big than the original (before use the common
terms).

On 9/25/06, carmmello <ca...@globo.com> wrote:
>
> I'm using Nutch 0.7.2 and have added to the common-terms.utf8 in the conf
> folder (and also under the classes folder, inside the ROOT folder on
> TomCat), some common terms in portuguese, one per line , like:
> ....................
> content:da
> contente:de
> contente:eu
> ..................
> However, when I try some search, I get all the results for those
> portuguese common terms, and, at the same time, I get zero results for the
> original english terms.  I have even tried to list all the terms in
> alphabetical order, including the original ones, with the same results. 
> In
> other words, Nutch does not seem to recognize, as such, the  added common
> terms, only the original ones, included in the distribution.
> Can any one clarify this?
> Tanks
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com



--------------------------------------------------------------------------------


No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.405 / Virus Database: 268.12.6/453 - Release Date: 20/9/2006


Re: Common terms

Posted by Lourival Júnior <ju...@gmail.com>.
Has you reindexed your segments? It's important, because it makes nutch
recognize your common terms. I've tried it and the only thing I've noted was
the index size that is more big than the original (before use the common
terms).

On 9/25/06, carmmello <ca...@globo.com> wrote:
>
> I'm using Nutch 0.7.2 and have added to the common-terms.utf8 in the conf
> folder (and also under the classes folder, inside the ROOT folder on
> TomCat), some common terms in portuguese, one per line , like:
> ....................
> content:da
> contente:de
> contente:eu
> ..................
> However, when I try some search, I get all the results for those
> portuguese common terms, and, at the same time, I get zero results for the
> original english terms.  I have even tried to list all the terms in
> alphabetical order, including the original ones, with the same results.  In
> other words, Nutch does not seem to recognize, as such, the  added common
> terms, only the original ones, included in the distribution.
> Can any one clarify this?
> Tanks
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com