You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Lourival Júnior <ju...@gmail.com> on 2006/07/07 19:20:50 UTC

Number of pages different to number of indexed pages

Hi all!

I have a little doubt. My WebDB contains, actually, 779 pages with 899
links. When I use the segread command it returns 779 count pages too in one
segment. However when I make a search or when I use the luke software the
maximum number of documents is 437. I've seen the recrawl logs and when the
script is fetching pages, some of them contains the message:

... failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.

I thing that it happens because some network problem. The fetcher try to
fetch some page, but it did not obtain. Because this, when the segment is
being indexed, only the fetched pages will appear in results. It is a
problem to me.

Could someone explain me what should I do to refetch these pages to increase
my web search results? Should I change the http.max.delays and
fetcher.server.delay properties in nutch-default.xml?

Regards,


-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: Number of pages different to number of indexed pages

Posted by Lourival Júnior <ju...@gmail.com>.

Yes! It really works! I'm execunting the recrawl at now, and it is fetching
the pages that it didn't fetched yet... It takes longer, but the final
result is more important.

Thanks a lot!

On 7/7/06, Honda-Search Administrator <ad...@honda-search.com> wrote:
>
> This is typical if you are crawling only a few sites.  I crawl 7 sites
> nightly and often get this error.  I changed my http.max.delays property
> from 3 to 50 and it works without a problem.  The crawl takes longer, but
> I
> get almost all of the pages.
>
> ----- Original Message -----
> From: "Lourival Júnior" <ju...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Friday, July 07, 2006 10:20 AM
> Subject: Number of pages different to number of indexed pages
>
>
> Hi all!
>
> I have a little doubt. My WebDB contains, actually, 779 pages with 899
> links. When I use the segread command it returns 779 count pages too in
> one
> segment. However when I make a search or when I use the luke software the
> maximum number of documents is 437. I've seen the recrawl logs and when
> the
> script is fetching pages, some of them contains the message:
>
> ... failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater
> :
> Exceeded http.max.delays: retry later.
>
> I thing that it happens because some network problem. The fetcher try to
> fetch some page, but it did not obtain. Because this, when the segment is
> being indexed, only the fetched pages will appear in results. It is a
> problem to me.
>
> Could someone explain me what should I do to refetch these pages to
> increase
> my web search results? Should I change the http.max.delays and
> fetcher.server.delay properties in nutch-default.xml?
>
> Regards,
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: junior_ufpa@hotmail.com
>
>


-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: Number of pages different to number of indexed pages

Posted by Honda-Search Administrator <ad...@honda-search.com>.

This is typical if you are crawling only a few sites.  I crawl 7 sites 
nightly and often get this error.  I changed my http.max.delays property 
from 3 to 50 and it works without a problem.  The crawl takes longer, but I 
get almost all of the pages.

----- Original Message ----- 
From: "Lourival Júnior" <ju...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Friday, July 07, 2006 10:20 AM
Subject: Number of pages different to number of indexed pages

Hi all!

I have a little doubt. My WebDB contains, actually, 779 pages with 899
links. When I use the segread command it returns 779 count pages too in one
segment. However when I make a search or when I use the luke software the
maximum number of documents is 437. I've seen the recrawl logs and when the
script is fetching pages, some of them contains the message:

... failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.

I thing that it happens because some network problem. The fetcher try to
fetch some page, but it did not obtain. Because this, when the segment is
being indexed, only the fetched pages will appear in results. It is a
problem to me.

Could someone explain me what should I do to refetch these pages to increase
my web search results? Should I change the http.max.delays and
fetcher.server.delay properties in nutch-default.xml?

Regards,

-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com