You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hilkiah Lavinier <hi...@yahoo.com> on 2008/01/11 22:36:39 UTC

nutch reindex question

Hi I've got a question related to reindexing.

Lets say I index site A which has 4 pages (link1,2,3,4).  Lets say page 3 is deleted and I recrawl/reindex.  Would nutch delete the information stored for page 3 as it is no longer valid?

Also, how can I configure nutch to index a site's content, but not store the content locally (for cache).  For me it is really a waste of time storing a cached version of the site.

Thanks in advance.

Regards,
 
Hilkiah G. Lavinier MEng (Hons), ACGI 
6 Winston Lane, 
Goodwill, 
Roseau, Dominica 
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
 
Email: hilkiah@yahoo.com
Email: hilkiah.lavinier@gmail.com
IM: Yahoo hilkiah / MSN hilkiahlavinier@hotmail.com
IM: ICQ #8978201  / AOL hilkiah21

----- Original Message ----
From: Susam Pal <su...@gmail.com>
To: nutch-user@lucene.apache.org
Sent: Monday, January 7, 2008 1:57:27 PM
Subject: Re: nutch crawl problem


What command are you running to crawl the web? If you are using
'bin/nutch crawl' command, then 'conf/crawl-urlfilter.txt' is used. Is
that question mark after http://www.search.com a punctuation or it is
a part of the URL? If it is a part of the URL, the second rule,
-[?*!@=],  in 'conf/crawl-urlfilter.txt' is filtering it out. There
can be a variety of reasons why your crawl is failing. Please read the
'logs/hadoop.log' file and see if you can find the cause of the error.

"Generator: 0 records selected for fetching, exiting ..." - You get
this error if your depth value is high but there are no more URLs to
fetch. This may happen in your case because the fetch in the first
cycle fails. So no new URLs are discovered and as a result there are
no URLs to fetch. Another possibility is that the first set of URLs
fetched in the first cycle do not point to any other pages that is
allowed by 'conf/crawl-urlfilter.txt'.

Regards,
Susam Pal

On Jan 7, 2008 8:56 AM,  <su...@hotmail.com> wrote:
> why i can crawl http://game.search.com but i can't crawl
 http://www.search.com? conf/crawl-urlfilter is
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
 #-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to
 break loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*search.com/
>
> # skip everything else
> +.
>
> and some host i can't crawl because have error "Generator: 0 records
 selected for fetching, exiting ..." i set the same config for all
 host.why?
>
>
>






      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ