You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by AJ Chen <an...@sbcglobal.net> on 2005/09/01 08:09:44 UTC

how to fetch all web pages on one site

I'm testing nutch whole-web crawling with juts one url in a text file. 
But, after generate/fetch/updatedb/index, there is only one document in 
the index. Questions:
1. What needs to be set in order to fetch all available web pages on one 
site?
2. Where is the log file that I can check what's going on?
Thanks,

-AJ



Re: [Nutch-general] scope filter in OC

Posted by Kelvin Tan <ke...@relevanz.com>.
There is a FLFilter in OC which uses Nutch's regex-urlfilter.txt. I believe its called NutchUrlFLFilter

On Tue, 6 Sep 2005 19:32:15 -0700 (PDT), Michael Ji wrote:
> Hi Kelvin:
>
> Does OC support domain crawling like url-fliter.txt?
> If so, how to insert the seeds domain list to OC?
>
> I saw OC's org.supermind.crawl.scope  package, didn't
> see a similar concept.
>
> thanks,
>
> Michael Ji
>
>
> ______________________________________________________
> Click here to donate to the Hurricane Katrina relief effort.
> http://store.yahoo.com/redcross-donate3/
>
>
> ------------------------------------------------------- SF.Net
> email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle
> Practices Agile & Plan-Driven Development * Managing Projects &
> Teams * Testing & QA Security * Process Improvement & Measurement *
> http://www.sqe.com/bsce5sf
> _______________________________________________ Nutch-general
> mailing list Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general



scope filter in OC

Posted by Michael Ji <fj...@yahoo.com>.
Hi Kelvin:

Does OC support domain crawling like url-fliter.txt?
If so, how to insert the seeds domain list to OC?

I saw OC's org.supermind.crawl.scope  package, didn't
see a similar concept.

thanks,

Michael Ji


	
		
______________________________________________________
Click here to donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/

Re: how to fetch all web pages on one site

Posted by Michael Ji <fj...@yahoo.com>.
I think you need run several runs. The first run just
crawling the homepage of the site.

I use the screen output as the log information. Do
sure whatelse logs are.

Michael Ji,

--- AJ Chen <an...@sbcglobal.net> wrote:

> I'm testing nutch whole-web crawling with juts one
> url in a text file. 
> But, after generate/fetch/updatedb/index, there is
> only one document in 
> the index. Questions:
> 1. What needs to be set in order to fetch all
> available web pages on one 
> site?
> 2. Where is the log file that I can check what's
> going on?
> Thanks,
> 
> -AJ
> 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com