You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by AJ Chen <an...@sbcglobal.net> on 2005/09/01 08:09:44 UTC
how to fetch all web pages on one site
I'm testing nutch whole-web crawling with juts one url in a text file.
But, after generate/fetch/updatedb/index, there is only one document in
the index. Questions:
1. What needs to be set in order to fetch all available web pages on one
site?
2. Where is the log file that I can check what's going on?
Thanks,
-AJ
Re: [Nutch-general] scope filter in OC
Posted by Kelvin Tan <ke...@relevanz.com>.
There is a FLFilter in OC which uses Nutch's regex-urlfilter.txt. I believe its called NutchUrlFLFilter
On Tue, 6 Sep 2005 19:32:15 -0700 (PDT), Michael Ji wrote:
> Hi Kelvin:
>
> Does OC support domain crawling like url-fliter.txt?
> If so, how to insert the seeds domain list to OC?
>
> I saw OC's org.supermind.crawl.scope package, didn't
> see a similar concept.
>
> thanks,
>
> Michael Ji
>
>
> ______________________________________________________
> Click here to donate to the Hurricane Katrina relief effort.
> http://store.yahoo.com/redcross-donate3/
>
>
> ------------------------------------------------------- SF.Net
> email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle
> Practices Agile & Plan-Driven Development * Managing Projects &
> Teams * Testing & QA Security * Process Improvement & Measurement *
> http://www.sqe.com/bsce5sf
> _______________________________________________ Nutch-general
> mailing list Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
scope filter in OC
Posted by Michael Ji <fj...@yahoo.com>.
Hi Kelvin:
Does OC support domain crawling like url-fliter.txt?
If so, how to insert the seeds domain list to OC?
I saw OC's org.supermind.crawl.scope package, didn't
see a similar concept.
thanks,
Michael Ji
______________________________________________________
Click here to donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/
Re: how to fetch all web pages on one site
Posted by Michael Ji <fj...@yahoo.com>.
I think you need run several runs. The first run just
crawling the homepage of the site.
I use the screen output as the log information. Do
sure whatelse logs are.
Michael Ji,
--- AJ Chen <an...@sbcglobal.net> wrote:
> I'm testing nutch whole-web crawling with juts one
> url in a text file.
> But, after generate/fetch/updatedb/index, there is
> only one document in
> the index. Questions:
> 1. What needs to be set in order to fetch all
> available web pages on one
> site?
> 2. Where is the log file that I can check what's
> going on?
> Thanks,
>
> -AJ
>
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com