You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "wuwuengr@gmail.com" <wu...@gmail.com> on 2008/10/08 06:01:30 UTC

Just to save webpages (Newbie question)

I am new to Nutch.

My goal is to extract content (local listings) of a certain website. I have
obtained the urls of all the listings (only ~20K). And I also wrote a parser
to pull the contents (like address and phone). All I need is to download the
urls.

But as I used download tool to batch download the urls, very quickly I
started to get 404 responses in downloaded pages.

Is there a way I can do this in nutch? What's the risk of being blocked
again? I just want the urls, no crawl, no indexing, just plain fetch and
leaving them intact.

Re: Just to save webpages (Newbie question)

Posted by Winton Davies <wd...@cs.stanford.edu>.
Check their TOS. If you are trying to make a business out of 
specifically their data, then they probably will be hostile out of 
it. They probably allow big search engines to do it because it gives 
them a quid pro quo in terms of referrals. If you are just trying to 
datamine their site, they probably dont want it to happen.

Having said that, you might want to check their Robots.txt and see if 
for soem reason Nutch is hitting them too much (ie isn't honoring 
their robots.txt). The other approach is to distribute the crawl 
among multiple machines and IP address ranges....

Winton


>I am new to Nutch.
>
>My goal is to extract content (local listings) of a certain website. I have
>obtained the urls of all the listings (only ~20K). And I also wrote a parser
>to pull the contents (like address and phone). All I need is to download the
>urls.
>
>But as I used download tool to batch download the urls, very quickly I
>started to get 404 responses in downloaded pages.
>
>Is there a way I can do this in nutch? What's the risk of being blocked
>again? I just want the urls, no crawl, no indexing, just plain fetch and
>leaving them intact.