You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kh3rad <kh...@gmail.com> on 2012/05/14 11:46:35 UTC

Couldn't get robots.txt for site

Hi,

I want to crawl a website which denies access to all crawlers. this site is
one of the top site in alexa rank and it is news site. these are my log on
hadoop. i set  "Protocol.CHECK_ROBOTS" false in my nutch-site file.

how can i solve this problem and crawl this site with nutch?


2012-05-14 12:39:51,079 INFO org.apache.nutch.fetcher.Fetcher: fetching
http://farsnews.com/
2012-05-14 12:39:56,615 INFO
org.apache.nutch.protocol.http.api.RobotRulesParser: Couldn't get robots.txt
for http://farsnews.com/: java.net.SocketTimeoutException: Read timed out
2012-05-14 12:40:01,873 INFO org.apache.nutch.fetcher.Fetcher: fetch of
http://farsnews.com/ failed with: java.net.SocketTimeoutException: Read
timed out

Thanks, 
kh3rad

--
View this message in context: http://lucene.472066.n3.nabble.com/Couldn-t-get-robots-txt-for-site-tp3983633.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Couldn't get robots.txt for site

Posted by kh3rad <kh...@gmail.com>.

i find that robots.txt file is unavailable in this site,how can i crawl in
farsnews.com ?
thanks

--
View this message in context: http://lucene.472066.n3.nabble.com/Couldn-t-get-robots-txt-for-site-tp3983633p3983636.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Couldn't get robots.txt for site

Posted by Ken Krugler <kk...@transpac.com>.

Hi Kh3rad,

If a site disallows crawling via robots.txt, then it is CRITICAL that you honor such a directive.

The only time you should ignore robots.txt is if you have explicit permission from the site owner to do so.

And even then, it's better if they edit robots.txt to explicitly allow your user agent.

Having said that, farsnews.com has no robots.txt file. It appears they are explicitly checking for user agent strings that are not regular web browsers and intentionally causing these requests to time out (which is what you see below). The same thing happens if you try to use curl to access that top page.

-- Ken

On May 14, 2012, at 2:46am, kh3rad wrote:

> Hi,
> 
> I want to crawl a website which denies access to all crawlers. this site is
> one of the top site in alexa rank and it is news site. these are my log on
> hadoop. i set  "Protocol.CHECK_ROBOTS" false in my nutch-site file.
> 
> how can i solve this problem and crawl this site with nutch?
> 
> 
> 2012-05-14 12:39:51,079 INFO org.apache.nutch.fetcher.Fetcher: fetching
> http://farsnews.com/
> 2012-05-14 12:39:56,615 INFO
> org.apache.nutch.protocol.http.api.RobotRulesParser: Couldn't get robots.txt
> for http://farsnews.com/: java.net.SocketTimeoutException: Read timed out
> 2012-05-14 12:40:01,873 INFO org.apache.nutch.fetcher.Fetcher: fetch of
> http://farsnews.com/ failed with: java.net.SocketTimeoutException: Read
> timed out
> 
> Thanks, 
> kh3rad
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Couldn-t-get-robots-txt-for-site-tp3983633.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr