You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nidhi malik <ni...@gmail.com> on 2008/01/03 19:38:34 UTC

hadoop file and nutch-407 error

At the time of fetching I am getting this below message  and I attached the
haddop.log file

Fetcher: starting
Fetcher: segment: crawl/segments/20080104002039
Fetcher: threads: 10
fetching http://www.w3schools.com/
http.proxy.host = netmon.iitb.ac.in
http.proxy.port = 80
http.timeout = 100000
http.content.limit = 65536
http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com;
digvijayy@it.iitb.ac.in)
protocol.plugin.check.blocking = true
protocol.plugin.check.robots = true
fetcher.server.delay = 5000
http.max.delays = 100
Configured Client
fetch of http://www.w3schools.com/ failed with: Http code=407, url=
http://www.w3schools.com/
Fetcher: done

Re: hadoop file and nutch-407 error

Posted by Susam Pal <su...@gmail.com>.
Hi,

I have replied this once and since you have provided no additional
information, my reply is going to remain almost same.

Please send the following information:-

1. The Nutch version you are using. (NUTCH-559v0.5 was generated
against the trunk. If you are using Nutch-0.9, the patch might not go
smoothly. You might have to manually compare whether the patch went
through nicely.)

2. How did the ant build go? Were there any errors in the build or the
build completed with the following message:- BUILD SUCCESSFUL ?

3. It would be better if you also send the output of your patch command.

4. The relevant logs from 'log/hadoop.log' with DEBUG enabled for
protocol-httpclient.

To enable DEBUG for protocol-httpclient, please do the following:-

1. Open 'conf/log4j.properties'.

2. Add the following line and save the file:-
    log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout

3. Delete log/hadoop.log, run a new crawl and send the 'log/hadoop.log' file.

Please make sure before sending, that the log file has the DEBUG
lines. They look like this:-

2008-01-02 21:55:30,177 DEBUG httpclient.Http - url:
https://mail.yahoo.com/robots.txt; status code: 404; bytes received:
2337
2008-01-02 21:55:32,900 DEBUG httpclient.Http - url:
https://mail.yahoo.com/; status code: 200; bytes received: 26291

If DEBUG lines are missing, it means you have either not enabled DEBUG
properly or you have not successfully patched and built Nutch.

Regards,
Susam Pal

On Jan 4, 2008 12:08 AM, Nidhi malik <ni...@gmail.com> wrote:
> At the time of fetching I am getting this below message  and I attached the
> haddop.log file
>
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080104002039
> Fetcher: threads: 10
> fetching http://www.w3schools.com/
> http.proxy.host = netmon.iitb.ac.in
> http.proxy.port = 80
> http.timeout = 100000
> http.content.limit = 65536
> http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com;
> digvijayy@it.iitb.ac.in)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 5000
> http.max.delays = 100
> Configured Client
> fetch of http://www.w3schools.com/ failed with: Http code=407,
> url=http://www.w3schools.com/
> Fetcher: done
>
>