You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Dean Pullen <de...@semantico.com> on 2012/02/02 17:44:11 UTC

Failed fetching

Hi all,

I'm trying to fetch from http://nutch.apache.org

But after fetching, parsing, and updating the DB I examine the DB for 
'http://nutch.apache.org/' (oddly I must include the last slash) and get:

/URL: http://nutch.apache.org/
Version: 7
Status: 1 (*db_unfetched*)
Fetch time: Fri Feb 03 16:33:13 GMT 2012
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 1
Retry interval: 2592000 seconds (30 days)
Score: 500.0
Signature: null
Metadata: _pst_: *failed*(2), lastModified=0/

Why is the fetch failing and how can I show more nutch logging so as to 
view the failure attempt/message?
Nothing is seen in my access logs when I try to crawl my own external site.

To ensure all URLs are permitted I've changed the regex-urlfilter.txt to:

/# accept anything else
+./

This has been puzzling me all day, I'm hoping someone can help!

Dean.

Re: Failed fetching

Posted by tiagorcs <da...@mitsue.co.jp>.

I've downloaded the sources and compiled them myself. 

Both protocol-http and protocol-httpclient (with basic auth) are working
like a charm now.

Thx for the help!

T

--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3765295.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by remi tassing <ta...@gmail.com>.

I just used protocol-http and it works!

It's probably a configuration issue. You can download a clean version and
start afresh

Remi

On Wed, Feb 15, 2012 at 3:46 AM, tiagorcs <da...@mitsue.co.jp>wrote:

> So do you suggest me to download Nutch from a different source? Maybe to
> reconfigure Cygwin? Or are there some configuration settings that I might
> have missed?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3745698.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Failed fetching

Posted by tiagorcs <da...@mitsue.co.jp>.

So do you suggest me to download Nutch from a different source? Maybe to
reconfigure Cygwin? Or are there some configuration settings that I might
have missed?

--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3745698.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by Lewis John Mcgibbney <le...@gmail.com>.

This would be great Remi. It would be nice to confirm that protocol-http
does work with Nutch 1.4 in cygwin & windows.

On Tue, Feb 14, 2012 at 6:03 PM, remi tassing <ta...@gmail.com> wrote:

> I'm slowly from migrating from Nutch-1.2 to 1.4 and it works with cygwin.
>
> I use protocol-httpclient but could try protocol-http if you want
>
> Remi
>
> On Friday, February 10, 2012, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
> > In all honesty this is strange. We can assure you that 1.4 DOES work for
> > protocol-http!
> >
> > Any cygwin users out there that can lend a hand?
> >
> > On Mon, Feb 6, 2012 at 4:37 AM, tiagorcs <da...@mitsue.co.jp>
> wrote:
> >
> >> Also, this is what I got inside my *plugins* folder
> >>
> >> creativecommons
> >> findsupporter-label
> >> index-more
> >> *lib-http*
> >> lib-xml
> >> parse-ext
> >> parse-swf
> >> feed
> >> index-anchor
> >> index-static
> >> lib-nekohtml
> >> microformats-reltag
> >> parse-html  parse-tika
> >> findsupporter-category
> >> index-basic
> >> language-identifier
> >> lib-regex-filter
> >> nutch-extensionpoints
> >> parse-js
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3718831.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Failed fetching

Posted by remi tassing <ta...@gmail.com>.

I'm slowly from migrating from Nutch-1.2 to 1.4 and it works with cygwin.

I use protocol-httpclient but could try protocol-http if you want

Remi

On Friday, February 10, 2012, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> In all honesty this is strange. We can assure you that 1.4 DOES work for
> protocol-http!
>
> Any cygwin users out there that can lend a hand?
>
> On Mon, Feb 6, 2012 at 4:37 AM, tiagorcs <da...@mitsue.co.jp>
wrote:
>
>> Also, this is what I got inside my *plugins* folder
>>
>> creativecommons
>> findsupporter-label
>> index-more
>> *lib-http*
>> lib-xml
>> parse-ext
>> parse-swf
>> feed
>> index-anchor
>> index-static
>> lib-nekohtml
>> microformats-reltag
>> parse-html  parse-tika
>> findsupporter-category
>> index-basic
>> language-identifier
>> lib-regex-filter
>> nutch-extensionpoints
>> parse-js
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3718831.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> *Lewis*
>

Re: Failed fetching

Posted by Lewis John Mcgibbney <le...@gmail.com>.

In all honesty this is strange. We can assure you that 1.4 DOES work for
protocol-http!

Any cygwin users out there that can lend a hand?

On Mon, Feb 6, 2012 at 4:37 AM, tiagorcs <da...@mitsue.co.jp> wrote:

> Also, this is what I got inside my *plugins* folder
>
> creativecommons
> findsupporter-label
> index-more
> *lib-http*
> lib-xml
> parse-ext
> parse-swf
> feed
> index-anchor
> index-static
> lib-nekohtml
> microformats-reltag
> parse-html  parse-tika
> findsupporter-category
> index-basic
> language-identifier
> lib-regex-filter
> nutch-extensionpoints
> parse-js
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3718831.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: Failed fetching

Posted by tiagorcs <da...@mitsue.co.jp>.

Also, this is what I got inside my *plugins* folder

creativecommons         
findsupporter-label  
index-more           
*lib-http*   
lib-xml                
parse-ext   
parse-swf
feed                    
index-anchor         
index-static         
lib-nekohtml      
microformats-reltag    
parse-html  parse-tika
findsupporter-category  
index-basic          
language-identifier  
lib-regex-filter  
nutch-extensionpoints  
parse-js

--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3718831.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by tiagorcs <da...@mitsue.co.jp>.

I downloaded Nutch 1.4 once more, and I got all of the command options this
time. I am, however, receiving the same error.

This is what I get with indexchecker for my URL and for some other one not
in my intranet (Cygwin + Window 7 -- works with Nutch 1.3)

$ ./nutch org.apache.nutch.indexer.IndexingFiltersChecker
http://testsite.mydomain.co.jp/
fetching: http://testsite.mydomain.co.jp/
Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound:
protocol not found for url=http
        at
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:80)
        at
org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:63)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:112)
		
$ ./nutch org.apache.nutch.indexer.IndexingFiltersChecker
http://www.inf.ufrgs.br/~oliveira/
fetching: http://www.inf.ufrgs.br/~oliveira/
Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound:
protocol not found for url=http
        at
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:80)
        at
org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:63)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:112)

I have already removed tag *plugin.includes* from my nutch-site.xml,　ans I
am using its definition from nutch-default.xml.

<property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description> ... 
  </description>
</property>


Any ideas? Because I have run out of...

--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3718761.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by Dean Pullen <de...@semantico.com>.

Thanks for the reply - I'm using 1.4

The problem was; as previously described, the nutch-site.xml didn't have 
the protocol-http in the plugins include - I had presumed this was 
copied from the 1.4 nutch-default.xml but was in fact left over from an 
older version. Adding the protcol-http fixed it!

As usual, I struggle all day to find the answers, posted to a newsgroup, 
and then solve it myself five minutes later....!


Dean.



On 02/02/2012 18:01, Lewis John Mcgibbney wrote:
> Looks liek your using an old version of Nutc here.
>
> Please try upgrading to 1.4 Dean
>
> hth
>
> On Thu, Feb 2, 2012 at 5:22 PM, Dean Pullen<de...@semantico.com>wrote:
>
>> What I see in logs/userlogs/myfetchjobxx/**syslog is:
>>
>> 2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.**Fetcher: fetch of
>> http://nutch.apache.org/ failed with: org.apache.nutch.protocol.**ProtocolNotFound:
>> protocol not found for url=http
>>
>> I look at the nutch-site.xml file and see:
>>
>> <property>
>> <name>plugin.includes</name>
>> <value>
>>             protocol-httpclient|urlfilter-**regex|parse-(text|html|js|**
>> msexcel|mspowerpoint|msword|**pdf|rss)|index-(basic|anchor|**
>> more)|query-(basic|site|url)|**response-(json|xml)|summary-**
>> basic|metatag|scoring-opic|**urlnormalizer-(pass|regex|**
>> basic)|url-query-normalizer
>> </value>
>> </property>
>>
>> Do we have to manually add the protocol-http to it?! Surely this should be
>> there by default?
>>
>> Dean.
>>
>>
>> On 02/02/2012 17:11, Dean Pullen wrote:
>>
>>> I've added:
>>>
>>> <property>
>>> <name>http.verbose</name>
>>> <value>true</value>
>>> <description>If true, HTTP will log more verbosely.</description>
>>> </property>
>>> <property>
>>> <name>fetcher.verbose</name>
>>> <value>true</value>
>>> <description>If true, fetcher will log more verbosely.</description>
>>> </property>
>>>
>>>
>>> To the nutch-site.xml in an attempt for more info....
>>>
>>> On 02/02/2012 16:44, Dean Pullen wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm trying to fetch from http://nutch.apache.org
>>>>
>>>> But after fetching, parsing, and updating the DB I examine the DB for '
>>>> http://nutch.apache.org/' (oddly I must include the last slash) and get:
>>>>
>>>> /URL: http://nutch.apache.org/
>>>> Version: 7
>>>> Status: 1 (*db_unfetched*)
>>>> Fetch time: Fri Feb 03 16:33:13 GMT 2012
>>>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>>>> Retries since fetch: 1
>>>> Retry interval: 2592000 seconds (30 days)
>>>> Score: 500.0
>>>> Signature: null
>>>> Metadata: _pst_: *failed*(2), lastModified=0/
>>>>
>>>> Why is the fetch failing and how can I show more nutch logging so as to
>>>> view the failure attempt/message?
>>>> Nothing is seen in my access logs when I try to crawl my own external
>>>> site.
>>>>
>>>> To ensure all URLs are permitted I've changed the regex-urlfilter.txt to:
>>>>
>>>> /# accept anything else
>>>> +./
>>>>
>>>> This has been puzzling me all day, I'm hoping someone can help!
>>>>
>>>> Dean.
>>>>
>>>>
>

Re: Failed fetching

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Looks liek your using an old version of Nutc here.

Please try upgrading to 1.4 Dean

hth

On Thu, Feb 2, 2012 at 5:22 PM, Dean Pullen <de...@semantico.com>wrote:

> What I see in logs/userlogs/myfetchjobxx/**syslog is:
>
> 2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.**Fetcher: fetch of
> http://nutch.apache.org/ failed with: org.apache.nutch.protocol.**ProtocolNotFound:
> protocol not found for url=http
>
> I look at the nutch-site.xml file and see:
>
> <property>
> <name>plugin.includes</name>
> <value>
>            protocol-httpclient|urlfilter-**regex|parse-(text|html|js|**
> msexcel|mspowerpoint|msword|**pdf|rss)|index-(basic|anchor|**
> more)|query-(basic|site|url)|**response-(json|xml)|summary-**
> basic|metatag|scoring-opic|**urlnormalizer-(pass|regex|**
> basic)|url-query-normalizer
> </value>
> </property>
>
> Do we have to manually add the protocol-http to it?! Surely this should be
> there by default?
>
> Dean.
>
>
> On 02/02/2012 17:11, Dean Pullen wrote:
>
>> I've added:
>>
>> <property>
>> <name>http.verbose</name>
>> <value>true</value>
>> <description>If true, HTTP will log more verbosely.</description>
>> </property>
>> <property>
>> <name>fetcher.verbose</name>
>> <value>true</value>
>> <description>If true, fetcher will log more verbosely.</description>
>> </property>
>>
>>
>> To the nutch-site.xml in an attempt for more info....
>>
>> On 02/02/2012 16:44, Dean Pullen wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to fetch from http://nutch.apache.org
>>>
>>> But after fetching, parsing, and updating the DB I examine the DB for '
>>> http://nutch.apache.org/' (oddly I must include the last slash) and get:
>>>
>>> /URL: http://nutch.apache.org/
>>> Version: 7
>>> Status: 1 (*db_unfetched*)
>>> Fetch time: Fri Feb 03 16:33:13 GMT 2012
>>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>>> Retries since fetch: 1
>>> Retry interval: 2592000 seconds (30 days)
>>> Score: 500.0
>>> Signature: null
>>> Metadata: _pst_: *failed*(2), lastModified=0/
>>>
>>> Why is the fetch failing and how can I show more nutch logging so as to
>>> view the failure attempt/message?
>>> Nothing is seen in my access logs when I try to crawl my own external
>>> site.
>>>
>>> To ensure all URLs are permitted I've changed the regex-urlfilter.txt to:
>>>
>>> /# accept anything else
>>> +./
>>>
>>> This has been puzzling me all day, I'm hoping someone can help!
>>>
>>> Dean.
>>>
>>>
>>
>


-- 
*Lewis*

Re: Failed fetching

Posted by Markus Jelsma <ma...@openindex.io>.

You said you were using Nutch 1.4. Forgot to update the bin/nutch script 
perhaps?

> I don't have that either. Is there a different package which contains all
> these classes? I've sent in my last post the commands I have available.
> Should I send it again?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712707.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by tiagorcs <da...@mitsue.co.jp>.

I don't have that either. Is there a different package which contains all
these classes? I've sent in my last post the commands I have available.
Should I send it again?

--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712707.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by Markus Jelsma <ma...@openindex.io>.

then try parsechecker

> My URL is
> 
> http://testsite.my.domain.which.I.cannot.reveal.co.jp
> 
> and it works fine with Nutch 1.3 (Cygwyn + Windows 7 and Redhat Linux)
> 
> my bin/nutch does not seem to have the *indexchecker* command (and class
> *org.apache.nutch.indexer.IndexingFiltersChecker* is not found)... Here it
> is the list of commands I have available
> 
> 
> 
> Did I miss something?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712692.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by tiagorcs <da...@mitsue.co.jp>.

My URL is

http://testsite.my.domain.which.I.cannot.reveal.co.jp

and it works fine with Nutch 1.3 (Cygwyn + Windows 7 and Redhat Linux)

my bin/nutch does not seem to have the *indexchecker* command (and class
*org.apache.nutch.indexer.IndexingFiltersChecker* is not found)... Here it
is the list of commands I have available



Did I miss something?

--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712692.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by Markus Jelsma <ma...@openindex.io>.

Ah, it's visible at nabble.

Anyway, something is likely wrong with your URL. The plugin.includes seems 
fine. Please try using the indexchecker tool with your URL.

> I'd posted it with my previous message. Sending again then.
> 
> Nutch crawl output (where http://xxx.xxx.xxx/ is an intranet URL)
> 
> 
> 
> My nutch-site.xml (where aaa.bbb.ccc.ddd is my IP number)
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712620.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by tiagorcs <da...@mitsue.co.jp>.

I'd posted it with my previous message. Sending again then.

Nutch crawl output (where http://xxx.xxx.xxx/ is an intranet URL)



My nutch-site.xml (where aaa.bbb.ccc.ddd is my IP number)



--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712620.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by Lewis John Mcgibbney <le...@gmail.com>.

There's no log files attached

On Fri, Feb 3, 2012 at 10:06 AM, tiagorcs <da...@mitsue.co.jp>wrote:

> Forgot to mention I am using Nutch 1.4, and that I have no problems with
> the
> exact same setup for Nutch 1.3.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712590.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: Failed fetching

Posted by tiagorcs <da...@mitsue.co.jp>.

Forgot to mention I am using Nutch 1.4, and that I have no problems with the
exact same setup for Nutch 1.3.

--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712590.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by tiagorcs <da...@mitsue.co.jp>.

Hi

actually, independently of using *protocol-http* or *protocol-httpclient*, I
am getting the same error

These are my log files:

Nutch crawl output (where *http://xxx.xxx.xxx/* is an intranet URL)



My nutch-site.xml (where *aaa.bbb.ccc.ddd* is my IP number)




Any ideas?

Thx in advance

--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712586.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Failed fetching

Posted by Markus Jelsma <ma...@openindex.io>.

It is default but you override it in nutch-site. Use protocol-http if you can 
and stay away from protocol-httpclient.

> What I see in logs/userlogs/myfetchjobxx/syslog is:
> 
> 2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.Fetcher: fetch of
> http://nutch.apache.org/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http
> 
> I look at the nutch-site.xml file and see:
> 
> <property>
> <name>plugin.includes</name>
> <value>
> 
> protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoin
> t|msword|pdf|rss)|index-(basic|anchor|more)|query-(basic|site|url)|response
> -(json|xml)|summary-basic|metatag|scoring-opic|urlnormalizer-(pass|regex|ba
> sic)|url-query-normalizer </value>
> </property>
> 
> Do we have to manually add the protocol-http to it?! Surely this should
> be there by default?
> 
> Dean.
> 
> On 02/02/2012 17:11, Dean Pullen wrote:
> > I've added:
> > 
> > <property>
> > <name>http.verbose</name>
> > <value>true</value>
> > <description>If true, HTTP will log more verbosely.</description>
> > </property>
> > <property>
> > <name>fetcher.verbose</name>
> > <value>true</value>
> > <description>If true, fetcher will log more verbosely.</description>
> > </property>
> > 
> > 
> > To the nutch-site.xml in an attempt for more info....
> > 
> > On 02/02/2012 16:44, Dean Pullen wrote:
> >> Hi all,
> >> 
> >> I'm trying to fetch from http://nutch.apache.org
> >> 
> >> But after fetching, parsing, and updating the DB I examine the DB for
> >> 'http://nutch.apache.org/' (oddly I must include the last slash) and
> >> get:
> >> 
> >> /URL: http://nutch.apache.org/
> >> Version: 7
> >> Status: 1 (*db_unfetched*)
> >> Fetch time: Fri Feb 03 16:33:13 GMT 2012
> >> Modified time: Thu Jan 01 01:00:00 GMT 1970
> >> Retries since fetch: 1
> >> Retry interval: 2592000 seconds (30 days)
> >> Score: 500.0
> >> Signature: null
> >> Metadata: _pst_: *failed*(2), lastModified=0/
> >> 
> >> Why is the fetch failing and how can I show more nutch logging so as
> >> to view the failure attempt/message?
> >> Nothing is seen in my access logs when I try to crawl my own external
> >> site.
> >> 
> >> To ensure all URLs are permitted I've changed the regex-urlfilter.txt
> >> to:
> >> 
> >> /# accept anything else
> >> +./
> >> 
> >> This has been puzzling me all day, I'm hoping someone can help!
> >> 
> >> Dean.

Re: Failed fetching

Posted by Dean Pullen <de...@semantico.com>.

What I see in logs/userlogs/myfetchjobxx/syslog is:

2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.Fetcher: fetch of 
http://nutch.apache.org/ failed with: 
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http

I look at the nutch-site.xml file and see:

<property>
<name>plugin.includes</name>
<value>
             
protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|pdf|rss)|index-(basic|anchor|more)|query-(basic|site|url)|response-(json|xml)|summary-basic|metatag|scoring-opic|urlnormalizer-(pass|regex|basic)|url-query-normalizer
</value>
</property>

Do we have to manually add the protocol-http to it?! Surely this should 
be there by default?

Dean.

On 02/02/2012 17:11, Dean Pullen wrote:
> I've added:
>
> <property>
> <name>http.verbose</name>
> <value>true</value>
> <description>If true, HTTP will log more verbosely.</description>
> </property>
> <property>
> <name>fetcher.verbose</name>
> <value>true</value>
> <description>If true, fetcher will log more verbosely.</description>
> </property>
>
>
> To the nutch-site.xml in an attempt for more info....
>
> On 02/02/2012 16:44, Dean Pullen wrote:
>> Hi all,
>>
>> I'm trying to fetch from http://nutch.apache.org
>>
>> But after fetching, parsing, and updating the DB I examine the DB for 
>> 'http://nutch.apache.org/' (oddly I must include the last slash) and 
>> get:
>>
>> /URL: http://nutch.apache.org/
>> Version: 7
>> Status: 1 (*db_unfetched*)
>> Fetch time: Fri Feb 03 16:33:13 GMT 2012
>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>> Retries since fetch: 1
>> Retry interval: 2592000 seconds (30 days)
>> Score: 500.0
>> Signature: null
>> Metadata: _pst_: *failed*(2), lastModified=0/
>>
>> Why is the fetch failing and how can I show more nutch logging so as 
>> to view the failure attempt/message?
>> Nothing is seen in my access logs when I try to crawl my own external 
>> site.
>>
>> To ensure all URLs are permitted I've changed the regex-urlfilter.txt 
>> to:
>>
>> /# accept anything else
>> +./
>>
>> This has been puzzling me all day, I'm hoping someone can help!
>>
>> Dean.
>>
>

Re: Failed fetching

Posted by Dean Pullen <de...@semantico.com>.

I've added:

<property>
<name>http.verbose</name>
<value>true</value>
<description>If true, HTTP will log more verbosely.</description>
</property>
<property>
<name>fetcher.verbose</name>
<value>true</value>
<description>If true, fetcher will log more verbosely.</description>
</property>


To the nutch-site.xml in an attempt for more info....

On 02/02/2012 16:44, Dean Pullen wrote:
> Hi all,
>
> I'm trying to fetch from http://nutch.apache.org
>
> But after fetching, parsing, and updating the DB I examine the DB for 
> 'http://nutch.apache.org/' (oddly I must include the last slash) and get:
>
> /URL: http://nutch.apache.org/
> Version: 7
> Status: 1 (*db_unfetched*)
> Fetch time: Fri Feb 03 16:33:13 GMT 2012
> Modified time: Thu Jan 01 01:00:00 GMT 1970
> Retries since fetch: 1
> Retry interval: 2592000 seconds (30 days)
> Score: 500.0
> Signature: null
> Metadata: _pst_: *failed*(2), lastModified=0/
>
> Why is the fetch failing and how can I show more nutch logging so as 
> to view the failure attempt/message?
> Nothing is seen in my access logs when I try to crawl my own external 
> site.
>
> To ensure all URLs are permitted I've changed the regex-urlfilter.txt to:
>
> /# accept anything else
> +./
>
> This has been puzzling me all day, I'm hoping someone can help!
>
> Dean.
>