You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dean Pullen <de...@semantico.com> on 2012/02/02 17:44:11 UTC
Failed fetching
Hi all,
I'm trying to fetch from http://nutch.apache.org
But after fetching, parsing, and updating the DB I examine the DB for
'http://nutch.apache.org/' (oddly I must include the last slash) and get:
/URL: http://nutch.apache.org/
Version: 7
Status: 1 (*db_unfetched*)
Fetch time: Fri Feb 03 16:33:13 GMT 2012
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 1
Retry interval: 2592000 seconds (30 days)
Score: 500.0
Signature: null
Metadata: _pst_: *failed*(2), lastModified=0/
Why is the fetch failing and how can I show more nutch logging so as to
view the failure attempt/message?
Nothing is seen in my access logs when I try to crawl my own external site.
To ensure all URLs are permitted I've changed the regex-urlfilter.txt to:
/# accept anything else
+./
This has been puzzling me all day, I'm hoping someone can help!
Dean.
Re: Failed fetching
Posted by tiagorcs <da...@mitsue.co.jp>.
I've downloaded the sources and compiled them myself.
Both protocol-http and protocol-httpclient (with basic auth) are working
like a charm now.
Thx for the help!
T
--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3765295.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by remi tassing <ta...@gmail.com>.
I just used protocol-http and it works!
It's probably a configuration issue. You can download a clean version and
start afresh
Remi
On Wed, Feb 15, 2012 at 3:46 AM, tiagorcs <da...@mitsue.co.jp>wrote:
> So do you suggest me to download Nutch from a different source? Maybe to
> reconfigure Cygwin? Or are there some configuration settings that I might
> have missed?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3745698.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Re: Failed fetching
Posted by tiagorcs <da...@mitsue.co.jp>.
So do you suggest me to download Nutch from a different source? Maybe to
reconfigure Cygwin? Or are there some configuration settings that I might
have missed?
--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3745698.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by Lewis John Mcgibbney <le...@gmail.com>.
This would be great Remi. It would be nice to confirm that protocol-http
does work with Nutch 1.4 in cygwin & windows.
On Tue, Feb 14, 2012 at 6:03 PM, remi tassing <ta...@gmail.com> wrote:
> I'm slowly from migrating from Nutch-1.2 to 1.4 and it works with cygwin.
>
> I use protocol-httpclient but could try protocol-http if you want
>
> Remi
>
> On Friday, February 10, 2012, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
> > In all honesty this is strange. We can assure you that 1.4 DOES work for
> > protocol-http!
> >
> > Any cygwin users out there that can lend a hand?
> >
> > On Mon, Feb 6, 2012 at 4:37 AM, tiagorcs <da...@mitsue.co.jp>
> wrote:
> >
> >> Also, this is what I got inside my *plugins* folder
> >>
> >> creativecommons
> >> findsupporter-label
> >> index-more
> >> *lib-http*
> >> lib-xml
> >> parse-ext
> >> parse-swf
> >> feed
> >> index-anchor
> >> index-static
> >> lib-nekohtml
> >> microformats-reltag
> >> parse-html parse-tika
> >> findsupporter-category
> >> index-basic
> >> language-identifier
> >> lib-regex-filter
> >> nutch-extensionpoints
> >> parse-js
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3718831.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >
> >
> >
> > --
> > *Lewis*
> >
>
--
*Lewis*
Re: Failed fetching
Posted by remi tassing <ta...@gmail.com>.
I'm slowly from migrating from Nutch-1.2 to 1.4 and it works with cygwin.
I use protocol-httpclient but could try protocol-http if you want
Remi
On Friday, February 10, 2012, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> In all honesty this is strange. We can assure you that 1.4 DOES work for
> protocol-http!
>
> Any cygwin users out there that can lend a hand?
>
> On Mon, Feb 6, 2012 at 4:37 AM, tiagorcs <da...@mitsue.co.jp>
wrote:
>
>> Also, this is what I got inside my *plugins* folder
>>
>> creativecommons
>> findsupporter-label
>> index-more
>> *lib-http*
>> lib-xml
>> parse-ext
>> parse-swf
>> feed
>> index-anchor
>> index-static
>> lib-nekohtml
>> microformats-reltag
>> parse-html parse-tika
>> findsupporter-category
>> index-basic
>> language-identifier
>> lib-regex-filter
>> nutch-extensionpoints
>> parse-js
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3718831.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> *Lewis*
>
Re: Failed fetching
Posted by Lewis John Mcgibbney <le...@gmail.com>.
In all honesty this is strange. We can assure you that 1.4 DOES work for
protocol-http!
Any cygwin users out there that can lend a hand?
On Mon, Feb 6, 2012 at 4:37 AM, tiagorcs <da...@mitsue.co.jp> wrote:
> Also, this is what I got inside my *plugins* folder
>
> creativecommons
> findsupporter-label
> index-more
> *lib-http*
> lib-xml
> parse-ext
> parse-swf
> feed
> index-anchor
> index-static
> lib-nekohtml
> microformats-reltag
> parse-html parse-tika
> findsupporter-category
> index-basic
> language-identifier
> lib-regex-filter
> nutch-extensionpoints
> parse-js
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3718831.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
--
*Lewis*
Re: Failed fetching
Posted by tiagorcs <da...@mitsue.co.jp>.
Also, this is what I got inside my *plugins* folder
creativecommons
findsupporter-label
index-more
*lib-http*
lib-xml
parse-ext
parse-swf
feed
index-anchor
index-static
lib-nekohtml
microformats-reltag
parse-html parse-tika
findsupporter-category
index-basic
language-identifier
lib-regex-filter
nutch-extensionpoints
parse-js
--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3718831.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by tiagorcs <da...@mitsue.co.jp>.
I downloaded Nutch 1.4 once more, and I got all of the command options this
time. I am, however, receiving the same error.
This is what I get with indexchecker for my URL and for some other one not
in my intranet (Cygwin + Window 7 -- works with Nutch 1.3)
$ ./nutch org.apache.nutch.indexer.IndexingFiltersChecker
http://testsite.mydomain.co.jp/
fetching: http://testsite.mydomain.co.jp/
Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound:
protocol not found for url=http
at
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:80)
at
org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:63)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:112)
$ ./nutch org.apache.nutch.indexer.IndexingFiltersChecker
http://www.inf.ufrgs.br/~oliveira/
fetching: http://www.inf.ufrgs.br/~oliveira/
Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound:
protocol not found for url=http
at
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:80)
at
org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:63)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:112)
I have already removed tag *plugin.includes* from my nutch-site.xml, ans I
am using its definition from nutch-default.xml.
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description> ...
</description>
</property>
Any ideas? Because I have run out of...
--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3718761.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by Dean Pullen <de...@semantico.com>.
Thanks for the reply - I'm using 1.4
The problem was; as previously described, the nutch-site.xml didn't have
the protocol-http in the plugins include - I had presumed this was
copied from the 1.4 nutch-default.xml but was in fact left over from an
older version. Adding the protcol-http fixed it!
As usual, I struggle all day to find the answers, posted to a newsgroup,
and then solve it myself five minutes later....!
Dean.
On 02/02/2012 18:01, Lewis John Mcgibbney wrote:
> Looks liek your using an old version of Nutc here.
>
> Please try upgrading to 1.4 Dean
>
> hth
>
> On Thu, Feb 2, 2012 at 5:22 PM, Dean Pullen<de...@semantico.com>wrote:
>
>> What I see in logs/userlogs/myfetchjobxx/**syslog is:
>>
>> 2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.**Fetcher: fetch of
>> http://nutch.apache.org/ failed with: org.apache.nutch.protocol.**ProtocolNotFound:
>> protocol not found for url=http
>>
>> I look at the nutch-site.xml file and see:
>>
>> <property>
>> <name>plugin.includes</name>
>> <value>
>> protocol-httpclient|urlfilter-**regex|parse-(text|html|js|**
>> msexcel|mspowerpoint|msword|**pdf|rss)|index-(basic|anchor|**
>> more)|query-(basic|site|url)|**response-(json|xml)|summary-**
>> basic|metatag|scoring-opic|**urlnormalizer-(pass|regex|**
>> basic)|url-query-normalizer
>> </value>
>> </property>
>>
>> Do we have to manually add the protocol-http to it?! Surely this should be
>> there by default?
>>
>> Dean.
>>
>>
>> On 02/02/2012 17:11, Dean Pullen wrote:
>>
>>> I've added:
>>>
>>> <property>
>>> <name>http.verbose</name>
>>> <value>true</value>
>>> <description>If true, HTTP will log more verbosely.</description>
>>> </property>
>>> <property>
>>> <name>fetcher.verbose</name>
>>> <value>true</value>
>>> <description>If true, fetcher will log more verbosely.</description>
>>> </property>
>>>
>>>
>>> To the nutch-site.xml in an attempt for more info....
>>>
>>> On 02/02/2012 16:44, Dean Pullen wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm trying to fetch from http://nutch.apache.org
>>>>
>>>> But after fetching, parsing, and updating the DB I examine the DB for '
>>>> http://nutch.apache.org/' (oddly I must include the last slash) and get:
>>>>
>>>> /URL: http://nutch.apache.org/
>>>> Version: 7
>>>> Status: 1 (*db_unfetched*)
>>>> Fetch time: Fri Feb 03 16:33:13 GMT 2012
>>>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>>>> Retries since fetch: 1
>>>> Retry interval: 2592000 seconds (30 days)
>>>> Score: 500.0
>>>> Signature: null
>>>> Metadata: _pst_: *failed*(2), lastModified=0/
>>>>
>>>> Why is the fetch failing and how can I show more nutch logging so as to
>>>> view the failure attempt/message?
>>>> Nothing is seen in my access logs when I try to crawl my own external
>>>> site.
>>>>
>>>> To ensure all URLs are permitted I've changed the regex-urlfilter.txt to:
>>>>
>>>> /# accept anything else
>>>> +./
>>>>
>>>> This has been puzzling me all day, I'm hoping someone can help!
>>>>
>>>> Dean.
>>>>
>>>>
>
Re: Failed fetching
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Looks liek your using an old version of Nutc here.
Please try upgrading to 1.4 Dean
hth
On Thu, Feb 2, 2012 at 5:22 PM, Dean Pullen <de...@semantico.com>wrote:
> What I see in logs/userlogs/myfetchjobxx/**syslog is:
>
> 2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.**Fetcher: fetch of
> http://nutch.apache.org/ failed with: org.apache.nutch.protocol.**ProtocolNotFound:
> protocol not found for url=http
>
> I look at the nutch-site.xml file and see:
>
> <property>
> <name>plugin.includes</name>
> <value>
> protocol-httpclient|urlfilter-**regex|parse-(text|html|js|**
> msexcel|mspowerpoint|msword|**pdf|rss)|index-(basic|anchor|**
> more)|query-(basic|site|url)|**response-(json|xml)|summary-**
> basic|metatag|scoring-opic|**urlnormalizer-(pass|regex|**
> basic)|url-query-normalizer
> </value>
> </property>
>
> Do we have to manually add the protocol-http to it?! Surely this should be
> there by default?
>
> Dean.
>
>
> On 02/02/2012 17:11, Dean Pullen wrote:
>
>> I've added:
>>
>> <property>
>> <name>http.verbose</name>
>> <value>true</value>
>> <description>If true, HTTP will log more verbosely.</description>
>> </property>
>> <property>
>> <name>fetcher.verbose</name>
>> <value>true</value>
>> <description>If true, fetcher will log more verbosely.</description>
>> </property>
>>
>>
>> To the nutch-site.xml in an attempt for more info....
>>
>> On 02/02/2012 16:44, Dean Pullen wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to fetch from http://nutch.apache.org
>>>
>>> But after fetching, parsing, and updating the DB I examine the DB for '
>>> http://nutch.apache.org/' (oddly I must include the last slash) and get:
>>>
>>> /URL: http://nutch.apache.org/
>>> Version: 7
>>> Status: 1 (*db_unfetched*)
>>> Fetch time: Fri Feb 03 16:33:13 GMT 2012
>>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>>> Retries since fetch: 1
>>> Retry interval: 2592000 seconds (30 days)
>>> Score: 500.0
>>> Signature: null
>>> Metadata: _pst_: *failed*(2), lastModified=0/
>>>
>>> Why is the fetch failing and how can I show more nutch logging so as to
>>> view the failure attempt/message?
>>> Nothing is seen in my access logs when I try to crawl my own external
>>> site.
>>>
>>> To ensure all URLs are permitted I've changed the regex-urlfilter.txt to:
>>>
>>> /# accept anything else
>>> +./
>>>
>>> This has been puzzling me all day, I'm hoping someone can help!
>>>
>>> Dean.
>>>
>>>
>>
>
--
*Lewis*
Re: Failed fetching
Posted by Markus Jelsma <ma...@openindex.io>.
You said you were using Nutch 1.4. Forgot to update the bin/nutch script
perhaps?
> I don't have that either. Is there a different package which contains all
> these classes? I've sent in my last post the commands I have available.
> Should I send it again?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712707.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by tiagorcs <da...@mitsue.co.jp>.
I don't have that either. Is there a different package which contains all
these classes? I've sent in my last post the commands I have available.
Should I send it again?
--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712707.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by Markus Jelsma <ma...@openindex.io>.
then try parsechecker
> My URL is
>
> http://testsite.my.domain.which.I.cannot.reveal.co.jp
>
> and it works fine with Nutch 1.3 (Cygwyn + Windows 7 and Redhat Linux)
>
> my bin/nutch does not seem to have the *indexchecker* command (and class
> *org.apache.nutch.indexer.IndexingFiltersChecker* is not found)... Here it
> is the list of commands I have available
>
>
>
> Did I miss something?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712692.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by tiagorcs <da...@mitsue.co.jp>.
My URL is
http://testsite.my.domain.which.I.cannot.reveal.co.jp
and it works fine with Nutch 1.3 (Cygwyn + Windows 7 and Redhat Linux)
my bin/nutch does not seem to have the *indexchecker* command (and class
*org.apache.nutch.indexer.IndexingFiltersChecker* is not found)... Here it
is the list of commands I have available
Did I miss something?
--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712692.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by Markus Jelsma <ma...@openindex.io>.
Ah, it's visible at nabble.
Anyway, something is likely wrong with your URL. The plugin.includes seems
fine. Please try using the indexchecker tool with your URL.
> I'd posted it with my previous message. Sending again then.
>
> Nutch crawl output (where http://xxx.xxx.xxx/ is an intranet URL)
>
>
>
> My nutch-site.xml (where aaa.bbb.ccc.ddd is my IP number)
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712620.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by tiagorcs <da...@mitsue.co.jp>.
I'd posted it with my previous message. Sending again then.
Nutch crawl output (where http://xxx.xxx.xxx/ is an intranet URL)
My nutch-site.xml (where aaa.bbb.ccc.ddd is my IP number)
--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712620.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by Lewis John Mcgibbney <le...@gmail.com>.
There's no log files attached
On Fri, Feb 3, 2012 at 10:06 AM, tiagorcs <da...@mitsue.co.jp>wrote:
> Forgot to mention I am using Nutch 1.4, and that I have no problems with
> the
> exact same setup for Nutch 1.3.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712590.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
--
*Lewis*
Re: Failed fetching
Posted by tiagorcs <da...@mitsue.co.jp>.
Forgot to mention I am using Nutch 1.4, and that I have no problems with the
exact same setup for Nutch 1.3.
--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712590.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by tiagorcs <da...@mitsue.co.jp>.
Hi
actually, independently of using *protocol-http* or *protocol-httpclient*, I
am getting the same error
These are my log files:
Nutch crawl output (where *http://xxx.xxx.xxx/* is an intranet URL)
My nutch-site.xml (where *aaa.bbb.ccc.ddd* is my IP number)
Any ideas?
Thx in advance
--
View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3712586.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Failed fetching
Posted by Markus Jelsma <ma...@openindex.io>.
It is default but you override it in nutch-site. Use protocol-http if you can
and stay away from protocol-httpclient.
> What I see in logs/userlogs/myfetchjobxx/syslog is:
>
> 2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.Fetcher: fetch of
> http://nutch.apache.org/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http
>
> I look at the nutch-site.xml file and see:
>
> <property>
> <name>plugin.includes</name>
> <value>
>
> protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoin
> t|msword|pdf|rss)|index-(basic|anchor|more)|query-(basic|site|url)|response
> -(json|xml)|summary-basic|metatag|scoring-opic|urlnormalizer-(pass|regex|ba
> sic)|url-query-normalizer </value>
> </property>
>
> Do we have to manually add the protocol-http to it?! Surely this should
> be there by default?
>
> Dean.
>
> On 02/02/2012 17:11, Dean Pullen wrote:
> > I've added:
> >
> > <property>
> > <name>http.verbose</name>
> > <value>true</value>
> > <description>If true, HTTP will log more verbosely.</description>
> > </property>
> > <property>
> > <name>fetcher.verbose</name>
> > <value>true</value>
> > <description>If true, fetcher will log more verbosely.</description>
> > </property>
> >
> >
> > To the nutch-site.xml in an attempt for more info....
> >
> > On 02/02/2012 16:44, Dean Pullen wrote:
> >> Hi all,
> >>
> >> I'm trying to fetch from http://nutch.apache.org
> >>
> >> But after fetching, parsing, and updating the DB I examine the DB for
> >> 'http://nutch.apache.org/' (oddly I must include the last slash) and
> >> get:
> >>
> >> /URL: http://nutch.apache.org/
> >> Version: 7
> >> Status: 1 (*db_unfetched*)
> >> Fetch time: Fri Feb 03 16:33:13 GMT 2012
> >> Modified time: Thu Jan 01 01:00:00 GMT 1970
> >> Retries since fetch: 1
> >> Retry interval: 2592000 seconds (30 days)
> >> Score: 500.0
> >> Signature: null
> >> Metadata: _pst_: *failed*(2), lastModified=0/
> >>
> >> Why is the fetch failing and how can I show more nutch logging so as
> >> to view the failure attempt/message?
> >> Nothing is seen in my access logs when I try to crawl my own external
> >> site.
> >>
> >> To ensure all URLs are permitted I've changed the regex-urlfilter.txt
> >> to:
> >>
> >> /# accept anything else
> >> +./
> >>
> >> This has been puzzling me all day, I'm hoping someone can help!
> >>
> >> Dean.
Re: Failed fetching
Posted by Dean Pullen <de...@semantico.com>.
What I see in logs/userlogs/myfetchjobxx/syslog is:
2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.Fetcher: fetch of
http://nutch.apache.org/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http
I look at the nutch-site.xml file and see:
<property>
<name>plugin.includes</name>
<value>
protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|pdf|rss)|index-(basic|anchor|more)|query-(basic|site|url)|response-(json|xml)|summary-basic|metatag|scoring-opic|urlnormalizer-(pass|regex|basic)|url-query-normalizer
</value>
</property>
Do we have to manually add the protocol-http to it?! Surely this should
be there by default?
Dean.
On 02/02/2012 17:11, Dean Pullen wrote:
> I've added:
>
> <property>
> <name>http.verbose</name>
> <value>true</value>
> <description>If true, HTTP will log more verbosely.</description>
> </property>
> <property>
> <name>fetcher.verbose</name>
> <value>true</value>
> <description>If true, fetcher will log more verbosely.</description>
> </property>
>
>
> To the nutch-site.xml in an attempt for more info....
>
> On 02/02/2012 16:44, Dean Pullen wrote:
>> Hi all,
>>
>> I'm trying to fetch from http://nutch.apache.org
>>
>> But after fetching, parsing, and updating the DB I examine the DB for
>> 'http://nutch.apache.org/' (oddly I must include the last slash) and
>> get:
>>
>> /URL: http://nutch.apache.org/
>> Version: 7
>> Status: 1 (*db_unfetched*)
>> Fetch time: Fri Feb 03 16:33:13 GMT 2012
>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>> Retries since fetch: 1
>> Retry interval: 2592000 seconds (30 days)
>> Score: 500.0
>> Signature: null
>> Metadata: _pst_: *failed*(2), lastModified=0/
>>
>> Why is the fetch failing and how can I show more nutch logging so as
>> to view the failure attempt/message?
>> Nothing is seen in my access logs when I try to crawl my own external
>> site.
>>
>> To ensure all URLs are permitted I've changed the regex-urlfilter.txt
>> to:
>>
>> /# accept anything else
>> +./
>>
>> This has been puzzling me all day, I'm hoping someone can help!
>>
>> Dean.
>>
>
Re: Failed fetching
Posted by Dean Pullen <de...@semantico.com>.
I've added:
<property>
<name>http.verbose</name>
<value>true</value>
<description>If true, HTTP will log more verbosely.</description>
</property>
<property>
<name>fetcher.verbose</name>
<value>true</value>
<description>If true, fetcher will log more verbosely.</description>
</property>
To the nutch-site.xml in an attempt for more info....
On 02/02/2012 16:44, Dean Pullen wrote:
> Hi all,
>
> I'm trying to fetch from http://nutch.apache.org
>
> But after fetching, parsing, and updating the DB I examine the DB for
> 'http://nutch.apache.org/' (oddly I must include the last slash) and get:
>
> /URL: http://nutch.apache.org/
> Version: 7
> Status: 1 (*db_unfetched*)
> Fetch time: Fri Feb 03 16:33:13 GMT 2012
> Modified time: Thu Jan 01 01:00:00 GMT 1970
> Retries since fetch: 1
> Retry interval: 2592000 seconds (30 days)
> Score: 500.0
> Signature: null
> Metadata: _pst_: *failed*(2), lastModified=0/
>
> Why is the fetch failing and how can I show more nutch logging so as
> to view the failure attempt/message?
> Nothing is seen in my access logs when I try to crawl my own external
> site.
>
> To ensure all URLs are permitted I've changed the regex-urlfilter.txt to:
>
> /# accept anything else
> +./
>
> This has been puzzling me all day, I'm hoping someone can help!
>
> Dean.
>