You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jan Philippe Wimmer <in...@jepse.net> on 2012/12/17 13:24:43 UTC

Re: shouldFetch rejected

Hi again.

i still have that issue. I start with a complete new crawl directory 
structure and get the following error:

-shouldFetch rejected 'http://www.lequipe.fr/Football/', 
fetchTime=1359626286623, curTime=1355738313780

Full-Log:
crawl started in: /opt/project/current/crawl_project/nutch/crawl/1300
rootUrlDir = /opt/project/current/crawl_project/nutch/urls/url_1300
threads = 20
depth = 3
solrUrl=http://192.168.1.144:8983/solr/
topN = 400
Injector: starting at 2012-12-17 10:57:36
Injector: crawlDb: 
/opt/project/current/crawl_project/nutch/crawl/1300/crawldb
Injector: urlDir: /opt/project/current/crawl_project/nutch/urls/url_1300
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-12-17 10:57:51, elapsed: 00:00:14
Generator: starting at 2012-12-17 10:57:51
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 400
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: 
/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
Generator: finished at 2012-12-17 10:58:06, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in 
'http.robots.agents' property.
Fetcher: starting at 2012-12-17 10:58:06
Fetcher: segment: 
/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
Using queue mode : byHost
Fetcher: threads: 20
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.lequipe.fr/Football/
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-12-17 10:58:13, elapsed: 00:00:07
ParseSegment: starting at 2012-12-17 10:58:13
ParseSegment: segment: 
/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
ParseSegment: finished at 2012-12-17 10:58:20, elapsed: 00:00:07
CrawlDb update: starting at 2012-12-17 10:58:20
CrawlDb update: db: 
/opt/project/current/crawl_project/nutch/crawl/1300/crawldb
CrawlDb update: segments: 
[/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-12-17 10:58:33, elapsed: 00:00:13
Generator: starting at 2012-12-17 10:58:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 400
Generator: jobtracker is 'local', generating exactly one partition.
-shouldFetch rejected 'http://www.lequipe.fr/Football/', 
fetchTime=1359626286623, curTime=1355738313780
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2012-12-17 10:58:40
LinkDb: linkdb: /opt/project/current/crawl_project/nutch/crawl/1300/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: 
file:/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
LinkDb: finished at 2012-12-17 10:58:47, elapsed: 00:00:07
SolrIndexer: starting at 2012-12-17 10:58:47
SolrIndexer: deleting gone documents: false
SolrIndexer: URL filtering: false
SolrIndexer: URL normalizing: false
SolrIndexer: finished at 2012-12-17 10:59:09, elapsed: 00:00:22
SolrDeleteDuplicates: starting at 2012-12-17 10:59:09
SolrDeleteDuplicates: Solr url: http://192.168.1.144:8983/solr/
SolrDeleteDuplicates: finished at 2012-12-17 10:59:47, elapsed: 00:00:37

Am 25.11.2012 21:02, schrieb Sebastian Nagel:
>> But, i create a complete new crawl dir for every crawl.
> Then all should work as expected.
>
>> why the the cralwer set a "page to fetch" to rejected. Because obviously
>> the crawler never saw this page before (because i deleted all the old crawl dirs).
>> In the crawl log i see many page to fetch, but at the end all of them are rejected
> Are you sure they aren't fetched at all? This debug log output in Generator mapper
> is shown also for URLs fetched in previous cycles. You should check the complete
> log for the "rejected" URLs.
>
>
> On 11/24/2012 04:46 PM, Jan Philippe Wimmer wrote:
>> Hey Sebastian! Thanks for your answer.
>>
>> But, i create a complete new crawl dir for every crawl. In other words i just have the crawl data of
>> the current, running crawl-process. When i recrawl a urlset, i delete the old crawl dir and create a
>> new one. At the end of any crawl i index it to solr. So i keep all crawled content in the index. I
>> don't need any nutch crawl dirs, because i want to crawl all relevant pages in every crawl process.
>> again and again.
>>
>> I totaly don't understand, why the the cralwer set a "page to fetch" to rejected. Because obviously
>> the crawler never saw this page before (because i deleted all the old crawl dirs). In the crawl log
>> i see many page to fetch, but at the end all of them are rejected. Any ideas?
>>
>> Am 24.11.2012 16:36, schrieb Sebastian Nagel:
>>>> I want my crawler to crawl the complete page without setting up schedulers at all. Every crawl
>>>> process should crawl every page again without having setup wait intervals.
>>> That's quite easy: remove all data and launch the crawl again.
>>> - Nutch 1.x : remove crawldb, segments, and linkdb
>>> - 2.x : drop 'webpage' (or similar, depends on the chosen data store)
>>>
>>> On 11/24/2012 12:17 PM, Jan Philippe Wimmer wrote:
>>>> Hi there,
>>>>
>>>> how can i avoid the following error:
>>>> -shouldFetch rejected 'http://www.page.com/shop', fetchTime=1356347311285, curTime=1353755337755
>>>>
>>>> I want my crawler to crawl the complete page without setting up schedulers at all. Every crawl
>>>> process should crawl every page again without having setup wait intervals.
>>>>
>>>> Any soluti


RE: shouldFetch rejected

Posted by Markus Jelsma <ma...@openindex.io>.
You're doing nothing wrong, it's just a debug entry. curTime is just the CURRENT TIME and fetchTime is the time in the future after which the record must be fetched again. The fetch time is controlled by your fetch scheduler. See the API docs for AbstractFetchSchedule. 

I assume http://www.lequipe.fr/Football is already fetched. Check if this is true using the readdb tool.
 
-----Original message-----
> From:Jan Philippe Wimmer <in...@jepse.net>
> Sent: Mon 17-Dec-2012 13:42
> To: user@nutch.apache.org
> Subject: Re: shouldFetch rejected
> 
> Ahh, but i crawl other urls with the same settings and there it works. 
> What am i doing wrong? What is the correct setting? What setting are 
> responsible for curtime ahead fetchtime?
> Am 17.12.2012 13:40, schrieb Markus Jelsma:
> > Hi - curTime does not exceed fetchTime, thus the record is not eligible for fetch.
> >   
> >   
> > -----Original message-----
> >> From:Jan Philippe Wimmer <in...@jepse.net>
> >> Sent: Mon 17-Dec-2012 13:31
> >> To: user@nutch.apache.org
> >> Subject: Re: shouldFetch rejected
> >>
> >> Hi again.
> >>
> >> i still have that issue. I start with a complete new crawl directory
> >> structure and get the following error:
> >>
> >> -shouldFetch rejected 'http://www.lequipe.fr/Football/',
> >> fetchTime=1359626286623, curTime=1355738313780
> >>
> >> Full-Log:
> >> crawl started in: /opt/project/current/crawl_project/nutch/crawl/1300
> >> rootUrlDir = /opt/project/current/crawl_project/nutch/urls/url_1300
> >> threads = 20
> >> depth = 3
> >> solrUrl=http://192.168.1.144:8983/solr/
> >> topN = 400
> >> Injector: starting at 2012-12-17 10:57:36
> >> Injector: crawlDb:
> >> /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
> >> Injector: urlDir: /opt/project/current/crawl_project/nutch/urls/url_1300
> >> Injector: Converting injected urls to crawl db entries.
> >> Injector: Merging injected urls into crawl db.
> >> Injector: finished at 2012-12-17 10:57:51, elapsed: 00:00:14
> >> Generator: starting at 2012-12-17 10:57:51
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: filtering: true
> >> Generator: normalizing: true
> >> Generator: topN: 400
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> Generator: Partitioning selected urls for politeness.
> >> Generator: segment:
> >> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> >> Generator: finished at 2012-12-17 10:58:06, elapsed: 00:00:15
> >> Fetcher: Your 'http.agent.name' value should be listed first in
> >> 'http.robots.agents' property.
> >> Fetcher: starting at 2012-12-17 10:58:06
> >> Fetcher: segment:
> >> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> >> Using queue mode : byHost
> >> Fetcher: threads: 20
> >> Fetcher: time-out divisor: 2
> >> QueueFeeder finished: total 1 records + hit by time limit :0
> >> Using queue mode : byHost
> >> Using queue mode : byHost
> >> fetching http://www.lequipe.fr/Football/
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Fetcher: throughput threshold: -1
> >> Fetcher: throughput threshold retries: 5
> >> -finishing thread FetcherThread, activeThreads=0
> >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> -activeThreads=0
> >> Fetcher: finished at 2012-12-17 10:58:13, elapsed: 00:00:07
> >> ParseSegment: starting at 2012-12-17 10:58:13
> >> ParseSegment: segment:
> >> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> >> ParseSegment: finished at 2012-12-17 10:58:20, elapsed: 00:00:07
> >> CrawlDb update: starting at 2012-12-17 10:58:20
> >> CrawlDb update: db:
> >> /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
> >> CrawlDb update: segments:
> >> [/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759]
> >> CrawlDb update: additions allowed: true
> >> CrawlDb update: URL normalizing: true
> >> CrawlDb update: URL filtering: true
> >> CrawlDb update: 404 purging: false
> >> CrawlDb update: Merging segment data into db.
> >> CrawlDb update: finished at 2012-12-17 10:58:33, elapsed: 00:00:13
> >> Generator: starting at 2012-12-17 10:58:33
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: filtering: true
> >> Generator: normalizing: true
> >> Generator: topN: 400
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> -shouldFetch rejected 'http://www.lequipe.fr/Football/',
> >> fetchTime=1359626286623, curTime=1355738313780
> >> Generator: 0 records selected for fetching, exiting ...
> >> Stopping at depth=1 - no more URLs to fetch.
> >> LinkDb: starting at 2012-12-17 10:58:40
> >> LinkDb: linkdb: /opt/project/current/crawl_project/nutch/crawl/1300/linkdb
> >> LinkDb: URL normalize: true
> >> LinkDb: URL filter: true
> >> LinkDb: internal links will be ignored.
> >> LinkDb: adding segment:
> >> file:/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> >> LinkDb: finished at 2012-12-17 10:58:47, elapsed: 00:00:07
> >> SolrIndexer: starting at 2012-12-17 10:58:47
> >> SolrIndexer: deleting gone documents: false
> >> SolrIndexer: URL filtering: false
> >> SolrIndexer: URL normalizing: false
> >> SolrIndexer: finished at 2012-12-17 10:59:09, elapsed: 00:00:22
> >> SolrDeleteDuplicates: starting at 2012-12-17 10:59:09
> >> SolrDeleteDuplicates: Solr url: http://192.168.1.144:8983/solr/
> >> SolrDeleteDuplicates: finished at 2012-12-17 10:59:47, elapsed: 00:00:37
> >>
> >> Am 25.11.2012 21:02, schrieb Sebastian Nagel:
> >>>> But, i create a complete new crawl dir for every crawl.
> >>> Then all should work as expected.
> >>>
> >>>> why the the cralwer set a "page to fetch" to rejected. Because obviously
> >>>> the crawler never saw this page before (because i deleted all the old crawl dirs).
> >>>> In the crawl log i see many page to fetch, but at the end all of them are rejected
> >>> Are you sure they aren't fetched at all? This debug log output in Generator mapper
> >>> is shown also for URLs fetched in previous cycles. You should check the complete
> >>> log for the "rejected" URLs.
> >>>
> >>>
> >>> On 11/24/2012 04:46 PM, Jan Philippe Wimmer wrote:
> >>>> Hey Sebastian! Thanks for your answer.
> >>>>
> >>>> But, i create a complete new crawl dir for every crawl. In other words i just have the crawl data of
> >>>> the current, running crawl-process. When i recrawl a urlset, i delete the old crawl dir and create a
> >>>> new one. At the end of any crawl i index it to solr. So i keep all crawled content in the index. I
> >>>> don't need any nutch crawl dirs, because i want to crawl all relevant pages in every crawl process.
> >>>> again and again.
> >>>>
> >>>> I totaly don't understand, why the the cralwer set a "page to fetch" to rejected. Because obviously
> >>>> the crawler never saw this page before (because i deleted all the old crawl dirs). In the crawl log
> >>>> i see many page to fetch, but at the end all of them are rejected. Any ideas?
> >>>>
> >>>> Am 24.11.2012 16:36, schrieb Sebastian Nagel:
> >>>>>> I want my crawler to crawl the complete page without setting up schedulers at all. Every crawl
> >>>>>> process should crawl every page again without having setup wait intervals.
> >>>>> That's quite easy: remove all data and launch the crawl again.
> >>>>> - Nutch 1.x : remove crawldb, segments, and linkdb
> >>>>> - 2.x : drop 'webpage' (or similar, depends on the chosen data store)
> >>>>>
> >>>>> On 11/24/2012 12:17 PM, Jan Philippe Wimmer wrote:
> >>>>>> Hi there,
> >>>>>>
> >>>>>> how can i avoid the following error:
> >>>>>> -shouldFetch rejected 'http://www.page.com/shop', fetchTime=1356347311285, curTime=1353755337755
> >>>>>>
> >>>>>> I want my crawler to crawl the complete page without setting up schedulers at all. Every crawl
> >>>>>> process should crawl every page again without having setup wait intervals.
> >>>>>>
> >>>>>> Any soluti
> >>
> 
> 

Re: shouldFetch rejected

Posted by Jan Philippe Wimmer <in...@jepse.net>.
Ahh, but i crawl other urls with the same settings and there it works. 
What am i doing wrong? What is the correct setting? What setting are 
responsible for curtime ahead fetchtime?
Am 17.12.2012 13:40, schrieb Markus Jelsma:
> Hi - curTime does not exceed fetchTime, thus the record is not eligible for fetch.
>   
>   
> -----Original message-----
>> From:Jan Philippe Wimmer <in...@jepse.net>
>> Sent: Mon 17-Dec-2012 13:31
>> To: user@nutch.apache.org
>> Subject: Re: shouldFetch rejected
>>
>> Hi again.
>>
>> i still have that issue. I start with a complete new crawl directory
>> structure and get the following error:
>>
>> -shouldFetch rejected 'http://www.lequipe.fr/Football/',
>> fetchTime=1359626286623, curTime=1355738313780
>>
>> Full-Log:
>> crawl started in: /opt/project/current/crawl_project/nutch/crawl/1300
>> rootUrlDir = /opt/project/current/crawl_project/nutch/urls/url_1300
>> threads = 20
>> depth = 3
>> solrUrl=http://192.168.1.144:8983/solr/
>> topN = 400
>> Injector: starting at 2012-12-17 10:57:36
>> Injector: crawlDb:
>> /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
>> Injector: urlDir: /opt/project/current/crawl_project/nutch/urls/url_1300
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: finished at 2012-12-17 10:57:51, elapsed: 00:00:14
>> Generator: starting at 2012-12-17 10:57:51
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: true
>> Generator: normalizing: true
>> Generator: topN: 400
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls for politeness.
>> Generator: segment:
>> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
>> Generator: finished at 2012-12-17 10:58:06, elapsed: 00:00:15
>> Fetcher: Your 'http.agent.name' value should be listed first in
>> 'http.robots.agents' property.
>> Fetcher: starting at 2012-12-17 10:58:06
>> Fetcher: segment:
>> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
>> Using queue mode : byHost
>> Fetcher: threads: 20
>> Fetcher: time-out divisor: 2
>> QueueFeeder finished: total 1 records + hit by time limit :0
>> Using queue mode : byHost
>> Using queue mode : byHost
>> fetching http://www.lequipe.fr/Football/
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Using queue mode : byHost
>> -finishing thread FetcherThread, activeThreads=1
>> Fetcher: throughput threshold: -1
>> Fetcher: throughput threshold retries: 5
>> -finishing thread FetcherThread, activeThreads=0
>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> -activeThreads=0
>> Fetcher: finished at 2012-12-17 10:58:13, elapsed: 00:00:07
>> ParseSegment: starting at 2012-12-17 10:58:13
>> ParseSegment: segment:
>> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
>> ParseSegment: finished at 2012-12-17 10:58:20, elapsed: 00:00:07
>> CrawlDb update: starting at 2012-12-17 10:58:20
>> CrawlDb update: db:
>> /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
>> CrawlDb update: segments:
>> [/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: 404 purging: false
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: finished at 2012-12-17 10:58:33, elapsed: 00:00:13
>> Generator: starting at 2012-12-17 10:58:33
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: true
>> Generator: normalizing: true
>> Generator: topN: 400
>> Generator: jobtracker is 'local', generating exactly one partition.
>> -shouldFetch rejected 'http://www.lequipe.fr/Football/',
>> fetchTime=1359626286623, curTime=1355738313780
>> Generator: 0 records selected for fetching, exiting ...
>> Stopping at depth=1 - no more URLs to fetch.
>> LinkDb: starting at 2012-12-17 10:58:40
>> LinkDb: linkdb: /opt/project/current/crawl_project/nutch/crawl/1300/linkdb
>> LinkDb: URL normalize: true
>> LinkDb: URL filter: true
>> LinkDb: internal links will be ignored.
>> LinkDb: adding segment:
>> file:/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
>> LinkDb: finished at 2012-12-17 10:58:47, elapsed: 00:00:07
>> SolrIndexer: starting at 2012-12-17 10:58:47
>> SolrIndexer: deleting gone documents: false
>> SolrIndexer: URL filtering: false
>> SolrIndexer: URL normalizing: false
>> SolrIndexer: finished at 2012-12-17 10:59:09, elapsed: 00:00:22
>> SolrDeleteDuplicates: starting at 2012-12-17 10:59:09
>> SolrDeleteDuplicates: Solr url: http://192.168.1.144:8983/solr/
>> SolrDeleteDuplicates: finished at 2012-12-17 10:59:47, elapsed: 00:00:37
>>
>> Am 25.11.2012 21:02, schrieb Sebastian Nagel:
>>>> But, i create a complete new crawl dir for every crawl.
>>> Then all should work as expected.
>>>
>>>> why the the cralwer set a "page to fetch" to rejected. Because obviously
>>>> the crawler never saw this page before (because i deleted all the old crawl dirs).
>>>> In the crawl log i see many page to fetch, but at the end all of them are rejected
>>> Are you sure they aren't fetched at all? This debug log output in Generator mapper
>>> is shown also for URLs fetched in previous cycles. You should check the complete
>>> log for the "rejected" URLs.
>>>
>>>
>>> On 11/24/2012 04:46 PM, Jan Philippe Wimmer wrote:
>>>> Hey Sebastian! Thanks for your answer.
>>>>
>>>> But, i create a complete new crawl dir for every crawl. In other words i just have the crawl data of
>>>> the current, running crawl-process. When i recrawl a urlset, i delete the old crawl dir and create a
>>>> new one. At the end of any crawl i index it to solr. So i keep all crawled content in the index. I
>>>> don't need any nutch crawl dirs, because i want to crawl all relevant pages in every crawl process.
>>>> again and again.
>>>>
>>>> I totaly don't understand, why the the cralwer set a "page to fetch" to rejected. Because obviously
>>>> the crawler never saw this page before (because i deleted all the old crawl dirs). In the crawl log
>>>> i see many page to fetch, but at the end all of them are rejected. Any ideas?
>>>>
>>>> Am 24.11.2012 16:36, schrieb Sebastian Nagel:
>>>>>> I want my crawler to crawl the complete page without setting up schedulers at all. Every crawl
>>>>>> process should crawl every page again without having setup wait intervals.
>>>>> That's quite easy: remove all data and launch the crawl again.
>>>>> - Nutch 1.x : remove crawldb, segments, and linkdb
>>>>> - 2.x : drop 'webpage' (or similar, depends on the chosen data store)
>>>>>
>>>>> On 11/24/2012 12:17 PM, Jan Philippe Wimmer wrote:
>>>>>> Hi there,
>>>>>>
>>>>>> how can i avoid the following error:
>>>>>> -shouldFetch rejected 'http://www.page.com/shop', fetchTime=1356347311285, curTime=1353755337755
>>>>>>
>>>>>> I want my crawler to crawl the complete page without setting up schedulers at all. Every crawl
>>>>>> process should crawl every page again without having setup wait intervals.
>>>>>>
>>>>>> Any soluti
>>


RE: shouldFetch rejected

Posted by Markus Jelsma <ma...@openindex.io>.
Hi - curTime does not exceed fetchTime, thus the record is not eligible for fetch.
 
 
-----Original message-----
> From:Jan Philippe Wimmer <in...@jepse.net>
> Sent: Mon 17-Dec-2012 13:31
> To: user@nutch.apache.org
> Subject: Re: shouldFetch rejected
> 
> Hi again.
> 
> i still have that issue. I start with a complete new crawl directory 
> structure and get the following error:
> 
> -shouldFetch rejected 'http://www.lequipe.fr/Football/', 
> fetchTime=1359626286623, curTime=1355738313780
> 
> Full-Log:
> crawl started in: /opt/project/current/crawl_project/nutch/crawl/1300
> rootUrlDir = /opt/project/current/crawl_project/nutch/urls/url_1300
> threads = 20
> depth = 3
> solrUrl=http://192.168.1.144:8983/solr/
> topN = 400
> Injector: starting at 2012-12-17 10:57:36
> Injector: crawlDb: 
> /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
> Injector: urlDir: /opt/project/current/crawl_project/nutch/urls/url_1300
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-12-17 10:57:51, elapsed: 00:00:14
> Generator: starting at 2012-12-17 10:57:51
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 400
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: 
> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> Generator: finished at 2012-12-17 10:58:06, elapsed: 00:00:15
> Fetcher: Your 'http.agent.name' value should be listed first in 
> 'http.robots.agents' property.
> Fetcher: starting at 2012-12-17 10:58:06
> Fetcher: segment: 
> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> Using queue mode : byHost
> Fetcher: threads: 20
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> fetching http://www.lequipe.fr/Football/
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2012-12-17 10:58:13, elapsed: 00:00:07
> ParseSegment: starting at 2012-12-17 10:58:13
> ParseSegment: segment: 
> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> ParseSegment: finished at 2012-12-17 10:58:20, elapsed: 00:00:07
> CrawlDb update: starting at 2012-12-17 10:58:20
> CrawlDb update: db: 
> /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
> CrawlDb update: segments: 
> [/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2012-12-17 10:58:33, elapsed: 00:00:13
> Generator: starting at 2012-12-17 10:58:33
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 400
> Generator: jobtracker is 'local', generating exactly one partition.
> -shouldFetch rejected 'http://www.lequipe.fr/Football/', 
> fetchTime=1359626286623, curTime=1355738313780
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting at 2012-12-17 10:58:40
> LinkDb: linkdb: /opt/project/current/crawl_project/nutch/crawl/1300/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: internal links will be ignored.
> LinkDb: adding segment: 
> file:/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> LinkDb: finished at 2012-12-17 10:58:47, elapsed: 00:00:07
> SolrIndexer: starting at 2012-12-17 10:58:47
> SolrIndexer: deleting gone documents: false
> SolrIndexer: URL filtering: false
> SolrIndexer: URL normalizing: false
> SolrIndexer: finished at 2012-12-17 10:59:09, elapsed: 00:00:22
> SolrDeleteDuplicates: starting at 2012-12-17 10:59:09
> SolrDeleteDuplicates: Solr url: http://192.168.1.144:8983/solr/
> SolrDeleteDuplicates: finished at 2012-12-17 10:59:47, elapsed: 00:00:37
> 
> Am 25.11.2012 21:02, schrieb Sebastian Nagel:
> >> But, i create a complete new crawl dir for every crawl.
> > Then all should work as expected.
> >
> >> why the the cralwer set a "page to fetch" to rejected. Because obviously
> >> the crawler never saw this page before (because i deleted all the old crawl dirs).
> >> In the crawl log i see many page to fetch, but at the end all of them are rejected
> > Are you sure they aren't fetched at all? This debug log output in Generator mapper
> > is shown also for URLs fetched in previous cycles. You should check the complete
> > log for the "rejected" URLs.
> >
> >
> > On 11/24/2012 04:46 PM, Jan Philippe Wimmer wrote:
> >> Hey Sebastian! Thanks for your answer.
> >>
> >> But, i create a complete new crawl dir for every crawl. In other words i just have the crawl data of
> >> the current, running crawl-process. When i recrawl a urlset, i delete the old crawl dir and create a
> >> new one. At the end of any crawl i index it to solr. So i keep all crawled content in the index. I
> >> don't need any nutch crawl dirs, because i want to crawl all relevant pages in every crawl process.
> >> again and again.
> >>
> >> I totaly don't understand, why the the cralwer set a "page to fetch" to rejected. Because obviously
> >> the crawler never saw this page before (because i deleted all the old crawl dirs). In the crawl log
> >> i see many page to fetch, but at the end all of them are rejected. Any ideas?
> >>
> >> Am 24.11.2012 16:36, schrieb Sebastian Nagel:
> >>>> I want my crawler to crawl the complete page without setting up schedulers at all. Every crawl
> >>>> process should crawl every page again without having setup wait intervals.
> >>> That's quite easy: remove all data and launch the crawl again.
> >>> - Nutch 1.x : remove crawldb, segments, and linkdb
> >>> - 2.x : drop 'webpage' (or similar, depends on the chosen data store)
> >>>
> >>> On 11/24/2012 12:17 PM, Jan Philippe Wimmer wrote:
> >>>> Hi there,
> >>>>
> >>>> how can i avoid the following error:
> >>>> -shouldFetch rejected 'http://www.page.com/shop', fetchTime=1356347311285, curTime=1353755337755
> >>>>
> >>>> I want my crawler to crawl the complete page without setting up schedulers at all. Every crawl
> >>>> process should crawl every page again without having setup wait intervals.
> >>>>
> >>>> Any soluti
> 
>