You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul Rogers <pa...@gmail.com> on 2014/09/08 23:09:20 UTC

Nutch not crawling deep enough into directory structure

Hi Guys

Reposting this since I think it got lost in the tail end of the last post.

I have a web site serving a series of documents (pdf's) and am using Nutch
1.8 to index them in solr.  The base url is http://localhost/ and the
documents are stored in a series of directories in the directory
http://localhost/doccontrol/, e.g.

/
|_doccontrol
    |_DC-10 Incoming Correspondence
    |_DC-11 Outgoing Correspondence

If when I first run nutch the folders DC-10 and DC-11 contain all the files
to be indexed then nutch crawls everything without a problem - GOOD :-)

If I add a new folder or documents to the root or doc control folder then
the next time nutch runs it crawls all the new files and indexes them -
GOOD :-)

However any new files that are added to the DC-10 or DC-11 directories are
not indexed with nutch's output as follows (summarised):

Injector: starting at 2014-08-29 15:19:59
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: true
Injector: update: false
Injector: finished at 2014-08-29 15:20:02, elapsed: 00:00:02
Fri Aug 29 15:20:02 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:02
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20140829152005
Generator: finished at 2014-08-29 15:20:06, elapsed: 00:00:03
Operating on segment : 20140829152005
Fetching : 20140829152005
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2014-08-29 15:20:06
Fetcher: segment: crawl/segments/20140829152005
Fetcher Timelimit set for : 1409354406733
Using queue mode : byHost
Fetcher: threads: 50
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
fetching http://ws0895/doccontrol/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
.
.
.
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2014-08-29 15:20:09, elapsed: 00:00:02
Parsing : 20140829152005
ParseSegment: starting at 2014-08-29 15:20:09
ParseSegment: segment: crawl/segments/20140829152005
Parsed (3ms):http://ws0895/doccontrol/
ParseSegment: finished at 2014-08-29 15:20:10, elapsed: 00:00:01
CrawlDB update
CrawlDb update: starting at 2014-08-29 15:20:11
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20140829152005]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2014-08-29 15:20:12, elapsed: 00:00:01
Link inversion
LinkDb: starting at 2014-08-29 15:20:13
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: crawl/segments/20140829152005
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2014-08-29 15:20:15, elapsed: 00:00:02
Dedup on crawldb
Indexing 20140829152005 on SOLR index ->
http://localhost:8983/solr/collection1
Indexer: starting at 2014-08-29 15:20:19
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication


Indexer: finished at 2014-08-29 15:20:20, elapsed: 00:00:01
Cleanup on SOLR index -> http://localhost:8983/solr/collection1
Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

BAD - :-(

What I'd like nutch to do is to index any newly added docs whatever level
they were added at.

My nutch command is as follows:

bin/crawl urls crawl http://localhost:8983/solr/collection1 4

My nutch-site.xml contains:

<property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
 </property>
 <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>
 <property>
  <name>db.injector.overwrite</name>
  <value>true</value>
  <description>Whether existing records in the CrawlDB will be overwritten
  by injected records.
  </description>
 </property>
 <property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
  <description>The implementation of fetch schedule. DefaultFetchSchedule
simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
 </property>

 <property>
  <name>db.fetch.schedule.adaptive.min_interval</name>
  <value>86400.0</value>
  <description>Minimum fetchInterval, in seconds.</description>
 </property>
  <property>
  <name>db.fetch.interval.default</name>
  <value>1209600</value>
  <description>The default number of seconds between re-fetches of a page
(14 days).
  </description>
 </property>

Is what I am trying to do (recrawl any newly added documents at any level)
impossible?

Or (more likely) am I missing something in the config?

Can anyone point me in the right direction?

Many thanks

P

Re: Nutch not crawling deep enough into directory structure

Posted by Paul Rogers <pa...@gmail.com>.

Hi Chris

Many thanks for the response.  Sorry it's taken a few days to test things
here.

I have added the properties you suggested:

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

They were indeed set to the defaults.

I have also updated my command to:

bin/crawl urls crawl http://localhost:8983/solr/collection1 16

I have checked the max.outlinks and it is set as follows:

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>

which I believe should be OK.

After each change I have deleted the crawl database and run my initial
crawl (everything is added no matter what depth) and then added some
documents additional documents and re run the crawl.

As before, no change unfortunately.

If the files are added at the root or doccontrol level everything is
added/crawled (I can even add a new directory at this level and the files
within it are crawled/added).  But any new documents added into the other
folders (DC-10 Incoming Correspondence or DC-11 Outgoing Correspondence)
are ignored.  Whenever I run the crawl/add it always stops at the second
pass (irrespective of the final parameter on the command line), ie always
finishes with:

Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

Would the seed.txt or regex-url.txt affect things in this manner?

seed.txt entry: http://ws0895/doccontrol/

regex-url.txt entry: http://([a-z0-9]*\.)*ws0895/

Any further suggestions?

Many thanks


P

On 8 September 2014 16:15, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Paul,
>
> Try expanding your last parameter (which is the # of crawling rounds).
>
> Also make sure to check these properties:
>
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>true</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
>
> The first can be set to false so that Nutch actually processes inlinks
> from the same host and the second to true, so that Nutch folks external
> links (if necessary).
>
> Also check your max outlinks per page property.
>
> HTH,
> Chris
>
> ________________________________________
> From: Paul Rogers [paul.rogers6@gmail.com]
> Sent: Monday, September 08, 2014 2:09 PM
> To: user@nutch.apache.org
> Subject: Nutch not crawling deep enough into directory structure
>
> Hi Guys
>
> Reposting this since I think it got lost in the tail end of the last post.
>
> I have a web site serving a series of documents (pdf's) and am using Nutch
> 1.8 to index them in solr.  The base url is http://localhost/ and the
> documents are stored in a series of directories in the directory
> http://localhost/doccontrol/, e.g.
>
> /
> |_doccontrol
>     |_DC-10 Incoming Correspondence
>     |_DC-11 Outgoing Correspondence
>
> If when I first run nutch the folders DC-10 and DC-11 contain all the files
> to be indexed then nutch crawls everything without a problem - GOOD :-)
>
> If I add a new folder or documents to the root or doc control folder then
> the next time nutch runs it crawls all the new files and indexes them -
> GOOD :-)
>
> However any new files that are added to the DC-10 or DC-11 directories are
> not indexed with nutch's output as follows (summarised):
>
> Injector: starting at 2014-08-29 15:19:59
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering:
> 1
> Injector: Merging injected urls into crawl db.
> Injector: overwrite: true
> Injector: update: false
> Injector: finished at 2014-08-29 15:20:02, elapsed: 00:00:02
> Fri Aug 29 15:20:02 EST 2014 : Iteration 1 of 4
> Generating a new segment
> Generator: starting at 2014-08-29 15:20:02
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20140829152005
> Generator: finished at 2014-08-29 15:20:06, elapsed: 00:00:03
> Operating on segment : 20140829152005
> Fetching : 20140829152005
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2014-08-29 15:20:06
> Fetcher: segment: crawl/segments/20140829152005
> Fetcher Timelimit set for : 1409354406733
> Using queue mode : byHost
> Fetcher: threads: 50
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> fetching http://ws0895/doccontrol/ (queue crawl delay=5000ms)
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> .
> .
> .
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Fetcher: throughput threshold: -1
> -finishing thread FetcherThread, activeThreads=1
> Fetcher: throughput threshold retries: 5
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2014-08-29 15:20:09, elapsed: 00:00:02
> Parsing : 20140829152005
> ParseSegment: starting at 2014-08-29 15:20:09
> ParseSegment: segment: crawl/segments/20140829152005
> Parsed (3ms):http://ws0895/doccontrol/
> ParseSegment: finished at 2014-08-29 15:20:10, elapsed: 00:00:01
> CrawlDB update
> CrawlDb update: starting at 2014-08-29 15:20:11
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20140829152005]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2014-08-29 15:20:12, elapsed: 00:00:01
> Link inversion
> LinkDb: starting at 2014-08-29 15:20:13
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: internal links will be ignored.
> LinkDb: adding segment: crawl/segments/20140829152005
> LinkDb: merging with existing linkdb: crawl/linkdb
> LinkDb: finished at 2014-08-29 15:20:15, elapsed: 00:00:02
> Dedup on crawldb
> Indexing 20140829152005 on SOLR index ->
> http://localhost:8983/solr/collection1
> Indexer: starting at 2014-08-29 15:20:19
> Indexer: deleting gone documents: false
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> SOLRIndexWriter
>         solr.server.url : URL of the SOLR instance (mandatory)
>         solr.commit.size : buffer size when sending to SOLR (default 1000)
>         solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
>         solr.auth : use authentication (default false)
>         solr.auth.username : use authentication (default false)
>         solr.auth : username for authentication
>         solr.auth.password : password for authentication
>
>
> Indexer: finished at 2014-08-29 15:20:20, elapsed: 00:00:01
> Cleanup on SOLR index -> http://localhost:8983/solr/collection1
> Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4
> Generating a new segment
> Generator: starting at 2014-08-29 15:20:23
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: 0 records selected for fetching, exiting ...
>
> BAD - :-(
>
> What I'd like nutch to do is to index any newly added docs whatever level
> they were added at.
>
> My nutch command is as follows:
>
> bin/crawl urls crawl http://localhost:8983/solr/collection1 4
>
> My nutch-site.xml contains:
>
> <property>
>   <name>db.update.additions.allowed</name>
>   <value>true</value>
>   <description>If true, updatedb will add newly discovered URLs, if false
>   only already existing URLs in the CrawlDb will be updated and no new
>   URLs will be added.
>   </description>
>  </property>
>  <name>db.max.outlinks.per.page</name>
>   <value>-1</value>
>   <description>The maximum number of outlinks that we'll process for a
> page.
>   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>   will be processed for a page; otherwise, all outlinks will be processed.
>   </description>
>  </property>
>  <property>
>   <name>db.injector.overwrite</name>
>   <value>true</value>
>   <description>Whether existing records in the CrawlDB will be overwritten
>   by injected records.
>   </description>
>  </property>
>  <property>
>   <name>db.fetch.schedule.class</name>
>   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
>   <description>The implementation of fetch schedule. DefaultFetchSchedule
> simply
>   adds the original fetchInterval to the last fetch time, regardless of
>   page changes.</description>
>  </property>
>
>  <property>
>   <name>db.fetch.schedule.adaptive.min_interval</name>
>   <value>86400.0</value>
>   <description>Minimum fetchInterval, in seconds.</description>
>  </property>
>   <property>
>   <name>db.fetch.interval.default</name>
>   <value>1209600</value>
>   <description>The default number of seconds between re-fetches of a page
> (14 days).
>   </description>
>  </property>
>
> Is what I am trying to do (recrawl any newly added documents at any level)
> impossible?
>
> Or (more likely) am I missing something in the config?
>
> Can anyone point me in the right direction?
>
> Many thanks
>
> P
>

RE: Nutch not crawling deep enough into directory structure

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hi Paul,

Try expanding your last parameter (which is the # of crawling rounds).

Also make sure to check these properties:

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

The first can be set to false so that Nutch actually processes inlinks
from the same host and the second to true, so that Nutch folks external
links (if necessary).

Also check your max outlinks per page property.

HTH,
Chris

________________________________________
From: Paul Rogers [paul.rogers6@gmail.com]
Sent: Monday, September 08, 2014 2:09 PM
To: user@nutch.apache.org
Subject: Nutch not crawling deep enough into directory structure

Hi Guys

Reposting this since I think it got lost in the tail end of the last post.

I have a web site serving a series of documents (pdf's) and am using Nutch
1.8 to index them in solr.  The base url is http://localhost/ and the
documents are stored in a series of directories in the directory
http://localhost/doccontrol/, e.g.

/
|_doccontrol
    |_DC-10 Incoming Correspondence
    |_DC-11 Outgoing Correspondence

If when I first run nutch the folders DC-10 and DC-11 contain all the files
to be indexed then nutch crawls everything without a problem - GOOD :-)

If I add a new folder or documents to the root or doc control folder then
the next time nutch runs it crawls all the new files and indexes them -
GOOD :-)

However any new files that are added to the DC-10 or DC-11 directories are
not indexed with nutch's output as follows (summarised):

Injector: starting at 2014-08-29 15:19:59
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: true
Injector: update: false
Injector: finished at 2014-08-29 15:20:02, elapsed: 00:00:02
Fri Aug 29 15:20:02 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:02
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20140829152005
Generator: finished at 2014-08-29 15:20:06, elapsed: 00:00:03
Operating on segment : 20140829152005
Fetching : 20140829152005
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2014-08-29 15:20:06
Fetcher: segment: crawl/segments/20140829152005
Fetcher Timelimit set for : 1409354406733
Using queue mode : byHost
Fetcher: threads: 50
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
fetching http://ws0895/doccontrol/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
.
.
.
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2014-08-29 15:20:09, elapsed: 00:00:02
Parsing : 20140829152005
ParseSegment: starting at 2014-08-29 15:20:09
ParseSegment: segment: crawl/segments/20140829152005
Parsed (3ms):http://ws0895/doccontrol/
ParseSegment: finished at 2014-08-29 15:20:10, elapsed: 00:00:01
CrawlDB update
CrawlDb update: starting at 2014-08-29 15:20:11
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20140829152005]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2014-08-29 15:20:12, elapsed: 00:00:01
Link inversion
LinkDb: starting at 2014-08-29 15:20:13
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: crawl/segments/20140829152005
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2014-08-29 15:20:15, elapsed: 00:00:02
Dedup on crawldb
Indexing 20140829152005 on SOLR index ->
http://localhost:8983/solr/collection1
Indexer: starting at 2014-08-29 15:20:19
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication


Indexer: finished at 2014-08-29 15:20:20, elapsed: 00:00:01
Cleanup on SOLR index -> http://localhost:8983/solr/collection1
Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

BAD - :-(

What I'd like nutch to do is to index any newly added docs whatever level
they were added at.

My nutch command is as follows:

bin/crawl urls crawl http://localhost:8983/solr/collection1 4

My nutch-site.xml contains:

<property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
 </property>
 <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>
 <property>
  <name>db.injector.overwrite</name>
  <value>true</value>
  <description>Whether existing records in the CrawlDB will be overwritten
  by injected records.
  </description>
 </property>
 <property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
  <description>The implementation of fetch schedule. DefaultFetchSchedule
simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
 </property>

 <property>
  <name>db.fetch.schedule.adaptive.min_interval</name>
  <value>86400.0</value>
  <description>Minimum fetchInterval, in seconds.</description>
 </property>
  <property>
  <name>db.fetch.interval.default</name>
  <value>1209600</value>
  <description>The default number of seconds between re-fetches of a page
(14 days).
  </description>
 </property>

Is what I am trying to do (recrawl any newly added documents at any level)
impossible?

Or (more likely) am I missing something in the config?

Can anyone point me in the right direction?

Many thanks

P