You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Rogers <pa...@gmail.com> on 2014/08/29 22:39:57 UTC

Re: New documents still not being added by nutch

Hi Guys

I'm still struggling with this.  In summary my directory structure is as
follows

/
|_doccontrol
    |_DC-10 Incoming Correspondence
    |_DC-11 Outgoing Correspondence

If when I first run nutch the folders DC-10 and DC-11 contain all the files
to be indexed then nutch crawls everything without a problem - GOOD :-)

If I add a new folder or documents to the root or doc control folder then
the next time nutch runs it crawls all the new files and indexes them -
GOOD :-)

However any new files that are added to the DC-10 DC-11 are not indexed
with nutch's output as follows (summarised):

Injector: starting at 2014-08-29 15:19:59
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: true
Injector: update: false
Injector: finished at 2014-08-29 15:20:02, elapsed: 00:00:02
Fri Aug 29 15:20:02 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:02
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20140829152005
Generator: finished at 2014-08-29 15:20:06, elapsed: 00:00:03
Operating on segment : 20140829152005
Fetching : 20140829152005
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2014-08-29 15:20:06
Fetcher: segment: crawl/segments/20140829152005
Fetcher Timelimit set for : 1409354406733
Using queue mode : byHost
Fetcher: threads: 50
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
fetching http://ws0895/doccontrol/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
.
.
.
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2014-08-29 15:20:09, elapsed: 00:00:02
Parsing : 20140829152005
ParseSegment: starting at 2014-08-29 15:20:09
ParseSegment: segment: crawl/segments/20140829152005
Parsed (3ms):http://ws0895/doccontrol/
ParseSegment: finished at 2014-08-29 15:20:10, elapsed: 00:00:01
CrawlDB update
CrawlDb update: starting at 2014-08-29 15:20:11
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20140829152005]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2014-08-29 15:20:12, elapsed: 00:00:01
Link inversion
LinkDb: starting at 2014-08-29 15:20:13
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: crawl/segments/20140829152005
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2014-08-29 15:20:15, elapsed: 00:00:02
Dedup on crawldb
Indexing 20140829152005 on SOLR index ->
http://localhost:8983/solr/collection1
Indexer: starting at 2014-08-29 15:20:19
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication


Indexer: finished at 2014-08-29 15:20:20, elapsed: 00:00:01
Cleanup on SOLR index -> http://localhost:8983/solr/collection1
Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

BAD - :-(

What I'd like nutch to do is to index any newly added docs whatever level
they were added at.

My nutch command is as follows:

bin/crawl urls crawl http://localhost:8983/solr/collection1 4

My nutch-site.xml now contains:

<property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
 </property>
 <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>
 <property>
  <name>db.injector.overwrite</name>
  <value>true</value>
  <description>Whether existing records in the CrawlDB will be overwritten
  by injected records.
  </description>
 </property>
 <property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
  <description>The implementation of fetch schedule. DefaultFetchSchedule
simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
 </property>

 <property>
  <name>db.fetch.schedule.adaptive.min_interval</name>
  <value>86400.0</value>
  <description>Minimum fetchInterval, in seconds.</description>
 </property>
  <property>
  <name>db.fetch.interval.default</name>
  <value>1209600</value>
  <description>The default number of seconds between re-fetches of a page
(14 days).
  </description>
 </property>

Is what I am trying to do (recrawl any newly added documents at any level)
impossible?

Or (more likely) am I still missing something?

Many thanks

P
On 22 August 2014 13:06, Sebastian Nagel <wa...@googlemail.com> wrote:

> Hi Paul,
>
> > Not sure why nutch is not adding new URL's.  Is it because
> > http://localhost/doccontrol is not the "root" and will only be scanned
> > again in 30 days time?
>
> Every document, even seeds (including "root") is re-crawled after 30 days
> per default.
>
> > I thought the db.update.additions.allowed fixed this but am I missing
> > something?
>
> That will not help. db.update.additions.allowed is true, if false
> no new documents will be found by links.
>
> You could either:
>
> - inject the new directory URLs: they are not known yet,
>   consequently, they'll get fetched (if not filtered away)
>
> - force a re-fetch of all URLs containing links to the new directories:
>   i.e. re-fetch "root". Have a look at property "db.injector.overwrite."
>   The injected CrawlDatum with status unfetched will replace the old one
>   and the URL will get fetched immediately.
>
> Sebastian
>
> On 08/21/2014 10:38 PM, Paul Rogers wrote:
> > Hi All
> >
> > I have a web site serving a series of documents (pdf's) and am using
> Nutch
> > 1.8 to index them in solr.  The base url is http://localhost/ and the
> > documents are stored in a series of directories in the directory
> > http://localhost/doccontrol/.  To start with while I was experimenting
> this
> > directory contained a single directory (
> http://localhost/doccontrol/DC-10
> > Incoming Correspondence) containing approximately 2500 pdf documents.
> >  Nutch successfully crawled and indexed this directory and all the files
> > contained in it.
> >
> > I have now added two further directories to doccontrol (
> > http://localhost/doccontrol/DC-11 Outgoing Correspondence and
> > http://localhost/doccontrol/DC-16 MEETINGS MINUTES).  Each has about
> 2500
> > pdf documents in it.
> >
> > However when I run Nutch no further documents are added to the index and
> > nutch gives the following output.
> >
> > bin/crawl urls crawl http://localhost:8983/solr/collection1 4
> >
> > Injector: starting at 2014-08-21 14:06:25
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: total number of urls rejected by filters: 0
> > Injector: total number of urls injected after normalization and
> filtering: 1
> > Injector: Merging injected urls into crawl db.
> > Injector: overwrite: false
> > Injector: update: false
> > Injector: finished at 2014-08-21 14:06:28, elapsed: 00:00:02
> > Thu Aug 21 14:06:28 EST 2014 : Iteration 1 of 4
> > Generating a new segment
> > Generator: starting at 2014-08-21 14:06:29
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 50000
> > Generator: 0 records selected for fetching, exiting ...
> >
> > I have the following in my nutch-site.xml
> >
> >  <property>
> >   <name>db.update.additions.allowed</name>
> >   <value>true</value>
> >   <description>If true, updatedb will add newly discovered URLs, if false
> >   only already existing URLs in the CrawlDb will be updated and no new
> >   URLs will be added.
> >   </description>
> >  </property>
> >  <property>
> >   <name>db.max.outlinks.per.page</name>
> >   <value>-1</value>
> >   <description>The maximum number of outlinks that we'll process for a
> page.
> >   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> > outlinks
> >   will be processed for a page; otherwise, all outlinks will be
> processed.
> >   </description>
> >  </property>
> >
> > Not sure why nutch is not adding new URL's.  Is it because
> > http://localhost/doccontrol is not the "root" and will only be scanned
> > again in 30 days time?
> >
> > I thought the db.update.additions.allowed fixed this but am I missing
> > something?
> >
> > Why are  the new directories and folders not being added?  Can anyone
> point
> > me in the right direction.
> >
> > Many thanks
> >
> > Paul
> >
>
>