You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul Rogers <pa...@gmail.com> on 2014/08/21 22:38:26 UTC

New documents not being added by nutch

Hi All

I have a web site serving a series of documents (pdf's) and am using Nutch
1.8 to index them in solr.  The base url is http://localhost/ and the
documents are stored in a series of directories in the directory
http://localhost/doccontrol/.  To start with while I was experimenting this
directory contained a single directory (http://localhost/doccontrol/DC-10
Incoming Correspondence) containing approximately 2500 pdf documents.
 Nutch successfully crawled and indexed this directory and all the files
contained in it.

I have now added two further directories to doccontrol (
http://localhost/doccontrol/DC-11 Outgoing Correspondence and
http://localhost/doccontrol/DC-16 MEETINGS MINUTES).  Each has about 2500
pdf documents in it.

However when I run Nutch no further documents are added to the index and
nutch gives the following output.

bin/crawl urls crawl http://localhost:8983/solr/collection1 4

Injector: starting at 2014-08-21 14:06:25
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: finished at 2014-08-21 14:06:28, elapsed: 00:00:02
Thu Aug 21 14:06:28 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-21 14:06:29
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

I have the following in my nutch-site.xml

 <property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
 </property>
 <property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>

Not sure why nutch is not adding new URL's.  Is it because
http://localhost/doccontrol is not the "root" and will only be scanned
again in 30 days time?

I thought the db.update.additions.allowed fixed this but am I missing
something?

Why are  the new directories and folders not being added?  Can anyone point
me in the right direction.

Many thanks

Paul

Re: New documents not being added by nutch

Posted by Paul Rogers <pa...@gmail.com>.

Hi Sebastian

Couple of questions if I may.

The directories are constantly (at 22:00 each day) being updated (adding 20
or so documents to each each time).

Option 1 - inject new directory url's won't help for the new documents
added.  The URL's for these will only get injected when the parent
directory is recrawled (sorry if I'm mixing terms).  Is my understanding
correct?

Do changes to nutch-site.xml get picked up at the next crawl?  The reason I
ask is that when updated the db.max.outlinks.per.page from 100 to -1 nutch
still returned 0 new documents until I deleted the crawldb.

Updating db.injector.overwrite has now caused the new directories to be
crawled.  The description for this option states "Whether existing records
in the CrawlDB will be overwritten by injected records."  Does this mean
that when combined with db.update.additions.allowed = true any new links
will be added and crawled or that all the existing links will always be
recrawled?

Do you know of a resource I might've missed that might explain how all this
fits together?  Or is it trial and error and the help of you guys?

many thanks

P




On 22 August 2014 13:06, Sebastian Nagel <wa...@googlemail.com> wrote:

> Hi Paul,
>
> > Not sure why nutch is not adding new URL's.  Is it because
> > http://localhost/doccontrol is not the "root" and will only be scanned
> > again in 30 days time?
>
> Every document, even seeds (including "root") is re-crawled after 30 days
> per default.
>
> > I thought the db.update.additions.allowed fixed this but am I missing
> > something?
>
> That will not help. db.update.additions.allowed is true, if false
> no new documents will be found by links.
>
> You could either:
>
> - inject the new directory URLs: they are not known yet,
>   consequently, they'll get fetched (if not filtered away)
>
> - force a re-fetch of all URLs containing links to the new directories:
>   i.e. re-fetch "root". Have a look at property "db.injector.overwrite."
>   The injected CrawlDatum with status unfetched will replace the old one
>   and the URL will get fetched immediately.
>
> Sebastian
>
> On 08/21/2014 10:38 PM, Paul Rogers wrote:
> > Hi All
> >
> > I have a web site serving a series of documents (pdf's) and am using
> Nutch
> > 1.8 to index them in solr.  The base url is http://localhost/ and the
> > documents are stored in a series of directories in the directory
> > http://localhost/doccontrol/.  To start with while I was experimenting
> this
> > directory contained a single directory (
> http://localhost/doccontrol/DC-10
> > Incoming Correspondence) containing approximately 2500 pdf documents.
> >  Nutch successfully crawled and indexed this directory and all the files
> > contained in it.
> >
> > I have now added two further directories to doccontrol (
> > http://localhost/doccontrol/DC-11 Outgoing Correspondence and
> > http://localhost/doccontrol/DC-16 MEETINGS MINUTES).  Each has about
> 2500
> > pdf documents in it.
> >
> > However when I run Nutch no further documents are added to the index and
> > nutch gives the following output.
> >
> > bin/crawl urls crawl http://localhost:8983/solr/collection1 4
> >
> > Injector: starting at 2014-08-21 14:06:25
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: total number of urls rejected by filters: 0
> > Injector: total number of urls injected after normalization and
> filtering: 1
> > Injector: Merging injected urls into crawl db.
> > Injector: overwrite: false
> > Injector: update: false
> > Injector: finished at 2014-08-21 14:06:28, elapsed: 00:00:02
> > Thu Aug 21 14:06:28 EST 2014 : Iteration 1 of 4
> > Generating a new segment
> > Generator: starting at 2014-08-21 14:06:29
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 50000
> > Generator: 0 records selected for fetching, exiting ...
> >
> > I have the following in my nutch-site.xml
> >
> >  <property>
> >   <name>db.update.additions.allowed</name>
> >   <value>true</value>
> >   <description>If true, updatedb will add newly discovered URLs, if false
> >   only already existing URLs in the CrawlDb will be updated and no new
> >   URLs will be added.
> >   </description>
> >  </property>
> >  <property>
> >   <name>db.max.outlinks.per.page</name>
> >   <value>-1</value>
> >   <description>The maximum number of outlinks that we'll process for a
> page.
> >   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> > outlinks
> >   will be processed for a page; otherwise, all outlinks will be
> processed.
> >   </description>
> >  </property>
> >
> > Not sure why nutch is not adding new URL's.  Is it because
> > http://localhost/doccontrol is not the "root" and will only be scanned
> > again in 30 days time?
> >
> > I thought the db.update.additions.allowed fixed this but am I missing
> > something?
> >
> > Why are  the new directories and folders not being added?  Can anyone
> point
> > me in the right direction.
> >
> > Many thanks
> >
> > Paul
> >
>
>

Re: New documents not being added by nutch

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Paul,

> Not sure why nutch is not adding new URL's.  Is it because
> http://localhost/doccontrol is not the "root" and will only be scanned
> again in 30 days time?

Every document, even seeds (including "root") is re-crawled after 30 days
per default.

> I thought the db.update.additions.allowed fixed this but am I missing
> something?

That will not help. db.update.additions.allowed is true, if false
no new documents will be found by links.

You could either:

- inject the new directory URLs: they are not known yet,
  consequently, they'll get fetched (if not filtered away)

- force a re-fetch of all URLs containing links to the new directories:
  i.e. re-fetch "root". Have a look at property "db.injector.overwrite."
  The injected CrawlDatum with status unfetched will replace the old one
  and the URL will get fetched immediately.

Sebastian

On 08/21/2014 10:38 PM, Paul Rogers wrote:
> Hi All
> 
> I have a web site serving a series of documents (pdf's) and am using Nutch
> 1.8 to index them in solr.  The base url is http://localhost/ and the
> documents are stored in a series of directories in the directory
> http://localhost/doccontrol/.  To start with while I was experimenting this
> directory contained a single directory (http://localhost/doccontrol/DC-10
> Incoming Correspondence) containing approximately 2500 pdf documents.
>  Nutch successfully crawled and indexed this directory and all the files
> contained in it.
> 
> I have now added two further directories to doccontrol (
> http://localhost/doccontrol/DC-11 Outgoing Correspondence and
> http://localhost/doccontrol/DC-16 MEETINGS MINUTES).  Each has about 2500
> pdf documents in it.
> 
> However when I run Nutch no further documents are added to the index and
> nutch gives the following output.
> 
> bin/crawl urls crawl http://localhost:8983/solr/collection1 4
> 
> Injector: starting at 2014-08-21 14:06:25
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering: 1
> Injector: Merging injected urls into crawl db.
> Injector: overwrite: false
> Injector: update: false
> Injector: finished at 2014-08-21 14:06:28, elapsed: 00:00:02
> Thu Aug 21 14:06:28 EST 2014 : Iteration 1 of 4
> Generating a new segment
> Generator: starting at 2014-08-21 14:06:29
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: 0 records selected for fetching, exiting ...
> 
> I have the following in my nutch-site.xml
> 
>  <property>
>   <name>db.update.additions.allowed</name>
>   <value>true</value>
>   <description>If true, updatedb will add newly discovered URLs, if false
>   only already existing URLs in the CrawlDb will be updated and no new
>   URLs will be added.
>   </description>
>  </property>
>  <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>-1</value>
>   <description>The maximum number of outlinks that we'll process for a page.
>   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>   will be processed for a page; otherwise, all outlinks will be processed.
>   </description>
>  </property>
> 
> Not sure why nutch is not adding new URL's.  Is it because
> http://localhost/doccontrol is not the "root" and will only be scanned
> again in 30 days time?
> 
> I thought the db.update.additions.allowed fixed this but am I missing
> something?
> 
> Why are  the new directories and folders not being added?  Can anyone point
> me in the right direction.
> 
> Many thanks
> 
> Paul
>