You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul Rogers <pa...@gmail.com> on 2014/08/18 22:03:51 UTC

Nutch not crawling all documents in a directory

Hi All

I'm having problems with Nutch not crawling all the documents in a
directory:

The directory in question can be found at:

http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/

There are 2460 documents (pdf's) in the directory.  Nutch enters the
directory and indexes the first 100 or so documents and then completes it's
crawl.  The command issued is:

HOST=localhost
PORT=8983
CORE=collection1
cd /opt/nutch
bin/crawl urls crawl http://localhost:8983/solr/collection1 4

Any attempt to recrawl the directory gives the following output:

Injector: starting at 2014-08-18 14:58:26
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02
Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-18 14:58:29
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

I have the following in conf/nutch-site.xml

 <property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
 </property>
<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
 </property>

I think this must be a config issue but am unsure where to look next.

Can anyone point me in the right direction?

Thanks

P

Re: Nutch not crawling all documents in a directory

Posted by Paul Rogers <pa...@gmail.com>.

Hey Sebastian

Thank you so much!!  You're a star.

P


On 19 August 2014 12:39, Sebastian Nagel <wa...@googlemail.com> wrote:

> Hi Paul,
>
> documents in a directory are first just links.
> There is a limit on the max. number of links per page.
> You may guess: the default is 100 :)
> Increase it, or even set it to -1, see below.
>
> Cheers,
> Sebastian
>
>
> <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>100</value>
>   <description>The maximum number of outlinks that we'll process for a
> page.
>   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>   will be processed for a page; otherwise, all outlinks will be processed.
>   </description>
> </property>
>
>
> On 08/18/2014 10:03 PM, Paul Rogers wrote:
> > Hi All
> >
> > I'm having problems with Nutch not crawling all the documents in a
> > directory:
> >
> > The directory in question can be found at:
> >
> > http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/
> >
> > There are 2460 documents (pdf's) in the directory.  Nutch enters the
> > directory and indexes the first 100 or so documents and then completes
> it's
> > crawl.  The command issued is:
> >
> > HOST=localhost
> > PORT=8983
> > CORE=collection1
> > cd /opt/nutch
> > bin/crawl urls crawl http://localhost:8983/solr/collection1 4
> >
> > Any attempt to recrawl the directory gives the following output:
> >
> > Injector: starting at 2014-08-18 14:58:26
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: total number of urls rejected by filters: 0
> > Injector: total number of urls injected after normalization and
> filtering: 1
> > Injector: Merging injected urls into crawl db.
> > Injector: overwrite: false
> > Injector: update: false
> > Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02
> > Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4
> > Generating a new segment
> > Generator: starting at 2014-08-18 14:58:29
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 50000
> > Generator: 0 records selected for fetching, exiting ...
> >
> > I have the following in conf/nutch-site.xml
> >
> >  <property>
> >   <name>db.update.additions.allowed</name>
> >   <value>true</value>
> >   <description>If true, updatedb will add newly discovered URLs, if false
> >   only already existing URLs in the CrawlDb will be updated and no new
> >   URLs will be added.
> >   </description>
> >  </property>
> > <property>
> >   <name>http.content.limit</name>
> >   <value>-1</value>
> >   <description>The length limit for downloaded content using the http://
> >   protocol, in bytes. If this value is nonnegative (>=0), content longer
> >   than it will be truncated; otherwise, no truncation at all. Do not
> >   confuse this setting with the file.content.limit setting.
> >   </description>
> >  </property>
> >
> > I think this must be a config issue but am unsure where to look next.
> >
> > Can anyone point me in the right direction?
> >
> > Thanks
> >
> > P
> >
>
>

Re: Nutch not crawling all documents in a directory

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Paul,

documents in a directory are first just links.
There is a limit on the max. number of links per page.
You may guess: the default is 100 :)
Increase it, or even set it to -1, see below.

Cheers,
Sebastian


<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>


On 08/18/2014 10:03 PM, Paul Rogers wrote:
> Hi All
> 
> I'm having problems with Nutch not crawling all the documents in a
> directory:
> 
> The directory in question can be found at:
> 
> http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/
> 
> There are 2460 documents (pdf's) in the directory.  Nutch enters the
> directory and indexes the first 100 or so documents and then completes it's
> crawl.  The command issued is:
> 
> HOST=localhost
> PORT=8983
> CORE=collection1
> cd /opt/nutch
> bin/crawl urls crawl http://localhost:8983/solr/collection1 4
> 
> Any attempt to recrawl the directory gives the following output:
> 
> Injector: starting at 2014-08-18 14:58:26
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering: 1
> Injector: Merging injected urls into crawl db.
> Injector: overwrite: false
> Injector: update: false
> Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02
> Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4
> Generating a new segment
> Generator: starting at 2014-08-18 14:58:29
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: 0 records selected for fetching, exiting ...
> 
> I have the following in conf/nutch-site.xml
> 
>  <property>
>   <name>db.update.additions.allowed</name>
>   <value>true</value>
>   <description>If true, updatedb will add newly discovered URLs, if false
>   only already existing URLs in the CrawlDb will be updated and no new
>   URLs will be added.
>   </description>
>  </property>
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>   <description>The length limit for downloaded content using the http://
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   </description>
>  </property>
> 
> I think this must be a config issue but am unsure where to look next.
> 
> Can anyone point me in the right direction?
> 
> Thanks
> 
> P
>