You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Rogers <pa...@gmail.com> on 2014/08/18 22:03:51 UTC
Nutch not crawling all documents in a directory
Hi All
I'm having problems with Nutch not crawling all the documents in a
directory:
The directory in question can be found at:
http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/
There are 2460 documents (pdf's) in the directory. Nutch enters the
directory and indexes the first 100 or so documents and then completes it's
crawl. The command issued is:
HOST=localhost
PORT=8983
CORE=collection1
cd /opt/nutch
bin/crawl urls crawl http://localhost:8983/solr/collection1 4
Any attempt to recrawl the directory gives the following output:
Injector: starting at 2014-08-18 14:58:26
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02
Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-18 14:58:29
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
I have the following in conf/nutch-site.xml
<property>
<name>db.update.additions.allowed</name>
<value>true</value>
<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.
</description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
I think this must be a config issue but am unsure where to look next.
Can anyone point me in the right direction?
Thanks
P
Re: Nutch not crawling all documents in a directory
Posted by Paul Rogers <pa...@gmail.com>.
Hey Sebastian
Thank you so much!! You're a star.
P
On 19 August 2014 12:39, Sebastian Nagel <wa...@googlemail.com> wrote:
> Hi Paul,
>
> documents in a directory are first just links.
> There is a limit on the max. number of links per page.
> You may guess: the default is 100 :)
> Increase it, or even set it to -1, see below.
>
> Cheers,
> Sebastian
>
>
> <property>
> <name>db.max.outlinks.per.page</name>
> <value>100</value>
> <description>The maximum number of outlinks that we'll process for a
> page.
> If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
> will be processed for a page; otherwise, all outlinks will be processed.
> </description>
> </property>
>
>
> On 08/18/2014 10:03 PM, Paul Rogers wrote:
> > Hi All
> >
> > I'm having problems with Nutch not crawling all the documents in a
> > directory:
> >
> > The directory in question can be found at:
> >
> > http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/
> >
> > There are 2460 documents (pdf's) in the directory. Nutch enters the
> > directory and indexes the first 100 or so documents and then completes
> it's
> > crawl. The command issued is:
> >
> > HOST=localhost
> > PORT=8983
> > CORE=collection1
> > cd /opt/nutch
> > bin/crawl urls crawl http://localhost:8983/solr/collection1 4
> >
> > Any attempt to recrawl the directory gives the following output:
> >
> > Injector: starting at 2014-08-18 14:58:26
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: total number of urls rejected by filters: 0
> > Injector: total number of urls injected after normalization and
> filtering: 1
> > Injector: Merging injected urls into crawl db.
> > Injector: overwrite: false
> > Injector: update: false
> > Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02
> > Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4
> > Generating a new segment
> > Generator: starting at 2014-08-18 14:58:29
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 50000
> > Generator: 0 records selected for fetching, exiting ...
> >
> > I have the following in conf/nutch-site.xml
> >
> > <property>
> > <name>db.update.additions.allowed</name>
> > <value>true</value>
> > <description>If true, updatedb will add newly discovered URLs, if false
> > only already existing URLs in the CrawlDb will be updated and no new
> > URLs will be added.
> > </description>
> > </property>
> > <property>
> > <name>http.content.limit</name>
> > <value>-1</value>
> > <description>The length limit for downloaded content using the http://
> > protocol, in bytes. If this value is nonnegative (>=0), content longer
> > than it will be truncated; otherwise, no truncation at all. Do not
> > confuse this setting with the file.content.limit setting.
> > </description>
> > </property>
> >
> > I think this must be a config issue but am unsure where to look next.
> >
> > Can anyone point me in the right direction?
> >
> > Thanks
> >
> > P
> >
>
>
Re: Nutch not crawling all documents in a directory
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Paul,
documents in a directory are first just links.
There is a limit on the max. number of links per page.
You may guess: the default is 100 :)
Increase it, or even set it to -1, see below.
Cheers,
Sebastian
<property>
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
On 08/18/2014 10:03 PM, Paul Rogers wrote:
> Hi All
>
> I'm having problems with Nutch not crawling all the documents in a
> directory:
>
> The directory in question can be found at:
>
> http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/
>
> There are 2460 documents (pdf's) in the directory. Nutch enters the
> directory and indexes the first 100 or so documents and then completes it's
> crawl. The command issued is:
>
> HOST=localhost
> PORT=8983
> CORE=collection1
> cd /opt/nutch
> bin/crawl urls crawl http://localhost:8983/solr/collection1 4
>
> Any attempt to recrawl the directory gives the following output:
>
> Injector: starting at 2014-08-18 14:58:26
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering: 1
> Injector: Merging injected urls into crawl db.
> Injector: overwrite: false
> Injector: update: false
> Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02
> Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4
> Generating a new segment
> Generator: starting at 2014-08-18 14:58:29
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: 0 records selected for fetching, exiting ...
>
> I have the following in conf/nutch-site.xml
>
> <property>
> <name>db.update.additions.allowed</name>
> <value>true</value>
> <description>If true, updatedb will add newly discovered URLs, if false
> only already existing URLs in the CrawlDb will be updated and no new
> URLs will be added.
> </description>
> </property>
> <property>
> <name>http.content.limit</name>
> <value>-1</value>
> <description>The length limit for downloaded content using the http://
> protocol, in bytes. If this value is nonnegative (>=0), content longer
> than it will be truncated; otherwise, no truncation at all. Do not
> confuse this setting with the file.content.limit setting.
> </description>
> </property>
>
> I think this must be a config issue but am unsure where to look next.
>
> Can anyone point me in the right direction?
>
> Thanks
>
> P
>