You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by shano <Sh...@gmail.com> on 2012/03/27 13:02:58 UTC
Nutch limiting crawl to 100 documents per directory
I've configured a nutch file system crawl on the server.
My directory has thousands of htm files.
But no matter how I run nutch, it will only crawl 100 documents per
directory, which is no use considering I only have 1 directory with many
files.
I've looked everywhere for the setting that's telling it only to crawl 100
docs but can't find it.
The command I'm running is
bin/nutch crawl urls -solr http://localhost:8085/solr/ -dir crawl -threads 5
-depth 2
Here's a snippet from the logs.
QueueFeeder always specifies 100 records.
And fetchQueues.totalSize starts at 99 working down to 0.
I have already successfully crawled & indexed a local web server. And the
100 local files that are being crawled are being indexed & are searchable in
solr. So I'm stumped with this limitation of 100.
Any help would be appreciated.
--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-limiting-crawl-to-100-documents-per-directory-tp3861006p3861006.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch limiting crawl to 100 documents per directory
Posted by shano <Sh...@gmail.com>.
Thank you Ken!
db.max.outlinks.per.page was the property I needed to change
Once I changed this from 100 it lifted the cap.
It's flying through the docs now.
Many thanks.
--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-limiting-crawl-to-100-documents-per-directory-tp3861006p3861549.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch limiting crawl to 100 documents per directory
Posted by Ken Krugler <kk...@transpac.com>.
It's been many, many years since I've used Nutch to crawl a file system, so I don't know if db.max.outlinks.per.page comes into play.
But it's set to 100 by default.
-- Ken
On Mar 27, 2012, at 4:02am, shano wrote:
> I've configured a nutch file system crawl on the server.
> My directory has thousands of htm files.
>
> But no matter how I run nutch, it will only crawl 100 documents per
> directory, which is no use considering I only have 1 directory with many
> files.
>
> I've looked everywhere for the setting that's telling it only to crawl 100
> docs but can't find it.
>
> The command I'm running is
> bin/nutch crawl urls -solr http://localhost:8085/solr/ -dir crawl -threads 5
> -depth 2
>
> Here's a snippet from the logs.
> QueueFeeder always specifies 100 records.
> And fetchQueues.totalSize starts at 99 working down to 0.
>
>
>
> I have already successfully crawled & indexed a local web server. And the
> 100 local files that are being crawled are being indexed & are searchable in
> solr. So I'm stumped with this limitation of 100.
>
> Any help would be appreciated.
--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr
Re: Nutch limiting crawl to 100 documents per directory
Posted by Elisabeth Adler <el...@gmail.com>.
Hi,
In your command you specify -depth 2. Try changing this to a higher value...
Best,
Elisabeth
On 27.03.2012 13:02, shano wrote:
> I've configured a nutch file system crawl on the server.
> My directory has thousands of htm files.
>
> But no matter how I run nutch, it will only crawl 100 documents per
> directory, which is no use considering I only have 1 directory with many
> files.
>
> I've looked everywhere for the setting that's telling it only to crawl 100
> docs but can't find it.
>
> The command I'm running is
> bin/nutch crawl urls -solr http://localhost:8085/solr/ -dir crawl -threads 5
> -depth 2
>
> Here's a snippet from the logs.
> QueueFeeder always specifies 100 records.
> And fetchQueues.totalSize starts at 99 working down to 0.
>
>
>
> I have already successfully crawled& indexed a local web server. And the
> 100 local files that are being crawled are being indexed& are searchable in
> solr. So I'm stumped with this limitation of 100.
>
> Any help would be appreciated.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-limiting-crawl-to-100-documents-per-directory-tp3861006p3861006.html
> Sent from the Nutch - User mailing list archive at Nabble.com.