You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by shano <Sh...@gmail.com> on 2012/03/27 13:02:58 UTC

Nutch limiting crawl to 100 documents per directory

I've configured a nutch file system crawl on the server.
My directory has thousands of htm files. 

But no matter how I run nutch, it will only crawl 100 documents per
directory, which is no use considering I only have 1 directory with many
files. 

I've looked everywhere for the setting that's telling it only to crawl 100
docs but can't find it. 

The command I'm running is 
bin/nutch crawl urls -solr http://localhost:8085/solr/ -dir crawl -threads 5
-depth 2

Here's a snippet from the logs. 
QueueFeeder always specifies 100 records. 
And fetchQueues.totalSize starts at 99 working down to 0. 



I have already successfully crawled & indexed a local web server. And the
100 local files that are being crawled are being indexed & are searchable in
solr. So I'm stumped with this limitation of 100. 

Any help would be appreciated. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-limiting-crawl-to-100-documents-per-directory-tp3861006p3861006.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch limiting crawl to 100 documents per directory

Posted by shano <Sh...@gmail.com>.

Thank you Ken! 
db.max.outlinks.per.page was the property I needed to change

Once I changed this from 100 it lifted the cap. 

It's flying through the docs now. 
Many thanks. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-limiting-crawl-to-100-documents-per-directory-tp3861006p3861549.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch limiting crawl to 100 documents per directory

Posted by Ken Krugler <kk...@transpac.com>.

It's been many, many years since I've used Nutch to crawl a file system, so I don't know if db.max.outlinks.per.page comes into play.

But it's set to 100 by default.

-- Ken

On Mar 27, 2012, at 4:02am, shano wrote:

> I've configured a nutch file system crawl on the server.
> My directory has thousands of htm files. 
> 
> But no matter how I run nutch, it will only crawl 100 documents per
> directory, which is no use considering I only have 1 directory with many
> files. 
> 
> I've looked everywhere for the setting that's telling it only to crawl 100
> docs but can't find it. 
> 
> The command I'm running is 
> bin/nutch crawl urls -solr http://localhost:8085/solr/ -dir crawl -threads 5
> -depth 2
> 
> Here's a snippet from the logs. 
> QueueFeeder always specifies 100 records. 
> And fetchQueues.totalSize starts at 99 working down to 0. 
> 
> 
> 
> I have already successfully crawled & indexed a local web server. And the
> 100 local files that are being crawled are being indexed & are searchable in
> solr. So I'm stumped with this limitation of 100. 
> 
> Any help would be appreciated. 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: Nutch limiting crawl to 100 documents per directory

Posted by Elisabeth Adler <el...@gmail.com>.

Hi,
In your command you specify -depth 2. Try changing this to a higher value...
Best,
Elisabeth

On 27.03.2012 13:02, shano wrote:
> I've configured a nutch file system crawl on the server.
> My directory has thousands of htm files.
>
> But no matter how I run nutch, it will only crawl 100 documents per
> directory, which is no use considering I only have 1 directory with many
> files.
>
> I've looked everywhere for the setting that's telling it only to crawl 100
> docs but can't find it.
>
> The command I'm running is
> bin/nutch crawl urls -solr http://localhost:8085/solr/ -dir crawl -threads 5
> -depth 2
>
> Here's a snippet from the logs.
> QueueFeeder always specifies 100 records.
> And fetchQueues.totalSize starts at 99 working down to 0.
>
>
>
> I have already successfully crawled&  indexed a local web server. And the
> 100 local files that are being crawled are being indexed&  are searchable in
> solr. So I'm stumped with this limitation of 100.
>
> Any help would be appreciated.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-limiting-crawl-to-100-documents-per-directory-tp3861006p3861006.html
> Sent from the Nutch - User mailing list archive at Nabble.com.