You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Rafael Pappert <rp...@fwpsystems.com> on 2011/11/16 14:54:43 UTC

Crawler fetches only a few page at each run

Hello List,

I try to setup a crawler for ~10K Urls and their subpages (just the internal ones).
I set topN to 10000 (nutch crawl urls -dir crawl -depth 1 -topN 10000 -threads 10) 
but the fetch job only fetches 2 * generate.max.count pages per run. 

The hadoop map task list looks like that:

task_201111161348_0005_m_000000	100.00%
0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 1.0 (1) pages/s, 524.0 (65536) kbits/s, 
16-Nov-2011 13:53:16
16-Nov-2011 13:53:22 (6sec)

task_201111161348_0005_m_000001	100.00%
0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s, 92.0 (34552) kbits/s, 
16-Nov-2011 13:53:19
16-Nov-2011 13:53:28 (9sec)

task_201111161348_0005_m_000002	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 13:53:22
16-Nov-2011 13:53:28 (6sec)

task_201111161348_0005_m_000003	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 13:53:25
16-Nov-2011 13:53:31 (6sec)

task_201111161348_0005_m_000004	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 13:53:28
16-Nov-2011 13:53:34 (6sec)

task_201111161348_0005_m_000005	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 13:53:31
16-Nov-2011 13:53:37 (6sec)

task_201111161348_0005_m_000006	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 13:53:34
16-Nov-2011 13:53:40 (6sec)

task_201111161348_0005_m_000007	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 13:53:37
16-Nov-2011 13:53:43 (6sec)

But "readdb -stats" is like that, after a few runs:

11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 1 (db_unfetched):	14653
11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 2 (db_fetched):	17

Server:

One Node running Hadoop 0.20.203.0 (later i will add more nodes) and Nutch 1.4.

Configfiles:

// nutch-site.xml

<property>
  <name>http.accept.language</name>
  <value>de-de,de,en-us,en-gb,en;q=0.7,*;q=0.3</value>
  <description>Value of the "Accept-Language" request header field.
  This allows selecting non-English language as default one to retrieve.
  It is a useful setting for search engines build for certain national group.
  </description>
</property>

<property>
  <name>plugin.folders</name>
  <value>plugins</value>
</property>

<property>
  <name>plugin.includes</name>
  <value>linkbutler|language-identifier|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description></description>
</property>

<property>
  <name>generate.count.mode</name>
  <value>host</value>
  <description>Determines how the URLs are counted for generator.max.count.
  Default value is 'host' but can be 'domain'. Note that we do not count
  per IP in the new version of the Generator.
  </description>
</property>

<property>
  <name>generate.max.count</name>
  <value>1</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.

  </description>

// mapred-site.xml

<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>8</value>
  <description>The maximum number of map tasks that will be run simultaneously by a task tracker. </description>
</property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>8</value>
  <description>The maximum number of reduce tasks that will be run simultaneously by a task tracker. </description>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>8</value>
  <description></description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>8</value>
  <description></description>
</property>

Whats wrong with my configuration? Please correct me if i'm wrong but I guess topN = 10k and depth = 1 means nutch should fetch
10k pages in one run?  

Thanks in advance,
Rafael

Re: Crawler fetches only a few page at each run

Posted by Rafael Pappert <rp...@fwpsystems.com>.

hi,

i changed generate.max.count to -1 but the result is nearly the same.
Now the fetch task fetches 500 urls but there are still 6 map tasks
with 0 queues.

task_201111161504_0005_m_000002	100.00%
0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s, 92.0 (34540) kbits/s, 
16-Nov-2011 15:08:24
16-Nov-2011 15:08:33 (9sec)

task_201111161504_0005_m_000003	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 15:08:27
16-Nov-2011 15:08:33 (6sec)

task_201111161504_0005_m_000004	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 15:08:30
16-Nov-2011 15:08:36 (6sec)

task_201111161504_0005_m_000005	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 15:08:33
16-Nov-2011 15:08:39 (6sec)

task_201111161504_0005_m_000006	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 15:08:36
16-Nov-2011 15:08:42 (6sec)

task_201111161504_0005_m_000007	100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 
16-Nov-2011 15:08:39
16-Nov-2011 15:08:45 (6sec)

task_201111161504_0005_m_000000	42.83%
10 threads, 1 queues, 500 URLs queued, 38 pages, 0 errors, 0.0 (0) pages/s, 96.0 (0) kbits/s, 
16-Nov-2011 15:08:18

task_201111161504_0005_m_000001	98.43%
10 threads, 1 queues, 500 URLs queued, 32 pages, 0 errors, 0.0 (0) pages/s, 76.0 (0) kbits/s, 
16-Nov-2011 15:08:21

Why creates the generator only 2 fetch lists and why is the generator taking the same 2 hosts
again and again. Now after 20 runs, i have 2000 pages fetched but only from 2 different hosts.

best regards,
rafael.

On 16/Nov/ 2011, at 15:01 , Markus Jelsma wrote:

>  <name>generate.max.count</name>
>  <value>1</value>
> 
> I think this is the problem. Please increase as you crawl only one host, each 
> generate cycle will contain only 1 page for this host since your mode is set 
> to host.
> 
> Set to -1 or a higher value.
> 
> 
> On Wednesday 16 November 2011 14:54:43 Rafael Pappert wrote:
>> Hello List,
>> 
>> I try to setup a crawler for ~10K Urls and their subpages (just the
>> internal ones). I set topN to 10000 (nutch crawl urls -dir crawl -depth 1
>> -topN 10000 -threads 10) but the fetch job only fetches 2 *
>> generate.max.count pages per run.
>> 
>> The hadoop map task list looks like that:
>> 
>> task_201111161348_0005_m_000000	100.00%
>> 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 1.0 (1) pages/s,
>> 524.0 (65536) kbits/s, 16-Nov-2011 13:53:16
>> 16-Nov-2011 13:53:22 (6sec)
>> 
>> task_201111161348_0005_m_000001	100.00%
>> 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s,
>> 92.0 (34552) kbits/s, 16-Nov-2011 13:53:19
>> 16-Nov-2011 13:53:28 (9sec)
>> 
>> task_201111161348_0005_m_000002	100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:22
>> 16-Nov-2011 13:53:28 (6sec)
>> 
>> task_201111161348_0005_m_000003	100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:25
>> 16-Nov-2011 13:53:31 (6sec)
>> 
>> task_201111161348_0005_m_000004	100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:28
>> 16-Nov-2011 13:53:34 (6sec)
>> 
>> task_201111161348_0005_m_000005	100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:31
>> 16-Nov-2011 13:53:37 (6sec)
>> 
>> task_201111161348_0005_m_000006	100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:34
>> 16-Nov-2011 13:53:40 (6sec)
>> 
>> task_201111161348_0005_m_000007	100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:37
>> 16-Nov-2011 13:53:43 (6sec)
>> 
>> But "readdb -stats" is like that, after a few runs:
>> 
>> 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 1 (db_unfetched):	14653
>> 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 2 (db_fetched):	17
>> 
>> Server:
>> 
>> One Node running Hadoop 0.20.203.0 (later i will add more nodes) and Nutch
>> 1.4.
>> 
>> Configfiles:
>> 
>> // nutch-site.xml
>> 
>> <property>
>>  <name>http.accept.language</name>
>>  <value>de-de,de,en-us,en-gb,en;q=0.7,*;q=0.3</value>
>>  <description>Value of the "Accept-Language" request header field.
>>  This allows selecting non-English language as default one to retrieve.
>>  It is a useful setting for search engines build for certain national
>> group. </description>
>> </property>
>> 
>> <property>
>>  <name>plugin.folders</name>
>>  <value>plugins</value>
>> </property>
>> 
>> <property>
>>  <name>plugin.includes</name>
>> 
>> <value>linkbutler|language-identifier|protocol-http|urlfilter-regex|parse-
>> (html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|rege
>> x|basic)</value> <description>Regular expression naming plugin directory
>> names to include.  Any plugin not matching this expression is excluded.
>>  In any case you need at least include the nutch-extensionpoints plugin.
>> By default Nutch includes crawling just HTML and plain text via HTTP, and
>> basic indexing and search plugins.
>>  </description>
>> </property>
>> 
>> <property>
>>  <name>db.ignore.external.links</name>
>>  <value>true</value>
>>  <description></description>
>> </property>
>> 
>> <property>
>>  <name>generate.count.mode</name>
>>  <value>host</value>
>>  <description>Determines how the URLs are counted for generator.max.count.
>>  Default value is 'host' but can be 'domain'. Note that we do not count
>>  per IP in the new version of the Generator.
>>  </description>
>> </property>
>> 
>> <property>
>>  <name>generate.max.count</name>
>>  <value>1</value>
>>  <description>The maximum number of urls in a single
>>  fetchlist.  -1 if unlimited. The urls are counted according
>>  to the value of the parameter generator.count.mode.
>> 
>>  </description>
>> 
>> // mapred-site.xml
>> 
>> <property>
>>  <name>mapred.tasktracker.map.tasks.maximum</name>
>>  <value>8</value>
>>  <description>The maximum number of map tasks that will be run
>> simultaneously by a task tracker. </description> </property>
>> 
>> <property>
>>  <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>  <value>8</value>
>>  <description>The maximum number of reduce tasks that will be run
>> simultaneously by a task tracker. </description> </property>
>> 
>> <property>
>>  <name>mapred.map.tasks</name>
>>  <value>8</value>
>>  <description></description>
>> </property>
>> 
>> <property>
>>  <name>mapred.reduce.tasks</name>
>>  <value>8</value>
>>  <description></description>
>> </property>
>> 
>> Whats wrong with my configuration? Please correct me if i'm wrong but I
>> guess topN = 10k and depth = 1 means nutch should fetch 10k pages in one
>> run?
>> 
>> Thanks in advance,
>> Rafael
> 
> -- 
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350

Re: Crawler fetches only a few page at each run

Posted by Markus Jelsma <ma...@openindex.io>.

  <name>generate.max.count</name>
  <value>1</value>

I think this is the problem. Please increase as you crawl only one host, each 
generate cycle will contain only 1 page for this host since your mode is set 
to host.

Set to -1 or a higher value.


On Wednesday 16 November 2011 14:54:43 Rafael Pappert wrote:
> Hello List,
> 
> I try to setup a crawler for ~10K Urls and their subpages (just the
> internal ones). I set topN to 10000 (nutch crawl urls -dir crawl -depth 1
> -topN 10000 -threads 10) but the fetch job only fetches 2 *
> generate.max.count pages per run.
> 
> The hadoop map task list looks like that:
> 
> task_201111161348_0005_m_000000	100.00%
> 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 1.0 (1) pages/s,
> 524.0 (65536) kbits/s, 16-Nov-2011 13:53:16
> 16-Nov-2011 13:53:22 (6sec)
> 
> task_201111161348_0005_m_000001	100.00%
> 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s,
> 92.0 (34552) kbits/s, 16-Nov-2011 13:53:19
> 16-Nov-2011 13:53:28 (9sec)
> 
> task_201111161348_0005_m_000002	100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:22
> 16-Nov-2011 13:53:28 (6sec)
> 
> task_201111161348_0005_m_000003	100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:25
> 16-Nov-2011 13:53:31 (6sec)
> 
> task_201111161348_0005_m_000004	100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:28
> 16-Nov-2011 13:53:34 (6sec)
> 
> task_201111161348_0005_m_000005	100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:31
> 16-Nov-2011 13:53:37 (6sec)
> 
> task_201111161348_0005_m_000006	100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:34
> 16-Nov-2011 13:53:40 (6sec)
> 
> task_201111161348_0005_m_000007	100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:37
> 16-Nov-2011 13:53:43 (6sec)
> 
> But "readdb -stats" is like that, after a few runs:
> 
> 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 1 (db_unfetched):	14653
> 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 2 (db_fetched):	17
> 
> Server:
> 
> One Node running Hadoop 0.20.203.0 (later i will add more nodes) and Nutch
> 1.4.
> 
> Configfiles:
> 
> // nutch-site.xml
> 
> <property>
>   <name>http.accept.language</name>
>   <value>de-de,de,en-us,en-gb,en;q=0.7,*;q=0.3</value>
>   <description>Value of the "Accept-Language" request header field.
>   This allows selecting non-English language as default one to retrieve.
>   It is a useful setting for search engines build for certain national
> group. </description>
> </property>
> 
> <property>
>   <name>plugin.folders</name>
>   <value>plugins</value>
> </property>
> 
> <property>
>   <name>plugin.includes</name>
>  
> <value>linkbutler|language-identifier|protocol-http|urlfilter-regex|parse-
> (html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|rege
> x|basic)</value> <description>Regular expression naming plugin directory
> names to include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By default Nutch includes crawling just HTML and plain text via HTTP, and
> basic indexing and search plugins.
>   </description>
> </property>
> 
> <property>
>   <name>db.ignore.external.links</name>
>   <value>true</value>
>   <description></description>
> </property>
> 
> <property>
>   <name>generate.count.mode</name>
>   <value>host</value>
>   <description>Determines how the URLs are counted for generator.max.count.
>   Default value is 'host' but can be 'domain'. Note that we do not count
>   per IP in the new version of the Generator.
>   </description>
> </property>
> 
> <property>
>   <name>generate.max.count</name>
>   <value>1</value>
>   <description>The maximum number of urls in a single
>   fetchlist.  -1 if unlimited. The urls are counted according
>   to the value of the parameter generator.count.mode.
> 
>   </description>
> 
> // mapred-site.xml
> 
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>8</value>
>   <description>The maximum number of map tasks that will be run
> simultaneously by a task tracker. </description> </property>
> 
> <property>
>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>   <value>8</value>
>   <description>The maximum number of reduce tasks that will be run
> simultaneously by a task tracker. </description> </property>
> 
> <property>
>   <name>mapred.map.tasks</name>
>   <value>8</value>
>   <description></description>
> </property>
> 
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>8</value>
>   <description></description>
> </property>
> 
> Whats wrong with my configuration? Please correct me if i'm wrong but I
> guess topN = 10k and depth = 1 means nutch should fetch 10k pages in one
> run?
> 
> Thanks in advance,
> Rafael

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350