You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by imehesz <im...@gmail.com> on 2013/02/01 15:17:40 UTC

Why is my Nutch-crawling so slow?

Hello,

I'm trying to crawl a large number of sites (eventually) one by one. After
playing with Nutch and Solr for a couple of days now, I 'm not really sure
why crawling takes such a long time.

I was crawling ONE web-site, that has 5 pages on it with very minimal text
content with about 10 pictures, and it took ~3 minutes.

- I turned off external-link crawling in the configuration,
- command: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 2
-topN 10000
- the URL file has one URL in it (MYDOMAIN.COM as an example!),
- and in the conf-crawl-urlfilter.txt has 1 rule set
+^http://([a-z0-9]*\.)*MYDOMAIN.COM/

Is there a way I can speed this up?

thanks,
--i



--
View this message in context: http://lucene.472066.n3.nabble.com/Why-is-my-Nutch-crawling-so-slow-tp4037964.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Why is my Nutch-crawling so slow?

Posted by Tejas Patil <te...@gmail.com>.

great !!
"-threads" is the command line option to increase the config "
fetcher.threads.fetch".

Thanks,
Tejas Patil

On Tue, Feb 19, 2013 at 6:12 AM, imehesz <im...@gmail.com> wrote:

> Tejas,
>
> it seems like passing in `/-thread 200/` did the trick!
>
> thank you!
>
> --iM
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Why-is-my-Nutch-crawling-so-slow-tp4037964p4041290.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Why is my Nutch-crawling so slow?

Posted by imehesz <im...@gmail.com>.

Tejas,

it seems like passing in `/-thread 200/` did the trick!

thank you!

--iM



--
View this message in context: http://lucene.472066.n3.nabble.com/Why-is-my-Nutch-crawling-so-slow-tp4037964p4041290.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Why is my Nutch-crawling so slow?

Posted by Tejas Patil <te...@gmail.com>.

Hi Imehesz,

Nutch has a bunch of configs that can be tuned to improve the crawl rate.
Apart from that, you also need to check the average size of the documents
being crawled and then evaluate the crawl speed.

Do you want to crawl images too ? If not then you can add rules to
conf/regex-urlfilter.txt to omit those. (by default nutch will not crawl
images)

Configs:
If you want only a single thread to crawl any host at a time, then you need
to modify these params:
(values are the default ones. To override them, put these properties into
conf/nutch-site.xml with the new values)

<property>
  <name>fetcher.server.delay</name>
  <value>5.0</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

<property>
  <name>fetcher.threads.fetch</name>
  <value>10</value>
  <description>The number of FetcherThreads the fetcher should use.
  This is also determines the maximum number of requests that are
  made at once (each FetcherThread handles one connection). The total
  number of threads running in distributed mode will be the number of
  fetcher threads * number of nodes as fetcher has one map task per node.
  </description>
</property>

If you want multiple threads to crawl a given host simultaneoulsy, then you
need to modify these params:

<property>
  <name>fetcher.threads.per.queue</name>
  <value>1</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a queue at one time. Replaces
    deprecated parameter 'fetcher.threads.per.host'.
   </description>
</property>

<property>
  <name>fetcher.server.min.delay</name>
  <value>0.0</value>
  <description>The minimum number of seconds the fetcher will delay between
  successive requests to the same server. This value is applicable ONLY
  if fetcher.threads.per.host is greater than 1 (i.e. the host blocking
  is turned off).</description>
</property>

Thanks,
Tejas Patil

On Fri, Feb 1, 2013 at 6:17 AM, imehesz <im...@gmail.com> wrote:

> Hello,
>
> I'm trying to crawl a large number of sites (eventually) one by one. After
> playing with Nutch and Solr for a couple of days now, I 'm not really sure
> why crawling takes such a long time.
>
> I was crawling ONE web-site, that has 5 pages on it with very minimal text
> content with about 10 pictures, and it took ~3 minutes.
>
> - I turned off external-link crawling in the configuration,
> - command: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 2
> -topN 10000
> - the URL file has one URL in it (MYDOMAIN.COM as an example!),
> - and in the conf-crawl-urlfilter.txt has 1 rule set
> +^http://([a-z0-9]*\.)*MYDOMAIN.COM/
>
> Is there a way I can speed this up?
>
> thanks,
> --i
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Why-is-my-Nutch-crawling-so-slow-tp4037964.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>