You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matt MacDonald <ma...@nearbyfyi.com> on 2012/09/26 14:42:13 UTC

Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Hi,

I have just performed my first full crawl of a collection of sites
within a vertical domain using Nutch 2.1. This is a restricted crawl
where I am limiting the results to just the collection of urls in the
seed.txt file and setting db.ignore.external.links to true. This crawl
is being performed on a development server and deployment in
production is likely to be EC2. I'm posting my crawl results here
hoping that others might share how they have configured their
environments, moving from smaller crawls on development machines to
larger, distributed crawls.

I'm specifically interested in suggestions for speeding up both the
initial crawl and subsequent re-crawls in local mode first. Also
suggestions and approaches to determine an optimal cost/speed EC2
configuration. Initially I'd like to avoid a multi-server setup if
possible as I'm likely to only have funds for a single server for the
time being.

I've only been poking around with Nutch off and on for a few weeks,
apologies if I'm butchering terminology or concepts.

I've read in other posts that using the native libraries is likely to
improve performance but I have been unable to find helpful information
about how to load them on on OS X while using Nutch 2.
* util.NativeCodeLoader - Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable is likely
to help

Single Server
-----------------------
* OS X 10.7.4
* 2.7 GHz Intel Core i5 Quad Core
* 16GB memory
* 25Mbps download speed over consumer broadband (RCN)
* 1.95Mbps upload speed
* 1TB SATA hard dive 7200 rpm

Nutch 2.X HEAD
-----------------------
* Local mode bin/nutch crawl urls -depth 8 -topN 10000
* Using Hbase 0.90.6 (was encountering hung threads with 0.90.4)
* Portions of my configuration... (are there other more relevant bits?)

  <property>
    <name>fetcher.server.delay</name>
    <value>4.0</value>
    <description>The number of seconds the fetcher will delay between
     successive requests to the same server.</description>
  </property>

  <property>
    <name>fetcher.threads.fetch</name>
    <value>50</value>
    <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection). The total
    number of threads running in distributed mode will be the number of
    fetcher threads * number of nodes as fetcher has one map task per node.
    </description>
  </property>

  <property>
    <name>fetcher.threads.per.queue</name>
    <value>20</value>
    <description>This number is the maximum number of threads that
      should be allowed to access a queue at one time.</description>
  </property>

  <property>
    <name>fetcher.threads.per.host</name>
    <value>1</value>
    <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
  </property>

Data points
-----------------------
* Currently 177 unique websites will grow to ~30,000 websites
* PDF/Word/Excel heavy sites ~50% of the documents are non-HTML
* 64,000 unique webpage documents reported in 'webpage' Hbase table
* Hbase 'webpage' table is 14GB

Crawl performance
-----------------------
* bin/nutch crawl urls -depth 8 -topN 10000
* Initial crawl takes ~4 hours
* jconsole reports the Heap usage usually hovers around 1GB occasional
spikes to about 2GB
* max heap size is set to default
* CPU use is typically below 10%
* Network peak data received: 30Mbps
* Subsequent crawl takes ~2 hours

Indexing
-----------------------
* Using ElasticSearch for the index
* ElasticSearch index is 752MB with 50,000 unique documents
* Custom index plugin adds specific vertical domain information to the index
* Indexing 64,000 documents via bin/nutch elasticindex takes ~5minutes

Questions
-----------------------
* Given my current setup, is the crawl that I'm performing taking
roughly the same time that others might expect?
* If this crawl is taking much longer than you might expect what would
you suggest trying to decrease the crawl time?
* Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000'
and calling each step individually? Why?
* Are there more recent/improved versions of
http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch
2.x?
* Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
* What other questions should I be asking?

Thanks,
Matt

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Posted by Matt MacDonald <ma...@nearbyfyi.com>.
An answer to one of my own questions. I'd still love help with the others.

> Some questions:
> -------------------------
> 1) After 12 iterations I'm still seeing more than 4,500 documents out
> of 45,000 that are unfetched. How might I go about determining why the
> unfeteched urls are not being fetched?

I ran bin/nutch readdb -stats -dump links and found that many of the
documents that were unfetched were 500 errors, 404, socket timeouts or
just that the depth was greater than my iterations markers {dist=13}.
The 500 errors directly relate to my other question though about
making sure that I'm not saturating a website and causing those 500
errors. I checked some of the urls that reported 500 errors in the
webpage table by hand and they are returning 200 response codes now.

>
> 2) Any suggestions for modifying the interation steps and/or
> parameters for each step in successive iterations to decrease crawl
> times and/or increase the number of fetched urls? topN? threads?
>
> 3) Any additional information on what the mapred related parameters do?
> mapred.reduce.tasks.speculative.execution=false
> mapred.map.tasks.speculative.execution=false
> mapred.compress.map.output=true
> mapred.skip.attempts.to.start.skipping=2
> mapred.skip.map.max.skip.records=1
>
> 4) During my local, single node crawl I've seen a few sites throw 500
> errors and become unresponsive. How can I ensure that I'm not DOSing
> and crashing the sites I'm crawling?
> * fetcher.server.delay=5.0
> * fetcher.threads.fetch=100
> * fetcher.threads.per.queue=100
> * fetcher.threads.per.host=100
> * db.fetch.schedule.class=org.apache.nutch.crawl.AdaptiveFetchSchedule
> * http.timeout=30000
> * db.ignore.external.links=true
>
> 5) What value should I set for gora.buffer.read.limit? Currently it's
> set to the default of 10000. During fetch steps #6-#12 nearly 50% of
> the time was spent reading from HBase. I was seeing
> gora.buffer.read.limit=10000 show up for several minutes in the logs.
>
> Thanks,
> Matt
>
> On Fri, Sep 28, 2012 at 8:21 AM, Julien Nioche
> <li...@gmail.com> wrote:
>> Hi Matt
>>
>>
>>> > the fetch step is likely to take most of the time and the time it takes
>>> it
>>> > mostly a matter of the distribution of hosts/IP/domains in your
>>> fetchlist.
>>> > Search the WIKI for details on performance tips
>>>
>>> Thanks. Most of the urls that I'm fetching are each on their own
>>> IP/hosts and unique servers.
>>>
>>
>> Ok, you might want to use a large number of threads then
>> (fetcher.threads.fetch)
>>
>> [...]
>>
>>
>>>
>>> >
>>> >
>>> >> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
>>> >>
>>> >
>>> > redirections? sounds quite a lot though
>>>
>>> Thoughts for how I would identify which are redirects?
>>>
>>
>> try using 'nutch readdb' to dump the content of the webtable and inspect
>> the URLs
>>
>> Julien
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

I like to add that 'gora.buffer.read.limit' has no significance with regard
to the HBaseStore. What this property means is that after each N records,
it closes and reopens the store scanners (continuing at the last processed
row). I'm not sure how this relates to other stores, but for HBase you
might as well set it to a very high number, so that just one HBase Scanner
is used for  a mapreduce task. (Just do not set it too low, or there is
unnecessary closing/opening of scanners).

For HBase, the property that decides how many rows are read at once is
hbase.client.scanner.caching (client side property). Setting this too high
means that the regionservers are overloaded with the responses. (Check
for responseTooLarge
errors in regionserver logs). Because it depends on the type of job how fat
the inputted rows are, it is difficult to set a single value over all Nutch
jobs. For example, the GeneratorJob rows are slim (just a few, small
columns are inputted), but the ParserJob rows are fat, because of the
content field that is inputted. What I have done is set
hbase.client.scanner.caching
to a high number i.e. 1000, but set a limit on the SIZE in bytes how big
responses can be. This is determined by the property
'hbase.client.scanner.max.result.size',
which you can set to 50100100 (50MB) or something like that. This property
should be set both server (regionserver) and client side (so defined within
the properties of the submitted job), otherwise you get missing rows:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/27919

Ferdy.

On Wed, Oct 3, 2012 at 9:36 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Matt,
>
> I know th6ere is a pile of stuff to add to this but for the time being
> (until I dive into your response in detail) please see below
>
> On Tue, Oct 2, 2012 at 11:17 PM, Matt MacDonald <ma...@nearbyfyi.com>
> wrote:
> > Hi,
> ...
> >
> > 5) What value should I set for gora.buffer.read.limit? Currently it's
> > set to the default of 10000. During fetch steps #6-#12 nearly 50% of
> > the time was spent reading from HBase. I was seeing
> > gora.buffer.read.limit=10000 show up for several minutes in the logs.
> >
>
> Oddly enough we have little documentation on configuration properties
> for Gora, however for time being please see th6e link below for an
> indication of buffered read and writes in gora
>
>
> http://techvineyard.blogspot.co.uk/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency
>
> hth
> Lewis
>

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Matt,

I know th6ere is a pile of stuff to add to this but for the time being
(until I dive into your response in detail) please see below

On Tue, Oct 2, 2012 at 11:17 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
> Hi,
...
>
> 5) What value should I set for gora.buffer.read.limit? Currently it's
> set to the default of 10000. During fetch steps #6-#12 nearly 50% of
> the time was spent reading from HBase. I was seeing
> gora.buffer.read.limit=10000 show up for several minutes in the logs.
>

Oddly enough we have little documentation on configuration properties
for Gora, however for time being please see th6e link below for an
indication of buffered read and writes in gora

http://techvineyard.blogspot.co.uk/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency

hth
Lewis

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Posted by Matt MacDonald <ma...@nearbyfyi.com>.
Hi,

Great feedback, suggestions and activity on this list.

Based on guidance from the list I stopped using the bin/nutch crawl
command and am now calling each step individually. Julien, you
suggested that I start with
https://issues.apache.org/jira/secure/attachment/12535851/NUTCH-1087-2.1.patch.
I'm more comfortable working with Ruby than shell scripts so I ported
the script to Ruby and added some additional logging to help me better
understand the timing and output of each step.

There are a few parameters that are used in the shell script that I'm
unclear of what impact they have or if they are being used and I'd
love feedback on what they mean and how I might tweak them.

Called during Generate & Fetch:
-------------------------
mapred.reduce.tasks.speculative.execution=false
mapred.map.tasks.speculative.execution=false
mapred.compress.map.output=true

Called during Parse:
-------------------------
mapred.skip.attempts.to.start.skipping=2
mapred.skip.map.max.skip.records=1


I've run 12 crawl iterations over the 177 websites that I'm crawling
and I'm wondering if the results are what others might expect.

These are my crawling commands:
-------------------------
0) nutch inject #{options[:seed_dir]}

Loop
1) nutch generate -D mapred.map.tasks=2 -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -numFetchers 1 -noFilter

2) nutch fetch -D mapred.map.tasks=2 -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true <BATCH_ID>

3) nutch parse -D mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 <BATCH_ID>

4) nutch updatedb


Iterations #2-#5 resulted in:
-------------------------
Average iteration time: 30-35 minutes

Iterations #6-#12 resulted in (realized I should be timing each step):
-------------------------
Average generate time: 250 seconds
Average fetch time: 400 seconds
Average parse time: 450 seconds
Average update time: 300 seconds
Average total iteration time: 20-25 minutes

HBase size after 12 interations: 11.02GB

After the 12th iteration readdb -stats resulted in the following output
-------------------------
WebTable statistics start
Statistics for WebTable:
status 2 (status_fetched):  39611
min score:  0.0
retry 0:  43146
jobs: {db_stats-job_local_0001={jobID=job_local_0001,
jobName=db_stats, counters={File Input Format Counters
={BYTES_READ=0}, Map-Reduce
Framework={MAP_OUTPUT_MATERIALIZED_BYTES=7829,
MAP_INPUT_RECORDS=45859, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=1339,
MAP_OUTPUT_BYTES=2430527, COMMITTED_HEAP_BYTES=27249123328,
COMBINE_INPUT_RECORDS=183436, SPLIT_RAW_BYTES=78062,
REDUCE_INPUT_RECORDS=463, REDUCE_INPUT_GROUPS=118,
COMBINE_OUTPUT_RECORDS=463, REDUCE_OUTPUT_RECORDS=118,
MAP_OUTPUT_RECORDS=183436},
FileSystemCounters={FILE_BYTES_READ=32439253,
FILE_BYTES_WRITTEN=32913783}, File Output Format Counters
={BYTES_WRITTEN=2520}}}}
retry 1:  2713
status 5 (status_redir_perm): 1373
max score:  19.345
TOTAL urls: 45859
status 4 (status_redir_temp): 346
status 1 (status_unfetched):  4529
avg score:  0.04870584
WebTable statistics: done
status 2 (status_fetched):  39611
min score:  0.0
retry 0:  43146
jobs: {db_stats-job_local_0001={jobID=job_local_0001,
jobName=db_stats, counters={File Input Format Counters
={BYTES_READ=0}, Map-Reduce
Framework={MAP_OUTPUT_MATERIALIZED_BYTES=7829,
MAP_INPUT_RECORDS=45859, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=1339,
MAP_OUTPUT_BYTES=2430527, COMMITTED_HEAP_BYTES=27249123328,
COMBINE_INPUT_RECORDS=183436, SPLIT_RAW_BYTES=78062,
REDUCE_INPUT_RECORDS=463, REDUCE_INPUT_GROUPS=118,
COMBINE_OUTPUT_RECORDS=463, REDUCE_OUTPUT_RECORDS=118,
MAP_OUTPUT_RECORDS=183436},
FileSystemCounters={FILE_BYTES_READ=32439253,
FILE_BYTES_WRITTEN=32913783}, File Output Format Counters
={BYTES_WRITTEN=2520}}}}
retry 1:  2713
status 5 (status_redir_perm): 1373
max score:  19.345
TOTAL urls: 45859
status 4 (status_redir_temp): 346
status 1 (status_unfetched):  4529
avg score:  0.04870584


Some questions:
-------------------------
1) After 12 iterations I'm still seeing more than 4,500 documents out
of 45,000 that are unfetched. How might I go about determining why the
unfeteched urls are not being fetched?

2) Any suggestions for modifying the interation steps and/or
parameters for each step in successive iterations to decrease crawl
times and/or increase the number of fetched urls? topN? threads?

3) Any additional information on what the mapred related parameters do?
mapred.reduce.tasks.speculative.execution=false
mapred.map.tasks.speculative.execution=false
mapred.compress.map.output=true
mapred.skip.attempts.to.start.skipping=2
mapred.skip.map.max.skip.records=1

4) During my local, single node crawl I've seen a few sites throw 500
errors and become unresponsive. How can I ensure that I'm not DOSing
and crashing the sites I'm crawling?
* fetcher.server.delay=5.0
* fetcher.threads.fetch=100
* fetcher.threads.per.queue=100
* fetcher.threads.per.host=100
* db.fetch.schedule.class=org.apache.nutch.crawl.AdaptiveFetchSchedule
* http.timeout=30000
* db.ignore.external.links=true

5) What value should I set for gora.buffer.read.limit? Currently it's
set to the default of 10000. During fetch steps #6-#12 nearly 50% of
the time was spent reading from HBase. I was seeing
gora.buffer.read.limit=10000 show up for several minutes in the logs.

Thanks,
Matt

On Fri, Sep 28, 2012 at 8:21 AM, Julien Nioche
<li...@gmail.com> wrote:
> Hi Matt
>
>
>> > the fetch step is likely to take most of the time and the time it takes
>> it
>> > mostly a matter of the distribution of hosts/IP/domains in your
>> fetchlist.
>> > Search the WIKI for details on performance tips
>>
>> Thanks. Most of the urls that I'm fetching are each on their own
>> IP/hosts and unique servers.
>>
>
> Ok, you might want to use a large number of threads then
> (fetcher.threads.fetch)
>
> [...]
>
>
>>
>> >
>> >
>> >> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
>> >>
>> >
>> > redirections? sounds quite a lot though
>>
>> Thoughts for how I would identify which are redirects?
>>
>
> try using 'nutch readdb' to dump the content of the webtable and inspect
> the URLs
>
> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Posted by Julien Nioche <li...@gmail.com>.
Hi Matt


> > the fetch step is likely to take most of the time and the time it takes
> it
> > mostly a matter of the distribution of hosts/IP/domains in your
> fetchlist.
> > Search the WIKI for details on performance tips
>
> Thanks. Most of the urls that I'm fetching are each on their own
> IP/hosts and unique servers.
>

Ok, you might want to use a large number of threads then
(fetcher.threads.fetch)

[...]


>
> >
> >
> >> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
> >>
> >
> > redirections? sounds quite a lot though
>
> Thoughts for how I would identify which are redirects?
>

try using 'nutch readdb' to dump the content of the webtable and inspect
the URLs

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Posted by Matt MacDonald <ma...@nearbyfyi.com>.
Julien,

> the fetch step is likely to take most of the time and the time it takes it
> mostly a matter of the distribution of hosts/IP/domains in your fetchlist.
> Search the WIKI for details on performance tips

Thanks. Most of the urls that I'm fetching are each on their own
IP/hosts and unique servers.

>
>
>> * Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000'
>> and calling each step individually? Why?
>>
>
> This has been discussed several times on the mailing list : you get more
> control with a script + all in one crawl command can have issues with
> runaway parsing threads, etc...

Understood.

>
>
>> * Are there more recent/improved versions of
>> http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch
>> 2.x?
>>
>
> yes, see patch in https://issues.apache.org/jira/browse/NUTCH-1087

Thanks. I'll review that.

>
>
>> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
>>
>
> redirections? sounds quite a lot though

Thoughts for how I would identify which are redirects?

>
> HTH
>
> J
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Posted by Julien Nioche <li...@gmail.com>.
 Hi


> * Given my current setup, is the crawl that I'm performing taking
> roughly the same time that others might expect?
> * If this crawl is taking much longer than you might expect what would
> you suggest trying to decrease the crawl time?
>

the fetch step is likely to take most of the time and the time it takes it
mostly a matter of the distribution of hosts/IP/domains in your fetchlist.
Search the WIKI for details on performance tips


> * Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000'
> and calling each step individually? Why?
>

This has been discussed several times on the mailing list : you get more
control with a script + all in one crawl command can have issues with
runaway parsing threads, etc...


> * Are there more recent/improved versions of
> http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch
> 2.x?
>

yes, see patch in https://issues.apache.org/jira/browse/NUTCH-1087


> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
>

redirections? sounds quite a lot though

HTH

J



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Posted by Bai Shen <ba...@gmail.com>.
Two things to note.  db.ignore.external.links doesn't quite work the way
you think it should.  If you have a url inside the domain that resolves to
a url outside the domain, nutch will end up indexing that domain as well.
The way to get around this is to use the whitelist instead of
db.ignore.external.links.

Also, you should set your fetcher.threads.per.host to at least 2.  I've
seen nutch take forever in a fetch because one host is left.  Don't set
this too high, however, as you can cause a DOS.  Also, set limits on your
generates.  You want to have urls from as many different servers as
possible in order to spread the load around.

On Wed, Sep 26, 2012 at 8:42 AM, Matt MacDonald <ma...@nearbyfyi.com> wrote:

> Hi,
>
> I have just performed my first full crawl of a collection of sites
> within a vertical domain using Nutch 2.1. This is a restricted crawl
> where I am limiting the results to just the collection of urls in the
> seed.txt file and setting db.ignore.external.links to true. This crawl
> is being performed on a development server and deployment in
> production is likely to be EC2. I'm posting my crawl results here
> hoping that others might share how they have configured their
> environments, moving from smaller crawls on development machines to
> larger, distributed crawls.
>
> I'm specifically interested in suggestions for speeding up both the
> initial crawl and subsequent re-crawls in local mode first. Also
> suggestions and approaches to determine an optimal cost/speed EC2
> configuration. Initially I'd like to avoid a multi-server setup if
> possible as I'm likely to only have funds for a single server for the
> time being.
>
> I've only been poking around with Nutch off and on for a few weeks,
> apologies if I'm butchering terminology or concepts.
>
> I've read in other posts that using the native libraries is likely to
> improve performance but I have been unable to find helpful information
> about how to load them on on OS X while using Nutch 2.
> * util.NativeCodeLoader - Unable to load native-hadoop library for
> your platform... using builtin-java classes where applicable is likely
> to help
>
> Single Server
> -----------------------
> * OS X 10.7.4
> * 2.7 GHz Intel Core i5 Quad Core
> * 16GB memory
> * 25Mbps download speed over consumer broadband (RCN)
> * 1.95Mbps upload speed
> * 1TB SATA hard dive 7200 rpm
>
> Nutch 2.X HEAD
> -----------------------
> * Local mode bin/nutch crawl urls -depth 8 -topN 10000
> * Using Hbase 0.90.6 (was encountering hung threads with 0.90.4)
> * Portions of my configuration... (are there other more relevant bits?)
>
>   <property>
>     <name>fetcher.server.delay</name>
>     <value>4.0</value>
>     <description>The number of seconds the fetcher will delay between
>      successive requests to the same server.</description>
>   </property>
>
>   <property>
>     <name>fetcher.threads.fetch</name>
>     <value>50</value>
>     <description>The number of FetcherThreads the fetcher should use.
>     This is also determines the maximum number of requests that are
>     made at once (each FetcherThread handles one connection). The total
>     number of threads running in distributed mode will be the number of
>     fetcher threads * number of nodes as fetcher has one map task per node.
>     </description>
>   </property>
>
>   <property>
>     <name>fetcher.threads.per.queue</name>
>     <value>20</value>
>     <description>This number is the maximum number of threads that
>       should be allowed to access a queue at one time.</description>
>   </property>
>
>   <property>
>     <name>fetcher.threads.per.host</name>
>     <value>1</value>
>     <description>This number is the maximum number of threads that
>     should be allowed to access a host at one time.</description>
>   </property>
>
> Data points
> -----------------------
> * Currently 177 unique websites will grow to ~30,000 websites
> * PDF/Word/Excel heavy sites ~50% of the documents are non-HTML
> * 64,000 unique webpage documents reported in 'webpage' Hbase table
> * Hbase 'webpage' table is 14GB
>
> Crawl performance
> -----------------------
> * bin/nutch crawl urls -depth 8 -topN 10000
> * Initial crawl takes ~4 hours
> * jconsole reports the Heap usage usually hovers around 1GB occasional
> spikes to about 2GB
> * max heap size is set to default
> * CPU use is typically below 10%
> * Network peak data received: 30Mbps
> * Subsequent crawl takes ~2 hours
>
> Indexing
> -----------------------
> * Using ElasticSearch for the index
> * ElasticSearch index is 752MB with 50,000 unique documents
> * Custom index plugin adds specific vertical domain information to the
> index
> * Indexing 64,000 documents via bin/nutch elasticindex takes ~5minutes
>
> Questions
> -----------------------
> * Given my current setup, is the crawl that I'm performing taking
> roughly the same time that others might expect?
> * If this crawl is taking much longer than you might expect what would
> you suggest trying to decrease the crawl time?
> * Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000'
> and calling each step individually? Why?
> * Are there more recent/improved versions of
> http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch
> 2.x?
> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
> * What other questions should I be asking?
>
> Thanks,
> Matt
>