You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Steve Cohen <ma...@gmail.com> on 2010/09/27 17:24:39 UTC

What is nutch doing?

Hello,

I've been given the task of figuring out why nutch is running slower on
Solaris then on Linux with the same configuration. I am looking at the log
file and I see this big gap between the time fetcher stops fetching and it
says it is done and I would love to know what is going on. Here is the log
snippet.

2010-09-24 11:04:28,413 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0
2010-09-24 11:05:32,782 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2010-09-24 11:05:33,469 INFO  plugin.PluginRepository - Plugins: looking in:
/opt/nutch/build/plugins
2010-09-24 11:05:34,052 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository - Registered Plugins:
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Jakarta POI
- Java API To Access Microsoft Format Files (lib-jakarta-poi)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More
Indexing Filter (index-more)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         MSWord Parse
Plug-in (parse-msword)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More Query
Filter (query-more)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         XML
Libraries (lib-xml)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Http
Protocol Plug-in (protocol-http)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         MSExcel
Parse Plug-in (parse-msexcel)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         XML Response
Writer Plug-in (response-xml)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Zip Parse
Plug-in (parse-zip)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Anchor
Indexing Filter (index-anchor)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         URL Query
Filter (query-url)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Parse MS
Documents Framework (lib-parsems)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         JSON
Response Writer Plug-in (response-json)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         MSPowerPoint
Parse Plug-in (parse-mspowerpoint)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Basic Query
Filter (query-basic)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         RSS Parse
Plug-in (parse-rss)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Site Query
Filter (query-site)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Text Parse
Plug-in (parse-text)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         File
Protocol Plug-in (protocol-file)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository - Registered
Extension-Points:
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Field
Filter (org.apache.nutch.indexer.field.FieldFilter)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Search
Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch Online
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2010-09-24 11:47:04,995 INFO  fetcher.Fetcher - Fetcher: done
2010-09-24 11:47:10,151 INFO  crawl.CrawlDb - CrawlDb update: starting

So at 11:04, fetcher winds down and has no more threads to run. At 11:05 it
gives an error about not having native hadoop libraries (I am going to build
them today) and loads plugins. Then Fetcher gives a message that is done -
32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes?

Thanks,
Steve

Re: What is nutch doing?

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-29 01:13, Steve Cohen wrote:

>   fc08b7f0 * *java/util/regex/Matcher.search(I)Z [compiled] +174 (line 2208)
>   fc12c078 * *java/util/regex/Matcher.find()Z [compiled] +132 (line 1058)
>   fc12c078 *
> *org/apache/nutch/urlfilter/regex/RegexURLFilter$Rule.match(Ljava/lang/String;)Z+18
> (line 180)
>   fc12c078 *
> *org/apache/nutch/urlfilter/api/RegexURLFilterBase.filter(Ljava/lang/String;)Ljava/lang/String;+38
> (line 234)
>   fc12c078 *
> *org/apache/nutch/net/URLFilters.filter(Ljava/lang/String;)Ljava/lang/String;+50
> (line 184)
>   fc12c078 *
> *org/apache/nutch/parse/ParseOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/parse/Parse;)V+992
> (line 555)
>   fc16a680 *
> *org/apache/nutch/parse/ParseOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V
> [compiled] +20 (line 226)
>   fc16a680 *
> *org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/NutchWritable;)V+120

This fragment of the stacktrace suggests two things:

* you are running Fetcher in parsing mode. This is discouraged - if you 
encounter any issue with the parsing and it's stuck or crashes then you 
will have to re-fetch from scratch...

* regex urlfiltering can be slow at times - there are many weird URL-s 
out there, my favorite one was 64kB long and consisted partially of NULL 
characters... Java regex may work VERY VERY slow on such URLs, so slow 
that the task appears to hang, and sometimes TaskTracker thinks it is 
really hung and kills it. For large crawls I tend to avoid regex 
urlfilter, instead use a combination of prefix / suffix / domain / 
custom filtering that don't use regex or first sanitize the urls.

> I have a feeling I know why It is only using one core. I set
> mapred.tasktracker.reduce.tasks.maximum to 4 but I see that there is a
> setting for mapred.reduce.tasks which is set to 1. Do I need to up it to 4
> as well?

Yes. The first property specifies how many reduce tasks a tasktracker 
can run, but the second property says what is the default number of 
reduce tasks in a job (jobs may override this setting, but usually 
don't, so this will be usually the number of reducers per job).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: What is nutch doing?

Posted by Steve Cohen <ma...@gmail.com>.
On Mon, Sep 27, 2010 at 1:26 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-09-27 17:24, Steve Cohen wrote:
>
>> Hello,
>>
>> I've been given the task of figuring out why nutch is running slower on
>> Solaris then on Linux with the same configuration. I am looking at the log
>> file and I see this big gap between the time fetcher stops fetching and it
>> says it is done and I would love to know what is going on. Here is the log
>> snippet.
>>
>> 2010-09-24 11:04:28,413 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=0
>> 2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0,
>> spinWaiting=0, fetchQueues.totalSize=0
>> 2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0
>> 2010-09-24 11:05:32,782 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where
>> applicable
>> 2010-09-24 11:05:33,469 INFO  plugin.PluginRepository - Plugins: looking
>> in:
>> /opt/nutch/build/plugins
>> 2010-09-24 11:05:34,052 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Jakarta
>> POI
>> - Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More
>> Indexing Filter (index-more)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         HTTP
>> Framework (lib-http)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         MSWord
>> Parse
>> Plug-in (parse-msword)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More Query
>> Filter (query-more)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Regex URL
>> Filter (urlfilter-regex)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         XML
>> Libraries (lib-xml)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Http
>> Protocol Plug-in (protocol-http)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         MSExcel
>> Parse Plug-in (parse-msexcel)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         XML
>> Response
>> Writer Plug-in (response-xml)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         OPIC
>> Scoring
>> Plug-in (scoring-opic)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Zip Parse
>> Plug-in (parse-zip)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Anchor
>> Indexing Filter (index-anchor)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         URL Query
>> Filter (query-url)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Parse MS
>> Documents Framework (lib-parsems)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Regex URL
>> Filter Framework (lib-regex-filter)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         JSON
>> Response Writer Plug-in (response-json)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         the nutch
>> core extension points (nutch-extensionpoints)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -
>> MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Basic
>> Query
>> Filter (query-basic)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         RSS Parse
>> Plug-in (parse-rss)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Html Parse
>> Plug-in (parse-html)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
>> Indexing Filter (index-basic)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Site Query
>> Filter (query-site)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
>> Summarizer Plug-in (summary-basic)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Text Parse
>> Plug-in (parse-text)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         CyberNeko
>> HTML Parser (lib-nekohtml)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         File
>> Protocol Plug-in (protocol-file)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Summarizer (org.apache.nutch.searcher.Summarizer)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Protocol (org.apache.nutch.protocol.Protocol)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Field
>> Filter (org.apache.nutch.indexer.field.FieldFilter)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Search
>> Results Response Writer
>> (org.apache.nutch.searcher.response.ResponseWriter)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
>> Filter (org.apache.nutch.net.URLFilter)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
>> Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
>> Content Parser (org.apache.nutch.parse.Parser)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Ontology
>> Model Loader (org.apache.nutch.ontology.Ontology)
>> 2010-09-24 11:47:04,995 INFO  fetcher.Fetcher - Fetcher: done
>> 2010-09-24 11:47:10,151 INFO  crawl.CrawlDb - CrawlDb update: starting
>>
>> So at 11:04, fetcher winds down and has no more threads to run. At 11:05
>> it
>> gives an error about not having native hadoop libraries (I am going to
>> build
>> them today) and loads plugins. Then Fetcher gives a message that is done -
>> 32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes?
>>
>
> It was diligently running the "reduce" phase, which consists of sorting and
> the reduce() proper.  If you run Fetcher in the parsing mode then another
> possibility is that some of the parsers run slower on Solaris. Yet another
> possibility, that you mentioned, is that HAdoop can use the native
> compression libs on Linux, but there are no such libs pre-compiled for
> Solaris.
>
> Also, while reduce() speed is mostly determined by the Reducer
> implementation (and very little by IO), the sorting speed is very much
> dependent on disk IO and the size of the dataset that was partitioned to a
> given reduce task. All other config factors being equal, I suspect that your
> Solaris box could have a slower disk.
>
> You can verify these hypotheses with top/iostat/vmstat and see whether the
> tasks are bound by CPU or by diskwait.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

I tried running nutch again after setting job tracker to localhost:<port>
from local and set mapred.tasktracker.reduce.tasks.maximum and
mapred.tasktracker.map.tasks.maximum, hoping it would use multiple threads
and cores to speed up the portion after the fetch.fetcher threads drops to 0
but before it tells me that fetcher is done.

It seems I need to do more configuration. It is only using one thread during
this down time.

 but I do know what it is doing. I ran the solaris command pstack on the pid
to see what is going on and you were right. It is using running map reduce.

-----------------  lwp# 2 / thread# 2  --------------------
 fc0e3324 *
*java/util/regex/Pattern$CharProperty.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z
[compiled]
 fc0f9b9c *
*java/util/regex/Pattern$Slice.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z
[compiled] +152 (line 6979)
 fc0f9b9c *
*java/util/regex/Pattern$Begin.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z+62
(line 6244)
 fc08b7f0 * *java/util/regex/Matcher.search(I)Z [compiled] +174 (line 2208)
 fc12c078 * *java/util/regex/Matcher.find()Z [compiled] +132 (line 1058)
 fc12c078 *
*org/apache/nutch/urlfilter/regex/RegexURLFilter$Rule.match(Ljava/lang/String;)Z+18
(line 180)
 fc12c078 *
*org/apache/nutch/urlfilter/api/RegexURLFilterBase.filter(Ljava/lang/String;)Ljava/lang/String;+38
(line 234)
 fc12c078 *
*org/apache/nutch/net/URLFilters.filter(Ljava/lang/String;)Ljava/lang/String;+50
(line 184)
 fc12c078 *
*org/apache/nutch/parse/ParseOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/parse/Parse;)V+992
(line 555)
 fc16a680 *
*org/apache/nutch/parse/ParseOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V
[compiled] +20 (line 226)
 fc16a680 *
*org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/NutchWritable;)V+120
(line 186)
 fc16a680 *
*org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V+20
(line 140)
 fc16a680 *
*org/apache/hadoop/mapred/ReduceTask$3.collect(Ljava/lang/Object;Ljava/lang/Object;)V+14
(line 821)
 fc16a680 *
*org/apache/hadoop/mapred/lib/IdentityReducer.reduce(Ljava/lang/Object;Ljava/util/Iterator;Lorg/apache/hadoop/mapred/OutputCollector;Lorg/apache/hadoop/mapred/Reporter;)V+36
(line 79)
 fc005fd0 *
org/apache/hadoop/mapred/ReduceTask.run(Lorg/apache/hadoop/mapred/JobConf;Lorg/apache/hadoop/mapred/TaskUmbilicalProtocol;)V+610
(line 751)
 fc005ab0 * org/apache/hadoop/mapred/Child.main([Ljava/lang/String;)V+440
(line 227)
 fc00021c * StubRoutines (1)
 fe5594fc
__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_
(fc0001c0, 35400, 1, 0, 779d6e98, fe37ff08) + 208
 fe5fd1d4 jni_CallStaticVoidMethod (35510, 35c4c, 35848, 35400, 35840,
2c648) + 4b8
 00013ab0 JavaMain (368e8, 2b8f4, 2af28, 35510, 4, fee32d24) + 15f0
 ff2c8950 _lwp_start (0, 0, 0, 0, 0, 0)


I have a feeling I know why It is only using one core. I set
mapred.tasktracker.reduce.tasks.maximum to 4 but I see that there is a
setting for mapred.reduce.tasks which is set to 1. Do I need to up it to 4
as well?

Thanks,
Steve Cohen

Re: What is nutch doing?

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-27 17:24, Steve Cohen wrote:
> Hello,
>
> I've been given the task of figuring out why nutch is running slower on
> Solaris then on Linux with the same configuration. I am looking at the log
> file and I see this big gap between the time fetcher stops fetching and it
> says it is done and I would love to know what is going on. Here is the log
> snippet.
>
> 2010-09-24 11:04:28,413 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0
> 2010-09-24 11:05:32,782 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2010-09-24 11:05:33,469 INFO  plugin.PluginRepository - Plugins: looking in:
> /opt/nutch/build/plugins
> 2010-09-24 11:05:34,052 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository - Registered Plugins:
> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Jakarta POI
> - Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More
> Indexing Filter (index-more)
> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         HTTP
> Framework (lib-http)
> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         MSWord Parse
> Plug-in (parse-msword)
> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More Query
> Filter (query-more)
> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Regex URL
> Filter (urlfilter-regex)
> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         XML
> Libraries (lib-xml)
> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Http
> Protocol Plug-in (protocol-http)
> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         MSExcel
> Parse Plug-in (parse-msexcel)
> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         XML Response
> Writer Plug-in (response-xml)
> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         OPIC Scoring
> Plug-in (scoring-opic)
> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Zip Parse
> Plug-in (parse-zip)
> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Anchor
> Indexing Filter (index-anchor)
> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         URL Query
> Filter (query-url)
> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Parse MS
> Documents Framework (lib-parsems)
> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Regex URL
> Filter Framework (lib-regex-filter)
> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         JSON
> Response Writer Plug-in (response-json)
> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         the nutch
> core extension points (nutch-extensionpoints)
> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Basic Query
> Filter (query-basic)
> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         RSS Parse
> Plug-in (parse-rss)
> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Html Parse
> Plug-in (parse-html)
> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
> Indexing Filter (index-basic)
> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Site Query
> Filter (query-site)
> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
> Summarizer Plug-in (summary-basic)
> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Text Parse
> Plug-in (parse-text)
> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         CyberNeko
> HTML Parser (lib-nekohtml)
> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         File
> Protocol Plug-in (protocol-file)
> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Field
> Filter (org.apache.nutch.indexer.field.FieldFilter)
> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         HTML Parse
> Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Search
> Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Ontology
> Model Loader (org.apache.nutch.ontology.Ontology)
> 2010-09-24 11:47:04,995 INFO  fetcher.Fetcher - Fetcher: done
> 2010-09-24 11:47:10,151 INFO  crawl.CrawlDb - CrawlDb update: starting
>
> So at 11:04, fetcher winds down and has no more threads to run. At 11:05 it
> gives an error about not having native hadoop libraries (I am going to build
> them today) and loads plugins. Then Fetcher gives a message that is done -
> 32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes?

It was diligently running the "reduce" phase, which consists of sorting 
and the reduce() proper.  If you run Fetcher in the parsing mode then 
another possibility is that some of the parsers run slower on Solaris. 
Yet another possibility, that you mentioned, is that HAdoop can use the 
native compression libs on Linux, but there are no such libs 
pre-compiled for Solaris.

Also, while reduce() speed is mostly determined by the Reducer 
implementation (and very little by IO), the sorting speed is very much 
dependent on disk IO and the size of the dataset that was partitioned to 
a given reduce task. All other config factors being equal, I suspect 
that your Solaris box could have a slower disk.

You can verify these hypotheses with top/iostat/vmstat and see whether 
the tasks are bound by CPU or by diskwait.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com