You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Steve Cohen <ma...@gmail.com> on 2010/09/27 17:24:39 UTC
What is nutch doing?
Hello,
I've been given the task of figuring out why nutch is running slower on
Solaris then on Linux with the same configuration. I am looking at the log
file and I see this big gap between the time fetcher stops fetching and it
says it is done and I would love to know what is going on. Here is the log
snippet.
2010-09-24 11:04:28,413 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-09-24 11:04:29,200 INFO fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-09-24 11:04:29,200 INFO fetcher.Fetcher - -activeThreads=0
2010-09-24 11:05:32,782 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2010-09-24 11:05:33,469 INFO plugin.PluginRepository - Plugins: looking in:
/opt/nutch/build/plugins
2010-09-24 11:05:34,052 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Registered Plugins:
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Jakarta POI
- Java API To Access Microsoft Format Files (lib-jakarta-poi)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - More
Indexing Filter (index-more)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - MSWord Parse
Plug-in (parse-msword)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - More Query
Filter (query-more)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - XML
Libraries (lib-xml)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - MSExcel
Parse Plug-in (parse-msexcel)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - XML Response
Writer Plug-in (response-xml)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Zip Parse
Plug-in (parse-zip)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Anchor
Indexing Filter (index-anchor)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - URL Query
Filter (query-url)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Parse MS
Documents Framework (lib-parsems)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - JSON
Response Writer Plug-in (response-json)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - the nutch
core extension points (nutch-extensionpoints)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - MSPowerPoint
Parse Plug-in (parse-mspowerpoint)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - RSS Parse
Plug-in (parse-rss)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - CyberNeko
HTML Parser (lib-nekohtml)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - File
Protocol Plug-in (protocol-file)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Registered
Extension-Points:
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch Field
Filter (org.apache.nutch.indexer.field.FieldFilter)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch Search
Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2010-09-24 11:47:04,995 INFO fetcher.Fetcher - Fetcher: done
2010-09-24 11:47:10,151 INFO crawl.CrawlDb - CrawlDb update: starting
So at 11:04, fetcher winds down and has no more threads to run. At 11:05 it
gives an error about not having native hadoop libraries (I am going to build
them today) and loads plugins. Then Fetcher gives a message that is done -
32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes?
Thanks,
Steve
Re: What is nutch doing?
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-29 01:13, Steve Cohen wrote:
> fc08b7f0 * *java/util/regex/Matcher.search(I)Z [compiled] +174 (line 2208)
> fc12c078 * *java/util/regex/Matcher.find()Z [compiled] +132 (line 1058)
> fc12c078 *
> *org/apache/nutch/urlfilter/regex/RegexURLFilter$Rule.match(Ljava/lang/String;)Z+18
> (line 180)
> fc12c078 *
> *org/apache/nutch/urlfilter/api/RegexURLFilterBase.filter(Ljava/lang/String;)Ljava/lang/String;+38
> (line 234)
> fc12c078 *
> *org/apache/nutch/net/URLFilters.filter(Ljava/lang/String;)Ljava/lang/String;+50
> (line 184)
> fc12c078 *
> *org/apache/nutch/parse/ParseOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/parse/Parse;)V+992
> (line 555)
> fc16a680 *
> *org/apache/nutch/parse/ParseOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V
> [compiled] +20 (line 226)
> fc16a680 *
> *org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/NutchWritable;)V+120
This fragment of the stacktrace suggests two things:
* you are running Fetcher in parsing mode. This is discouraged - if you
encounter any issue with the parsing and it's stuck or crashes then you
will have to re-fetch from scratch...
* regex urlfiltering can be slow at times - there are many weird URL-s
out there, my favorite one was 64kB long and consisted partially of NULL
characters... Java regex may work VERY VERY slow on such URLs, so slow
that the task appears to hang, and sometimes TaskTracker thinks it is
really hung and kills it. For large crawls I tend to avoid regex
urlfilter, instead use a combination of prefix / suffix / domain /
custom filtering that don't use regex or first sanitize the urls.
> I have a feeling I know why It is only using one core. I set
> mapred.tasktracker.reduce.tasks.maximum to 4 but I see that there is a
> setting for mapred.reduce.tasks which is set to 1. Do I need to up it to 4
> as well?
Yes. The first property specifies how many reduce tasks a tasktracker
can run, but the second property says what is the default number of
reduce tasks in a job (jobs may override this setting, but usually
don't, so this will be usually the number of reducers per job).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: What is nutch doing?
Posted by Steve Cohen <ma...@gmail.com>.
On Mon, Sep 27, 2010 at 1:26 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> On 2010-09-27 17:24, Steve Cohen wrote:
>
>> Hello,
>>
>> I've been given the task of figuring out why nutch is running slower on
>> Solaris then on Linux with the same configuration. I am looking at the log
>> file and I see this big gap between the time fetcher stops fetching and it
>> says it is done and I would love to know what is going on. Here is the log
>> snippet.
>>
>> 2010-09-24 11:04:28,413 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=0
>> 2010-09-24 11:04:29,200 INFO fetcher.Fetcher - -activeThreads=0,
>> spinWaiting=0, fetchQueues.totalSize=0
>> 2010-09-24 11:04:29,200 INFO fetcher.Fetcher - -activeThreads=0
>> 2010-09-24 11:05:32,782 WARN util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where
>> applicable
>> 2010-09-24 11:05:33,469 INFO plugin.PluginRepository - Plugins: looking
>> in:
>> /opt/nutch/build/plugins
>> 2010-09-24 11:05:34,052 INFO plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Registered
>> Plugins:
>> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Jakarta
>> POI
>> - Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - More
>> Indexing Filter (index-more)
>> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - HTTP
>> Framework (lib-http)
>> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - MSWord
>> Parse
>> Plug-in (parse-msword)
>> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - More Query
>> Filter (query-more)
>> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Regex URL
>> Filter (urlfilter-regex)
>> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - XML
>> Libraries (lib-xml)
>> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Http
>> Protocol Plug-in (protocol-http)
>> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - MSExcel
>> Parse Plug-in (parse-msexcel)
>> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - XML
>> Response
>> Writer Plug-in (response-xml)
>> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - OPIC
>> Scoring
>> Plug-in (scoring-opic)
>> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Zip Parse
>> Plug-in (parse-zip)
>> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Anchor
>> Indexing Filter (index-anchor)
>> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - URL Query
>> Filter (query-url)
>> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Parse MS
>> Documents Framework (lib-parsems)
>> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Regex URL
>> Filter Framework (lib-regex-filter)
>> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - JSON
>> Response Writer Plug-in (response-json)
>> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - the nutch
>> core extension points (nutch-extensionpoints)
>> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository -
>> MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Basic
>> Query
>> Filter (query-basic)
>> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - RSS Parse
>> Plug-in (parse-rss)
>> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Html Parse
>> Plug-in (parse-html)
>> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Basic
>> Indexing Filter (index-basic)
>> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Site Query
>> Filter (query-site)
>> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Basic
>> Summarizer Plug-in (summary-basic)
>> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Text Parse
>> Plug-in (parse-text)
>> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - CyberNeko
>> HTML Parser (lib-nekohtml)
>> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - File
>> Protocol Plug-in (protocol-file)
>> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Registered
>> Extension-Points:
>> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
>> Summarizer (org.apache.nutch.searcher.Summarizer)
>> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
>> Protocol (org.apache.nutch.protocol.Protocol)
>> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
>> Field
>> Filter (org.apache.nutch.indexer.field.FieldFilter)
>> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
>> Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
>> Search
>> Results Response Writer
>> (org.apache.nutch.searcher.response.ResponseWriter)
>> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch URL
>> Filter (org.apache.nutch.net.URLFilter)
>> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
>> Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
>> Content Parser (org.apache.nutch.parse.Parser)
>> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Ontology
>> Model Loader (org.apache.nutch.ontology.Ontology)
>> 2010-09-24 11:47:04,995 INFO fetcher.Fetcher - Fetcher: done
>> 2010-09-24 11:47:10,151 INFO crawl.CrawlDb - CrawlDb update: starting
>>
>> So at 11:04, fetcher winds down and has no more threads to run. At 11:05
>> it
>> gives an error about not having native hadoop libraries (I am going to
>> build
>> them today) and loads plugins. Then Fetcher gives a message that is done -
>> 32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes?
>>
>
> It was diligently running the "reduce" phase, which consists of sorting and
> the reduce() proper. If you run Fetcher in the parsing mode then another
> possibility is that some of the parsers run slower on Solaris. Yet another
> possibility, that you mentioned, is that HAdoop can use the native
> compression libs on Linux, but there are no such libs pre-compiled for
> Solaris.
>
> Also, while reduce() speed is mostly determined by the Reducer
> implementation (and very little by IO), the sorting speed is very much
> dependent on disk IO and the size of the dataset that was partitioned to a
> given reduce task. All other config factors being equal, I suspect that your
> Solaris box could have a slower disk.
>
> You can verify these hypotheses with top/iostat/vmstat and see whether the
> tasks are bound by CPU or by diskwait.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
I tried running nutch again after setting job tracker to localhost:<port>
from local and set mapred.tasktracker.reduce.tasks.maximum and
mapred.tasktracker.map.tasks.maximum, hoping it would use multiple threads
and cores to speed up the portion after the fetch.fetcher threads drops to 0
but before it tells me that fetcher is done.
It seems I need to do more configuration. It is only using one thread during
this down time.
but I do know what it is doing. I ran the solaris command pstack on the pid
to see what is going on and you were right. It is using running map reduce.
----------------- lwp# 2 / thread# 2 --------------------
fc0e3324 *
*java/util/regex/Pattern$CharProperty.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z
[compiled]
fc0f9b9c *
*java/util/regex/Pattern$Slice.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z
[compiled] +152 (line 6979)
fc0f9b9c *
*java/util/regex/Pattern$Begin.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z+62
(line 6244)
fc08b7f0 * *java/util/regex/Matcher.search(I)Z [compiled] +174 (line 2208)
fc12c078 * *java/util/regex/Matcher.find()Z [compiled] +132 (line 1058)
fc12c078 *
*org/apache/nutch/urlfilter/regex/RegexURLFilter$Rule.match(Ljava/lang/String;)Z+18
(line 180)
fc12c078 *
*org/apache/nutch/urlfilter/api/RegexURLFilterBase.filter(Ljava/lang/String;)Ljava/lang/String;+38
(line 234)
fc12c078 *
*org/apache/nutch/net/URLFilters.filter(Ljava/lang/String;)Ljava/lang/String;+50
(line 184)
fc12c078 *
*org/apache/nutch/parse/ParseOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/parse/Parse;)V+992
(line 555)
fc16a680 *
*org/apache/nutch/parse/ParseOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V
[compiled] +20 (line 226)
fc16a680 *
*org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/NutchWritable;)V+120
(line 186)
fc16a680 *
*org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V+20
(line 140)
fc16a680 *
*org/apache/hadoop/mapred/ReduceTask$3.collect(Ljava/lang/Object;Ljava/lang/Object;)V+14
(line 821)
fc16a680 *
*org/apache/hadoop/mapred/lib/IdentityReducer.reduce(Ljava/lang/Object;Ljava/util/Iterator;Lorg/apache/hadoop/mapred/OutputCollector;Lorg/apache/hadoop/mapred/Reporter;)V+36
(line 79)
fc005fd0 *
org/apache/hadoop/mapred/ReduceTask.run(Lorg/apache/hadoop/mapred/JobConf;Lorg/apache/hadoop/mapred/TaskUmbilicalProtocol;)V+610
(line 751)
fc005ab0 * org/apache/hadoop/mapred/Child.main([Ljava/lang/String;)V+440
(line 227)
fc00021c * StubRoutines (1)
fe5594fc
__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_
(fc0001c0, 35400, 1, 0, 779d6e98, fe37ff08) + 208
fe5fd1d4 jni_CallStaticVoidMethod (35510, 35c4c, 35848, 35400, 35840,
2c648) + 4b8
00013ab0 JavaMain (368e8, 2b8f4, 2af28, 35510, 4, fee32d24) + 15f0
ff2c8950 _lwp_start (0, 0, 0, 0, 0, 0)
I have a feeling I know why It is only using one core. I set
mapred.tasktracker.reduce.tasks.maximum to 4 but I see that there is a
setting for mapred.reduce.tasks which is set to 1. Do I need to up it to 4
as well?
Thanks,
Steve Cohen
Re: What is nutch doing?
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-27 17:24, Steve Cohen wrote:
> Hello,
>
> I've been given the task of figuring out why nutch is running slower on
> Solaris then on Linux with the same configuration. I am looking at the log
> file and I see this big gap between the time fetcher stops fetching and it
> says it is done and I would love to know what is going on. Here is the log
> snippet.
>
> 2010-09-24 11:04:28,413 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2010-09-24 11:04:29,200 INFO fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2010-09-24 11:04:29,200 INFO fetcher.Fetcher - -activeThreads=0
> 2010-09-24 11:05:32,782 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2010-09-24 11:05:33,469 INFO plugin.PluginRepository - Plugins: looking in:
> /opt/nutch/build/plugins
> 2010-09-24 11:05:34,052 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Registered Plugins:
> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Jakarta POI
> - Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - More
> Indexing Filter (index-more)
> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - HTTP
> Framework (lib-http)
> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - MSWord Parse
> Plug-in (parse-msword)
> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - More Query
> Filter (query-more)
> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Regex URL
> Filter (urlfilter-regex)
> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - XML
> Libraries (lib-xml)
> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Http
> Protocol Plug-in (protocol-http)
> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - MSExcel
> Parse Plug-in (parse-msexcel)
> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - XML Response
> Writer Plug-in (response-xml)
> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - OPIC Scoring
> Plug-in (scoring-opic)
> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Zip Parse
> Plug-in (parse-zip)
> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Anchor
> Indexing Filter (index-anchor)
> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - URL Query
> Filter (query-url)
> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Parse MS
> Documents Framework (lib-parsems)
> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Regex URL
> Filter Framework (lib-regex-filter)
> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - JSON
> Response Writer Plug-in (response-json)
> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - the nutch
> core extension points (nutch-extensionpoints)
> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Basic Query
> Filter (query-basic)
> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - RSS Parse
> Plug-in (parse-rss)
> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Basic
> Indexing Filter (index-basic)
> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Site Query
> Filter (query-site)
> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Basic
> Summarizer Plug-in (summary-basic)
> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Text Parse
> Plug-in (parse-text)
> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - CyberNeko
> HTML Parser (lib-nekohtml)
> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - File
> Protocol Plug-in (protocol-file)
> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Registered
> Extension-Points:
> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch Field
> Filter (org.apache.nutch.indexer.field.FieldFilter)
> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - HTML Parse
> Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch Search
> Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Ontology
> Model Loader (org.apache.nutch.ontology.Ontology)
> 2010-09-24 11:47:04,995 INFO fetcher.Fetcher - Fetcher: done
> 2010-09-24 11:47:10,151 INFO crawl.CrawlDb - CrawlDb update: starting
>
> So at 11:04, fetcher winds down and has no more threads to run. At 11:05 it
> gives an error about not having native hadoop libraries (I am going to build
> them today) and loads plugins. Then Fetcher gives a message that is done -
> 32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes?
It was diligently running the "reduce" phase, which consists of sorting
and the reduce() proper. If you run Fetcher in the parsing mode then
another possibility is that some of the parsers run slower on Solaris.
Yet another possibility, that you mentioned, is that HAdoop can use the
native compression libs on Linux, but there are no such libs
pre-compiled for Solaris.
Also, while reduce() speed is mostly determined by the Reducer
implementation (and very little by IO), the sorting speed is very much
dependent on disk IO and the size of the dataset that was partitioned to
a given reduce task. All other config factors being equal, I suspect
that your Solaris box could have a slower disk.
You can verify these hypotheses with top/iostat/vmstat and see whether
the tasks are bound by CPU or by diskwait.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com