You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@buyways.nl> on 2010/09/27 13:54:52 UTC
Output for plugin.PluginRepository repeats in logs
Hi,
I have seen this in all version of Nutch iv'e used. The snippet below keeps repeating itself many times for each job i execute. Why?
2010-09-26 00:13:50,036 INFO plugin.PluginRepository - Plugins: looking in: /home/markus/src/nutch/branch-1.2/build/plugins
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Registered Plugins:
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - More Indexing Filter (index-more)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Subcollection indexing and query filter (subcollection)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Registered Extension-Points:
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Field Filter (org.apache.nutch.indexer.field.FieldFilter)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Search Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2010-09-26 00:13:50,209 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
Cheers,
Re: Output for plugin.PluginRepository repeats in logs
Posted by Markus Jelsma <ma...@buyways.nl>.
I see, complex indeed. I'll manage for now. Thanks for your answer.
On Tuesday 28 September 2010 14:18:06 Andrzej Bialecki wrote:
> On 2010-09-28 13:55, Markus Jelsma wrote:
> > Thanks. Could we modify the code so it will only output the info before
> > the tasks are initialized? If so, how to proceed?
>
> This is a bit tricky, because the code is executed differently depending
> on whether it executes in local mode (or from a local application) and
> in distributed mode (or from one of the mapreduce tasks).
>
> In local mode resources are taken from a classpath determined during the
> execution of the driver application (the one with main()), and these may
> include (and often do!) multiple copies of local files, such as
> conf/nutch-site.xml and nutch-site.xml that is packed inside a job jar.
> Furthermore, plugins in local mode are NOT loaded from nutch.job, but
> instead from the plugins/ directory... so their composition may be
> different than the one that is used by distributed tasks.
>
> Now, the crux of the matter is that in order to print this list only
> once you would have to do this from the driver application - but when
> you run Nutch in distributed mode the driver application uses a
> different classpath than each of the tasks will use, so the list could
> be different, which would be very confusing...
>
> All in all, I think it's best to print it possibly many times from
> tasks, or not at all. This choice could be implemented as a logging
> level, or as a config property.
>
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Output for plugin.PluginRepository repeats in logs
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-28 13:55, Markus Jelsma wrote:
> Thanks. Could we modify the code so it will only output the info before the
> tasks are initialized? If so, how to proceed?
This is a bit tricky, because the code is executed differently depending
on whether it executes in local mode (or from a local application) and
in distributed mode (or from one of the mapreduce tasks).
In local mode resources are taken from a classpath determined during the
execution of the driver application (the one with main()), and these may
include (and often do!) multiple copies of local files, such as
conf/nutch-site.xml and nutch-site.xml that is packed inside a job jar.
Furthermore, plugins in local mode are NOT loaded from nutch.job, but
instead from the plugins/ directory... so their composition may be
different than the one that is used by distributed tasks.
Now, the crux of the matter is that in order to print this list only
once you would have to do this from the driver application - but when
you run Nutch in distributed mode the driver application uses a
different classpath than each of the tasks will use, so the list could
be different, which would be very confusing...
All in all, I think it's best to print it possibly many times from
tasks, or not at all. This choice could be implemented as a logging
level, or as a config property.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Output for plugin.PluginRepository repeats in logs
Posted by Markus Jelsma <ma...@buyways.nl>.
Thanks. Could we modify the code so it will only output the info before the
tasks are initialized? If so, how to proceed?
On Monday 27 September 2010 14:49:54 Andrzej Bialecki wrote:
> On 2010-09-27 13:54, Markus Jelsma wrote:
> > Hi,
> >
> >
> >
> > I have seen this in all version of Nutch iv'e used. The snippet below
> > keeps repeating itself many times for each job i execute. Why?
>
> It's printed as many times as the PluginRepository is initialized, and
> this in turn happens for each map or reduce task where plugins are used.
> Since typically there are many tasks per job, this initialization occurs
> many times.
>
> Traditionally this information was logged at INFO level to help in
> debugging plugin config issues... but I agree that printing it over and
> over again is an overkill.
>
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Output for plugin.PluginRepository repeats in logs
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-27 13:54, Markus Jelsma wrote:
> Hi,
>
>
>
> I have seen this in all version of Nutch iv'e used. The snippet below keeps repeating itself many times for each job i execute. Why?
It's printed as many times as the PluginRepository is initialized, and
this in turn happens for each map or reduce task where plugins are used.
Since typically there are many tasks per job, this initialization occurs
many times.
Traditionally this information was logged at INFO level to help in
debugging plugin config issues... but I agree that printing it over and
over again is an overkill.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com