You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@buyways.nl> on 2010/09/27 13:54:52 UTC

Output for plugin.PluginRepository repeats in logs

Hi,

 

I have seen this in all version of Nutch iv'e used. The snippet below keeps repeating itself many times for each job i execute. Why?

 

 

2010-09-26 00:13:50,036 INFO  plugin.PluginRepository - Plugins: looking in: /home/markus/src/nutch/branch-1.2/build/plugins
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository - Registered Plugins:
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         More Indexing Filter (index-more)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Http Protocol Plug-in (protocol-http)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Subcollection indexing and query filter (subcollection)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Tika Parser Plug-in (parse-tika)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Anchor Indexing Filter (index-anchor)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository - Registered Extension-Points:
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Field Filter (org.apache.nutch.indexer.field.FieldFilter)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Search Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2010-09-26 00:13:50,209 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
 

Cheers,

Re: Output for plugin.PluginRepository repeats in logs

Posted by Markus Jelsma <ma...@buyways.nl>.
I see, complex indeed. I'll manage for now. Thanks for your answer.

On Tuesday 28 September 2010 14:18:06 Andrzej Bialecki wrote:
> On 2010-09-28 13:55, Markus Jelsma wrote:
> > Thanks. Could we modify the code so it will only output the info before
> > the tasks are initialized? If so, how to proceed?
> 
> This is a bit tricky, because the code is executed differently depending
> on whether it executes in local mode (or from a local application) and
> in distributed mode (or from one of the mapreduce tasks).
> 
> In local mode resources are taken from a classpath determined during the
> execution of the driver application (the one with main()), and these may
> include (and often do!) multiple copies of local files, such as
> conf/nutch-site.xml and nutch-site.xml that is packed inside a job jar.
> Furthermore, plugins in local mode are NOT loaded from nutch.job, but
> instead from the plugins/ directory... so their composition may be
> different than the one that is used by distributed tasks.
> 
> Now, the crux of the matter is that in order to print this list only
> once you would have to do this from the driver application - but when
> you run Nutch in distributed mode the driver application uses a
> different classpath than each of the tasks will use, so the list could
> be different, which would be very confusing...
> 
> All in all, I think it's best to print it possibly many times from
> tasks, or not at all. This choice could be implemented as a logging
> level, or as a config property.
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Output for plugin.PluginRepository repeats in logs

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-28 13:55, Markus Jelsma wrote:
> Thanks. Could we modify the code so it will only output the info before the
> tasks are initialized? If so, how to proceed?

This is a bit tricky, because the code is executed differently depending 
on whether it executes in local mode (or from a local application) and 
in distributed mode (or from one of the mapreduce tasks).

In local mode resources are taken from a classpath determined during the 
execution of the driver application (the one with main()), and these may 
include (and often do!) multiple copies of local files, such as 
conf/nutch-site.xml and nutch-site.xml that is packed inside a job jar. 
Furthermore, plugins in local mode are NOT loaded from nutch.job, but 
instead from the plugins/ directory... so their composition may be 
different than the one that is used by distributed tasks.

Now, the crux of the matter is that in order to print this list only 
once you would have to do this from the driver application - but when 
you run Nutch in distributed mode the driver application uses a 
different classpath than each of the tasks will use, so the list could 
be different, which would be very confusing...

All in all, I think it's best to print it possibly many times from 
tasks, or not at all. This choice could be implemented as a logging 
level, or as a config property.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Output for plugin.PluginRepository repeats in logs

Posted by Markus Jelsma <ma...@buyways.nl>.
Thanks. Could we modify the code so it will only output the info before the 
tasks are initialized? If so, how to proceed?

On Monday 27 September 2010 14:49:54 Andrzej Bialecki wrote:
> On 2010-09-27 13:54, Markus Jelsma wrote:
> > Hi,
> >
> >
> >
> > I have seen this in all version of Nutch iv'e used. The snippet below
> > keeps repeating itself many times for each job i execute. Why?
> 
> It's printed as many times as the PluginRepository is initialized, and
> this in turn happens for each map or reduce task where plugins are used.
> Since typically there are many tasks per job, this initialization occurs
> many times.
> 
> Traditionally this information was logged at INFO level to help in
> debugging plugin config issues... but I agree that printing it over and
> over again is an overkill.
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Output for plugin.PluginRepository repeats in logs

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-27 13:54, Markus Jelsma wrote:
> Hi,
>
>
>
> I have seen this in all version of Nutch iv'e used. The snippet below keeps repeating itself many times for each job i execute. Why?

It's printed as many times as the PluginRepository is initialized, and 
this in turn happens for each map or reduce task where plugins are used. 
Since typically there are many tasks per job, this initialization occurs 
many times.

Traditionally this information was logged at INFO level to help in 
debugging plugin config issues... but I agree that printing it over and 
over again is an overkill.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com