You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2012/04/16 14:15:54 UTC

[Nutch Wiki] Trivial Update of "AboutPlugins" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "AboutPlugins" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/AboutPlugins?action=diff&rev1=8&rev2=9

  Nutch's plugin system is based on the one used in [[http://www.eclipse.org/articles/Article-Plug-in-architecture/plugin_architecture.html|Eclipse 2.x]].  Plugins are central to how nutch works.  All of the parsing, indexing and searching that nutch does is actually accomplished by various plugins.
  
- In writing a plugin, you're actually providing one or more ''extensions'' of the existing ''extension-points'' . The core Nutch ''extension-points'' are themselves defined in a plugin, the [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/plugin/ExtensionPoint.html|NutchExtensionPoints]] plugin (they are listed in the !NutchExtensionPoints [[http://svn.apache.org/viewcvs.cgi/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml?view=markup|plugin.xml]] file). Each ''extension-point'' defines an interface that must be implemented by the ''extension''. The core extension points are:
+ In writing a plugin, you're actually providing one or more ''extensions'' of the existing ''extension-points'' . The core Nutch ''extension-points'' are themselves defined in a plugin, the [[http://nutch.apache.org/apidocs-1.4/org/apache/nutch/plugin/ExtensionPoint.html|NutchExtensionPoints]] plugin (they are listed in the !NutchExtensionPoints [[http://svn.apache.org/viewcvs.cgi/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml?view=markup|plugin.xml]] file). Each ''extension-point'' defines an interface that must be implemented by the ''extension''. The core extension points are:
  
-  * [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/clustering/OnlineClusterer.html|OnlineClusterer]] -- An extension point interface for online search results clustering algorithms (from javadoc).
-  * [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/indexer/IndexingFilter.html|IndexingFilter]] -- Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).
+  * [[http://nutch.apache.org/apidocs-1.4/org/apache/nutch/indexer/IndexingFilter.html|IndexingFilter]] -- Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).
-  * [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/ontology/Ontology.html|Ontology]]
-  * [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/parse/Parser.html|Parser]] -- Parser implementations read through fetched documents in order to extract data to be indexed.  This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.
+  * [[http://nutch.apache.org/apidocs-1.4/org/apache/nutch/parse/Parser.html|Parser]] -- Parser implementations read through fetched documents in order to extract data to be indexed.  This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.
-  * [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/parse/HtmlParseFilter.html|HtmlParseFilter]] -- Permits one to add additional metadata to HTML parses (from javadoc).
+  * [[http://nutch.apache.org/apidocs-1.4/org/apache/nutch/parse/HtmlParseFilter.html|HtmlParseFilter]] -- Permits one to add additional metadata to HTML parses (from javadoc).
-  * [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/protocol/Protocol.html|Protocol]] -- Protocol implementations allow nutch to use different protocols (ftp, http, etc.) to fetch documents.
+  * [[http://nutch.apache.org/apidocs-1.4/org/apache/nutch/protocol/Protocol.html|Protocol]] -- Protocol implementations allow nutch to use different protocols (ftp, http, etc.) to fetch documents.
-  * [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/searcher/QueryFilter.html|QueryFilter]] -- Extension point for query translation. Permits one to add metadata to a query (from javadoc).
-  * [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/net/URLFilter.html|URLFilter]] -- URLFilter implementations limit the URLs that nutch attempts to fetch.  The [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/net/RegexURLFilter.html|RegexURLFilter]] distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.
+  * [[http://nutch.apache.org/apidocs-1.4/org/apache/nutch/net/URLFilter.html|URLFilter]] -- URLFilter implementations limit the URLs that nutch attempts to fetch.  The [[http://nutch.apache.org/apidocs-1.1/org/apache/nutch/net/RegexURLFilter.html|RegexURLFilter]] distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.
+  * [[http://nutch.apache.org/apidocs-1.4/org/apache/nutch/net/URLNormalizer.html|URLNormalizer]] -- Interface used to convert URLs to normal form and optionally perform substitutions.
+  * [[http://nutch.apache.org/apidocs-1.4/org/apache/nutch/scoring/ScoringFilter.html|ScoringFilter]] -- A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments. 
+  * [[http://nutch.apache.org/apidocs-1.4/org/apache/nutch/segment/SegmentMergeFilter.html|SegmentMergeFilter]] -- Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page. 
   * [[http://svn.apache.org/viewcvs.cgi/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java?view=markup|NutchAnalyzer]] -- An extension point that provides some language specific analyzers (see MultiLingualSupport proposal). ''Since it is in development stage, it is not in released javadoc''.