You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/07/13 11:26:52 UTC

[Nutch Wiki] Trivial Update of "OldPluginCentral" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "OldPluginCentral" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/OldPluginCentral

New page:
OldPluginCentral is a repository for pre-Nutch 1.3 plugin's. Looking back, it actually contains a wealth of Nutch plugin resources as well as tutorials for building plugins.

== Plugins that Come with Nutch (0.9) ==

In order to get Nutch to use any of these plugins, you just need to edit your conf/nutch-site.xml file and add the name of the plugin to the list of plugin.includes.

 * '''[[ClusteringPlugin|clustering-carrot2]]''' - Online Search Results Clustering using Carrot2's components.
 * '''creativecommons''' - Support for crawling and searching Creative-Commons licensed content.
 * '''index-basic''' - Adds url, content and anchor fields to the index.
 * '''index-more''' - Adds date, content-length, contentType, primaryType and subtype fields to the index.
 * '''languageidentifier''' - Adds a lang field to the index and allows you to query against it.
 * '''[[OntologyPlugin|ontology]]''' - Helps refine queries based on owl files.
 * '''parse-ext''' - A wrapper that invokes external command to do real parsing job.
 * '''parse-html''' - Parses HTML documents
 * '''parse-js''' - Parses Java``Script
 * '''parse-mp3''' - Parses MP3s
 * '''parse-zip''' - Parses ZIP archives
 * '''parse-mspowerpoint''' - Parses Microsoft Powerpoint files
 * '''parse-msword''' - Parses MS Word documents
 * '''parse-msexcel''' - Parses MS Excel documents
 * '''parse-pdf''' - Parses PDFs
 * '''parse-rss''' - Parses RSS feeds
 * '''parse-oo''' - Parses OpenOffice files
 * '''parse-swf''' - Parses Shockwave Flash
 * '''parse-rtf''' - Parses RTF files
 * '''parse-text''' - Parses text documents
 * '''protocol-file''' - Retreives documents from the filesystem
 * '''protocol-ftp''' - Retreives documents through ftp
 * '''protocol-http''' - Retreives documents through http
 * '''protocol-httpclient''' - Retreives documents through http and https
 * '''query-basic''' - Runs queries against content, url and anchor fields
 * '''query-more''' - Runs queries against date, content-length, contentType, primaryType and subType fields.
 * '''query-site''' - Runs queries against site field
 * '''query-url''' - Runs queries against url field.
 * '''urlfilter-prefix'''
 * '''urlfilter-regex'''

== Additional Plugins in Dev Branch (0.8) ==

 * '''analysis-de'''
 * '''analysis-fr'''
 * '''lib-commons-httpclient'''
 * '''lib-http'''
 * '''lib-jakarta-poi'''
 * '''lib-log4j''' 
 * '''lib-lucene-analyzers''' - Lucene analyzers
 * '''lib-nekohtml''' - automatic tag balancer 
 * '''lib-parsems''' - parse ms documents framework
 * '''parse-msexcel''' - Parses MS Excel documents
 * '''parse-mspowerpoint''' - Parses MS Powerpoint documents
 * '''parse-oo''' - Parses Open Office and Star Office documents (Extentsions: ODT, OTT, ODH, ODM, ODS, OTS, ODP, OTP, SXW, STW, SXC, STC, SXI, STI)
 * '''parse-swf''' - Parses Flash SWF files
 * '''microformats-reltag''' - Adds [[http://www.microformats.org/wiki/Rel-Tag|rel-tag]] fields to the index and runs queries against them.
 * '''parse-zip'''