You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2008/08/26 23:46:35 UTC
[Nutch Wiki] Update of "Features" by Paul Ruiz
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by Paul Ruiz:
http://wiki.apache.org/nutch/Features
------------------------------------------------------------------------------
* Guessing from the names of the available parser plugins, this is probably it. However, only the plain text and HTML are enabled by default. Edit conf/nutch-site.xml and change the value of plugin.includes property to include the plugins for the document types that you want Nutch to handle:
* Plain Text (plugin: parse-text)
* HTML (parse-html)
+ * XML (parse-xml) uses XPath and namespaces to do the mapping between XML elements and Lucene fields.
* Java``Script (for extracting links only?) (parse-js)
* Microsoft Power Point, the .ppt file (parse-mspowerpoint)
* Microsoft Word, the .doc file (parse-msword)