You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2006/01/02 23:35:40 UTC
[Nutch Wiki] Update of "Features" by KurosakaTeruhiko
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by KurosakaTeruhiko:
http://wiki.apache.org/nutch/Features
------------------------------------------------------------------------------
*How does the search engine handle punctuation and special characters? (and what's configurable?)
*Which document formats are supported?
+ * Guessing from the names of the available parser plugins, this is probably it:
+ *Plain Text (in a fixed preconfigured charset only)
+ * HTML (in most any charsets)
+ * JavaScript (for extracting links only?)
+ * Microsoft Power Point, the .ppt file
+ * Microsoft Word, the .doc file
+ * Adobe PDF
+ * RSS
+ * RTF
+ * MP3 (?) Is there any text in MP3?
+ * ZIP (?) This seems to expand the zip of plain text files and return the concatenated text.
+
*What post-coordination options are available? (hey Karen, what does this mean?)
*How easy is Nutch to configure?