You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2006/01/02 23:35:40 UTC

[Nutch Wiki] Update of "Features" by KurosakaTeruhiko

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by KurosakaTeruhiko:
http://wiki.apache.org/nutch/Features

------------------------------------------------------------------------------
  
   *How does the search engine handle punctuation and special characters? (and what's configurable?)
   *Which document formats are supported?
+   * Guessing from the names of the available parser plugins, this is probably it:
+    *Plain Text (in a fixed preconfigured charset only)
+    * HTML (in most any charsets)
+    * JavaScript (for extracting links only?)
+    * Microsoft Power Point, the .ppt file
+    * Microsoft Word, the .doc file
+    * Adobe PDF
+    * RSS
+    * RTF
+    * MP3 (?) Is there any text in MP3?
+    * ZIP (?) This seems to expand the zip of plain text files and return the concatenated text.
+ 
   *What post-coordination options are available? (hey Karen, what does this mean?)
  
   *How easy is Nutch to configure?