You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/01/03 06:32:01 UTC

[Lucene-java Wiki] Update of "LuceneFAQ" by HossMan

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by HossMan:
http://wiki.apache.org/lucene-java/LuceneFAQ

The comment on the change is:
be more explicit in question for people who skim instead of search

------------------------------------------------------------------------------
  See article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsing, indexing, and searching XML with Digester and Lucene].
  
  
- ==== How can I index file formats like OpenDocument, MS-Word, MS-Excel, etc? ====
+ ==== How can I index file formats like OpenDocument (aka OpenOffice.org), Microsoft Word, Excel, PowerPoint, Visio, etc? ====
  
  Have a look at [http://lucene.apache.org/tika/ Tika, the content analysis toolkit].
  
- Some background information: Many modern office file formats (.odt, .sxw, .sxc, etc) are ZIP archives that contain XML files. You can uncompress the file using Java's ZIP support, then parse e.g. meta.xml to get the title and e.g. content.xml to get the document's content. You can then add these to the Lucene index, typically using one Lucene field per property.
+ Alternately: Many modern office file formats (.odt, .sxw, .sxc, etc) are ZIP archives that contain XML files. You can uncompress the file using Java's ZIP support, then parse e.g. meta.xml to get the title and e.g. content.xml to get the document's content. You can then add these to the Lucene index, typically using one Lucene field per property.
  
  You can also use LIUS framework for indexing !OpenOffice.org documents (http://www.bibl.ulaval.ca/lius/). LIUS allows metadata and fulltext indexing, using XPath.