You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/03/03 17:21:36 UTC

[Solr Wiki] Trivial Update of "ExtractingRequestHandler" by EricPugh

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by EricPugh.
The comment on this change is: fix urls to tika project now it's out of incubation.  Don't deep link to formats page since it is version dependent and tika versions change..
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=66&rev2=67

--------------------------------------------------

  = Introduction =
  <!> [[Solr1.4]]
  
- A common need of users is the ability to ingest binary and/or structured documents such as Office, Word, PDF and other proprietary formats.  The [[http://incubator.apache.org/tika/|Apache Tika]] project provides a framework for wrapping many different file format parsers, such as PDFBox, POI and others.
+ A common need of users is the ability to ingest binary and/or structured documents such as Office, Word, PDF and other proprietary formats.  The [[http://tika.apache.org/|Apache Tika]] project provides a framework for wrapping many different file format parsers, such as PDFBox, POI and others.
  
  Solr's !ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.
  
@@ -17, +17 @@

   * Tika will automatically attempt to determine the input document type (word, pdf, etc.) and extract the content appropriately. If you want, you can explicitly specify a MIME type for Tika wth the stream.type parameter
   * Tika does everything by producing an XHTML stream that it feeds to a SAX !ContentHandler.
   * Solr then reacts to Tika's SAX events and creates the fields to index.
-  * Tika produces Metadata information such as Title, Subject, and Author, according to specifications like !DublinCore.  See http://lucene.apache.org/tika/formats.html for the file types supported.
+  * Tika produces Metadata information such as Title, Subject, and Author, according to specifications like !DublinCore.  See http://tika.apache.org/ site for the file types supported.
   * All of the extracted text is added to the "content" field
   * We can map Tika's metadata fields to Solr fields.  We can boost these fields
   * We can also pass in literals for field values.
@@ -224, +224 @@

   * Commit
  
  = Additional Resources =
- * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid Imagination article]] * [[http://tika.apache.org/0.7/formats.html|Supported document formats via Tika (0.7)]]
+ * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid Imagination article]] * [[http://tika.apache.org/0.9/formats.html|Supported document formats via Tika (0.9)]]
  
  = What's in a Name =
  Grant was writing the javadocs for the code and needed an entry for the <title> tag and wrote out "Solr Content Extraction Library", since the contrib directory is named "extraction".  This then lead to an "acronym":  Solr CEL which then gets mashed to: Solr Cell.  Hence, the project name is "Solr Cell".  It's also appropriate because a Solar Cell's job is to convert the raw energy of the Sun to electricity, and this contrib's module is responsible for converting the "raw" content of a document to something usable by Solr. http://en.wikipedia.org/wiki/Solar_cell