You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2018/05/26 16:13:34 UTC
[Solr Wiki] Update of "ExtractingRequestHandler" by ShawnHeisey

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by ShawnHeisey:
https://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=84&rev2=85

Comment:
Add link to the page describing why the handler isn't recommended for production use. Improve first line of page.

- a.k.a the "Solr Cell" project!
+ a.k.a the !SolrCell project!
  
  <<TableOfContents>>
  
@@ -10, +10 @@

  A common need of users is the ability to ingest binary and/or structured documents such as Office, Word, PDF and other proprietary formats.  The [[http://tika.apache.org/|Apache Tika]] project provides a framework for wrapping many different file format parsers, such as PDFBox, POI and others.
  
  Solr's !ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.
+ 
+ Although this handler is a great proof of concept, it is not actually recommended for production use.  See RecommendCustomIndexingWithTika for more information.
  
  = Concepts =
  Before getting started, there are a few concepts that are helpful to understand.
@@ -143, +145 @@

  }}}
  == MultiCore config ==
   * For multi-core, specify {{{ sharedLib='lib' }}} in {{{ <solr /> }}} in example/solr/solr.xml in order for Solr to find the jars in example/solr/lib
-  * Lib resources in solrconfig.xml must point to the lib folder relative form where the actual used solrconfig.xml. 
+  * Lib resources in solrconfig.xml must point to the lib folder relative form where the actual used solrconfig.xml.
-  * For multi cores with common solrconfig and schema the can use the same instanceDir  
+  * For multi cores with common solrconfig and schema the can use the same instanceDir
  
  = Metadata =
  As has been implied up to now, Tika produces Metadata about the document.  Metadata often contains things like the author of the file or the number of pages, etc.  The Metadata produced depends on the type of document submitted.  For instance, PDFs have different metadata from Word docs.
@@ -277, +279 @@

   * Commit
  
  = Additional Resources =
-  * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid Imagination article]] 
+  * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid Imagination article]]
   * [[http://tika.apache.org/1.2/formats.html|Supported document formats via Tika (1.2)]]
  
  = What's in a Name =