You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/06/16 01:08:38 UTC
[Solr Wiki] Update of "DocumentProcessing" by levitski

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by levitski:
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=15&rev2=16

  
  = Problem =
  Solr would benefit from a flexible document processing framework meeting the requirements of enterprise grade content integration. Most search projects have some need for processing the incoming content prior to indexing, for example:
+ 
   * Language identification
   * Text extraction (Tika)
   * Entity extraction and classification
@@ -16, +17 @@

  
  There are many processing pipeline frameworks from which to get inspiration, such as the one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]] (now on [[https://github.com/kolstae/openpipe|GitHub]]), [[http://www.pypes.org/|Pypes]], [[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Eclipse SMILA]], [[http://commons.apache.org/sandbox/pipeline/|Apache commons pipeline]], [[http://found.no/products/piped/|Piped]] and others. Indeed, some of these are already being used with Solr as a pre-processing server. This means weak coupling but also weak re-use of code. Each new project will have to choose which of the pipelines to invest in.
  
- The community would benefit from an official processing framework -- and more importantly an official repository of processing stages which are shared and reused. The sharing part is crucial. If a company develops, say a Geo``Names stage to translate address into lat/lon, the whole community can benefit from that by fetching the stage from the shared repository. This will not happen as long as there is not one single preferred integration point.
+ The community would benefit from an official processing framework -- and more importantly an official repository of processing stages which are shared and reused. The sharing part is crucial. If a company develops, say a GeoNames stage to translate address into lat/lon, the whole community can benefit from that by fetching the stage from the shared repository. This will not happen as long as there is not one single preferred integration point.
  
- There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this presentation]] from Lucene Eurocon 2010 as well as [[http://findabilityblog.se/solr-processing-pipeline|this blog post]] for thoughts from Find``Wise, as well as the recent solr-user thread [[http://search-lucene.com/m/pFegS7BQ7k2|Pipeline for Solr]].
+ There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this presentation]] from Lucene Eurocon 2010 as well as [[http://findabilityblog.se/solr-processing-pipeline|this blog post]] for thoughts from FindWise, as well as the recent solr-user thread [[http://search-lucene.com/m/pFegS7BQ7k2|Pipeline for Solr]].
  
  = Solution proposal =
  Develop a simple, scalable, easily scriptable and configurable document processing framework for Solr, which builds on existing best practices. The framework should be simple and lightweight enough for use with Solr single node, and powerful enough to scale out in a separate document processing cluster, simply by changing configuration.
@@ -78, +79 @@

  
  = Q&A =
  == Your question here ==
- 
   * Q: Is there a JIRA issue that tracks the development of this feature?
   * A: Not yet
  
@@ -86, +86 @@

   * A: SOLR-2129 is an UpdateProcessor for UIMA (see [[http://wiki.apache.org/solr/SolrUIMA|SolrUIMA]]). Here we're talking about improving the whole UpdateProcessor framework, either by replacing it or enhancing the existing.
  
   * Q: Will the pipelines have to be linear. For instance, could we implement a first stage in the pipeline that would be a splitter. The splitter could, for example, break up a large XML document into chapters, then push each chapter to the next stage where other processing will take place. In the end, the Lucene index would have one document per chapter.
-  * A: 
+  * A:
  
-  * Q: (Your question here)
+  * Q: How will the pipelines support compound files, e.g. archives, e-mail messages with attachments (which could be archives), etc.? This could be a problem if pipelines are linear.
-  * A: 
+  * A: