You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/10/20 14:03:45 UTC

[Solr Wiki] Trivial Update of "DocumentProcessing" by Alex Brasetvik

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by Alex Brasetvik:
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=17&rev2=18

Comment:
Corrected Piped-URL

  
  The built-in UpdateRequestProcessorChain is capable of doing simple simple processing jobs, but it is only built for local execution on the indexer node in the same thread. This means that any performance heavy processing chains will slow down the indexers without any way to scale out processing independently. We have seen FAST systems with far more servers doing document processing than indexing.
  
- There are many processing pipeline frameworks from which to get inspiration, such as the one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]] (now on [[https://github.com/kolstae/openpipe|GitHub]]), [[http://www.pypes.org/|Pypes]], [[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Eclipse SMILA]], [[http://commons.apache.org/sandbox/pipeline/|Apache commons pipeline]], [[http://found.no/products/piped/|Piped]] and others. Indeed, some of these are already being used with Solr as a pre-processing server. This means weak coupling but also weak re-use of code. Each new project will have to choose which of the pipelines to invest in.
+ There are many processing pipeline frameworks from which to get inspiration, such as the one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]] (now on [[https://github.com/kolstae/openpipe|GitHub]]), [[http://www.pypes.org/|Pypes]], [[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Eclipse SMILA]], [[http://commons.apache.org/sandbox/pipeline/|Apache commons pipeline]], [[http://www.piped.io/|Piped]] and others. Indeed, some of these are already being used with Solr as a pre-processing server. This means weak coupling but also weak re-use of code. Each new project will have to choose which of the pipelines to invest in.
  
  The community would benefit from an official processing framework -- and more importantly an official repository of processing stages which are shared and reused. The sharing part is crucial. If a company develops, say a GeoNames stage to translate address into lat/lon, the whole community can benefit from that by fetching the stage from the shared repository. This will not happen as long as there is not one single preferred integration point.