You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2010/09/07 00:37:57 UTC

[Solr Wiki] Trivial Update of "DocumentProcessing" by JanHoydahl

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl.
The comment on this change is: Removing unwanted wikilinks.
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=1&rev2=2

--------------------------------------------------

  
  There are many processing pipeline frameworks from which to get inspiration, such as the one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]], [[http://www.pypes.org/|Pypes]], [[http://uima.apache.org/|UIMA]] and others. Indeed, many of these are already being used with Solr as a pre-processing server. 
  
- However, the Solr community needs one single solution and more importantly a repository of processing stages which can be shared and reused. The sharing part is crucial. If a company develops, say a ~GeoNames stage to translate address into lat/lon, the whole community can benefit from that by fetching the stage from the shared repository. This will not happen as long as there is not one single preferred integration point.
+ However, the Solr community needs one single solution and more importantly a repository of processing stages which can be shared and reused. The sharing part is crucial. If a company develops, say a Geo``Names stage to translate address into lat/lon, the whole community can benefit from that by fetching the stage from the shared repository. This will not happen as long as there is not one single preferred integration point.
  
  There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this presentation]] from Lucene Eurocon 2010 for thoughts from Findwise, and now is the time to move.
  
@@ -26, +26 @@

   * Java based
   * Lightweight
   * Support for multiple named pipelines, addressable at document ingestion
-  * Must work with existing RequestHandlers (XML, CSV, DIH, Binary etc) as entry point
+  * Must work with existing Request``Handlers (XML, CSV, DIH, Binary etc) as entry point
   * Support for metadata on document and field level (e.g. tokenized=true, language=en)
   * Allow scaling out processing to multiple dedicated servers for heavy tasks
   * Well defined API for the processing stages
@@ -48, +48 @@

   * Wrappers for custom FAST ESP stages to work with minor modification
  
  = Proposed architecture =
- Hook into the context of the existing ~UpdateRequestProcessorChain (integrate in ~ContentStreamHandlerBase) by providing a dispatcher class, SolrPipelineDispatcher. The dispatcher would be enabled and configured through update parameters pipeline.name and pipeline.mode, either from the update request or in solrconfig.xml.
+ Hook into the context of the existing UpdateRequestProcessorChain (integrate in Content``Stream``Handler``Base) by providing a dispatcher class, Solr``Pipeline``Dispatcher. The dispatcher would be enabled and configured through update parameters pipeline.name and pipeline.mode, either from the update request or in solrconfig.xml.
  
- SolrPipelineDispatcher would have two modes: "local" and "distributed". In case of local mode, the pipeline executes locally and results in the ProcessorChain being completed with RunUpdateProcessorFactory submitting the content to local index. This would work well for single-node as well as low load scenarios.
+ Solr``Pipeline``Dispatcher would have two modes: "local" and "distributed". In case of local mode, the pipeline executes locally and results in the ProcessorChain being completed with RunUpdateProcessorFactory submitting the content to local index. This would work well for single-node as well as low load scenarios.
  
  The "distributed" mode would enable more advanced dispatching (streaming) to a cluster of remote worker nodes who executes the actual pipeline. This means that indexing will not (necessarily) happen locally. Thus we introduce the possibility for a Solr node which takes on the role of RequestHandler + Dispatcher only. 
  
- On the remote end, there will be a Solr installation with a new PipelineRequestHandler (cmd=processPipeline) which receives a stream of updateRequests and executes the correct pipeline. When the pipeline has finished executing, the resulting documents enter the SolrPipelineDispatcher again and gets dispatched to the correct shard for indexing. For this to work, the shard ID must be configured or calculated somewhere (sounds like a good time to introduce general distributed indexing!).
+ On the remote end, there will be a Solr installation with a new Pipeline``Request``Handler (cmd=processPipeline) which receives a stream of updateRequests and executes the correct pipeline. When the pipeline has finished executing, the resulting documents enter the Solr``Pipeline``Dispatcher again and gets dispatched to the correct shard for indexing. For this to work, the shard ID must be configured or calculated somewhere (sounds like a good time to introduce general distributed indexing!).
  
- The shard masters which are the final targets for the pipeline will then receive the processed documents through the PipelineRequestHandler (cmd=index) and finalize indexing.
+ The shard masters which are the final targets for the pipeline will then receive the processed documents through the Pipeline``Request``Handler (cmd=index) and finalize indexing.
  
  The pipeline itself could be based on [[http://commons.apache.org/sandbox/pipeline/|Apache Commons Pipeline]] or some code from one of the other existing pipeline projects. Benefit with Commons Pipeline is that it is already an Apache library, built for scalability. However, it must perhaps be adapted to suit our needs.