You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/10/18 22:49:37 UTC

[Solr Wiki] Update of "DocumentProcessing" by JanHoydahl

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl:
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=16&rev2=17

Comment:
Answering some questions

* A: Not yet

* Q: How is this related to https://issues.apache.org/jira/browse/SOLR-2129?
- * A: SOLR-2129 is an UpdateProcessor for UIMA (see [[http://wiki.apache.org/solr/SolrUIMA|SolrUIMA]]). Here we're talking about improving the whole UpdateProcessor framework, either by replacing it or enhancing the existing.
+ * A: SOLR-2129 is an UpdateProcessor for UIMA (see [[http://wiki.apache.org/solr/SolrUIMA|SolrUIMA]]). Here we're talking about improving the whole UpdateProcessor framework, either by enhancing the existing or creating a new project.

* Q: Will the pipelines have to be linear. For instance, could we implement a first stage in the pipeline that would be a splitter. The splitter could, for example, break up a large XML document into chapters, then push each chapter to the next stage where other processing will take place. In the end, the Lucene index would have one document per chapter.
- * A:
+ * A: In [[https://issues.apache.org/jira/browse/SOLR-2841|SOLR-2841]] we suggest a way to make pipelines non linear. For splitting in chapters however, I think that a UpdateRequestHandler may be a better choice, see http://wiki.apache.org/solr/XsltUpdateRequestHandler

* Q: How will the pipelines support compound files, e.g. archives, e-mail messages with attachments (which could be archives), etc.? This could be a problem if pipelines are linear.
- * A:
+ * A: Again, you have a choice whether your UpdateRequestHandler should understand the input format and do the splitting for you. But it should also be possible to write an UpdateProcessor which splits the incoming SolrInputDocument into multiple sub documents - generating unique IDs for each. You would somehow need to inject these sub documents again, either by using SolrJ from your UpdateProcessor or by instantiating a "sub chain" in another thread to push the sub docs into the index. This is however, left as an exercise for the user :)