You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2014/05/28 01:47:02 UTC
[jira] [Comment Edited] (CONNECTORS-946) Add support for pipeline connector

    [ https://issues.apache.org/jira/browse/CONNECTORS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010505#comment-14010505 ] 

Karl Wright edited comment on CONNECTORS-946 at 5/27/14 11:45 PM:
------------------------------------------------------------------

On second thought, in order to be able to maintain the ability to detect configuration changes, the Pipeline Connector will have to have a version string.  This changes the design quite a bit:

- The pipeline connection list for processing is built right in the job
- Each job has an ordered list of pipeline connections it runs on every document (in a new database table)
- Pipeline connections can have job tabs in the UI (although we have to figure out something to avoid collisions when the same connection type appears more than once in one job -- maybe pass in the pipeline connection name as a parameter to the UI methods)
- There's a TranslationSpecification equivalent to an OutputSpecification or DocumentSpecification, and a pipeline connector method that explicitly maps the TranslationSpecification to a version string
- The transformDocument() method accepts the version string, and uses that where appropriate to control the transformation
- The ingeststatus table has a new sidecar table that holds onto pipeline connection version strings, for comparison

It's critical that the performance of the ingeststatus table does not suffer unless there are configured pipeline steps, but I think that would be relatively straightforward to do, since pipeline version strings will be directly requested by the worker threads when evaluating whether a document has changed.



was (Author: kwright@metacarta.com):
On second thought, in order to be able to maintain the ability to detect configuration changes, the Pipeline Connector will have to have a version string.  This changes the design quite a bit:

- The pipeline connection list for processing is built right in the job
- Each job has an ordered list of pipeline connections it runs on every document (in a new database table)
- Pipeline connections can have job tabs in the UI (although we have to figure out something to avoid collisions when the same connection type appears more than once in one job -- maybe pass in the pipeline connection name as a parameter to the UI methods
- There's a TranslationSpecification equivalent to an OutputSpecification or DocumentSpecification, and a pipeline connector method that explicitly maps the TranslationSpecification to a version string
- The transformDocument() method accepts the version string, and uses that where appropriate to control the transformation
- The ingeststatus table has a new sidecar table that holds onto pipeline connection version strings, for comparison

It's critical that the performance of the ingeststatus table does not suffer unless there are configured pipeline steps, but I think that would be relatively straightforward to do, since pipeline version strings will be directly requested by the worker threads when evaluating whether a document has changed.


> Add support for pipeline connector
> ----------------------------------
>
>                 Key: CONNECTORS-946
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-946
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>
> In the Amazon Search Connector, we finally found an example of an output connector that needed to do full document processing in order to work.  This ticket represents work in the framework to create a concept of "pipeline connector".  Pipeline connections would receive RepositoryDocument objects, and transform them to new RepositoryDocument objects.  There would be a single important method:
> {code}
> public void transformDocument(RepositoryDocument rd, ITransformationActivities activities) throws ServiceInterruption, ManifoldCFException;
> {code}
> ... where ITransformationActivities would include a method that would send a RepositoryDocument object onward to either the output connection or to the next pipeline connection.
> Each pipeline connection would have:
> - A name
> - A description
> - Configuration data
> - An optional prerequisite pipeline connection
> Every output connection would have a new field, which is an optional prerequisite pipeline connection.
> This design is based loosely on how mapping connections and authority connections interrelate.  An alternate design would involve having per-job specification information, but I think this would wind up being way too complex for very little benefit, since each pipeline connection/stage would be expected to do relatively simple/granular things, not usually involving interaction with an external system.



--
This message was sent by Atlassian JIRA
(v6.2#6252)