You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2014/06/11 18:43:11 UTC

[jira] [Comment Edited] (CONNECTORS-962) Support multiple output connections for a single job

    [ https://issues.apache.org/jira/browse/CONNECTORS-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027990#comment-14027990 ] 

Karl Wright edited comment on CONNECTORS-962 at 6/11/14 4:42 PM:
-----------------------------------------------------------------

Solution to the RepositoryDocument "multiple consumers" problem:
- RepositoryDocument actually contains an input stream wrapper, which is backed by the original
  InputStream
- The wrapper "mints" a new actual fresh InputStream whenever asked
- As the original stream is read, the data is stored locally in a temporary file, and the file name is kept by the wrapper object
- When the wrapper object is closed, the temporary file is deleted (if it was created in the first place)
- The tricky part: in order to know whether to create the temporary file, we *must* know at the start whether there will be more than one consumer of the stream.  RepositoryDocument.setMultipleConsumers() would be a possibility, if called before the first read.  We can interrogate each connector in a pipeline to find out how many consumers there will be, so that we can set this parameter in the framework before any of them are called.

This functionality is also very useful for transformation connectors, so even if the multiple outputs logic is not implemented yet, I believe I will go ahead and write the stream wrapper logic.



was (Author: kwright@metacarta.com):
Solution to the RepositoryDocument "multiple consumers" problem:
- RepositoryDocument actually contains an input stream wrapper, which is backed by the original
  InputStream
- The wrapper "mints" a new actual fresh InputStream whenever asked
- As the original stream is read, the data is stored locally in a temporary file, and the file name is kept by the wrapper object
- When the wrapper object is closed, the temporary file is deleted (if it was created in the first place)
- The tricky part: in order to know whether to create the temporary file, we *must* know at the start whether there will be more than one consumer of the stream.  RepositoryDocument.setMultipleConsumers() would be a possibility, if called before the first read.  We can interrogate each connector in a pipeline to find out how many consumers there will be, so that we can set this parameter in the framework before any of them are called.


> Support multiple output connections for a single job
> ----------------------------------------------------
>
>                 Key: CONNECTORS-962
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-962
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>
> Zaizi has a requirement to support multiple outputs for a single job.  In theory this requirement can be met by doing the following:
> - Allow multiple output connections, and multiple pipelines, per job
> - Keep a distinct ingeststatus record for each document/output combination
> - Modify WorkerThread to call IncrementalIndexer multiple times for every document fetched
> Places where different things need to happen are:
> - RepositoryDocument - because one binary stream will not do for multiple outputs
> - UI, obviously, because there will need to be multiple pipelines, not just one, and in addition it would be probably important to be able to "split" the pipeline at arbitrary points



--
This message was sent by Atlassian JIRA
(v6.2#6252)