You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2014/07/20 17:24:38 UTC

[jira] [Resolved] (CONNECTORS-989) Support virtual child document model

     [ https://issues.apache.org/jira/browse/CONNECTORS-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright resolved CONNECTORS-989.
------------------------------------

       Resolution: Fixed
    Fix Version/s: ManifoldCF 1.7

r1612102

> Support virtual child document model
> ------------------------------------
>
>                 Key: CONNECTORS-989
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-989
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework agents process
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>
> In some cases, documents that are indexed may be virtual children of those that are queued.  A good example of this is RSS feeds where the data being indexed all comes from the feed.
> In order to implement this, the following changes would be required:
> (1) IProcessActivity.ingestDocument() has a variant which allows you to include a virtual child document identifier in addition to the main document identifier.
> (2) IIncrementalIngester's addOrReplaceDocument receives TWO document keys -- one for main (queued) document identifier, one for child virtual document identifier.
> (3) IIncrementalIngester has two new methods: beginDocument() and endDocument(), both of which take a main (queued) document identifier as an argument.
> (4) ingeststatus table has two additional columns: a state, and a child key.
> (5) The flow is: at beginDocument() time, put all records relating to a document into a "processing" state.  Documents that are seen have their state changed.  Documents never
>    encountered are deleted at the end.
> (6) Incremental decisions not to update an output record STILL will require that the record be touched and its state set.
> (7) DocumentIngest records for the entire set of children will be fetched when the document is queued.
> (8) The getDocumentVersions() method must be modified to allow return of version strings for all children, although there can be "shortcuts" as well (where a single version
>     string applies to all children.)
> (9) The decision about whether to refetch a document is based on the returned version strings and on those fetched by the stuffer thread.
> (10) Similarly, processDocuments() receives version strings for all virtual children.
> (11) There is no need to actively reset the state of document records on restart; the current logic should be robust enough to be able to generate the required deletions.
> (12) Deleting a document deletes ALL child virtual documents.  This happens within the incremental ingester.
> (13) Requeuing interval must be computed across all children, taking the minimum, since there's no requirement that an ingeststatus record exist for the parent.
> (14) All other logic, including making sure only one agent operates on a url at a time, is the same.
> (15) Interrupting the delete phase is safe because next time the doc is processed the records will be removed.
> Analysis:
> - The critical thing is making the non-virtual case no worse.
> - For a virtual child document, instead of one db access, there are two.
> - For document records that are not changed, there are two additional writes that were not needed before.
> - There's an additional index (or the document key index has another subfield).
> - If the queries written can be done in such a way as to treat the standard (no child document) case specially, we may be able to avoid much impact; only two index queries per document returning zero rows each
> - If we handle the standard case using the same mechanism, the WorkerThread logic dealing with deletions can go away.
> Summary:
> - Additional database overhead in the non-virtual indexing case consists of one additional write and one additional zero-row query, OR two additional zero-row queries.
> - Additional database overhead in the non-virtual skip case consists of two additional writes, OR two additional zero-row queries.
> - The overhead is low but is significant and will impact overall framework performance
> - The up-sides are as follows: (a) handling an important but infrequent case better; (b) less connector involvement in indexing (e.g., IProcessActivity.deleteDocument() does nothing now, and can be deprecated).



--
This message was sent by Atlassian JIRA
(v6.2#6252)