You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2014/12/15 13:35:14 UTC

[jira] [Commented] (CONNECTORS-1122) Explore ways to make job start be faster in systems with lots of documents

    [ https://issues.apache.org/jira/browse/CONNECTORS-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246612#comment-14246612 ] 

Karl Wright commented on CONNECTORS-1122:
-----------------------------------------

The fundamental issue is that document bins are *not* stored in the schema.  Connectors produce the document bins for a given document in code.  When a job starts, certain documents in the job's queue are put into a state where they need priorities to be determined.  Similarly, when a job is aborted, documents that had priorities in that job beforehand have to have those priorities rescinded.  In both cases, since document bins are global, the allocation of document priorities is suddenly incorrect, if there are other documents in other jobs that have document priorities assigned which share the same document bins as those documents whose state is being changed.  This is why, at the moment, ManifoldCF takes the approach of reprioritizing all documents at the time when (say) jobs start or end.

At job start time, if only the documents being marked active for the new job were marked, then any documents present whose bins overlapped existing jobs would find that they would be placed at the back of the line. *No* documents from the overlapping bins would be processed in the new job until *all* the documents currently prioritized in the older jobs were processed.

At job end time, when you rescind document priorities, there are suddenly "holes" in the prioritization, and the efficiency of ManifoldCF document distribution becomes lower.

For the start case, it may be acceptable to not fully reprioritize.  This is one change that would be easy to explore.  For the job abort case, it's not going to work; the reprioritization must take place.


> Explore ways to make job start be faster in systems with lots of documents
> --------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1122
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1122
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.8, ManifoldCF 2.0
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.9, ManifoldCF 2.1
>
>
> Job start requires all documents to be marked as needing reprioritization now.  We should consider ways in which we can reduce the need to do this as much as possible.  For example, if there are NO documents at all for a job, reprioritization is by definition unneeded.  Alternatively, coming up with a way of determining if there are any bin-level overlaps between documents made active by a job start at documents elsewhere, we could be more targeted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)