You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2011/01/09 21:22:45 UTC

[jira] Commented: (CONNECTORS-146) Logic for dealing with unreachable documents at the end of a non-continuous job run does not handle hopcount and carrydown correctly

    [ https://issues.apache.org/jira/browse/CONNECTORS-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979400#action_12979400 ] 

Karl Wright commented on CONNECTORS-146:
----------------------------------------

Look at this deeper, I realized that there is actually a fairly significant case buried here.  To whit:

**Problem:
Deleting records at cleanup time has a bad side effect: the carrydown information
of a child may well change!  Say (for example) that A->B, B->C, A->D, and D->C.
When A changes so that it no longer ->D, then D is orphaned and will be cleaned up.
BUT: the carrydown information for C has changed!  So, C needs reindexing.

**One solution:
Do nothing.  On the next run, the change to C will be detected.  Or will it?  If the connector
seeding method doesn't detect the change, the change won't be detected.  So this will not work.

**Another solution:
Put the child documents into PENDINGPURGATORY and return the job to the active state at the end
of the SHUTTINGDOWN phase.  The return can be automatic; the existence of PENDINGPURGATORY
records when there are no remaining PURGATORY records can help the crawler decide.  The
 documents should go to PENDINGPURGATORY only if they
are in the COMPLETED state; they should not if they are in the PURGATORY state.

In order to implement this solution, we want to call this JobManager method:

  /** Note deletion as result of document processing by a job thread of a document.
  *@param documentDescriptions are the set of description objects for the documents that were processed.
  *@param hopcountMethod describes how to handle deletions for hopcount purposes.
  *@return the set of documents for which carrydown data was changed by this operation.  These documents are likely
  *  to be requeued as a result of the change.
  */
  public DocumentDescription[] markDocumentDeletedMultiple(Long jobID, String[] legalLinkTypes, DocumentDescription[] documentDescriptions,
    int hopcountMethod)
    throws ManifoldCFException


**Problem:
Need to get legallinktypes and hopcountmethod in order to call this method instead of the method

  public void cleanupIngestedDocumentIdentifiers(DocumentDescription[] identifiers)
    throws ManifoldCFException

... which we call today.

**Solution: I have all the necessary information in DocumentCleanupThread.  I just need to rework the
thread code to correlate it properly to do the right thing.

**Problem:
When, during cleanup stuffing, I detect legal documents for cleanup that are shared with other jobs, but are not active,
what should I do?

**Solution:
Since the right database cleanup involves calling markDocumentDeletedMultiple(), the documents
must still be queued, with a signal flag that tells DocumentCleanupThread not to actually delete it
from the index.  But, is there a race condition here?  Since we cannot queue the same document for another
job until the processing is complete, there probably isn't.  But we need to add a special bit to the
queue, which signals whether to delete the document from the search index or not, and also change
both the stuffer and the cleanup threads to do the right thing with that bit.


> Logic for dealing with unreachable documents at the end of a non-continuous job run does not handle hopcount and carrydown correctly
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-146
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-146
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework crawler agent
>            Reporter: Karl Wright
>
> The same logic is used for deleting document that belong to jobs that are going away, and jobs that are just cleaning up after a crawl.  A shortcut in the logic makes it only appropriate at this time for jobs that are going away entirely.  No hopcount or carrydown cleanup is ever done, for instance.
> A solution may involve having separate stuffer and worker threads for these two circumstances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.