You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2019/04/25 18:42:00 UTC

[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

    [ https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826349#comment-16826349 ] 

Karl Wright commented on CONNECTORS-1602:
-----------------------------------------

Continuous crawling bases the next crawl time on the last time the document changed.  In general it doubles the crawling interval, up to the maximum, before retrying.  So if your document doesn't change very often, the crawler may wait quite some time before reviewing it.

The best way to see what it is going to do is to find the document in the Document Status report, and see when ManifoldCF intends to recrawl it.



> Continuous crawling doesn't recrawl everything
> ----------------------------------------------
>
>                 Key: CONNECTORS-1602
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>            Reporter: Donald Van den Driessche
>            Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all documents are recrawled.
> The site is quite extensive. We figured out that after crawling a document/page gets a recrawl timestamp in between the recrawl interval and max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling those, but seems to ignore the rest of the website. Also sometimes documents get recrawled 5 times while other don't get recrawled. Apparently due to the same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)