You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Tim Steenbeke (JIRA)" <ji...@apache.org> on 2018/12/18 07:49:00 UTC
[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due
to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723782#comment-16723782 ]
Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 7:48 AM:
---------------------------------------------------------------------
[~kwright@metacarta.com]
If we update to manifold 2.12 can we than use the seedmap as originaly intended by us ?
so we create a job with X seeds, ES output, web input and HopCount 0 for links and redirect:
# Put X seeds in seedmap
# run job
# X documents get pushed to ES
# update job to have X minus 20 seeds
wait till scheduled time
# run job
# 20 documents get deleted from ES
# X minus 20 documents get updated
# wait till scheduled time
# ...
Will it work like this ?
was (Author: steenti):
If we update to manifold 2.12 can we than use the seedmap as originaly intended by us ?
so we create a job with X seeds, ES output, web input and HopCount 0 for links and redirect:
# Put X seeds in seedmap
# run job
# X documents get pushed to ES
# update job to have X minus 20 seeds
wait till scheduled time
# run job
# 20 documents get deleted from ES
# X minus 20 documents get updated
# wait till scheduled time
# ...
Will it work like this ?
> Documents unreachable due to hopcount are not considered unreachable on cleanup pass
> ------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
> Issue Type: Bug
> Components: Elastic Search connector, Web connector
> Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
> Reporter: Tim Steenbeke
> Assignee: Karl Wright
> Priority: Critical
> Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)