You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by lalit jangra <la...@gmail.com> on 2014/02/18 10:30:13 UTC

How ManifoldCF scheduling behaves?

Hi,

I am working on integrating manifoldcf or mcf with alfresco cms as
repository connector using CMIS query and using solr as output channel
where all index are stored. I am able to do it fine & can search documents
in solr index.

Now as part of implementation, i am planing to introduce multiple
repository such as sharepoint, file systems etc. so now i have three
document repositories : alfresco, sharepoint & filesystem. I am planning to
have scheduled jobs which run through each of repositories and crawl these
at particular intervals. But i have following contentions.

1. Although i am scheduling jobs for frequent intervals, i want to make
sure that mcf jobs pick only those content which are either added new or
updated say i have 100 docs dring current job run but say 110 at next job
run so i only want to run jobs for new 10 docs not entire 110 docs.
2. As there are relatively lesser mcf tutorials available, i have no means
to ensure that mcf jobs behaves this way but i assume it is intelligent
enough to behave this way but again no proof  to substantiate it.
3. I want to know more about mcf job schedule type : scan every document
once/rescan documents directly. Similarly i want to know more about job
invocation : complete/minimal. i would be sorry for being a newbie.
4. Also i am considering about doing some custom coding to ensure that only
latest/updated docs are eligible for processing but again going thru code
only as less documentation available.
5. Is it wise to doc custom coding in this case or mcf provides all these
features OOTB.

I would appreciate for any response.

Regards,
Lalit Jangra.

Re: How ManifoldCF scheduling behaves?

Posted by Karl Wright <da...@gmail.com>.
Hi Lalit,

First --- you may want to sign up for this list -- or since your question
is really more a user question, sign up for
users@manifoldcf.apache.orginstead.  Otherwise I need to moderate your
mail through.

Second --- the answer to your question about incremental crawling is, "yes,
MCF does this out of the box".  I highly suggest buying the book to
understand how it works.  http://www.manning.com/wright .

Third --- there are two ways to run a job: "normal" and "minimal".  The
"normal" crawling cycle discovers documents, crawls those and any documents
they reference, and then cleans up those documents that could not be
reached during the crawl.  "Minimal" does the same but doesn't try to clean
up removed documents at the end of the crawl.  For your purposes you will
need both cycles; "minimal" most of the time, "normal" once in a while.

Fourth --- continuous crawling is OK for some tasks but not others.  It's
well suited for a situation where you really only want fresh content and
you want expire older content.

I really recommend reading the book, though, because I can't give you all
the detail you will need in a post really.

Karl



On Tue, Feb 18, 2014 at 4:30 AM, lalit jangra <la...@gmail.com>wrote:

> Hi,
>
> I am working on integrating manifoldcf or mcf with alfresco cms as
> repository connector using CMIS query and using solr as output channel
> where all index are stored. I am able to do it fine & can search documents
> in solr index.
>
> Now as part of implementation, i am planing to introduce multiple
> repository such as sharepoint, file systems etc. so now i have three
> document repositories : alfresco, sharepoint & filesystem. I am planning to
> have scheduled jobs which run through each of repositories and crawl these
> at particular intervals. But i have following contentions.
>
> 1. Although i am scheduling jobs for frequent intervals, i want to make
> sure that mcf jobs pick only those content which are either added new or
> updated say i have 100 docs dring current job run but say 110 at next job
> run so i only want to run jobs for new 10 docs not entire 110 docs.
> 2. As there are relatively lesser mcf tutorials available, i have no means
> to ensure that mcf jobs behaves this way but i assume it is intelligent
> enough to behave this way but again no proof  to substantiate it.
> 3. I want to know more about mcf job schedule type : scan every document
> once/rescan documents directly. Similarly i want to know more about job
> invocation : complete/minimal. i would be sorry for being a newbie.
> 4. Also i am considering about doing some custom coding to ensure that only
> latest/updated docs are eligible for processing but again going thru code
> only as less documentation available.
> 5. Is it wise to doc custom coding in this case or mcf provides all these
> features OOTB.
>
> I would appreciate for any response.
>
> Regards,
> Lalit Jangra.
>