You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by Raman Gupta <ro...@gmail.com> on 2019/05/24 20:24:34 UTC

Repository connector for source with with delta API

My team is creating a new repository connector. The source system has
a delta API that lets us know of all new, modified, and deleted
individual folders and documents since the last call to the API. Each
call to the delta API provides the changes, as well as a token which
can be provided on subsequent calls to get changes since that token
was generated/returned.

What is the best approach to building a repo connector to a system
that has this type of delta API?

Our first design was an implementation that specifies
`MODEL_ADD_CHANGE_DELETE` and then:

* In addSeedDocuments, on the initial call we seed every document in
the source system. On subsequent calls, we use the delta API to seed
every added, modified, or deleted file. We return the delta API token
as the version value of addSeedDocuments, so that it an be used on
subsequent calls.

* In processDocuments, we do the usual thing for each document identifier.

On prototyping, this works for new docs, but "processDocuments" is
never triggered for modified and deleted docs.

A second design we are considering is to use
MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return only
one "virtual" document, which represents the root of the remote repo.

Then, in "processDocuments" the new "document" is used to determine
all the child documents of that delta call, which are then added to
the queue via `activities.addDocumentReference`. To force the "virtual
seed" to trigger processDocuments again on the next call to
`addSeedDocuments`, we do `activities.deleteDocument(virtualDocId)` as
well.

With this alternative design, the stage 1 seed effectively becomes a
no-op, and is just used as a mechanism to trigger stage 2.

Thoughts?

Regards,
Raman Gupta

Re: Repository connector for source with with delta API

Posted by Raman Gupta <ro...@gmail.com>.

On Mon, May 27, 2019 at 5:58 PM Karl Wright <da...@gmail.com> wrote:
>
> (1) There should be no new tables needed for any of this.  Your seed list
> can be stored in the job specification information.  See the rss connector
> for simple example of how this might be done.

Are you assuming the seed list is static? The RSS connector only
changes the job specification in the `processSpecificationPost`
method, and I assume that the job spec is read-only in
`addSeedDocuments`.

As I thought I made clear in previous messages, my seed list is
dynamic, which is why I switched to MODEL_ALL -- on each call to
addSeedDocuments, I can dynamically determine which seeds are
relevant, and only provide that list. To provide additional context,
the dynamic seed list is based on regular expression matches, where
the underlying seeds/roots can come and go based on which ones match
the regexes, and the regexes are present in the document spec.

> (2) If you have switched to MODEL_ALL then all you need to do is provide a
> mechanism for any given document for determining which seed it comes from,
> and simply look for that in the job specification.  If not there, call
> activities.removeDocument().

See above.

Regards,
Raman

> Karl
>
>
> On Mon, May 27, 2019 at 5:16 PM Raman Gupta <ro...@gmail.com> wrote:
>
> > One seed per job is an interesting approach but in the interests of
> > fully understanding the alternatives let me consider choice #2.
> >
> > >  you might want to combine this all into one job, but then you would
> > need to link your documents somehow to the seed they came from, so that if
> > the seed was no longer part of the job specification, it could always be
> > detected as a deletion.
> >
> > There are good reasons for me to prefer a single job, so how would I
> > accomplish this? Should my connector create its own tables and manage
> > this state there? Or is there another more light-weight approach?
> >
> > > Unfortunately, this is inconsistent with MODEL_ADD_CHANGE_DELETE,
> > because under that scheme you'd need to *detect* the deletion, because you
> > wouldn't be told by the repository that somebody had changed the
> > configuration.
> >
> > That is fine and I understand completely -- I forgot to mention in my
> > previous message that I've already switched to MODEL_ALL, and am
> > detecting and providing the list of currently active seeds on every
> > call to addSeedDocuments.
> >
> > Regards,
> > Raman
> >
> > On Mon, May 27, 2019 at 4:55 PM Karl Wright <da...@gmail.com> wrote:
> > >
> > > This is very different from the design you originally told me you were
> > > going to do.
> > >
> > > Generally, using hopcounts for managing your documents is a bad practice;
> > > this is expensive to do and almost always yields unexpected results.
> > > You could have one job per seed, which means all you have to do to make
> > the
> > > seed go away is delete the job corresponding to it.  If you have way too
> > > many seeds for that, you might want to combine this all into one job, but
> > > then you would need to link your documents somehow to the seed they came
> > > from, so that if the seed was no longer part of the job specification, it
> > > could always be detected as a deletion.  Unfortunately, this is
> > > inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme
> > you'd
> > > need to *detect* the deletion, because you wouldn't be told by the
> > > repository that somebody had changed the configuration.
> > >
> > > So two choices: (1) Exactly one seed per job, or (2) don't use
> > > MODEL_ADD_CHANGE_DELETE.
> > >
> > > Karl
> > >
> > >
> > > On Mon, May 27, 2019 at 4:38 PM Raman Gupta <ro...@gmail.com>
> > wrote:
> > >
> > > > Thanks for your help Karl. So I think I'm converging on a design.
> > > > First of all, per your recommendation, I've switched to scheduled
> > > > crawl and it executes as expected every minute with the "schedule
> > > > window anytime" setting.
> > > >
> > > > My next problem is dealing with seed deletion. My upstream source
> > > > actually has multiple "roots" i.e. each root has its own set of
> > > > documents, and the delta API must be called once for each "root". To
> > > > deal with this, I'm specifying each "root" as  a "seed document", and
> > > > each such root/seed creates "contained_in" documents. It is also
> > > > possible for a "root" to be deleted by a user of the upstream system.
> > > >
> > > > My job is defined with an accurate hopcount as follows:
> > > >
> > > > "job": {
> > > >   ... snip naming, scheduling, output connectors, doc spec....
> > > >   "hopcount_mode" to "accurate"
> > > >   "hopcount" to json {
> > > >     "link_type" to "contained_in"
> > > >     "count" to 1
> > > >   },
> > > >
> > > > For each seed, in processDocuments I am doing:
> > > >
> > > > activities.addDocumentReference("... doc identifier ...",
> > > > seedDocumentIdentifier, "contained_in");
> > > >
> > > > and then this triggers processDocuments for each of those documents,
> > > > as expected.
> > > >
> > > > How do I code the connector such that I can now remove the documents
> > > > that are now unreachable due to the deleted seed? I don't see any
> > > > calls to `processDocuments` via the framework that would allow me to
> > > > do this.
> > > >
> > > > Regards,
> > > > Raman
> > > >
> > > >
> > > > On Fri, May 24, 2019 at 7:29 PM Karl Wright <da...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi Raman,
> > > > >
> > > > > (1) Continuous crawl is not a good model for you.  It's meant for
> > > > crawling
> > > > > large web domains, not the kind of task you are doing.
> > > > > (2) Scheduled crawl will work fine for you if you simply tell it
> > "start
> > > > > within schedule window" and make sure your schedule completely covers
> > > > 7x24
> > > > > times.  So you can do this with one record, which triggers on every
> > day
> > > > of
> > > > > the week, that has a schedule window of 24 hours.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Fri, May 24, 2019 at 7:12 PM Raman Gupta <ro...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Yes, we are indeed running it in continuous crawl mode. Scheduled
> > mode
> > > > > > works, but given we have a delta API, we thought this is what makes
> > > > > > sense, as the delta API is efficient and we don't need to wait an
> > > > > > entire day for a scheduled job to run. I see that if I change
> > recrawl
> > > > > > interval and max recrawl interval also to 1 minute, then my
> > documents
> > > > > > do get processed each time. However, now we have the opposite
> > problem:
> > > > > > now the documents are reprocessed every minute, regardless of
> > whether
> > > > > > they were reseeded or not, which makes no sense to me. If I am
> > using
> > > > > > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed
> > method,
> > > > > > then why are the same documents being reprocessed over and over? I
> > > > > > have sent the output to the NullOutput using
> > > > > > `ingestDocumentWithException` and the status shows OK, and yet the
> > > > > > same documents are repeatedly sent to processDocuments.
> > > > > >
> > > > > > I just want to process the particular documents I specify on each
> > > > > > iteration every 60 seconds -- no more, no less, and yet I seem
> > unable
> > > > > > to build a connector that does this.
> > > > > >
> > > > > > If I move to a non-contiguous mode, do I really have to create 1440
> > > > > > schedule objects, one for each minute of each day? The way the
> > > > > > schedule seems to be put together, I don't see a way to just
> > schedule
> > > > > > every minute with one schedule. I would have expected schedules to
> > > > > > just use cron expressions.
> > > > > >
> > > > > > If I move to the design #2 in my OP and have one "virtual
> > document" to
> > > > > > just avoid the seeding stage all-together, then is there some place
> > > > > > where I can store the delta token state? Or does my connector have
> > to
> > > > > > create its own db table to store this?
> > > > > >
> > > > > > Regards,
> > > > > > Raman
> > > > > >
> > > > > > On Fri, May 24, 2019 at 6:18 PM Karl Wright <da...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > So MODEL_ADD_CHANGE does not work for you, eh?
> > > > > > >
> > > > > > > You were saying that every minute a addSeedDocuments is being
> > called,
> > > > > > > correct?  It sounds to me like you are running this job in
> > continuous
> > > > > > crawl
> > > > > > > mode.  Can you try running the job in non-continuous mode, and
> > just
> > > > > > > repeating the job run once it completes?
> > > > > > >
> > > > > > > The reason I ask is because continuous crawling has very unique
> > > > kinds of
> > > > > > > ways of dealing with documents it has crawled.  It uses
> > "exponential
> > > > > > > backoff" to schedule the next document crawl and that is probably
> > > > why you
> > > > > > > see the documents in the queue but not being processed; you
> > simply
> > > > > > haven't
> > > > > > > waited long enough.
> > > > > > >
> > > > > > > Karl
> > > > > > >
> > > > > > > Karl
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <
> > rocketraman@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Here are my addSeedDocuments and processDocuments methods
> > > > simplifying
> > > > > > > > them down to the minimum necessary to show what is happening:
> > > > > > > >
> > > > > > > > @Override
> > > > > > > > public String addSeedDocuments(ISeedingActivity activities,
> > > > > > Specification
> > > > > > > > spec,
> > > > > > > >                                String lastSeedVersion, long
> > > > seedTime,
> > > > > > > > int jobMode)
> > > > > > > >   throws ManifoldCFException, ServiceInterruption
> > > > > > > > {
> > > > > > > >   // return the same 3 docs every time, simulating an initial
> > > > load, and
> > > > > > > > then
> > > > > > > >   // these 3 docs changing constantly
> > > > > > > >   System.out.println(String.format("-=-= SeedTime=%s",
> > seedTime));
> > > > > > > >   activities.addSeedDocument("100');
> > > > > > > >   activities.addSeedDocument("110');
> > > > > > > >   activities.addSeedDocument("120');
> > > > > > > >   System.out.println("SEEDING DONE");
> > > > > > > >   return null
> > > > > > > > }
> > > > > > > >
> > > > > > > > @Override
> > > > > > > > public void processDocuments(String[] documentIdentifiers,
> > > > > > > > IExistingVersions statuses, Specification spec,
> > > > > > > >                              IProcessActivity activities, int
> > > > jobMode,
> > > > > > > > boolean usesDefaultAuthority)
> > > > > > > >   throws ManifoldCFException, ServiceInterruption {
> > > > > > > >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > > > > > > > Arrays.deepToString(documentIdentifiers) );
> > > > > > > >   // for (String documentIdentifier : documentIdentifiers) {
> > > > > > > >   //  activities.deleteDocument(documentIdentifier);
> > > > > > > >   //}
> > > > > > > >
> > > > > > > >   // I've commented out all subsequent logic here, but adding
> > the
> > > > call
> > > > > > to
> > > > > > > >   // activities.ingestDocumentWithException(documentIdentifier,
> > > > > > > > version, documentUri, rd);
> > > > > > > >   // does not change anything
> > > > > > > > }
> > > > > > > >
> > > > > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > > > > > > > MODEL_ADD_CHANGE, the output of this is:
> > > > > > > >
> > > > > > > > -=-= SeedTime=1558733436082
> > > > > > > > -=--=-= PROCESS DOCUMENTS: [200]
> > > > > > > > -=--=-= PROCESS DOCUMENTS: [220]
> > > > > > > > -=--=-= PROCESS DOCUMENTS: [210]
> > > > > > > > -=-= SeedTime=1558733549367
> > > > > > > > -=-= SeedTime=1558733609384
> > > > > > > > -=-= SeedTime=1558733436082
> > > > > > > > etc.
> > > > > > > >
> > > > > > > >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and
> > > > then
> > > > > > > > never again, even though "SEEDING DONE" is printing every
> > minute.
> > > > If
> > > > > > > > and only if I uncomment the for loop which deletes the
> > documents
> > > > does
> > > > > > > > "processDocuments" get called again for those seed document
> > ids.
> > > > > > > >
> > > > > > > > I do note that the queue shows documents 100, 110, and 120 in
> > state
> > > > > > > > "Waiting for processing", and nothing I do seems to affect
> > that.
> > > > The
> > > > > > > > database update in JobQueue.updateExistingRecordInitial is a
> > no-op
> > > > for
> > > > > > > > these docs, as the status of them is STATUS_PENDINGPURGATORY
> > and
> > > > the
> > > > > > > > update does not actually change anything in the db.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Raman
> > > > > > > >
> > > > > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <
> > daddywri@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > For any given job run, all documents that are added via
> > > > > > > > addSeedDocuments()
> > > > > > > > > should be processed.  There is no magic in the framework that
> > > > somehow
> > > > > > > > knows
> > > > > > > > > that a document has been created vs. modified vs. deleted
> > until
> > > > > > > > > processDocuments() is called.  If your claim is that this
> > > > contract
> > > > > > is not
> > > > > > > > > being honored, could you try changing your connector model to
> > > > > > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything
> > seems to
> > > > > > work
> > > > > > > > > using that model.  If it does *not* then clearly you've got
> > some
> > > > > > kind of
> > > > > > > > > implementation problem at the addSeedDocuments() level
> > because
> > > > most
> > > > > > of
> > > > > > > > the
> > > > > > > > > Manifold connectors use that model.
> > > > > > > > >
> > > > > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step
> > is
> > > > to
> > > > > > figure
> > > > > > > > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > > > > > > > >
> > > > > > > > > Karl
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <
> > > > rocketraman@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <
> > > > daddywri@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > For ADD_CHANGE_DELETE, the contract for
> > addSeedDocuments()
> > > > > > basically
> > > > > > > > says
> > > > > > > > > > > that you have to include *at least* the documents that
> > were
> > > > > > changed,
> > > > > > > > > > added,
> > > > > > > > > > > or deleted since the previous stamp, and if no stamp is
> > > > > > provided, it
> > > > > > > > > > should
> > > > > > > > > > > return ALL specified documents.  Are you doing that?
> > > > > > > > > >
> > > > > > > > > > Yes, the delta API gives us all the changed, added, and
> > deleted
> > > > > > > > > > documents, and those are exactly the ones that we are
> > > > including.
> > > > > > > > > >
> > > > > > > > > > > If you are, the next thing to look at is the computation
> > of
> > > > the
> > > > > > > > version
> > > > > > > > > > > string.  The version string is what is used to figure
> > out if
> > > > a
> > > > > > change
> > > > > > > > > > took
> > > > > > > > > > > place.  You need this IN ADDITION TO the
> > addSeedDocuments()
> > > > > > doing the
> > > > > > > > > > right
> > > > > > > > > > > thing.  For deleted documents, obviously the
> > > > processDocuments()
> > > > > > > > should
> > > > > > > > > > call
> > > > > > > > > > > the activities.deleteDocument() method.
> > > > > > > > > >
> > > > > > > > > > The version String is calculated by `processDocuments`.
> > Since
> > > > after
> > > > > > > > > > calling `addSeedDocuments` once for document A version 1,
> > > > > > > > > > `processDocuments` is never called again for that document,
> > > > even
> > > > > > > > > > though it has been modified to document A version 2.
> > > > Therefore, our
> > > > > > > > > > connector never gets a chance to return the "version 2"
> > string.
> > > > > > > > > >
> > > > > > > > > > > Does this sound like what your code is doing?
> > > > > > > > > >
> > > > > > > > > > Yes, as far as we can go given the fact that
> > > > `processDocuments` is
> > > > > > > > > > only called once for any particular document identifier.
> > > > > > > > > >
> > > > > > > > > > > Karl
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <
> > > > > > rocketraman@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > My team is creating a new repository connector. The
> > source
> > > > > > system
> > > > > > > > has
> > > > > > > > > > > > a delta API that lets us know of all new, modified, and
> > > > deleted
> > > > > > > > > > > > individual folders and documents since the last call
> > to the
> > > > > > API.
> > > > > > > > Each
> > > > > > > > > > > > call to the delta API provides the changes, as well as
> > a
> > > > token
> > > > > > > > which
> > > > > > > > > > > > can be provided on subsequent calls to get changes
> > since
> > > > that
> > > > > > token
> > > > > > > > > > > > was generated/returned.
> > > > > > > > > > > >
> > > > > > > > > > > > What is the best approach to building a repo connector
> > to a
> > > > > > system
> > > > > > > > > > > > that has this type of delta API?
> > > > > > > > > > > >
> > > > > > > > > > > > Our first design was an implementation that specifies
> > > > > > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > > > > > > > >
> > > > > > > > > > > > * In addSeedDocuments, on the initial call we seed
> > every
> > > > > > document
> > > > > > > > in
> > > > > > > > > > > > the source system. On subsequent calls, we use the
> > delta
> > > > API to
> > > > > > > > seed
> > > > > > > > > > > > every added, modified, or deleted file. We return the
> > > > delta API
> > > > > > > > token
> > > > > > > > > > > > as the version value of addSeedDocuments, so that it
> > an be
> > > > > > used on
> > > > > > > > > > > > subsequent calls.
> > > > > > > > > > > >
> > > > > > > > > > > > * In processDocuments, we do the usual thing for each
> > > > document
> > > > > > > > > > identifier.
> > > > > > > > > > > >
> > > > > > > > > > > > On prototyping, this works for new docs, but
> > > > > > "processDocuments" is
> > > > > > > > > > > > never triggered for modified and deleted docs.
> > > > > > > > > > > >
> > > > > > > > > > > > A second design we are considering is to use
> > > > > > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have
> > addSeedDocuments
> > > > > > return
> > > > > > > > only
> > > > > > > > > > > > one "virtual" document, which represents the root of
> > the
> > > > remote
> > > > > > > > repo.
> > > > > > > > > > > >
> > > > > > > > > > > > Then, in "processDocuments" the new "document" is used
> > to
> > > > > > determine
> > > > > > > > > > > > all the child documents of that delta call, which are
> > then
> > > > > > added to
> > > > > > > > > > > > the queue via `activities.addDocumentReference`. To
> > force
> > > > the
> > > > > > > > "virtual
> > > > > > > > > > > > seed" to trigger processDocuments again on the next
> > call to
> > > > > > > > > > > > `addSeedDocuments`, we do
> > > > > > > > `activities.deleteDocument(virtualDocId)` as
> > > > > > > > > > > > well.
> > > > > > > > > > > >
> > > > > > > > > > > > With this alternative design, the stage 1 seed
> > effectively
> > > > > > becomes
> > > > > > > > a
> > > > > > > > > > > > no-op, and is just used as a mechanism to trigger
> > stage 2.
> > > > > > > > > > > >
> > > > > > > > > > > > Thoughts?
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Raman Gupta
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: Repository connector for source with with delta API

Posted by Karl Wright <da...@gmail.com>.

(1) There should be no new tables needed for any of this.  Your seed list
can be stored in the job specification information.  See the rss connector
for simple example of how this might be done.

(2) If you have switched to MODEL_ALL then all you need to do is provide a
mechanism for any given document for determining which seed it comes from,
and simply look for that in the job specification.  If not there, call
activities.removeDocument().

Karl


On Mon, May 27, 2019 at 5:16 PM Raman Gupta <ro...@gmail.com> wrote:

> One seed per job is an interesting approach but in the interests of
> fully understanding the alternatives let me consider choice #2.
>
> >  you might want to combine this all into one job, but then you would
> need to link your documents somehow to the seed they came from, so that if
> the seed was no longer part of the job specification, it could always be
> detected as a deletion.
>
> There are good reasons for me to prefer a single job, so how would I
> accomplish this? Should my connector create its own tables and manage
> this state there? Or is there another more light-weight approach?
>
> > Unfortunately, this is inconsistent with MODEL_ADD_CHANGE_DELETE,
> because under that scheme you'd need to *detect* the deletion, because you
> wouldn't be told by the repository that somebody had changed the
> configuration.
>
> That is fine and I understand completely -- I forgot to mention in my
> previous message that I've already switched to MODEL_ALL, and am
> detecting and providing the list of currently active seeds on every
> call to addSeedDocuments.
>
> Regards,
> Raman
>
> On Mon, May 27, 2019 at 4:55 PM Karl Wright <da...@gmail.com> wrote:
> >
> > This is very different from the design you originally told me you were
> > going to do.
> >
> > Generally, using hopcounts for managing your documents is a bad practice;
> > this is expensive to do and almost always yields unexpected results.
> > You could have one job per seed, which means all you have to do to make
> the
> > seed go away is delete the job corresponding to it.  If you have way too
> > many seeds for that, you might want to combine this all into one job, but
> > then you would need to link your documents somehow to the seed they came
> > from, so that if the seed was no longer part of the job specification, it
> > could always be detected as a deletion.  Unfortunately, this is
> > inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme
> you'd
> > need to *detect* the deletion, because you wouldn't be told by the
> > repository that somebody had changed the configuration.
> >
> > So two choices: (1) Exactly one seed per job, or (2) don't use
> > MODEL_ADD_CHANGE_DELETE.
> >
> > Karl
> >
> >
> > On Mon, May 27, 2019 at 4:38 PM Raman Gupta <ro...@gmail.com>
> wrote:
> >
> > > Thanks for your help Karl. So I think I'm converging on a design.
> > > First of all, per your recommendation, I've switched to scheduled
> > > crawl and it executes as expected every minute with the "schedule
> > > window anytime" setting.
> > >
> > > My next problem is dealing with seed deletion. My upstream source
> > > actually has multiple "roots" i.e. each root has its own set of
> > > documents, and the delta API must be called once for each "root". To
> > > deal with this, I'm specifying each "root" as  a "seed document", and
> > > each such root/seed creates "contained_in" documents. It is also
> > > possible for a "root" to be deleted by a user of the upstream system.
> > >
> > > My job is defined with an accurate hopcount as follows:
> > >
> > > "job": {
> > >   ... snip naming, scheduling, output connectors, doc spec....
> > >   "hopcount_mode" to "accurate"
> > >   "hopcount" to json {
> > >     "link_type" to "contained_in"
> > >     "count" to 1
> > >   },
> > >
> > > For each seed, in processDocuments I am doing:
> > >
> > > activities.addDocumentReference("... doc identifier ...",
> > > seedDocumentIdentifier, "contained_in");
> > >
> > > and then this triggers processDocuments for each of those documents,
> > > as expected.
> > >
> > > How do I code the connector such that I can now remove the documents
> > > that are now unreachable due to the deleted seed? I don't see any
> > > calls to `processDocuments` via the framework that would allow me to
> > > do this.
> > >
> > > Regards,
> > > Raman
> > >
> > >
> > > On Fri, May 24, 2019 at 7:29 PM Karl Wright <da...@gmail.com>
> wrote:
> > > >
> > > > Hi Raman,
> > > >
> > > > (1) Continuous crawl is not a good model for you.  It's meant for
> > > crawling
> > > > large web domains, not the kind of task you are doing.
> > > > (2) Scheduled crawl will work fine for you if you simply tell it
> "start
> > > > within schedule window" and make sure your schedule completely covers
> > > 7x24
> > > > times.  So you can do this with one record, which triggers on every
> day
> > > of
> > > > the week, that has a schedule window of 24 hours.
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Fri, May 24, 2019 at 7:12 PM Raman Gupta <ro...@gmail.com>
> > > wrote:
> > > >
> > > > > Yes, we are indeed running it in continuous crawl mode. Scheduled
> mode
> > > > > works, but given we have a delta API, we thought this is what makes
> > > > > sense, as the delta API is efficient and we don't need to wait an
> > > > > entire day for a scheduled job to run. I see that if I change
> recrawl
> > > > > interval and max recrawl interval also to 1 minute, then my
> documents
> > > > > do get processed each time. However, now we have the opposite
> problem:
> > > > > now the documents are reprocessed every minute, regardless of
> whether
> > > > > they were reseeded or not, which makes no sense to me. If I am
> using
> > > > > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed
> method,
> > > > > then why are the same documents being reprocessed over and over? I
> > > > > have sent the output to the NullOutput using
> > > > > `ingestDocumentWithException` and the status shows OK, and yet the
> > > > > same documents are repeatedly sent to processDocuments.
> > > > >
> > > > > I just want to process the particular documents I specify on each
> > > > > iteration every 60 seconds -- no more, no less, and yet I seem
> unable
> > > > > to build a connector that does this.
> > > > >
> > > > > If I move to a non-contiguous mode, do I really have to create 1440
> > > > > schedule objects, one for each minute of each day? The way the
> > > > > schedule seems to be put together, I don't see a way to just
> schedule
> > > > > every minute with one schedule. I would have expected schedules to
> > > > > just use cron expressions.
> > > > >
> > > > > If I move to the design #2 in my OP and have one "virtual
> document" to
> > > > > just avoid the seeding stage all-together, then is there some place
> > > > > where I can store the delta token state? Or does my connector have
> to
> > > > > create its own db table to store this?
> > > > >
> > > > > Regards,
> > > > > Raman
> > > > >
> > > > > On Fri, May 24, 2019 at 6:18 PM Karl Wright <da...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > So MODEL_ADD_CHANGE does not work for you, eh?
> > > > > >
> > > > > > You were saying that every minute a addSeedDocuments is being
> called,
> > > > > > correct?  It sounds to me like you are running this job in
> continuous
> > > > > crawl
> > > > > > mode.  Can you try running the job in non-continuous mode, and
> just
> > > > > > repeating the job run once it completes?
> > > > > >
> > > > > > The reason I ask is because continuous crawling has very unique
> > > kinds of
> > > > > > ways of dealing with documents it has crawled.  It uses
> "exponential
> > > > > > backoff" to schedule the next document crawl and that is probably
> > > why you
> > > > > > see the documents in the queue but not being processed; you
> simply
> > > > > haven't
> > > > > > waited long enough.
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <
> rocketraman@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Here are my addSeedDocuments and processDocuments methods
> > > simplifying
> > > > > > > them down to the minimum necessary to show what is happening:
> > > > > > >
> > > > > > > @Override
> > > > > > > public String addSeedDocuments(ISeedingActivity activities,
> > > > > Specification
> > > > > > > spec,
> > > > > > >                                String lastSeedVersion, long
> > > seedTime,
> > > > > > > int jobMode)
> > > > > > >   throws ManifoldCFException, ServiceInterruption
> > > > > > > {
> > > > > > >   // return the same 3 docs every time, simulating an initial
> > > load, and
> > > > > > > then
> > > > > > >   // these 3 docs changing constantly
> > > > > > >   System.out.println(String.format("-=-= SeedTime=%s",
> seedTime));
> > > > > > >   activities.addSeedDocument("100');
> > > > > > >   activities.addSeedDocument("110');
> > > > > > >   activities.addSeedDocument("120');
> > > > > > >   System.out.println("SEEDING DONE");
> > > > > > >   return null
> > > > > > > }
> > > > > > >
> > > > > > > @Override
> > > > > > > public void processDocuments(String[] documentIdentifiers,
> > > > > > > IExistingVersions statuses, Specification spec,
> > > > > > >                              IProcessActivity activities, int
> > > jobMode,
> > > > > > > boolean usesDefaultAuthority)
> > > > > > >   throws ManifoldCFException, ServiceInterruption {
> > > > > > >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > > > > > > Arrays.deepToString(documentIdentifiers) );
> > > > > > >   // for (String documentIdentifier : documentIdentifiers) {
> > > > > > >   //  activities.deleteDocument(documentIdentifier);
> > > > > > >   //}
> > > > > > >
> > > > > > >   // I've commented out all subsequent logic here, but adding
> the
> > > call
> > > > > to
> > > > > > >   // activities.ingestDocumentWithException(documentIdentifier,
> > > > > > > version, documentUri, rd);
> > > > > > >   // does not change anything
> > > > > > > }
> > > > > > >
> > > > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > > > > > > MODEL_ADD_CHANGE, the output of this is:
> > > > > > >
> > > > > > > -=-= SeedTime=1558733436082
> > > > > > > -=--=-= PROCESS DOCUMENTS: [200]
> > > > > > > -=--=-= PROCESS DOCUMENTS: [220]
> > > > > > > -=--=-= PROCESS DOCUMENTS: [210]
> > > > > > > -=-= SeedTime=1558733549367
> > > > > > > -=-= SeedTime=1558733609384
> > > > > > > -=-= SeedTime=1558733436082
> > > > > > > etc.
> > > > > > >
> > > > > > >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and
> > > then
> > > > > > > never again, even though "SEEDING DONE" is printing every
> minute.
> > > If
> > > > > > > and only if I uncomment the for loop which deletes the
> documents
> > > does
> > > > > > > "processDocuments" get called again for those seed document
> ids.
> > > > > > >
> > > > > > > I do note that the queue shows documents 100, 110, and 120 in
> state
> > > > > > > "Waiting for processing", and nothing I do seems to affect
> that.
> > > The
> > > > > > > database update in JobQueue.updateExistingRecordInitial is a
> no-op
> > > for
> > > > > > > these docs, as the status of them is STATUS_PENDINGPURGATORY
> and
> > > the
> > > > > > > update does not actually change anything in the db.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Raman
> > > > > > >
> > > > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <
> daddywri@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > For any given job run, all documents that are added via
> > > > > > > addSeedDocuments()
> > > > > > > > should be processed.  There is no magic in the framework that
> > > somehow
> > > > > > > knows
> > > > > > > > that a document has been created vs. modified vs. deleted
> until
> > > > > > > > processDocuments() is called.  If your claim is that this
> > > contract
> > > > > is not
> > > > > > > > being honored, could you try changing your connector model to
> > > > > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything
> seems to
> > > > > work
> > > > > > > > using that model.  If it does *not* then clearly you've got
> some
> > > > > kind of
> > > > > > > > implementation problem at the addSeedDocuments() level
> because
> > > most
> > > > > of
> > > > > > > the
> > > > > > > > Manifold connectors use that model.
> > > > > > > >
> > > > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step
> is
> > > to
> > > > > figure
> > > > > > > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > > > > > > >
> > > > > > > > Karl
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <
> > > rocketraman@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <
> > > daddywri@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > For ADD_CHANGE_DELETE, the contract for
> addSeedDocuments()
> > > > > basically
> > > > > > > says
> > > > > > > > > > that you have to include *at least* the documents that
> were
> > > > > changed,
> > > > > > > > > added,
> > > > > > > > > > or deleted since the previous stamp, and if no stamp is
> > > > > provided, it
> > > > > > > > > should
> > > > > > > > > > return ALL specified documents.  Are you doing that?
> > > > > > > > >
> > > > > > > > > Yes, the delta API gives us all the changed, added, and
> deleted
> > > > > > > > > documents, and those are exactly the ones that we are
> > > including.
> > > > > > > > >
> > > > > > > > > > If you are, the next thing to look at is the computation
> of
> > > the
> > > > > > > version
> > > > > > > > > > string.  The version string is what is used to figure
> out if
> > > a
> > > > > change
> > > > > > > > > took
> > > > > > > > > > place.  You need this IN ADDITION TO the
> addSeedDocuments()
> > > > > doing the
> > > > > > > > > right
> > > > > > > > > > thing.  For deleted documents, obviously the
> > > processDocuments()
> > > > > > > should
> > > > > > > > > call
> > > > > > > > > > the activities.deleteDocument() method.
> > > > > > > > >
> > > > > > > > > The version String is calculated by `processDocuments`.
> Since
> > > after
> > > > > > > > > calling `addSeedDocuments` once for document A version 1,
> > > > > > > > > `processDocuments` is never called again for that document,
> > > even
> > > > > > > > > though it has been modified to document A version 2.
> > > Therefore, our
> > > > > > > > > connector never gets a chance to return the "version 2"
> string.
> > > > > > > > >
> > > > > > > > > > Does this sound like what your code is doing?
> > > > > > > > >
> > > > > > > > > Yes, as far as we can go given the fact that
> > > `processDocuments` is
> > > > > > > > > only called once for any particular document identifier.
> > > > > > > > >
> > > > > > > > > > Karl
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <
> > > > > rocketraman@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > My team is creating a new repository connector. The
> source
> > > > > system
> > > > > > > has
> > > > > > > > > > > a delta API that lets us know of all new, modified, and
> > > deleted
> > > > > > > > > > > individual folders and documents since the last call
> to the
> > > > > API.
> > > > > > > Each
> > > > > > > > > > > call to the delta API provides the changes, as well as
> a
> > > token
> > > > > > > which
> > > > > > > > > > > can be provided on subsequent calls to get changes
> since
> > > that
> > > > > token
> > > > > > > > > > > was generated/returned.
> > > > > > > > > > >
> > > > > > > > > > > What is the best approach to building a repo connector
> to a
> > > > > system
> > > > > > > > > > > that has this type of delta API?
> > > > > > > > > > >
> > > > > > > > > > > Our first design was an implementation that specifies
> > > > > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > > > > > > >
> > > > > > > > > > > * In addSeedDocuments, on the initial call we seed
> every
> > > > > document
> > > > > > > in
> > > > > > > > > > > the source system. On subsequent calls, we use the
> delta
> > > API to
> > > > > > > seed
> > > > > > > > > > > every added, modified, or deleted file. We return the
> > > delta API
> > > > > > > token
> > > > > > > > > > > as the version value of addSeedDocuments, so that it
> an be
> > > > > used on
> > > > > > > > > > > subsequent calls.
> > > > > > > > > > >
> > > > > > > > > > > * In processDocuments, we do the usual thing for each
> > > document
> > > > > > > > > identifier.
> > > > > > > > > > >
> > > > > > > > > > > On prototyping, this works for new docs, but
> > > > > "processDocuments" is
> > > > > > > > > > > never triggered for modified and deleted docs.
> > > > > > > > > > >
> > > > > > > > > > > A second design we are considering is to use
> > > > > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have
> addSeedDocuments
> > > > > return
> > > > > > > only
> > > > > > > > > > > one "virtual" document, which represents the root of
> the
> > > remote
> > > > > > > repo.
> > > > > > > > > > >
> > > > > > > > > > > Then, in "processDocuments" the new "document" is used
> to
> > > > > determine
> > > > > > > > > > > all the child documents of that delta call, which are
> then
> > > > > added to
> > > > > > > > > > > the queue via `activities.addDocumentReference`. To
> force
> > > the
> > > > > > > "virtual
> > > > > > > > > > > seed" to trigger processDocuments again on the next
> call to
> > > > > > > > > > > `addSeedDocuments`, we do
> > > > > > > `activities.deleteDocument(virtualDocId)` as
> > > > > > > > > > > well.
> > > > > > > > > > >
> > > > > > > > > > > With this alternative design, the stage 1 seed
> effectively
> > > > > becomes
> > > > > > > a
> > > > > > > > > > > no-op, and is just used as a mechanism to trigger
> stage 2.
> > > > > > > > > > >
> > > > > > > > > > > Thoughts?
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Raman Gupta
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: Repository connector for source with with delta API

Posted by Raman Gupta <ro...@gmail.com>.

One seed per job is an interesting approach but in the interests of
fully understanding the alternatives let me consider choice #2.

>  you might want to combine this all into one job, but then you would need to link your documents somehow to the seed they came from, so that if the seed was no longer part of the job specification, it could always be detected as a deletion.

There are good reasons for me to prefer a single job, so how would I
accomplish this? Should my connector create its own tables and manage
this state there? Or is there another more light-weight approach?

> Unfortunately, this is inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme you'd need to *detect* the deletion, because you wouldn't be told by the repository that somebody had changed the configuration.

That is fine and I understand completely -- I forgot to mention in my
previous message that I've already switched to MODEL_ALL, and am
detecting and providing the list of currently active seeds on every
call to addSeedDocuments.

Regards,
Raman

On Mon, May 27, 2019 at 4:55 PM Karl Wright <da...@gmail.com> wrote:
>
> This is very different from the design you originally told me you were
> going to do.
>
> Generally, using hopcounts for managing your documents is a bad practice;
> this is expensive to do and almost always yields unexpected results.
> You could have one job per seed, which means all you have to do to make the
> seed go away is delete the job corresponding to it.  If you have way too
> many seeds for that, you might want to combine this all into one job, but
> then you would need to link your documents somehow to the seed they came
> from, so that if the seed was no longer part of the job specification, it
> could always be detected as a deletion.  Unfortunately, this is
> inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme you'd
> need to *detect* the deletion, because you wouldn't be told by the
> repository that somebody had changed the configuration.
>
> So two choices: (1) Exactly one seed per job, or (2) don't use
> MODEL_ADD_CHANGE_DELETE.
>
> Karl
>
>
> On Mon, May 27, 2019 at 4:38 PM Raman Gupta <ro...@gmail.com> wrote:
>
> > Thanks for your help Karl. So I think I'm converging on a design.
> > First of all, per your recommendation, I've switched to scheduled
> > crawl and it executes as expected every minute with the "schedule
> > window anytime" setting.
> >
> > My next problem is dealing with seed deletion. My upstream source
> > actually has multiple "roots" i.e. each root has its own set of
> > documents, and the delta API must be called once for each "root". To
> > deal with this, I'm specifying each "root" as  a "seed document", and
> > each such root/seed creates "contained_in" documents. It is also
> > possible for a "root" to be deleted by a user of the upstream system.
> >
> > My job is defined with an accurate hopcount as follows:
> >
> > "job": {
> >   ... snip naming, scheduling, output connectors, doc spec....
> >   "hopcount_mode" to "accurate"
> >   "hopcount" to json {
> >     "link_type" to "contained_in"
> >     "count" to 1
> >   },
> >
> > For each seed, in processDocuments I am doing:
> >
> > activities.addDocumentReference("... doc identifier ...",
> > seedDocumentIdentifier, "contained_in");
> >
> > and then this triggers processDocuments for each of those documents,
> > as expected.
> >
> > How do I code the connector such that I can now remove the documents
> > that are now unreachable due to the deleted seed? I don't see any
> > calls to `processDocuments` via the framework that would allow me to
> > do this.
> >
> > Regards,
> > Raman
> >
> >
> > On Fri, May 24, 2019 at 7:29 PM Karl Wright <da...@gmail.com> wrote:
> > >
> > > Hi Raman,
> > >
> > > (1) Continuous crawl is not a good model for you.  It's meant for
> > crawling
> > > large web domains, not the kind of task you are doing.
> > > (2) Scheduled crawl will work fine for you if you simply tell it "start
> > > within schedule window" and make sure your schedule completely covers
> > 7x24
> > > times.  So you can do this with one record, which triggers on every day
> > of
> > > the week, that has a schedule window of 24 hours.
> > >
> > > Karl
> > >
> > >
> > > On Fri, May 24, 2019 at 7:12 PM Raman Gupta <ro...@gmail.com>
> > wrote:
> > >
> > > > Yes, we are indeed running it in continuous crawl mode. Scheduled mode
> > > > works, but given we have a delta API, we thought this is what makes
> > > > sense, as the delta API is efficient and we don't need to wait an
> > > > entire day for a scheduled job to run. I see that if I change recrawl
> > > > interval and max recrawl interval also to 1 minute, then my documents
> > > > do get processed each time. However, now we have the opposite problem:
> > > > now the documents are reprocessed every minute, regardless of whether
> > > > they were reseeded or not, which makes no sense to me. If I am using
> > > > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed method,
> > > > then why are the same documents being reprocessed over and over? I
> > > > have sent the output to the NullOutput using
> > > > `ingestDocumentWithException` and the status shows OK, and yet the
> > > > same documents are repeatedly sent to processDocuments.
> > > >
> > > > I just want to process the particular documents I specify on each
> > > > iteration every 60 seconds -- no more, no less, and yet I seem unable
> > > > to build a connector that does this.
> > > >
> > > > If I move to a non-contiguous mode, do I really have to create 1440
> > > > schedule objects, one for each minute of each day? The way the
> > > > schedule seems to be put together, I don't see a way to just schedule
> > > > every minute with one schedule. I would have expected schedules to
> > > > just use cron expressions.
> > > >
> > > > If I move to the design #2 in my OP and have one "virtual document" to
> > > > just avoid the seeding stage all-together, then is there some place
> > > > where I can store the delta token state? Or does my connector have to
> > > > create its own db table to store this?
> > > >
> > > > Regards,
> > > > Raman
> > > >
> > > > On Fri, May 24, 2019 at 6:18 PM Karl Wright <da...@gmail.com>
> > wrote:
> > > > >
> > > > > So MODEL_ADD_CHANGE does not work for you, eh?
> > > > >
> > > > > You were saying that every minute a addSeedDocuments is being called,
> > > > > correct?  It sounds to me like you are running this job in continuous
> > > > crawl
> > > > > mode.  Can you try running the job in non-continuous mode, and just
> > > > > repeating the job run once it completes?
> > > > >
> > > > > The reason I ask is because continuous crawling has very unique
> > kinds of
> > > > > ways of dealing with documents it has crawled.  It uses "exponential
> > > > > backoff" to schedule the next document crawl and that is probably
> > why you
> > > > > see the documents in the queue but not being processed; you simply
> > > > haven't
> > > > > waited long enough.
> > > > >
> > > > > Karl
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <ro...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Here are my addSeedDocuments and processDocuments methods
> > simplifying
> > > > > > them down to the minimum necessary to show what is happening:
> > > > > >
> > > > > > @Override
> > > > > > public String addSeedDocuments(ISeedingActivity activities,
> > > > Specification
> > > > > > spec,
> > > > > >                                String lastSeedVersion, long
> > seedTime,
> > > > > > int jobMode)
> > > > > >   throws ManifoldCFException, ServiceInterruption
> > > > > > {
> > > > > >   // return the same 3 docs every time, simulating an initial
> > load, and
> > > > > > then
> > > > > >   // these 3 docs changing constantly
> > > > > >   System.out.println(String.format("-=-= SeedTime=%s", seedTime));
> > > > > >   activities.addSeedDocument("100');
> > > > > >   activities.addSeedDocument("110');
> > > > > >   activities.addSeedDocument("120');
> > > > > >   System.out.println("SEEDING DONE");
> > > > > >   return null
> > > > > > }
> > > > > >
> > > > > > @Override
> > > > > > public void processDocuments(String[] documentIdentifiers,
> > > > > > IExistingVersions statuses, Specification spec,
> > > > > >                              IProcessActivity activities, int
> > jobMode,
> > > > > > boolean usesDefaultAuthority)
> > > > > >   throws ManifoldCFException, ServiceInterruption {
> > > > > >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > > > > > Arrays.deepToString(documentIdentifiers) );
> > > > > >   // for (String documentIdentifier : documentIdentifiers) {
> > > > > >   //  activities.deleteDocument(documentIdentifier);
> > > > > >   //}
> > > > > >
> > > > > >   // I've commented out all subsequent logic here, but adding the
> > call
> > > > to
> > > > > >   // activities.ingestDocumentWithException(documentIdentifier,
> > > > > > version, documentUri, rd);
> > > > > >   // does not change anything
> > > > > > }
> > > > > >
> > > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > > > > > MODEL_ADD_CHANGE, the output of this is:
> > > > > >
> > > > > > -=-= SeedTime=1558733436082
> > > > > > -=--=-= PROCESS DOCUMENTS: [200]
> > > > > > -=--=-= PROCESS DOCUMENTS: [220]
> > > > > > -=--=-= PROCESS DOCUMENTS: [210]
> > > > > > -=-= SeedTime=1558733549367
> > > > > > -=-= SeedTime=1558733609384
> > > > > > -=-= SeedTime=1558733436082
> > > > > > etc.
> > > > > >
> > > > > >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and
> > then
> > > > > > never again, even though "SEEDING DONE" is printing every minute.
> > If
> > > > > > and only if I uncomment the for loop which deletes the documents
> > does
> > > > > > "processDocuments" get called again for those seed document ids.
> > > > > >
> > > > > > I do note that the queue shows documents 100, 110, and 120 in state
> > > > > > "Waiting for processing", and nothing I do seems to affect that.
> > The
> > > > > > database update in JobQueue.updateExistingRecordInitial is a no-op
> > for
> > > > > > these docs, as the status of them is STATUS_PENDINGPURGATORY and
> > the
> > > > > > update does not actually change anything in the db.
> > > > > >
> > > > > > Regards,
> > > > > > Raman
> > > > > >
> > > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <da...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > For any given job run, all documents that are added via
> > > > > > addSeedDocuments()
> > > > > > > should be processed.  There is no magic in the framework that
> > somehow
> > > > > > knows
> > > > > > > that a document has been created vs. modified vs. deleted until
> > > > > > > processDocuments() is called.  If your claim is that this
> > contract
> > > > is not
> > > > > > > being honored, could you try changing your connector model to
> > > > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to
> > > > work
> > > > > > > using that model.  If it does *not* then clearly you've got some
> > > > kind of
> > > > > > > implementation problem at the addSeedDocuments() level because
> > most
> > > > of
> > > > > > the
> > > > > > > Manifold connectors use that model.
> > > > > > >
> > > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step is
> > to
> > > > figure
> > > > > > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > > > > > >
> > > > > > > Karl
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <
> > rocketraman@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <
> > daddywri@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments()
> > > > basically
> > > > > > says
> > > > > > > > > that you have to include *at least* the documents that were
> > > > changed,
> > > > > > > > added,
> > > > > > > > > or deleted since the previous stamp, and if no stamp is
> > > > provided, it
> > > > > > > > should
> > > > > > > > > return ALL specified documents.  Are you doing that?
> > > > > > > >
> > > > > > > > Yes, the delta API gives us all the changed, added, and deleted
> > > > > > > > documents, and those are exactly the ones that we are
> > including.
> > > > > > > >
> > > > > > > > > If you are, the next thing to look at is the computation of
> > the
> > > > > > version
> > > > > > > > > string.  The version string is what is used to figure out if
> > a
> > > > change
> > > > > > > > took
> > > > > > > > > place.  You need this IN ADDITION TO the addSeedDocuments()
> > > > doing the
> > > > > > > > right
> > > > > > > > > thing.  For deleted documents, obviously the
> > processDocuments()
> > > > > > should
> > > > > > > > call
> > > > > > > > > the activities.deleteDocument() method.
> > > > > > > >
> > > > > > > > The version String is calculated by `processDocuments`. Since
> > after
> > > > > > > > calling `addSeedDocuments` once for document A version 1,
> > > > > > > > `processDocuments` is never called again for that document,
> > even
> > > > > > > > though it has been modified to document A version 2.
> > Therefore, our
> > > > > > > > connector never gets a chance to return the "version 2" string.
> > > > > > > >
> > > > > > > > > Does this sound like what your code is doing?
> > > > > > > >
> > > > > > > > Yes, as far as we can go given the fact that
> > `processDocuments` is
> > > > > > > > only called once for any particular document identifier.
> > > > > > > >
> > > > > > > > > Karl
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <
> > > > rocketraman@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > My team is creating a new repository connector. The source
> > > > system
> > > > > > has
> > > > > > > > > > a delta API that lets us know of all new, modified, and
> > deleted
> > > > > > > > > > individual folders and documents since the last call to the
> > > > API.
> > > > > > Each
> > > > > > > > > > call to the delta API provides the changes, as well as a
> > token
> > > > > > which
> > > > > > > > > > can be provided on subsequent calls to get changes since
> > that
> > > > token
> > > > > > > > > > was generated/returned.
> > > > > > > > > >
> > > > > > > > > > What is the best approach to building a repo connector to a
> > > > system
> > > > > > > > > > that has this type of delta API?
> > > > > > > > > >
> > > > > > > > > > Our first design was an implementation that specifies
> > > > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > > > > > >
> > > > > > > > > > * In addSeedDocuments, on the initial call we seed every
> > > > document
> > > > > > in
> > > > > > > > > > the source system. On subsequent calls, we use the delta
> > API to
> > > > > > seed
> > > > > > > > > > every added, modified, or deleted file. We return the
> > delta API
> > > > > > token
> > > > > > > > > > as the version value of addSeedDocuments, so that it an be
> > > > used on
> > > > > > > > > > subsequent calls.
> > > > > > > > > >
> > > > > > > > > > * In processDocuments, we do the usual thing for each
> > document
> > > > > > > > identifier.
> > > > > > > > > >
> > > > > > > > > > On prototyping, this works for new docs, but
> > > > "processDocuments" is
> > > > > > > > > > never triggered for modified and deleted docs.
> > > > > > > > > >
> > > > > > > > > > A second design we are considering is to use
> > > > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments
> > > > return
> > > > > > only
> > > > > > > > > > one "virtual" document, which represents the root of the
> > remote
> > > > > > repo.
> > > > > > > > > >
> > > > > > > > > > Then, in "processDocuments" the new "document" is used to
> > > > determine
> > > > > > > > > > all the child documents of that delta call, which are then
> > > > added to
> > > > > > > > > > the queue via `activities.addDocumentReference`. To force
> > the
> > > > > > "virtual
> > > > > > > > > > seed" to trigger processDocuments again on the next call to
> > > > > > > > > > `addSeedDocuments`, we do
> > > > > > `activities.deleteDocument(virtualDocId)` as
> > > > > > > > > > well.
> > > > > > > > > >
> > > > > > > > > > With this alternative design, the stage 1 seed effectively
> > > > becomes
> > > > > > a
> > > > > > > > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > > > > > > > >
> > > > > > > > > > Thoughts?
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Raman Gupta
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: Repository connector for source with with delta API

Posted by Karl Wright <da...@gmail.com>.

This is very different from the design you originally told me you were
going to do.

Generally, using hopcounts for managing your documents is a bad practice;
this is expensive to do and almost always yields unexpected results.
You could have one job per seed, which means all you have to do to make the
seed go away is delete the job corresponding to it.  If you have way too
many seeds for that, you might want to combine this all into one job, but
then you would need to link your documents somehow to the seed they came
from, so that if the seed was no longer part of the job specification, it
could always be detected as a deletion.  Unfortunately, this is
inconsistent with MODEL_ADD_CHANGE_DELETE, because under that scheme you'd
need to *detect* the deletion, because you wouldn't be told by the
repository that somebody had changed the configuration.

So two choices: (1) Exactly one seed per job, or (2) don't use
MODEL_ADD_CHANGE_DELETE.

Karl


On Mon, May 27, 2019 at 4:38 PM Raman Gupta <ro...@gmail.com> wrote:

> Thanks for your help Karl. So I think I'm converging on a design.
> First of all, per your recommendation, I've switched to scheduled
> crawl and it executes as expected every minute with the "schedule
> window anytime" setting.
>
> My next problem is dealing with seed deletion. My upstream source
> actually has multiple "roots" i.e. each root has its own set of
> documents, and the delta API must be called once for each "root". To
> deal with this, I'm specifying each "root" as  a "seed document", and
> each such root/seed creates "contained_in" documents. It is also
> possible for a "root" to be deleted by a user of the upstream system.
>
> My job is defined with an accurate hopcount as follows:
>
> "job": {
>   ... snip naming, scheduling, output connectors, doc spec....
>   "hopcount_mode" to "accurate"
>   "hopcount" to json {
>     "link_type" to "contained_in"
>     "count" to 1
>   },
>
> For each seed, in processDocuments I am doing:
>
> activities.addDocumentReference("... doc identifier ...",
> seedDocumentIdentifier, "contained_in");
>
> and then this triggers processDocuments for each of those documents,
> as expected.
>
> How do I code the connector such that I can now remove the documents
> that are now unreachable due to the deleted seed? I don't see any
> calls to `processDocuments` via the framework that would allow me to
> do this.
>
> Regards,
> Raman
>
>
> On Fri, May 24, 2019 at 7:29 PM Karl Wright <da...@gmail.com> wrote:
> >
> > Hi Raman,
> >
> > (1) Continuous crawl is not a good model for you.  It's meant for
> crawling
> > large web domains, not the kind of task you are doing.
> > (2) Scheduled crawl will work fine for you if you simply tell it "start
> > within schedule window" and make sure your schedule completely covers
> 7x24
> > times.  So you can do this with one record, which triggers on every day
> of
> > the week, that has a schedule window of 24 hours.
> >
> > Karl
> >
> >
> > On Fri, May 24, 2019 at 7:12 PM Raman Gupta <ro...@gmail.com>
> wrote:
> >
> > > Yes, we are indeed running it in continuous crawl mode. Scheduled mode
> > > works, but given we have a delta API, we thought this is what makes
> > > sense, as the delta API is efficient and we don't need to wait an
> > > entire day for a scheduled job to run. I see that if I change recrawl
> > > interval and max recrawl interval also to 1 minute, then my documents
> > > do get processed each time. However, now we have the opposite problem:
> > > now the documents are reprocessed every minute, regardless of whether
> > > they were reseeded or not, which makes no sense to me. If I am using
> > > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed method,
> > > then why are the same documents being reprocessed over and over? I
> > > have sent the output to the NullOutput using
> > > `ingestDocumentWithException` and the status shows OK, and yet the
> > > same documents are repeatedly sent to processDocuments.
> > >
> > > I just want to process the particular documents I specify on each
> > > iteration every 60 seconds -- no more, no less, and yet I seem unable
> > > to build a connector that does this.
> > >
> > > If I move to a non-contiguous mode, do I really have to create 1440
> > > schedule objects, one for each minute of each day? The way the
> > > schedule seems to be put together, I don't see a way to just schedule
> > > every minute with one schedule. I would have expected schedules to
> > > just use cron expressions.
> > >
> > > If I move to the design #2 in my OP and have one "virtual document" to
> > > just avoid the seeding stage all-together, then is there some place
> > > where I can store the delta token state? Or does my connector have to
> > > create its own db table to store this?
> > >
> > > Regards,
> > > Raman
> > >
> > > On Fri, May 24, 2019 at 6:18 PM Karl Wright <da...@gmail.com>
> wrote:
> > > >
> > > > So MODEL_ADD_CHANGE does not work for you, eh?
> > > >
> > > > You were saying that every minute a addSeedDocuments is being called,
> > > > correct?  It sounds to me like you are running this job in continuous
> > > crawl
> > > > mode.  Can you try running the job in non-continuous mode, and just
> > > > repeating the job run once it completes?
> > > >
> > > > The reason I ask is because continuous crawling has very unique
> kinds of
> > > > ways of dealing with documents it has crawled.  It uses "exponential
> > > > backoff" to schedule the next document crawl and that is probably
> why you
> > > > see the documents in the queue but not being processed; you simply
> > > haven't
> > > > waited long enough.
> > > >
> > > > Karl
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <ro...@gmail.com>
> > > wrote:
> > > >
> > > > > Here are my addSeedDocuments and processDocuments methods
> simplifying
> > > > > them down to the minimum necessary to show what is happening:
> > > > >
> > > > > @Override
> > > > > public String addSeedDocuments(ISeedingActivity activities,
> > > Specification
> > > > > spec,
> > > > >                                String lastSeedVersion, long
> seedTime,
> > > > > int jobMode)
> > > > >   throws ManifoldCFException, ServiceInterruption
> > > > > {
> > > > >   // return the same 3 docs every time, simulating an initial
> load, and
> > > > > then
> > > > >   // these 3 docs changing constantly
> > > > >   System.out.println(String.format("-=-= SeedTime=%s", seedTime));
> > > > >   activities.addSeedDocument("100');
> > > > >   activities.addSeedDocument("110');
> > > > >   activities.addSeedDocument("120');
> > > > >   System.out.println("SEEDING DONE");
> > > > >   return null
> > > > > }
> > > > >
> > > > > @Override
> > > > > public void processDocuments(String[] documentIdentifiers,
> > > > > IExistingVersions statuses, Specification spec,
> > > > >                              IProcessActivity activities, int
> jobMode,
> > > > > boolean usesDefaultAuthority)
> > > > >   throws ManifoldCFException, ServiceInterruption {
> > > > >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > > > > Arrays.deepToString(documentIdentifiers) );
> > > > >   // for (String documentIdentifier : documentIdentifiers) {
> > > > >   //  activities.deleteDocument(documentIdentifier);
> > > > >   //}
> > > > >
> > > > >   // I've commented out all subsequent logic here, but adding the
> call
> > > to
> > > > >   // activities.ingestDocumentWithException(documentIdentifier,
> > > > > version, documentUri, rd);
> > > > >   // does not change anything
> > > > > }
> > > > >
> > > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > > > > MODEL_ADD_CHANGE, the output of this is:
> > > > >
> > > > > -=-= SeedTime=1558733436082
> > > > > -=--=-= PROCESS DOCUMENTS: [200]
> > > > > -=--=-= PROCESS DOCUMENTS: [220]
> > > > > -=--=-= PROCESS DOCUMENTS: [210]
> > > > > -=-= SeedTime=1558733549367
> > > > > -=-= SeedTime=1558733609384
> > > > > -=-= SeedTime=1558733436082
> > > > > etc.
> > > > >
> > > > >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and
> then
> > > > > never again, even though "SEEDING DONE" is printing every minute.
> If
> > > > > and only if I uncomment the for loop which deletes the documents
> does
> > > > > "processDocuments" get called again for those seed document ids.
> > > > >
> > > > > I do note that the queue shows documents 100, 110, and 120 in state
> > > > > "Waiting for processing", and nothing I do seems to affect that.
> The
> > > > > database update in JobQueue.updateExistingRecordInitial is a no-op
> for
> > > > > these docs, as the status of them is STATUS_PENDINGPURGATORY and
> the
> > > > > update does not actually change anything in the db.
> > > > >
> > > > > Regards,
> > > > > Raman
> > > > >
> > > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <da...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > For any given job run, all documents that are added via
> > > > > addSeedDocuments()
> > > > > > should be processed.  There is no magic in the framework that
> somehow
> > > > > knows
> > > > > > that a document has been created vs. modified vs. deleted until
> > > > > > processDocuments() is called.  If your claim is that this
> contract
> > > is not
> > > > > > being honored, could you try changing your connector model to
> > > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to
> > > work
> > > > > > using that model.  If it does *not* then clearly you've got some
> > > kind of
> > > > > > implementation problem at the addSeedDocuments() level because
> most
> > > of
> > > > > the
> > > > > > Manifold connectors use that model.
> > > > > >
> > > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step is
> to
> > > figure
> > > > > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <
> rocketraman@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <
> daddywri@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments()
> > > basically
> > > > > says
> > > > > > > > that you have to include *at least* the documents that were
> > > changed,
> > > > > > > added,
> > > > > > > > or deleted since the previous stamp, and if no stamp is
> > > provided, it
> > > > > > > should
> > > > > > > > return ALL specified documents.  Are you doing that?
> > > > > > >
> > > > > > > Yes, the delta API gives us all the changed, added, and deleted
> > > > > > > documents, and those are exactly the ones that we are
> including.
> > > > > > >
> > > > > > > > If you are, the next thing to look at is the computation of
> the
> > > > > version
> > > > > > > > string.  The version string is what is used to figure out if
> a
> > > change
> > > > > > > took
> > > > > > > > place.  You need this IN ADDITION TO the addSeedDocuments()
> > > doing the
> > > > > > > right
> > > > > > > > thing.  For deleted documents, obviously the
> processDocuments()
> > > > > should
> > > > > > > call
> > > > > > > > the activities.deleteDocument() method.
> > > > > > >
> > > > > > > The version String is calculated by `processDocuments`. Since
> after
> > > > > > > calling `addSeedDocuments` once for document A version 1,
> > > > > > > `processDocuments` is never called again for that document,
> even
> > > > > > > though it has been modified to document A version 2.
> Therefore, our
> > > > > > > connector never gets a chance to return the "version 2" string.
> > > > > > >
> > > > > > > > Does this sound like what your code is doing?
> > > > > > >
> > > > > > > Yes, as far as we can go given the fact that
> `processDocuments` is
> > > > > > > only called once for any particular document identifier.
> > > > > > >
> > > > > > > > Karl
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <
> > > rocketraman@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > My team is creating a new repository connector. The source
> > > system
> > > > > has
> > > > > > > > > a delta API that lets us know of all new, modified, and
> deleted
> > > > > > > > > individual folders and documents since the last call to the
> > > API.
> > > > > Each
> > > > > > > > > call to the delta API provides the changes, as well as a
> token
> > > > > which
> > > > > > > > > can be provided on subsequent calls to get changes since
> that
> > > token
> > > > > > > > > was generated/returned.
> > > > > > > > >
> > > > > > > > > What is the best approach to building a repo connector to a
> > > system
> > > > > > > > > that has this type of delta API?
> > > > > > > > >
> > > > > > > > > Our first design was an implementation that specifies
> > > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > > > > >
> > > > > > > > > * In addSeedDocuments, on the initial call we seed every
> > > document
> > > > > in
> > > > > > > > > the source system. On subsequent calls, we use the delta
> API to
> > > > > seed
> > > > > > > > > every added, modified, or deleted file. We return the
> delta API
> > > > > token
> > > > > > > > > as the version value of addSeedDocuments, so that it an be
> > > used on
> > > > > > > > > subsequent calls.
> > > > > > > > >
> > > > > > > > > * In processDocuments, we do the usual thing for each
> document
> > > > > > > identifier.
> > > > > > > > >
> > > > > > > > > On prototyping, this works for new docs, but
> > > "processDocuments" is
> > > > > > > > > never triggered for modified and deleted docs.
> > > > > > > > >
> > > > > > > > > A second design we are considering is to use
> > > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments
> > > return
> > > > > only
> > > > > > > > > one "virtual" document, which represents the root of the
> remote
> > > > > repo.
> > > > > > > > >
> > > > > > > > > Then, in "processDocuments" the new "document" is used to
> > > determine
> > > > > > > > > all the child documents of that delta call, which are then
> > > added to
> > > > > > > > > the queue via `activities.addDocumentReference`. To force
> the
> > > > > "virtual
> > > > > > > > > seed" to trigger processDocuments again on the next call to
> > > > > > > > > `addSeedDocuments`, we do
> > > > > `activities.deleteDocument(virtualDocId)` as
> > > > > > > > > well.
> > > > > > > > >
> > > > > > > > > With this alternative design, the stage 1 seed effectively
> > > becomes
> > > > > a
> > > > > > > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Raman Gupta
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: Repository connector for source with with delta API

Posted by Raman Gupta <ro...@gmail.com>.

Thanks for your help Karl. So I think I'm converging on a design.
First of all, per your recommendation, I've switched to scheduled
crawl and it executes as expected every minute with the "schedule
window anytime" setting.

My next problem is dealing with seed deletion. My upstream source
actually has multiple "roots" i.e. each root has its own set of
documents, and the delta API must be called once for each "root". To
deal with this, I'm specifying each "root" as  a "seed document", and
each such root/seed creates "contained_in" documents. It is also
possible for a "root" to be deleted by a user of the upstream system.

My job is defined with an accurate hopcount as follows:

"job": {
  ... snip naming, scheduling, output connectors, doc spec....
  "hopcount_mode" to "accurate"
  "hopcount" to json {
    "link_type" to "contained_in"
    "count" to 1
  },

For each seed, in processDocuments I am doing:

activities.addDocumentReference("... doc identifier ...",
seedDocumentIdentifier, "contained_in");

and then this triggers processDocuments for each of those documents,
as expected.

How do I code the connector such that I can now remove the documents
that are now unreachable due to the deleted seed? I don't see any
calls to `processDocuments` via the framework that would allow me to
do this.

Regards,
Raman


On Fri, May 24, 2019 at 7:29 PM Karl Wright <da...@gmail.com> wrote:
>
> Hi Raman,
>
> (1) Continuous crawl is not a good model for you.  It's meant for crawling
> large web domains, not the kind of task you are doing.
> (2) Scheduled crawl will work fine for you if you simply tell it "start
> within schedule window" and make sure your schedule completely covers 7x24
> times.  So you can do this with one record, which triggers on every day of
> the week, that has a schedule window of 24 hours.
>
> Karl
>
>
> On Fri, May 24, 2019 at 7:12 PM Raman Gupta <ro...@gmail.com> wrote:
>
> > Yes, we are indeed running it in continuous crawl mode. Scheduled mode
> > works, but given we have a delta API, we thought this is what makes
> > sense, as the delta API is efficient and we don't need to wait an
> > entire day for a scheduled job to run. I see that if I change recrawl
> > interval and max recrawl interval also to 1 minute, then my documents
> > do get processed each time. However, now we have the opposite problem:
> > now the documents are reprocessed every minute, regardless of whether
> > they were reseeded or not, which makes no sense to me. If I am using
> > MODEL_ADD_CHANGE_DELETE and not returning anything in my seed method,
> > then why are the same documents being reprocessed over and over? I
> > have sent the output to the NullOutput using
> > `ingestDocumentWithException` and the status shows OK, and yet the
> > same documents are repeatedly sent to processDocuments.
> >
> > I just want to process the particular documents I specify on each
> > iteration every 60 seconds -- no more, no less, and yet I seem unable
> > to build a connector that does this.
> >
> > If I move to a non-contiguous mode, do I really have to create 1440
> > schedule objects, one for each minute of each day? The way the
> > schedule seems to be put together, I don't see a way to just schedule
> > every minute with one schedule. I would have expected schedules to
> > just use cron expressions.
> >
> > If I move to the design #2 in my OP and have one "virtual document" to
> > just avoid the seeding stage all-together, then is there some place
> > where I can store the delta token state? Or does my connector have to
> > create its own db table to store this?
> >
> > Regards,
> > Raman
> >
> > On Fri, May 24, 2019 at 6:18 PM Karl Wright <da...@gmail.com> wrote:
> > >
> > > So MODEL_ADD_CHANGE does not work for you, eh?
> > >
> > > You were saying that every minute a addSeedDocuments is being called,
> > > correct?  It sounds to me like you are running this job in continuous
> > crawl
> > > mode.  Can you try running the job in non-continuous mode, and just
> > > repeating the job run once it completes?
> > >
> > > The reason I ask is because continuous crawling has very unique kinds of
> > > ways of dealing with documents it has crawled.  It uses "exponential
> > > backoff" to schedule the next document crawl and that is probably why you
> > > see the documents in the queue but not being processed; you simply
> > haven't
> > > waited long enough.
> > >
> > > Karl
> > >
> > > Karl
> > >
> > >
> > > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <ro...@gmail.com>
> > wrote:
> > >
> > > > Here are my addSeedDocuments and processDocuments methods simplifying
> > > > them down to the minimum necessary to show what is happening:
> > > >
> > > > @Override
> > > > public String addSeedDocuments(ISeedingActivity activities,
> > Specification
> > > > spec,
> > > >                                String lastSeedVersion, long seedTime,
> > > > int jobMode)
> > > >   throws ManifoldCFException, ServiceInterruption
> > > > {
> > > >   // return the same 3 docs every time, simulating an initial load, and
> > > > then
> > > >   // these 3 docs changing constantly
> > > >   System.out.println(String.format("-=-= SeedTime=%s", seedTime));
> > > >   activities.addSeedDocument("100');
> > > >   activities.addSeedDocument("110');
> > > >   activities.addSeedDocument("120');
> > > >   System.out.println("SEEDING DONE");
> > > >   return null
> > > > }
> > > >
> > > > @Override
> > > > public void processDocuments(String[] documentIdentifiers,
> > > > IExistingVersions statuses, Specification spec,
> > > >                              IProcessActivity activities, int jobMode,
> > > > boolean usesDefaultAuthority)
> > > >   throws ManifoldCFException, ServiceInterruption {
> > > >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > > > Arrays.deepToString(documentIdentifiers) );
> > > >   // for (String documentIdentifier : documentIdentifiers) {
> > > >   //  activities.deleteDocument(documentIdentifier);
> > > >   //}
> > > >
> > > >   // I've commented out all subsequent logic here, but adding the call
> > to
> > > >   // activities.ingestDocumentWithException(documentIdentifier,
> > > > version, documentUri, rd);
> > > >   // does not change anything
> > > > }
> > > >
> > > > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > > > MODEL_ADD_CHANGE, the output of this is:
> > > >
> > > > -=-= SeedTime=1558733436082
> > > > -=--=-= PROCESS DOCUMENTS: [200]
> > > > -=--=-= PROCESS DOCUMENTS: [220]
> > > > -=--=-= PROCESS DOCUMENTS: [210]
> > > > -=-= SeedTime=1558733549367
> > > > -=-= SeedTime=1558733609384
> > > > -=-= SeedTime=1558733436082
> > > > etc.
> > > >
> > > >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and then
> > > > never again, even though "SEEDING DONE" is printing every minute. If
> > > > and only if I uncomment the for loop which deletes the documents does
> > > > "processDocuments" get called again for those seed document ids.
> > > >
> > > > I do note that the queue shows documents 100, 110, and 120 in state
> > > > "Waiting for processing", and nothing I do seems to affect that. The
> > > > database update in JobQueue.updateExistingRecordInitial is a no-op for
> > > > these docs, as the status of them is STATUS_PENDINGPURGATORY and the
> > > > update does not actually change anything in the db.
> > > >
> > > > Regards,
> > > > Raman
> > > >
> > > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <da...@gmail.com>
> > wrote:
> > > > >
> > > > > For any given job run, all documents that are added via
> > > > addSeedDocuments()
> > > > > should be processed.  There is no magic in the framework that somehow
> > > > knows
> > > > > that a document has been created vs. modified vs. deleted until
> > > > > processDocuments() is called.  If your claim is that this contract
> > is not
> > > > > being honored, could you try changing your connector model to
> > > > > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to
> > work
> > > > > using that model.  If it does *not* then clearly you've got some
> > kind of
> > > > > implementation problem at the addSeedDocuments() level because most
> > of
> > > > the
> > > > > Manifold connectors use that model.
> > > > >
> > > > > If MODEL_ADD_CHANGE mostly works for you, then the next step is to
> > figure
> > > > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <ro...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <da...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments()
> > basically
> > > > says
> > > > > > > that you have to include *at least* the documents that were
> > changed,
> > > > > > added,
> > > > > > > or deleted since the previous stamp, and if no stamp is
> > provided, it
> > > > > > should
> > > > > > > return ALL specified documents.  Are you doing that?
> > > > > >
> > > > > > Yes, the delta API gives us all the changed, added, and deleted
> > > > > > documents, and those are exactly the ones that we are including.
> > > > > >
> > > > > > > If you are, the next thing to look at is the computation of the
> > > > version
> > > > > > > string.  The version string is what is used to figure out if a
> > change
> > > > > > took
> > > > > > > place.  You need this IN ADDITION TO the addSeedDocuments()
> > doing the
> > > > > > right
> > > > > > > thing.  For deleted documents, obviously the processDocuments()
> > > > should
> > > > > > call
> > > > > > > the activities.deleteDocument() method.
> > > > > >
> > > > > > The version String is calculated by `processDocuments`. Since after
> > > > > > calling `addSeedDocuments` once for document A version 1,
> > > > > > `processDocuments` is never called again for that document, even
> > > > > > though it has been modified to document A version 2. Therefore, our
> > > > > > connector never gets a chance to return the "version 2" string.
> > > > > >
> > > > > > > Does this sound like what your code is doing?
> > > > > >
> > > > > > Yes, as far as we can go given the fact that `processDocuments` is
> > > > > > only called once for any particular document identifier.
> > > > > >
> > > > > > > Karl
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <
> > rocketraman@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > My team is creating a new repository connector. The source
> > system
> > > > has
> > > > > > > > a delta API that lets us know of all new, modified, and deleted
> > > > > > > > individual folders and documents since the last call to the
> > API.
> > > > Each
> > > > > > > > call to the delta API provides the changes, as well as a token
> > > > which
> > > > > > > > can be provided on subsequent calls to get changes since that
> > token
> > > > > > > > was generated/returned.
> > > > > > > >
> > > > > > > > What is the best approach to building a repo connector to a
> > system
> > > > > > > > that has this type of delta API?
> > > > > > > >
> > > > > > > > Our first design was an implementation that specifies
> > > > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > > > >
> > > > > > > > * In addSeedDocuments, on the initial call we seed every
> > document
> > > > in
> > > > > > > > the source system. On subsequent calls, we use the delta API to
> > > > seed
> > > > > > > > every added, modified, or deleted file. We return the delta API
> > > > token
> > > > > > > > as the version value of addSeedDocuments, so that it an be
> > used on
> > > > > > > > subsequent calls.
> > > > > > > >
> > > > > > > > * In processDocuments, we do the usual thing for each document
> > > > > > identifier.
> > > > > > > >
> > > > > > > > On prototyping, this works for new docs, but
> > "processDocuments" is
> > > > > > > > never triggered for modified and deleted docs.
> > > > > > > >
> > > > > > > > A second design we are considering is to use
> > > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments
> > return
> > > > only
> > > > > > > > one "virtual" document, which represents the root of the remote
> > > > repo.
> > > > > > > >
> > > > > > > > Then, in "processDocuments" the new "document" is used to
> > determine
> > > > > > > > all the child documents of that delta call, which are then
> > added to
> > > > > > > > the queue via `activities.addDocumentReference`. To force the
> > > > "virtual
> > > > > > > > seed" to trigger processDocuments again on the next call to
> > > > > > > > `addSeedDocuments`, we do
> > > > `activities.deleteDocument(virtualDocId)` as
> > > > > > > > well.
> > > > > > > >
> > > > > > > > With this alternative design, the stage 1 seed effectively
> > becomes
> > > > a
> > > > > > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Raman Gupta
> > > > > > > >
> > > > > >
> > > >
> >

Re: Repository connector for source with with delta API

Posted by Karl Wright <da...@gmail.com>.

Hi Raman,

(1) Continuous crawl is not a good model for you.  It's meant for crawling
large web domains, not the kind of task you are doing.
(2) Scheduled crawl will work fine for you if you simply tell it "start
within schedule window" and make sure your schedule completely covers 7x24
times.  So you can do this with one record, which triggers on every day of
the week, that has a schedule window of 24 hours.

Karl


On Fri, May 24, 2019 at 7:12 PM Raman Gupta <ro...@gmail.com> wrote:

> Yes, we are indeed running it in continuous crawl mode. Scheduled mode
> works, but given we have a delta API, we thought this is what makes
> sense, as the delta API is efficient and we don't need to wait an
> entire day for a scheduled job to run. I see that if I change recrawl
> interval and max recrawl interval also to 1 minute, then my documents
> do get processed each time. However, now we have the opposite problem:
> now the documents are reprocessed every minute, regardless of whether
> they were reseeded or not, which makes no sense to me. If I am using
> MODEL_ADD_CHANGE_DELETE and not returning anything in my seed method,
> then why are the same documents being reprocessed over and over? I
> have sent the output to the NullOutput using
> `ingestDocumentWithException` and the status shows OK, and yet the
> same documents are repeatedly sent to processDocuments.
>
> I just want to process the particular documents I specify on each
> iteration every 60 seconds -- no more, no less, and yet I seem unable
> to build a connector that does this.
>
> If I move to a non-contiguous mode, do I really have to create 1440
> schedule objects, one for each minute of each day? The way the
> schedule seems to be put together, I don't see a way to just schedule
> every minute with one schedule. I would have expected schedules to
> just use cron expressions.
>
> If I move to the design #2 in my OP and have one "virtual document" to
> just avoid the seeding stage all-together, then is there some place
> where I can store the delta token state? Or does my connector have to
> create its own db table to store this?
>
> Regards,
> Raman
>
> On Fri, May 24, 2019 at 6:18 PM Karl Wright <da...@gmail.com> wrote:
> >
> > So MODEL_ADD_CHANGE does not work for you, eh?
> >
> > You were saying that every minute a addSeedDocuments is being called,
> > correct?  It sounds to me like you are running this job in continuous
> crawl
> > mode.  Can you try running the job in non-continuous mode, and just
> > repeating the job run once it completes?
> >
> > The reason I ask is because continuous crawling has very unique kinds of
> > ways of dealing with documents it has crawled.  It uses "exponential
> > backoff" to schedule the next document crawl and that is probably why you
> > see the documents in the queue but not being processed; you simply
> haven't
> > waited long enough.
> >
> > Karl
> >
> > Karl
> >
> >
> > On Fri, May 24, 2019 at 5:36 PM Raman Gupta <ro...@gmail.com>
> wrote:
> >
> > > Here are my addSeedDocuments and processDocuments methods simplifying
> > > them down to the minimum necessary to show what is happening:
> > >
> > > @Override
> > > public String addSeedDocuments(ISeedingActivity activities,
> Specification
> > > spec,
> > >                                String lastSeedVersion, long seedTime,
> > > int jobMode)
> > >   throws ManifoldCFException, ServiceInterruption
> > > {
> > >   // return the same 3 docs every time, simulating an initial load, and
> > > then
> > >   // these 3 docs changing constantly
> > >   System.out.println(String.format("-=-= SeedTime=%s", seedTime));
> > >   activities.addSeedDocument("100');
> > >   activities.addSeedDocument("110');
> > >   activities.addSeedDocument("120');
> > >   System.out.println("SEEDING DONE");
> > >   return null
> > > }
> > >
> > > @Override
> > > public void processDocuments(String[] documentIdentifiers,
> > > IExistingVersions statuses, Specification spec,
> > >                              IProcessActivity activities, int jobMode,
> > > boolean usesDefaultAuthority)
> > >   throws ManifoldCFException, ServiceInterruption {
> > >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > > Arrays.deepToString(documentIdentifiers) );
> > >   // for (String documentIdentifier : documentIdentifiers) {
> > >   //  activities.deleteDocument(documentIdentifier);
> > >   //}
> > >
> > >   // I've commented out all subsequent logic here, but adding the call
> to
> > >   // activities.ingestDocumentWithException(documentIdentifier,
> > > version, documentUri, rd);
> > >   // does not change anything
> > > }
> > >
> > > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > > MODEL_ADD_CHANGE, the output of this is:
> > >
> > > -=-= SeedTime=1558733436082
> > > -=--=-= PROCESS DOCUMENTS: [200]
> > > -=--=-= PROCESS DOCUMENTS: [220]
> > > -=--=-= PROCESS DOCUMENTS: [210]
> > > -=-= SeedTime=1558733549367
> > > -=-= SeedTime=1558733609384
> > > -=-= SeedTime=1558733436082
> > > etc.
> > >
> > >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and then
> > > never again, even though "SEEDING DONE" is printing every minute. If
> > > and only if I uncomment the for loop which deletes the documents does
> > > "processDocuments" get called again for those seed document ids.
> > >
> > > I do note that the queue shows documents 100, 110, and 120 in state
> > > "Waiting for processing", and nothing I do seems to affect that. The
> > > database update in JobQueue.updateExistingRecordInitial is a no-op for
> > > these docs, as the status of them is STATUS_PENDINGPURGATORY and the
> > > update does not actually change anything in the db.
> > >
> > > Regards,
> > > Raman
> > >
> > > On Fri, May 24, 2019 at 5:13 PM Karl Wright <da...@gmail.com>
> wrote:
> > > >
> > > > For any given job run, all documents that are added via
> > > addSeedDocuments()
> > > > should be processed.  There is no magic in the framework that somehow
> > > knows
> > > > that a document has been created vs. modified vs. deleted until
> > > > processDocuments() is called.  If your claim is that this contract
> is not
> > > > being honored, could you try changing your connector model to
> > > > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to
> work
> > > > using that model.  If it does *not* then clearly you've got some
> kind of
> > > > implementation problem at the addSeedDocuments() level because most
> of
> > > the
> > > > Manifold connectors use that model.
> > > >
> > > > If MODEL_ADD_CHANGE mostly works for you, then the next step is to
> figure
> > > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <ro...@gmail.com>
> > > wrote:
> > > >
> > > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <da...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments()
> basically
> > > says
> > > > > > that you have to include *at least* the documents that were
> changed,
> > > > > added,
> > > > > > or deleted since the previous stamp, and if no stamp is
> provided, it
> > > > > should
> > > > > > return ALL specified documents.  Are you doing that?
> > > > >
> > > > > Yes, the delta API gives us all the changed, added, and deleted
> > > > > documents, and those are exactly the ones that we are including.
> > > > >
> > > > > > If you are, the next thing to look at is the computation of the
> > > version
> > > > > > string.  The version string is what is used to figure out if a
> change
> > > > > took
> > > > > > place.  You need this IN ADDITION TO the addSeedDocuments()
> doing the
> > > > > right
> > > > > > thing.  For deleted documents, obviously the processDocuments()
> > > should
> > > > > call
> > > > > > the activities.deleteDocument() method.
> > > > >
> > > > > The version String is calculated by `processDocuments`. Since after
> > > > > calling `addSeedDocuments` once for document A version 1,
> > > > > `processDocuments` is never called again for that document, even
> > > > > though it has been modified to document A version 2. Therefore, our
> > > > > connector never gets a chance to return the "version 2" string.
> > > > >
> > > > > > Does this sound like what your code is doing?
> > > > >
> > > > > Yes, as far as we can go given the fact that `processDocuments` is
> > > > > only called once for any particular document identifier.
> > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <
> rocketraman@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > My team is creating a new repository connector. The source
> system
> > > has
> > > > > > > a delta API that lets us know of all new, modified, and deleted
> > > > > > > individual folders and documents since the last call to the
> API.
> > > Each
> > > > > > > call to the delta API provides the changes, as well as a token
> > > which
> > > > > > > can be provided on subsequent calls to get changes since that
> token
> > > > > > > was generated/returned.
> > > > > > >
> > > > > > > What is the best approach to building a repo connector to a
> system
> > > > > > > that has this type of delta API?
> > > > > > >
> > > > > > > Our first design was an implementation that specifies
> > > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > > >
> > > > > > > * In addSeedDocuments, on the initial call we seed every
> document
> > > in
> > > > > > > the source system. On subsequent calls, we use the delta API to
> > > seed
> > > > > > > every added, modified, or deleted file. We return the delta API
> > > token
> > > > > > > as the version value of addSeedDocuments, so that it an be
> used on
> > > > > > > subsequent calls.
> > > > > > >
> > > > > > > * In processDocuments, we do the usual thing for each document
> > > > > identifier.
> > > > > > >
> > > > > > > On prototyping, this works for new docs, but
> "processDocuments" is
> > > > > > > never triggered for modified and deleted docs.
> > > > > > >
> > > > > > > A second design we are considering is to use
> > > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments
> return
> > > only
> > > > > > > one "virtual" document, which represents the root of the remote
> > > repo.
> > > > > > >
> > > > > > > Then, in "processDocuments" the new "document" is used to
> determine
> > > > > > > all the child documents of that delta call, which are then
> added to
> > > > > > > the queue via `activities.addDocumentReference`. To force the
> > > "virtual
> > > > > > > seed" to trigger processDocuments again on the next call to
> > > > > > > `addSeedDocuments`, we do
> > > `activities.deleteDocument(virtualDocId)` as
> > > > > > > well.
> > > > > > >
> > > > > > > With this alternative design, the stage 1 seed effectively
> becomes
> > > a
> > > > > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > Regards,
> > > > > > > Raman Gupta
> > > > > > >
> > > > >
> > >
>

Re: Repository connector for source with with delta API

Posted by Raman Gupta <ro...@gmail.com>.

Yes, we are indeed running it in continuous crawl mode. Scheduled mode
works, but given we have a delta API, we thought this is what makes
sense, as the delta API is efficient and we don't need to wait an
entire day for a scheduled job to run. I see that if I change recrawl
interval and max recrawl interval also to 1 minute, then my documents
do get processed each time. However, now we have the opposite problem:
now the documents are reprocessed every minute, regardless of whether
they were reseeded or not, which makes no sense to me. If I am using
MODEL_ADD_CHANGE_DELETE and not returning anything in my seed method,
then why are the same documents being reprocessed over and over? I
have sent the output to the NullOutput using
`ingestDocumentWithException` and the status shows OK, and yet the
same documents are repeatedly sent to processDocuments.

I just want to process the particular documents I specify on each
iteration every 60 seconds -- no more, no less, and yet I seem unable
to build a connector that does this.

If I move to a non-contiguous mode, do I really have to create 1440
schedule objects, one for each minute of each day? The way the
schedule seems to be put together, I don't see a way to just schedule
every minute with one schedule. I would have expected schedules to
just use cron expressions.

If I move to the design #2 in my OP and have one "virtual document" to
just avoid the seeding stage all-together, then is there some place
where I can store the delta token state? Or does my connector have to
create its own db table to store this?

Regards,
Raman

On Fri, May 24, 2019 at 6:18 PM Karl Wright <da...@gmail.com> wrote:
>
> So MODEL_ADD_CHANGE does not work for you, eh?
>
> You were saying that every minute a addSeedDocuments is being called,
> correct?  It sounds to me like you are running this job in continuous crawl
> mode.  Can you try running the job in non-continuous mode, and just
> repeating the job run once it completes?
>
> The reason I ask is because continuous crawling has very unique kinds of
> ways of dealing with documents it has crawled.  It uses "exponential
> backoff" to schedule the next document crawl and that is probably why you
> see the documents in the queue but not being processed; you simply haven't
> waited long enough.
>
> Karl
>
> Karl
>
>
> On Fri, May 24, 2019 at 5:36 PM Raman Gupta <ro...@gmail.com> wrote:
>
> > Here are my addSeedDocuments and processDocuments methods simplifying
> > them down to the minimum necessary to show what is happening:
> >
> > @Override
> > public String addSeedDocuments(ISeedingActivity activities, Specification
> > spec,
> >                                String lastSeedVersion, long seedTime,
> > int jobMode)
> >   throws ManifoldCFException, ServiceInterruption
> > {
> >   // return the same 3 docs every time, simulating an initial load, and
> > then
> >   // these 3 docs changing constantly
> >   System.out.println(String.format("-=-= SeedTime=%s", seedTime));
> >   activities.addSeedDocument("100');
> >   activities.addSeedDocument("110');
> >   activities.addSeedDocument("120');
> >   System.out.println("SEEDING DONE");
> >   return null
> > }
> >
> > @Override
> > public void processDocuments(String[] documentIdentifiers,
> > IExistingVersions statuses, Specification spec,
> >                              IProcessActivity activities, int jobMode,
> > boolean usesDefaultAuthority)
> >   throws ManifoldCFException, ServiceInterruption {
> >   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> > Arrays.deepToString(documentIdentifiers) );
> >   // for (String documentIdentifier : documentIdentifiers) {
> >   //  activities.deleteDocument(documentIdentifier);
> >   //}
> >
> >   // I've commented out all subsequent logic here, but adding the call to
> >   // activities.ingestDocumentWithException(documentIdentifier,
> > version, documentUri, rd);
> >   // does not change anything
> > }
> >
> > When I run this code with MODEL_ADD_CHANGE_DELETE or with
> > MODEL_ADD_CHANGE, the output of this is:
> >
> > -=-= SeedTime=1558733436082
> > -=--=-= PROCESS DOCUMENTS: [200]
> > -=--=-= PROCESS DOCUMENTS: [220]
> > -=--=-= PROCESS DOCUMENTS: [210]
> > -=-= SeedTime=1558733549367
> > -=-= SeedTime=1558733609384
> > -=-= SeedTime=1558733436082
> > etc.
> >
> >  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and then
> > never again, even though "SEEDING DONE" is printing every minute. If
> > and only if I uncomment the for loop which deletes the documents does
> > "processDocuments" get called again for those seed document ids.
> >
> > I do note that the queue shows documents 100, 110, and 120 in state
> > "Waiting for processing", and nothing I do seems to affect that. The
> > database update in JobQueue.updateExistingRecordInitial is a no-op for
> > these docs, as the status of them is STATUS_PENDINGPURGATORY and the
> > update does not actually change anything in the db.
> >
> > Regards,
> > Raman
> >
> > On Fri, May 24, 2019 at 5:13 PM Karl Wright <da...@gmail.com> wrote:
> > >
> > > For any given job run, all documents that are added via
> > addSeedDocuments()
> > > should be processed.  There is no magic in the framework that somehow
> > knows
> > > that a document has been created vs. modified vs. deleted until
> > > processDocuments() is called.  If your claim is that this contract is not
> > > being honored, could you try changing your connector model to
> > > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to work
> > > using that model.  If it does *not* then clearly you've got some kind of
> > > implementation problem at the addSeedDocuments() level because most of
> > the
> > > Manifold connectors use that model.
> > >
> > > If MODEL_ADD_CHANGE mostly works for you, then the next step is to figure
> > > out why MODEL_ADD_CHANGE_DELETE is failing.
> > >
> > > Karl
> > >
> > >
> > > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <ro...@gmail.com>
> > wrote:
> > >
> > > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <da...@gmail.com>
> > wrote:
> > > > >
> > > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically
> > says
> > > > > that you have to include *at least* the documents that were changed,
> > > > added,
> > > > > or deleted since the previous stamp, and if no stamp is provided, it
> > > > should
> > > > > return ALL specified documents.  Are you doing that?
> > > >
> > > > Yes, the delta API gives us all the changed, added, and deleted
> > > > documents, and those are exactly the ones that we are including.
> > > >
> > > > > If you are, the next thing to look at is the computation of the
> > version
> > > > > string.  The version string is what is used to figure out if a change
> > > > took
> > > > > place.  You need this IN ADDITION TO the addSeedDocuments() doing the
> > > > right
> > > > > thing.  For deleted documents, obviously the processDocuments()
> > should
> > > > call
> > > > > the activities.deleteDocument() method.
> > > >
> > > > The version String is calculated by `processDocuments`. Since after
> > > > calling `addSeedDocuments` once for document A version 1,
> > > > `processDocuments` is never called again for that document, even
> > > > though it has been modified to document A version 2. Therefore, our
> > > > connector never gets a chance to return the "version 2" string.
> > > >
> > > > > Does this sound like what your code is doing?
> > > >
> > > > Yes, as far as we can go given the fact that `processDocuments` is
> > > > only called once for any particular document identifier.
> > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <ro...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > My team is creating a new repository connector. The source system
> > has
> > > > > > a delta API that lets us know of all new, modified, and deleted
> > > > > > individual folders and documents since the last call to the API.
> > Each
> > > > > > call to the delta API provides the changes, as well as a token
> > which
> > > > > > can be provided on subsequent calls to get changes since that token
> > > > > > was generated/returned.
> > > > > >
> > > > > > What is the best approach to building a repo connector to a system
> > > > > > that has this type of delta API?
> > > > > >
> > > > > > Our first design was an implementation that specifies
> > > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > > >
> > > > > > * In addSeedDocuments, on the initial call we seed every document
> > in
> > > > > > the source system. On subsequent calls, we use the delta API to
> > seed
> > > > > > every added, modified, or deleted file. We return the delta API
> > token
> > > > > > as the version value of addSeedDocuments, so that it an be used on
> > > > > > subsequent calls.
> > > > > >
> > > > > > * In processDocuments, we do the usual thing for each document
> > > > identifier.
> > > > > >
> > > > > > On prototyping, this works for new docs, but "processDocuments" is
> > > > > > never triggered for modified and deleted docs.
> > > > > >
> > > > > > A second design we are considering is to use
> > > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return
> > only
> > > > > > one "virtual" document, which represents the root of the remote
> > repo.
> > > > > >
> > > > > > Then, in "processDocuments" the new "document" is used to determine
> > > > > > all the child documents of that delta call, which are then added to
> > > > > > the queue via `activities.addDocumentReference`. To force the
> > "virtual
> > > > > > seed" to trigger processDocuments again on the next call to
> > > > > > `addSeedDocuments`, we do
> > `activities.deleteDocument(virtualDocId)` as
> > > > > > well.
> > > > > >
> > > > > > With this alternative design, the stage 1 seed effectively becomes
> > a
> > > > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > Regards,
> > > > > > Raman Gupta
> > > > > >
> > > >
> >

Re: Repository connector for source with with delta API

Posted by Karl Wright <da...@gmail.com>.

So MODEL_ADD_CHANGE does not work for you, eh?

You were saying that every minute a addSeedDocuments is being called,
correct?  It sounds to me like you are running this job in continuous crawl
mode.  Can you try running the job in non-continuous mode, and just
repeating the job run once it completes?

The reason I ask is because continuous crawling has very unique kinds of
ways of dealing with documents it has crawled.  It uses "exponential
backoff" to schedule the next document crawl and that is probably why you
see the documents in the queue but not being processed; you simply haven't
waited long enough.

Karl

Karl


On Fri, May 24, 2019 at 5:36 PM Raman Gupta <ro...@gmail.com> wrote:

> Here are my addSeedDocuments and processDocuments methods simplifying
> them down to the minimum necessary to show what is happening:
>
> @Override
> public String addSeedDocuments(ISeedingActivity activities, Specification
> spec,
>                                String lastSeedVersion, long seedTime,
> int jobMode)
>   throws ManifoldCFException, ServiceInterruption
> {
>   // return the same 3 docs every time, simulating an initial load, and
> then
>   // these 3 docs changing constantly
>   System.out.println(String.format("-=-= SeedTime=%s", seedTime));
>   activities.addSeedDocument("100');
>   activities.addSeedDocument("110');
>   activities.addSeedDocument("120');
>   System.out.println("SEEDING DONE");
>   return null
> }
>
> @Override
> public void processDocuments(String[] documentIdentifiers,
> IExistingVersions statuses, Specification spec,
>                              IProcessActivity activities, int jobMode,
> boolean usesDefaultAuthority)
>   throws ManifoldCFException, ServiceInterruption {
>   System.out.println("-=--=-= PROCESS DOCUMENTS: " +
> Arrays.deepToString(documentIdentifiers) );
>   // for (String documentIdentifier : documentIdentifiers) {
>   //  activities.deleteDocument(documentIdentifier);
>   //}
>
>   // I've commented out all subsequent logic here, but adding the call to
>   // activities.ingestDocumentWithException(documentIdentifier,
> version, documentUri, rd);
>   // does not change anything
> }
>
> When I run this code with MODEL_ADD_CHANGE_DELETE or with
> MODEL_ADD_CHANGE, the output of this is:
>
> -=-= SeedTime=1558733436082
> -=--=-= PROCESS DOCUMENTS: [200]
> -=--=-= PROCESS DOCUMENTS: [220]
> -=--=-= PROCESS DOCUMENTS: [210]
> -=-= SeedTime=1558733549367
> -=-= SeedTime=1558733609384
> -=-= SeedTime=1558733436082
> etc.
>
>  "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and then
> never again, even though "SEEDING DONE" is printing every minute. If
> and only if I uncomment the for loop which deletes the documents does
> "processDocuments" get called again for those seed document ids.
>
> I do note that the queue shows documents 100, 110, and 120 in state
> "Waiting for processing", and nothing I do seems to affect that. The
> database update in JobQueue.updateExistingRecordInitial is a no-op for
> these docs, as the status of them is STATUS_PENDINGPURGATORY and the
> update does not actually change anything in the db.
>
> Regards,
> Raman
>
> On Fri, May 24, 2019 at 5:13 PM Karl Wright <da...@gmail.com> wrote:
> >
> > For any given job run, all documents that are added via
> addSeedDocuments()
> > should be processed.  There is no magic in the framework that somehow
> knows
> > that a document has been created vs. modified vs. deleted until
> > processDocuments() is called.  If your claim is that this contract is not
> > being honored, could you try changing your connector model to
> > MODEL_ADD_CHANGE, just temporarily, to see if everything seems to work
> > using that model.  If it does *not* then clearly you've got some kind of
> > implementation problem at the addSeedDocuments() level because most of
> the
> > Manifold connectors use that model.
> >
> > If MODEL_ADD_CHANGE mostly works for you, then the next step is to figure
> > out why MODEL_ADD_CHANGE_DELETE is failing.
> >
> > Karl
> >
> >
> > On Fri, May 24, 2019 at 5:06 PM Raman Gupta <ro...@gmail.com>
> wrote:
> >
> > > On Fri, May 24, 2019 at 4:41 PM Karl Wright <da...@gmail.com>
> wrote:
> > > >
> > > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically
> says
> > > > that you have to include *at least* the documents that were changed,
> > > added,
> > > > or deleted since the previous stamp, and if no stamp is provided, it
> > > should
> > > > return ALL specified documents.  Are you doing that?
> > >
> > > Yes, the delta API gives us all the changed, added, and deleted
> > > documents, and those are exactly the ones that we are including.
> > >
> > > > If you are, the next thing to look at is the computation of the
> version
> > > > string.  The version string is what is used to figure out if a change
> > > took
> > > > place.  You need this IN ADDITION TO the addSeedDocuments() doing the
> > > right
> > > > thing.  For deleted documents, obviously the processDocuments()
> should
> > > call
> > > > the activities.deleteDocument() method.
> > >
> > > The version String is calculated by `processDocuments`. Since after
> > > calling `addSeedDocuments` once for document A version 1,
> > > `processDocuments` is never called again for that document, even
> > > though it has been modified to document A version 2. Therefore, our
> > > connector never gets a chance to return the "version 2" string.
> > >
> > > > Does this sound like what your code is doing?
> > >
> > > Yes, as far as we can go given the fact that `processDocuments` is
> > > only called once for any particular document identifier.
> > >
> > > > Karl
> > > >
> > > >
> > > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <ro...@gmail.com>
> > > wrote:
> > > >
> > > > > My team is creating a new repository connector. The source system
> has
> > > > > a delta API that lets us know of all new, modified, and deleted
> > > > > individual folders and documents since the last call to the API.
> Each
> > > > > call to the delta API provides the changes, as well as a token
> which
> > > > > can be provided on subsequent calls to get changes since that token
> > > > > was generated/returned.
> > > > >
> > > > > What is the best approach to building a repo connector to a system
> > > > > that has this type of delta API?
> > > > >
> > > > > Our first design was an implementation that specifies
> > > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > > >
> > > > > * In addSeedDocuments, on the initial call we seed every document
> in
> > > > > the source system. On subsequent calls, we use the delta API to
> seed
> > > > > every added, modified, or deleted file. We return the delta API
> token
> > > > > as the version value of addSeedDocuments, so that it an be used on
> > > > > subsequent calls.
> > > > >
> > > > > * In processDocuments, we do the usual thing for each document
> > > identifier.
> > > > >
> > > > > On prototyping, this works for new docs, but "processDocuments" is
> > > > > never triggered for modified and deleted docs.
> > > > >
> > > > > A second design we are considering is to use
> > > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return
> only
> > > > > one "virtual" document, which represents the root of the remote
> repo.
> > > > >
> > > > > Then, in "processDocuments" the new "document" is used to determine
> > > > > all the child documents of that delta call, which are then added to
> > > > > the queue via `activities.addDocumentReference`. To force the
> "virtual
> > > > > seed" to trigger processDocuments again on the next call to
> > > > > `addSeedDocuments`, we do
> `activities.deleteDocument(virtualDocId)` as
> > > > > well.
> > > > >
> > > > > With this alternative design, the stage 1 seed effectively becomes
> a
> > > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > Regards,
> > > > > Raman Gupta
> > > > >
> > >
>

Re: Repository connector for source with with delta API

Posted by Raman Gupta <ro...@gmail.com>.

Here are my addSeedDocuments and processDocuments methods simplifying
them down to the minimum necessary to show what is happening:

@Override
public String addSeedDocuments(ISeedingActivity activities, Specification spec,
                               String lastSeedVersion, long seedTime,
int jobMode)
  throws ManifoldCFException, ServiceInterruption
{
  // return the same 3 docs every time, simulating an initial load, and then
  // these 3 docs changing constantly
  System.out.println(String.format("-=-= SeedTime=%s", seedTime));
  activities.addSeedDocument("100');
  activities.addSeedDocument("110');
  activities.addSeedDocument("120');
  System.out.println("SEEDING DONE");
  return null
}

@Override
public void processDocuments(String[] documentIdentifiers,
IExistingVersions statuses, Specification spec,
                             IProcessActivity activities, int jobMode,
boolean usesDefaultAuthority)
  throws ManifoldCFException, ServiceInterruption {
  System.out.println("-=--=-= PROCESS DOCUMENTS: " +
Arrays.deepToString(documentIdentifiers) );
  // for (String documentIdentifier : documentIdentifiers) {
  //  activities.deleteDocument(documentIdentifier);
  //}

  // I've commented out all subsequent logic here, but adding the call to
  // activities.ingestDocumentWithException(documentIdentifier,
version, documentUri, rd);
  // does not change anything
}

When I run this code with MODEL_ADD_CHANGE_DELETE or with
MODEL_ADD_CHANGE, the output of this is:

-=-= SeedTime=1558733436082
-=--=-= PROCESS DOCUMENTS: [200]
-=--=-= PROCESS DOCUMENTS: [220]
-=--=-= PROCESS DOCUMENTS: [210]
-=-= SeedTime=1558733549367
-=-= SeedTime=1558733609384
-=-= SeedTime=1558733436082
etc.

 "PROCESS DOCUMENTS: [100, 110, 120]" output is shown once, and then
never again, even though "SEEDING DONE" is printing every minute. If
and only if I uncomment the for loop which deletes the documents does
"processDocuments" get called again for those seed document ids.

I do note that the queue shows documents 100, 110, and 120 in state
"Waiting for processing", and nothing I do seems to affect that. The
database update in JobQueue.updateExistingRecordInitial is a no-op for
these docs, as the status of them is STATUS_PENDINGPURGATORY and the
update does not actually change anything in the db.

Regards,
Raman

On Fri, May 24, 2019 at 5:13 PM Karl Wright <da...@gmail.com> wrote:
>
> For any given job run, all documents that are added via addSeedDocuments()
> should be processed.  There is no magic in the framework that somehow knows
> that a document has been created vs. modified vs. deleted until
> processDocuments() is called.  If your claim is that this contract is not
> being honored, could you try changing your connector model to
> MODEL_ADD_CHANGE, just temporarily, to see if everything seems to work
> using that model.  If it does *not* then clearly you've got some kind of
> implementation problem at the addSeedDocuments() level because most of the
> Manifold connectors use that model.
>
> If MODEL_ADD_CHANGE mostly works for you, then the next step is to figure
> out why MODEL_ADD_CHANGE_DELETE is failing.
>
> Karl
>
>
> On Fri, May 24, 2019 at 5:06 PM Raman Gupta <ro...@gmail.com> wrote:
>
> > On Fri, May 24, 2019 at 4:41 PM Karl Wright <da...@gmail.com> wrote:
> > >
> > > For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically says
> > > that you have to include *at least* the documents that were changed,
> > added,
> > > or deleted since the previous stamp, and if no stamp is provided, it
> > should
> > > return ALL specified documents.  Are you doing that?
> >
> > Yes, the delta API gives us all the changed, added, and deleted
> > documents, and those are exactly the ones that we are including.
> >
> > > If you are, the next thing to look at is the computation of the version
> > > string.  The version string is what is used to figure out if a change
> > took
> > > place.  You need this IN ADDITION TO the addSeedDocuments() doing the
> > right
> > > thing.  For deleted documents, obviously the processDocuments() should
> > call
> > > the activities.deleteDocument() method.
> >
> > The version String is calculated by `processDocuments`. Since after
> > calling `addSeedDocuments` once for document A version 1,
> > `processDocuments` is never called again for that document, even
> > though it has been modified to document A version 2. Therefore, our
> > connector never gets a chance to return the "version 2" string.
> >
> > > Does this sound like what your code is doing?
> >
> > Yes, as far as we can go given the fact that `processDocuments` is
> > only called once for any particular document identifier.
> >
> > > Karl
> > >
> > >
> > > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <ro...@gmail.com>
> > wrote:
> > >
> > > > My team is creating a new repository connector. The source system has
> > > > a delta API that lets us know of all new, modified, and deleted
> > > > individual folders and documents since the last call to the API. Each
> > > > call to the delta API provides the changes, as well as a token which
> > > > can be provided on subsequent calls to get changes since that token
> > > > was generated/returned.
> > > >
> > > > What is the best approach to building a repo connector to a system
> > > > that has this type of delta API?
> > > >
> > > > Our first design was an implementation that specifies
> > > > `MODEL_ADD_CHANGE_DELETE` and then:
> > > >
> > > > * In addSeedDocuments, on the initial call we seed every document in
> > > > the source system. On subsequent calls, we use the delta API to seed
> > > > every added, modified, or deleted file. We return the delta API token
> > > > as the version value of addSeedDocuments, so that it an be used on
> > > > subsequent calls.
> > > >
> > > > * In processDocuments, we do the usual thing for each document
> > identifier.
> > > >
> > > > On prototyping, this works for new docs, but "processDocuments" is
> > > > never triggered for modified and deleted docs.
> > > >
> > > > A second design we are considering is to use
> > > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return only
> > > > one "virtual" document, which represents the root of the remote repo.
> > > >
> > > > Then, in "processDocuments" the new "document" is used to determine
> > > > all the child documents of that delta call, which are then added to
> > > > the queue via `activities.addDocumentReference`. To force the "virtual
> > > > seed" to trigger processDocuments again on the next call to
> > > > `addSeedDocuments`, we do `activities.deleteDocument(virtualDocId)` as
> > > > well.
> > > >
> > > > With this alternative design, the stage 1 seed effectively becomes a
> > > > no-op, and is just used as a mechanism to trigger stage 2.
> > > >
> > > > Thoughts?
> > > >
> > > > Regards,
> > > > Raman Gupta
> > > >
> >

Re: Repository connector for source with with delta API

Posted by Karl Wright <da...@gmail.com>.

For any given job run, all documents that are added via addSeedDocuments()
should be processed.  There is no magic in the framework that somehow knows
that a document has been created vs. modified vs. deleted until
processDocuments() is called.  If your claim is that this contract is not
being honored, could you try changing your connector model to
MODEL_ADD_CHANGE, just temporarily, to see if everything seems to work
using that model.  If it does *not* then clearly you've got some kind of
implementation problem at the addSeedDocuments() level because most of the
Manifold connectors use that model.

If MODEL_ADD_CHANGE mostly works for you, then the next step is to figure
out why MODEL_ADD_CHANGE_DELETE is failing.

Karl


On Fri, May 24, 2019 at 5:06 PM Raman Gupta <ro...@gmail.com> wrote:

> On Fri, May 24, 2019 at 4:41 PM Karl Wright <da...@gmail.com> wrote:
> >
> > For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically says
> > that you have to include *at least* the documents that were changed,
> added,
> > or deleted since the previous stamp, and if no stamp is provided, it
> should
> > return ALL specified documents.  Are you doing that?
>
> Yes, the delta API gives us all the changed, added, and deleted
> documents, and those are exactly the ones that we are including.
>
> > If you are, the next thing to look at is the computation of the version
> > string.  The version string is what is used to figure out if a change
> took
> > place.  You need this IN ADDITION TO the addSeedDocuments() doing the
> right
> > thing.  For deleted documents, obviously the processDocuments() should
> call
> > the activities.deleteDocument() method.
>
> The version String is calculated by `processDocuments`. Since after
> calling `addSeedDocuments` once for document A version 1,
> `processDocuments` is never called again for that document, even
> though it has been modified to document A version 2. Therefore, our
> connector never gets a chance to return the "version 2" string.
>
> > Does this sound like what your code is doing?
>
> Yes, as far as we can go given the fact that `processDocuments` is
> only called once for any particular document identifier.
>
> > Karl
> >
> >
> > On Fri, May 24, 2019 at 4:25 PM Raman Gupta <ro...@gmail.com>
> wrote:
> >
> > > My team is creating a new repository connector. The source system has
> > > a delta API that lets us know of all new, modified, and deleted
> > > individual folders and documents since the last call to the API. Each
> > > call to the delta API provides the changes, as well as a token which
> > > can be provided on subsequent calls to get changes since that token
> > > was generated/returned.
> > >
> > > What is the best approach to building a repo connector to a system
> > > that has this type of delta API?
> > >
> > > Our first design was an implementation that specifies
> > > `MODEL_ADD_CHANGE_DELETE` and then:
> > >
> > > * In addSeedDocuments, on the initial call we seed every document in
> > > the source system. On subsequent calls, we use the delta API to seed
> > > every added, modified, or deleted file. We return the delta API token
> > > as the version value of addSeedDocuments, so that it an be used on
> > > subsequent calls.
> > >
> > > * In processDocuments, we do the usual thing for each document
> identifier.
> > >
> > > On prototyping, this works for new docs, but "processDocuments" is
> > > never triggered for modified and deleted docs.
> > >
> > > A second design we are considering is to use
> > > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return only
> > > one "virtual" document, which represents the root of the remote repo.
> > >
> > > Then, in "processDocuments" the new "document" is used to determine
> > > all the child documents of that delta call, which are then added to
> > > the queue via `activities.addDocumentReference`. To force the "virtual
> > > seed" to trigger processDocuments again on the next call to
> > > `addSeedDocuments`, we do `activities.deleteDocument(virtualDocId)` as
> > > well.
> > >
> > > With this alternative design, the stage 1 seed effectively becomes a
> > > no-op, and is just used as a mechanism to trigger stage 2.
> > >
> > > Thoughts?
> > >
> > > Regards,
> > > Raman Gupta
> > >
>

Re: Repository connector for source with with delta API

Posted by Raman Gupta <ro...@gmail.com>.

On Fri, May 24, 2019 at 4:41 PM Karl Wright <da...@gmail.com> wrote:
>
> For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically says
> that you have to include *at least* the documents that were changed, added,
> or deleted since the previous stamp, and if no stamp is provided, it should
> return ALL specified documents.  Are you doing that?

Yes, the delta API gives us all the changed, added, and deleted
documents, and those are exactly the ones that we are including.

> If you are, the next thing to look at is the computation of the version
> string.  The version string is what is used to figure out if a change took
> place.  You need this IN ADDITION TO the addSeedDocuments() doing the right
> thing.  For deleted documents, obviously the processDocuments() should call
> the activities.deleteDocument() method.

The version String is calculated by `processDocuments`. Since after
calling `addSeedDocuments` once for document A version 1,
`processDocuments` is never called again for that document, even
though it has been modified to document A version 2. Therefore, our
connector never gets a chance to return the "version 2" string.

> Does this sound like what your code is doing?

Yes, as far as we can go given the fact that `processDocuments` is
only called once for any particular document identifier.

> Karl
>
>
> On Fri, May 24, 2019 at 4:25 PM Raman Gupta <ro...@gmail.com> wrote:
>
> > My team is creating a new repository connector. The source system has
> > a delta API that lets us know of all new, modified, and deleted
> > individual folders and documents since the last call to the API. Each
> > call to the delta API provides the changes, as well as a token which
> > can be provided on subsequent calls to get changes since that token
> > was generated/returned.
> >
> > What is the best approach to building a repo connector to a system
> > that has this type of delta API?
> >
> > Our first design was an implementation that specifies
> > `MODEL_ADD_CHANGE_DELETE` and then:
> >
> > * In addSeedDocuments, on the initial call we seed every document in
> > the source system. On subsequent calls, we use the delta API to seed
> > every added, modified, or deleted file. We return the delta API token
> > as the version value of addSeedDocuments, so that it an be used on
> > subsequent calls.
> >
> > * In processDocuments, we do the usual thing for each document identifier.
> >
> > On prototyping, this works for new docs, but "processDocuments" is
> > never triggered for modified and deleted docs.
> >
> > A second design we are considering is to use
> > MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return only
> > one "virtual" document, which represents the root of the remote repo.
> >
> > Then, in "processDocuments" the new "document" is used to determine
> > all the child documents of that delta call, which are then added to
> > the queue via `activities.addDocumentReference`. To force the "virtual
> > seed" to trigger processDocuments again on the next call to
> > `addSeedDocuments`, we do `activities.deleteDocument(virtualDocId)` as
> > well.
> >
> > With this alternative design, the stage 1 seed effectively becomes a
> > no-op, and is just used as a mechanism to trigger stage 2.
> >
> > Thoughts?
> >
> > Regards,
> > Raman Gupta
> >

Re: Repository connector for source with with delta API

Posted by Karl Wright <da...@gmail.com>.

For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically says
that you have to include *at least* the documents that were changed, added,
or deleted since the previous stamp, and if no stamp is provided, it should
return ALL specified documents.  Are you doing that?

If you are, the next thing to look at is the computation of the version
string.  The version string is what is used to figure out if a change took
place.  You need this IN ADDITION TO the addSeedDocuments() doing the right
thing.  For deleted documents, obviously the processDocuments() should call
the activities.deleteDocument() method.

Does this sound like what your code is doing?

Karl


On Fri, May 24, 2019 at 4:25 PM Raman Gupta <ro...@gmail.com> wrote:

> My team is creating a new repository connector. The source system has
> a delta API that lets us know of all new, modified, and deleted
> individual folders and documents since the last call to the API. Each
> call to the delta API provides the changes, as well as a token which
> can be provided on subsequent calls to get changes since that token
> was generated/returned.
>
> What is the best approach to building a repo connector to a system
> that has this type of delta API?
>
> Our first design was an implementation that specifies
> `MODEL_ADD_CHANGE_DELETE` and then:
>
> * In addSeedDocuments, on the initial call we seed every document in
> the source system. On subsequent calls, we use the delta API to seed
> every added, modified, or deleted file. We return the delta API token
> as the version value of addSeedDocuments, so that it an be used on
> subsequent calls.
>
> * In processDocuments, we do the usual thing for each document identifier.
>
> On prototyping, this works for new docs, but "processDocuments" is
> never triggered for modified and deleted docs.
>
> A second design we are considering is to use
> MODEL_CHAINED_ADD_CHANGE_DELETE and have addSeedDocuments return only
> one "virtual" document, which represents the root of the remote repo.
>
> Then, in "processDocuments" the new "document" is used to determine
> all the child documents of that delta call, which are then added to
> the queue via `activities.addDocumentReference`. To force the "virtual
> seed" to trigger processDocuments again on the next call to
> `addSeedDocuments`, we do `activities.deleteDocument(virtualDocId)` as
> well.
>
> With this alternative design, the stage 1 seed effectively becomes a
> no-op, and is just used as a mechanism to trigger stage 2.
>
> Thoughts?
>
> Regards,
> Raman Gupta
>