You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by ju...@francelabs.com on 2020/02/29 13:10:14 UTC

Delta deletion

Hi dev community,

 

I am trying to develop a connector for an API that exposes a hierarchical
arborescence of documents: each document can have children documents.  

During the init crawl, the child documents are referenced in the MCF
connector through the method
activities.addDocumentRefenrece(childDocumentIdentifier,
parentDocumentIdentifier, parentDataNames, parentDataValues)

The API is able to provide delta modifications/deletions from a provided
date but, when a document that has children is deleted, the API only returns
the id of the document, not its children. On the MCF connector side, I
thought that, as I have referenced the children, by deleting the parent
document all its children would be deleted with it, but it appears that it
is not the case.

So my question is : did I miss something ? Is there another way to perform
delta deletions ? Unfortunately if I don't find a way to solve this issue, I
will not be able to take advantage of the delta feature and thus I will have
to use the "add_modify" connector type and test every id on a delta crawl to
figure out which ids are missing. This would be a huge loss of performances.

 

Regards,

Julien Massiera

Re: Delta deletion

Posted by Karl Wright <da...@gmail.com>.

The framework uses the model to figure out what the seeds mean, and really
nothing else.  So making it dynamic would be pretty much useless.

Karl

On Tue, Mar 31, 2020 at 10:21 AM <ju...@francelabs.com> wrote:

> Thanks Karl !
>
> The key thing of the solution I needed was : "Carrydown changes force the
> children that rely on them to be processed again"
>
> I have one last question : Is it possible to "dynamically" change the
> return value of the getConnectorModel method ? For example, is it possible
> to rely it on a job specification parameter ? Or is it too late because the
> framework has already considered the value at this step ?
>
> Julien
>
> -----Message d'origine-----
> De : Karl Wright <da...@gmail.com>
> Envoyé : vendredi 27 mars 2020 04:51
> À : dev <de...@manifoldcf.apache.org>
> Objet : Re: Delta deletion
>
> Just to be clear, this is what I think would work:
>
> (1) Your addSeedDocuments() method adds your seeds.
> (2) Each seed document, when processed, either decides it's dead, or it
> calls IProcessActivity.addDocumentReference()
> to populate children AND pass on an "Parent is alive" value.
> (3) The seed document, inside processDocuments(), when it decides it is
> dead, does nothing other than call IProcessActivity.removeDocument() on
> itself.  If it's still alive it checks to be sure that it needs to index
> using the standard IProcessActivity method for that, and if so, calls the
> indexing method.
> (4) processDocuments, when called for a child document, looks for the
> "Parent is alive" value.  If it does not find it, it should call
> IProcessActivity.removeDocument() on itself.  If it does find it, it
> should check to be sure it needs indexing etc etc.  Note that if the parent
> deletes itself, the carrydown data from the parent will be removed -- it
> will not be changed, just gone.
>
> Noted that the child document may be processed more than once in any given
> job run depending on the order things happen.  Carrydown changes force the
> children that rely on them to be processed again; that's how ManifoldCF
> keeps it all straight and consistent.
>
> I haven't thought through whether calling IProcessActivity.noDocument() is
> better than IProcessActivity.removeDocument().  I suspect it is because it
> will leave a versioned gravemarker around while
> IProcessActivity.removeDocument() gets rid of all traces of the document.
> You would know best which makes the most sense in your context.
>
> Karl
>
>
> On Thu, Mar 26, 2020 at 11:11 PM Karl Wright <da...@gmail.com> wrote:
>
> > There is no such restriction.  The only requirement is that every
> > document deletes itself and cannot delete any other documents.
> >
> > Perhaps you can share some code snippets?
> >
> >
> > On Thu, Mar 26, 2020, 11:52 AM <ju...@francelabs.com> wrote:
> >
> >> Hi Karl,
> >>
> >> I tried to use the carrydown mechanism to perform the delete of
> >> children documents but I am facing a problem:
> >>
> >> During the first crawl, the connector registers children documents of
> >> a document as carrydown data in the processDocuments method through
> >> the activities.addDocumentReference method, and all is working well.
> >> During a delta crawl, the addSeedDocuments method declares deleted
> >> parent documents, but in the processDocuments, although I am able to
> >> retrieve the child document ids of the parent document thanks to the
> >> carrydown data, I am unable to delete them. My guess is that the ids
> >> I want to delete have not been declared in the addSeedDocuments
> >> method. If this is correct, is there any way to change this behavior ?
> >> Otherwise, is there another way to do things ? As I cannot retrieve
> >> carrydown data in the addSeedDocuments I seem to be in a dead-end.
> >>
> >> Julien
> >>
> >>
> >> -----Message d'origine-----
> >> De : Julien Massiera <ju...@francelabs.com> Envoyé : lundi
> >> 9 mars 2020 14:24 À : dev@manifoldcf.apache.org Objet : Re: Delta
> >> deletion
> >>
> >> Yes I consider the confluence connector complete.
> >>
> >> As you suggest, I will try to use the "carrydown" mechanism to do
> >> what I want.
> >>
> >> Thanks,
> >>
> >> Julien
> >>
> >> On 09/03/2020 13:59, Karl Wright wrote:
> >> > Do you consider the confluence connector in the branch complete?
> >> > If so I'll look at it as time permits later today.
> >> >
> >> > As far as your proposal is concerned, maintaining lists of
> >> > dependencies for all documents is quite expensive.  We do this for
> >> > hop counting and we basically tell people to only use it if they
> >> > must, because of the huge amount of database overhead involved.  We
> >> > also maintain "carrydown" data which is accessible during document
> >> > processing.  It is typically used for ingestion, but maybe you
> >> > could use that for a signal that child documents should delete
> >> > themselves or
> >> something.
> >> >
> >> > Major crawling model changes are a gigantic effort; there are
> >> > always many things to consider and many problems encountered that
> >> > need to be worked around.  If you are concerned simply with the
> >> > load on your API to handle deletions, I'd suggest using one of the
> >> > existing mechanisms for reducing that.  But I can see no
> >> > straightforward way to incrementally add dependency deletion to the
> current framework.
> >> >
> >> > Karl
> >> >
> >> >
> >> > On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera <
> >> > julien.massiera@francelabs.com> wrote:
> >> >
> >> >> Hi Karl,
> >> >>
> >> >> Now that I finished the confluence connector, I am getting back to
> >> >> the other one I was working on, and it would greatly help me to
> >> >> have your thoughts on my proposal below.
> >> >>
> >> >> Thanks,
> >> >> Julien
> >> >>
> >> >> On 02/03/2020 16:40, julien.massiera@francelabs.com wrote:
> >> >>> Hi Karl,
> >> >>>
> >> >>> Thanks for your answer.
> >> >>>
> >> >>> Your explanations validate what I was anticipating on the way MCF
> >> >>> is
> >> >> currently implementing its model. As you stated, this does mean
> >> >> that in order to use the _DELETE model properly, the seeding
> >> >> process has to provide the complete list of deleted documents.
> >> >>> Yet wouldn't it be a useful improvement to update the
> >> >> activities.deleteDocument method (or create an additional delete
> >> >> method) so that it automatically – and optionnaly - removes the
> >> >> referenced documents of a document Id ?
> >> >>> For instance, since the activities.addDocumentReference method
> >> >>> already
> >> >> asks the document identifier of the "parent" document, couldn’t we
> >> >> maintain in postgres a list of "child ids" and use it during the
> >> >> delete process to delete them ?
> >> >>> This is very useful in the use case I already described but I am
> >> >>> sure it
> >> >> would be useful for other type of connectors and/or future
> >> >> connectors. The benefits of such modification increase with the
> >> >> number
> >> of crawled documents.
> >> >>> Here is an illustration of the benefits of this MCF modification:
> >> >>>
> >> >>> With my current connector, if my first crawl ingests 1M documents
> >> >>> and on
> >> >> the delta crawl only 1 document that has 2 children is deleted, it
> >> >> must rely on the processDocument method to check the version of
> >> >> each of the 1M documents to figure out and delete the 3 concerned
> >> >> ones (so at least 1M calls to the API of the targeted repository).
> >> >> With the suggested optional modification, the seeding process
> >> >> would use the delta API of the targeted repository and declare the
> >> >> parent document (only one API call), then the processDocuments
> >> >> method would be triggered only one time to check the version of
> >> >> the document (another one API call), figure out that it does not
> >> >> exists anymore and delete
> >> it with its 2 children. Its 2 API calls vs 1M...
> >> >> even if on framework side we have one more request to perform to
> >> >> postgres, I think it worth the processing time.
> >> >>> What do you think ?
> >> >>>
> >> >>> Julien
> >> >>>
> >> >>> -----Message d'origine-----
> >> >>> De : Karl Wright <da...@gmail.com> Envoyé : samedi 29 février
> >> >>> 2020 15:51 À : dev <de...@manifoldcf.apache.org> Objet : Re: Delta
> >> >>> deletion
> >> >>>
> >> >>> Hi Julien,
> >> >>>
> >> >>> First, ALL models rely on individual existence checks for documents.
> >> >> That is, when your connector fetches a deleted document, the
> >> >> framework has to be told that the document is gone, or it will not
> >> >> be removed.  There is no "discovery" process for deleted documents
> >> >> other than seeding (and only when the model includes _DELETE).
> >> >>> The upshot of this is that IF your seeding method does not return
> >> >>> all
> >> >> documents that have been removed THEN it cannot be a _DELETE model.
> >> >>> I hope this helps.
> >> >>>
> >> >>> Karl
> >> >>>
> >> >>>
> >> >>> On Sat, Feb 29, 2020 at 8:10 AM <ju...@francelabs.com>
> >> wrote:
> >> >>>
> >> >>>> Hi dev community,
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> I am trying to develop a connector for an API that exposes a
> >> >>>> hierarchical arborescence of documents: each document can have
> >> >>>> children
> >> >> documents.
> >> >>>> During the init crawl, the child documents are referenced in the
> >> >>>> MCF connector through the method
> >> >>>> activities.addDocumentRefenrece(childDocumentIdentifier,
> >> >>>> parentDocumentIdentifier, parentDataNames, parentDataValues)
> >> >>>>
> >> >>>> The API is able to provide delta modifications/deletions from a
> >> >>>> provided date but, when a document that has children is deleted,
> >> >>>> the API only returns the id of the document, not its children.
> >> >>>> On the MCF connector side, I thought that, as I have referenced
> >> >>>> the children, by deleting the parent document all its children
> >> >>>> would be deleted with it, but it appears that it is not the case.
> >> >>>>
> >> >>>> So my question is : did I miss something ? Is there another way
> >> >>>> to perform delta deletions ? Unfortunately if I don't find a way
> >> >>>> to solve this issue, I will not be able to take advantage of the
> >> >>>> delta feature and thus I will have to use the "add_modify"
> >> >>>> connector type and test every id on a delta crawl to figure out
> >> >>>> which ids are missing. This would be a huge loss of performances.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> Regards,
> >> >>>>
> >> >>>> Julien Massiera
> >> >>>>
> >> >>>>
> >> >> --
> >> >> Julien MASSIERA
> >> >> Directeur développement produit
> >> >> France Labs – Les experts du Search Datafari – Vainqueur du
> >> >> trophée Big Data 2018 au Digital Innovation Makers Summit
> >> >> www.francelabs.com
> >> >>
> >> >>
> >> --
> >> Julien MASSIERA
> >> Directeur développement produit
> >> France Labs – Les experts du Search
> >> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation
> >> Makers Summit www.francelabs.com
> >>
> >>
> >>
>
>

RE: Delta deletion

Posted by ju...@francelabs.com.

Thanks Karl ! 

The key thing of the solution I needed was : "Carrydown changes force the children that rely on them to be processed again"

I have one last question : Is it possible to "dynamically" change the return value of the getConnectorModel method ? For example, is it possible to rely it on a job specification parameter ? Or is it too late because the framework has already considered the value at this step ?  

Julien

-----Message d'origine-----
De : Karl Wright <da...@gmail.com> 
Envoyé : vendredi 27 mars 2020 04:51
À : dev <de...@manifoldcf.apache.org>
Objet : Re: Delta deletion

Just to be clear, this is what I think would work:

(1) Your addSeedDocuments() method adds your seeds.
(2) Each seed document, when processed, either decides it's dead, or it calls IProcessActivity.addDocumentReference()
to populate children AND pass on an "Parent is alive" value.
(3) The seed document, inside processDocuments(), when it decides it is dead, does nothing other than call IProcessActivity.removeDocument() on itself.  If it's still alive it checks to be sure that it needs to index using the standard IProcessActivity method for that, and if so, calls the indexing method.
(4) processDocuments, when called for a child document, looks for the "Parent is alive" value.  If it does not find it, it should call
IProcessActivity.removeDocument() on itself.  If it does find it, it should check to be sure it needs indexing etc etc.  Note that if the parent deletes itself, the carrydown data from the parent will be removed -- it will not be changed, just gone.

Noted that the child document may be processed more than once in any given job run depending on the order things happen.  Carrydown changes force the children that rely on them to be processed again; that's how ManifoldCF keeps it all straight and consistent.

I haven't thought through whether calling IProcessActivity.noDocument() is better than IProcessActivity.removeDocument().  I suspect it is because it will leave a versioned gravemarker around while
IProcessActivity.removeDocument() gets rid of all traces of the document.
You would know best which makes the most sense in your context.

Karl


On Thu, Mar 26, 2020 at 11:11 PM Karl Wright <da...@gmail.com> wrote:

> There is no such restriction.  The only requirement is that every 
> document deletes itself and cannot delete any other documents.
>
> Perhaps you can share some code snippets?
>
>
> On Thu, Mar 26, 2020, 11:52 AM <ju...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> I tried to use the carrydown mechanism to perform the delete of 
>> children documents but I am facing a problem:
>>
>> During the first crawl, the connector registers children documents of 
>> a document as carrydown data in the processDocuments method through 
>> the activities.addDocumentReference method, and all is working well.
>> During a delta crawl, the addSeedDocuments method declares deleted 
>> parent documents, but in the processDocuments, although I am able to 
>> retrieve the child document ids of the parent document thanks to the 
>> carrydown data, I am unable to delete them. My guess is that the ids 
>> I want to delete have not been declared in the addSeedDocuments 
>> method. If this is correct, is there any way to change this behavior ?
>> Otherwise, is there another way to do things ? As I cannot retrieve 
>> carrydown data in the addSeedDocuments I seem to be in a dead-end.
>>
>> Julien
>>
>>
>> -----Message d'origine-----
>> De : Julien Massiera <ju...@francelabs.com> Envoyé : lundi 
>> 9 mars 2020 14:24 À : dev@manifoldcf.apache.org Objet : Re: Delta 
>> deletion
>>
>> Yes I consider the confluence connector complete.
>>
>> As you suggest, I will try to use the "carrydown" mechanism to do 
>> what I want.
>>
>> Thanks,
>>
>> Julien
>>
>> On 09/03/2020 13:59, Karl Wright wrote:
>> > Do you consider the confluence connector in the branch complete?
>> > If so I'll look at it as time permits later today.
>> >
>> > As far as your proposal is concerned, maintaining lists of 
>> > dependencies for all documents is quite expensive.  We do this for 
>> > hop counting and we basically tell people to only use it if they 
>> > must, because of the huge amount of database overhead involved.  We 
>> > also maintain "carrydown" data which is accessible during document 
>> > processing.  It is typically used for ingestion, but maybe you 
>> > could use that for a signal that child documents should delete 
>> > themselves or
>> something.
>> >
>> > Major crawling model changes are a gigantic effort; there are 
>> > always many things to consider and many problems encountered that 
>> > need to be worked around.  If you are concerned simply with the 
>> > load on your API to handle deletions, I'd suggest using one of the 
>> > existing mechanisms for reducing that.  But I can see no 
>> > straightforward way to incrementally add dependency deletion to the current framework.
>> >
>> > Karl
>> >
>> >
>> > On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera < 
>> > julien.massiera@francelabs.com> wrote:
>> >
>> >> Hi Karl,
>> >>
>> >> Now that I finished the confluence connector, I am getting back to 
>> >> the other one I was working on, and it would greatly help me to 
>> >> have your thoughts on my proposal below.
>> >>
>> >> Thanks,
>> >> Julien
>> >>
>> >> On 02/03/2020 16:40, julien.massiera@francelabs.com wrote:
>> >>> Hi Karl,
>> >>>
>> >>> Thanks for your answer.
>> >>>
>> >>> Your explanations validate what I was anticipating on the way MCF 
>> >>> is
>> >> currently implementing its model. As you stated, this does mean 
>> >> that in order to use the _DELETE model properly, the seeding 
>> >> process has to provide the complete list of deleted documents.
>> >>> Yet wouldn't it be a useful improvement to update the
>> >> activities.deleteDocument method (or create an additional delete
>> >> method) so that it automatically – and optionnaly - removes the 
>> >> referenced documents of a document Id ?
>> >>> For instance, since the activities.addDocumentReference method 
>> >>> already
>> >> asks the document identifier of the "parent" document, couldn’t we 
>> >> maintain in postgres a list of "child ids" and use it during the 
>> >> delete process to delete them ?
>> >>> This is very useful in the use case I already described but I am 
>> >>> sure it
>> >> would be useful for other type of connectors and/or future 
>> >> connectors. The benefits of such modification increase with the 
>> >> number
>> of crawled documents.
>> >>> Here is an illustration of the benefits of this MCF modification:
>> >>>
>> >>> With my current connector, if my first crawl ingests 1M documents 
>> >>> and on
>> >> the delta crawl only 1 document that has 2 children is deleted, it 
>> >> must rely on the processDocument method to check the version of 
>> >> each of the 1M documents to figure out and delete the 3 concerned 
>> >> ones (so at least 1M calls to the API of the targeted repository). 
>> >> With the suggested optional modification, the seeding process 
>> >> would use the delta API of the targeted repository and declare the 
>> >> parent document (only one API call), then the processDocuments 
>> >> method would be triggered only one time to check the version of 
>> >> the document (another one API call), figure out that it does not 
>> >> exists anymore and delete
>> it with its 2 children. Its 2 API calls vs 1M...
>> >> even if on framework side we have one more request to perform to 
>> >> postgres, I think it worth the processing time.
>> >>> What do you think ?
>> >>>
>> >>> Julien
>> >>>
>> >>> -----Message d'origine-----
>> >>> De : Karl Wright <da...@gmail.com> Envoyé : samedi 29 février
>> >>> 2020 15:51 À : dev <de...@manifoldcf.apache.org> Objet : Re: Delta 
>> >>> deletion
>> >>>
>> >>> Hi Julien,
>> >>>
>> >>> First, ALL models rely on individual existence checks for documents.
>> >> That is, when your connector fetches a deleted document, the 
>> >> framework has to be told that the document is gone, or it will not 
>> >> be removed.  There is no "discovery" process for deleted documents 
>> >> other than seeding (and only when the model includes _DELETE).
>> >>> The upshot of this is that IF your seeding method does not return 
>> >>> all
>> >> documents that have been removed THEN it cannot be a _DELETE model.
>> >>> I hope this helps.
>> >>>
>> >>> Karl
>> >>>
>> >>>
>> >>> On Sat, Feb 29, 2020 at 8:10 AM <ju...@francelabs.com>
>> wrote:
>> >>>
>> >>>> Hi dev community,
>> >>>>
>> >>>>
>> >>>>
>> >>>> I am trying to develop a connector for an API that exposes a 
>> >>>> hierarchical arborescence of documents: each document can have 
>> >>>> children
>> >> documents.
>> >>>> During the init crawl, the child documents are referenced in the 
>> >>>> MCF connector through the method 
>> >>>> activities.addDocumentRefenrece(childDocumentIdentifier,
>> >>>> parentDocumentIdentifier, parentDataNames, parentDataValues)
>> >>>>
>> >>>> The API is able to provide delta modifications/deletions from a 
>> >>>> provided date but, when a document that has children is deleted, 
>> >>>> the API only returns the id of the document, not its children. 
>> >>>> On the MCF connector side, I thought that, as I have referenced 
>> >>>> the children, by deleting the parent document all its children 
>> >>>> would be deleted with it, but it appears that it is not the case.
>> >>>>
>> >>>> So my question is : did I miss something ? Is there another way 
>> >>>> to perform delta deletions ? Unfortunately if I don't find a way 
>> >>>> to solve this issue, I will not be able to take advantage of the 
>> >>>> delta feature and thus I will have to use the "add_modify" 
>> >>>> connector type and test every id on a delta crawl to figure out 
>> >>>> which ids are missing. This would be a huge loss of performances.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Regards,
>> >>>>
>> >>>> Julien Massiera
>> >>>>
>> >>>>
>> >> --
>> >> Julien MASSIERA
>> >> Directeur développement produit
>> >> France Labs – Les experts du Search Datafari – Vainqueur du 
>> >> trophée Big Data 2018 au Digital Innovation Makers Summit 
>> >> www.francelabs.com
>> >>
>> >>
>> --
>> Julien MASSIERA
>> Directeur développement produit
>> France Labs – Les experts du Search
>> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation 
>> Makers Summit www.francelabs.com
>>
>>
>>

Re: Delta deletion

Posted by Karl Wright <da...@gmail.com>.

Just to be clear, this is what I think would work:

(1) Your addSeedDocuments() method adds your seeds.
(2) Each seed document, when processed, either decides it's dead, or it
calls IProcessActivity.addDocumentReference()
to populate children AND pass on an "Parent is alive" value.
(3) The seed document, inside processDocuments(), when it decides it is
dead, does nothing other than call IProcessActivity.removeDocument() on
itself.  If it's still alive it checks to be sure that it needs to index
using the standard IProcessActivity method for that, and if so, calls the
indexing method.
(4) processDocuments, when called for a child document, looks for the
"Parent is alive" value.  If it does not find it, it should call
IProcessActivity.removeDocument() on itself.  If it does find it, it should
check to be sure it needs indexing etc etc.  Note that if the parent
deletes itself, the carrydown data from the parent will be removed -- it
will not be changed, just gone.

Noted that the child document may be processed more than once in any given
job run depending on the order things happen.  Carrydown changes force the
children that rely on them to be processed again; that's how ManifoldCF
keeps it all straight and consistent.

I haven't thought through whether calling IProcessActivity.noDocument() is
better than IProcessActivity.removeDocument().  I suspect it is because it
will leave a versioned gravemarker around while
IProcessActivity.removeDocument() gets rid of all traces of the document.
You would know best which makes the most sense in your context.

Karl


On Thu, Mar 26, 2020 at 11:11 PM Karl Wright <da...@gmail.com> wrote:

> There is no such restriction.  The only requirement is that every document
> deletes itself and cannot delete any other documents.
>
> Perhaps you can share some code snippets?
>
>
> On Thu, Mar 26, 2020, 11:52 AM <ju...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> I tried to use the carrydown mechanism to perform the delete of children
>> documents but I am facing a problem:
>>
>> During the first crawl, the connector registers children documents of a
>> document as carrydown data in the processDocuments method through the
>> activities.addDocumentReference method, and all is working well.
>> During a delta crawl, the addSeedDocuments method declares deleted parent
>> documents, but in the processDocuments, although I am able to retrieve the
>> child document ids of the parent document thanks to the carrydown data, I
>> am unable to delete them. My guess is that the ids I want to delete have
>> not been declared in the addSeedDocuments method. If this is correct, is
>> there any way to change this behavior ?
>> Otherwise, is there another way to do things ? As I cannot retrieve
>> carrydown data in the addSeedDocuments I seem to be in a dead-end.
>>
>> Julien
>>
>>
>> -----Message d'origine-----
>> De : Julien Massiera <ju...@francelabs.com>
>> Envoyé : lundi 9 mars 2020 14:24
>> À : dev@manifoldcf.apache.org
>> Objet : Re: Delta deletion
>>
>> Yes I consider the confluence connector complete.
>>
>> As you suggest, I will try to use the "carrydown" mechanism to do what I
>> want.
>>
>> Thanks,
>>
>> Julien
>>
>> On 09/03/2020 13:59, Karl Wright wrote:
>> > Do you consider the confluence connector in the branch complete?
>> > If so I'll look at it as time permits later today.
>> >
>> > As far as your proposal is concerned, maintaining lists of
>> > dependencies for all documents is quite expensive.  We do this for hop
>> > counting and we basically tell people to only use it if they must,
>> > because of the huge amount of database overhead involved.  We also
>> > maintain "carrydown" data which is accessible during document
>> > processing.  It is typically used for ingestion, but maybe you could
>> > use that for a signal that child documents should delete themselves or
>> something.
>> >
>> > Major crawling model changes are a gigantic effort; there are always
>> > many things to consider and many problems encountered that need to be
>> > worked around.  If you are concerned simply with the load on your API
>> > to handle deletions, I'd suggest using one of the existing mechanisms
>> > for reducing that.  But I can see no straightforward way to
>> > incrementally add dependency deletion to the current framework.
>> >
>> > Karl
>> >
>> >
>> > On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera <
>> > julien.massiera@francelabs.com> wrote:
>> >
>> >> Hi Karl,
>> >>
>> >> Now that I finished the confluence connector, I am getting back to
>> >> the other one I was working on, and it would greatly help me to have
>> >> your thoughts on my proposal below.
>> >>
>> >> Thanks,
>> >> Julien
>> >>
>> >> On 02/03/2020 16:40, julien.massiera@francelabs.com wrote:
>> >>> Hi Karl,
>> >>>
>> >>> Thanks for your answer.
>> >>>
>> >>> Your explanations validate what I was anticipating on the way MCF is
>> >> currently implementing its model. As you stated, this does mean that
>> >> in order to use the _DELETE model properly, the seeding process has
>> >> to provide the complete list of deleted documents.
>> >>> Yet wouldn't it be a useful improvement to update the
>> >> activities.deleteDocument method (or create an additional delete
>> >> method) so that it automatically – and optionnaly - removes the
>> >> referenced documents of a document Id ?
>> >>> For instance, since the activities.addDocumentReference method
>> >>> already
>> >> asks the document identifier of the "parent" document, couldn’t we
>> >> maintain in postgres a list of "child ids" and use it during the
>> >> delete process to delete them ?
>> >>> This is very useful in the use case I already described but I am
>> >>> sure it
>> >> would be useful for other type of connectors and/or future
>> >> connectors. The benefits of such modification increase with the number
>> of crawled documents.
>> >>> Here is an illustration of the benefits of this MCF modification:
>> >>>
>> >>> With my current connector, if my first crawl ingests 1M documents
>> >>> and on
>> >> the delta crawl only 1 document that has 2 children is deleted, it
>> >> must rely on the processDocument method to check the version of each
>> >> of the 1M documents to figure out and delete the 3 concerned ones (so
>> >> at least 1M calls to the API of the targeted repository). With the
>> >> suggested optional modification, the seeding process would use the
>> >> delta API of the targeted repository and declare the parent document
>> >> (only one API call), then the processDocuments method would be
>> >> triggered only one time to check the version of the document (another
>> >> one API call), figure out that it does not exists anymore and delete
>> it with its 2 children. Its 2 API calls vs 1M...
>> >> even if on framework side we have one more request to perform to
>> >> postgres, I think it worth the processing time.
>> >>> What do you think ?
>> >>>
>> >>> Julien
>> >>>
>> >>> -----Message d'origine-----
>> >>> De : Karl Wright <da...@gmail.com> Envoyé : samedi 29 février
>> >>> 2020 15:51 À : dev <de...@manifoldcf.apache.org> Objet : Re: Delta
>> >>> deletion
>> >>>
>> >>> Hi Julien,
>> >>>
>> >>> First, ALL models rely on individual existence checks for documents.
>> >> That is, when your connector fetches a deleted document, the
>> >> framework has to be told that the document is gone, or it will not be
>> >> removed.  There is no "discovery" process for deleted documents other
>> >> than seeding (and only when the model includes _DELETE).
>> >>> The upshot of this is that IF your seeding method does not return
>> >>> all
>> >> documents that have been removed THEN it cannot be a _DELETE model.
>> >>> I hope this helps.
>> >>>
>> >>> Karl
>> >>>
>> >>>
>> >>> On Sat, Feb 29, 2020 at 8:10 AM <ju...@francelabs.com>
>> wrote:
>> >>>
>> >>>> Hi dev community,
>> >>>>
>> >>>>
>> >>>>
>> >>>> I am trying to develop a connector for an API that exposes a
>> >>>> hierarchical arborescence of documents: each document can have
>> >>>> children
>> >> documents.
>> >>>> During the init crawl, the child documents are referenced in the
>> >>>> MCF connector through the method
>> >>>> activities.addDocumentRefenrece(childDocumentIdentifier,
>> >>>> parentDocumentIdentifier, parentDataNames, parentDataValues)
>> >>>>
>> >>>> The API is able to provide delta modifications/deletions from a
>> >>>> provided date but, when a document that has children is deleted,
>> >>>> the API only returns the id of the document, not its children. On
>> >>>> the MCF connector side, I thought that, as I have referenced the
>> >>>> children, by deleting the parent document all its children would be
>> >>>> deleted with it, but it appears that it is not the case.
>> >>>>
>> >>>> So my question is : did I miss something ? Is there another way to
>> >>>> perform delta deletions ? Unfortunately if I don't find a way to
>> >>>> solve this issue, I will not be able to take advantage of the delta
>> >>>> feature and thus I will have to use the "add_modify" connector type
>> >>>> and test every id on a delta crawl to figure out which ids are
>> >>>> missing. This would be a huge loss of performances.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Regards,
>> >>>>
>> >>>> Julien Massiera
>> >>>>
>> >>>>
>> >> --
>> >> Julien MASSIERA
>> >> Directeur développement produit
>> >> France Labs – Les experts du Search
>> >> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation
>> >> Makers Summit www.francelabs.com
>> >>
>> >>
>> --
>> Julien MASSIERA
>> Directeur développement produit
>> France Labs – Les experts du Search
>> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation
>> Makers Summit www.francelabs.com
>>
>>
>>

Re: Delta deletion

Posted by Karl Wright <da...@gmail.com>.

There is no such restriction.  The only requirement is that every document
deletes itself and cannot delete any other documents.

Perhaps you can share some code snippets?


On Thu, Mar 26, 2020, 11:52 AM <ju...@francelabs.com> wrote:

> Hi Karl,
>
> I tried to use the carrydown mechanism to perform the delete of children
> documents but I am facing a problem:
>
> During the first crawl, the connector registers children documents of a
> document as carrydown data in the processDocuments method through the
> activities.addDocumentReference method, and all is working well.
> During a delta crawl, the addSeedDocuments method declares deleted parent
> documents, but in the processDocuments, although I am able to retrieve the
> child document ids of the parent document thanks to the carrydown data, I
> am unable to delete them. My guess is that the ids I want to delete have
> not been declared in the addSeedDocuments method. If this is correct, is
> there any way to change this behavior ?
> Otherwise, is there another way to do things ? As I cannot retrieve
> carrydown data in the addSeedDocuments I seem to be in a dead-end.
>
> Julien
>
>
> -----Message d'origine-----
> De : Julien Massiera <ju...@francelabs.com>
> Envoyé : lundi 9 mars 2020 14:24
> À : dev@manifoldcf.apache.org
> Objet : Re: Delta deletion
>
> Yes I consider the confluence connector complete.
>
> As you suggest, I will try to use the "carrydown" mechanism to do what I
> want.
>
> Thanks,
>
> Julien
>
> On 09/03/2020 13:59, Karl Wright wrote:
> > Do you consider the confluence connector in the branch complete?
> > If so I'll look at it as time permits later today.
> >
> > As far as your proposal is concerned, maintaining lists of
> > dependencies for all documents is quite expensive.  We do this for hop
> > counting and we basically tell people to only use it if they must,
> > because of the huge amount of database overhead involved.  We also
> > maintain "carrydown" data which is accessible during document
> > processing.  It is typically used for ingestion, but maybe you could
> > use that for a signal that child documents should delete themselves or
> something.
> >
> > Major crawling model changes are a gigantic effort; there are always
> > many things to consider and many problems encountered that need to be
> > worked around.  If you are concerned simply with the load on your API
> > to handle deletions, I'd suggest using one of the existing mechanisms
> > for reducing that.  But I can see no straightforward way to
> > incrementally add dependency deletion to the current framework.
> >
> > Karl
> >
> >
> > On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera <
> > julien.massiera@francelabs.com> wrote:
> >
> >> Hi Karl,
> >>
> >> Now that I finished the confluence connector, I am getting back to
> >> the other one I was working on, and it would greatly help me to have
> >> your thoughts on my proposal below.
> >>
> >> Thanks,
> >> Julien
> >>
> >> On 02/03/2020 16:40, julien.massiera@francelabs.com wrote:
> >>> Hi Karl,
> >>>
> >>> Thanks for your answer.
> >>>
> >>> Your explanations validate what I was anticipating on the way MCF is
> >> currently implementing its model. As you stated, this does mean that
> >> in order to use the _DELETE model properly, the seeding process has
> >> to provide the complete list of deleted documents.
> >>> Yet wouldn't it be a useful improvement to update the
> >> activities.deleteDocument method (or create an additional delete
> >> method) so that it automatically – and optionnaly - removes the
> >> referenced documents of a document Id ?
> >>> For instance, since the activities.addDocumentReference method
> >>> already
> >> asks the document identifier of the "parent" document, couldn’t we
> >> maintain in postgres a list of "child ids" and use it during the
> >> delete process to delete them ?
> >>> This is very useful in the use case I already described but I am
> >>> sure it
> >> would be useful for other type of connectors and/or future
> >> connectors. The benefits of such modification increase with the number
> of crawled documents.
> >>> Here is an illustration of the benefits of this MCF modification:
> >>>
> >>> With my current connector, if my first crawl ingests 1M documents
> >>> and on
> >> the delta crawl only 1 document that has 2 children is deleted, it
> >> must rely on the processDocument method to check the version of each
> >> of the 1M documents to figure out and delete the 3 concerned ones (so
> >> at least 1M calls to the API of the targeted repository). With the
> >> suggested optional modification, the seeding process would use the
> >> delta API of the targeted repository and declare the parent document
> >> (only one API call), then the processDocuments method would be
> >> triggered only one time to check the version of the document (another
> >> one API call), figure out that it does not exists anymore and delete it
> with its 2 children. Its 2 API calls vs 1M...
> >> even if on framework side we have one more request to perform to
> >> postgres, I think it worth the processing time.
> >>> What do you think ?
> >>>
> >>> Julien
> >>>
> >>> -----Message d'origine-----
> >>> De : Karl Wright <da...@gmail.com> Envoyé : samedi 29 février
> >>> 2020 15:51 À : dev <de...@manifoldcf.apache.org> Objet : Re: Delta
> >>> deletion
> >>>
> >>> Hi Julien,
> >>>
> >>> First, ALL models rely on individual existence checks for documents.
> >> That is, when your connector fetches a deleted document, the
> >> framework has to be told that the document is gone, or it will not be
> >> removed.  There is no "discovery" process for deleted documents other
> >> than seeding (and only when the model includes _DELETE).
> >>> The upshot of this is that IF your seeding method does not return
> >>> all
> >> documents that have been removed THEN it cannot be a _DELETE model.
> >>> I hope this helps.
> >>>
> >>> Karl
> >>>
> >>>
> >>> On Sat, Feb 29, 2020 at 8:10 AM <ju...@francelabs.com>
> wrote:
> >>>
> >>>> Hi dev community,
> >>>>
> >>>>
> >>>>
> >>>> I am trying to develop a connector for an API that exposes a
> >>>> hierarchical arborescence of documents: each document can have
> >>>> children
> >> documents.
> >>>> During the init crawl, the child documents are referenced in the
> >>>> MCF connector through the method
> >>>> activities.addDocumentRefenrece(childDocumentIdentifier,
> >>>> parentDocumentIdentifier, parentDataNames, parentDataValues)
> >>>>
> >>>> The API is able to provide delta modifications/deletions from a
> >>>> provided date but, when a document that has children is deleted,
> >>>> the API only returns the id of the document, not its children. On
> >>>> the MCF connector side, I thought that, as I have referenced the
> >>>> children, by deleting the parent document all its children would be
> >>>> deleted with it, but it appears that it is not the case.
> >>>>
> >>>> So my question is : did I miss something ? Is there another way to
> >>>> perform delta deletions ? Unfortunately if I don't find a way to
> >>>> solve this issue, I will not be able to take advantage of the delta
> >>>> feature and thus I will have to use the "add_modify" connector type
> >>>> and test every id on a delta crawl to figure out which ids are
> >>>> missing. This would be a huge loss of performances.
> >>>>
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Julien Massiera
> >>>>
> >>>>
> >> --
> >> Julien MASSIERA
> >> Directeur développement produit
> >> France Labs – Les experts du Search
> >> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation
> >> Makers Summit www.francelabs.com
> >>
> >>
> --
> Julien MASSIERA
> Directeur développement produit
> France Labs – Les experts du Search
> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers
> Summit www.francelabs.com
>
>
>

RE: Delta deletion

Posted by ju...@francelabs.com.

Hi Karl,

I tried to use the carrydown mechanism to perform the delete of children documents but I am facing a problem:

During the first crawl, the connector registers children documents of a document as carrydown data in the processDocuments method through the activities.addDocumentReference method, and all is working well.
During a delta crawl, the addSeedDocuments method declares deleted parent documents, but in the processDocuments, although I am able to retrieve the child document ids of the parent document thanks to the carrydown data, I am unable to delete them. My guess is that the ids I want to delete have not been declared in the addSeedDocuments method. If this is correct, is there any way to change this behavior ?
Otherwise, is there another way to do things ? As I cannot retrieve carrydown data in the addSeedDocuments I seem to be in a dead-end.

Julien


-----Message d'origine-----
De : Julien Massiera <ju...@francelabs.com> 
Envoyé : lundi 9 mars 2020 14:24
À : dev@manifoldcf.apache.org
Objet : Re: Delta deletion

Yes I consider the confluence connector complete.

As you suggest, I will try to use the "carrydown" mechanism to do what I want.

Thanks,

Julien

On 09/03/2020 13:59, Karl Wright wrote:
> Do you consider the confluence connector in the branch complete?
> If so I'll look at it as time permits later today.
>
> As far as your proposal is concerned, maintaining lists of 
> dependencies for all documents is quite expensive.  We do this for hop 
> counting and we basically tell people to only use it if they must, 
> because of the huge amount of database overhead involved.  We also 
> maintain "carrydown" data which is accessible during document 
> processing.  It is typically used for ingestion, but maybe you could 
> use that for a signal that child documents should delete themselves or something.
>
> Major crawling model changes are a gigantic effort; there are always 
> many things to consider and many problems encountered that need to be 
> worked around.  If you are concerned simply with the load on your API 
> to handle deletions, I'd suggest using one of the existing mechanisms 
> for reducing that.  But I can see no straightforward way to 
> incrementally add dependency deletion to the current framework.
>
> Karl
>
>
> On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera < 
> julien.massiera@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> Now that I finished the confluence connector, I am getting back to 
>> the other one I was working on, and it would greatly help me to have 
>> your thoughts on my proposal below.
>>
>> Thanks,
>> Julien
>>
>> On 02/03/2020 16:40, julien.massiera@francelabs.com wrote:
>>> Hi Karl,
>>>
>>> Thanks for your answer.
>>>
>>> Your explanations validate what I was anticipating on the way MCF is
>> currently implementing its model. As you stated, this does mean that 
>> in order to use the _DELETE model properly, the seeding process has 
>> to provide the complete list of deleted documents.
>>> Yet wouldn't it be a useful improvement to update the
>> activities.deleteDocument method (or create an additional delete 
>> method) so that it automatically – and optionnaly - removes the 
>> referenced documents of a document Id ?
>>> For instance, since the activities.addDocumentReference method 
>>> already
>> asks the document identifier of the "parent" document, couldn’t we 
>> maintain in postgres a list of "child ids" and use it during the 
>> delete process to delete them ?
>>> This is very useful in the use case I already described but I am 
>>> sure it
>> would be useful for other type of connectors and/or future 
>> connectors. The benefits of such modification increase with the number of crawled documents.
>>> Here is an illustration of the benefits of this MCF modification:
>>>
>>> With my current connector, if my first crawl ingests 1M documents 
>>> and on
>> the delta crawl only 1 document that has 2 children is deleted, it 
>> must rely on the processDocument method to check the version of each 
>> of the 1M documents to figure out and delete the 3 concerned ones (so 
>> at least 1M calls to the API of the targeted repository). With the 
>> suggested optional modification, the seeding process would use the 
>> delta API of the targeted repository and declare the parent document 
>> (only one API call), then the processDocuments method would be 
>> triggered only one time to check the version of the document (another 
>> one API call), figure out that it does not exists anymore and delete it with its 2 children. Its 2 API calls vs 1M...
>> even if on framework side we have one more request to perform to 
>> postgres, I think it worth the processing time.
>>> What do you think ?
>>>
>>> Julien
>>>
>>> -----Message d'origine-----
>>> De : Karl Wright <da...@gmail.com> Envoyé : samedi 29 février 
>>> 2020 15:51 À : dev <de...@manifoldcf.apache.org> Objet : Re: Delta 
>>> deletion
>>>
>>> Hi Julien,
>>>
>>> First, ALL models rely on individual existence checks for documents.
>> That is, when your connector fetches a deleted document, the 
>> framework has to be told that the document is gone, or it will not be 
>> removed.  There is no "discovery" process for deleted documents other 
>> than seeding (and only when the model includes _DELETE).
>>> The upshot of this is that IF your seeding method does not return 
>>> all
>> documents that have been removed THEN it cannot be a _DELETE model.
>>> I hope this helps.
>>>
>>> Karl
>>>
>>>
>>> On Sat, Feb 29, 2020 at 8:10 AM <ju...@francelabs.com> wrote:
>>>
>>>> Hi dev community,
>>>>
>>>>
>>>>
>>>> I am trying to develop a connector for an API that exposes a 
>>>> hierarchical arborescence of documents: each document can have 
>>>> children
>> documents.
>>>> During the init crawl, the child documents are referenced in the 
>>>> MCF connector through the method 
>>>> activities.addDocumentRefenrece(childDocumentIdentifier,
>>>> parentDocumentIdentifier, parentDataNames, parentDataValues)
>>>>
>>>> The API is able to provide delta modifications/deletions from a 
>>>> provided date but, when a document that has children is deleted, 
>>>> the API only returns the id of the document, not its children. On 
>>>> the MCF connector side, I thought that, as I have referenced the 
>>>> children, by deleting the parent document all its children would be 
>>>> deleted with it, but it appears that it is not the case.
>>>>
>>>> So my question is : did I miss something ? Is there another way to 
>>>> perform delta deletions ? Unfortunately if I don't find a way to 
>>>> solve this issue, I will not be able to take advantage of the delta 
>>>> feature and thus I will have to use the "add_modify" connector type 
>>>> and test every id on a delta crawl to figure out which ids are 
>>>> missing. This would be a huge loss of performances.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Julien Massiera
>>>>
>>>>
>> --
>> Julien MASSIERA
>> Directeur développement produit
>> France Labs – Les experts du Search
>> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation 
>> Makers Summit www.francelabs.com
>>
>>
--
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers Summit www.francelabs.com

Re: Delta deletion

Posted by Julien Massiera <ju...@francelabs.com>.

Yes I consider the confluence connector complete.

As you suggest, I will try to use the "carrydown" mechanism to do what I 
want.

Thanks,

Julien

On 09/03/2020 13:59, Karl Wright wrote:
> Do you consider the confluence connector in the branch complete?
> If so I'll look at it as time permits later today.
>
> As far as your proposal is concerned, maintaining lists of dependencies for
> all documents is quite expensive.  We do this for hop counting and we
> basically tell people to only use it if they must, because of the huge
> amount of database overhead involved.  We also maintain "carrydown" data
> which is accessible during document processing.  It is typically used for
> ingestion, but maybe you could use that for a signal that child documents
> should delete themselves or something.
>
> Major crawling model changes are a gigantic effort; there are always many
> things to consider and many problems encountered that need to be worked
> around.  If you are concerned simply with the load on your API to handle
> deletions, I'd suggest using one of the existing mechanisms for reducing
> that.  But I can see no straightforward way to incrementally add dependency
> deletion to the current framework.
>
> Karl
>
>
> On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera <
> julien.massiera@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> Now that I finished the confluence connector, I am getting back to the
>> other one I was working on, and it would greatly help me to have your
>> thoughts on my proposal below.
>>
>> Thanks,
>> Julien
>>
>> On 02/03/2020 16:40, julien.massiera@francelabs.com wrote:
>>> Hi Karl,
>>>
>>> Thanks for your answer.
>>>
>>> Your explanations validate what I was anticipating on the way MCF is
>> currently implementing its model. As you stated, this does mean that in
>> order to use the _DELETE model properly, the seeding process has to provide
>> the complete list of deleted documents.
>>> Yet wouldn't it be a useful improvement to update the
>> activities.deleteDocument method (or create an additional delete method) so
>> that it automatically – and optionnaly - removes the referenced documents
>> of a document Id ?
>>> For instance, since the activities.addDocumentReference method already
>> asks the document identifier of the "parent" document, couldn’t we maintain
>> in postgres a list of "child ids" and use it during the delete process to
>> delete them ?
>>> This is very useful in the use case I already described but I am sure it
>> would be useful for other type of connectors and/or future connectors. The
>> benefits of such modification increase with the number of crawled documents.
>>> Here is an illustration of the benefits of this MCF modification:
>>>
>>> With my current connector, if my first crawl ingests 1M documents and on
>> the delta crawl only 1 document that has 2 children is deleted, it must
>> rely on the processDocument method to check the version of each of the 1M
>> documents to figure out and delete the 3 concerned ones (so at least 1M
>> calls to the API of the targeted repository). With the suggested optional
>> modification, the seeding process would use the delta API of the targeted
>> repository and declare the parent document (only one API call), then the
>> processDocuments method would be triggered only one time to check the
>> version of the document (another one API call), figure out that it does not
>> exists anymore and delete it with its 2 children. Its 2 API calls vs 1M...
>> even if on framework side we have one more request to perform to postgres,
>> I think it worth the processing time.
>>> What do you think ?
>>>
>>> Julien
>>>
>>> -----Message d'origine-----
>>> De : Karl Wright <da...@gmail.com>
>>> Envoyé : samedi 29 février 2020 15:51
>>> À : dev <de...@manifoldcf.apache.org>
>>> Objet : Re: Delta deletion
>>>
>>> Hi Julien,
>>>
>>> First, ALL models rely on individual existence checks for documents.
>> That is, when your connector fetches a deleted document, the framework has
>> to be told that the document is gone, or it will not be removed.  There is
>> no "discovery" process for deleted documents other than seeding (and only
>> when the model includes _DELETE).
>>> The upshot of this is that IF your seeding method does not return all
>> documents that have been removed THEN it cannot be a _DELETE model.
>>> I hope this helps.
>>>
>>> Karl
>>>
>>>
>>> On Sat, Feb 29, 2020 at 8:10 AM <ju...@francelabs.com> wrote:
>>>
>>>> Hi dev community,
>>>>
>>>>
>>>>
>>>> I am trying to develop a connector for an API that exposes a
>>>> hierarchical arborescence of documents: each document can have children
>> documents.
>>>> During the init crawl, the child documents are referenced in the MCF
>>>> connector through the method
>>>> activities.addDocumentRefenrece(childDocumentIdentifier,
>>>> parentDocumentIdentifier, parentDataNames, parentDataValues)
>>>>
>>>> The API is able to provide delta modifications/deletions from a
>>>> provided date but, when a document that has children is deleted, the
>>>> API only returns the id of the document, not its children. On the MCF
>>>> connector side, I thought that, as I have referenced the children, by
>>>> deleting the parent document all its children would be deleted with
>>>> it, but it appears that it is not the case.
>>>>
>>>> So my question is : did I miss something ? Is there another way to
>>>> perform delta deletions ? Unfortunately if I don't find a way to solve
>>>> this issue, I will not be able to take advantage of the delta feature
>>>> and thus I will have to use the "add_modify" connector type and test
>>>> every id on a delta crawl to figure out which ids are missing. This
>>>> would be a huge loss of performances.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Julien Massiera
>>>>
>>>>
>> --
>> Julien MASSIERA
>> Directeur développement produit
>> France Labs – Les experts du Search
>> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers
>> Summit
>> www.francelabs.com
>>
>>
-- 
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers Summit
www.francelabs.com

Re: Delta deletion

Posted by Karl Wright <da...@gmail.com>.

Do you consider the confluence connector in the branch complete?
If so I'll look at it as time permits later today.

As far as your proposal is concerned, maintaining lists of dependencies for
all documents is quite expensive.  We do this for hop counting and we
basically tell people to only use it if they must, because of the huge
amount of database overhead involved.  We also maintain "carrydown" data
which is accessible during document processing.  It is typically used for
ingestion, but maybe you could use that for a signal that child documents
should delete themselves or something.

Major crawling model changes are a gigantic effort; there are always many
things to consider and many problems encountered that need to be worked
around.  If you are concerned simply with the load on your API to handle
deletions, I'd suggest using one of the existing mechanisms for reducing
that.  But I can see no straightforward way to incrementally add dependency
deletion to the current framework.

Karl


On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera <
julien.massiera@francelabs.com> wrote:

> Hi Karl,
>
> Now that I finished the confluence connector, I am getting back to the
> other one I was working on, and it would greatly help me to have your
> thoughts on my proposal below.
>
> Thanks,
> Julien
>
> On 02/03/2020 16:40, julien.massiera@francelabs.com wrote:
> > Hi Karl,
> >
> > Thanks for your answer.
> >
> > Your explanations validate what I was anticipating on the way MCF is
> currently implementing its model. As you stated, this does mean that in
> order to use the _DELETE model properly, the seeding process has to provide
> the complete list of deleted documents.
> >
> > Yet wouldn't it be a useful improvement to update the
> activities.deleteDocument method (or create an additional delete method) so
> that it automatically – and optionnaly - removes the referenced documents
> of a document Id ?
> >
> > For instance, since the activities.addDocumentReference method already
> asks the document identifier of the "parent" document, couldn’t we maintain
> in postgres a list of "child ids" and use it during the delete process to
> delete them ?
> >
> > This is very useful in the use case I already described but I am sure it
> would be useful for other type of connectors and/or future connectors. The
> benefits of such modification increase with the number of crawled documents.
> >
> > Here is an illustration of the benefits of this MCF modification:
> >
> > With my current connector, if my first crawl ingests 1M documents and on
> the delta crawl only 1 document that has 2 children is deleted, it must
> rely on the processDocument method to check the version of each of the 1M
> documents to figure out and delete the 3 concerned ones (so at least 1M
> calls to the API of the targeted repository). With the suggested optional
> modification, the seeding process would use the delta API of the targeted
> repository and declare the parent document (only one API call), then the
> processDocuments method would be triggered only one time to check the
> version of the document (another one API call), figure out that it does not
> exists anymore and delete it with its 2 children. Its 2 API calls vs 1M...
> even if on framework side we have one more request to perform to postgres,
> I think it worth the processing time.
> >
> > What do you think ?
> >
> > Julien
> >
> > -----Message d'origine-----
> > De : Karl Wright <da...@gmail.com>
> > Envoyé : samedi 29 février 2020 15:51
> > À : dev <de...@manifoldcf.apache.org>
> > Objet : Re: Delta deletion
> >
> > Hi Julien,
> >
> > First, ALL models rely on individual existence checks for documents.
> That is, when your connector fetches a deleted document, the framework has
> to be told that the document is gone, or it will not be removed.  There is
> no "discovery" process for deleted documents other than seeding (and only
> when the model includes _DELETE).
> >
> > The upshot of this is that IF your seeding method does not return all
> documents that have been removed THEN it cannot be a _DELETE model.
> >
> > I hope this helps.
> >
> > Karl
> >
> >
> > On Sat, Feb 29, 2020 at 8:10 AM <ju...@francelabs.com> wrote:
> >
> >> Hi dev community,
> >>
> >>
> >>
> >> I am trying to develop a connector for an API that exposes a
> >> hierarchical arborescence of documents: each document can have children
> documents.
> >>
> >> During the init crawl, the child documents are referenced in the MCF
> >> connector through the method
> >> activities.addDocumentRefenrece(childDocumentIdentifier,
> >> parentDocumentIdentifier, parentDataNames, parentDataValues)
> >>
> >> The API is able to provide delta modifications/deletions from a
> >> provided date but, when a document that has children is deleted, the
> >> API only returns the id of the document, not its children. On the MCF
> >> connector side, I thought that, as I have referenced the children, by
> >> deleting the parent document all its children would be deleted with
> >> it, but it appears that it is not the case.
> >>
> >> So my question is : did I miss something ? Is there another way to
> >> perform delta deletions ? Unfortunately if I don't find a way to solve
> >> this issue, I will not be able to take advantage of the delta feature
> >> and thus I will have to use the "add_modify" connector type and test
> >> every id on a delta crawl to figure out which ids are missing. This
> >> would be a huge loss of performances.
> >>
> >>
> >>
> >> Regards,
> >>
> >> Julien Massiera
> >>
> >>
> --
> Julien MASSIERA
> Directeur développement produit
> France Labs – Les experts du Search
> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers
> Summit
> www.francelabs.com
>
>

Re: Delta deletion

Posted by Julien Massiera <ju...@francelabs.com>.

Hi Karl,

Now that I finished the confluence connector, I am getting back to the 
other one I was working on, and it would greatly help me to have your 
thoughts on my proposal below.

Thanks,
Julien

On 02/03/2020 16:40, julien.massiera@francelabs.com wrote:
> Hi Karl,
>
> Thanks for your answer.
>
> Your explanations validate what I was anticipating on the way MCF is currently implementing its model. As you stated, this does mean that in order to use the _DELETE model properly, the seeding process has to provide the complete list of deleted documents.
>
> Yet wouldn't it be a useful improvement to update the activities.deleteDocument method (or create an additional delete method) so that it automatically – and optionnaly - removes the referenced documents of a document Id ?
>
> For instance, since the activities.addDocumentReference method already asks the document identifier of the "parent" document, couldn’t we maintain in postgres a list of "child ids" and use it during the delete process to delete them ?
>
> This is very useful in the use case I already described but I am sure it would be useful for other type of connectors and/or future connectors. The benefits of such modification increase with the number of crawled documents.
>
> Here is an illustration of the benefits of this MCF modification:
>
> With my current connector, if my first crawl ingests 1M documents and on the delta crawl only 1 document that has 2 children is deleted, it must rely on the processDocument method to check the version of each of the 1M documents to figure out and delete the 3 concerned ones (so at least 1M calls to the API of the targeted repository). With the suggested optional modification, the seeding process would use the delta API of the targeted repository and declare the parent document (only one API call), then the processDocuments method would be triggered only one time to check the version of the document (another one API call), figure out that it does not exists anymore and delete it with its 2 children. Its 2 API calls vs 1M... even if on framework side we have one more request to perform to postgres, I think it worth the processing time.
>
> What do you think ?
>
> Julien
>
> -----Message d'origine-----
> De : Karl Wright <da...@gmail.com>
> Envoyé : samedi 29 février 2020 15:51
> À : dev <de...@manifoldcf.apache.org>
> Objet : Re: Delta deletion
>
> Hi Julien,
>
> First, ALL models rely on individual existence checks for documents.  That is, when your connector fetches a deleted document, the framework has to be told that the document is gone, or it will not be removed.  There is no "discovery" process for deleted documents other than seeding (and only when the model includes _DELETE).
>
> The upshot of this is that IF your seeding method does not return all documents that have been removed THEN it cannot be a _DELETE model.
>
> I hope this helps.
>
> Karl
>
>
> On Sat, Feb 29, 2020 at 8:10 AM <ju...@francelabs.com> wrote:
>
>> Hi dev community,
>>
>>
>>
>> I am trying to develop a connector for an API that exposes a
>> hierarchical arborescence of documents: each document can have children documents.
>>
>> During the init crawl, the child documents are referenced in the MCF
>> connector through the method
>> activities.addDocumentRefenrece(childDocumentIdentifier,
>> parentDocumentIdentifier, parentDataNames, parentDataValues)
>>
>> The API is able to provide delta modifications/deletions from a
>> provided date but, when a document that has children is deleted, the
>> API only returns the id of the document, not its children. On the MCF
>> connector side, I thought that, as I have referenced the children, by
>> deleting the parent document all its children would be deleted with
>> it, but it appears that it is not the case.
>>
>> So my question is : did I miss something ? Is there another way to
>> perform delta deletions ? Unfortunately if I don't find a way to solve
>> this issue, I will not be able to take advantage of the delta feature
>> and thus I will have to use the "add_modify" connector type and test
>> every id on a delta crawl to figure out which ids are missing. This
>> would be a huge loss of performances.
>>
>>
>>
>> Regards,
>>
>> Julien Massiera
>>
>>
-- 
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers Summit
www.francelabs.com

RE: Delta deletion

Posted by ju...@francelabs.com.

Hi Karl,

Thanks for your answer.

Your explanations validate what I was anticipating on the way MCF is currently implementing its model. As you stated, this does mean that in order to use the _DELETE model properly, the seeding process has to provide the complete list of deleted documents. 

Yet wouldn't it be a useful improvement to update the activities.deleteDocument method (or create an additional delete method) so that it automatically – and optionnaly - removes the referenced documents of a document Id ? 

For instance, since the activities.addDocumentReference method already asks the document identifier of the "parent" document, couldn’t we maintain in postgres a list of "child ids" and use it during the delete process to delete them ? 

This is very useful in the use case I already described but I am sure it would be useful for other type of connectors and/or future connectors. The benefits of such modification increase with the number of crawled documents. 

Here is an illustration of the benefits of this MCF modification:

With my current connector, if my first crawl ingests 1M documents and on the delta crawl only 1 document that has 2 children is deleted, it must rely on the processDocument method to check the version of each of the 1M documents to figure out and delete the 3 concerned ones (so at least 1M calls to the API of the targeted repository). With the suggested optional modification, the seeding process would use the delta API of the targeted repository and declare the parent document (only one API call), then the processDocuments method would be triggered only one time to check the version of the document (another one API call), figure out that it does not exists anymore and delete it with its 2 children. Its 2 API calls vs 1M... even if on framework side we have one more request to perform to postgres, I think it worth the processing time.

What do you think ? 

Julien

-----Message d'origine-----
De : Karl Wright <da...@gmail.com> 
Envoyé : samedi 29 février 2020 15:51
À : dev <de...@manifoldcf.apache.org>
Objet : Re: Delta deletion

Hi Julien,

First, ALL models rely on individual existence checks for documents.  That is, when your connector fetches a deleted document, the framework has to be told that the document is gone, or it will not be removed.  There is no "discovery" process for deleted documents other than seeding (and only when the model includes _DELETE).

The upshot of this is that IF your seeding method does not return all documents that have been removed THEN it cannot be a _DELETE model.

I hope this helps.

Karl

On Sat, Feb 29, 2020 at 8:10 AM <ju...@francelabs.com> wrote:

> Hi dev community,
>
>
>
> I am trying to develop a connector for an API that exposes a 
> hierarchical arborescence of documents: each document can have children documents.
>
> During the init crawl, the child documents are referenced in the MCF 
> connector through the method 
> activities.addDocumentRefenrece(childDocumentIdentifier,
> parentDocumentIdentifier, parentDataNames, parentDataValues)
>
> The API is able to provide delta modifications/deletions from a 
> provided date but, when a document that has children is deleted, the 
> API only returns the id of the document, not its children. On the MCF 
> connector side, I thought that, as I have referenced the children, by 
> deleting the parent document all its children would be deleted with 
> it, but it appears that it is not the case.
>
> So my question is : did I miss something ? Is there another way to 
> perform delta deletions ? Unfortunately if I don't find a way to solve 
> this issue, I will not be able to take advantage of the delta feature 
> and thus I will have to use the "add_modify" connector type and test 
> every id on a delta crawl to figure out which ids are missing. This 
> would be a huge loss of performances.
>
>
>
> Regards,
>
> Julien Massiera
>
>

Re: Delta deletion

Posted by Karl Wright <da...@gmail.com>.

Hi Julien,

First, ALL models rely on individual existence checks for documents.  That
is, when your connector fetches a deleted document, the framework has to be
told that the document is gone, or it will not be removed.  There is no
"discovery" process for deleted documents other than seeding (and only when
the model includes _DELETE).

The upshot of this is that IF your seeding method does not return all
documents that have been removed THEN it cannot be a _DELETE model.

I hope this helps.

Karl

On Sat, Feb 29, 2020 at 8:10 AM <ju...@francelabs.com> wrote:

> Hi dev community,
>
>
>
> I am trying to develop a connector for an API that exposes a hierarchical
> arborescence of documents: each document can have children documents.
>
> During the init crawl, the child documents are referenced in the MCF
> connector through the method
> activities.addDocumentRefenrece(childDocumentIdentifier,
> parentDocumentIdentifier, parentDataNames, parentDataValues)
>
> The API is able to provide delta modifications/deletions from a provided
> date but, when a document that has children is deleted, the API only
> returns
> the id of the document, not its children. On the MCF connector side, I
> thought that, as I have referenced the children, by deleting the parent
> document all its children would be deleted with it, but it appears that it
> is not the case.
>
> So my question is : did I miss something ? Is there another way to perform
> delta deletions ? Unfortunately if I don't find a way to solve this issue,
> I
> will not be able to take advantage of the delta feature and thus I will
> have
> to use the "add_modify" connector type and test every id on a delta crawl
> to
> figure out which ids are missing. This would be a huge loss of
> performances.
>
>
>
> Regards,
>
> Julien Massiera
>
>