You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by ju...@francelabs.com on 2021/03/21 14:45:35 UTC

How to override carry down data

Hi Karl,

 

I am using carry-down data in a repository connector but I have figured out
that I am unable to update/override a value that already have been set.
Indeed, despite I am using the same key and the same parent identifier, the
values are stacked. So, when I retrieve carry-down data through the key I
get more and more values in the array instead of only one that is updated.
It seems I misunderstood the documentation, I was believing that the
carry-down data values are stacked only if there are several parent
identifiers for the same key.
What can I do to maintain only one carry-down data value for a given key and
a given parent identifier ?    

 

Regards,

Julien

RE: How to override carry down data

Posted by ju...@francelabs.com.

I think I found the problem: I also set carry-down data to the parent with the same carry-down key "content", in that case the retrieveParentData for the childIdentifier return both data for itself and the parent...
I simply have to change the carry-down identifier of the parent, this is something I have to keep in mind !

Thank for your help Karl

-----Message d'origine-----
De : julien.massiera@francelabs.com <ju...@francelabs.com> 
Envoyé : lundi 22 mars 2021 11:29
À : dev@manifoldcf.apache.org
Objet : RE: How to override carry down data

There is an activities.noDocument called on the parentIdentifier, and an activities.ingestDocumentWithException called on child. They should trigger the method you mention aren't they ? 

-----Message d'origine-----
De : Karl Wright <da...@gmail.com>
Envoyé : lundi 22 mars 2021 02:20
À : dev <de...@manifoldcf.apache.org>
Objet : Re: How to override carry down data

It gets called during JobManager.finishDocuments(), here:

  @Override
  public DocumentDescription[] finishDocuments(Long jobID, String[] legalLinkTypes, String[] parentIdentifierHashes, int hopcountMethod)
    throws ManifoldCFException
...
          // A certain set of carrydown records are going to be deleted by the ensuing restoreRecords command.  Calculate that set of records!
          rval =
calculateAffectedRestoreCarrydownChildren(jobID,parentIdentifierHashes);
          carryDown.restoreRecords(jobID,parentIdentifierHashes);
          database.performCommit();
...

Is your connector calling the IProcessActivity methods meant to signal that document processing has finished?  If not, that is the problem!

Karl



On Sun, Mar 21, 2021 at 9:14 PM Karl Wright <da...@gmail.com> wrote:

> Ah, so it appears that the way this works is subtle and clever.
>
> Values are added or updated in one phase of activity.  At this time 
> the records are flagged with either "new" or "existing".  At a later 
> time, values still in the "base state" are removed, and the "new" and "existing"
> states are mapped back to the base state.
>
> This is the Carrydown class method that supposedly does the deletion 
> and rejiggering of the states:
>
>   /** Return all records belonging to the specified parent documents 
> to the base state,
>   * and delete the old (eliminated) child records.
>   */
>   public void restoreRecords(Long jobID, String[] parentDocumentIDHashes)
>     throws ManifoldCFException
>
> ... and it appears that it does the right thing:
>
>     // Delete
>     StringBuilder sb = new StringBuilder("WHERE ");
>     ArrayList newList = new ArrayList();
>
>     sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
>       new UnitaryClause(jobIDField,jobID),
>       new MultiClause(parentIDHashField,list)})).append(" AND ");
>
>     sb.append(newField).append("=?");
>     newList.add(statusToString(ISNEW_BASE));
>     performDelete(sb.toString(),newList,null);
>
>     // Restore new values
>     sb = new StringBuilder("WHERE ");
>     newList.clear();
>
>     sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
>       new UnitaryClause(jobIDField,jobID),
>       new MultiClause(parentIDHashField,list)})).append(" AND ");
>
>     sb.append(newField).append(" IN (?,?)");
>     newList.add(statusToString(ISNEW_EXISTING));
>     newList.add(statusToString(ISNEW_NEW));
>
>     HashMap map = new HashMap();
>     map.put(newField,statusToString(ISNEW_BASE));
>     map.put(processIDField,null);
>     performUpdate(map,sb.toString(),newList,null);
>
>     noteModifications(0,list.size(),0);
>
> So the question becomes: does it get called appropriately?
>
> Karl
>
>
>
> On Sun, Mar 21, 2021 at 8:45 PM Karl Wright <da...@gmail.com> wrote:
>
>> I've tried to refresh my memory by looking at the carrydown code, 
>> which is quite old at this point.  But one thing is very clear: that 
>> code never removes carrydown data values unless the child or parent 
>> document goes away, and wasn't intended to.
>>
>> It's not at all trivial to do but the code here could be modified to 
>> set the carrydown values to exactly what is specified in the 
>> reference for the given parent.  However, I worry that changing this 
>> behavior will break something.  Carrydown has a built-in assumption 
>> that if the reference is added multiple times with different data 
>> during a crawl, eventually the data will stabilize and no more downstream processing will be necessary.
>> Carrydown changes that are incautious will result in jobs that never 
>> complete.
>>
>> I think it is worth looking at changing the behavior such that no 
>> accumulation of values takes place, though.  It's not an easy change 
>> I fear.  I'll look into how to make it happen.
>>
>> Karl
>>
>>
>>
>> On Sun, Mar 21, 2021 at 1:18 PM <ju...@francelabs.com> wrote:
>>
>>> ---------------------------- First crawl
>>> -----------------------------------------
>>>
>>> In the processDocument method the following code is triggered on the
>>> parentIdendifier:
>>>
>>> activities.addDocumentReference(childIdentifier, parentIdentifier, 
>>> null, new String[] { "content" }, new String[][] { { "someContent" } 
>>> });
>>>
>>> Then the childIdentifier is processed and the following code is 
>>> triggered in the processDocument method:
>>>
>>> final String[] contentArray =
>>> activities.retrieveParentData(childIdentifier, "content");
>>>
>>> At this point, the childIdentifier correctly retrieve a contentArray 
>>> containing 1 value which is "someContent"
>>>
>>> ---------------------------- Second crawl
>>> -----------------------------------------
>>>
>>> In the processDocument method the following code is triggered on the
>>> parentIdendifier:
>>>
>>> activities.addDocumentReference(childIdentifier, parentIdentifier, 
>>> null, new String[] { "content" }, new String[][] { { "newContent" } 
>>> });
>>>
>>> Then the childIdentifier is processed and the following code is 
>>> triggered in the processDocument method:
>>>
>>> final String[] contentArray =
>>> activities.retrieveParentData(childIdentifier, "content");
>>>
>>> At this point, the childIdentifier retrieves a contentArray 
>>> containing 2 values, the old one "someContent", and the new one "newContent"
>>>
>>> I can guarantee that the parentIdentifier between the two crawls is 
>>> the same and that on the second crawl, only the "newContent" is 
>>> added, I debugged the code to confirm everything.
>>>
>>>
>>>
>>> Julien
>>>
>>>
>>> -----Message d'origine-----
>>> De : Karl Wright <da...@gmail.com> Envoyé : dimanche 21 mars 2021
>>> 16:05 À : dev <de...@manifoldcf.apache.org> Objet : Re: How to 
>>> override carry down data
>>>
>>> Can you give me a code example?
>>> The carry-down information is set by the parent, as you say.  The 
>>> specific information is keyed to the parent so when the child is 
>>> added to the queue, all old carrydown information from the same 
>>> parent is deleted at that time, and until that happens the carrydown 
>>> information is preserved for every child.  As you say, it can be 
>>> augmented by other parents that refer to the same child, but it is 
>>> never *replaced* by carrydown info from a different parent, just augmented.
>>>
>>> If it didn't work this way, MCF would have horrendous order 
>>> dependencies in what documents got processed first.  As it is, when 
>>> the carrydown information changes because another parent is 
>>> discovered, the children are queued for processing to achieve stable results.
>>>
>>> Karl
>>>
>>>
>>> On Sun, Mar 21, 2021 at 10:45 AM <ju...@francelabs.com> wrote:
>>>
>>> > Hi Karl,
>>> >
>>> >
>>> >
>>> > I am using carry-down data in a repository connector but I have 
>>> > figured out that I am unable to update/override a value that 
>>> > already
>>> have been set.
>>> > Indeed, despite I am using the same key and the same parent 
>>> > identifier, the values are stacked. So, when I retrieve carry-down 
>>> > data through the key I get more and more values in the array 
>>> > instead
>>> of only one that is updated.
>>> > It seems I misunderstood the documentation, I was believing that 
>>> > the carry-down data values are stacked only if there are several 
>>> > parent identifiers for the same key.
>>> > What can I do to maintain only one carry-down data value for a 
>>> > given key and a given parent identifier ?
>>> >
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Julien
>>> >
>>> >
>>> >
>>> >
>>>
>>>

RE: How to override carry down data

Posted by ju...@francelabs.com.

There is an activities.noDocument called on the parentIdentifier, and an activities.ingestDocumentWithException called on child. They should trigger the method you mention aren't they ? 

-----Message d'origine-----
De : Karl Wright <da...@gmail.com> 
Envoyé : lundi 22 mars 2021 02:20
À : dev <de...@manifoldcf.apache.org>
Objet : Re: How to override carry down data

It gets called during JobManager.finishDocuments(), here:

  @Override
  public DocumentDescription[] finishDocuments(Long jobID, String[] legalLinkTypes, String[] parentIdentifierHashes, int hopcountMethod)
    throws ManifoldCFException
...
          // A certain set of carrydown records are going to be deleted by the ensuing restoreRecords command.  Calculate that set of records!
          rval =
calculateAffectedRestoreCarrydownChildren(jobID,parentIdentifierHashes);
          carryDown.restoreRecords(jobID,parentIdentifierHashes);
          database.performCommit();
...

Is your connector calling the IProcessActivity methods meant to signal that document processing has finished?  If not, that is the problem!

Karl



On Sun, Mar 21, 2021 at 9:14 PM Karl Wright <da...@gmail.com> wrote:

> Ah, so it appears that the way this works is subtle and clever.
>
> Values are added or updated in one phase of activity.  At this time 
> the records are flagged with either "new" or "existing".  At a later 
> time, values still in the "base state" are removed, and the "new" and "existing"
> states are mapped back to the base state.
>
> This is the Carrydown class method that supposedly does the deletion 
> and rejiggering of the states:
>
>   /** Return all records belonging to the specified parent documents 
> to the base state,
>   * and delete the old (eliminated) child records.
>   */
>   public void restoreRecords(Long jobID, String[] parentDocumentIDHashes)
>     throws ManifoldCFException
>
> ... and it appears that it does the right thing:
>
>     // Delete
>     StringBuilder sb = new StringBuilder("WHERE ");
>     ArrayList newList = new ArrayList();
>
>     sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
>       new UnitaryClause(jobIDField,jobID),
>       new MultiClause(parentIDHashField,list)})).append(" AND ");
>
>     sb.append(newField).append("=?");
>     newList.add(statusToString(ISNEW_BASE));
>     performDelete(sb.toString(),newList,null);
>
>     // Restore new values
>     sb = new StringBuilder("WHERE ");
>     newList.clear();
>
>     sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
>       new UnitaryClause(jobIDField,jobID),
>       new MultiClause(parentIDHashField,list)})).append(" AND ");
>
>     sb.append(newField).append(" IN (?,?)");
>     newList.add(statusToString(ISNEW_EXISTING));
>     newList.add(statusToString(ISNEW_NEW));
>
>     HashMap map = new HashMap();
>     map.put(newField,statusToString(ISNEW_BASE));
>     map.put(processIDField,null);
>     performUpdate(map,sb.toString(),newList,null);
>
>     noteModifications(0,list.size(),0);
>
> So the question becomes: does it get called appropriately?
>
> Karl
>
>
>
> On Sun, Mar 21, 2021 at 8:45 PM Karl Wright <da...@gmail.com> wrote:
>
>> I've tried to refresh my memory by looking at the carrydown code, 
>> which is quite old at this point.  But one thing is very clear: that 
>> code never removes carrydown data values unless the child or parent 
>> document goes away, and wasn't intended to.
>>
>> It's not at all trivial to do but the code here could be modified to 
>> set the carrydown values to exactly what is specified in the 
>> reference for the given parent.  However, I worry that changing this 
>> behavior will break something.  Carrydown has a built-in assumption 
>> that if the reference is added multiple times with different data 
>> during a crawl, eventually the data will stabilize and no more downstream processing will be necessary.
>> Carrydown changes that are incautious will result in jobs that never 
>> complete.
>>
>> I think it is worth looking at changing the behavior such that no 
>> accumulation of values takes place, though.  It's not an easy change 
>> I fear.  I'll look into how to make it happen.
>>
>> Karl
>>
>>
>>
>> On Sun, Mar 21, 2021 at 1:18 PM <ju...@francelabs.com> wrote:
>>
>>> ---------------------------- First crawl
>>> -----------------------------------------
>>>
>>> In the processDocument method the following code is triggered on the
>>> parentIdendifier:
>>>
>>> activities.addDocumentReference(childIdentifier, parentIdentifier, 
>>> null, new String[] { "content" }, new String[][] { { "someContent" } 
>>> });
>>>
>>> Then the childIdentifier is processed and the following code is 
>>> triggered in the processDocument method:
>>>
>>> final String[] contentArray =
>>> activities.retrieveParentData(childIdentifier, "content");
>>>
>>> At this point, the childIdentifier correctly retrieve a contentArray 
>>> containing 1 value which is "someContent"
>>>
>>> ---------------------------- Second crawl
>>> -----------------------------------------
>>>
>>> In the processDocument method the following code is triggered on the
>>> parentIdendifier:
>>>
>>> activities.addDocumentReference(childIdentifier, parentIdentifier, 
>>> null, new String[] { "content" }, new String[][] { { "newContent" } 
>>> });
>>>
>>> Then the childIdentifier is processed and the following code is 
>>> triggered in the processDocument method:
>>>
>>> final String[] contentArray =
>>> activities.retrieveParentData(childIdentifier, "content");
>>>
>>> At this point, the childIdentifier retrieves a contentArray 
>>> containing 2 values, the old one "someContent", and the new one "newContent"
>>>
>>> I can guarantee that the parentIdentifier between the two crawls is 
>>> the same and that on the second crawl, only the "newContent" is 
>>> added, I debugged the code to confirm everything.
>>>
>>>
>>>
>>> Julien
>>>
>>>
>>> -----Message d'origine-----
>>> De : Karl Wright <da...@gmail.com> Envoyé : dimanche 21 mars 2021 
>>> 16:05 À : dev <de...@manifoldcf.apache.org> Objet : Re: How to 
>>> override carry down data
>>>
>>> Can you give me a code example?
>>> The carry-down information is set by the parent, as you say.  The 
>>> specific information is keyed to the parent so when the child is 
>>> added to the queue, all old carrydown information from the same 
>>> parent is deleted at that time, and until that happens the carrydown 
>>> information is preserved for every child.  As you say, it can be 
>>> augmented by other parents that refer to the same child, but it is 
>>> never *replaced* by carrydown info from a different parent, just augmented.
>>>
>>> If it didn't work this way, MCF would have horrendous order 
>>> dependencies in what documents got processed first.  As it is, when 
>>> the carrydown information changes because another parent is 
>>> discovered, the children are queued for processing to achieve stable results.
>>>
>>> Karl
>>>
>>>
>>> On Sun, Mar 21, 2021 at 10:45 AM <ju...@francelabs.com> wrote:
>>>
>>> > Hi Karl,
>>> >
>>> >
>>> >
>>> > I am using carry-down data in a repository connector but I have 
>>> > figured out that I am unable to update/override a value that 
>>> > already
>>> have been set.
>>> > Indeed, despite I am using the same key and the same parent 
>>> > identifier, the values are stacked. So, when I retrieve carry-down 
>>> > data through the key I get more and more values in the array 
>>> > instead
>>> of only one that is updated.
>>> > It seems I misunderstood the documentation, I was believing that 
>>> > the carry-down data values are stacked only if there are several 
>>> > parent identifiers for the same key.
>>> > What can I do to maintain only one carry-down data value for a 
>>> > given key and a given parent identifier ?
>>> >
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Julien
>>> >
>>> >
>>> >
>>> >
>>>
>>>

Re: How to override carry down data

Posted by Karl Wright <da...@gmail.com>.

It gets called during JobManager.finishDocuments(), here:

  @Override
  public DocumentDescription[] finishDocuments(Long jobID, String[]
legalLinkTypes, String[] parentIdentifierHashes, int hopcountMethod)
    throws ManifoldCFException
...
          // A certain set of carrydown records are going to be deleted by
the ensuing restoreRecords command.  Calculate that set of records!
          rval =
calculateAffectedRestoreCarrydownChildren(jobID,parentIdentifierHashes);
          carryDown.restoreRecords(jobID,parentIdentifierHashes);
          database.performCommit();
...

Is your connector calling the IProcessActivity methods meant to signal that
document processing has finished?  If not, that is the problem!

Karl



On Sun, Mar 21, 2021 at 9:14 PM Karl Wright <da...@gmail.com> wrote:

> Ah, so it appears that the way this works is subtle and clever.
>
> Values are added or updated in one phase of activity.  At this time the
> records are flagged with either "new" or "existing".  At a later time,
> values still in the "base state" are removed, and the "new" and "existing"
> states are mapped back to the base state.
>
> This is the Carrydown class method that supposedly does the deletion and
> rejiggering of the states:
>
>   /** Return all records belonging to the specified parent documents to
> the base state,
>   * and delete the old (eliminated) child records.
>   */
>   public void restoreRecords(Long jobID, String[] parentDocumentIDHashes)
>     throws ManifoldCFException
>
> ... and it appears that it does the right thing:
>
>     // Delete
>     StringBuilder sb = new StringBuilder("WHERE ");
>     ArrayList newList = new ArrayList();
>
>     sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
>       new UnitaryClause(jobIDField,jobID),
>       new MultiClause(parentIDHashField,list)})).append(" AND ");
>
>     sb.append(newField).append("=?");
>     newList.add(statusToString(ISNEW_BASE));
>     performDelete(sb.toString(),newList,null);
>
>     // Restore new values
>     sb = new StringBuilder("WHERE ");
>     newList.clear();
>
>     sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
>       new UnitaryClause(jobIDField,jobID),
>       new MultiClause(parentIDHashField,list)})).append(" AND ");
>
>     sb.append(newField).append(" IN (?,?)");
>     newList.add(statusToString(ISNEW_EXISTING));
>     newList.add(statusToString(ISNEW_NEW));
>
>     HashMap map = new HashMap();
>     map.put(newField,statusToString(ISNEW_BASE));
>     map.put(processIDField,null);
>     performUpdate(map,sb.toString(),newList,null);
>
>     noteModifications(0,list.size(),0);
>
> So the question becomes: does it get called appropriately?
>
> Karl
>
>
>
> On Sun, Mar 21, 2021 at 8:45 PM Karl Wright <da...@gmail.com> wrote:
>
>> I've tried to refresh my memory by looking at the carrydown code, which
>> is quite old at this point.  But one thing is very clear: that code never
>> removes carrydown data values unless the child or parent document goes
>> away, and wasn't intended to.
>>
>> It's not at all trivial to do but the code here could be modified to set
>> the carrydown values to exactly what is specified in the reference for the
>> given parent.  However, I worry that changing this behavior will break
>> something.  Carrydown has a built-in assumption that if the reference is
>> added multiple times with different data during a crawl, eventually the
>> data will stabilize and no more downstream processing will be necessary.
>> Carrydown changes that are incautious will result in jobs that never
>> complete.
>>
>> I think it is worth looking at changing the behavior such that no
>> accumulation of values takes place, though.  It's not an easy change I
>> fear.  I'll look into how to make it happen.
>>
>> Karl
>>
>>
>>
>> On Sun, Mar 21, 2021 at 1:18 PM <ju...@francelabs.com> wrote:
>>
>>> ---------------------------- First crawl
>>> -----------------------------------------
>>>
>>> In the processDocument method the following code is triggered on the
>>> parentIdendifier:
>>>
>>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>>> new String[] { "content" }, new String[][] { { "someContent" } });
>>>
>>> Then the childIdentifier is processed and the following code is
>>> triggered in the processDocument method:
>>>
>>> final String[] contentArray =
>>> activities.retrieveParentData(childIdentifier, "content");
>>>
>>> At this point, the childIdentifier correctly retrieve a contentArray
>>> containing 1 value which is "someContent"
>>>
>>> ---------------------------- Second crawl
>>> -----------------------------------------
>>>
>>> In the processDocument method the following code is triggered on the
>>> parentIdendifier:
>>>
>>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>>> new String[] { "content" }, new String[][] { { "newContent" } });
>>>
>>> Then the childIdentifier is processed and the following code is
>>> triggered in the processDocument method:
>>>
>>> final String[] contentArray =
>>> activities.retrieveParentData(childIdentifier, "content");
>>>
>>> At this point, the childIdentifier retrieves a contentArray containing 2
>>> values, the old one "someContent", and the new one "newContent"
>>>
>>> I can guarantee that the parentIdentifier between the two crawls is the
>>> same and that on the second crawl, only the "newContent" is added, I
>>> debugged the code to confirm everything.
>>>
>>>
>>>
>>> Julien
>>>
>>>
>>> -----Message d'origine-----
>>> De : Karl Wright <da...@gmail.com>
>>> Envoyé : dimanche 21 mars 2021 16:05
>>> À : dev <de...@manifoldcf.apache.org>
>>> Objet : Re: How to override carry down data
>>>
>>> Can you give me a code example?
>>> The carry-down information is set by the parent, as you say.  The
>>> specific information is keyed to the parent so when the child is added to
>>> the queue, all old carrydown information from the same parent is deleted at
>>> that time, and until that happens the carrydown information is preserved
>>> for every child.  As you say, it can be augmented by other parents that
>>> refer to the same child, but it is never *replaced* by carrydown info from
>>> a different parent, just augmented.
>>>
>>> If it didn't work this way, MCF would have horrendous order dependencies
>>> in what documents got processed first.  As it is, when the carrydown
>>> information changes because another parent is discovered, the children are
>>> queued for processing to achieve stable results.
>>>
>>> Karl
>>>
>>>
>>> On Sun, Mar 21, 2021 at 10:45 AM <ju...@francelabs.com> wrote:
>>>
>>> > Hi Karl,
>>> >
>>> >
>>> >
>>> > I am using carry-down data in a repository connector but I have
>>> > figured out that I am unable to update/override a value that already
>>> have been set.
>>> > Indeed, despite I am using the same key and the same parent
>>> > identifier, the values are stacked. So, when I retrieve carry-down
>>> > data through the key I get more and more values in the array instead
>>> of only one that is updated.
>>> > It seems I misunderstood the documentation, I was believing that the
>>> > carry-down data values are stacked only if there are several parent
>>> > identifiers for the same key.
>>> > What can I do to maintain only one carry-down data value for a given
>>> > key and a given parent identifier ?
>>> >
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Julien
>>> >
>>> >
>>> >
>>> >
>>>
>>>

Re: How to override carry down data

Posted by Karl Wright <da...@gmail.com>.

Ah, so it appears that the way this works is subtle and clever.

Values are added or updated in one phase of activity.  At this time the
records are flagged with either "new" or "existing".  At a later time,
values still in the "base state" are removed, and the "new" and "existing"
states are mapped back to the base state.

This is the Carrydown class method that supposedly does the deletion and
rejiggering of the states:

  /** Return all records belonging to the specified parent documents to the
base state,
  * and delete the old (eliminated) child records.
  */
  public void restoreRecords(Long jobID, String[] parentDocumentIDHashes)
    throws ManifoldCFException

... and it appears that it does the right thing:

    // Delete
    StringBuilder sb = new StringBuilder("WHERE ");
    ArrayList newList = new ArrayList();

    sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
      new UnitaryClause(jobIDField,jobID),
      new MultiClause(parentIDHashField,list)})).append(" AND ");

    sb.append(newField).append("=?");
    newList.add(statusToString(ISNEW_BASE));
    performDelete(sb.toString(),newList,null);

    // Restore new values
    sb = new StringBuilder("WHERE ");
    newList.clear();

    sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
      new UnitaryClause(jobIDField,jobID),
      new MultiClause(parentIDHashField,list)})).append(" AND ");

    sb.append(newField).append(" IN (?,?)");
    newList.add(statusToString(ISNEW_EXISTING));
    newList.add(statusToString(ISNEW_NEW));

    HashMap map = new HashMap();
    map.put(newField,statusToString(ISNEW_BASE));
    map.put(processIDField,null);
    performUpdate(map,sb.toString(),newList,null);

    noteModifications(0,list.size(),0);

So the question becomes: does it get called appropriately?

Karl



On Sun, Mar 21, 2021 at 8:45 PM Karl Wright <da...@gmail.com> wrote:

> I've tried to refresh my memory by looking at the carrydown code, which is
> quite old at this point.  But one thing is very clear: that code never
> removes carrydown data values unless the child or parent document goes
> away, and wasn't intended to.
>
> It's not at all trivial to do but the code here could be modified to set
> the carrydown values to exactly what is specified in the reference for the
> given parent.  However, I worry that changing this behavior will break
> something.  Carrydown has a built-in assumption that if the reference is
> added multiple times with different data during a crawl, eventually the
> data will stabilize and no more downstream processing will be necessary.
> Carrydown changes that are incautious will result in jobs that never
> complete.
>
> I think it is worth looking at changing the behavior such that no
> accumulation of values takes place, though.  It's not an easy change I
> fear.  I'll look into how to make it happen.
>
> Karl
>
>
>
> On Sun, Mar 21, 2021 at 1:18 PM <ju...@francelabs.com> wrote:
>
>> ---------------------------- First crawl
>> -----------------------------------------
>>
>> In the processDocument method the following code is triggered on the
>> parentIdendifier:
>>
>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>> new String[] { "content" }, new String[][] { { "someContent" } });
>>
>> Then the childIdentifier is processed and the following code is triggered
>> in the processDocument method:
>>
>> final String[] contentArray =
>> activities.retrieveParentData(childIdentifier, "content");
>>
>> At this point, the childIdentifier correctly retrieve a contentArray
>> containing 1 value which is "someContent"
>>
>> ---------------------------- Second crawl
>> -----------------------------------------
>>
>> In the processDocument method the following code is triggered on the
>> parentIdendifier:
>>
>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>> new String[] { "content" }, new String[][] { { "newContent" } });
>>
>> Then the childIdentifier is processed and the following code is triggered
>> in the processDocument method:
>>
>> final String[] contentArray =
>> activities.retrieveParentData(childIdentifier, "content");
>>
>> At this point, the childIdentifier retrieves a contentArray containing 2
>> values, the old one "someContent", and the new one "newContent"
>>
>> I can guarantee that the parentIdentifier between the two crawls is the
>> same and that on the second crawl, only the "newContent" is added, I
>> debugged the code to confirm everything.
>>
>>
>>
>> Julien
>>
>>
>> -----Message d'origine-----
>> De : Karl Wright <da...@gmail.com>
>> Envoyé : dimanche 21 mars 2021 16:05
>> À : dev <de...@manifoldcf.apache.org>
>> Objet : Re: How to override carry down data
>>
>> Can you give me a code example?
>> The carry-down information is set by the parent, as you say.  The
>> specific information is keyed to the parent so when the child is added to
>> the queue, all old carrydown information from the same parent is deleted at
>> that time, and until that happens the carrydown information is preserved
>> for every child.  As you say, it can be augmented by other parents that
>> refer to the same child, but it is never *replaced* by carrydown info from
>> a different parent, just augmented.
>>
>> If it didn't work this way, MCF would have horrendous order dependencies
>> in what documents got processed first.  As it is, when the carrydown
>> information changes because another parent is discovered, the children are
>> queued for processing to achieve stable results.
>>
>> Karl
>>
>>
>> On Sun, Mar 21, 2021 at 10:45 AM <ju...@francelabs.com> wrote:
>>
>> > Hi Karl,
>> >
>> >
>> >
>> > I am using carry-down data in a repository connector but I have
>> > figured out that I am unable to update/override a value that already
>> have been set.
>> > Indeed, despite I am using the same key and the same parent
>> > identifier, the values are stacked. So, when I retrieve carry-down
>> > data through the key I get more and more values in the array instead of
>> only one that is updated.
>> > It seems I misunderstood the documentation, I was believing that the
>> > carry-down data values are stacked only if there are several parent
>> > identifiers for the same key.
>> > What can I do to maintain only one carry-down data value for a given
>> > key and a given parent identifier ?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Julien
>> >
>> >
>> >
>> >
>>
>>

Re: How to override carry down data

Posted by Karl Wright <da...@gmail.com>.

I've tried to refresh my memory by looking at the carrydown code, which is
quite old at this point.  But one thing is very clear: that code never
removes carrydown data values unless the child or parent document goes
away, and wasn't intended to.

It's not at all trivial to do but the code here could be modified to set
the carrydown values to exactly what is specified in the reference for the
given parent.  However, I worry that changing this behavior will break
something.  Carrydown has a built-in assumption that if the reference is
added multiple times with different data during a crawl, eventually the
data will stabilize and no more downstream processing will be necessary.
Carrydown changes that are incautious will result in jobs that never
complete.

I think it is worth looking at changing the behavior such that no
accumulation of values takes place, though.  It's not an easy change I
fear.  I'll look into how to make it happen.

Karl



On Sun, Mar 21, 2021 at 1:18 PM <ju...@francelabs.com> wrote:

> ---------------------------- First crawl
> -----------------------------------------
>
> In the processDocument method the following code is triggered on the
> parentIdendifier:
>
> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
> new String[] { "content" }, new String[][] { { "someContent" } });
>
> Then the childIdentifier is processed and the following code is triggered
> in the processDocument method:
>
> final String[] contentArray =
> activities.retrieveParentData(childIdentifier, "content");
>
> At this point, the childIdentifier correctly retrieve a contentArray
> containing 1 value which is "someContent"
>
> ---------------------------- Second crawl
> -----------------------------------------
>
> In the processDocument method the following code is triggered on the
> parentIdendifier:
>
> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
> new String[] { "content" }, new String[][] { { "newContent" } });
>
> Then the childIdentifier is processed and the following code is triggered
> in the processDocument method:
>
> final String[] contentArray =
> activities.retrieveParentData(childIdentifier, "content");
>
> At this point, the childIdentifier retrieves a contentArray containing 2
> values, the old one "someContent", and the new one "newContent"
>
> I can guarantee that the parentIdentifier between the two crawls is the
> same and that on the second crawl, only the "newContent" is added, I
> debugged the code to confirm everything.
>
>
>
> Julien
>
>
> -----Message d'origine-----
> De : Karl Wright <da...@gmail.com>
> Envoyé : dimanche 21 mars 2021 16:05
> À : dev <de...@manifoldcf.apache.org>
> Objet : Re: How to override carry down data
>
> Can you give me a code example?
> The carry-down information is set by the parent, as you say.  The specific
> information is keyed to the parent so when the child is added to the queue,
> all old carrydown information from the same parent is deleted at that time,
> and until that happens the carrydown information is preserved for every
> child.  As you say, it can be augmented by other parents that refer to the
> same child, but it is never *replaced* by carrydown info from a different
> parent, just augmented.
>
> If it didn't work this way, MCF would have horrendous order dependencies
> in what documents got processed first.  As it is, when the carrydown
> information changes because another parent is discovered, the children are
> queued for processing to achieve stable results.
>
> Karl
>
>
> On Sun, Mar 21, 2021 at 10:45 AM <ju...@francelabs.com> wrote:
>
> > Hi Karl,
> >
> >
> >
> > I am using carry-down data in a repository connector but I have
> > figured out that I am unable to update/override a value that already
> have been set.
> > Indeed, despite I am using the same key and the same parent
> > identifier, the values are stacked. So, when I retrieve carry-down
> > data through the key I get more and more values in the array instead of
> only one that is updated.
> > It seems I misunderstood the documentation, I was believing that the
> > carry-down data values are stacked only if there are several parent
> > identifiers for the same key.
> > What can I do to maintain only one carry-down data value for a given
> > key and a given parent identifier ?
> >
> >
> >
> > Regards,
> >
> > Julien
> >
> >
> >
> >
>
>

RE: How to override carry down data

Posted by ju...@francelabs.com.

---------------------------- First crawl -----------------------------------------

In the processDocument method the following code is triggered on the parentIdendifier:

activities.addDocumentReference(childIdentifier, parentIdentifier, null, new String[] { "content" }, new String[][] { { "someContent" } });

Then the childIdentifier is processed and the following code is triggered in the processDocument method:

final String[] contentArray = activities.retrieveParentData(childIdentifier, "content");

At this point, the childIdentifier correctly retrieve a contentArray containing 1 value which is "someContent"

---------------------------- Second crawl -----------------------------------------

In the processDocument method the following code is triggered on the parentIdendifier:

activities.addDocumentReference(childIdentifier, parentIdentifier, null, new String[] { "content" }, new String[][] { { "newContent" } });

Then the childIdentifier is processed and the following code is triggered in the processDocument method:

final String[] contentArray = activities.retrieveParentData(childIdentifier, "content");

At this point, the childIdentifier retrieves a contentArray containing 2 values, the old one "someContent", and the new one "newContent"

I can guarantee that the parentIdentifier between the two crawls is the same and that on the second crawl, only the "newContent" is added, I debugged the code to confirm everything.

Julien

-----Message d'origine-----
De : Karl Wright <da...@gmail.com> 
Envoyé : dimanche 21 mars 2021 16:05
À : dev <de...@manifoldcf.apache.org>
Objet : Re: How to override carry down data

Can you give me a code example?
The carry-down information is set by the parent, as you say.  The specific information is keyed to the parent so when the child is added to the queue, all old carrydown information from the same parent is deleted at that time, and until that happens the carrydown information is preserved for every child.  As you say, it can be augmented by other parents that refer to the same child, but it is never *replaced* by carrydown info from a different parent, just augmented.

If it didn't work this way, MCF would have horrendous order dependencies in what documents got processed first.  As it is, when the carrydown information changes because another parent is discovered, the children are queued for processing to achieve stable results.

Karl

On Sun, Mar 21, 2021 at 10:45 AM <ju...@francelabs.com> wrote:

> Hi Karl,
>
>
>
> I am using carry-down data in a repository connector but I have 
> figured out that I am unable to update/override a value that already have been set.
> Indeed, despite I am using the same key and the same parent 
> identifier, the values are stacked. So, when I retrieve carry-down 
> data through the key I get more and more values in the array instead of only one that is updated.
> It seems I misunderstood the documentation, I was believing that the 
> carry-down data values are stacked only if there are several parent 
> identifiers for the same key.
> What can I do to maintain only one carry-down data value for a given 
> key and a given parent identifier ?
>
>
>
> Regards,
>
> Julien
>
>
>
>

Re: How to override carry down data

Posted by Karl Wright <da...@gmail.com>.

Can you give me a code example?
The carry-down information is set by the parent, as you say.  The specific
information is keyed to the parent so when the child is added to the queue,
all old carrydown information from the same parent is deleted at that time,
and until that happens the carrydown information is preserved for every
child.  As you say, it can be augmented by other parents that refer to the
same child, but it is never *replaced* by carrydown info from a different
parent, just augmented.

If it didn't work this way, MCF would have horrendous order dependencies in
what documents got processed first.  As it is, when the carrydown
information changes because another parent is discovered, the children are
queued for processing to achieve stable results.

Karl

On Sun, Mar 21, 2021 at 10:45 AM <ju...@francelabs.com> wrote:

> Hi Karl,
>
>
>
> I am using carry-down data in a repository connector but I have figured out
> that I am unable to update/override a value that already have been set.
> Indeed, despite I am using the same key and the same parent identifier, the
> values are stacked. So, when I retrieve carry-down data through the key I
> get more and more values in the array instead of only one that is updated.
> It seems I misunderstood the documentation, I was believing that the
> carry-down data values are stacked only if there are several parent
> identifiers for the same key.
> What can I do to maintain only one carry-down data value for a given key
> and
> a given parent identifier ?
>
>
>
> Regards,
>
> Julien
>
>
>
>