You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Rafa Haro <rh...@apache.org> on 2014/09/17 16:41:32 UTC

Index Purged if no new documents are seeded

Hi folks,

We have been working on an “unofficial” Alfresco connector that currently is more or less working for Manifold 1.7. You can check the code here: https://github.com/rafaharo/alfresco-webscript-manifold-connector. The README.md file is out of date, so please ignore it. Basically, this connector is using a client that consumes a set of Alfresco webscritps for dealing with content and metadata crawling. Documents seeding is based on Alfresco transactions, so the connector keeps asking alfresco for a concrete number of transactions until no new transactions are found. The transactions info, among others things, indicates if a documents has been deleted so, later, while processing the documents, those documents are marked to be deleted.

In the first run, all the available documents identifiers are seeded. In the next runs, we thought to seed only those documents affected by new transactions (new documents, any change at any level or deletions). And this is what is happening right now: for example, if there is not new transactions, any document is seeded and the whole index is purged (all the previous indexed documents are deleted).

My question is: is this a normal behavior ? How can we avoid it? Is there any configuration option for the jobs? We have read about minimal and complete runs, but it is still not clear for us.

Thanks a lot!
Cheers,
Rafa



Re: Index Purged if no new documents are seeded

Posted by Rafa Haro <rh...@gmail.com>.
Hi Karl, 

As always, thanks for your quick response. Changing the model to MODEL_ADD_CHANGE_DELETE did the trick. About the seeding string, we already managed that. 

Thanks a lot. We aim the community to test this connector also :-)

Cheers,
Rafa


En 17 de septiembre de 2014 en 16:51:18, Karl Wright (daddywri@gmail.com) escrito:

Hi Rafa,  

You probably need to do a few things to get your connector working right.  
First, what connector model are you using? MODEL_ALL is the default, and  
it tells ManifoldCF that your seeding method supplies ALL matching  
documents, and that's probably not right. Maybe you want MODEL_ADD_CHANGE  
instead. Second, please be sure your connector deals properly with the  
situation where the previous seeding string is empty. The seeding string  
is set to empty whenever someone changes the document specification for a  
job. In that case, you should always seed as if from the beginning of time.  

I will not have a chance to review your code for a while due to other  
issues I'm currently looking at, but based on your description of the  
problem, you've probably chosen the wrong seeding model.  

Thanks,  
Karl  




On Wed, Sep 17, 2014 at 10:41 AM, Rafa Haro <rh...@apache.org> wrote:  

> Hi folks,  
>  
> We have been working on an “unofficial” Alfresco connector that currently  
> is more or less working for Manifold 1.7. You can check the code here:  
> https://github.com/rafaharo/alfresco-webscript-manifold-connector. The  
> README.md file is out of date, so please ignore it. Basically, this  
> connector is using a client that consumes a set of Alfresco webscritps for  
> dealing with content and metadata crawling. Documents seeding is based on  
> Alfresco transactions, so the connector keeps asking alfresco for a  
> concrete number of transactions until no new transactions are found. The  
> transactions info, among others things, indicates if a documents has been  
> deleted so, later, while processing the documents, those documents are  
> marked to be deleted.  
>  
> In the first run, all the available documents identifiers are seeded. In  
> the next runs, we thought to seed only those documents affected by new  
> transactions (new documents, any change at any level or deletions). And  
> this is what is happening right now: for example, if there is not new  
> transactions, any document is seeded and the whole index is purged (all the  
> previous indexed documents are deleted).  
>  
> My question is: is this a normal behavior ? How can we avoid it? Is there  
> any configuration option for the jobs? We have read about minimal and  
> complete runs, but it is still not clear for us.  
>  
> Thanks a lot!  
> Cheers,  
> Rafa  
>  
>  
>  

Re: Index Purged if no new documents are seeded

Posted by Karl Wright <da...@gmail.com>.
Hi Rafa,

You probably need to do a few things to get your connector working right.
First, what connector model are you using?  MODEL_ALL is the default, and
it tells ManifoldCF that your seeding method supplies ALL matching
documents, and that's probably not right.  Maybe you want MODEL_ADD_CHANGE
instead.  Second, please be sure your connector deals properly with the
situation where the previous seeding string is empty.  The seeding string
is set to empty whenever someone changes the document specification for a
job.  In that case, you should always seed as if from the beginning of time.

I will not have a chance to review your code for a while due to other
issues I'm currently looking at, but based on your description of the
problem, you've probably chosen the wrong seeding model.

Thanks,
Karl




On Wed, Sep 17, 2014 at 10:41 AM, Rafa Haro <rh...@apache.org> wrote:

> Hi folks,
>
> We have been working on an “unofficial” Alfresco connector that currently
> is more or less working for Manifold 1.7. You can check the code here:
> https://github.com/rafaharo/alfresco-webscript-manifold-connector. The
> README.md file is out of date, so please ignore it. Basically, this
> connector is using a client that consumes a set of Alfresco webscritps for
> dealing with content and metadata crawling. Documents seeding is based on
> Alfresco transactions, so the connector keeps asking alfresco for a
> concrete number of transactions until no new transactions are found. The
> transactions info, among others things, indicates if a documents has been
> deleted so, later, while processing the documents, those documents are
> marked to be deleted.
>
> In the first run, all the available documents identifiers are seeded. In
> the next runs, we thought to seed only those documents affected by new
> transactions (new documents, any change at any level or deletions). And
> this is what is happening right now: for example, if there is not new
> transactions, any document is seeded and the whole index is purged (all the
> previous indexed documents are deleted).
>
> My question is: is this a normal behavior ? How can we avoid it? Is there
> any configuration option for the jobs? We have read about minimal and
> complete runs, but it is still not clear for us.
>
> Thanks a lot!
> Cheers,
> Rafa
>
>
>