You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Paththamestrige Perera <pr...@gmail.com> on 2014/07/15 23:06:01 UTC

Question about using ManifolfCF Repository Connectors

Hello All,

I'm new to Apache ManifoldCF and I have spent sometime referring the
publication 'ManifoldCF in Action' as well. I have started using the
ManifoldCF system with the available repository connectors, CMIS Repository
Connector, Alfresco Repository Connector and File System Connector.

I have used them as continuous crawlers with specific re-crawl intervals.
What I have noticed is that, irrelevant to the Document version (whether it
has changed or not), in all re-crawl jobs, CMIS and Alfresco connectors
process all seeded documents. I took a look at their implementations and as
I could see, these repository connectors does not use the property
'scanOnly' at the processing time of seeded documents which hints if the
document version has changed. It seems intentional by design. So I'm hoping
to know why is it necessary to process all seeded documents (oppose to only
process documents that were updated within the re-crawling interval) ?

Thanks!

Prasad Perera.

Re: Question about using ManifolfCF Repository Connectors

Posted by Karl Wright <da...@gmail.com>.

Hi Prasad,

Yes, please create a ticket any problems you find in the connector
implementations.  If a document is missing in CMIS, the CMIS connector
should definitely be calling IProcessActivity.deleteDocument().

Karl



On Wed, Jul 16, 2014 at 9:07 AM, Paththamestrige Perera <
prasad.srimal.perera@gmail.com> wrote:

> Hello Karl,
>
> Thanks for the reply! I'm going to checkout ManifoldCF trunk and check out
> this new changes. I see now that the WorkerThread has delegated version
> check to connectors level which is a better approach!
> I also saw CONNECTORS-994
> <https://issues.apache.org/jira/browse/CONNECTORS-994>. Thanks for that.
>
> One other thing I would wish to see is,
> handling CmisObjectNotFoundException for CMIS connector (also corresponding
> exception handling for Alfresco), which can be useful in sending delete
> document calls to output connectors. Would you think its a proper approach ?
> I would be happy to create a ticket for that.
>
> Thanks!
>
> Prasad Perera.
>
>
> On Tue, Jul 15, 2014 at 5:41 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Prasad,
>>
>> All changes to connector API's will be backwards compatible provided you
>> extend the base connector class.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Jul 15, 2014 at 5:35 PM, Paththamestrige Perera <
>> prasad.srimal.perera@gmail.com> wrote:
>>
>>> Hello Karl,
>>>
>>> Thanks for the quick reply!
>>>
>>> I'm using MCF 1.6 and I haven't checked version 1.7 yet (I see it has a
>>> release date set to 31st of August).
>>>
>>> Regarding the API changes (I assume) you have mentioned in the second
>>> reply, will there be major changes for the output connector as well ? (for
>>> example, the interfaces addOrReplaceDocument & removeDocument will be
>>> altered as well ?). I have my own output connector, working with a
>>> customize indexing system and curious to know how things may change from
>>> 1.6 to 1.7.
>>>
>>> If it matters, I would be glad to create a ticket regarding the document
>>> version handling for repository connectors for the version 1.6 and would be
>>> happy to get those changes in to my project space.
>>>
>>> Thanks!
>>>
>>> Prasad Perera.
>>>
>>>
>>> On Tue, Jul 15, 2014 at 5:16 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Prasad,
>>>>
>>>> Re: the scanOnly flag: Technically it is up to your connector to
>>>> determine how to use this flag.  It is set when the document has not
>>>> changed from the previous run.
>>>>
>>>> The flag was originally added to help support chained models before
>>>> explicit CHAINED model choices were implemented in the framework.  For
>>>> chained models, discovery would not necessarily work correctly unless all
>>>> references could be rediscovered at all times.  In MCF 1.7, all of this
>>>> will be deprecated, and the getDocumentVersions() and processDocuments()
>>>> methods are in fact merged into one method, and an IProcessActivity method
>>>> is provided to check for differences from the previous indexing.
>>>>
>>>> Hope this answers your question.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Tue, Jul 15, 2014 at 5:06 PM, Paththamestrige Perera <
>>>> prasad.srimal.perera@gmail.com> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> I'm new to Apache ManifoldCF and I have spent sometime referring the
>>>>> publication 'ManifoldCF in Action' as well. I have started using the
>>>>> ManifoldCF system with the available repository connectors, CMIS Repository
>>>>> Connector, Alfresco Repository Connector and File System Connector.
>>>>>
>>>>> I have used them as continuous crawlers with specific re-crawl
>>>>> intervals. What I have noticed is that, irrelevant to the Document version
>>>>> (whether it has changed or not), in all re-crawl jobs, CMIS and Alfresco
>>>>> connectors process all seeded documents. I took a look at their
>>>>> implementations and as I could see, these repository connectors does not
>>>>> use the property 'scanOnly' at the processing time of seeded documents
>>>>> which hints if the document version has changed. It seems intentional by
>>>>> design. So I'm hoping to know why is it necessary to process all seeded
>>>>> documents (oppose to only process documents that were updated within the
>>>>> re-crawling interval) ?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Prasad Perera.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Question about using ManifolfCF Repository Connectors

Posted by Paththamestrige Perera <pr...@gmail.com>.

Hello Karl,

Thanks for the reply! I'm going to checkout ManifoldCF trunk and check out
this new changes. I see now that the WorkerThread has delegated version
check to connectors level which is a better approach!
I also saw CONNECTORS-994
<https://issues.apache.org/jira/browse/CONNECTORS-994>. Thanks for that.

One other thing I would wish to see is,
handling CmisObjectNotFoundException for CMIS connector (also corresponding
exception handling for Alfresco), which can be useful in sending delete
document calls to output connectors. Would you think its a proper approach ?
I would be happy to create a ticket for that.

Thanks!

Prasad Perera.


On Tue, Jul 15, 2014 at 5:41 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Prasad,
>
> All changes to connector API's will be backwards compatible provided you
> extend the base connector class.
>
> Thanks,
> Karl
>
>
> On Tue, Jul 15, 2014 at 5:35 PM, Paththamestrige Perera <
> prasad.srimal.perera@gmail.com> wrote:
>
>> Hello Karl,
>>
>> Thanks for the quick reply!
>>
>> I'm using MCF 1.6 and I haven't checked version 1.7 yet (I see it has a
>> release date set to 31st of August).
>>
>> Regarding the API changes (I assume) you have mentioned in the second
>> reply, will there be major changes for the output connector as well ? (for
>> example, the interfaces addOrReplaceDocument & removeDocument will be
>> altered as well ?). I have my own output connector, working with a
>> customize indexing system and curious to know how things may change from
>> 1.6 to 1.7.
>>
>> If it matters, I would be glad to create a ticket regarding the document
>> version handling for repository connectors for the version 1.6 and would be
>> happy to get those changes in to my project space.
>>
>> Thanks!
>>
>> Prasad Perera.
>>
>>
>> On Tue, Jul 15, 2014 at 5:16 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Prasad,
>>>
>>> Re: the scanOnly flag: Technically it is up to your connector to
>>> determine how to use this flag.  It is set when the document has not
>>> changed from the previous run.
>>>
>>> The flag was originally added to help support chained models before
>>> explicit CHAINED model choices were implemented in the framework.  For
>>> chained models, discovery would not necessarily work correctly unless all
>>> references could be rediscovered at all times.  In MCF 1.7, all of this
>>> will be deprecated, and the getDocumentVersions() and processDocuments()
>>> methods are in fact merged into one method, and an IProcessActivity method
>>> is provided to check for differences from the previous indexing.
>>>
>>> Hope this answers your question.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Jul 15, 2014 at 5:06 PM, Paththamestrige Perera <
>>> prasad.srimal.perera@gmail.com> wrote:
>>>
>>>> Hello All,
>>>>
>>>> I'm new to Apache ManifoldCF and I have spent sometime referring the
>>>> publication 'ManifoldCF in Action' as well. I have started using the
>>>> ManifoldCF system with the available repository connectors, CMIS Repository
>>>> Connector, Alfresco Repository Connector and File System Connector.
>>>>
>>>> I have used them as continuous crawlers with specific re-crawl
>>>> intervals. What I have noticed is that, irrelevant to the Document version
>>>> (whether it has changed or not), in all re-crawl jobs, CMIS and Alfresco
>>>> connectors process all seeded documents. I took a look at their
>>>> implementations and as I could see, these repository connectors does not
>>>> use the property 'scanOnly' at the processing time of seeded documents
>>>> which hints if the document version has changed. It seems intentional by
>>>> design. So I'm hoping to know why is it necessary to process all seeded
>>>> documents (oppose to only process documents that were updated within the
>>>> re-crawling interval) ?
>>>>
>>>> Thanks!
>>>>
>>>> Prasad Perera.
>>>>
>>>
>>>
>>
>

Re: Question about using ManifolfCF Repository Connectors

Posted by Karl Wright <da...@gmail.com>.

Hi Prasad,

All changes to connector API's will be backwards compatible provided you
extend the base connector class.

Thanks,
Karl


On Tue, Jul 15, 2014 at 5:35 PM, Paththamestrige Perera <
prasad.srimal.perera@gmail.com> wrote:

> Hello Karl,
>
> Thanks for the quick reply!
>
> I'm using MCF 1.6 and I haven't checked version 1.7 yet (I see it has a
> release date set to 31st of August).
>
> Regarding the API changes (I assume) you have mentioned in the second
> reply, will there be major changes for the output connector as well ? (for
> example, the interfaces addOrReplaceDocument & removeDocument will be
> altered as well ?). I have my own output connector, working with a
> customize indexing system and curious to know how things may change from
> 1.6 to 1.7.
>
> If it matters, I would be glad to create a ticket regarding the document
> version handling for repository connectors for the version 1.6 and would be
> happy to get those changes in to my project space.
>
> Thanks!
>
> Prasad Perera.
>
>
> On Tue, Jul 15, 2014 at 5:16 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Prasad,
>>
>> Re: the scanOnly flag: Technically it is up to your connector to
>> determine how to use this flag.  It is set when the document has not
>> changed from the previous run.
>>
>> The flag was originally added to help support chained models before
>> explicit CHAINED model choices were implemented in the framework.  For
>> chained models, discovery would not necessarily work correctly unless all
>> references could be rediscovered at all times.  In MCF 1.7, all of this
>> will be deprecated, and the getDocumentVersions() and processDocuments()
>> methods are in fact merged into one method, and an IProcessActivity method
>> is provided to check for differences from the previous indexing.
>>
>> Hope this answers your question.
>>
>> Karl
>>
>>
>>
>> On Tue, Jul 15, 2014 at 5:06 PM, Paththamestrige Perera <
>> prasad.srimal.perera@gmail.com> wrote:
>>
>>> Hello All,
>>>
>>> I'm new to Apache ManifoldCF and I have spent sometime referring the
>>> publication 'ManifoldCF in Action' as well. I have started using the
>>> ManifoldCF system with the available repository connectors, CMIS Repository
>>> Connector, Alfresco Repository Connector and File System Connector.
>>>
>>> I have used them as continuous crawlers with specific re-crawl
>>> intervals. What I have noticed is that, irrelevant to the Document version
>>> (whether it has changed or not), in all re-crawl jobs, CMIS and Alfresco
>>> connectors process all seeded documents. I took a look at their
>>> implementations and as I could see, these repository connectors does not
>>> use the property 'scanOnly' at the processing time of seeded documents
>>> which hints if the document version has changed. It seems intentional by
>>> design. So I'm hoping to know why is it necessary to process all seeded
>>> documents (oppose to only process documents that were updated within the
>>> re-crawling interval) ?
>>>
>>> Thanks!
>>>
>>> Prasad Perera.
>>>
>>
>>
>

Re: Question about using ManifolfCF Repository Connectors

Posted by Paththamestrige Perera <pr...@gmail.com>.

Hello Karl,

Thanks for the quick reply!

I'm using MCF 1.6 and I haven't checked version 1.7 yet (I see it has a
release date set to 31st of August).

Regarding the API changes (I assume) you have mentioned in the second
reply, will there be major changes for the output connector as well ? (for
example, the interfaces addOrReplaceDocument & removeDocument will be
altered as well ?). I have my own output connector, working with a
customize indexing system and curious to know how things may change from
1.6 to 1.7.

If it matters, I would be glad to create a ticket regarding the document
version handling for repository connectors for the version 1.6 and would be
happy to get those changes in to my project space.

Thanks!

Prasad Perera.


On Tue, Jul 15, 2014 at 5:16 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Prasad,
>
> Re: the scanOnly flag: Technically it is up to your connector to determine
> how to use this flag.  It is set when the document has not changed from the
> previous run.
>
> The flag was originally added to help support chained models before
> explicit CHAINED model choices were implemented in the framework.  For
> chained models, discovery would not necessarily work correctly unless all
> references could be rediscovered at all times.  In MCF 1.7, all of this
> will be deprecated, and the getDocumentVersions() and processDocuments()
> methods are in fact merged into one method, and an IProcessActivity method
> is provided to check for differences from the previous indexing.
>
> Hope this answers your question.
>
> Karl
>
>
>
> On Tue, Jul 15, 2014 at 5:06 PM, Paththamestrige Perera <
> prasad.srimal.perera@gmail.com> wrote:
>
>> Hello All,
>>
>> I'm new to Apache ManifoldCF and I have spent sometime referring the
>> publication 'ManifoldCF in Action' as well. I have started using the
>> ManifoldCF system with the available repository connectors, CMIS Repository
>> Connector, Alfresco Repository Connector and File System Connector.
>>
>> I have used them as continuous crawlers with specific re-crawl intervals.
>> What I have noticed is that, irrelevant to the Document version (whether it
>> has changed or not), in all re-crawl jobs, CMIS and Alfresco connectors
>> process all seeded documents. I took a look at their implementations and as
>> I could see, these repository connectors does not use the property
>> 'scanOnly' at the processing time of seeded documents which hints if the
>> document version has changed. It seems intentional by design. So I'm hoping
>> to know why is it necessary to process all seeded documents (oppose to only
>> process documents that were updated within the re-crawling interval) ?
>>
>> Thanks!
>>
>> Prasad Perera.
>>
>
>

Re: Question about using ManifolfCF Repository Connectors

Posted by Karl Wright <da...@gmail.com>.

Hi Prasad,

Re: the scanOnly flag: Technically it is up to your connector to determine
how to use this flag.  It is set when the document has not changed from the
previous run.

The flag was originally added to help support chained models before
explicit CHAINED model choices were implemented in the framework.  For
chained models, discovery would not necessarily work correctly unless all
references could be rediscovered at all times.  In MCF 1.7, all of this
will be deprecated, and the getDocumentVersions() and processDocuments()
methods are in fact merged into one method, and an IProcessActivity method
is provided to check for differences from the previous indexing.

Hope this answers your question.

Karl

On Tue, Jul 15, 2014 at 5:06 PM, Paththamestrige Perera <
prasad.srimal.perera@gmail.com> wrote:

> Hello All,
>
> I'm new to Apache ManifoldCF and I have spent sometime referring the
> publication 'ManifoldCF in Action' as well. I have started using the
> ManifoldCF system with the available repository connectors, CMIS Repository
> Connector, Alfresco Repository Connector and File System Connector.
>
> I have used them as continuous crawlers with specific re-crawl intervals.
> What I have noticed is that, irrelevant to the Document version (whether it
> has changed or not), in all re-crawl jobs, CMIS and Alfresco connectors
> process all seeded documents. I took a look at their implementations and as
> I could see, these repository connectors does not use the property
> 'scanOnly' at the processing time of seeded documents which hints if the
> document version has changed. It seems intentional by design. So I'm hoping
> to know why is it necessary to process all seeded documents (oppose to only
> process documents that were updated within the re-crawling interval) ?
>
> Thanks!
>
> Prasad Perera.
>

Re: Question about using ManifolfCF Repository Connectors

Posted by Karl Wright <da...@gmail.com>.

Hi Prasad,

Since the CMIS and Alfresco connectors do not pay attention to the scanOnly
flag, they are not correctly written and should be fixed.  Could you create
a ticket to address this?

Thanks,
Karl



On Tue, Jul 15, 2014 at 5:06 PM, Paththamestrige Perera <
prasad.srimal.perera@gmail.com> wrote:

> Hello All,
>
> I'm new to Apache ManifoldCF and I have spent sometime referring the
> publication 'ManifoldCF in Action' as well. I have started using the
> ManifoldCF system with the available repository connectors, CMIS Repository
> Connector, Alfresco Repository Connector and File System Connector.
>
> I have used them as continuous crawlers with specific re-crawl intervals.
> What I have noticed is that, irrelevant to the Document version (whether it
> has changed or not), in all re-crawl jobs, CMIS and Alfresco connectors
> process all seeded documents. I took a look at their implementations and as
> I could see, these repository connectors does not use the property
> 'scanOnly' at the processing time of seeded documents which hints if the
> document version has changed. It seems intentional by design. So I'm hoping
> to know why is it necessary to process all seeded documents (oppose to only
> process documents that were updated within the re-crawling interval) ?
>
> Thanks!
>
> Prasad Perera.
>