You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Jitu <ab...@gmail.com> on 2014/07/29 17:31:48 UTC

question regarding manifoldcf

Hi,

I am a freelancer. for my current project i am using manifoldcf framework
where i need to pull documents from cmis repository and output to solr
connector.

But i noticed when i set job type as continuous. it is crawling all the
files everytime no matter whether they are modified or not. but my
requirement is to crawl the files again only if there is any modification.

how can i do it with manifoldcf.

Regards,
abjitu

Re: question regarding manifoldcf

Posted by Karl Wright <da...@gmail.com>.
Hi Abjitu,

Some CMIS implementations do not support versioning.  See
connectors/cmis/connector/src/main/java/org/apache/manifoldcf/connectors/cmis/CmisRepositoryConnector.java,
around line 1306:

>>>>>>
        //we have to check if this CMIS repository support versioning
        // or if the versioning is disabled for this content
        if(StringUtils.isNotEmpty(document.getVersionLabel())){
          rval[i] = document.getVersionLabel();
        } else {
        //a CMIS document that doesn't contain versioning information will
always be processed
          rval[i] = StringUtils.EMPTY;
        }
<<<<<<

In other words, if your repository does not support getVersionLabel() in a
meaningful way, ManifoldCF cannot either.  You can confirm this by adding
appropriate System.out.println statements in the above block of code.

Thanks,
Karl



On Tue, Jul 29, 2014 at 3:22 PM, Jitu <ab...@gmail.com> wrote:

> I have checked out trunk from below location. made the build but i can
> still see its crawling the same file again and again.
>
> svn checkout http://svn.apache.org/repos/asf/manifoldcf/trunk mcf-trunk
>
> My configuration :
> Nuxeo input connector
>
>
> Max connections:    10
> Connection type:    CMIS
> Authority group:    None (global authority)
> Parameters:
> username=Administrator
> password=********
> binding=atom
> protocol=http
> server=localhost
> port=8080
> path=/nuxeo/atom/cmis
> repositoryId=
>
> *output connector : solr *connector with max connections 10. as far as i
> know output connector has no information about whether its same file or
> different.
>
> *job configuration : *
> Priority:     5
> Start method:     Start at beginning of schedule window
> Schedule type:     Rescan documents dynamically
> Minimum recrawl interval:     10 minutes
> Maximum recrawl interval:     Infinity
> Expiration interval:     Infinity
> Reseed interval:     60 minutes
> No scheduled run times
> No forced metadata
> Maximum hop count for link type 'child':     Unlimited
> Hop count mode:     Delete unreachable documents
>
>
> i have only one file in my nuxeo repository and i see after every 10 mins
> same file is sent to output connector again and again. i mean the call goes
> to addOrReplaceDocument method inside output connector even though there is
> no change to the file in nuxeo repository.
>
> regards,
> Jitu
>
>
>
> On Tue, Jul 29, 2014 at 11:27 PM, Jitu <ab...@gmail.com> wrote:
>
>> Thanks Karl and Prasad. its great to hear back so quickly. Thanks for the
>> info it really helped me.
>>
>> Thanks for the support
>>
>> Regards,
>> Jitu
>>
>>
>> On Tue, Jul 29, 2014 at 10:41 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Jitu,
>>>
>>> The bug is that the CMIS and Alfresco connectors reindexed documents
>>> even though they had not changed.  This is now corrected.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Jul 29, 2014 at 12:28 PM, Jitu <ab...@gmail.com> wrote:
>>>
>>>> Hi Prasad,
>>>>           Thanks for the reply. the bug says "The CMIS and Alfresco
>>>> connectors currently do not look at scanOnly but should". does that mean
>>>> cmis connector and alfresco connector crawls all the files and hands over
>>>> to output connector no matter whether they are modified or not. Ideally it
>>>> should crawl only if the file is modified else not. am i correct?
>>>>
>>>> regards,
>>>> jitu
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 29, 2014 at 9:19 PM, Paththamestrige Perera <
>>>> prasad.srimal.perera@gmail.com> wrote:
>>>>
>>>>> Hello Jitu, I had the same issue and this was fixed with
>>>>> CONNECTORS-994 <https://issues.apache.org/jira/browse/CONNECTORS-994> for
>>>>> the MCF 1.7
>>>>> If you could checkout the mcf-trunk, it will work as expected.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 29, 2014 at 11:31 AM, Jitu <ab...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am a freelancer. for my current project i am using manifoldcf
>>>>>> framework where i need to pull documents from cmis repository and output to
>>>>>> solr connector.
>>>>>>
>>>>>> But i noticed when i set job type as continuous. it is crawling all
>>>>>> the files everytime no matter whether they are modified or not. but my
>>>>>> requirement is to crawl the files again only if there is any modification.
>>>>>>
>>>>>> how can i do it with manifoldcf.
>>>>>>
>>>>>> Regards,
>>>>>> abjitu
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: question regarding manifoldcf

Posted by Jitu <ab...@gmail.com>.
I have checked out trunk from below location. made the build but i can
still see its crawling the same file again and again.

svn checkout http://svn.apache.org/repos/asf/manifoldcf/trunk mcf-trunk

My configuration :
Nuxeo input connector


Max connections:    10
Connection type:    CMIS
Authority group:    None (global authority)
Parameters:
username=Administrator
password=********
binding=atom
protocol=http
server=localhost
port=8080
path=/nuxeo/atom/cmis
repositoryId=

*output connector : solr *connector with max connections 10. as far as i
know output connector has no information about whether its same file or
different.

*job configuration : *
Priority:     5
Start method:     Start at beginning of schedule window
Schedule type:     Rescan documents dynamically
Minimum recrawl interval:     10 minutes
Maximum recrawl interval:     Infinity
Expiration interval:     Infinity
Reseed interval:     60 minutes
No scheduled run times
No forced metadata
Maximum hop count for link type 'child':     Unlimited
Hop count mode:     Delete unreachable documents


i have only one file in my nuxeo repository and i see after every 10 mins
same file is sent to output connector again and again. i mean the call goes
to addOrReplaceDocument method inside output connector even though there is
no change to the file in nuxeo repository.

regards,
Jitu



On Tue, Jul 29, 2014 at 11:27 PM, Jitu <ab...@gmail.com> wrote:

> Thanks Karl and Prasad. its great to hear back so quickly. Thanks for the
> info it really helped me.
>
> Thanks for the support
>
> Regards,
> Jitu
>
>
> On Tue, Jul 29, 2014 at 10:41 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Jitu,
>>
>> The bug is that the CMIS and Alfresco connectors reindexed documents even
>> though they had not changed.  This is now corrected.
>>
>> Karl
>>
>>
>>
>> On Tue, Jul 29, 2014 at 12:28 PM, Jitu <ab...@gmail.com> wrote:
>>
>>> Hi Prasad,
>>>           Thanks for the reply. the bug says "The CMIS and Alfresco
>>> connectors currently do not look at scanOnly but should". does that mean
>>> cmis connector and alfresco connector crawls all the files and hands over
>>> to output connector no matter whether they are modified or not. Ideally it
>>> should crawl only if the file is modified else not. am i correct?
>>>
>>> regards,
>>> jitu
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 29, 2014 at 9:19 PM, Paththamestrige Perera <
>>> prasad.srimal.perera@gmail.com> wrote:
>>>
>>>> Hello Jitu, I had the same issue and this was fixed with CONNECTORS-994
>>>> <https://issues.apache.org/jira/browse/CONNECTORS-994> for the MCF 1.7
>>>> If you could checkout the mcf-trunk, it will work as expected.
>>>>
>>>>
>>>>
>>>> On Tue, Jul 29, 2014 at 11:31 AM, Jitu <ab...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am a freelancer. for my current project i am using manifoldcf
>>>>> framework where i need to pull documents from cmis repository and output to
>>>>> solr connector.
>>>>>
>>>>> But i noticed when i set job type as continuous. it is crawling all
>>>>> the files everytime no matter whether they are modified or not. but my
>>>>> requirement is to crawl the files again only if there is any modification.
>>>>>
>>>>> how can i do it with manifoldcf.
>>>>>
>>>>> Regards,
>>>>> abjitu
>>>>>
>>>>
>>>>
>>>
>>
>

Re: question regarding manifoldcf

Posted by Jitu <ab...@gmail.com>.
Thanks Karl and Prasad. its great to hear back so quickly. Thanks for the
info it really helped me.

Thanks for the support

Regards,
Jitu


On Tue, Jul 29, 2014 at 10:41 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Jitu,
>
> The bug is that the CMIS and Alfresco connectors reindexed documents even
> though they had not changed.  This is now corrected.
>
> Karl
>
>
>
> On Tue, Jul 29, 2014 at 12:28 PM, Jitu <ab...@gmail.com> wrote:
>
>> Hi Prasad,
>>           Thanks for the reply. the bug says "The CMIS and Alfresco
>> connectors currently do not look at scanOnly but should". does that mean
>> cmis connector and alfresco connector crawls all the files and hands over
>> to output connector no matter whether they are modified or not. Ideally it
>> should crawl only if the file is modified else not. am i correct?
>>
>> regards,
>> jitu
>>
>>
>>
>>
>>
>> On Tue, Jul 29, 2014 at 9:19 PM, Paththamestrige Perera <
>> prasad.srimal.perera@gmail.com> wrote:
>>
>>> Hello Jitu, I had the same issue and this was fixed with CONNECTORS-994
>>> <https://issues.apache.org/jira/browse/CONNECTORS-994> for the MCF 1.7
>>> If you could checkout the mcf-trunk, it will work as expected.
>>>
>>>
>>>
>>> On Tue, Jul 29, 2014 at 11:31 AM, Jitu <ab...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am a freelancer. for my current project i am using manifoldcf
>>>> framework where i need to pull documents from cmis repository and output to
>>>> solr connector.
>>>>
>>>> But i noticed when i set job type as continuous. it is crawling all the
>>>> files everytime no matter whether they are modified or not. but my
>>>> requirement is to crawl the files again only if there is any modification.
>>>>
>>>> how can i do it with manifoldcf.
>>>>
>>>> Regards,
>>>> abjitu
>>>>
>>>
>>>
>>
>

Re: question regarding manifoldcf

Posted by Karl Wright <da...@gmail.com>.
Hi Jitu,

The bug is that the CMIS and Alfresco connectors reindexed documents even
though they had not changed.  This is now corrected.

Karl



On Tue, Jul 29, 2014 at 12:28 PM, Jitu <ab...@gmail.com> wrote:

> Hi Prasad,
>           Thanks for the reply. the bug says "The CMIS and Alfresco
> connectors currently do not look at scanOnly but should". does that mean
> cmis connector and alfresco connector crawls all the files and hands over
> to output connector no matter whether they are modified or not. Ideally it
> should crawl only if the file is modified else not. am i correct?
>
> regards,
> jitu
>
>
>
>
>
> On Tue, Jul 29, 2014 at 9:19 PM, Paththamestrige Perera <
> prasad.srimal.perera@gmail.com> wrote:
>
>> Hello Jitu, I had the same issue and this was fixed with CONNECTORS-994
>> <https://issues.apache.org/jira/browse/CONNECTORS-994> for the MCF 1.7
>> If you could checkout the mcf-trunk, it will work as expected.
>>
>>
>>
>> On Tue, Jul 29, 2014 at 11:31 AM, Jitu <ab...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am a freelancer. for my current project i am using manifoldcf
>>> framework where i need to pull documents from cmis repository and output to
>>> solr connector.
>>>
>>> But i noticed when i set job type as continuous. it is crawling all the
>>> files everytime no matter whether they are modified or not. but my
>>> requirement is to crawl the files again only if there is any modification.
>>>
>>> how can i do it with manifoldcf.
>>>
>>> Regards,
>>> abjitu
>>>
>>
>>
>

Re: question regarding manifoldcf

Posted by Paththamestrige Perera <pr...@gmail.com>.
Hello Jitu,

You are correct! The scanOnly indicates if the document needs to be crawled
again (most likely  when it has changed). You can refer this mail thread :
"Question about using ManifolfCF Repository Connectors"  at
http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201407.mbox/browser
to get an idea how it came to fix.

Prasad.



On Tue, Jul 29, 2014 at 12:28 PM, Jitu <ab...@gmail.com> wrote:

> Hi Prasad,
>           Thanks for the reply. the bug says "The CMIS and Alfresco
> connectors currently do not look at scanOnly but should". does that mean
> cmis connector and alfresco connector crawls all the files and hands over
> to output connector no matter whether they are modified or not. Ideally it
> should crawl only if the file is modified else not. am i correct?
>
> regards,
> jitu
>
>
>
>
>
> On Tue, Jul 29, 2014 at 9:19 PM, Paththamestrige Perera <
> prasad.srimal.perera@gmail.com> wrote:
>
>> Hello Jitu, I had the same issue and this was fixed with CONNECTORS-994
>> <https://issues.apache.org/jira/browse/CONNECTORS-994> for the MCF 1.7
>> If you could checkout the mcf-trunk, it will work as expected.
>>
>>
>>
>> On Tue, Jul 29, 2014 at 11:31 AM, Jitu <ab...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am a freelancer. for my current project i am using manifoldcf
>>> framework where i need to pull documents from cmis repository and output to
>>> solr connector.
>>>
>>> But i noticed when i set job type as continuous. it is crawling all the
>>> files everytime no matter whether they are modified or not. but my
>>> requirement is to crawl the files again only if there is any modification.
>>>
>>> how can i do it with manifoldcf.
>>>
>>> Regards,
>>> abjitu
>>>
>>
>>
>

Re: question regarding manifoldcf

Posted by Jitu <ab...@gmail.com>.
Hi Prasad,
          Thanks for the reply. the bug says "The CMIS and Alfresco
connectors currently do not look at scanOnly but should". does that mean
cmis connector and alfresco connector crawls all the files and hands over
to output connector no matter whether they are modified or not. Ideally it
should crawl only if the file is modified else not. am i correct?

regards,
jitu





On Tue, Jul 29, 2014 at 9:19 PM, Paththamestrige Perera <
prasad.srimal.perera@gmail.com> wrote:

> Hello Jitu, I had the same issue and this was fixed with CONNECTORS-994
> <https://issues.apache.org/jira/browse/CONNECTORS-994> for the MCF 1.7
> If you could checkout the mcf-trunk, it will work as expected.
>
>
>
> On Tue, Jul 29, 2014 at 11:31 AM, Jitu <ab...@gmail.com> wrote:
>
>> Hi,
>>
>> I am a freelancer. for my current project i am using manifoldcf framework
>> where i need to pull documents from cmis repository and output to solr
>> connector.
>>
>> But i noticed when i set job type as continuous. it is crawling all the
>> files everytime no matter whether they are modified or not. but my
>> requirement is to crawl the files again only if there is any modification.
>>
>> how can i do it with manifoldcf.
>>
>> Regards,
>> abjitu
>>
>
>

Re: question regarding manifoldcf

Posted by Paththamestrige Perera <pr...@gmail.com>.
Hello Jitu, I had the same issue and this was fixed with CONNECTORS-994
<https://issues.apache.org/jira/browse/CONNECTORS-994> for the MCF 1.7
If you could checkout the mcf-trunk, it will work as expected.



On Tue, Jul 29, 2014 at 11:31 AM, Jitu <ab...@gmail.com> wrote:

> Hi,
>
> I am a freelancer. for my current project i am using manifoldcf framework
> where i need to pull documents from cmis repository and output to solr
> connector.
>
> But i noticed when i set job type as continuous. it is crawling all the
> files everytime no matter whether they are modified or not. but my
> requirement is to crawl the files again only if there is any modification.
>
> how can i do it with manifoldcf.
>
> Regards,
> abjitu
>