You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by VINAY Bengaluru <vi...@gmail.com> on 2018/07/31 08:05:00 UTC

Scheduler not working as we expected

Hi Karl,
               We have set up a scheduler for our jobs with input connector
as file system and output connector as Solr.
We have set up a scheduler as follows :
Schedule type: Rescan documents dynamically
Recrawl interval: blank
Schedule time: appropriate times with job invocation as complete.

We see that the job is not picking up documents at the scheduled intervals.

Why the job doesn't pickup new docs at the scheduled interval? Anything
wrong with our job configuration or our understanding?

Thanks and regards,
Vinay

Re: Scheduler not working as we expected

Posted by Karl Wright <da...@gmail.com>.

It's obviously a configuration problem.  Are you using the extract update
handler?  If not, do you have tika in the pipeline?

Karl


On Tue, Sep 25, 2018 at 4:24 AM Ronny Heylen <ro...@gmail.com> wrote:

> Hi,
> We have been using SOLR for a few years and now the server has been
> transferred to the VM's in out HQ ( and reinstalled ),
> We ara having the the following issue now :
> orcing SOLR indexation by curl works, as we can see from:
> *curl "*
> *http://gbsloappwp0083.corp.qbe.com:8080/solr/update/extract?literal.id=1&commit=true*
> <http://gbsloappwp0083.corp.qbe.com:8080/solr/update/extract?literal.id=1&commit=true>*"
> -F "myfile=@z:\qbere_bru\common\testsolr.txt"*
> which has successfully indexed testsolr.txt.
> As can be checked by:
> http://gbsloappwp0083.corp.qbe.com:8080/solr/collection1/select?q=ella
> giving:
> <result name="response" numFound="1" start="0">
> Searching for john returns 0 files:
> http://gbsloappwp0083.corp.qbe.com:8080/solr/collection1/select?q=john
> <result name="response" numFound="0" start="0"/>
> and searching for any gives also 1 file:
> http://gbsloappwp0083.corp.qbe.com:8080/solr/collection1/select?q=*
> <result name="response" numFound="1" start="0">
>
> However, launching a job from ManifoldCF doesn't seem to work.
> We see the folder names in file definition, we see that the job indexes
> documents (or at least seems to do so), but SOLR API:
> http://gbsloappwp0083.corp.qbe.com:8080/solr/collection1/select?q=*
> still return 1 file only, the one we have manually indexed
>
> If anybody have anu suggestion, would be really gratful
>
> Ronny.Heylen@qbere.com
>
>
> aan ik
>
> Op di 31 jul. 2018 om 12:12 schreef Karl Wright <da...@gmail.com>:
>
>> Hi Vinay,
>>
>> Dynamic rescan is meant for web-crawling and revisits already crawled
>> documents based on how often they have changed in the past.  It is
>> therefore wholly inappropriate for something like a file crawl, since
>> directory contents (one of the kinds of documents there are in a file
>> crawl) change very infrequently.
>>
>> Instead, I recommend that you run complete crawls, non-dynamic.  You can
>> even run minimal crawls fairly often, which will pick up new and changed
>> documents, and run non-minimal crawls on a less frequent schedule to
>> capture deletions.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Jul 31, 2018 at 4:05 AM VINAY Bengaluru <vi...@gmail.com>
>> wrote:
>>
>>> Hi Karl,
>>>                We have set up a scheduler for our jobs with input
>>> connector as file system and output connector as Solr.
>>> We have set up a scheduler as follows :
>>> Schedule type: Rescan documents dynamically
>>> Recrawl interval: blank
>>> Schedule time: appropriate times with job invocation as complete.
>>>
>>> We see that the job is not picking up documents at the scheduled
>>> intervals.
>>>
>>> Why the job doesn't pickup new docs at the scheduled interval? Anything
>>> wrong with our job configuration or our understanding?
>>>
>>> Thanks and regards,
>>> Vinay
>>>
>>>

Re: Scheduler not working as we expected

Posted by Ronny Heylen <ro...@gmail.com>.

Hi,
We have been using SOLR for a few years and now the server has been
transferred to the VM's in out HQ ( and reinstalled ),
We ara having the the following issue now :
orcing SOLR indexation by curl works, as we can see from:
*curl "*
*http://gbsloappwp0083.corp.qbe.com:8080/solr/update/extract?literal.id=1&commit=true*
<http://gbsloappwp0083.corp.qbe.com:8080/solr/update/extract?literal.id=1&commit=true>*"
-F "myfile=@z:\qbere_bru\common\testsolr.txt"*
which has successfully indexed testsolr.txt.
As can be checked by:
http://gbsloappwp0083.corp.qbe.com:8080/solr/collection1/select?q=ella
giving:
<result name="response" numFound="1" start="0">
Searching for john returns 0 files:
http://gbsloappwp0083.corp.qbe.com:8080/solr/collection1/select?q=john
<result name="response" numFound="0" start="0"/>
and searching for any gives also 1 file:
http://gbsloappwp0083.corp.qbe.com:8080/solr/collection1/select?q=*
<result name="response" numFound="1" start="0">

However, launching a job from ManifoldCF doesn't seem to work.
We see the folder names in file definition, we see that the job indexes
documents (or at least seems to do so), but SOLR API:
http://gbsloappwp0083.corp.qbe.com:8080/solr/collection1/select?q=*
still return 1 file only, the one we have manually indexed

If anybody have anu suggestion, would be really gratful

Ronny.Heylen@qbere.com


aan ik

Op di 31 jul. 2018 om 12:12 schreef Karl Wright <da...@gmail.com>:

> Hi Vinay,
>
> Dynamic rescan is meant for web-crawling and revisits already crawled
> documents based on how often they have changed in the past.  It is
> therefore wholly inappropriate for something like a file crawl, since
> directory contents (one of the kinds of documents there are in a file
> crawl) change very infrequently.
>
> Instead, I recommend that you run complete crawls, non-dynamic.  You can
> even run minimal crawls fairly often, which will pick up new and changed
> documents, and run non-minimal crawls on a less frequent schedule to
> capture deletions.
>
> Thanks,
> Karl
>
>
> On Tue, Jul 31, 2018 at 4:05 AM VINAY Bengaluru <vi...@gmail.com>
> wrote:
>
>> Hi Karl,
>>                We have set up a scheduler for our jobs with input
>> connector as file system and output connector as Solr.
>> We have set up a scheduler as follows :
>> Schedule type: Rescan documents dynamically
>> Recrawl interval: blank
>> Schedule time: appropriate times with job invocation as complete.
>>
>> We see that the job is not picking up documents at the scheduled
>> intervals.
>>
>> Why the job doesn't pickup new docs at the scheduled interval? Anything
>> wrong with our job configuration or our understanding?
>>
>> Thanks and regards,
>> Vinay
>>
>>

Re: Scheduler not working as we expected

Posted by Karl Wright <da...@gmail.com>.

Hi Vinay,

Dynamic rescan is meant for web-crawling and revisits already crawled
documents based on how often they have changed in the past.  It is
therefore wholly inappropriate for something like a file crawl, since
directory contents (one of the kinds of documents there are in a file
crawl) change very infrequently.

Instead, I recommend that you run complete crawls, non-dynamic.  You can
even run minimal crawls fairly often, which will pick up new and changed
documents, and run non-minimal crawls on a less frequent schedule to
capture deletions.

Thanks,
Karl

On Tue, Jul 31, 2018 at 4:05 AM VINAY Bengaluru <vi...@gmail.com>
wrote:

> Hi Karl,
>                We have set up a scheduler for our jobs with input
> connector as file system and output connector as Solr.
> We have set up a scheduler as follows :
> Schedule type: Rescan documents dynamically
> Recrawl interval: blank
> Schedule time: appropriate times with job invocation as complete.
>
> We see that the job is not picking up documents at the scheduled intervals.
>
> Why the job doesn't pickup new docs at the scheduled interval? Anything
> wrong with our job configuration or our understanding?
>
> Thanks and regards,
> Vinay
>
>