You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Gaurav G <go...@gmail.com> on 2019/02/08 15:10:55 UTC

Sharepoint Job - Incremental Crawling

Hi All,

We're trying to crawl a Sharepoint repo with about 30000 docs. Ideally we
would like to be able to synchronize changes with the repo within 30
minutes. We are scheduling incremental crawling on this. Our observation is
that a full crawl takes about 60-75 minutes. So if we schedule the
incremental crawl for 30 minutes, in what order would it process the
changes. Would it first bring the adds and updates and then process the
rest of the docs? What kind of logic is there in the incremental crawl?
We also tried the Continuous crawl to achieve this. However somehow the
continuous crawl was not picking up new documents.

Thanks,
Gaurav

Re: Sharepoint Job - Incremental Crawling

Posted by Steph van Schalkwyk <st...@remcam.net>.
Hi. I just saw this thread.
I believe Msft recommends a dedicated document source instance for larger
corpora.
I know in my SP days we were often frustrating users by making SP very slow
while we were crawling. Which was mostly solved by having a dedicated
source node.
S

On Sat, Feb 9, 2019, 2:10 AM Karl Wright <daddywri@gmail.com wrote:

> Hi Guarav,
>
> The number of connections you permit should depend on the resources on the
> Sharepoint instance you're crawling.  ManifoldCF will limit the number of
> connections to that instance to the number you select.  Making it larger
> might help if there's a lot of resources on the SharePoint side, but in my
> experience that's usually not realistic and just increasing the connection
> count can even have a paradoxical effect.  So that will require a back and
> forth with the people running the Sharepoint instances.
>
> Once you can confirm that SharePoint is no longer the bottleneck (I'm
> pretty certain it is right now), then the next step would be database
> performance optimization.  For Postgres running on Linux, you should be
> pretty much pegging the CPUs on the DB machine if you've got all the other
> bottlenecks eliminated.  If you aren't pegging those CPUs and/or the
> machine is IO bound, there has to be another bottleneck somewhere and
> you'll need to find it.
>
> Karl
>
>
> On Sat, Feb 9, 2019 at 1:10 AM Gaurav G <go...@gmail.com> wrote:
>
>> Hi Karl,
>>
>> Thanks for your insights. So I'm thinking of exploring the following
>> options to get the most optimal performance. Your thoughts..Is the first
>> option, the one which might give the most bang for the buck?
>>
>> 1) Ask the Sharepoint application team to dedicate a web and app server
>> specifically for crawling. Also on a related point, is there any optimal
>> value for the number of concurrent repository connections? Currently we
>> have it at about 40, not sure if increasing it further will improve speeds.
>> 2) Splitting the crawling between two sets of manifold and postgres
>> servers running on 4 different VMs but with lesser config..say 4 cores, 12
>> GB RAM.
>> 3) Co-locate the crawlers in the same data center as the sharepoint
>> servers. Currently they are in different DCs with dedicated MPLS
>> connectivity.
>>
>> Thanks,
>> Gaurav
>>
>> On Sat, Feb 9, 2019 at 3:03 AM Karl Wright <da...@gmail.com> wrote:
>>
>>> The problem is not the speed of Manifold, but rather the work it has to
>>> do and the performance of SharePoint.  All the speed in the world in the
>>> crawler will not fix the bottleneck that is SharePoint.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Feb 8, 2019 at 4:06 PM Gaurav G <go...@gmail.com> wrote:
>>>
>>>> Got it.
>>>> Is there any way we can increase the speed of the minimal crawl.
>>>> Currently we are running one VM for manifold with 8 cores and 32 gb Ram.
>>>> Postgres runs on another machine with a similar configuration. Have tuned
>>>> the Postgres and Manifoldcf parameters as per the recommendations. We run a
>>>> full vacuum once daily.
>>>>
>>>> Would switching to a multi process configuration with manifoldcf
>>>> running on two servers give a boost.
>>>>
>>>> Thanks,
>>>> Gaurav
>>>>
>>>> On Saturday, February 9, 2019, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> It does the minimum necessary.  That means it can't do it in less.  If
>>>>> this is a business requirement, then you should be angry with whoever made
>>>>> this requirement.
>>>>>
>>>>> Share point doesn't give you the ability to grab all changes or added
>>>>> documents up front.   You have to crawl to discover them.  That is how it
>>>>> is built and mcf cannot change it.
>>>>>
>>>>> Karl
>>>>>
>>>>> On Fri, Feb 8, 2019, 2:14 PM Gaurav G <goyalgauravg@gmail.com wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> Thanks for the response. We tried scheduling minimal crawl for 15
>>>>>> minutes. At the end of fifteen minutes it stops with about 3000 docs in
>>>>>> processing state and takes about 20-25 mins to stop. Then the question
>>>>>> becomes when to schedule the next crawl. And also in those 15 minutes would
>>>>>> it have picked all the adds and updates first or could they be part of the
>>>>>> 3000 docs which are still in processing state which would get picked in the
>>>>>> next run. The number of docs that actually change in a 30 min period won't
>>>>>> be more than 200.
>>>>>>
>>>>>> Being able to capture adds and updates in 30 minutes is a key
>>>>>> business requirement.
>>>>>>
>>>>>> Thanks,
>>>>>> Gaurav
>>>>>>
>>>>>> On Friday, February 8, 2019, Karl Wright <da...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guarav,
>>>>>>>
>>>>>>> The right way to do this is to schedule "minimal" crawls every 15
>>>>>>> minutes (which will process only the minimum needed to deal with adds and
>>>>>>> updates), and periodically perform "full" crawls (which will also include
>>>>>>> deletions).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <go...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> We're trying to crawl a Sharepoint repo with about 30000 docs.
>>>>>>>> Ideally we would like to be able to synchronize changes with the repo
>>>>>>>> within 30 minutes. We are scheduling incremental crawling on this. Our
>>>>>>>> observation is that a full crawl takes about 60-75 minutes. So if we
>>>>>>>> schedule the incremental crawl for 30 minutes, in what order would it
>>>>>>>> process the changes. Would it first bring the adds and updates and then
>>>>>>>> process the rest of the docs? What kind of logic is there in the
>>>>>>>> incremental crawl?
>>>>>>>> We also tried the Continuous crawl to achieve this. However somehow
>>>>>>>> the continuous crawl was not picking up new documents.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Gaurav
>>>>>>>>
>>>>>>>

Re: Sharepoint Job - Incremental Crawling

Posted by Karl Wright <da...@gmail.com>.
Hi Guarav,

The number of connections you permit should depend on the resources on the
Sharepoint instance you're crawling.  ManifoldCF will limit the number of
connections to that instance to the number you select.  Making it larger
might help if there's a lot of resources on the SharePoint side, but in my
experience that's usually not realistic and just increasing the connection
count can even have a paradoxical effect.  So that will require a back and
forth with the people running the Sharepoint instances.

Once you can confirm that SharePoint is no longer the bottleneck (I'm
pretty certain it is right now), then the next step would be database
performance optimization.  For Postgres running on Linux, you should be
pretty much pegging the CPUs on the DB machine if you've got all the other
bottlenecks eliminated.  If you aren't pegging those CPUs and/or the
machine is IO bound, there has to be another bottleneck somewhere and
you'll need to find it.

Karl


On Sat, Feb 9, 2019 at 1:10 AM Gaurav G <go...@gmail.com> wrote:

> Hi Karl,
>
> Thanks for your insights. So I'm thinking of exploring the following
> options to get the most optimal performance. Your thoughts..Is the first
> option, the one which might give the most bang for the buck?
>
> 1) Ask the Sharepoint application team to dedicate a web and app server
> specifically for crawling. Also on a related point, is there any optimal
> value for the number of concurrent repository connections? Currently we
> have it at about 40, not sure if increasing it further will improve speeds.
> 2) Splitting the crawling between two sets of manifold and postgres
> servers running on 4 different VMs but with lesser config..say 4 cores, 12
> GB RAM.
> 3) Co-locate the crawlers in the same data center as the sharepoint
> servers. Currently they are in different DCs with dedicated MPLS
> connectivity.
>
> Thanks,
> Gaurav
>
> On Sat, Feb 9, 2019 at 3:03 AM Karl Wright <da...@gmail.com> wrote:
>
>> The problem is not the speed of Manifold, but rather the work it has to
>> do and the performance of SharePoint.  All the speed in the world in the
>> crawler will not fix the bottleneck that is SharePoint.
>>
>> Karl
>>
>>
>> On Fri, Feb 8, 2019 at 4:06 PM Gaurav G <go...@gmail.com> wrote:
>>
>>> Got it.
>>> Is there any way we can increase the speed of the minimal crawl.
>>> Currently we are running one VM for manifold with 8 cores and 32 gb Ram.
>>> Postgres runs on another machine with a similar configuration. Have tuned
>>> the Postgres and Manifoldcf parameters as per the recommendations. We run a
>>> full vacuum once daily.
>>>
>>> Would switching to a multi process configuration with manifoldcf running
>>> on two servers give a boost.
>>>
>>> Thanks,
>>> Gaurav
>>>
>>> On Saturday, February 9, 2019, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> It does the minimum necessary.  That means it can't do it in less.  If
>>>> this is a business requirement, then you should be angry with whoever made
>>>> this requirement.
>>>>
>>>> Share point doesn't give you the ability to grab all changes or added
>>>> documents up front.   You have to crawl to discover them.  That is how it
>>>> is built and mcf cannot change it.
>>>>
>>>> Karl
>>>>
>>>> On Fri, Feb 8, 2019, 2:14 PM Gaurav G <goyalgauravg@gmail.com wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> Thanks for the response. We tried scheduling minimal crawl for 15
>>>>> minutes. At the end of fifteen minutes it stops with about 3000 docs in
>>>>> processing state and takes about 20-25 mins to stop. Then the question
>>>>> becomes when to schedule the next crawl. And also in those 15 minutes would
>>>>> it have picked all the adds and updates first or could they be part of the
>>>>> 3000 docs which are still in processing state which would get picked in the
>>>>> next run. The number of docs that actually change in a 30 min period won't
>>>>> be more than 200.
>>>>>
>>>>> Being able to capture adds and updates in 30 minutes is a key business
>>>>> requirement.
>>>>>
>>>>> Thanks,
>>>>> Gaurav
>>>>>
>>>>> On Friday, February 8, 2019, Karl Wright <da...@gmail.com> wrote:
>>>>>
>>>>>> Hi Guarav,
>>>>>>
>>>>>> The right way to do this is to schedule "minimal" crawls every 15
>>>>>> minutes (which will process only the minimum needed to deal with adds and
>>>>>> updates), and periodically perform "full" crawls (which will also include
>>>>>> deletions).
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <go...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We're trying to crawl a Sharepoint repo with about 30000 docs.
>>>>>>> Ideally we would like to be able to synchronize changes with the repo
>>>>>>> within 30 minutes. We are scheduling incremental crawling on this. Our
>>>>>>> observation is that a full crawl takes about 60-75 minutes. So if we
>>>>>>> schedule the incremental crawl for 30 minutes, in what order would it
>>>>>>> process the changes. Would it first bring the adds and updates and then
>>>>>>> process the rest of the docs? What kind of logic is there in the
>>>>>>> incremental crawl?
>>>>>>> We also tried the Continuous crawl to achieve this. However somehow
>>>>>>> the continuous crawl was not picking up new documents.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Gaurav
>>>>>>>
>>>>>>

Re: Sharepoint Job - Incremental Crawling

Posted by Gaurav G <go...@gmail.com>.
Hi Karl,

Thanks for your insights. So I'm thinking of exploring the following
options to get the most optimal performance. Your thoughts..Is the first
option, the one which might give the most bang for the buck?

1) Ask the Sharepoint application team to dedicate a web and app server
specifically for crawling. Also on a related point, is there any optimal
value for the number of concurrent repository connections? Currently we
have it at about 40, not sure if increasing it further will improve speeds.
2) Splitting the crawling between two sets of manifold and postgres servers
running on 4 different VMs but with lesser config..say 4 cores, 12 GB RAM.
3) Co-locate the crawlers in the same data center as the sharepoint
servers. Currently they are in different DCs with dedicated MPLS
connectivity.

Thanks,
Gaurav

On Sat, Feb 9, 2019 at 3:03 AM Karl Wright <da...@gmail.com> wrote:

> The problem is not the speed of Manifold, but rather the work it has to do
> and the performance of SharePoint.  All the speed in the world in the
> crawler will not fix the bottleneck that is SharePoint.
>
> Karl
>
>
> On Fri, Feb 8, 2019 at 4:06 PM Gaurav G <go...@gmail.com> wrote:
>
>> Got it.
>> Is there any way we can increase the speed of the minimal crawl.
>> Currently we are running one VM for manifold with 8 cores and 32 gb Ram.
>> Postgres runs on another machine with a similar configuration. Have tuned
>> the Postgres and Manifoldcf parameters as per the recommendations. We run a
>> full vacuum once daily.
>>
>> Would switching to a multi process configuration with manifoldcf running
>> on two servers give a boost.
>>
>> Thanks,
>> Gaurav
>>
>> On Saturday, February 9, 2019, Karl Wright <da...@gmail.com> wrote:
>>
>>> It does the minimum necessary.  That means it can't do it in less.  If
>>> this is a business requirement, then you should be angry with whoever made
>>> this requirement.
>>>
>>> Share point doesn't give you the ability to grab all changes or added
>>> documents up front.   You have to crawl to discover them.  That is how it
>>> is built and mcf cannot change it.
>>>
>>> Karl
>>>
>>> On Fri, Feb 8, 2019, 2:14 PM Gaurav G <goyalgauravg@gmail.com wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> Thanks for the response. We tried scheduling minimal crawl for 15
>>>> minutes. At the end of fifteen minutes it stops with about 3000 docs in
>>>> processing state and takes about 20-25 mins to stop. Then the question
>>>> becomes when to schedule the next crawl. And also in those 15 minutes would
>>>> it have picked all the adds and updates first or could they be part of the
>>>> 3000 docs which are still in processing state which would get picked in the
>>>> next run. The number of docs that actually change in a 30 min period won't
>>>> be more than 200.
>>>>
>>>> Being able to capture adds and updates in 30 minutes is a key business
>>>> requirement.
>>>>
>>>> Thanks,
>>>> Gaurav
>>>>
>>>> On Friday, February 8, 2019, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> Hi Guarav,
>>>>>
>>>>> The right way to do this is to schedule "minimal" crawls every 15
>>>>> minutes (which will process only the minimum needed to deal with adds and
>>>>> updates), and periodically perform "full" crawls (which will also include
>>>>> deletions).
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <go...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We're trying to crawl a Sharepoint repo with about 30000 docs.
>>>>>> Ideally we would like to be able to synchronize changes with the repo
>>>>>> within 30 minutes. We are scheduling incremental crawling on this. Our
>>>>>> observation is that a full crawl takes about 60-75 minutes. So if we
>>>>>> schedule the incremental crawl for 30 minutes, in what order would it
>>>>>> process the changes. Would it first bring the adds and updates and then
>>>>>> process the rest of the docs? What kind of logic is there in the
>>>>>> incremental crawl?
>>>>>> We also tried the Continuous crawl to achieve this. However somehow
>>>>>> the continuous crawl was not picking up new documents.
>>>>>>
>>>>>> Thanks,
>>>>>> Gaurav
>>>>>>
>>>>>

Re: Sharepoint Job - Incremental Crawling

Posted by Karl Wright <da...@gmail.com>.
The problem is not the speed of Manifold, but rather the work it has to do
and the performance of SharePoint.  All the speed in the world in the
crawler will not fix the bottleneck that is SharePoint.

Karl


On Fri, Feb 8, 2019 at 4:06 PM Gaurav G <go...@gmail.com> wrote:

> Got it.
> Is there any way we can increase the speed of the minimal crawl. Currently
> we are running one VM for manifold with 8 cores and 32 gb Ram. Postgres
> runs on another machine with a similar configuration. Have tuned the
> Postgres and Manifoldcf parameters as per the recommendations. We run a
> full vacuum once daily.
>
> Would switching to a multi process configuration with manifoldcf running
> on two servers give a boost.
>
> Thanks,
> Gaurav
>
> On Saturday, February 9, 2019, Karl Wright <da...@gmail.com> wrote:
>
>> It does the minimum necessary.  That means it can't do it in less.  If
>> this is a business requirement, then you should be angry with whoever made
>> this requirement.
>>
>> Share point doesn't give you the ability to grab all changes or added
>> documents up front.   You have to crawl to discover them.  That is how it
>> is built and mcf cannot change it.
>>
>> Karl
>>
>> On Fri, Feb 8, 2019, 2:14 PM Gaurav G <goyalgauravg@gmail.com wrote:
>>
>>> Hi Karl,
>>>
>>> Thanks for the response. We tried scheduling minimal crawl for 15
>>> minutes. At the end of fifteen minutes it stops with about 3000 docs in
>>> processing state and takes about 20-25 mins to stop. Then the question
>>> becomes when to schedule the next crawl. And also in those 15 minutes would
>>> it have picked all the adds and updates first or could they be part of the
>>> 3000 docs which are still in processing state which would get picked in the
>>> next run. The number of docs that actually change in a 30 min period won't
>>> be more than 200.
>>>
>>> Being able to capture adds and updates in 30 minutes is a key business
>>> requirement.
>>>
>>> Thanks,
>>> Gaurav
>>>
>>> On Friday, February 8, 2019, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Guarav,
>>>>
>>>> The right way to do this is to schedule "minimal" crawls every 15
>>>> minutes (which will process only the minimum needed to deal with adds and
>>>> updates), and periodically perform "full" crawls (which will also include
>>>> deletions).
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <go...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We're trying to crawl a Sharepoint repo with about 30000 docs. Ideally
>>>>> we would like to be able to synchronize changes with the repo within 30
>>>>> minutes. We are scheduling incremental crawling on this. Our observation is
>>>>> that a full crawl takes about 60-75 minutes. So if we schedule the
>>>>> incremental crawl for 30 minutes, in what order would it process the
>>>>> changes. Would it first bring the adds and updates and then process the
>>>>> rest of the docs? What kind of logic is there in the incremental crawl?
>>>>> We also tried the Continuous crawl to achieve this. However somehow
>>>>> the continuous crawl was not picking up new documents.
>>>>>
>>>>> Thanks,
>>>>> Gaurav
>>>>>
>>>>

Re: Sharepoint Job - Incremental Crawling

Posted by Gaurav G <go...@gmail.com>.
Got it.
Is there any way we can increase the speed of the minimal crawl. Currently
we are running one VM for manifold with 8 cores and 32 gb Ram. Postgres
runs on another machine with a similar configuration. Have tuned the
Postgres and Manifoldcf parameters as per the recommendations. We run a
full vacuum once daily.

Would switching to a multi process configuration with manifoldcf running on
two servers give a boost.

Thanks,
Gaurav

On Saturday, February 9, 2019, Karl Wright <da...@gmail.com> wrote:

> It does the minimum necessary.  That means it can't do it in less.  If
> this is a business requirement, then you should be angry with whoever made
> this requirement.
>
> Share point doesn't give you the ability to grab all changes or added
> documents up front.   You have to crawl to discover them.  That is how it
> is built and mcf cannot change it.
>
> Karl
>
> On Fri, Feb 8, 2019, 2:14 PM Gaurav G <goyalgauravg@gmail.com wrote:
>
>> Hi Karl,
>>
>> Thanks for the response. We tried scheduling minimal crawl for 15
>> minutes. At the end of fifteen minutes it stops with about 3000 docs in
>> processing state and takes about 20-25 mins to stop. Then the question
>> becomes when to schedule the next crawl. And also in those 15 minutes would
>> it have picked all the adds and updates first or could they be part of the
>> 3000 docs which are still in processing state which would get picked in the
>> next run. The number of docs that actually change in a 30 min period won't
>> be more than 200.
>>
>> Being able to capture adds and updates in 30 minutes is a key business
>> requirement.
>>
>> Thanks,
>> Gaurav
>>
>> On Friday, February 8, 2019, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Guarav,
>>>
>>> The right way to do this is to schedule "minimal" crawls every 15
>>> minutes (which will process only the minimum needed to deal with adds and
>>> updates), and periodically perform "full" crawls (which will also include
>>> deletions).
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <go...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We're trying to crawl a Sharepoint repo with about 30000 docs. Ideally
>>>> we would like to be able to synchronize changes with the repo within 30
>>>> minutes. We are scheduling incremental crawling on this. Our observation is
>>>> that a full crawl takes about 60-75 minutes. So if we schedule the
>>>> incremental crawl for 30 minutes, in what order would it process the
>>>> changes. Would it first bring the adds and updates and then process the
>>>> rest of the docs? What kind of logic is there in the incremental crawl?
>>>> We also tried the Continuous crawl to achieve this. However somehow the
>>>> continuous crawl was not picking up new documents.
>>>>
>>>> Thanks,
>>>> Gaurav
>>>>
>>>

Re: Sharepoint Job - Incremental Crawling

Posted by Karl Wright <da...@gmail.com>.
It does the minimum necessary.  That means it can't do it in less.  If this
is a business requirement, then you should be angry with whoever made this
requirement.

Share point doesn't give you the ability to grab all changes or added
documents up front.   You have to crawl to discover them.  That is how it
is built and mcf cannot change it.

Karl

On Fri, Feb 8, 2019, 2:14 PM Gaurav G <goyalgauravg@gmail.com wrote:

> Hi Karl,
>
> Thanks for the response. We tried scheduling minimal crawl for 15 minutes.
> At the end of fifteen minutes it stops with about 3000 docs in processing
> state and takes about 20-25 mins to stop. Then the question becomes when to
> schedule the next crawl. And also in those 15 minutes would it have picked
> all the adds and updates first or could they be part of the 3000 docs which
> are still in processing state which would get picked in the next run. The
> number of docs that actually change in a 30 min period won't be more than
> 200.
>
> Being able to capture adds and updates in 30 minutes is a key business
> requirement.
>
> Thanks,
> Gaurav
>
> On Friday, February 8, 2019, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Guarav,
>>
>> The right way to do this is to schedule "minimal" crawls every 15 minutes
>> (which will process only the minimum needed to deal with adds and updates),
>> and periodically perform "full" crawls (which will also include deletions).
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <go...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> We're trying to crawl a Sharepoint repo with about 30000 docs. Ideally
>>> we would like to be able to synchronize changes with the repo within 30
>>> minutes. We are scheduling incremental crawling on this. Our observation is
>>> that a full crawl takes about 60-75 minutes. So if we schedule the
>>> incremental crawl for 30 minutes, in what order would it process the
>>> changes. Would it first bring the adds and updates and then process the
>>> rest of the docs? What kind of logic is there in the incremental crawl?
>>> We also tried the Continuous crawl to achieve this. However somehow the
>>> continuous crawl was not picking up new documents.
>>>
>>> Thanks,
>>> Gaurav
>>>
>>

Re: Sharepoint Job - Incremental Crawling

Posted by Gaurav G <go...@gmail.com>.
Hi Karl,

Thanks for the response. We tried scheduling minimal crawl for 15 minutes.
At the end of fifteen minutes it stops with about 3000 docs in processing
state and takes about 20-25 mins to stop. Then the question becomes when to
schedule the next crawl. And also in those 15 minutes would it have picked
all the adds and updates first or could they be part of the 3000 docs which
are still in processing state which would get picked in the next run. The
number of docs that actually change in a 30 min period won't be more than
200.

Being able to capture adds and updates in 30 minutes is a key business
requirement.

Thanks,
Gaurav

On Friday, February 8, 2019, Karl Wright <da...@gmail.com> wrote:

> Hi Guarav,
>
> The right way to do this is to schedule "minimal" crawls every 15 minutes
> (which will process only the minimum needed to deal with adds and updates),
> and periodically perform "full" crawls (which will also include deletions).
>
> Thanks,
> Karl
>
>
> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <go...@gmail.com> wrote:
>
>> Hi All,
>>
>> We're trying to crawl a Sharepoint repo with about 30000 docs. Ideally we
>> would like to be able to synchronize changes with the repo within 30
>> minutes. We are scheduling incremental crawling on this. Our observation is
>> that a full crawl takes about 60-75 minutes. So if we schedule the
>> incremental crawl for 30 minutes, in what order would it process the
>> changes. Would it first bring the adds and updates and then process the
>> rest of the docs? What kind of logic is there in the incremental crawl?
>> We also tried the Continuous crawl to achieve this. However somehow the
>> continuous crawl was not picking up new documents.
>>
>> Thanks,
>> Gaurav
>>
>

Re: Sharepoint Job - Incremental Crawling

Posted by Karl Wright <da...@gmail.com>.
Hi Guarav,

The right way to do this is to schedule "minimal" crawls every 15 minutes
(which will process only the minimum needed to deal with adds and updates),
and periodically perform "full" crawls (which will also include deletions).

Thanks,
Karl


On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <go...@gmail.com> wrote:

> Hi All,
>
> We're trying to crawl a Sharepoint repo with about 30000 docs. Ideally we
> would like to be able to synchronize changes with the repo within 30
> minutes. We are scheduling incremental crawling on this. Our observation is
> that a full crawl takes about 60-75 minutes. So if we schedule the
> incremental crawl for 30 minutes, in what order would it process the
> changes. Would it first bring the adds and updates and then process the
> rest of the docs? What kind of logic is there in the incremental crawl?
> We also tried the Continuous crawl to achieve this. However somehow the
> continuous crawl was not picking up new documents.
>
> Thanks,
> Gaurav
>