You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Jorge Alonso Garcia <ja...@gmail.com> on 2020/01/27 09:04:53 UTC

Re: sharepoint crawler documents limit

Hi,
We had change timeout on sharepoint IIS and now the process is able to
crall all documents.
Thanks for your help



El lun., 30 dic. 2019 a las 12:18, Gaurav G (<go...@gmail.com>)
escribió:

> We had faced a similar issue, wherein our repo had 100,000 documents but
> our crawler stopped after 50000 documents. The issue turned out to be that
> the Sharepoint query that was fired by the Sharepoint web service gets
> progressively slower and eventually the connection starts timing out before
> the next 10000 records get returned. We increased a timeout parameter on
> Sharepoint to 10 minutes and then after that we were able to crawl all
> documents successfully.  I believe we had increased the parameter indicated
> in the link below
>
>
> https://weblogs.asp.net/jeffwids/how-to-increase-the-timeout-for-a-sharepoint-2010-website
>
>
>
> On Fri, Dec 20, 2019 at 6:27 PM Karl Wright <da...@gmail.com> wrote:
>
>> Hi Priya,
>>
>> This has nothing to do with anything in ManifoldCF.
>>
>> Karl
>>
>>
>> On Fri, Dec 20, 2019 at 7:56 AM Priya Arora <pr...@smartshore.nl> wrote:
>>
>>> Hi All,
>>>
>>> Is this issue something to have with below value/parameters set in
>>> properties.xml.
>>> [image: image.png]
>>>
>>>
>>> On Fri, Dec 20, 2019 at 5:21 PM Jorge Alonso Garcia <ja...@gmail.com>
>>> wrote:
>>>
>>>> And what other sharepoint parameter I could check?
>>>>
>>>> Jorge Alonso Garcia
>>>>
>>>>
>>>>
>>>> El vie., 20 dic. 2019 a las 12:47, Karl Wright (<da...@gmail.com>)
>>>> escribió:
>>>>
>>>>> The code seems correct and many people are using it without
>>>>> encountering this problem.  There may be another SharePoint configuration
>>>>> parameter you also need to look at somewhere.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia <
>>>>> jalongar@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi Karl,
>>>>>> On sharepoint the list view threshold is 150,000 but we only receipt
>>>>>> 20,000 from mcf
>>>>>> [image: image.png]
>>>>>>
>>>>>>
>>>>>> Jorge Alonso Garcia
>>>>>>
>>>>>>
>>>>>>
>>>>>> El jue., 19 dic. 2019 a las 19:19, Karl Wright (<da...@gmail.com>)
>>>>>> escribió:
>>>>>>
>>>>>>> If the job finished without error it implies that the number of
>>>>>>> documents returned from this one library was 10000 when the service is
>>>>>>> called the first time (starting at doc 0), 10000 when it's called the
>>>>>>> second time (starting at doc 10000), and zero when it is called the third
>>>>>>> time (starting at doc 20000).
>>>>>>>
>>>>>>> The plugin code is unremarkable and actually gets results in chunks
>>>>>>> of 1000 under the covers:
>>>>>>>
>>>>>>> >>>>>>
>>>>>>>                         SPQuery listQuery = new SPQuery();
>>>>>>>                         listQuery.Query = "<OrderBy
>>>>>>> Override=\"TRUE\"><FieldRef Name=\"FileRef\" /></OrderBy>";
>>>>>>>                         listQuery.QueryThrottleMode =
>>>>>>> SPQueryThrottleOption.Override;
>>>>>>>                         listQuery.ViewAttributes =
>>>>>>> "Scope=\"Recursive\"";
>>>>>>>                         listQuery.ViewFields = "<FieldRef
>>>>>>> Name='FileRef' />";
>>>>>>>                         listQuery.RowLimit = 1000;
>>>>>>>
>>>>>>>                         XmlDocument doc = new XmlDocument();
>>>>>>>                         retVal = doc.CreateElement("GetListItems",
>>>>>>>                             "
>>>>>>> http://schemas.microsoft.com/sharepoint/soap/directory/");
>>>>>>>                         XmlNode getListItemsNode =
>>>>>>> doc.CreateElement("GetListItemsResponse");
>>>>>>>
>>>>>>>                         uint counter = 0;
>>>>>>>                         do
>>>>>>>                         {
>>>>>>>                             if (counter >= startRowParam +
>>>>>>> rowLimitParam)
>>>>>>>                                 break;
>>>>>>>
>>>>>>>                             SPListItemCollection collListItems =
>>>>>>> oList.GetItems(listQuery);
>>>>>>>
>>>>>>>
>>>>>>>                             foreach (SPListItem oListItem in
>>>>>>> collListItems)
>>>>>>>                             {
>>>>>>>                                 if (counter >= startRowParam &&
>>>>>>> counter < startRowParam + rowLimitParam)
>>>>>>>                                 {
>>>>>>>                                     XmlNode resultNode =
>>>>>>> doc.CreateElement("GetListItemsResult");
>>>>>>>                                     XmlAttribute idAttribute =
>>>>>>> doc.CreateAttribute("FileRef");
>>>>>>>                                     idAttribute.Value =
>>>>>>> oListItem.Url;
>>>>>>>
>>>>>>> resultNode.Attributes.Append(idAttribute);
>>>>>>>                                     XmlAttribute urlAttribute =
>>>>>>> doc.CreateAttribute("ListItemURL");
>>>>>>>                                     //urlAttribute.Value =
>>>>>>> oListItem.ParentList.DefaultViewUrl;
>>>>>>>                                     urlAttribute.Value =
>>>>>>> string.Format("{0}?ID={1}",
>>>>>>> oListItem.ParentList.Forms[PAGETYPE.PAGE_DISPLAYFORM].ServerRelativeUrl,
>>>>>>> oListItem.ID);
>>>>>>>
>>>>>>> resultNode.Attributes.Append(urlAttribute);
>>>>>>>
>>>>>>> getListItemsNode.AppendChild(resultNode);
>>>>>>>                                 }
>>>>>>>                                 counter++;
>>>>>>>                             }
>>>>>>>
>>>>>>>                             listQuery.ListItemCollectionPosition =
>>>>>>> collListItems.ListItemCollectionPosition;
>>>>>>>
>>>>>>>                         } while
>>>>>>> (listQuery.ListItemCollectionPosition != null);
>>>>>>>
>>>>>>>                         retVal.AppendChild(getListItemsNode);
>>>>>>> <<<<<<
>>>>>>>
>>>>>>> The code is clearly working if you get 20000 results returned, so I
>>>>>>> submit that perhaps there's a configured limit in your SharePoint instance
>>>>>>> that prevents listing more than 20000.  That's the only way I can explain
>>>>>>> this.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 19, 2019 at 12:51 PM Jorge Alonso Garcia <
>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> The job finnish ok (several times) but always with this 20000
>>>>>>>> documents, for some reason the loop only execute twice
>>>>>>>>
>>>>>>>> Jorge Alonso Garcia
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> El jue., 19 dic. 2019 a las 18:14, Karl Wright (<da...@gmail.com>)
>>>>>>>> escribió:
>>>>>>>>
>>>>>>>>> If the are all in one document, then you'd be running this code:
>>>>>>>>>
>>>>>>>>> >>>>>>
>>>>>>>>>         int startingIndex = 0;
>>>>>>>>>         int amtToRequest = 10000;
>>>>>>>>>         while (true)
>>>>>>>>>         {
>>>>>>>>>
>>>>>>>>> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult
>>>>>>>>> itemsResult =
>>>>>>>>>
>>>>>>>>> itemCall.getListItems(guid,Integer.toString(startingIndex),Integer.toString(amtToRequest));
>>>>>>>>>
>>>>>>>>>           MessageElement[] itemsList = itemsResult.get_any();
>>>>>>>>>
>>>>>>>>>           if (Logging.connectors.isDebugEnabled()){
>>>>>>>>>             Logging.connectors.debug("SharePoint: getChildren xml
>>>>>>>>> response: " + itemsList[0].toString());
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>           if (itemsList.length != 1)
>>>>>>>>>             throw new ManifoldCFException("Bad response -
>>>>>>>>> expecting one outer 'GetListItems' node, saw
>>>>>>>>> "+Integer.toString(itemsList.length));
>>>>>>>>>
>>>>>>>>>           MessageElement items = itemsList[0];
>>>>>>>>>           if
>>>>>>>>> (!items.getElementName().getLocalName().equals("GetListItems"))
>>>>>>>>>             throw new ManifoldCFException("Bad response - outer
>>>>>>>>> node should have been 'GetListItems' node");
>>>>>>>>>
>>>>>>>>>           int resultCount = 0;
>>>>>>>>>           Iterator iter = items.getChildElements();
>>>>>>>>>           while (iter.hasNext())
>>>>>>>>>           {
>>>>>>>>>             MessageElement child = (MessageElement)iter.next();
>>>>>>>>>             if
>>>>>>>>> (child.getElementName().getLocalName().equals("GetListItemsResponse"))
>>>>>>>>>             {
>>>>>>>>>               Iterator resultIter = child.getChildElements();
>>>>>>>>>               while (resultIter.hasNext())
>>>>>>>>>               {
>>>>>>>>>                 MessageElement result =
>>>>>>>>> (MessageElement)resultIter.next();
>>>>>>>>>                 if
>>>>>>>>> (result.getElementName().getLocalName().equals("GetListItemsResult"))
>>>>>>>>>                 {
>>>>>>>>>                   resultCount++;
>>>>>>>>>                   String relPath = result.getAttribute("FileRef");
>>>>>>>>>                   String displayURL =
>>>>>>>>> result.getAttribute("ListItemURL");
>>>>>>>>>                   fileStream.addFile( relPath, displayURL );
>>>>>>>>>                 }
>>>>>>>>>               }
>>>>>>>>>
>>>>>>>>>             }
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>           if (resultCount < amtToRequest)
>>>>>>>>>             break;
>>>>>>>>>
>>>>>>>>>           startingIndex += resultCount;
>>>>>>>>>         }
>>>>>>>>> <<<<<<
>>>>>>>>>
>>>>>>>>> What this does is request library content URLs in chunks of
>>>>>>>>> 10000.  It stops when it receives less than 10000 documents from any one
>>>>>>>>> request.
>>>>>>>>>
>>>>>>>>> If the documents were all in one library, then one call to the web
>>>>>>>>> service yielded 10000 documents, and the second call yielded 10000
>>>>>>>>> documents, and there was no third call for no reason I can figure out.
>>>>>>>>> Since 10000 documents were returned each time the loop ought to just
>>>>>>>>> continue, unless there was some kind of error.  Does the job succeed, or
>>>>>>>>> does it abort?
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 19, 2019 at 12:05 PM Karl Wright <da...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> If you are using the MCF plugin, and selecting the appropriate
>>>>>>>>>> version of Sharepoint in the connection configuration, there is no hard
>>>>>>>>>> limit I'm aware of for any Sharepoint job.  We have lots of other people
>>>>>>>>>> using SharePoint and nobody has reported this ever before.
>>>>>>>>>>
>>>>>>>>>> If your SharePoint connection says "SharePoint 2003" as the
>>>>>>>>>> SharePoint version, then sure, that would be expected behavior.  So please
>>>>>>>>>> check that first.
>>>>>>>>>>
>>>>>>>>>> The other question I have is your description of you first
>>>>>>>>>> getting 10001 documents and then later 20002.  That's not how ManifoldCF
>>>>>>>>>> works.  At the start of the crawl, seeds are added; this would start out
>>>>>>>>>> just being the root, and then other documents would be discovered as the
>>>>>>>>>> crawl proceeded, after subsites and libraries are discovered.  So I am
>>>>>>>>>> still trying to square that with your description of how this is working
>>>>>>>>>> for you.
>>>>>>>>>>
>>>>>>>>>> Are all of your documents in one library?  Or two libraries?
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 19, 2019 at 11:42 AM Jorge Alonso Garcia <
>>>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>> On UI shows 20,002 documents (on a firts phase show 10,001,and
>>>>>>>>>>> after sometime of process raise to 20,002) .
>>>>>>>>>>> It looks like a hard limit, there is more files on sharepoint
>>>>>>>>>>> with the used criteria
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> El jue., 19 dic. 2019 a las 16:05, Karl Wright (<
>>>>>>>>>>> daddywri@gmail.com>) escribió:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Jorge,
>>>>>>>>>>>>
>>>>>>>>>>>> When you run the job, do you see more than 20,000 documents as
>>>>>>>>>>>> part of it?
>>>>>>>>>>>>
>>>>>>>>>>>> Do you see *exactly* 20,000 documents as part of it?
>>>>>>>>>>>>
>>>>>>>>>>>> Unless you are seeing a hard number like that in the UI for
>>>>>>>>>>>> that job on the job status page, I doubt very much that the problem is a
>>>>>>>>>>>> numerical limitation in the number of documents.  I would suspect that the
>>>>>>>>>>>> inclusion criteria, e.g. the mime type or maximum length, is excluding
>>>>>>>>>>>> documents.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 19, 2019 at 8:51 AM Jorge Alonso Garcia <
>>>>>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>> We had installed the shaterpoint plugin, and access properly
>>>>>>>>>>>>> http:/server/_vti_bin/MCPermissions.asmx
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sharepoint has more than 20,000 documents, but when execute
>>>>>>>>>>>>> the jon only extract these 20,000. How Can I check where is the issue?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> El jue., 19 dic. 2019 a las 12:52, Karl Wright (<
>>>>>>>>>>>>> daddywri@gmail.com>) escribió:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> By "stop at 20,000" do you mean that it finds more than
>>>>>>>>>>>>>> 20,000 but stops crawling at that time?  Or what exactly do you mean here?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FWIW, the behavior you describe sounds like you may not have
>>>>>>>>>>>>>> installed the SharePoint plugin and may have selected a version of
>>>>>>>>>>>>>> SharePoint that is inappropriate.  All SharePoint versions after 2008 limit
>>>>>>>>>>>>>> the number of documents returned using the standard web services methods.
>>>>>>>>>>>>>> The plugin allows us to bypass that hard limit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 19, 2019 at 6:37 AM Jorge Alonso Garcia <
>>>>>>>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> We have an isuse with sharepoint connector.
>>>>>>>>>>>>>>> There is a job that crawl a sharepoint 2016, but it is not
>>>>>>>>>>>>>>> recovering all files, it stop at 20.000 documents without any error.
>>>>>>>>>>>>>>> Is there any parameter that should be change to avoid this
>>>>>>>>>>>>>>> limitation?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: sharepoint crawler documents limit

Posted by Karl Wright <da...@gmail.com>.

I'm glad you got by this.  Thanks for letting us know what the issue was.
Karl

On Mon, Jan 27, 2020 at 4:05 AM Jorge Alonso Garcia <ja...@gmail.com>
wrote:

> Hi,
> We had change timeout on sharepoint IIS and now the process is able to
> crall all documents.
> Thanks for your help
>
>
>
> El lun., 30 dic. 2019 a las 12:18, Gaurav G (<go...@gmail.com>)
> escribió:
>
>> We had faced a similar issue, wherein our repo had 100,000 documents but
>> our crawler stopped after 50000 documents. The issue turned out to be that
>> the Sharepoint query that was fired by the Sharepoint web service gets
>> progressively slower and eventually the connection starts timing out before
>> the next 10000 records get returned. We increased a timeout parameter on
>> Sharepoint to 10 minutes and then after that we were able to crawl all
>> documents successfully.  I believe we had increased the parameter indicated
>> in the link below
>>
>>
>> https://weblogs.asp.net/jeffwids/how-to-increase-the-timeout-for-a-sharepoint-2010-website
>>
>>
>>
>> On Fri, Dec 20, 2019 at 6:27 PM Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Priya,
>>>
>>> This has nothing to do with anything in ManifoldCF.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Dec 20, 2019 at 7:56 AM Priya Arora <pr...@smartshore.nl> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Is this issue something to have with below value/parameters set in
>>>> properties.xml.
>>>> [image: image.png]
>>>>
>>>>
>>>> On Fri, Dec 20, 2019 at 5:21 PM Jorge Alonso Garcia <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> And what other sharepoint parameter I could check?
>>>>>
>>>>> Jorge Alonso Garcia
>>>>>
>>>>>
>>>>>
>>>>> El vie., 20 dic. 2019 a las 12:47, Karl Wright (<da...@gmail.com>)
>>>>> escribió:
>>>>>
>>>>>> The code seems correct and many people are using it without
>>>>>> encountering this problem.  There may be another SharePoint configuration
>>>>>> parameter you also need to look at somewhere.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia <
>>>>>> jalongar@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Karl,
>>>>>>> On sharepoint the list view threshold is 150,000 but we only receipt
>>>>>>> 20,000 from mcf
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>>
>>>>>>> Jorge Alonso Garcia
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> El jue., 19 dic. 2019 a las 19:19, Karl Wright (<da...@gmail.com>)
>>>>>>> escribió:
>>>>>>>
>>>>>>>> If the job finished without error it implies that the number of
>>>>>>>> documents returned from this one library was 10000 when the service is
>>>>>>>> called the first time (starting at doc 0), 10000 when it's called the
>>>>>>>> second time (starting at doc 10000), and zero when it is called the third
>>>>>>>> time (starting at doc 20000).
>>>>>>>>
>>>>>>>> The plugin code is unremarkable and actually gets results in chunks
>>>>>>>> of 1000 under the covers:
>>>>>>>>
>>>>>>>> >>>>>>
>>>>>>>>                         SPQuery listQuery = new SPQuery();
>>>>>>>>                         listQuery.Query = "<OrderBy
>>>>>>>> Override=\"TRUE\"><FieldRef Name=\"FileRef\" /></OrderBy>";
>>>>>>>>                         listQuery.QueryThrottleMode =
>>>>>>>> SPQueryThrottleOption.Override;
>>>>>>>>                         listQuery.ViewAttributes =
>>>>>>>> "Scope=\"Recursive\"";
>>>>>>>>                         listQuery.ViewFields = "<FieldRef
>>>>>>>> Name='FileRef' />";
>>>>>>>>                         listQuery.RowLimit = 1000;
>>>>>>>>
>>>>>>>>                         XmlDocument doc = new XmlDocument();
>>>>>>>>                         retVal = doc.CreateElement("GetListItems",
>>>>>>>>                             "
>>>>>>>> http://schemas.microsoft.com/sharepoint/soap/directory/");
>>>>>>>>                         XmlNode getListItemsNode =
>>>>>>>> doc.CreateElement("GetListItemsResponse");
>>>>>>>>
>>>>>>>>                         uint counter = 0;
>>>>>>>>                         do
>>>>>>>>                         {
>>>>>>>>                             if (counter >= startRowParam +
>>>>>>>> rowLimitParam)
>>>>>>>>                                 break;
>>>>>>>>
>>>>>>>>                             SPListItemCollection collListItems =
>>>>>>>> oList.GetItems(listQuery);
>>>>>>>>
>>>>>>>>
>>>>>>>>                             foreach (SPListItem oListItem in
>>>>>>>> collListItems)
>>>>>>>>                             {
>>>>>>>>                                 if (counter >= startRowParam &&
>>>>>>>> counter < startRowParam + rowLimitParam)
>>>>>>>>                                 {
>>>>>>>>                                     XmlNode resultNode =
>>>>>>>> doc.CreateElement("GetListItemsResult");
>>>>>>>>                                     XmlAttribute idAttribute =
>>>>>>>> doc.CreateAttribute("FileRef");
>>>>>>>>                                     idAttribute.Value =
>>>>>>>> oListItem.Url;
>>>>>>>>
>>>>>>>> resultNode.Attributes.Append(idAttribute);
>>>>>>>>                                     XmlAttribute urlAttribute =
>>>>>>>> doc.CreateAttribute("ListItemURL");
>>>>>>>>                                     //urlAttribute.Value =
>>>>>>>> oListItem.ParentList.DefaultViewUrl;
>>>>>>>>                                     urlAttribute.Value =
>>>>>>>> string.Format("{0}?ID={1}",
>>>>>>>> oListItem.ParentList.Forms[PAGETYPE.PAGE_DISPLAYFORM].ServerRelativeUrl,
>>>>>>>> oListItem.ID);
>>>>>>>>
>>>>>>>> resultNode.Attributes.Append(urlAttribute);
>>>>>>>>
>>>>>>>> getListItemsNode.AppendChild(resultNode);
>>>>>>>>                                 }
>>>>>>>>                                 counter++;
>>>>>>>>                             }
>>>>>>>>
>>>>>>>>                             listQuery.ListItemCollectionPosition =
>>>>>>>> collListItems.ListItemCollectionPosition;
>>>>>>>>
>>>>>>>>                         } while
>>>>>>>> (listQuery.ListItemCollectionPosition != null);
>>>>>>>>
>>>>>>>>                         retVal.AppendChild(getListItemsNode);
>>>>>>>> <<<<<<
>>>>>>>>
>>>>>>>> The code is clearly working if you get 20000 results returned, so I
>>>>>>>> submit that perhaps there's a configured limit in your SharePoint instance
>>>>>>>> that prevents listing more than 20000.  That's the only way I can explain
>>>>>>>> this.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 19, 2019 at 12:51 PM Jorge Alonso Garcia <
>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> The job finnish ok (several times) but always with this 20000
>>>>>>>>> documents, for some reason the loop only execute twice
>>>>>>>>>
>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> El jue., 19 dic. 2019 a las 18:14, Karl Wright (<
>>>>>>>>> daddywri@gmail.com>) escribió:
>>>>>>>>>
>>>>>>>>>> If the are all in one document, then you'd be running this code:
>>>>>>>>>>
>>>>>>>>>> >>>>>>
>>>>>>>>>>         int startingIndex = 0;
>>>>>>>>>>         int amtToRequest = 10000;
>>>>>>>>>>         while (true)
>>>>>>>>>>         {
>>>>>>>>>>
>>>>>>>>>> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult
>>>>>>>>>> itemsResult =
>>>>>>>>>>
>>>>>>>>>> itemCall.getListItems(guid,Integer.toString(startingIndex),Integer.toString(amtToRequest));
>>>>>>>>>>
>>>>>>>>>>           MessageElement[] itemsList = itemsResult.get_any();
>>>>>>>>>>
>>>>>>>>>>           if (Logging.connectors.isDebugEnabled()){
>>>>>>>>>>             Logging.connectors.debug("SharePoint: getChildren xml
>>>>>>>>>> response: " + itemsList[0].toString());
>>>>>>>>>>           }
>>>>>>>>>>
>>>>>>>>>>           if (itemsList.length != 1)
>>>>>>>>>>             throw new ManifoldCFException("Bad response -
>>>>>>>>>> expecting one outer 'GetListItems' node, saw
>>>>>>>>>> "+Integer.toString(itemsList.length));
>>>>>>>>>>
>>>>>>>>>>           MessageElement items = itemsList[0];
>>>>>>>>>>           if
>>>>>>>>>> (!items.getElementName().getLocalName().equals("GetListItems"))
>>>>>>>>>>             throw new ManifoldCFException("Bad response - outer
>>>>>>>>>> node should have been 'GetListItems' node");
>>>>>>>>>>
>>>>>>>>>>           int resultCount = 0;
>>>>>>>>>>           Iterator iter = items.getChildElements();
>>>>>>>>>>           while (iter.hasNext())
>>>>>>>>>>           {
>>>>>>>>>>             MessageElement child = (MessageElement)iter.next();
>>>>>>>>>>             if
>>>>>>>>>> (child.getElementName().getLocalName().equals("GetListItemsResponse"))
>>>>>>>>>>             {
>>>>>>>>>>               Iterator resultIter = child.getChildElements();
>>>>>>>>>>               while (resultIter.hasNext())
>>>>>>>>>>               {
>>>>>>>>>>                 MessageElement result =
>>>>>>>>>> (MessageElement)resultIter.next();
>>>>>>>>>>                 if
>>>>>>>>>> (result.getElementName().getLocalName().equals("GetListItemsResult"))
>>>>>>>>>>                 {
>>>>>>>>>>                   resultCount++;
>>>>>>>>>>                   String relPath = result.getAttribute("FileRef");
>>>>>>>>>>                   String displayURL =
>>>>>>>>>> result.getAttribute("ListItemURL");
>>>>>>>>>>                   fileStream.addFile( relPath, displayURL );
>>>>>>>>>>                 }
>>>>>>>>>>               }
>>>>>>>>>>
>>>>>>>>>>             }
>>>>>>>>>>           }
>>>>>>>>>>
>>>>>>>>>>           if (resultCount < amtToRequest)
>>>>>>>>>>             break;
>>>>>>>>>>
>>>>>>>>>>           startingIndex += resultCount;
>>>>>>>>>>         }
>>>>>>>>>> <<<<<<
>>>>>>>>>>
>>>>>>>>>> What this does is request library content URLs in chunks of
>>>>>>>>>> 10000.  It stops when it receives less than 10000 documents from any one
>>>>>>>>>> request.
>>>>>>>>>>
>>>>>>>>>> If the documents were all in one library, then one call to the
>>>>>>>>>> web service yielded 10000 documents, and the second call yielded 10000
>>>>>>>>>> documents, and there was no third call for no reason I can figure out.
>>>>>>>>>> Since 10000 documents were returned each time the loop ought to just
>>>>>>>>>> continue, unless there was some kind of error.  Does the job succeed, or
>>>>>>>>>> does it abort?
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 19, 2019 at 12:05 PM Karl Wright <da...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> If you are using the MCF plugin, and selecting the appropriate
>>>>>>>>>>> version of Sharepoint in the connection configuration, there is no hard
>>>>>>>>>>> limit I'm aware of for any Sharepoint job.  We have lots of other people
>>>>>>>>>>> using SharePoint and nobody has reported this ever before.
>>>>>>>>>>>
>>>>>>>>>>> If your SharePoint connection says "SharePoint 2003" as the
>>>>>>>>>>> SharePoint version, then sure, that would be expected behavior.  So please
>>>>>>>>>>> check that first.
>>>>>>>>>>>
>>>>>>>>>>> The other question I have is your description of you first
>>>>>>>>>>> getting 10001 documents and then later 20002.  That's not how ManifoldCF
>>>>>>>>>>> works.  At the start of the crawl, seeds are added; this would start out
>>>>>>>>>>> just being the root, and then other documents would be discovered as the
>>>>>>>>>>> crawl proceeded, after subsites and libraries are discovered.  So I am
>>>>>>>>>>> still trying to square that with your description of how this is working
>>>>>>>>>>> for you.
>>>>>>>>>>>
>>>>>>>>>>> Are all of your documents in one library?  Or two libraries?
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 19, 2019 at 11:42 AM Jorge Alonso Garcia <
>>>>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> On UI shows 20,002 documents (on a firts phase show 10,001,and
>>>>>>>>>>>> after sometime of process raise to 20,002) .
>>>>>>>>>>>> It looks like a hard limit, there is more files on sharepoint
>>>>>>>>>>>> with the used criteria
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> El jue., 19 dic. 2019 a las 16:05, Karl Wright (<
>>>>>>>>>>>> daddywri@gmail.com>) escribió:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Jorge,
>>>>>>>>>>>>>
>>>>>>>>>>>>> When you run the job, do you see more than 20,000 documents as
>>>>>>>>>>>>> part of it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you see *exactly* 20,000 documents as part of it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Unless you are seeing a hard number like that in the UI for
>>>>>>>>>>>>> that job on the job status page, I doubt very much that the problem is a
>>>>>>>>>>>>> numerical limitation in the number of documents.  I would suspect that the
>>>>>>>>>>>>> inclusion criteria, e.g. the mime type or maximum length, is excluding
>>>>>>>>>>>>> documents.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 19, 2019 at 8:51 AM Jorge Alonso Garcia <
>>>>>>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>> We had installed the shaterpoint plugin, and access properly
>>>>>>>>>>>>>> http:/server/_vti_bin/MCPermissions.asmx
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sharepoint has more than 20,000 documents, but when execute
>>>>>>>>>>>>>> the jon only extract these 20,000. How Can I check where is the issue?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> El jue., 19 dic. 2019 a las 12:52, Karl Wright (<
>>>>>>>>>>>>>> daddywri@gmail.com>) escribió:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> By "stop at 20,000" do you mean that it finds more than
>>>>>>>>>>>>>>> 20,000 but stops crawling at that time?  Or what exactly do you mean here?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> FWIW, the behavior you describe sounds like you may not have
>>>>>>>>>>>>>>> installed the SharePoint plugin and may have selected a version of
>>>>>>>>>>>>>>> SharePoint that is inappropriate.  All SharePoint versions after 2008 limit
>>>>>>>>>>>>>>> the number of documents returned using the standard web services methods.
>>>>>>>>>>>>>>> The plugin allows us to bypass that hard limit.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 19, 2019 at 6:37 AM Jorge Alonso Garcia <
>>>>>>>>>>>>>>> jalongar@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> We have an isuse with sharepoint connector.
>>>>>>>>>>>>>>>> There is a job that crawl a sharepoint 2016, but it is not
>>>>>>>>>>>>>>>> recovering all files, it stop at 20.000 documents without any error.
>>>>>>>>>>>>>>>> Is there any parameter that should be change to avoid this
>>>>>>>>>>>>>>>> limitation?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>