You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Cheng Zeng <ze...@hotmail.co.uk> on 2016/04/08 09:32:46 UTC

Sharepoint 2013 Crawling a large list

Hi,
I am trying to extract web pages and attachments from Sharepoint 2013 and upload these data to solr for indexing. 
I have installed the Sharepoint plugin on sharepoint 2013 server and been able to use manifoldCF to fetch items from the lists with less than 160 items. My problem is that there are a few lists which have more than 4,900 items. When manifoldCF tried to crawl on these large lists, it turned out that it started to process items very slow and seems to stop working, after 2,100 items were processed. I tried to slow down the speed to upload the items to the solr instance by forcing the working thread to sleep for 3 seconds after every 50 items were added to the pipeline. I tried to slow down the speed several times but manifoldCF starts to process items very slow as long as 2,100 items in the list were processed. It is noted that manifoldCF  starts to process items very slow after around 30 minutes since the crawling job starts and the errors are tossed as follows.
WARN 2016-04-08 12:29:14,762 (Worker thread '19') - Service interruption reported for job 1460088455222 connection 'SharepointRepoistoryConn': Remote procedure exception: ; nested exception is: 	java.lang.ArrayIndexOutOfBoundsExceptionFATAL 2016-04-08 12:29:14,777 (Worker thread '28') - Error tossed: nulljava.lang.NullPointerExceptionFATAL 2016-04-08 12:30:37,611 (Worker thread '29') - Error tossed: nulljava.lang.NullPointerException

The log is attached.  If someone could help me, I would really appreciated it.
Best regards,
Cheng 		 	   		  

RE: Sharepoint 2013 Crawling a large list

Posted by Cheng Zeng <ze...@hotmail.co.uk>.


Hi Karl,
Thank you very much for your reply. MCF processed all the items in the large list with no errors when I switched to Postgresql. Your suggestion is very helpful. Thank you for your suggestion. Best regards,
Cheng

Date: Fri, 8 Apr 2016 06:05:38 -0400
Subject: Re: Sharepoint 2013 Crawling a large list
From: daddywri@gmail.com
To: user@manifoldcf.apache.org

Hi Cheng,
That is a pretty impressively messed up system!
Let's start with what we know and then go on to what we don't.
The "Remote procedure exception" error is due to an org.apache.axis.AxisFault exception that is not apparently coming from the server.  That's pretty weird in its own right.  Equally weird is the NPE coming from within HttpClient during NTLM processing.  Unfortunately we aren't seeing the actual stack traces themselves, which would allow us to figure out what was happening; instead you are getting ArrayIndexOutOfBounds and NullPointerExceptions doing basic things like array copying (!).
Can you include one or two of the actual traces (with line numbers?)
My sense is that (a) you are using a non-standard JVM that is (b) running out of memory, but not throwing an out of memory exception when that happens.  Rather, it's blowing up and not allocating memory that it needs instead.  It's running out of memory most likely because (c) you are using Hsqldb, and hsqldb is keeping its database tables in memory, which is what it does.
I would recommend either (1) give MCF more memory, or (2) better yet, switch to Postgresql.  And if this keeps happening under either scenario, please include a few of the full traces so I can make better sense of the problem.
Please let us know what happens.
Thanks,Karl

On Fri, Apr 8, 2016 at 3:32 AM, Cheng Zeng <ze...@hotmail.co.uk> wrote:



Hi,
I am trying to extract web pages and attachments from Sharepoint 2013 and upload these data to solr for indexing. 
I have installed the Sharepoint plugin on sharepoint 2013 server and been able to use manifoldCF to fetch items from the lists with less than 160 items. My problem is that there are a few lists which have more than 4,900 items. When manifoldCF tried to crawl on these large lists, it turned out that it started to process items very slow and seems to stop working, after 2,100 items were processed. I tried to slow down the speed to upload the items to the solr instance by forcing the working thread to sleep for 3 seconds after every 50 items were added to the pipeline. I tried to slow down the speed several times but manifoldCF starts to process items very slow as long as 2,100 items in the list were processed. It is noted that manifoldCF  starts to process items very slow after around 30 minutes since the crawling job starts and the errors are tossed as follows.
WARN 2016-04-08 12:29:14,762 (Worker thread '19') - Service interruption reported for job 1460088455222 connection 'SharepointRepoistoryConn': Remote procedure exception: ; nested exception is: 	java.lang.ArrayIndexOutOfBoundsExceptionFATAL 2016-04-08 12:29:14,777 (Worker thread '28') - Error tossed: nulljava.lang.NullPointerExceptionFATAL 2016-04-08 12:30:37,611 (Worker thread '29') - Error tossed: nulljava.lang.NullPointerException

The log is attached.  If someone could help me, I would really appreciated it.
Best regards,
Cheng 		 	   		  


 		 	   		  

Re: Sharepoint 2013 Crawling a large list

Posted by Karl Wright <da...@gmail.com>.
Hi Cheng,

That is a pretty impressively messed up system!

Let's start with what we know and then go on to what we don't.

The "Remote procedure exception" error is due to an
org.apache.axis.AxisFault exception that is not apparently coming from the
server.  That's pretty weird in its own right.  Equally weird is the NPE
coming from within HttpClient during NTLM processing.  Unfortunately we
aren't seeing the actual stack traces themselves, which would allow us to
figure out what was happening; instead you are getting
ArrayIndexOutOfBounds and NullPointerExceptions doing basic things like
array copying (!).

Can you include one or two of the actual traces (with line numbers?)

My sense is that (a) you are using a non-standard JVM that is (b) running
out of memory, but not throwing an out of memory exception when that
happens.  Rather, it's blowing up and not allocating memory that it needs
instead.  It's running out of memory most likely because (c) you are using
Hsqldb, and hsqldb is keeping its database tables in memory, which is what
it does.

I would recommend either (1) give MCF more memory, or (2) better yet,
switch to Postgresql.  And if this keeps happening under either scenario,
please include a few of the full traces so I can make better sense of the
problem.

Please let us know what happens.

Thanks,
Karl


On Fri, Apr 8, 2016 at 3:32 AM, Cheng Zeng <ze...@hotmail.co.uk> wrote:

> Hi,
>
> I am trying to extract web pages and attachments from Sharepoint 2013 and
> upload these data to solr for indexing.
>
> I have installed the Sharepoint plugin on sharepoint 2013 server and been
> able to use manifoldCF to fetch items from the lists with less than 160
> items. My problem is that there are a few lists which have more than 4,900
> items. When manifoldCF tried to crawl on these large lists, it turned out
> that it started to process items very slow and seems to stop working, after
> 2,100 items were processed. I tried to slow down the speed to upload the
> items to the solr instance by forcing the working thread to sleep for 3
> seconds after every 50 items were added to the pipeline. I tried to slow
> down the speed several times but manifoldCF starts to process items very
> slow as long as 2,100 items in the list were processed. It is noted that
> manifoldCF  starts to process items very slow after around 30 minutes since
> the crawling job starts and the errors are tossed as follows.
>
> WARN 2016-04-08 12:29:14,762 (Worker thread '19') - Service interruption
> reported for job 1460088455222 connection 'SharepointRepoistoryConn':
> Remote procedure exception: ; nested exception is:
> java.lang.ArrayIndexOutOfBoundsException
> FATAL 2016-04-08 12:29:14,777 (Worker thread '28') - Error tossed: null
> java.lang.NullPointerException
> FATAL 2016-04-08 12:30:37,611 (Worker thread '29') - Error tossed: null
> java.lang.NullPointerException
>
>
> The log is attached.  If someone could help me, I would really appreciated
> it.
>
> Best regards,
>
> Cheng
>