You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Phil Riethmuller <pr...@funnelback.com> on 2016/02/16 06:22:38 UTC

HTTP 302 error causing job to abort

Hi -

When crawling a Sharepoint repository, I¹m receiving a HTTP 302 error which
is causing the manifold job to abort. How do I prevent the crawler from
aborting the job?

I¹m using v2.3 of Manifold with a postgres database.

Regards,
Phil



Re: HTTP 302 error causing job to abort

Posted by Phil Riethmuller <pr...@funnelback.com>.
Thanks Karl,

I¹ll take a look at this today.

Regards,

Phil Riethmuller
Technical Consultant
 
Funnelback | 437 Kent Street, Sydney, NSW 2000
T +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>

AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES


Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
Twitter


From:  Karl Wright <da...@gmail.com>
Reply-To:  <us...@manifoldcf.apache.org>
Date:  Monday, 22 February 2016 11:32 pm
To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Subject:  Re: HTTP 302 error causing job to abort

Any news on this research?
Karl


On Fri, Feb 19, 2016 at 12:46 AM, Karl Wright <da...@gmail.com> wrote:
> Hi Phil,
> 
> Thanks -- this information is more helpful.
> 
> So my understanding is that there is an external site reference in your
> site/subsite hierarchy?  And the *root* site (the one that you point at when
> you configure the connection itself) is *not* external after all?
> 
> If that is the case, then the external site must be being "discovered" through
> the Webs service API call.  There are two ways forward:
> 
> (1)  We can change the Webs response parsing to detect external sites and not
> include those in the crawl, or
> (2) We can try to make decisions based on whether a 302 comes back as a
> response code.
> 
> (1) is by far the best approach but it will require some cooperation and
> execution of sample code on your part.  Essentially I'll need to see what the
> xml is that is coming back that first describes the exterrnal site and see if
> there is an attribute that lets us know it is external.  That way I properly
> just skip it entirely.
> 
> We can have a look at what comes back from SharePoint for this API response if
> you enable connector debugging in properties.xml:
> 
> <property name="org.apache.manifoldcf.connectors" value="DEBUG"/>
> 
> ... and restart.  You will then need to do a crawl.  The following line will
> be what you look for:
> 
> Logging.connectors.debug("SharePoint: getSites xml response: "+xmlResponse);
> 
> This xml response will contain "Url" and "Title" nodes; what I need to know is
> whether there's any attribute of the "Url" node, or parallel node other than
> "Url" or "Title', that contains an indication of whether the Url that
> describes the external site is indeed external.  So you look for the Url that
> describes the SharePoint URL that has the redirection, and tell me if there's
> anything special about it in the associated getSites response.  Does that make
> sense?
> 
> If this is too hard, alternative (2) is possible, but it will require tons of
> individual changes.  So let's look into (1) first.
> 
> Thanks
> Karl
> 
> 
> On Thu, Feb 18, 2016 at 11:49 PM, Phil Riethmuller
> <pr...@funnelback.com> wrote:
>> Hi Karl,
>> 
>> Some further info:
>> * The problem document that Manifold reported, is redirecting to an external
>> site.
>> * We tried crawling a smaller subset of content on the same Sharepoint site
>> that definitely doesn¹t contain any external links in the content, and this
>> works OK. 
>> * The job that errors with the 302, says it has found 529 docs so far and
>> processed 127 of them. This seems to indicate that is has in fact found some
>> documents.
>> I¹m not sure what you mean that the error is being generated from the API
>> call, and not an individual document? The info appears to indicate it is not
>> all documents, but just selected documents.
>> 
>> There really isn¹t much we can do about this from the Sharepoint
>> configuration side, is there any way we can test if it is as simple as the
>> 302 coming from the documents themselves?
>> 
>> Thanks for your help to date.
>> 
>> Phil
>> 
>> 
>> From:  Karl Wright <da...@gmail.com>
>> Reply-To:  <us...@manifoldcf.apache.org>
>> Date:  Thursday, 18 February 2016 10:31 am
>> 
>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> Subject:  Re: HTTP 302 error causing job to abort
>> 
>> Hi Phil,
>> 
>> The 302 error is not coming from a single document.  If it *was* coming from
>> the fetch of an individual document, it would be easy to work around.  But,
>> from your stack trace, it is clear that this error is coming from an API
>> call, specifically a call to enumerate subsites of a given site.  That means
>> that some or all of the SharePoint hierarchy is not accessible through POST
>> requests.  I have never seen this kind of behavior from SharePoint before.
>> 
>> This is not something that I can work around without more information.  In
>> order to get that information, you will at the very minimum need to turn on
>> connector debugging, and probably turning on http wire debugging would be
>> helpful too.  And, if what you said about the View page for this connection
>> is true and it also shows a 302 error, I very much suspect that something
>> changed on the server end and you are currently unable to crawl *any*
>> documents at all.
>> 
>> I am sorry I cannot make this any clearer.
>> 
>> Thanks,
>> Karl
>> 
>> 
>> 
>> 
>> On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller
>> <pr...@funnelback.com> wrote:
>>> Hi Karl,
>>> 
>>> Thanks for the update.
>>> 
>>> I¹m not 100% sure how many documents have this redirect in them, but I¹ll
>>> see if I can get a better estimate. The content we are crawling is
>>> substantially large, and comes from many different authors so it¹s difficult
>>> to manage how these Sharepoint documents are created. It makes it extremely
>>> difficult to pinpoint all the documents that contain redirects.
>>> 
>>> Am I correct in assuming a single 302 error causes the job to fail, or is
>>> there some other logic that determines this?
>>> 
>>> How plausible would it be to include in the product an option for treating
>>> 302¹s as a warning, rather than a fatal error? Possibly just an option in
>>> the Job setup?
>>> 
>>> Regards,
>>> Phil
>>> 
>>> 
>>> From:  Karl Wright <da...@gmail.com>
>>> Reply-To:  <us...@manifoldcf.apache.org>
>>> Date:  Thursday, 18 February 2016 1:39 am
>>> 
>>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>> Subject:  Re: HTTP 302 error causing job to abort
>>> 
>>> Hi again Phil,
>>> 
>>> The HttpClient team points out that POST requests (as we do for the
>>> SharePoint repository requests) are not allowed to follow 302 redirections
>>> according to RFC2616.  We use POST requests because, for SOAP, there is
>>> often quite a bit of XML data that goes along with the request, and we would
>>> otherwise have size issues.  So we cannot use GET instead of POST.  See
>>> CONNECTORS-1279 for details.
>>> 
>>> If you still believe that it is only a couple of URLs that are returning 302
>>> for you, I'd like some analysis of why you believe that to be true.  I would
>>> be happy to consider recognition of an occasional 302 response as meaning
>>> "skip this document".  On the other hand, based on your stack trace, it
>>> really appears that you have a far more systemic problem; it is failing
>>> while obtaining information for an entire site, so not much would get
>>> crawled in that case.
>>> 
>>> Thanks,
>>> Karl
>>> 
>>> 
>>> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <da...@gmail.com> wrote:
>>>> Hi Phil,
>>>> 
>>>> It is not surprising that the connector doesn't like 302 responses and
>>>> doesn't know what to do with them, because it isn't supposed to ever be
>>>> getting any of these.
>>>> 
>>>> I am puzzled by your statement that "only a couple of documents have
>>>> redirections in them", because the connector crawls Lists and Library
>>>> documents within SharePoint *only*, and these are very specifically
>>>> accessible through a SharePoint URL hierarchy structure.  There's no room
>>>> in any of that for a 302 redirection.  Since you see a 302 in the UI, I
>>>> feel pretty certain you have a problem with your configuration and it is
>>>> not just "a couple of documents".
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller
>>>> <pr...@funnelback.com> wrote:
>>>>> Thanks Karl,
>>>>> 
>>>>> The majority of content is not going to the redirect, it¹s probably just a
>>>>> handful of documents that are behaving this way.
>>>>> 
>>>>> I¹d agree that it¹s of lesser concern whether or not the document itself
>>>>> is indexing, however I wouldn¹t expect the 302 to be treated as a fatal
>>>>> error that causes the job to come to a halt. I¹d expect the document to be
>>>>> passed over, and the crawl to continue.
>>>>> 
>>>>> Is the only solution at this point to remove the documents which redirect
>>>>> to a 302 to get the crawl to run in full?
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Phil Riethmuller
>>>>> Technical Consultant
>>>>>  
>>>>> Funnelback | 437 Kent Street, Sydney, NSW 2000
>>>>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882>  | funnelback.com
>>>>> <http://www.funnelback.com/>
>>>>> 
>>>>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>> 
>>>>> 
>>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
>>>>> Twitter
>>>>> 
>>>>> 
>>>>> From:  Karl Wright <da...@gmail.com>
>>>>> Reply-To:  <us...@manifoldcf.apache.org>
>>>>> Date:  Wednesday, 17 February 2016 8:58 am
>>>>> 
>>>>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>>> Subject:  Re: HTTP 302 error causing job to abort
>>>>> 
>>>>> Hi Phil,
>>>>> 
>>>>> You probably want to point your SharePoint repository connection to the
>>>>> proper server and site, and not rely on redirections.  It's also possible
>>>>> that you are missing the site entirely and the redirection you are seeing
>>>>> is taking you to some error page somewhere.
>>>>> 
>>>>> I will be raising the question of redirections with the
>>>>> HttpComponents/HttpClient team, since I see no obvious problems with the
>>>>> SharePoint connector code.  However, if your connection is properly set
>>>>> up, redirections should be unneeded.
>>>>> 
>>>>> I would read the documentation on the Wiki page for debugging SharePoint
>>>>> connections at the bottom of this page:
>>>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connectio
>>>>> ns
>>>>> 
>>>>> Thanks,
>>>>> Karl
>>>>> 
>>>>> 
>>>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller
>>>>> <pr...@funnelback.com> wrote:
>>>>>> Do you mean in the job status in the Manifold CF interface?
>>>>>> 
>>>>>> The job status also shows the same:
>>>>>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>>>>>> (302)HTTP/1.0 302 Found
>>>>>> 
>>>>>> I agree, I wouldn¹t of thought that the crawler would follow any links or
>>>>>> redirections.
>>>>>> 
>>>>>> What sort of configurations could be incorrectly configured, that I could
>>>>>> look at revising?
>>>>>> 
>>>>>> Phil
>>>>>> 
>>>>>> 
>>>>>> From:  Karl Wright <da...@gmail.com>
>>>>>> Reply-To:  <us...@manifoldcf.apache.org>
>>>>>> Date:  Wednesday, 17 February 2016 8:45 am
>>>>>> 
>>>>>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>>>> Subject:  Re: HTTP 302 error causing job to abort
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> When you view the repository connection in the UI, do you get a 302 error
>>>>>> also?
>>>>>> 
>>>>>> I have looked at the code; Httpclient is supposedly configured to honor
>>>>>> redirections.  Obviously it is not doing that, so I'll have to dig deeper
>>>>>> into why that is.  On the other hand, I would not expect you to be
>>>>>> getting any redirections, unless you have configured your connection
>>>>>> incorrectly.
>>>>>> 
>>>>>> Karl
>>>>>> 
>>>>>> 
>>>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller
>>>>>> <pr...@funnelback.com> wrote:
>>>>>>> Thanks Karl -
>>>>>>> 
>>>>>>> I¹ve replaced the actual URL with <URL> below, but here is the stack
>>>>>>> trace:
>>>>>>> 
>>>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
>>>>>>> Unexpected http error code 302 accessing SharePoint at <URL>:
>>>>>>> (302)HTTP/1.0 302 Found
>>>>>>> 
>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
>>>>>>> http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302
>>>>>>> Found
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSi
>>>>>>> tes(SPSProxyHelper.java:2246)
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository
>>>>>>> .processDocuments(SharePointRepository.java:1549)
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:
>>>>>>> 399)
>>>>>>> 
>>>>>>> Caused by: (302)HTTP/1.0 302 Found
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(Co
>>>>>>> mmonsHTTPSender.java:201)
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.j
>>>>>>> ava:32)
>>>>>>> 
>>>>>>>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>>>>> 
>>>>>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>>>>> 
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>>>>> 
>>>>>>>         at 
>>>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(Webs
>>>>>>> SoapStub.java:854)
>>>>>>> 
>>>>>>>         at 
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSi
>>>>>>> tes(SPSProxyHelper.java:2161)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>> Phil Riethmuller
>>>>>>> Technical Consultant
>>>>>>>  
>>>>>>> Funnelback | 437 Kent Street, Sydney, NSW 2000
>>>>>>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882>  | funnelback.com
>>>>>>> <http://www.funnelback.com/>
>>>>>>> 
>>>>>>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>>>> 
>>>>>>> 
>>>>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>
>>>>>>> - Twitter
>>>>>>> 
>>>>>>> 
>>>>>>> From:  Karl Wright <da...@gmail.com>
>>>>>>> Reply-To:  <us...@manifoldcf.apache.org>
>>>>>>> Date:  Tuesday, 16 February 2016 6:54 pm
>>>>>>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>>>>> Subject:  Re: HTTP 302 error causing job to abort
>>>>>>> 
>>>>>>> Hi Phil,
>>>>>>> 
>>>>>>> A HTTP 302 response is simply a redirection.  It should not, by itself,
>>>>>>> cause a job to abort.  I would expect that to go by in wire/http
>>>>>>> logging, but you should not see it anywhere else.  So it is not clear to
>>>>>>> me what you are really seeing here.
>>>>>>> 
>>>>>>> Can you include an example stack trace from the manifoldcf log?
>>>>>>> 
>>>>>>> Karl
>>>>>>>  
>>>>>>> 
>>>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller
>>>>>>> <pr...@funnelback.com> wrote:
>>>>>>> Hi -
>>>>>>> 
>>>>>>> When crawling a Sharepoint repository, I¹m receiving a HTTP 302 error
>>>>>>> which is causing the manifold job to abort. How do I prevent the crawler
>>>>>>> from aborting the job?
>>>>>>> 
>>>>>>> I¹m using v2.3 of Manifold with a postgres database.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Phil
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 




Re: HTTP 302 error causing job to abort

Posted by Karl Wright <da...@gmail.com>.
Any news on this research?
Karl


On Fri, Feb 19, 2016 at 12:46 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Phil,
>
> Thanks -- this information is more helpful.
>
> So my understanding is that there is an external site reference in your
> site/subsite hierarchy?  And the *root* site (the one that you point at
> when you configure the connection itself) is *not* external after all?
>
> If that is the case, then the external site must be being "discovered"
> through the Webs service API call.  There are two ways forward:
>
> (1)  We can change the Webs response parsing to detect external sites and
> not include those in the crawl, or
> (2) We can try to make decisions based on whether a 302 comes back as a
> response code.
>
> (1) is by far the best approach but it will require some cooperation and
> execution of sample code on your part.  Essentially I'll need to see what
> the xml is that is coming back that first describes the exterrnal site and
> see if there is an attribute that lets us know it is external.  That way I
> properly just skip it entirely.
>
> We can have a look at what comes back from SharePoint for this API
> response if you enable connector debugging in properties.xml:
>
> <property name="org.apache.manifoldcf.connectors" value="DEBUG"/>
>
> ... and restart.  You will then need to do a crawl.  The following line
> will be what you look for:
>
> Logging.connectors.debug("SharePoint: getSites xml response:
> "+xmlResponse);
>
> This xml response will contain "Url" and "Title" nodes; what I need to
> know is whether there's any attribute of the "Url" node, or parallel node
> other than "Url" or "Title', that contains an indication of whether the Url
> that describes the external site is indeed external.  So you look for the
> Url that describes the SharePoint URL that has the redirection, and tell me
> if there's anything special about it in the associated getSites response.
> Does that make sense?
>
> If this is too hard, alternative (2) is possible, but it will require tons
> of individual changes.  So let's look into (1) first.
>
> Thanks
> Karl
>
>
> On Thu, Feb 18, 2016 at 11:49 PM, Phil Riethmuller <
> priethmuller@funnelback.com> wrote:
>
>> Hi Karl,
>>
>> Some further info:
>>
>>    - The problem document that Manifold reported, is redirecting to an
>>    external site.
>>    - We tried crawling a smaller subset of content on the same
>>    Sharepoint site that definitely doesn’t contain any external links in the
>>    content, and this works OK.
>>    - The job that errors with the 302, says it has found 529 docs so far
>>    and processed 127 of them. This seems to indicate that is has in fact found
>>    some documents.
>>
>> I’m not sure what you mean that the error is being generated from the API
>> call, and not an individual document? The info appears to indicate it is
>> not all documents, but just selected documents.
>>
>> There really isn’t much we can do about this from the Sharepoint
>> configuration side, is there any way we can test if it is as simple as the
>> 302 coming from the documents themselves?
>>
>> Thanks for your help to date.
>>
>> Phil
>>
>>
>> From: Karl Wright <da...@gmail.com>
>> Reply-To: <us...@manifoldcf.apache.org>
>> Date: Thursday, 18 February 2016 10:31 am
>>
>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> Subject: Re: HTTP 302 error causing job to abort
>>
>> Hi Phil,
>>
>> The 302 error is not coming from a single document.  If it *was* coming
>> from the fetch of an individual document, it would be easy to work around.
>> But, from your stack trace, it is clear that this error is coming from an
>> API call, specifically a call to enumerate subsites of a given site.  That
>> means that some or all of the SharePoint hierarchy is not accessible
>> through POST requests.  I have never seen this kind of behavior from
>> SharePoint before.
>>
>> This is not something that I can work around without more information.
>> In order to get that information, you will at the very minimum need to turn
>> on connector debugging, and probably turning on http wire debugging would
>> be helpful too.  And, if what you said about the View page for this
>> connection is true and it also shows a 302 error, I very much suspect that
>> something changed on the server end and you are currently unable to crawl
>> *any* documents at all.
>>
>> I am sorry I cannot make this any clearer.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>> On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller <
>> priethmuller@funnelback.com> wrote:
>>
>>> Hi Karl,
>>>
>>> Thanks for the update.
>>>
>>> I’m not 100% sure how many documents have this redirect in them, but
>>> I’ll see if I can get a better estimate. The content we are crawling is
>>> substantially large, and comes from many different authors so it’s
>>> difficult to manage how these Sharepoint documents are created. It makes it
>>> extremely difficult to pinpoint all the documents that contain redirects.
>>>
>>> Am I correct in assuming a single 302 error causes the job to fail, or
>>> is there some other logic that determines this?
>>>
>>> How plausible would it be to include in the product an option for
>>> treating 302’s as a warning, rather than a fatal error? Possibly just an
>>> option in the Job setup?
>>>
>>> Regards,
>>> Phil
>>>
>>>
>>> From: Karl Wright <da...@gmail.com>
>>> Reply-To: <us...@manifoldcf.apache.org>
>>> Date: Thursday, 18 February 2016 1:39 am
>>>
>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>> Subject: Re: HTTP 302 error causing job to abort
>>>
>>> Hi again Phil,
>>>
>>> The HttpClient team points out that POST requests (as we do for the
>>> SharePoint repository requests) are not allowed to follow 302 redirections
>>> according to RFC2616.  We use POST requests because, for SOAP, there is
>>> often quite a bit of XML data that goes along with the request, and we
>>> would otherwise have size issues.  So we cannot use GET instead of POST.
>>> See CONNECTORS-1279 for details.
>>>
>>> If you still believe that it is only a couple of URLs that are returning
>>> 302 for you, I'd like some analysis of why you believe that to be true.  I
>>> would be happy to consider recognition of an occasional 302 response as
>>> meaning "skip this document".  On the other hand, based on your stack
>>> trace, it really appears that you have a far more systemic problem; it is
>>> failing while obtaining information for an entire site, so not much would
>>> get crawled in that case.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Phil,
>>>>
>>>> It is not surprising that the connector doesn't like 302 responses and
>>>> doesn't know what to do with them, because it isn't supposed to ever be
>>>> getting any of these.
>>>>
>>>> I am puzzled by your statement that "only a couple of documents have
>>>> redirections in them", because the connector crawls Lists and Library
>>>> documents within SharePoint *only*, and these are very specifically
>>>> accessible through a SharePoint URL hierarchy structure.  There's no room
>>>> in any of that for a 302 redirection.  Since you see a 302 in the UI, I
>>>> feel pretty certain you have a problem with your configuration and it is
>>>> not just "a couple of documents".
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller <
>>>> priethmuller@funnelback.com> wrote:
>>>>
>>>>> Thanks Karl,
>>>>>
>>>>> The majority of content is not going to the redirect, it’s probably
>>>>> just a handful of documents that are behaving this way.
>>>>>
>>>>> I’d agree that it’s of lesser concern whether or not the document
>>>>> itself is indexing, however I wouldn’t expect the 302 to be treated as a
>>>>> fatal error that causes the job to come to a halt. I’d expect the document
>>>>> to be passed over, and the crawl to continue.
>>>>>
>>>>> Is the only solution at this point to remove the documents which
>>>>> redirect to a 302 to get the crawl to run in full?
>>>>>
>>>>> Regards,
>>>>>
>>>>> *Phil Riethmuller*
>>>>> Technical Consultant
>>>>>
>>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>>>
>>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>>
>>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>
>>>>>  - *Twitter*
>>>>>
>>>>>
>>>>> From: Karl Wright <da...@gmail.com>
>>>>> Reply-To: <us...@manifoldcf.apache.org>
>>>>> Date: Wednesday, 17 February 2016 8:58 am
>>>>>
>>>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>>
>>>>> Hi Phil,
>>>>>
>>>>> You probably want to point your SharePoint repository connection to
>>>>> the proper server and site, and not rely on redirections.  It's also
>>>>> possible that you are missing the site entirely and the redirection you are
>>>>> seeing is taking you to some error page somewhere.
>>>>>
>>>>> I will be raising the question of redirections with the
>>>>> HttpComponents/HttpClient team, since I see no obvious problems with the
>>>>> SharePoint connector code.  However, if your connection is properly set up,
>>>>> redirections should be unneeded.
>>>>>
>>>>> I would read the documentation on the Wiki page for debugging
>>>>> SharePoint connections at the bottom of this page:
>>>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller <
>>>>> priethmuller@funnelback.com> wrote:
>>>>>
>>>>>> Do you mean in the job status in the Manifold CF interface?
>>>>>>
>>>>>> The job status also shows the same:
>>>>>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>>>>>> (302)HTTP/1.0 302 Found
>>>>>>
>>>>>> I agree, I wouldn’t of thought that the crawler would follow any
>>>>>> links or redirections.
>>>>>>
>>>>>> What sort of configurations could be incorrectly configured, that I
>>>>>> could look at revising?
>>>>>>
>>>>>> Phil
>>>>>>
>>>>>>
>>>>>> From: Karl Wright <da...@gmail.com>
>>>>>> Reply-To: <us...@manifoldcf.apache.org>
>>>>>> Date: Wednesday, 17 February 2016 8:45 am
>>>>>>
>>>>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> When you view the repository connection in the UI, do you get a 302
>>>>>> error also?
>>>>>>
>>>>>> I have looked at the code; Httpclient is supposedly configured to
>>>>>> honor redirections.  Obviously it is not doing that, so I'll have to dig
>>>>>> deeper into why that is.  On the other hand, I would not expect you to be
>>>>>> getting any redirections, unless you have configured your connection
>>>>>> incorrectly.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller <
>>>>>> priethmuller@funnelback.com> wrote:
>>>>>>
>>>>>>> Thanks Karl -
>>>>>>>
>>>>>>> I’ve replaced the actual URL with <URL> below, but here is the stack
>>>>>>> trace:
>>>>>>>
>>>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception
>>>>>>> tossed: Unexpected http error code 302 accessing SharePoint at <URL>:
>>>>>>> (302)HTTP/1.0 302 Found
>>>>>>>
>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>> Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
>>>>>>> 302 Found
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2246)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1549)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>>>>>>
>>>>>>> Caused by: (302)HTTP/1.0 302 Found
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>>>>>
>>>>>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>>>>>
>>>>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>>>>>
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>>>>>
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>>>>>
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>>>>>
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>>>>>
>>>>>>>         at
>>>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapStub.java:854)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> *Phil Riethmuller*
>>>>>>> Technical Consultant
>>>>>>>
>>>>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>>>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>>>>>
>>>>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>>>>
>>>>>>> Connect with us: LinkedIn
>>>>>>> <http://www.linkedin.com/company/funnelback> - *Twitter*
>>>>>>>
>>>>>>>
>>>>>>> From: Karl Wright <da...@gmail.com>
>>>>>>> Reply-To: <us...@manifoldcf.apache.org>
>>>>>>> Date: Tuesday, 16 February 2016 6:54 pm
>>>>>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>>>>
>>>>>>> Hi Phil,
>>>>>>>
>>>>>>> A HTTP 302 response is simply a redirection.  It should not, by
>>>>>>> itself, cause a job to abort.  I would expect that to go by in wire/http
>>>>>>> logging, but you should not see it anywhere else.  So it is not clear to me
>>>>>>> what you are really seeing here.
>>>>>>>
>>>>>>> Can you include an example stack trace from the manifoldcf log?
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller <
>>>>>>> priethmuller@funnelback.com> wrote:
>>>>>>>
>>>>>>>> Hi -
>>>>>>>>
>>>>>>>> When crawling a Sharepoint repository, I’m receiving a HTTP 302
>>>>>>>> error which is causing the manifold job to abort. How do I prevent the
>>>>>>>> crawler from aborting the job?
>>>>>>>>
>>>>>>>> I’m using v2.3 of Manifold with a postgres database.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Phil
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HTTP 302 error causing job to abort

Posted by Karl Wright <da...@gmail.com>.
Hi Phil,

Thanks -- this information is more helpful.

So my understanding is that there is an external site reference in your
site/subsite hierarchy?  And the *root* site (the one that you point at
when you configure the connection itself) is *not* external after all?

If that is the case, then the external site must be being "discovered"
through the Webs service API call.  There are two ways forward:

(1)  We can change the Webs response parsing to detect external sites and
not include those in the crawl, or
(2) We can try to make decisions based on whether a 302 comes back as a
response code.

(1) is by far the best approach but it will require some cooperation and
execution of sample code on your part.  Essentially I'll need to see what
the xml is that is coming back that first describes the exterrnal site and
see if there is an attribute that lets us know it is external.  That way I
properly just skip it entirely.

We can have a look at what comes back from SharePoint for this API response
if you enable connector debugging in properties.xml:

<property name="org.apache.manifoldcf.connectors" value="DEBUG"/>

... and restart.  You will then need to do a crawl.  The following line
will be what you look for:

Logging.connectors.debug("SharePoint: getSites xml response: "+xmlResponse);

This xml response will contain "Url" and "Title" nodes; what I need to know
is whether there's any attribute of the "Url" node, or parallel node other
than "Url" or "Title', that contains an indication of whether the Url that
describes the external site is indeed external.  So you look for the Url
that describes the SharePoint URL that has the redirection, and tell me if
there's anything special about it in the associated getSites response.
Does that make sense?

If this is too hard, alternative (2) is possible, but it will require tons
of individual changes.  So let's look into (1) first.

Thanks
Karl


On Thu, Feb 18, 2016 at 11:49 PM, Phil Riethmuller <
priethmuller@funnelback.com> wrote:

> Hi Karl,
>
> Some further info:
>
>    - The problem document that Manifold reported, is redirecting to an
>    external site.
>    - We tried crawling a smaller subset of content on the same Sharepoint
>    site that definitely doesn’t contain any external links in the content, and
>    this works OK.
>    - The job that errors with the 302, says it has found 529 docs so far
>    and processed 127 of them. This seems to indicate that is has in fact found
>    some documents.
>
> I’m not sure what you mean that the error is being generated from the API
> call, and not an individual document? The info appears to indicate it is
> not all documents, but just selected documents.
>
> There really isn’t much we can do about this from the Sharepoint
> configuration side, is there any way we can test if it is as simple as the
> 302 coming from the documents themselves?
>
> Thanks for your help to date.
>
> Phil
>
>
> From: Karl Wright <da...@gmail.com>
> Reply-To: <us...@manifoldcf.apache.org>
> Date: Thursday, 18 February 2016 10:31 am
>
> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> Subject: Re: HTTP 302 error causing job to abort
>
> Hi Phil,
>
> The 302 error is not coming from a single document.  If it *was* coming
> from the fetch of an individual document, it would be easy to work around.
> But, from your stack trace, it is clear that this error is coming from an
> API call, specifically a call to enumerate subsites of a given site.  That
> means that some or all of the SharePoint hierarchy is not accessible
> through POST requests.  I have never seen this kind of behavior from
> SharePoint before.
>
> This is not something that I can work around without more information.  In
> order to get that information, you will at the very minimum need to turn on
> connector debugging, and probably turning on http wire debugging would be
> helpful too.  And, if what you said about the View page for this connection
> is true and it also shows a 302 error, I very much suspect that something
> changed on the server end and you are currently unable to crawl *any*
> documents at all.
>
> I am sorry I cannot make this any clearer.
>
> Thanks,
> Karl
>
>
>
>
> On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller <
> priethmuller@funnelback.com> wrote:
>
>> Hi Karl,
>>
>> Thanks for the update.
>>
>> I’m not 100% sure how many documents have this redirect in them, but I’ll
>> see if I can get a better estimate. The content we are crawling is
>> substantially large, and comes from many different authors so it’s
>> difficult to manage how these Sharepoint documents are created. It makes it
>> extremely difficult to pinpoint all the documents that contain redirects.
>>
>> Am I correct in assuming a single 302 error causes the job to fail, or is
>> there some other logic that determines this?
>>
>> How plausible would it be to include in the product an option for
>> treating 302’s as a warning, rather than a fatal error? Possibly just an
>> option in the Job setup?
>>
>> Regards,
>> Phil
>>
>>
>> From: Karl Wright <da...@gmail.com>
>> Reply-To: <us...@manifoldcf.apache.org>
>> Date: Thursday, 18 February 2016 1:39 am
>>
>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> Subject: Re: HTTP 302 error causing job to abort
>>
>> Hi again Phil,
>>
>> The HttpClient team points out that POST requests (as we do for the
>> SharePoint repository requests) are not allowed to follow 302 redirections
>> according to RFC2616.  We use POST requests because, for SOAP, there is
>> often quite a bit of XML data that goes along with the request, and we
>> would otherwise have size issues.  So we cannot use GET instead of POST.
>> See CONNECTORS-1279 for details.
>>
>> If you still believe that it is only a couple of URLs that are returning
>> 302 for you, I'd like some analysis of why you believe that to be true.  I
>> would be happy to consider recognition of an occasional 302 response as
>> meaning "skip this document".  On the other hand, based on your stack
>> trace, it really appears that you have a far more systemic problem; it is
>> failing while obtaining information for an entire site, so not much would
>> get crawled in that case.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Phil,
>>>
>>> It is not surprising that the connector doesn't like 302 responses and
>>> doesn't know what to do with them, because it isn't supposed to ever be
>>> getting any of these.
>>>
>>> I am puzzled by your statement that "only a couple of documents have
>>> redirections in them", because the connector crawls Lists and Library
>>> documents within SharePoint *only*, and these are very specifically
>>> accessible through a SharePoint URL hierarchy structure.  There's no room
>>> in any of that for a 302 redirection.  Since you see a 302 in the UI, I
>>> feel pretty certain you have a problem with your configuration and it is
>>> not just "a couple of documents".
>>>
>>> Karl
>>>
>>>
>>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller <
>>> priethmuller@funnelback.com> wrote:
>>>
>>>> Thanks Karl,
>>>>
>>>> The majority of content is not going to the redirect, it’s probably
>>>> just a handful of documents that are behaving this way.
>>>>
>>>> I’d agree that it’s of lesser concern whether or not the document
>>>> itself is indexing, however I wouldn’t expect the 302 to be treated as a
>>>> fatal error that causes the job to come to a halt. I’d expect the document
>>>> to be passed over, and the crawl to continue.
>>>>
>>>> Is the only solution at this point to remove the documents which
>>>> redirect to a 302 to get the crawl to run in full?
>>>>
>>>> Regards,
>>>>
>>>> *Phil Riethmuller*
>>>> Technical Consultant
>>>>
>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>>
>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>
>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>
>>>> - *Twitter*
>>>>
>>>>
>>>> From: Karl Wright <da...@gmail.com>
>>>> Reply-To: <us...@manifoldcf.apache.org>
>>>> Date: Wednesday, 17 February 2016 8:58 am
>>>>
>>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>
>>>> Hi Phil,
>>>>
>>>> You probably want to point your SharePoint repository connection to the
>>>> proper server and site, and not rely on redirections.  It's also possible
>>>> that you are missing the site entirely and the redirection you are seeing
>>>> is taking you to some error page somewhere.
>>>>
>>>> I will be raising the question of redirections with the
>>>> HttpComponents/HttpClient team, since I see no obvious problems with the
>>>> SharePoint connector code.  However, if your connection is properly set up,
>>>> redirections should be unneeded.
>>>>
>>>> I would read the documentation on the Wiki page for debugging
>>>> SharePoint connections at the bottom of this page:
>>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller <
>>>> priethmuller@funnelback.com> wrote:
>>>>
>>>>> Do you mean in the job status in the Manifold CF interface?
>>>>>
>>>>> The job status also shows the same:
>>>>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>>>>> (302)HTTP/1.0 302 Found
>>>>>
>>>>> I agree, I wouldn’t of thought that the crawler would follow any links
>>>>> or redirections.
>>>>>
>>>>> What sort of configurations could be incorrectly configured, that I
>>>>> could look at revising?
>>>>>
>>>>> Phil
>>>>>
>>>>>
>>>>> From: Karl Wright <da...@gmail.com>
>>>>> Reply-To: <us...@manifoldcf.apache.org>
>>>>> Date: Wednesday, 17 February 2016 8:45 am
>>>>>
>>>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>>
>>>>> Thanks.
>>>>>
>>>>> When you view the repository connection in the UI, do you get a 302
>>>>> error also?
>>>>>
>>>>> I have looked at the code; Httpclient is supposedly configured to
>>>>> honor redirections.  Obviously it is not doing that, so I'll have to dig
>>>>> deeper into why that is.  On the other hand, I would not expect you to be
>>>>> getting any redirections, unless you have configured your connection
>>>>> incorrectly.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller <
>>>>> priethmuller@funnelback.com> wrote:
>>>>>
>>>>>> Thanks Karl -
>>>>>>
>>>>>> I’ve replaced the actual URL with <URL> below, but here is the stack
>>>>>> trace:
>>>>>>
>>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception
>>>>>> tossed: Unexpected http error code 302 accessing SharePoint at <URL>:
>>>>>> (302)HTTP/1.0 302 Found
>>>>>>
>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
>>>>>> http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found
>>>>>>
>>>>>>         at
>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2246)
>>>>>>
>>>>>>         at
>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1549)
>>>>>>
>>>>>>         at
>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>>>>>
>>>>>> Caused by: (302)HTTP/1.0 302 Found
>>>>>>
>>>>>>         at
>>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201)
>>>>>>
>>>>>>         at
>>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>>>>>>
>>>>>>         at
>>>>>> org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>>>>
>>>>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>>>>
>>>>>>         at
>>>>>> org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>>>>
>>>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>>>>
>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>>>>
>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>>>>
>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>>>>
>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>>>>
>>>>>>         at
>>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapStub.java:854)
>>>>>>
>>>>>>         at
>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> *Phil Riethmuller*
>>>>>> Technical Consultant
>>>>>>
>>>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>>>>
>>>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>>>
>>>>>> Connect with us: LinkedIn
>>>>>> <http://www.linkedin.com/company/funnelback> - *Twitter*
>>>>>>
>>>>>>
>>>>>> From: Karl Wright <da...@gmail.com>
>>>>>> Reply-To: <us...@manifoldcf.apache.org>
>>>>>> Date: Tuesday, 16 February 2016 6:54 pm
>>>>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>>>
>>>>>> Hi Phil,
>>>>>>
>>>>>> A HTTP 302 response is simply a redirection.  It should not, by
>>>>>> itself, cause a job to abort.  I would expect that to go by in wire/http
>>>>>> logging, but you should not see it anywhere else.  So it is not clear to me
>>>>>> what you are really seeing here.
>>>>>>
>>>>>> Can you include an example stack trace from the manifoldcf log?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller <
>>>>>> priethmuller@funnelback.com> wrote:
>>>>>>
>>>>>>> Hi -
>>>>>>>
>>>>>>> When crawling a Sharepoint repository, I’m receiving a HTTP 302
>>>>>>> error which is causing the manifold job to abort. How do I prevent the
>>>>>>> crawler from aborting the job?
>>>>>>>
>>>>>>> I’m using v2.3 of Manifold with a postgres database.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Phil
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HTTP 302 error causing job to abort

Posted by Phil Riethmuller <pr...@funnelback.com>.
Hi Karl,

Some further info:
* The problem document that Manifold reported, is redirecting to an external
site.
* We tried crawling a smaller subset of content on the same Sharepoint site
that definitely doesn¹t contain any external links in the content, and this
works OK. 
* The job that errors with the 302, says it has found 529 docs so far and
processed 127 of them. This seems to indicate that is has in fact found some
documents.
I¹m not sure what you mean that the error is being generated from the API
call, and not an individual document? The info appears to indicate it is not
all documents, but just selected documents.

There really isn¹t much we can do about this from the Sharepoint
configuration side, is there any way we can test if it is as simple as the
302 coming from the documents themselves?

Thanks for your help to date.

Phil


From:  Karl Wright <da...@gmail.com>
Reply-To:  <us...@manifoldcf.apache.org>
Date:  Thursday, 18 February 2016 10:31 am
To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Subject:  Re: HTTP 302 error causing job to abort

Hi Phil,

The 302 error is not coming from a single document.  If it *was* coming from
the fetch of an individual document, it would be easy to work around.  But,
from your stack trace, it is clear that this error is coming from an API
call, specifically a call to enumerate subsites of a given site.  That means
that some or all of the SharePoint hierarchy is not accessible through POST
requests.  I have never seen this kind of behavior from SharePoint before.

This is not something that I can work around without more information.  In
order to get that information, you will at the very minimum need to turn on
connector debugging, and probably turning on http wire debugging would be
helpful too.  And, if what you said about the View page for this connection
is true and it also shows a 302 error, I very much suspect that something
changed on the server end and you are currently unable to crawl *any*
documents at all.

I am sorry I cannot make this any clearer.

Thanks,
Karl




On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller
<pr...@funnelback.com> wrote:
> Hi Karl,
> 
> Thanks for the update.
> 
> I¹m not 100% sure how many documents have this redirect in them, but I¹ll see
> if I can get a better estimate. The content we are crawling is substantially
> large, and comes from many different authors so it¹s difficult to manage how
> these Sharepoint documents are created. It makes it extremely difficult to
> pinpoint all the documents that contain redirects.
> 
> Am I correct in assuming a single 302 error causes the job to fail, or is
> there some other logic that determines this?
> 
> How plausible would it be to include in the product an option for treating
> 302¹s as a warning, rather than a fatal error? Possibly just an option in the
> Job setup?
> 
> Regards,
> Phil
> 
> 
> From:  Karl Wright <da...@gmail.com>
> Reply-To:  <us...@manifoldcf.apache.org>
> Date:  Thursday, 18 February 2016 1:39 am
> 
> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> Subject:  Re: HTTP 302 error causing job to abort
> 
> Hi again Phil,
> 
> The HttpClient team points out that POST requests (as we do for the SharePoint
> repository requests) are not allowed to follow 302 redirections according to
> RFC2616.  We use POST requests because, for SOAP, there is often quite a bit
> of XML data that goes along with the request, and we would otherwise have size
> issues.  So we cannot use GET instead of POST.  See CONNECTORS-1279 for
> details.
> 
> If you still believe that it is only a couple of URLs that are returning 302
> for you, I'd like some analysis of why you believe that to be true.  I would
> be happy to consider recognition of an occasional 302 response as meaning
> "skip this document".  On the other hand, based on your stack trace, it really
> appears that you have a far more systemic problem; it is failing while
> obtaining information for an entire site, so not much would get crawled in
> that case.
> 
> Thanks,
> Karl
> 
> 
> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <da...@gmail.com> wrote:
>> Hi Phil,
>> 
>> It is not surprising that the connector doesn't like 302 responses and
>> doesn't know what to do with them, because it isn't supposed to ever be
>> getting any of these.
>> 
>> I am puzzled by your statement that "only a couple of documents have
>> redirections in them", because the connector crawls Lists and Library
>> documents within SharePoint *only*, and these are very specifically
>> accessible through a SharePoint URL hierarchy structure.  There's no room in
>> any of that for a 302 redirection.  Since you see a 302 in the UI, I feel
>> pretty certain you have a problem with your configuration and it is not just
>> "a couple of documents".
>> 
>> Karl
>> 
>> 
>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller
>> <pr...@funnelback.com> wrote:
>>> Thanks Karl,
>>> 
>>> The majority of content is not going to the redirect, it¹s probably just a
>>> handful of documents that are behaving this way.
>>> 
>>> I¹d agree that it¹s of lesser concern whether or not the document itself is
>>> indexing, however I wouldn¹t expect the 302 to be treated as a fatal error
>>> that causes the job to come to a halt. I¹d expect the document to be passed
>>> over, and the crawl to continue.
>>> 
>>> Is the only solution at this point to remove the documents which redirect to
>>> a 302 to get the crawl to run in full?
>>> 
>>> Regards,
>>> 
>>> Phil Riethmuller
>>> Technical Consultant
>>>  
>>> Funnelback | 437 Kent Street, Sydney, NSW 2000
>>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882>  | funnelback.com
>>> <http://www.funnelback.com/>
>>> 
>>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>> 
>>> 
>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
>>> Twitter
>>> 
>>> 
>>> From:  Karl Wright <da...@gmail.com>
>>> Reply-To:  <us...@manifoldcf.apache.org>
>>> Date:  Wednesday, 17 February 2016 8:58 am
>>> 
>>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>> Subject:  Re: HTTP 302 error causing job to abort
>>> 
>>> Hi Phil,
>>> 
>>> You probably want to point your SharePoint repository connection to the
>>> proper server and site, and not rely on redirections.  It's also possible
>>> that you are missing the site entirely and the redirection you are seeing is
>>> taking you to some error page somewhere.
>>> 
>>> I will be raising the question of redirections with the
>>> HttpComponents/HttpClient team, since I see no obvious problems with the
>>> SharePoint connector code.  However, if your connection is properly set up,
>>> redirections should be unneeded.
>>> 
>>> I would read the documentation on the Wiki page for debugging SharePoint
>>> connections at the bottom of this page:
>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
>>> 
>>> Thanks,
>>> Karl
>>> 
>>> 
>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller
>>> <pr...@funnelback.com> wrote:
>>>> Do you mean in the job status in the Manifold CF interface?
>>>> 
>>>> The job status also shows the same:
>>>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>>>> (302)HTTP/1.0 302 Found
>>>> 
>>>> I agree, I wouldn¹t of thought that the crawler would follow any links or
>>>> redirections.
>>>> 
>>>> What sort of configurations could be incorrectly configured, that I could
>>>> look at revising?
>>>> 
>>>> Phil
>>>> 
>>>> 
>>>> From:  Karl Wright <da...@gmail.com>
>>>> Reply-To:  <us...@manifoldcf.apache.org>
>>>> Date:  Wednesday, 17 February 2016 8:45 am
>>>> 
>>>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>> Subject:  Re: HTTP 302 error causing job to abort
>>>> 
>>>> Thanks.
>>>> 
>>>> When you view the repository connection in the UI, do you get a 302 error
>>>> also?
>>>> 
>>>> I have looked at the code; Httpclient is supposedly configured to honor
>>>> redirections.  Obviously it is not doing that, so I'll have to dig deeper
>>>> into why that is.  On the other hand, I would not expect you to be getting
>>>> any redirections, unless you have configured your connection incorrectly.
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller
>>>> <pr...@funnelback.com> wrote:
>>>>> Thanks Karl -
>>>>> 
>>>>> I¹ve replaced the actual URL with <URL> below, but here is the stack
>>>>> trace:
>>>>> 
>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
>>>>> Unexpected http error code 302 accessing SharePoint at <URL>:
>>>>> (302)HTTP/1.0 302 Found
>>>>> 
>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected http
>>>>> error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found
>>>>> 
>>>>>         at 
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSite
>>>>> s(SPSProxyHelper.java:2246)
>>>>> 
>>>>>         at 
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.p
>>>>> rocessDocuments(SharePointRepository.java:1549)
>>>>> 
>>>>>         at 
>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:39
>>>>> 9)
>>>>> 
>>>>> Caused by: (302)HTTP/1.0 302 Found
>>>>> 
>>>>>         at 
>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(Comm
>>>>> onsHTTPSender.java:201)
>>>>> 
>>>>>         at 
>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.jav
>>>>> a:32)
>>>>> 
>>>>>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>>> 
>>>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>>> 
>>>>>         at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>>> 
>>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>>> 
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>>> 
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>>> 
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>>> 
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>>> 
>>>>>         at 
>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSo
>>>>> apStub.java:854)
>>>>> 
>>>>>         at 
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSite
>>>>> s(SPSProxyHelper.java:2161)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Phil Riethmuller
>>>>> Technical Consultant
>>>>>  
>>>>> Funnelback | 437 Kent Street, Sydney, NSW 2000
>>>>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882>  | funnelback.com
>>>>> <http://www.funnelback.com/>
>>>>> 
>>>>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>> 
>>>>> 
>>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
>>>>> Twitter
>>>>> 
>>>>> 
>>>>> From:  Karl Wright <da...@gmail.com>
>>>>> Reply-To:  <us...@manifoldcf.apache.org>
>>>>> Date:  Tuesday, 16 February 2016 6:54 pm
>>>>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>>> Subject:  Re: HTTP 302 error causing job to abort
>>>>> 
>>>>> Hi Phil,
>>>>> 
>>>>> A HTTP 302 response is simply a redirection.  It should not, by itself,
>>>>> cause a job to abort.  I would expect that to go by in wire/http logging,
>>>>> but you should not see it anywhere else.  So it is not clear to me what
>>>>> you are really seeing here.
>>>>> 
>>>>> Can you include an example stack trace from the manifoldcf log?
>>>>> 
>>>>> Karl
>>>>>  
>>>>> 
>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller
>>>>> <pr...@funnelback.com> wrote:
>>>>>> Hi -
>>>>>> 
>>>>>> When crawling a Sharepoint repository, I¹m receiving a HTTP 302 error
>>>>>> which is causing the manifold job to abort. How do I prevent the crawler
>>>>>> from aborting the job?
>>>>>> 
>>>>>> I¹m using v2.3 of Manifold with a postgres database.
>>>>>> 
>>>>>> Regards,
>>>>>> Phil
>>>>> 
>>>> 
>>> 
>> 
> 




Re: HTTP 302 error causing job to abort

Posted by Karl Wright <da...@gmail.com>.
Hi Phil,

The 302 error is not coming from a single document.  If it *was* coming
from the fetch of an individual document, it would be easy to work around.
But, from your stack trace, it is clear that this error is coming from an
API call, specifically a call to enumerate subsites of a given site.  That
means that some or all of the SharePoint hierarchy is not accessible
through POST requests.  I have never seen this kind of behavior from
SharePoint before.

This is not something that I can work around without more information.  In
order to get that information, you will at the very minimum need to turn on
connector debugging, and probably turning on http wire debugging would be
helpful too.  And, if what you said about the View page for this connection
is true and it also shows a 302 error, I very much suspect that something
changed on the server end and you are currently unable to crawl *any*
documents at all.

I am sorry I cannot make this any clearer.

Thanks,
Karl




On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller <
priethmuller@funnelback.com> wrote:

> Hi Karl,
>
> Thanks for the update.
>
> I’m not 100% sure how many documents have this redirect in them, but I’ll
> see if I can get a better estimate. The content we are crawling is
> substantially large, and comes from many different authors so it’s
> difficult to manage how these Sharepoint documents are created. It makes it
> extremely difficult to pinpoint all the documents that contain redirects.
>
> Am I correct in assuming a single 302 error causes the job to fail, or is
> there some other logic that determines this?
>
> How plausible would it be to include in the product an option for treating
> 302’s as a warning, rather than a fatal error? Possibly just an option in
> the Job setup?
>
> Regards,
> Phil
>
>
> From: Karl Wright <da...@gmail.com>
> Reply-To: <us...@manifoldcf.apache.org>
> Date: Thursday, 18 February 2016 1:39 am
>
> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> Subject: Re: HTTP 302 error causing job to abort
>
> Hi again Phil,
>
> The HttpClient team points out that POST requests (as we do for the
> SharePoint repository requests) are not allowed to follow 302 redirections
> according to RFC2616.  We use POST requests because, for SOAP, there is
> often quite a bit of XML data that goes along with the request, and we
> would otherwise have size issues.  So we cannot use GET instead of POST.
> See CONNECTORS-1279 for details.
>
> If you still believe that it is only a couple of URLs that are returning
> 302 for you, I'd like some analysis of why you believe that to be true.  I
> would be happy to consider recognition of an occasional 302 response as
> meaning "skip this document".  On the other hand, based on your stack
> trace, it really appears that you have a far more systemic problem; it is
> failing while obtaining information for an entire site, so not much would
> get crawled in that case.
>
> Thanks,
> Karl
>
>
> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Phil,
>>
>> It is not surprising that the connector doesn't like 302 responses and
>> doesn't know what to do with them, because it isn't supposed to ever be
>> getting any of these.
>>
>> I am puzzled by your statement that "only a couple of documents have
>> redirections in them", because the connector crawls Lists and Library
>> documents within SharePoint *only*, and these are very specifically
>> accessible through a SharePoint URL hierarchy structure.  There's no room
>> in any of that for a 302 redirection.  Since you see a 302 in the UI, I
>> feel pretty certain you have a problem with your configuration and it is
>> not just "a couple of documents".
>>
>> Karl
>>
>>
>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller <
>> priethmuller@funnelback.com> wrote:
>>
>>> Thanks Karl,
>>>
>>> The majority of content is not going to the redirect, it’s probably just
>>> a handful of documents that are behaving this way.
>>>
>>> I’d agree that it’s of lesser concern whether or not the document itself
>>> is indexing, however I wouldn’t expect the 302 to be treated as a fatal
>>> error that causes the job to come to a halt. I’d expect the document to be
>>> passed over, and the crawl to continue.
>>>
>>> Is the only solution at this point to remove the documents which
>>> redirect to a 302 to get the crawl to run in full?
>>>
>>> Regards,
>>>
>>> *Phil Riethmuller*
>>> Technical Consultant
>>>
>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>
>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>
>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback> -
>>>  *Twitter*
>>>
>>>
>>> From: Karl Wright <da...@gmail.com>
>>> Reply-To: <us...@manifoldcf.apache.org>
>>> Date: Wednesday, 17 February 2016 8:58 am
>>>
>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>> Subject: Re: HTTP 302 error causing job to abort
>>>
>>> Hi Phil,
>>>
>>> You probably want to point your SharePoint repository connection to the
>>> proper server and site, and not rely on redirections.  It's also possible
>>> that you are missing the site entirely and the redirection you are seeing
>>> is taking you to some error page somewhere.
>>>
>>> I will be raising the question of redirections with the
>>> HttpComponents/HttpClient team, since I see no obvious problems with the
>>> SharePoint connector code.  However, if your connection is properly set up,
>>> redirections should be unneeded.
>>>
>>> I would read the documentation on the Wiki page for debugging SharePoint
>>> connections at the bottom of this page:
>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller <
>>> priethmuller@funnelback.com> wrote:
>>>
>>>> Do you mean in the job status in the Manifold CF interface?
>>>>
>>>> The job status also shows the same:
>>>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>>>> (302)HTTP/1.0 302 Found
>>>>
>>>> I agree, I wouldn’t of thought that the crawler would follow any links
>>>> or redirections.
>>>>
>>>> What sort of configurations could be incorrectly configured, that I
>>>> could look at revising?
>>>>
>>>> Phil
>>>>
>>>>
>>>> From: Karl Wright <da...@gmail.com>
>>>> Reply-To: <us...@manifoldcf.apache.org>
>>>> Date: Wednesday, 17 February 2016 8:45 am
>>>>
>>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>
>>>> Thanks.
>>>>
>>>> When you view the repository connection in the UI, do you get a 302
>>>> error also?
>>>>
>>>> I have looked at the code; Httpclient is supposedly configured to honor
>>>> redirections.  Obviously it is not doing that, so I'll have to dig deeper
>>>> into why that is.  On the other hand, I would not expect you to be getting
>>>> any redirections, unless you have configured your connection incorrectly.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller <
>>>> priethmuller@funnelback.com> wrote:
>>>>
>>>>> Thanks Karl -
>>>>>
>>>>> I’ve replaced the actual URL with <URL> below, but here is the stack
>>>>> trace:
>>>>>
>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
>>>>> Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
>>>>> 302 Found
>>>>>
>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
>>>>> http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found
>>>>>
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2246)
>>>>>
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1549)
>>>>>
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>>>>
>>>>> Caused by: (302)HTTP/1.0 302 Found
>>>>>
>>>>>         at
>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201)
>>>>>
>>>>>         at
>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>>>>>
>>>>>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>>>
>>>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>>>
>>>>>         at
>>>>> org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>>>
>>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>>>
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>>>
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>>>
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>>>
>>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>>>
>>>>>         at
>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapStub.java:854)
>>>>>
>>>>>         at
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161)
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> *Phil Riethmuller*
>>>>> Technical Consultant
>>>>>
>>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>>>
>>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>>
>>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>
>>>>>  - *Twitter*
>>>>>
>>>>>
>>>>> From: Karl Wright <da...@gmail.com>
>>>>> Reply-To: <us...@manifoldcf.apache.org>
>>>>> Date: Tuesday, 16 February 2016 6:54 pm
>>>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>>
>>>>> Hi Phil,
>>>>>
>>>>> A HTTP 302 response is simply a redirection.  It should not, by
>>>>> itself, cause a job to abort.  I would expect that to go by in wire/http
>>>>> logging, but you should not see it anywhere else.  So it is not clear to me
>>>>> what you are really seeing here.
>>>>>
>>>>> Can you include an example stack trace from the manifoldcf log?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller <
>>>>> priethmuller@funnelback.com> wrote:
>>>>>
>>>>>> Hi -
>>>>>>
>>>>>> When crawling a Sharepoint repository, I’m receiving a HTTP 302 error
>>>>>> which is causing the manifold job to abort. How do I prevent the crawler
>>>>>> from aborting the job?
>>>>>>
>>>>>> I’m using v2.3 of Manifold with a postgres database.
>>>>>>
>>>>>> Regards,
>>>>>> Phil
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HTTP 302 error causing job to abort

Posted by Phil Riethmuller <pr...@funnelback.com>.
Hi Karl,

Thanks for the update.

I¹m not 100% sure how many documents have this redirect in them, but I¹ll
see if I can get a better estimate. The content we are crawling is
substantially large, and comes from many different authors so it¹s difficult
to manage how these Sharepoint documents are created. It makes it extremely
difficult to pinpoint all the documents that contain redirects.

Am I correct in assuming a single 302 error causes the job to fail, or is
there some other logic that determines this?

How plausible would it be to include in the product an option for treating
302¹s as a warning, rather than a fatal error? Possibly just an option in
the Job setup?

Regards,
Phil


From:  Karl Wright <da...@gmail.com>
Reply-To:  <us...@manifoldcf.apache.org>
Date:  Thursday, 18 February 2016 1:39 am
To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Subject:  Re: HTTP 302 error causing job to abort

Hi again Phil,

The HttpClient team points out that POST requests (as we do for the
SharePoint repository requests) are not allowed to follow 302 redirections
according to RFC2616.  We use POST requests because, for SOAP, there is
often quite a bit of XML data that goes along with the request, and we would
otherwise have size issues.  So we cannot use GET instead of POST.  See
CONNECTORS-1279 for details.

If you still believe that it is only a couple of URLs that are returning 302
for you, I'd like some analysis of why you believe that to be true.  I would
be happy to consider recognition of an occasional 302 response as meaning
"skip this document".  On the other hand, based on your stack trace, it
really appears that you have a far more systemic problem; it is failing
while obtaining information for an entire site, so not much would get
crawled in that case.

Thanks,
Karl


On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <da...@gmail.com> wrote:
> Hi Phil,
> 
> It is not surprising that the connector doesn't like 302 responses and doesn't
> know what to do with them, because it isn't supposed to ever be getting any of
> these.
> 
> I am puzzled by your statement that "only a couple of documents have
> redirections in them", because the connector crawls Lists and Library
> documents within SharePoint *only*, and these are very specifically accessible
> through a SharePoint URL hierarchy structure.  There's no room in any of that
> for a 302 redirection.  Since you see a 302 in the UI, I feel pretty certain
> you have a problem with your configuration and it is not just "a couple of
> documents".
> 
> Karl
> 
> 
> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller
> <pr...@funnelback.com> wrote:
>> Thanks Karl,
>> 
>> The majority of content is not going to the redirect, it¹s probably just a
>> handful of documents that are behaving this way.
>> 
>> I¹d agree that it¹s of lesser concern whether or not the document itself is
>> indexing, however I wouldn¹t expect the 302 to be treated as a fatal error
>> that causes the job to come to a halt. I¹d expect the document to be passed
>> over, and the crawl to continue.
>> 
>> Is the only solution at this point to remove the documents which redirect to
>> a 302 to get the crawl to run in full?
>> 
>> Regards,
>> 
>> Phil Riethmuller
>> Technical Consultant
>>  
>> Funnelback | 437 Kent Street, Sydney, NSW 2000
>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882>  | funnelback.com
>> <http://www.funnelback.com/>
>> 
>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>> 
>> 
>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
>> Twitter
>> 
>> 
>> From:  Karl Wright <da...@gmail.com>
>> Reply-To:  <us...@manifoldcf.apache.org>
>> Date:  Wednesday, 17 February 2016 8:58 am
>> 
>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> Subject:  Re: HTTP 302 error causing job to abort
>> 
>> Hi Phil,
>> 
>> You probably want to point your SharePoint repository connection to the
>> proper server and site, and not rely on redirections.  It's also possible
>> that you are missing the site entirely and the redirection you are seeing is
>> taking you to some error page somewhere.
>> 
>> I will be raising the question of redirections with the
>> HttpComponents/HttpClient team, since I see no obvious problems with the
>> SharePoint connector code.  However, if your connection is properly set up,
>> redirections should be unneeded.
>> 
>> I would read the documentation on the Wiki page for debugging SharePoint
>> connections at the bottom of this page:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
>> 
>> Thanks,
>> Karl
>> 
>> 
>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller
>> <pr...@funnelback.com> wrote:
>>> Do you mean in the job status in the Manifold CF interface?
>>> 
>>> The job status also shows the same:
>>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>>> (302)HTTP/1.0 302 Found
>>> 
>>> I agree, I wouldn¹t of thought that the crawler would follow any links or
>>> redirections.
>>> 
>>> What sort of configurations could be incorrectly configured, that I could
>>> look at revising?
>>> 
>>> Phil
>>> 
>>> 
>>> From:  Karl Wright <da...@gmail.com>
>>> Reply-To:  <us...@manifoldcf.apache.org>
>>> Date:  Wednesday, 17 February 2016 8:45 am
>>> 
>>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>> Subject:  Re: HTTP 302 error causing job to abort
>>> 
>>> Thanks.
>>> 
>>> When you view the repository connection in the UI, do you get a 302 error
>>> also?
>>> 
>>> I have looked at the code; Httpclient is supposedly configured to honor
>>> redirections.  Obviously it is not doing that, so I'll have to dig deeper
>>> into why that is.  On the other hand, I would not expect you to be getting
>>> any redirections, unless you have configured your connection incorrectly.
>>> 
>>> Karl
>>> 
>>> 
>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller
>>> <pr...@funnelback.com> wrote:
>>>> Thanks Karl -
>>>> 
>>>> I¹ve replaced the actual URL with <URL> below, but here is the stack trace:
>>>> 
>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
>>>> Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
>>>> 302 Found
>>>> 
>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected http
>>>> error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found
>>>> 
>>>>         at 
>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites
>>>> (SPSProxyHelper.java:2246)
>>>> 
>>>>         at 
>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.pr
>>>> ocessDocuments(SharePointRepository.java:1549)
>>>> 
>>>>         at 
>>>> 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399>>>>
)
>>>> 
>>>> Caused by: (302)HTTP/1.0 302 Found
>>>> 
>>>>         at 
>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(Commo
>>>> nsHTTPSender.java:201)
>>>> 
>>>>         at 
>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java
>>>> :32)
>>>> 
>>>>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>> 
>>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>> 
>>>>         at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>> 
>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>> 
>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>> 
>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>> 
>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>> 
>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>> 
>>>>         at 
>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoa
>>>> pStub.java:854)
>>>> 
>>>>         at 
>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites
>>>> (SPSProxyHelper.java:2161)
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> 
>>>> Phil Riethmuller
>>>> Technical Consultant
>>>>  
>>>> Funnelback | 437 Kent Street, Sydney, NSW 2000
>>>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882>  | funnelback.com
>>>> <http://www.funnelback.com/>
>>>> 
>>>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>> 
>>>> 
>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
>>>> Twitter
>>>> 
>>>> 
>>>> From:  Karl Wright <da...@gmail.com>
>>>> Reply-To:  <us...@manifoldcf.apache.org>
>>>> Date:  Tuesday, 16 February 2016 6:54 pm
>>>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>> Subject:  Re: HTTP 302 error causing job to abort
>>>> 
>>>> Hi Phil,
>>>> 
>>>> A HTTP 302 response is simply a redirection.  It should not, by itself,
>>>> cause a job to abort.  I would expect that to go by in wire/http logging,
>>>> but you should not see it anywhere else.  So it is not clear to me what you
>>>> are really seeing here.
>>>> 
>>>> Can you include an example stack trace from the manifoldcf log?
>>>> 
>>>> Karl
>>>>  
>>>> 
>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller
>>>> <pr...@funnelback.com> wrote:
>>>>> Hi -
>>>>> 
>>>>> When crawling a Sharepoint repository, I¹m receiving a HTTP 302 error
>>>>> which is causing the manifold job to abort. How do I prevent the crawler
>>>>> from aborting the job?
>>>>> 
>>>>> I¹m using v2.3 of Manifold with a postgres database.
>>>>> 
>>>>> Regards,
>>>>> Phil
>>>> 
>>> 
>> 
> 




Re: HTTP 302 error causing job to abort

Posted by Karl Wright <da...@gmail.com>.
Hi again Phil,

The HttpClient team points out that POST requests (as we do for the
SharePoint repository requests) are not allowed to follow 302 redirections
according to RFC2616.  We use POST requests because, for SOAP, there is
often quite a bit of XML data that goes along with the request, and we
would otherwise have size issues.  So we cannot use GET instead of POST.
See CONNECTORS-1279 for details.

If you still believe that it is only a couple of URLs that are returning
302 for you, I'd like some analysis of why you believe that to be true.  I
would be happy to consider recognition of an occasional 302 response as
meaning "skip this document".  On the other hand, based on your stack
trace, it really appears that you have a far more systemic problem; it is
failing while obtaining information for an entire site, so not much would
get crawled in that case.

Thanks,
Karl


On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Phil,
>
> It is not surprising that the connector doesn't like 302 responses and
> doesn't know what to do with them, because it isn't supposed to ever be
> getting any of these.
>
> I am puzzled by your statement that "only a couple of documents have
> redirections in them", because the connector crawls Lists and Library
> documents within SharePoint *only*, and these are very specifically
> accessible through a SharePoint URL hierarchy structure.  There's no room
> in any of that for a 302 redirection.  Since you see a 302 in the UI, I
> feel pretty certain you have a problem with your configuration and it is
> not just "a couple of documents".
>
> Karl
>
>
> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller <
> priethmuller@funnelback.com> wrote:
>
>> Thanks Karl,
>>
>> The majority of content is not going to the redirect, it’s probably just
>> a handful of documents that are behaving this way.
>>
>> I’d agree that it’s of lesser concern whether or not the document itself
>> is indexing, however I wouldn’t expect the 302 to be treated as a fatal
>> error that causes the job to come to a halt. I’d expect the document to be
>> passed over, and the crawl to continue.
>>
>> Is the only solution at this point to remove the documents which redirect
>> to a 302 to get the crawl to run in full?
>>
>> Regards,
>>
>> *Phil Riethmuller*
>> Technical Consultant
>>
>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>
>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>
>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback> -
>> *Twitter*
>>
>>
>> From: Karl Wright <da...@gmail.com>
>> Reply-To: <us...@manifoldcf.apache.org>
>> Date: Wednesday, 17 February 2016 8:58 am
>>
>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> Subject: Re: HTTP 302 error causing job to abort
>>
>> Hi Phil,
>>
>> You probably want to point your SharePoint repository connection to the
>> proper server and site, and not rely on redirections.  It's also possible
>> that you are missing the site entirely and the redirection you are seeing
>> is taking you to some error page somewhere.
>>
>> I will be raising the question of redirections with the
>> HttpComponents/HttpClient team, since I see no obvious problems with the
>> SharePoint connector code.  However, if your connection is properly set up,
>> redirections should be unneeded.
>>
>> I would read the documentation on the Wiki page for debugging SharePoint
>> connections at the bottom of this page:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller <
>> priethmuller@funnelback.com> wrote:
>>
>>> Do you mean in the job status in the Manifold CF interface?
>>>
>>> The job status also shows the same:
>>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>>> (302)HTTP/1.0 302 Found
>>>
>>> I agree, I wouldn’t of thought that the crawler would follow any links
>>> or redirections.
>>>
>>> What sort of configurations could be incorrectly configured, that I
>>> could look at revising?
>>>
>>> Phil
>>>
>>>
>>> From: Karl Wright <da...@gmail.com>
>>> Reply-To: <us...@manifoldcf.apache.org>
>>> Date: Wednesday, 17 February 2016 8:45 am
>>>
>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>> Subject: Re: HTTP 302 error causing job to abort
>>>
>>> Thanks.
>>>
>>> When you view the repository connection in the UI, do you get a 302
>>> error also?
>>>
>>> I have looked at the code; Httpclient is supposedly configured to honor
>>> redirections.  Obviously it is not doing that, so I'll have to dig deeper
>>> into why that is.  On the other hand, I would not expect you to be getting
>>> any redirections, unless you have configured your connection incorrectly.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller <
>>> priethmuller@funnelback.com> wrote:
>>>
>>>> Thanks Karl -
>>>>
>>>> I’ve replaced the actual URL with <URL> below, but here is the stack
>>>> trace:
>>>>
>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
>>>> Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
>>>> 302 Found
>>>>
>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
>>>> http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found
>>>>
>>>>         at
>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2246)
>>>>
>>>>         at
>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1549)
>>>>
>>>>         at
>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>>>
>>>> Caused by: (302)HTTP/1.0 302 Found
>>>>
>>>>         at
>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201)
>>>>
>>>>         at
>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>>>>
>>>>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>>
>>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>>
>>>>         at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>>
>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>>
>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>>
>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>>
>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>>
>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>>
>>>>         at
>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapStub.java:854)
>>>>
>>>>         at
>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161)
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> *Phil Riethmuller*
>>>> Technical Consultant
>>>>
>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>>
>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>
>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>
>>>> - *Twitter*
>>>>
>>>>
>>>> From: Karl Wright <da...@gmail.com>
>>>> Reply-To: <us...@manifoldcf.apache.org>
>>>> Date: Tuesday, 16 February 2016 6:54 pm
>>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>
>>>> Hi Phil,
>>>>
>>>> A HTTP 302 response is simply a redirection.  It should not, by itself,
>>>> cause a job to abort.  I would expect that to go by in wire/http logging,
>>>> but you should not see it anywhere else.  So it is not clear to me what you
>>>> are really seeing here.
>>>>
>>>> Can you include an example stack trace from the manifoldcf log?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller <
>>>> priethmuller@funnelback.com> wrote:
>>>>
>>>>> Hi -
>>>>>
>>>>> When crawling a Sharepoint repository, I’m receiving a HTTP 302 error
>>>>> which is causing the manifold job to abort. How do I prevent the crawler
>>>>> from aborting the job?
>>>>>
>>>>> I’m using v2.3 of Manifold with a postgres database.
>>>>>
>>>>> Regards,
>>>>> Phil
>>>>>
>>>>
>>>>
>>>
>>
>

Re: HTTP 302 error causing job to abort

Posted by Karl Wright <da...@gmail.com>.
Hi Phil,

It is not surprising that the connector doesn't like 302 responses and
doesn't know what to do with them, because it isn't supposed to ever be
getting any of these.

I am puzzled by your statement that "only a couple of documents have
redirections in them", because the connector crawls Lists and Library
documents within SharePoint *only*, and these are very specifically
accessible through a SharePoint URL hierarchy structure.  There's no room
in any of that for a 302 redirection.  Since you see a 302 in the UI, I
feel pretty certain you have a problem with your configuration and it is
not just "a couple of documents".

Karl


On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller <
priethmuller@funnelback.com> wrote:

> Thanks Karl,
>
> The majority of content is not going to the redirect, it’s probably just a
> handful of documents that are behaving this way.
>
> I’d agree that it’s of lesser concern whether or not the document itself
> is indexing, however I wouldn’t expect the 302 to be treated as a fatal
> error that causes the job to come to a halt. I’d expect the document to be
> passed over, and the crawl to continue.
>
> Is the only solution at this point to remove the documents which redirect
> to a 302 to get the crawl to run in full?
>
> Regards,
>
> *Phil Riethmuller*
> Technical Consultant
>
> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>
> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>
> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback> -
> *Twitter*
>
>
> From: Karl Wright <da...@gmail.com>
> Reply-To: <us...@manifoldcf.apache.org>
> Date: Wednesday, 17 February 2016 8:58 am
>
> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> Subject: Re: HTTP 302 error causing job to abort
>
> Hi Phil,
>
> You probably want to point your SharePoint repository connection to the
> proper server and site, and not rely on redirections.  It's also possible
> that you are missing the site entirely and the redirection you are seeing
> is taking you to some error page somewhere.
>
> I will be raising the question of redirections with the
> HttpComponents/HttpClient team, since I see no obvious problems with the
> SharePoint connector code.  However, if your connection is properly set up,
> redirections should be unneeded.
>
> I would read the documentation on the Wiki page for debugging SharePoint
> connections at the bottom of this page:
> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
>
> Thanks,
> Karl
>
>
> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller <
> priethmuller@funnelback.com> wrote:
>
>> Do you mean in the job status in the Manifold CF interface?
>>
>> The job status also shows the same:
>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>> (302)HTTP/1.0 302 Found
>>
>> I agree, I wouldn’t of thought that the crawler would follow any links or
>> redirections.
>>
>> What sort of configurations could be incorrectly configured, that I could
>> look at revising?
>>
>> Phil
>>
>>
>> From: Karl Wright <da...@gmail.com>
>> Reply-To: <us...@manifoldcf.apache.org>
>> Date: Wednesday, 17 February 2016 8:45 am
>>
>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> Subject: Re: HTTP 302 error causing job to abort
>>
>> Thanks.
>>
>> When you view the repository connection in the UI, do you get a 302 error
>> also?
>>
>> I have looked at the code; Httpclient is supposedly configured to honor
>> redirections.  Obviously it is not doing that, so I'll have to dig deeper
>> into why that is.  On the other hand, I would not expect you to be getting
>> any redirections, unless you have configured your connection incorrectly.
>>
>> Karl
>>
>>
>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller <
>> priethmuller@funnelback.com> wrote:
>>
>>> Thanks Karl -
>>>
>>> I’ve replaced the actual URL with <URL> below, but here is the stack
>>> trace:
>>>
>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
>>> Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
>>> 302 Found
>>>
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
>>> http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found
>>>
>>>         at
>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2246)
>>>
>>>         at
>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1549)
>>>
>>>         at
>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>>
>>> Caused by: (302)HTTP/1.0 302 Found
>>>
>>>         at
>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201)
>>>
>>>         at
>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>>>
>>>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>
>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>
>>>         at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>
>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>
>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>
>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>
>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>
>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>
>>>         at
>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapStub.java:854)
>>>
>>>         at
>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161)
>>>
>>>
>>>
>>> Regards,
>>>
>>> *Phil Riethmuller*
>>> Technical Consultant
>>>
>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>
>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>
>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback> -
>>>  *Twitter*
>>>
>>>
>>> From: Karl Wright <da...@gmail.com>
>>> Reply-To: <us...@manifoldcf.apache.org>
>>> Date: Tuesday, 16 February 2016 6:54 pm
>>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>>> Subject: Re: HTTP 302 error causing job to abort
>>>
>>> Hi Phil,
>>>
>>> A HTTP 302 response is simply a redirection.  It should not, by itself,
>>> cause a job to abort.  I would expect that to go by in wire/http logging,
>>> but you should not see it anywhere else.  So it is not clear to me what you
>>> are really seeing here.
>>>
>>> Can you include an example stack trace from the manifoldcf log?
>>>
>>> Karl
>>>
>>>
>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller <
>>> priethmuller@funnelback.com> wrote:
>>>
>>>> Hi -
>>>>
>>>> When crawling a Sharepoint repository, I’m receiving a HTTP 302 error
>>>> which is causing the manifold job to abort. How do I prevent the crawler
>>>> from aborting the job?
>>>>
>>>> I’m using v2.3 of Manifold with a postgres database.
>>>>
>>>> Regards,
>>>> Phil
>>>>
>>>
>>>
>>
>

Re: HTTP 302 error causing job to abort

Posted by Phil Riethmuller <pr...@funnelback.com>.
Thanks Karl,

The majority of content is not going to the redirect, it¹s probably just a
handful of documents that are behaving this way.

I¹d agree that it¹s of lesser concern whether or not the document itself is
indexing, however I wouldn¹t expect the 302 to be treated as a fatal error
that causes the job to come to a halt. I¹d expect the document to be passed
over, and the crawl to continue.

Is the only solution at this point to remove the documents which redirect to
a 302 to get the crawl to run in full?

Regards,

Phil Riethmuller
Technical Consultant
 
Funnelback | 437 Kent Street, Sydney, NSW 2000
T +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>

AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES


Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
Twitter


From:  Karl Wright <da...@gmail.com>
Reply-To:  <us...@manifoldcf.apache.org>
Date:  Wednesday, 17 February 2016 8:58 am
To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Subject:  Re: HTTP 302 error causing job to abort

Hi Phil,

You probably want to point your SharePoint repository connection to the
proper server and site, and not rely on redirections.  It's also possible
that you are missing the site entirely and the redirection you are seeing is
taking you to some error page somewhere.

I will be raising the question of redirections with the
HttpComponents/HttpClient team, since I see no obvious problems with the
SharePoint connector code.  However, if your connection is properly set up,
redirections should be unneeded.

I would read the documentation on the Wiki page for debugging SharePoint
connections at the bottom of this page:
https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections

Thanks,
Karl


On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller
<pr...@funnelback.com> wrote:
> Do you mean in the job status in the Manifold CF interface?
> 
> The job status also shows the same:
> Error: Unexpected http error code 302 accessing SharePoint at <url>:
> (302)HTTP/1.0 302 Found
> 
> I agree, I wouldn¹t of thought that the crawler would follow any links or
> redirections.
> 
> What sort of configurations could be incorrectly configured, that I could look
> at revising?
> 
> Phil
> 
> 
> From:  Karl Wright <da...@gmail.com>
> Reply-To:  <us...@manifoldcf.apache.org>
> Date:  Wednesday, 17 February 2016 8:45 am
> 
> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> Subject:  Re: HTTP 302 error causing job to abort
> 
> Thanks.
> 
> When you view the repository connection in the UI, do you get a 302 error
> also?
> 
> I have looked at the code; Httpclient is supposedly configured to honor
> redirections.  Obviously it is not doing that, so I'll have to dig deeper into
> why that is.  On the other hand, I would not expect you to be getting any
> redirections, unless you have configured your connection incorrectly.
> 
> Karl
> 
> 
> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller
> <pr...@funnelback.com> wrote:
>> Thanks Karl -
>> 
>> I¹ve replaced the actual URL with <URL> below, but here is the stack trace:
>> 
>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
>> Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
>> 302 Found
>> 
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected http
>> error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found
>> 
>>         at 
>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(S
>> PSProxyHelper.java:2246)
>> 
>>         at 
>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.proc
>> essDocuments(SharePointRepository.java:1549)
>> 
>>         at 
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> 
>> Caused by: (302)HTTP/1.0 302 Found
>> 
>>         at 
>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(Commons
>> HTTPSender.java:201)
>> 
>>         at 
>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:3
>> 2)
>> 
>>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>> 
>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>> 
>>         at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>> 
>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>> 
>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>> 
>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>> 
>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>> 
>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>> 
>>         at 
>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapS
>> tub.java:854)
>> 
>>         at 
>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(S
>> PSProxyHelper.java:2161)
>> 
>> 
>> 
>> 
>> Regards,
>> 
>> Phil Riethmuller
>> Technical Consultant
>>  
>> Funnelback | 437 Kent Street, Sydney, NSW 2000
>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882>  | funnelback.com
>> <http://www.funnelback.com/>
>> 
>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>> 
>> 
>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
>> Twitter
>> 
>> 
>> From:  Karl Wright <da...@gmail.com>
>> Reply-To:  <us...@manifoldcf.apache.org>
>> Date:  Tuesday, 16 February 2016 6:54 pm
>> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> Subject:  Re: HTTP 302 error causing job to abort
>> 
>> Hi Phil,
>> 
>> A HTTP 302 response is simply a redirection.  It should not, by itself, cause
>> a job to abort.  I would expect that to go by in wire/http logging, but you
>> should not see it anywhere else.  So it is not clear to me what you are
>> really seeing here.
>> 
>> Can you include an example stack trace from the manifoldcf log?
>> 
>> Karl
>>  
>> 
>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller
>> <pr...@funnelback.com> wrote:
>>> Hi -
>>> 
>>> When crawling a Sharepoint repository, I¹m receiving a HTTP 302 error which
>>> is causing the manifold job to abort. How do I prevent the crawler from
>>> aborting the job?
>>> 
>>> I¹m using v2.3 of Manifold with a postgres database.
>>> 
>>> Regards,
>>> Phil
>> 
> 




Re: HTTP 302 error causing job to abort

Posted by Karl Wright <da...@gmail.com>.
Hi Phil,

You probably want to point your SharePoint repository connection to the
proper server and site, and not rely on redirections.  It's also possible
that you are missing the site entirely and the redirection you are seeing
is taking you to some error page somewhere.

I will be raising the question of redirections with the
HttpComponents/HttpClient team, since I see no obvious problems with the
SharePoint connector code.  However, if your connection is properly set up,
redirections should be unneeded.

I would read the documentation on the Wiki page for debugging SharePoint
connections at the bottom of this page:
https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections

Thanks,
Karl


On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller <
priethmuller@funnelback.com> wrote:

> Do you mean in the job status in the Manifold CF interface?
>
> The job status also shows the same:
> Error: Unexpected http error code 302 accessing SharePoint at <url>:
> (302)HTTP/1.0 302 Found
>
> I agree, I wouldn’t of thought that the crawler would follow any links or
> redirections.
>
> What sort of configurations could be incorrectly configured, that I could
> look at revising?
>
> Phil
>
>
> From: Karl Wright <da...@gmail.com>
> Reply-To: <us...@manifoldcf.apache.org>
> Date: Wednesday, 17 February 2016 8:45 am
>
> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> Subject: Re: HTTP 302 error causing job to abort
>
> Thanks.
>
> When you view the repository connection in the UI, do you get a 302 error
> also?
>
> I have looked at the code; Httpclient is supposedly configured to honor
> redirections.  Obviously it is not doing that, so I'll have to dig deeper
> into why that is.  On the other hand, I would not expect you to be getting
> any redirections, unless you have configured your connection incorrectly.
>
> Karl
>
>
> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller <
> priethmuller@funnelback.com> wrote:
>
>> Thanks Karl -
>>
>> I’ve replaced the actual URL with <URL> below, but here is the stack
>> trace:
>>
>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
>> Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
>> 302 Found
>>
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
>> http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found
>>
>>         at
>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2246)
>>
>>         at
>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1549)
>>
>>         at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>
>> Caused by: (302)HTTP/1.0 302 Found
>>
>>         at
>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201)
>>
>>         at
>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>>
>>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>
>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>
>>         at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>
>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>
>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>
>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>
>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>
>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>
>>         at
>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapStub.java:854)
>>
>>         at
>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161)
>>
>>
>>
>> Regards,
>>
>> *Phil Riethmuller*
>> Technical Consultant
>>
>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>
>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>
>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback> -
>> *Twitter*
>>
>>
>> From: Karl Wright <da...@gmail.com>
>> Reply-To: <us...@manifoldcf.apache.org>
>> Date: Tuesday, 16 February 2016 6:54 pm
>> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
>> Subject: Re: HTTP 302 error causing job to abort
>>
>> Hi Phil,
>>
>> A HTTP 302 response is simply a redirection.  It should not, by itself,
>> cause a job to abort.  I would expect that to go by in wire/http logging,
>> but you should not see it anywhere else.  So it is not clear to me what you
>> are really seeing here.
>>
>> Can you include an example stack trace from the manifoldcf log?
>>
>> Karl
>>
>>
>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller <
>> priethmuller@funnelback.com> wrote:
>>
>>> Hi -
>>>
>>> When crawling a Sharepoint repository, I’m receiving a HTTP 302 error
>>> which is causing the manifold job to abort. How do I prevent the crawler
>>> from aborting the job?
>>>
>>> I’m using v2.3 of Manifold with a postgres database.
>>>
>>> Regards,
>>> Phil
>>>
>>
>>
>

Re: HTTP 302 error causing job to abort

Posted by Phil Riethmuller <pr...@funnelback.com>.
Do you mean in the job status in the Manifold CF interface?

The job status also shows the same:
Error: Unexpected http error code 302 accessing SharePoint at <url>:
(302)HTTP/1.0 302 Found

I agree, I wouldn¹t of thought that the crawler would follow any links or
redirections.

What sort of configurations could be incorrectly configured, that I could
look at revising?

Phil


From:  Karl Wright <da...@gmail.com>
Reply-To:  <us...@manifoldcf.apache.org>
Date:  Wednesday, 17 February 2016 8:45 am
To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Subject:  Re: HTTP 302 error causing job to abort

Thanks.

When you view the repository connection in the UI, do you get a 302 error
also?

I have looked at the code; Httpclient is supposedly configured to honor
redirections.  Obviously it is not doing that, so I'll have to dig deeper
into why that is.  On the other hand, I would not expect you to be getting
any redirections, unless you have configured your connection incorrectly.

Karl


On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller
<pr...@funnelback.com> wrote:
> Thanks Karl -
> 
> I¹ve replaced the actual URL with <URL> below, but here is the stack trace:
> 
> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
> Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
> 302 Found
> 
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected http
> error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found
> 
>         at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SP
> SProxyHelper.java:2246)
> 
>         at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.proce
> ssDocuments(SharePointRepository.java:1549)
> 
>         at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> 
> Caused by: (302)HTTP/1.0 302 Found
> 
>         at 
> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsH
> TTPSender.java:201)
> 
>         at 
> 
org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32>
)
> 
>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
> 
>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
> 
>         at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
> 
>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
> 
>         at org.apache.axis.client.Call.invoke(Call.java:2767)
> 
>         at org.apache.axis.client.Call.invoke(Call.java:2443)
> 
>         at org.apache.axis.client.Call.invoke(Call.java:2366)
> 
>         at org.apache.axis.client.Call.invoke(Call.java:1812)
> 
>         at 
> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapSt
> ub.java:854)
> 
>         at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SP
> SProxyHelper.java:2161)
> 
> 
> 
> 
> Regards,
> 
> Phil Riethmuller
> Technical Consultant
>  
> Funnelback | 437 Kent Street, Sydney, NSW 2000
> T +61 2 9045 2882 <tel:%2B61%202%209045%202882>  | funnelback.com
> <http://www.funnelback.com/>
> 
> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
> 
> 
> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
> Twitter
> 
> 
> From:  Karl Wright <da...@gmail.com>
> Reply-To:  <us...@manifoldcf.apache.org>
> Date:  Tuesday, 16 February 2016 6:54 pm
> To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> Subject:  Re: HTTP 302 error causing job to abort
> 
> Hi Phil,
> 
> A HTTP 302 response is simply a redirection.  It should not, by itself, cause
> a job to abort.  I would expect that to go by in wire/http logging, but you
> should not see it anywhere else.  So it is not clear to me what you are really
> seeing here.
> 
> Can you include an example stack trace from the manifoldcf log?
> 
> Karl
>  
> 
> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller
> <pr...@funnelback.com> wrote:
>> Hi -
>> 
>> When crawling a Sharepoint repository, I¹m receiving a HTTP 302 error which
>> is causing the manifold job to abort. How do I prevent the crawler from
>> aborting the job?
>> 
>> I¹m using v2.3 of Manifold with a postgres database.
>> 
>> Regards,
>> Phil
> 




Re: HTTP 302 error causing job to abort

Posted by Karl Wright <da...@gmail.com>.
Thanks.

When you view the repository connection in the UI, do you get a 302 error
also?

I have looked at the code; Httpclient is supposedly configured to honor
redirections.  Obviously it is not doing that, so I'll have to dig deeper
into why that is.  On the other hand, I would not expect you to be getting
any redirections, unless you have configured your connection incorrectly.

Karl


On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller <
priethmuller@funnelback.com> wrote:

> Thanks Karl -
>
> I’ve replaced the actual URL with <URL> below, but here is the stack trace:
>
> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
> Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
> 302 Found
>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected http
> error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found
>
>         at
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2246)
>
>         at
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1549)
>
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>
> Caused by: (302)HTTP/1.0 302 Found
>
>         at
> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201)
>
>         at
> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>
>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>
>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>
>         at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>
>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>
>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>
>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>
>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>
>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>
>         at
> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapStub.java:854)
>
>         at
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161)
>
>
>
> Regards,
>
> *Phil Riethmuller*
> Technical Consultant
>
> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>
> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>
> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback> -
> *Twitter*
>
>
> From: Karl Wright <da...@gmail.com>
> Reply-To: <us...@manifoldcf.apache.org>
> Date: Tuesday, 16 February 2016 6:54 pm
> To: "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
> Subject: Re: HTTP 302 error causing job to abort
>
> Hi Phil,
>
> A HTTP 302 response is simply a redirection.  It should not, by itself,
> cause a job to abort.  I would expect that to go by in wire/http logging,
> but you should not see it anywhere else.  So it is not clear to me what you
> are really seeing here.
>
> Can you include an example stack trace from the manifoldcf log?
>
> Karl
>
>
> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller <
> priethmuller@funnelback.com> wrote:
>
>> Hi -
>>
>> When crawling a Sharepoint repository, I’m receiving a HTTP 302 error
>> which is causing the manifold job to abort. How do I prevent the crawler
>> from aborting the job?
>>
>> I’m using v2.3 of Manifold with a postgres database.
>>
>> Regards,
>> Phil
>>
>
>

Re: HTTP 302 error causing job to abort

Posted by Phil Riethmuller <pr...@funnelback.com>.
Thanks Karl -

I¹ve replaced the actual URL with <URL> below, but here is the stack trace:

ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed:
Unexpected http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0
302 Found

org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected http
error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 Found

        at 
org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(
SPSProxyHelper.java:2246)

        at 
org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.pro
cessDocuments(SharePointRepository.java:1549)

        at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)

Caused by: (302)HTTP/1.0 302 Found

        at 
org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(Common
sHTTPSender.java:201)

        at 
org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:
32)

        at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)

        at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)

        at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)

        at org.apache.axis.client.Call.invokeEngine(Call.java:2784)

        at org.apache.axis.client.Call.invoke(Call.java:2767)

        at org.apache.axis.client.Call.invoke(Call.java:2443)

        at org.apache.axis.client.Call.invoke(Call.java:2366)

        at org.apache.axis.client.Call.invoke(Call.java:1812)

        at 
com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoap
Stub.java:854)

        at 
org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(
SPSProxyHelper.java:2161)




Regards,

Phil Riethmuller
Technical Consultant
 
Funnelback | 437 Kent Street, Sydney, NSW 2000
T +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>

AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES


Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>  -
Twitter


From:  Karl Wright <da...@gmail.com>
Reply-To:  <us...@manifoldcf.apache.org>
Date:  Tuesday, 16 February 2016 6:54 pm
To:  "user@manifoldcf.apache.org" <us...@manifoldcf.apache.org>
Subject:  Re: HTTP 302 error causing job to abort

Hi Phil,

A HTTP 302 response is simply a redirection.  It should not, by itself,
cause a job to abort.  I would expect that to go by in wire/http logging,
but you should not see it anywhere else.  So it is not clear to me what you
are really seeing here.

Can you include an example stack trace from the manifoldcf log?

Karl
 

On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller
<pr...@funnelback.com> wrote:
> Hi -
> 
> When crawling a Sharepoint repository, I¹m receiving a HTTP 302 error which is
> causing the manifold job to abort. How do I prevent the crawler from aborting
> the job?
> 
> I¹m using v2.3 of Manifold with a postgres database.
> 
> Regards,
> Phil




Re: HTTP 302 error causing job to abort

Posted by Karl Wright <da...@gmail.com>.
Hi Phil,

A HTTP 302 response is simply a redirection.  It should not, by itself,
cause a job to abort.  I would expect that to go by in wire/http logging,
but you should not see it anywhere else.  So it is not clear to me what you
are really seeing here.

Can you include an example stack trace from the manifoldcf log?

Karl


On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller <
priethmuller@funnelback.com> wrote:

> Hi -
>
> When crawling a Sharepoint repository, I’m receiving a HTTP 302 error
> which is causing the manifold job to abort. How do I prevent the crawler
> from aborting the job?
>
> I’m using v2.3 of Manifold with a postgres database.
>
> Regards,
> Phil
>