You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2012/06/28 11:26:02 UTC

Re: Crawling behind an ISA proxy (iis 7.5)

I was wondering if you'd picked up and tried the patch for
CONNECTORS-483.  This patch adds official proxy support for the Web
Connector.  Alternatively, you could try to build and run with trunk
code.

Karl

On Wed, May 16, 2012 at 12:12 PM, Karl Wright <da...@gmail.com> wrote:
> Hi Rene,
>
> The URL that is causing the RFC2617 challenge/response is being
> authenticated with basic auth, not NTLM.  This could yield a 401.  You
> may want to check the URL in a browser other than IE (Firefox, for
> instance) to see if basic auth is being used for this URL rather than
> NTLM.
>
> The redirection you describe to GetLogon is pretty standard practice.
> You can easily tell the web connector that that is part of the logon
> sequence by following the steps I laid out in the earlier email.
>
> Once you have set up what you think is the right set of logon pages,
> it's very helpful to attempt a crawl and then see what the simple
> history shows.  There are specific activities logged when logon begins
> and ends, so this is enormously helpful as a diagnostic aid.  If you
> see a continuous loop (entering logon sequence, doing stuff, exiting
> logon sequence, and repeating) then it is clear that the cookie has
> not been set.
>
> I won't be able to look at your packet log for a while, probably at
> least a week.
>
> Karl
>
>
>
> On Wed, May 16, 2012 at 10:23 AM, Rene Nederhand <re...@nederhand.net> wrote:
>> Hi Karl,
>>
>> Thank you so much for putting a so much time in educating a newbe. I
>> appreciate your help enormously.
>>
>> I'd tried to follow each of the steps below. So far, it doesn't work but I
>> will continue this evening to see if I can get this thing going.
>>
>> In the mean time, I have switched loglevels of the crawling proces to "INFO"
>> and found something interesting in the logs. Perhaps, this could shine some
>> light on my issues:
>>
>> ERROR 2012-05-16 16:04:13,581 (Thread-1019) - Invalid challenge: Basic
>> org.apache.commons.httpclient.auth.MalformedChallengeException: Invalid
>> challenge: Basic
>> at
>> org.apache.commons.httpclient.auth.AuthChallengeParser.extractParams(Unknown
>> Source)
>> at org.apache.commons.httpclient.auth.RFC2617Scheme.processChallenge(Unknown
>> Source)
>> at org.apache.commons.httpclient.auth.BasicScheme.processChallenge(Unknown
>> Source)
>> at
>> org.apache.commons.httpclient.auth.AuthChallengeProcessor.processChallenge(Unknown
>> Source)
>> at
>> org.apache.commons.httpclient.HttpMethodDirector.processWWWAuthChallenge(Unknown
>> Source)
>> at
>> org.apache.commons.httpclient.HttpMethodDirector.processAuthenticationResponse(Unknown
>> Source)
>> at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown
>> Source)
>> at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown Source)
>> at
>> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244)
>>
>> Please not that I have set NTLM (not BASIC) authentication on
>> "bb.helo.hanze.nl" and nothing else. The error does not occur when I try to
>> crawl our intranet (also with NTLM). Does this mean something? At least, I
>> think it is the source of the 401 I get when looking at the simple report,
>> isn't it?
>>
>> In addition, I've used Charles proxy to monitor all interaction between my
>> browser and the server. I have found that it doesn't matter which url I use
>> to enter Blackboard, they get all directed to
>> https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon. Shouldn't page based
>> authentication handle this?
>>
>> To make the information complete, I've added the HAR file with the
>> CharlesProxy output. It can be displayed
>> at http://www.softwareishard.com/har/viewer/ for example. You'll be able to
>> see all requests/responses when I start with a clean browser (cookies
>> removed) entering https://bb.helo.hanze.nl. Maybe, this does help.
>>
>> Again, thanks a lot for your help!
>>
>> René
>>
>>
>>
>>
>>
>> On Tue, May 15, 2012 at 5:59 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> Hi Rene,
>>>
>>> You will need both NTLM auth (page auth, which you have already set
>>> up), and Session auth (which you haven't yet set up).
>>>
>>> In order to set up session-based auth, you should first identify the
>>> set of pages that you want access to that are protected by a cookie
>>> requirement.  You will need to write a regular expression that matches
>>> these pages and ONLY these pages.  This URL gets entered as the "URL
>>> regular expression" on the Access Credentials tab in the Session-based
>>> Access Credentials part of the tab.  Then, click the Add button.
>>>
>>> The next thing you will need is to specify how the connector
>>> recognizes pages that belong to the logon sequence.  The actual
>>> sequence you need to understand is what happens in the browser when
>>> you try to access a specific protected URL and you don't have the
>>> right cookie.  You did not actually specify that; I think you are
>>> presuming that you'd be entering directly through the logon page, but
>>> that is not how it works.  The crawler will have a URL in mind and
>>> will need access to the content of that URL.  It will fetch the URL,
>>> and if the actual content is NOT fetched, we need to detect that
>>> situation and consider it part of the logon sequence.
>>>
>>> So let's pretend that what happens when the cookie is not present is
>>> that you get a redirection to the logon page, instead of the actual
>>> page content.  In that case, you would create a login sequence page
>>> description consisting of the same URL regular expression that
>>> describes the protected content pages, plus the "redirection" radio
>>> button, plus a target URL regular expression that would match
>>> "bb.helo.hanze.nl/CookieAuth.dll?GetLogon".  You then click the Add
>>> button for login pages to add that description to the set of login
>>> pages.
>>>
>>> Next, the GetLogon page itself needs to be added as a login sequence
>>> page.  The regular expression should match only
>>> "bb.helo.hanze.nl/CookieAuth.dll?GetLogon".  The type of the page is
>>> "form" because you said this was a form where you could fill in your
>>> login credentials.  If there is only one form on the page you can
>>> leave the regexp that matches the form name blank since that will
>>> match everything.  Once you click "Add" for this page, you will have
>>> the opportunity to fill in form names and values to post when the form
>>> gets posted.
>>>
>>> It was not clear from your description, once again, what happens after
>>> the Logon page is posted.  If there is a special target page, you need
>>> to include that also in the login sequence so that its content is not
>>> taken.  If there is a redirection back to the original content page,
>>> you'd include that redirection.
>>>
>>> Hopefully this is beginning to make a bit of sense to you; but this is
>>> the general picture, not related to your actual site that closely.
>>> For example, the Javascript redirection you mentioned will not be
>>> processed by ManifoldCF, but that is unnecessary because at the end of
>>> the whole login sequence ManifoldCF automatically goes back to the
>>> original URL when the login sequence is chased to its end.  So all you
>>> need to do is make sure that all pages that are part of that sequence
>>> are specified.
>>>
>>> On the other hand, it's not clear that the code you have "protecting"
>>> the site sets cookies any other way than through Javascript.  The
>>> cookie that this Javascript actually sets is a really stupid
>>> non-specific cookie, but unless it is set by the standard response
>>> header method, I don't think it's going to wind up being set at all.
>>> Can you confirm that this is the only way the cookie gets set?
>>>
>>> Karl
>>>
>>> On Tue, May 15, 2012 at 10:57 AM, Rene Nederhand <re...@nederhand.net>
>>> wrote:
>>> > Hi Karl,
>>> >
>>> > Thank you so much for your detailed explanation. I am trying  each
>>> > step you've pointed out. Unfortunately, I cannot get this thing going.
>>> > Hopefully, you can help me if I give you more detailed information.
>>> >
>>> > The sequence of steps is (when accessing https://bb.helo.hanze.nl):
>>> >
>>> > 1.
>>> > https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3
>>> > This gives me indeed NTLM authentication. When I create a crawler that
>>> > only crawls the above page I get a 200 response. So this works, no
>>> > 401.
>>> >
>>> > 2. If I submit my username and password. This request is sent to the
>>> > server. This is also the only form I'll ever see.:
>>> >
>>> > https://bb.helo.hanze.nl/CookieAuth.dll?Logon (302)
>>> > Request:
>>> > curl    Z2F
>>> > flags   0
>>> > forcedownlevel  0
>>> > formdir 3
>>> > trusted 0
>>> > username        loginname
>>> > password        mypassword
>>> > SubmitCreds     Log On
>>> >
>>> > 3. The response is a cookie being set with a redirect to the first url
>>> > (but now with the cookie set)
>>> >
>>> > Response:
>>> >        HTTP/1.1 302 Moved Temporarily
>>> > Location        https://bb.helo.hanze.nl/
>>> > Set-Cookie
>>> >  noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9";
>>> > HttpOnly; Domain=.hanze.nl; secure; path=/
>>> > Content-Length  0
>>> > Connection      close
>>> >
>>> > Request:
>>> >        GET / HTTP/1.1
>>> > Host    bb.helo.hanze.nl
>>> > User-Agent      Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0)
>>> > Gecko/20100101 Firefox/12.0
>>> > Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>>> > Accept-Language en-us,en;q=0.5
>>> > Accept-Encoding gzip, deflate
>>> > Connection      keep-alive
>>> > Referer
>>> > https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3
>>> > Cookie
>>> >  noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9"
>>> >
>>> > 4. Lastly, a redirect is made to the Blackboard site (javascript check
>>> > for cookie and redirect)
>>> >
>>> > Response:
>>> > <HTML dir='ltr'><HEAD>
>>> > <META HTTP-EQUIV="Pragma" CONTENT="no-cache"><META
>>> > HTTP-EQUIV="Cache-Control" CONTENT="no-cache">
>>> > <script language="Javascript">
>>> >  cookie_name = "cookies_enabled";
>>> >  document.cookie=cookie_name+"=yes";
>>> >  if (!document.cookie) {
>>> >    document.location.href="/nocookies.html";
>>> >  }
>>> >  document.cookie=cookie_name+"yes;expires=Thu, 01-Jan-1970 00:00:01
>>> > GMT";
>>> > </script>
>>> > <SCRIPT language="Javascript"><!--
>>> >
>>> > document.location.replace('https://bb.helo.hanze.nl/webapps/portal/frameset.jsp');
>>> > //--></SCRIPT></HEAD>
>>> > <BODY BGCOLOR='#FFFFFF' LINK='#000000' ALINK='#000000'>
>>> > <br><br><br><br><div style="text-align: center;"><hr width='350'
>>> > height='5'><br>
>>> > <strong>You are being redirected to another page</strong>
>>> > <p><strong>Please Wait...</strong><br><br><hr width='350' height='5'>
>>> > <br><A
>>> > HREF='https://bb.helo.hanze.nl/webapps/portal/frameset.jsp'><strong>Click
>>> > here to access the page to which you are being
>>> > forwarded.</strong></A></div>
>>> > </BODY></HTML>
>>> >
>>> > Although the first form used NTLM authentication, this doesn't work
>>> > out. Therefore, I would think that session based auth would work
>>> > better as I can create each step myself. I still haven't a clue how to
>>> > approach this. What do I fill in those boxes?
>>> >
>>> > Thanks for helping me.
>>> >
>>> > Cheers,
>>> > René
>>> >
>>> >
>>> >
>>> >
>>> > On Fri, May 11, 2012 at 4:26 PM, Karl Wright <da...@gmail.com> wrote:
>>> >> Hi Rene,
>>> >>
>>> >> Crawling through a proxy is usually easy, but crawling a session-based
>>> >> site is always a challenge.
>>> >>
>>> >> ISA proxies usually authenticate with NTLM.  So you will want to set
>>> >> up your web connection with NTLM authentication in order to even be
>>> >> able to reach the pages.  It's not clear that you've got that right
>>> >> yet, because if you don't have it right you will get 401 errors back.
>>> >> Getting this right is a prerequisite; you won't be able to proceed
>>> >> until it is correct.  To see that you do, try a very limited crawl
>>> >> that fetches ONLY the login page (or some other un-session-protected
>>> >> content).  If you get a 401 you'll need to figure out what's not right
>>> >> before proceeding.
>>> >>
>>> >> It sounds like the site may also be secured using session-based
>>> >> authentication.  If a cookie is involved then you need to configure
>>> >> session auth in order to get to any session-protected pages.  The
>>> >> trick is that, for session-based auth, you need to fully understand
>>> >> the sequence of pages and forms that happen when a user visits the
>>> >> site and is granted the cookie(s) - the login process, what content
>>> >> URLs are protected, what URLs are part of the login sequence, etc.
>>> >> The end-user documentation describes this in some detail.  It can be a
>>> >> challenge to get it all set up right.
>>> >>
>>> >> Finally, for SharePoint sites, if you are intending to index
>>> >> documents, you might well find the SharePoint Connector a better
>>> >> choice than trying to crawl the site with the web connector.
>>> >>
>>> >> Thanks,
>>> >> Karl
>>> >>
>>> >> On Fri, May 11, 2012 at 10:13 AM, Rene Nederhand <re...@nederhand.net>
>>> >> wrote:
>>> >>> Hi,
>>> >>>
>>> >>> I am trying to get ManifoldCF crawl our electronic learning
>>> >>> environment (Blackboard). To enable single sign-on, our institution
>>> >>> has placed an ISA server as proxy before Blackboard.
>>> >>> This is giving me a lot of problems.
>>> >>>
>>> >>> I've managed to get passed the ISA server using session based
>>> >>> authentication, but then I am stuck at a 401 error message. According
>>> >>> to our architect, ISA is responsible for the communication with
>>> >>> Blackboard and will set a cookie so Blackboard will know it a
>>> >>> legitimate user is accessing its service. I think, ManifoldCF is not
>>> >>> able to handle this cookie and hence is not able to access Blackboard.
>>> >>> Am I right? If so, is there a possibility to get Blackboard indexed?
>>> >>>
>>> >>> By the way, the same authentication is used for our Sharepoint. I
>>> >>> would like to index this as well....
>>> >>>
>>> >>> Any help on solving this problem is appreciated.
>>> >>>
>>> >>> Cheers,
>>> >>>
>>> >>> René
>>
>>

Re: Crawling behind an ISA proxy (iis 7.5)

Posted by Jan van Haarst <ja...@vanhaarst.net>.
Dear All,

We are now able to connect to the IIS proxy, thanks to the added logging
facilities by Karl, we were able to see that this is the fix :

Index:
connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
===================================================================
---
connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
(revision
1357379)
+++
connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
(working
copy)
@@ -361,7 +361,7 @@
       String emailAddress =
params.getParameter(WebcrawlerConfig.PARAMETER_EMAIL);
       if (emailAddress == null)
         throw new ManifoldCFException("Missing email address");
-      userAgent = "ApacheManifoldCFWebCrawler; "+emailAddress+")";
+      userAgent = "Mozilla/5.0 (ApacheManifoldCFWebCrawler;
"+emailAddress+")";
       from = emailAddress;

       x = params.getParameter(WebcrawlerConfig.PARAMETER_ROBOTSUSAGE);

Yes, this is weird, a proxy shouldn't fail on User-Agent settings, but
apparently this one does.
Even Google apparently does this :
http://www.useragentstring.com/pages/Googlebot/
Now, we 'just' have to get the crawling working,  but the main (unique)
hurdle has now been taken !

Karl, a big Thank You for your help, and for the openssl s_client that
enabled us to debug this.

Dag,
Jan

On Thu, Jun 28, 2012 at 11:05 PM, Jan van Haarst <ja...@vanhaarst.net> wrote:

> On Thu, Jun 28, 2012 at 11:26 AM, Karl Wright <da...@gmail.com> wrote:
>
>> I was wondering if you'd picked up and tried the patch for
>> CONNECTORS-483.  This patch adds official proxy support for the Web
>> Connector.  Alternatively, you could try to build and run with trunk
>> code.
>>
>> Karl
>>
>
> I'm going the building from trunk way, and all seems to go well up to the
> creation of the zip and tar.gz files.
> Is there anything special to do after running the build process like this ?
>
> ant clean clean-core-deps clean-deps && ant make-core-deps make-deps build
> && ant image
>
> Did I miss anything ?
> If not, I'll replace the old binary installation with my source-build one,
> and see where it leads me.
>
> --
> Dag,
> Jan
>



-- 
Dag,
Jan

Re: Crawling behind an ISA proxy (iis 7.5)

Posted by Jan van Haarst <ja...@vanhaarst.net>.
On Thu, Jun 28, 2012 at 11:26 AM, Karl Wright <da...@gmail.com> wrote:

> I was wondering if you'd picked up and tried the patch for
> CONNECTORS-483.  This patch adds official proxy support for the Web
> Connector.  Alternatively, you could try to build and run with trunk
> code.
>
> Karl
>

I'm going the building from trunk way, and all seems to go well up to the
creation of the zip and tar.gz files.
Is there anything special to do after running the build process like this ?

ant clean clean-core-deps clean-deps && ant make-core-deps make-deps build
&& ant image

Did I miss anything ?
If not, I'll replace the old binary installation with my source-build one,
and see where it leads me.

-- 
Dag,
Jan