You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Jan van Haarst <ja...@vanhaarst.net> on 2012/07/08 12:39:37 UTC

Re: Crawling behind an ISA proxy (iis 7.5)

Dear All,

We are now able to connect to the IIS proxy, thanks to the added logging
facilities by Karl, we were able to see that this is the fix :

Index:
connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
===================================================================
---
connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
(revision
1357379)
+++
connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
(working
copy)
@@ -361,7 +361,7 @@
       String emailAddress =
params.getParameter(WebcrawlerConfig.PARAMETER_EMAIL);
       if (emailAddress == null)
         throw new ManifoldCFException("Missing email address");
-      userAgent = "ApacheManifoldCFWebCrawler; "+emailAddress+")";
+      userAgent = "Mozilla/5.0 (ApacheManifoldCFWebCrawler;
"+emailAddress+")";
       from = emailAddress;

       x = params.getParameter(WebcrawlerConfig.PARAMETER_ROBOTSUSAGE);

Yes, this is weird, a proxy shouldn't fail on User-Agent settings, but
apparently this one does.
Even Google apparently does this :
http://www.useragentstring.com/pages/Googlebot/
Now, we 'just' have to get the crawling working,  but the main (unique)
hurdle has now been taken !

Karl, a big Thank You for your help, and for the openssl s_client that
enabled us to debug this.

Dag,
Jan

On Thu, Jun 28, 2012 at 11:05 PM, Jan van Haarst <ja...@vanhaarst.net> wrote:

> On Thu, Jun 28, 2012 at 11:26 AM, Karl Wright <da...@gmail.com> wrote:
>
>> I was wondering if you'd picked up and tried the patch for
>> CONNECTORS-483.  This patch adds official proxy support for the Web
>> Connector.  Alternatively, you could try to build and run with trunk
>> code.
>>
>> Karl
>>
>
> I'm going the building from trunk way, and all seems to go well up to the
> creation of the zip and tar.gz files.
> Is there anything special to do after running the build process like this ?
>
> ant clean clean-core-deps clean-deps && ant make-core-deps make-deps build
> && ant image
>
> Did I miss anything ?
> If not, I'll replace the old binary installation with my source-build one,
> and see where it leads me.
>
> --
> Dag,
> Jan
>



-- 
Dag,
Jan