You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Jan van Haarst <ja...@vanhaarst.net> on 2012/06/18 14:09:36 UTC

Crawling behind an ISA proxy (iis 7.5) revisited

Hello all,

I'm a colleague of the original poster [1].

We got a lot further in figuring out the flow of the website, and thus the
way ManifoldCF should crawl it.
In that process, we discovered that our problem might lie with
httpclient ,as the
server responds with a 401.2 response, because the client doesn't send
authentication headers, as mentioned in [2].

My question is this :
Is the raw response of the server stored somewhere in case of a 401 return
code ?
If so, I can check whether my idea is right, and after that try to  fix it.

With kind regards,

Jan van Haarst

[1]
http://mail-archives.apache.org/mod_mbox/incubator-connectors-user/201205.mbox/%3CCAFxWV0WY_Vojsshbfr0PSs%3DG-Xpd1wUJXFcbVVsOvntbXs1zRg%40mail.gmail.com%3E

[2]
http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/8feeaa51-c634-4de3-bfdc-e922d195a45e.mspx?mfr=true

Re: Crawling behind an ISA proxy (iis 7.5) revisited

Posted by Karl Wright <da...@gmail.com>.
I've created CONNECTORS-482 to cover the addition of a feature that
would allow the body to be included, at least in part, in the activity
record (so you can see it in the Simple History report).

Karl

On Mon, Jun 18, 2012 at 8:37 AM, Karl Wright <da...@gmail.com> wrote:
> Another way to proceed at this point, if the connection is not an SSL
> connection, would be to use Wireshark to log the interaction.  Seeing
> headers is trivial and would tell you everything you need to know
> without patching the code.
>
> Karl
>
> On Mon, Jun 18, 2012 at 8:21 AM, Karl Wright <da...@gmail.com> wrote:
>> HTTPClient 3.1 itself does not seem to provide a logging option for
>> logging the body.  However, it should be straightforward to add this
>> to the ManifoldCF code.  What version are you running, so that I can
>> provide the appropriate patch?
>>
>> Karl
>>
>>
>>
>> On Mon, Jun 18, 2012 at 8:09 AM, Jan van Haarst <ja...@vanhaarst.net> wrote:
>>> Hello all,
>>>
>>> I'm a colleague of the original poster [1].
>>>
>>> We got a lot further in figuring out the flow of the website, and thus the
>>> way ManifoldCF should crawl it.
>>> In that process, we discovered that our problem might lie with
>>> httpclient ,as the server responds with a 401.2 response, because the client
>>> doesn't send authentication headers, as mentioned in [2].
>>>
>>> My question is this :
>>> Is the raw response of the server stored somewhere in case of a 401 return
>>> code ?
>>> If so, I can check whether my idea is right, and after that try to  fix it.
>>>
>>> With kind regards,
>>>
>>> Jan van Haarst
>>>
>>> [1]
>>> http://mail-archives.apache.org/mod_mbox/incubator-connectors-user/201205.mbox/%3CCAFxWV0WY_Vojsshbfr0PSs%3DG-Xpd1wUJXFcbVVsOvntbXs1zRg%40mail.gmail.com%3E
>>> [2] http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/8feeaa51-c634-4de3-bfdc-e922d195a45e.mspx?mfr=true
>>>
>>>

Re: Crawling behind an ISA proxy (iis 7.5) revisited

Posted by Karl Wright <da...@gmail.com>.
Another way to proceed at this point, if the connection is not an SSL
connection, would be to use Wireshark to log the interaction.  Seeing
headers is trivial and would tell you everything you need to know
without patching the code.

Karl

On Mon, Jun 18, 2012 at 8:21 AM, Karl Wright <da...@gmail.com> wrote:
> HTTPClient 3.1 itself does not seem to provide a logging option for
> logging the body.  However, it should be straightforward to add this
> to the ManifoldCF code.  What version are you running, so that I can
> provide the appropriate patch?
>
> Karl
>
>
>
> On Mon, Jun 18, 2012 at 8:09 AM, Jan van Haarst <ja...@vanhaarst.net> wrote:
>> Hello all,
>>
>> I'm a colleague of the original poster [1].
>>
>> We got a lot further in figuring out the flow of the website, and thus the
>> way ManifoldCF should crawl it.
>> In that process, we discovered that our problem might lie with
>> httpclient ,as the server responds with a 401.2 response, because the client
>> doesn't send authentication headers, as mentioned in [2].
>>
>> My question is this :
>> Is the raw response of the server stored somewhere in case of a 401 return
>> code ?
>> If so, I can check whether my idea is right, and after that try to  fix it.
>>
>> With kind regards,
>>
>> Jan van Haarst
>>
>> [1]
>> http://mail-archives.apache.org/mod_mbox/incubator-connectors-user/201205.mbox/%3CCAFxWV0WY_Vojsshbfr0PSs%3DG-Xpd1wUJXFcbVVsOvntbXs1zRg%40mail.gmail.com%3E
>> [2] http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/8feeaa51-c634-4de3-bfdc-e922d195a45e.mspx?mfr=true
>>
>>

Re: Crawling behind an ISA proxy (iis 7.5) revisited

Posted by Karl Wright <da...@gmail.com>.
Posted two patches for CONNECTORS-482 also, if you want enhanced debugging.

Karl

On Mon, Jun 18, 2012 at 5:24 PM, Karl Wright <da...@gmail.com> wrote:
> Please add the patch for CONNECTORS-483.  This adds the NT proxy
> feature, ported from the RSS connector.
>
> Karl
>
> On Mon, Jun 18, 2012 at 2:10 PM, Jan van Haarst <ja...@vanhaarst.net> wrote:
>> OK, we'll do.
>>
>> On Mon, Jun 18, 2012 at 3:18 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> I'll be committing any changes to trunk.  I'm happy to also include a
>>> patch, which should work with 0.5-incubating, but you'll need to build
>>> it, of course, with the patch in place.
>>>
>>> Karl
>>>
>>> On Mon, Jun 18, 2012 at 9:09 AM, Jan van Haarst <ja...@vanhaarst.net> wrote:
>>> > Hello Karl,
>>> >
>>> > The version we have running is ManifoldCF 0.5-incubating.
>>> > It would be great to be able to get to the bottom of this.
>>> >
>>> > Dag,
>>> > Jan
>>> >
>>> > On Mon, Jun 18, 2012 at 2:21 PM, Karl Wright <da...@gmail.com> wrote:
>>> >>
>>> >> HTTPClient 3.1 itself does not seem to provide a logging option for
>>> >> logging the body.  However, it should be straightforward to add this
>>> >> to the ManifoldCF code.  What version are you running, so that I can
>>> >> provide the appropriate patch?
>>> >>
>>> >> Karl
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Jun 18, 2012 at 8:09 AM, Jan van Haarst <ja...@vanhaarst.net>
>>> >> wrote:
>>> >> > Hello all,
>>> >> >
>>> >> > I'm a colleague of the original poster [1].
>>> >> >
>>> >> > We got a lot further in figuring out the flow of the website, and
>>> >> > thus
>>> >> > the
>>> >> > way ManifoldCF should crawl it.
>>> >> > In that process, we discovered that our problem might lie with
>>> >> > httpclient ,as the server responds with a 401.2 response, because the
>>> >> > client
>>> >> > doesn't send authentication headers, as mentioned in [2].
>>> >> >
>>> >> > My question is this :
>>> >> > Is the raw response of the server stored somewhere in case of a 401
>>> >> > return
>>> >> > code ?
>>> >> > If so, I can check whether my idea is right, and after that try to
>>> >> >  fix
>>> >> > it.
>>> >> >
>>> >> > With kind regards,
>>> >> >
>>> >> > Jan van Haarst
>>> >> >
>>> >> > [1]
>>> >> >
>>> >> >
>>> >> > http://mail-archives.apache.org/mod_mbox/incubator-connectors-user/201205.mbox/%3CCAFxWV0WY_Vojsshbfr0PSs%3DG-Xpd1wUJXFcbVVsOvntbXs1zRg%40mail.gmail.com%3E
>>> >> >
>>> >> >
>>> >> > [2] http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/8feeaa51-c634-4de3-bfdc-e922d195a45e.mspx?mfr=true
>>> >> >
>>> >> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Dag,
>>> > Jan
>>
>>
>>
>>
>> --
>> Dag,
>> Jan

Re: Crawling behind an ISA proxy (iis 7.5) revisited

Posted by Karl Wright <da...@gmail.com>.
Please add the patch for CONNECTORS-483.  This adds the NT proxy
feature, ported from the RSS connector.

Karl

On Mon, Jun 18, 2012 at 2:10 PM, Jan van Haarst <ja...@vanhaarst.net> wrote:
> OK, we'll do.
>
> On Mon, Jun 18, 2012 at 3:18 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> I'll be committing any changes to trunk.  I'm happy to also include a
>> patch, which should work with 0.5-incubating, but you'll need to build
>> it, of course, with the patch in place.
>>
>> Karl
>>
>> On Mon, Jun 18, 2012 at 9:09 AM, Jan van Haarst <ja...@vanhaarst.net> wrote:
>> > Hello Karl,
>> >
>> > The version we have running is ManifoldCF 0.5-incubating.
>> > It would be great to be able to get to the bottom of this.
>> >
>> > Dag,
>> > Jan
>> >
>> > On Mon, Jun 18, 2012 at 2:21 PM, Karl Wright <da...@gmail.com> wrote:
>> >>
>> >> HTTPClient 3.1 itself does not seem to provide a logging option for
>> >> logging the body.  However, it should be straightforward to add this
>> >> to the ManifoldCF code.  What version are you running, so that I can
>> >> provide the appropriate patch?
>> >>
>> >> Karl
>> >>
>> >>
>> >>
>> >> On Mon, Jun 18, 2012 at 8:09 AM, Jan van Haarst <ja...@vanhaarst.net>
>> >> wrote:
>> >> > Hello all,
>> >> >
>> >> > I'm a colleague of the original poster [1].
>> >> >
>> >> > We got a lot further in figuring out the flow of the website, and
>> >> > thus
>> >> > the
>> >> > way ManifoldCF should crawl it.
>> >> > In that process, we discovered that our problem might lie with
>> >> > httpclient ,as the server responds with a 401.2 response, because the
>> >> > client
>> >> > doesn't send authentication headers, as mentioned in [2].
>> >> >
>> >> > My question is this :
>> >> > Is the raw response of the server stored somewhere in case of a 401
>> >> > return
>> >> > code ?
>> >> > If so, I can check whether my idea is right, and after that try to
>> >> >  fix
>> >> > it.
>> >> >
>> >> > With kind regards,
>> >> >
>> >> > Jan van Haarst
>> >> >
>> >> > [1]
>> >> >
>> >> >
>> >> > http://mail-archives.apache.org/mod_mbox/incubator-connectors-user/201205.mbox/%3CCAFxWV0WY_Vojsshbfr0PSs%3DG-Xpd1wUJXFcbVVsOvntbXs1zRg%40mail.gmail.com%3E
>> >> >
>> >> >
>> >> > [2] http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/8feeaa51-c634-4de3-bfdc-e922d195a45e.mspx?mfr=true
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Dag,
>> > Jan
>
>
>
>
> --
> Dag,
> Jan

Re: Crawling behind an ISA proxy (iis 7.5) revisited

Posted by Jan van Haarst <ja...@vanhaarst.net>.
OK, we'll do.

On Mon, Jun 18, 2012 at 3:18 PM, Karl Wright <da...@gmail.com> wrote:

> I'll be committing any changes to trunk.  I'm happy to also include a
> patch, which should work with 0.5-incubating, but you'll need to build
> it, of course, with the patch in place.
>
> Karl
>
> On Mon, Jun 18, 2012 at 9:09 AM, Jan van Haarst <ja...@vanhaarst.net> wrote:
> > Hello Karl,
> >
> > The version we have running is ManifoldCF 0.5-incubating.
> > It would be great to be able to get to the bottom of this.
> >
> > Dag,
> > Jan
> >
> > On Mon, Jun 18, 2012 at 2:21 PM, Karl Wright <da...@gmail.com> wrote:
> >>
> >> HTTPClient 3.1 itself does not seem to provide a logging option for
> >> logging the body.  However, it should be straightforward to add this
> >> to the ManifoldCF code.  What version are you running, so that I can
> >> provide the appropriate patch?
> >>
> >> Karl
> >>
> >>
> >>
> >> On Mon, Jun 18, 2012 at 8:09 AM, Jan van Haarst <ja...@vanhaarst.net>
> wrote:
> >> > Hello all,
> >> >
> >> > I'm a colleague of the original poster [1].
> >> >
> >> > We got a lot further in figuring out the flow of the website, and thus
> >> > the
> >> > way ManifoldCF should crawl it.
> >> > In that process, we discovered that our problem might lie with
> >> > httpclient ,as the server responds with a 401.2 response, because the
> >> > client
> >> > doesn't send authentication headers, as mentioned in [2].
> >> >
> >> > My question is this :
> >> > Is the raw response of the server stored somewhere in case of a 401
> >> > return
> >> > code ?
> >> > If so, I can check whether my idea is right, and after that try to
>  fix
> >> > it.
> >> >
> >> > With kind regards,
> >> >
> >> > Jan van Haarst
> >> >
> >> > [1]
> >> >
> >> >
> http://mail-archives.apache.org/mod_mbox/incubator-connectors-user/201205.mbox/%3CCAFxWV0WY_Vojsshbfr0PSs%3DG-Xpd1wUJXFcbVVsOvntbXs1zRg%40mail.gmail.com%3E
> >> >
> >> > [2]
> http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/8feeaa51-c634-4de3-bfdc-e922d195a45e.mspx?mfr=true
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Dag,
> > Jan
>



-- 
Dag,
Jan

Re: Crawling behind an ISA proxy (iis 7.5) revisited

Posted by Karl Wright <da...@gmail.com>.
I'll be committing any changes to trunk.  I'm happy to also include a
patch, which should work with 0.5-incubating, but you'll need to build
it, of course, with the patch in place.

Karl

On Mon, Jun 18, 2012 at 9:09 AM, Jan van Haarst <ja...@vanhaarst.net> wrote:
> Hello Karl,
>
> The version we have running is ManifoldCF 0.5-incubating.
> It would be great to be able to get to the bottom of this.
>
> Dag,
> Jan
>
> On Mon, Jun 18, 2012 at 2:21 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> HTTPClient 3.1 itself does not seem to provide a logging option for
>> logging the body.  However, it should be straightforward to add this
>> to the ManifoldCF code.  What version are you running, so that I can
>> provide the appropriate patch?
>>
>> Karl
>>
>>
>>
>> On Mon, Jun 18, 2012 at 8:09 AM, Jan van Haarst <ja...@vanhaarst.net> wrote:
>> > Hello all,
>> >
>> > I'm a colleague of the original poster [1].
>> >
>> > We got a lot further in figuring out the flow of the website, and thus
>> > the
>> > way ManifoldCF should crawl it.
>> > In that process, we discovered that our problem might lie with
>> > httpclient ,as the server responds with a 401.2 response, because the
>> > client
>> > doesn't send authentication headers, as mentioned in [2].
>> >
>> > My question is this :
>> > Is the raw response of the server stored somewhere in case of a 401
>> > return
>> > code ?
>> > If so, I can check whether my idea is right, and after that try to  fix
>> > it.
>> >
>> > With kind regards,
>> >
>> > Jan van Haarst
>> >
>> > [1]
>> >
>> > http://mail-archives.apache.org/mod_mbox/incubator-connectors-user/201205.mbox/%3CCAFxWV0WY_Vojsshbfr0PSs%3DG-Xpd1wUJXFcbVVsOvntbXs1zRg%40mail.gmail.com%3E
>> >
>> > [2] http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/8feeaa51-c634-4de3-bfdc-e922d195a45e.mspx?mfr=true
>> >
>> >
>
>
>
>
> --
> Dag,
> Jan

Re: Crawling behind an ISA proxy (iis 7.5) revisited

Posted by Jan van Haarst <ja...@vanhaarst.net>.
Hello Karl,

The version we have running is ManifoldCF 0.5-incubating.
It would be great to be able to get to the bottom of this.

Dag,
Jan

On Mon, Jun 18, 2012 at 2:21 PM, Karl Wright <da...@gmail.com> wrote:

> HTTPClient 3.1 itself does not seem to provide a logging option for
> logging the body.  However, it should be straightforward to add this
> to the ManifoldCF code.  What version are you running, so that I can
> provide the appropriate patch?
>
> Karl
>
>
>
> On Mon, Jun 18, 2012 at 8:09 AM, Jan van Haarst <ja...@vanhaarst.net> wrote:
> > Hello all,
> >
> > I'm a colleague of the original poster [1].
> >
> > We got a lot further in figuring out the flow of the website, and thus
> the
> > way ManifoldCF should crawl it.
> > In that process, we discovered that our problem might lie with
> > httpclient ,as the server responds with a 401.2 response, because the
> client
> > doesn't send authentication headers, as mentioned in [2].
> >
> > My question is this :
> > Is the raw response of the server stored somewhere in case of a 401
> return
> > code ?
> > If so, I can check whether my idea is right, and after that try to  fix
> it.
> >
> > With kind regards,
> >
> > Jan van Haarst
> >
> > [1]
> >
> http://mail-archives.apache.org/mod_mbox/incubator-connectors-user/201205.mbox/%3CCAFxWV0WY_Vojsshbfr0PSs%3DG-Xpd1wUJXFcbVVsOvntbXs1zRg%40mail.gmail.com%3E
> > [2]
> http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/8feeaa51-c634-4de3-bfdc-e922d195a45e.mspx?mfr=true
> >
> >
>



-- 
Dag,
Jan

Re: Crawling behind an ISA proxy (iis 7.5) revisited

Posted by Karl Wright <da...@gmail.com>.
HTTPClient 3.1 itself does not seem to provide a logging option for
logging the body.  However, it should be straightforward to add this
to the ManifoldCF code.  What version are you running, so that I can
provide the appropriate patch?

Karl



On Mon, Jun 18, 2012 at 8:09 AM, Jan van Haarst <ja...@vanhaarst.net> wrote:
> Hello all,
>
> I'm a colleague of the original poster [1].
>
> We got a lot further in figuring out the flow of the website, and thus the
> way ManifoldCF should crawl it.
> In that process, we discovered that our problem might lie with
> httpclient ,as the server responds with a 401.2 response, because the client
> doesn't send authentication headers, as mentioned in [2].
>
> My question is this :
> Is the raw response of the server stored somewhere in case of a 401 return
> code ?
> If so, I can check whether my idea is right, and after that try to  fix it.
>
> With kind regards,
>
> Jan van Haarst
>
> [1]
> http://mail-archives.apache.org/mod_mbox/incubator-connectors-user/201205.mbox/%3CCAFxWV0WY_Vojsshbfr0PSs%3DG-Xpd1wUJXFcbVVsOvntbXs1zRg%40mail.gmail.com%3E
> [2] http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/8feeaa51-c634-4de3-bfdc-e922d195a45e.mspx?mfr=true
>
>