You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Hiroshi Tatsumi <ho...@comet.ocn.ne.jp> on 2014/04/09 04:31:50 UTC

Question about WebcrawlerConnector

Hi,

I'm using MCF1.5.1 and Solr4.6.1.
I have a question about WebcrawlerConnector.


[Question] WebcrawlerConnector - Session based access credentials
Can CookieManager use multiple domain cookies?
This is the case in my company's intranet Web site.
When I access to below web site, I need to send two cookies.

Procedure
access to target URL -> auto redirect to login page -> if success to login,
auto redirect to the target URL(need two cookies)

Target URL
https://foo.bar-network.jp/trac/repository/

Cookie domain/name
(1).bar-network.jp/comauth-req
(2)hoge.bar-network.jp/JSESSIONID

"Session based access credentials" can simulate this login process.
But in last auto redirect part, only one cookie is sent.
So the login procedure is failure. I cannot crawl the target Web site.

(1).bar-network.jp/comauth-req  ->not sent
(2)hoge.bar-network.jp/JSESSIONID  ->sent

Do you have any idea to success this login procedure?
Or should I modify MCF source code?

Regards,
Hiroshi Tatsumi 


Re: Question about WebcrawlerConnector

Posted by Karl Wright <da...@gmail.com>.
Hi Hiroshi,

Are both cookies being set at the same time?

The ManifoldCF web connector records *all* the cookies that have been set
at the time the login sequence ends.  So there are two possibilities:

(1) You did not specify all of the login sequence.  You may have missed,
for instance, a last redirection, which sets the second cookie.
(2) There is some sort of problem with Httpclient, or how we configure it,
which prevents it from accepting one of the cookies.  Httpclient has many
different cookie policies; we may need to change the one we use.

If both cookies are set at the same time on the same response, then we know
that the problem is not (1).  So please let me know.

For debugging this on 1.5, the best thing to do is to turn on httpclient
logging of various sorts.  You do this through the ManifoldCF logging.ini
file.  See the section on log4j at:
http://hc.apache.org/httpcomponents-client-4.2.x/logging.html
Wire debugging is very helpful for determining when a cookie has been
transmitted.  If you need to know why a cookie has been rejected, context
logging is helpful.

I'd be happy to look at the logging output and let you know what I think,
if you want to send it to me.

Thanks,
Karl



On Tue, Apr 8, 2014 at 10:31 PM, Hiroshi Tatsumi <
honekichi19@comet.ocn.ne.jp> wrote:

> Hi,
>
> I'm using MCF1.5.1 and Solr4.6.1.
> I have a question about WebcrawlerConnector.
>
>
> [Question] WebcrawlerConnector - Session based access credentials
> Can CookieManager use multiple domain cookies?
> This is the case in my company's intranet Web site.
> When I access to below web site, I need to send two cookies.
>
> Procedure
> access to target URL -> auto redirect to login page -> if success to login,
> auto redirect to the target URL(need two cookies)
>
> Target URL
> https://foo.bar-network.jp/trac/repository/
>
> Cookie domain/name
> (1).bar-network.jp/comauth-req
> (2)hoge.bar-network.jp/JSESSIONID
>
> "Session based access credentials" can simulate this login process.
> But in last auto redirect part, only one cookie is sent.
> So the login procedure is failure. I cannot crawl the target Web site.
>
> (1).bar-network.jp/comauth-req  ->not sent
> (2)hoge.bar-network.jp/JSESSIONID  ->sent
>
> Do you have any idea to success this login procedure?
> Or should I modify MCF source code?
>
> Regards,
> Hiroshi Tatsumi
>