You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dinçer Kavraal <dk...@gmail.com> on 2011/08/02 00:17:28 UTC

redirect and cookie

Hi,    (USING: nutch 1.3 on Ubuntu 11.04 - 2.6.38-10-generic-pae)

I would like to crawl a news site and having troubles with cookie support
(which haven't undestood well enough yet).

Here is the status of the project... When I crawl the URLs I get this
message in the log of nutch (also in console):
> Skipping http://ntvmsnbc.com/id/25237248 as content is not fetched
successfully
> Skipping http://ntvmsnbc.com/id/25237249 as content is not fetched
successfully
> Skipping http://ntvmsnbc.com/id/25237253 as content is not fetched
successfully


I have looked up the segments dump and saw that the site would like to
redirect the client to an address which sets a cookie and falls back to the
same address and shows the content. To clearify:
1. clients tries the address: http://ntvmsnbc.com/id/25237248
2. address sends a Location header for redirection to:
http://www.ntvmsnbc.com/redirect.aspx?to=http%3a%2f%2fwww.ntvmsnbc.com%2fid%2f25237248%2f&from=http%3a%2f%2fntvmsnbc.com%2fid%2f25237248%2f&mskey=323dcf07efd1a45a3851a0199679e65e
3. this address sets a cookie and redirects the user back to original
address requested: http://ntvmsnbc.com/id/25237248
4. the content is shown this time

with the standart configuration, I cannot get any content at all. I thought
I should try to accept cookies. However, *I don't know how to accept cookies
yet*. My settings work for other sites which not requires cookies.

I tried to change in nutch-default.xml as:
http.redirect.max = 5
http.useHttp11 = true
plugin.includes
= protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)

Do you think something goes wrong with my config?

Appreciate any ideas,
Dincer

Re: redirect and cookie

Posted by Dinçer Kavraal <dk...@gmail.com>.
For whom it might concern,

I have achieved my solution to override the protocol-http plugin to send an
additional header like:
"Cookie: mycookie=1"

Regards.


2011/8/2 Dinçer Kavraal <dk...@gmail.com>

> Hi,    (USING: nutch 1.3 on Ubuntu 11.04 - 2.6.38-10-generic-pae)
>
> I would like to crawl a news site and having troubles with cookie support
> (which haven't undestood well enough yet).
>
> Here is the status of the project... When I crawl the URLs I get this
> message in the log of nutch (also in console):
> > Skipping http://ntvmsnbc.com/id/25237248 as content is not fetched
> successfully
> > Skipping http://ntvmsnbc.com/id/25237249 as content is not fetched
> successfully
> > Skipping http://ntvmsnbc.com/id/25237253 as content is not fetched
> successfully
>
>
> I have looked up the segments dump and saw that the site would like to
> redirect the client to an address which sets a cookie and falls back to the
> same address and shows the content. To clearify:
> 1. clients tries the address: http://ntvmsnbc.com/id/25237248
> 2. address sends a Location header for redirection to:
> http://www.ntvmsnbc.com/redirect.aspx?to=http%3a%2f%2fwww.ntvmsnbc.com%2fid%2f25237248%2f&from=http%3a%2f%2fntvmsnbc.com%2fid%2f25237248%2f&mskey=323dcf07efd1a45a3851a0199679e65e
> 3. this address sets a cookie and redirects the user back to original
> address requested: http://ntvmsnbc.com/id/25237248
> 4. the content is shown this time
>
> with the standart configuration, I cannot get any content at all. I thought
> I should try to accept cookies. However, *I don't know how to accept
> cookies yet*. My settings work for other sites which not requires
> cookies.
>
> I tried to change in nutch-default.xml as:
> http.redirect.max = 5
> http.useHttp11 = true
> plugin.includes
> = protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> Do you think something goes wrong with my config?
>
> Appreciate any ideas,
> Dincer
>