You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dinçer Kavraal <dk...@gmail.com> on 2011/08/02 00:17:28 UTC
redirect and cookie
Hi, (USING: nutch 1.3 on Ubuntu 11.04 - 2.6.38-10-generic-pae)
I would like to crawl a news site and having troubles with cookie support
(which haven't undestood well enough yet).
Here is the status of the project... When I crawl the URLs I get this
message in the log of nutch (also in console):
> Skipping http://ntvmsnbc.com/id/25237248 as content is not fetched
successfully
> Skipping http://ntvmsnbc.com/id/25237249 as content is not fetched
successfully
> Skipping http://ntvmsnbc.com/id/25237253 as content is not fetched
successfully
I have looked up the segments dump and saw that the site would like to
redirect the client to an address which sets a cookie and falls back to the
same address and shows the content. To clearify:
1. clients tries the address: http://ntvmsnbc.com/id/25237248
2. address sends a Location header for redirection to:
http://www.ntvmsnbc.com/redirect.aspx?to=http%3a%2f%2fwww.ntvmsnbc.com%2fid%2f25237248%2f&from=http%3a%2f%2fntvmsnbc.com%2fid%2f25237248%2f&mskey=323dcf07efd1a45a3851a0199679e65e
3. this address sets a cookie and redirects the user back to original
address requested: http://ntvmsnbc.com/id/25237248
4. the content is shown this time
with the standart configuration, I cannot get any content at all. I thought
I should try to accept cookies. However, *I don't know how to accept cookies
yet*. My settings work for other sites which not requires cookies.
I tried to change in nutch-default.xml as:
http.redirect.max = 5
http.useHttp11 = true
plugin.includes
= protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
Do you think something goes wrong with my config?
Appreciate any ideas,
Dincer
Re: redirect and cookie
Posted by Dinçer Kavraal <dk...@gmail.com>.
For whom it might concern,
I have achieved my solution to override the protocol-http plugin to send an
additional header like:
"Cookie: mycookie=1"
Regards.
2011/8/2 Dinçer Kavraal <dk...@gmail.com>
> Hi, (USING: nutch 1.3 on Ubuntu 11.04 - 2.6.38-10-generic-pae)
>
> I would like to crawl a news site and having troubles with cookie support
> (which haven't undestood well enough yet).
>
> Here is the status of the project... When I crawl the URLs I get this
> message in the log of nutch (also in console):
> > Skipping http://ntvmsnbc.com/id/25237248 as content is not fetched
> successfully
> > Skipping http://ntvmsnbc.com/id/25237249 as content is not fetched
> successfully
> > Skipping http://ntvmsnbc.com/id/25237253 as content is not fetched
> successfully
>
>
> I have looked up the segments dump and saw that the site would like to
> redirect the client to an address which sets a cookie and falls back to the
> same address and shows the content. To clearify:
> 1. clients tries the address: http://ntvmsnbc.com/id/25237248
> 2. address sends a Location header for redirection to:
> http://www.ntvmsnbc.com/redirect.aspx?to=http%3a%2f%2fwww.ntvmsnbc.com%2fid%2f25237248%2f&from=http%3a%2f%2fntvmsnbc.com%2fid%2f25237248%2f&mskey=323dcf07efd1a45a3851a0199679e65e
> 3. this address sets a cookie and redirects the user back to original
> address requested: http://ntvmsnbc.com/id/25237248
> 4. the content is shown this time
>
> with the standart configuration, I cannot get any content at all. I thought
> I should try to accept cookies. However, *I don't know how to accept
> cookies yet*. My settings work for other sites which not requires
> cookies.
>
> I tried to change in nutch-default.xml as:
> http.redirect.max = 5
> http.useHttp11 = true
> plugin.includes
> = protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> Do you think something goes wrong with my config?
>
> Appreciate any ideas,
> Dincer
>