You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dennis Kubes <nu...@dragonflymc.com> on 2006/04/04 16:38:32 UTC

Meta-Refresh Question

Silly question but nutch won't follow meta-refreshes will it?

Dennis



RE: Meta-Refresh Question

Posted by Dennis Kubes <nu...@dragonflymc.com>.
I searched through the code and the problem is the URL returned for the
meta-refresh is like this:

http://www.oneforever.com/tohomepage.do;jsessionid=F3C8BBAC224990A9214A1785E
5001AFD

Which matches the RegexURLFilter for this pattern:

-[?*!@=] (because of the = sign

So my question is should the URL be cleaned up inside of the HttpBase where
it is grabbed from the page content or would it be better to put in a URL
filter to match before it gets eliminated by the filter above?

Dennis

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org]
Sent: Tuesday, April 04, 2006 9:56 AM
To: nutch-user@lucene.apache.org
Subject: Re: Meta-Refresh Question

Dennis Kubes wrote:
> Silly question but nutch won't follow meta-refreshes will it?
>

It should have, parse-html has support for this
(ParseStatus.SUCCESS_REDIRECT), and it did work in 0.7, but now I can see
that one of the necessary pieces (in Fetcher) didn't make it to 0.8.
Please create a JIRA issue so that it doesn't escape our attention.
Thank you!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com





Re: Meta-Refresh Question

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> Silly question but nutch won't follow meta-refreshes will it?
>   

It should have, parse-html has support for this 
(ParseStatus.SUCCESS_REDIRECT), and it did work in 0.7, but now I can 
see that one of the necessary pieces (in Fetcher) didn't make it to 0.8. 
Please create a JIRA issue so that it doesn't escape our attention. 
Thank you!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com