[jira] Updated: (NUTCH-926) Nutch follows wrong url in <META http-equiv="refresh" tag

You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Marco Novo (JIRA)" <ji...@apache.org> on 2010/10/27 17:48:23 UTC
[jira] Updated: (NUTCH-926) Nutch follows wrong url in
     [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marco Novo updated NUTCH-926:
-----------------------------

    Description: 
We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
So

WWW.DOMAIN1.COM
..
..
..
WWW.RIGHTDOMAIN.COM
..
..
..
..
WWW.DOMAIN.COM

We sets nutch to:
NOT FOLLOW EXERNAL LINKS


During crawling of WWW.RIGHTDOMAIN.COM
if a page contains

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
    <META http-equiv="refresh" content="0;
    url=http://WRONG.RIGHTDOMAIN.COM">
</head>
<body>
</body>
</html>

Nutch continues to crawl the WRONG subdomains! But it should not do this!!

During crawling of WWW.RIGHTDOMAIN.COM
if a page contains

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
    <META http-equiv="refresh" content="0;
    url=http://WWW.WRONGDOMAIN.COM">
</head>
<body>
</body>
</html>


Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....


We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

  was:
We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
So

WWW.DOMAIN1.COM
..
..
..
WWW.RIGHTDOMAIN.COM
..
..
..
..
WWW.DOMAIN.COM

We sets nutch to:
NOT FOLLOW EXERNAL LINKS


During crawling of WWW.RIGHTDOMAIN.COM
if a page contains

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
    <META http-equiv="refresh" content="0;
    url=http://WRONG.RIGHTDOMAIN.COM">
</head>
<body>
</body>
</html>

Nutch continues to crawl the WRONG subdomains! But it should not do this!!

During crawling of WWW.RIGHTDOMAIN.COM
if a page contains

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
    <META http-equiv="refresh" content="0;
    url=http://WWW.WRONGDOMAIN.COM">
</head>
<body>
</body>
</html>


Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....


We think the problem is in Fetcher class in method handleRedirect but we are not able to fix that

     Patch Info: [Patch Available]

> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.