You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Marco Novo (JIRA)" <ji...@apache.org> on 2010/10/27 16:46:19 UTC

[jira] Created: (NUTCH-926) Nutch follows wrong url in
Nutch follows wrong url in <META http-equiv="refresh" tag
---------------------------------------------------------

                 Key: NUTCH-926
                 URL: https://issues.apache.org/jira/browse/NUTCH-926
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.2
         Environment: gnu/linux centOs
            Reporter: Marco Novo
            Priority: Critical
             Fix For: 1.3


We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
So

WWW.DOMAIN1.COM
..
..
..
WWW.RIGHTDOMAIN.COM
..
..
..
..
WWW.DOMAIN.COM

We sets nutch to:
NOT FOLLOW EXERNAL LINKS


During crawling of WWW.RIGHTDOMAIN.COM
if a page contains

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
    <META http-equiv="refresh" content="0;
    url=http://WRONG.RIGHTDOMAIN.COM">
</head>
<body>
</body>
</html>

Nutch continues to crawl the WRONG subdomains! But it should not do this!!

During crawling of WWW.RIGHTDOMAIN.COM
if a page contains

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
    <META http-equiv="refresh" content="0;
    url=http://WWW.WRONGDOMAIN.COM">
</head>
<body>
</body>
</html>


Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....


We think the problem is in Fetcher class in method handleRedirect but we are not able to fix that

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Created: (NUTCH-926) Nutch follows wrong url in Posted by David Stuart <da...@progressivealliance.co.uk>.
Have you tried restricting the crawl range in the regex-urlfilter.txt  instead of having

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

Change to

# Crawl right domain
+^www.rightdomain.com

# Deny anything else
-.



David


On 27 Oct 2010, at 15:46, Marco Novo (JIRA) wrote:

>  If that we will spider all the web....


[jira] Commented: (NUTCH-926) Nutch follows wrong url in Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925467#action_12925467 ] 

Markus Jelsma commented on NUTCH-926:
-------------------------------------

I've seen the patch but my Java is still too rusty and Nutch's lack of comments isn't helpful either. What does the patch exactly do? I still think this is to be solved by updating your regex filters as dynamically as you inject seeds. Perhaps someone else can shed some light on this one.

Thanks

> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-926) Nutch follows wrong url in Posted by "Marco Novo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925455#action_12925455 ] 

Marco Novo commented on NUTCH-926:
----------------------------------

No we are using an external file with the urllist (to be injected). Really this list is a seeds list. And the list is dinamically generated (so it changes time to time). So we really want to do everything and skip nothing.
We attached a patch...have a look at it.

Thanks


> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-926) Nutch follows wrong url in Posted by "Marco Novo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marco Novo updated NUTCH-926:
-----------------------------

    Description: 
We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
So

WWW.DOMAIN1.COM
..
..
..
WWW.RIGHTDOMAIN.COM
..
..
..
..
WWW.DOMAIN.COM

We sets nutch to:
NOT FOLLOW EXERNAL LINKS


During crawling of WWW.RIGHTDOMAIN.COM
if a page contains

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
    <META http-equiv="refresh" content="0;
    url=http://WRONG.RIGHTDOMAIN.COM">
</head>
<body>
</body>
</html>

Nutch continues to crawl the WRONG subdomains! But it should not do this!!

During crawling of WWW.RIGHTDOMAIN.COM
if a page contains

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
    <META http-equiv="refresh" content="0;
    url=http://WWW.WRONGDOMAIN.COM">
</head>
<body>
</body>
</html>


Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....


We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

  was:
We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
So

WWW.DOMAIN1.COM
..
..
..
WWW.RIGHTDOMAIN.COM
..
..
..
..
WWW.DOMAIN.COM

We sets nutch to:
NOT FOLLOW EXERNAL LINKS


During crawling of WWW.RIGHTDOMAIN.COM
if a page contains

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
    <META http-equiv="refresh" content="0;
    url=http://WRONG.RIGHTDOMAIN.COM">
</head>
<body>
</body>
</html>

Nutch continues to crawl the WRONG subdomains! But it should not do this!!

During crawling of WWW.RIGHTDOMAIN.COM
if a page contains

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
    <META http-equiv="refresh" content="0;
    url=http://WWW.WRONGDOMAIN.COM">
</head>
<body>
</body>
</html>


Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....


We think the problem is in Fetcher class in method handleRedirect but we are not able to fix that

     Patch Info: [Patch Available]

> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-926) Nutch follows wrong url in Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925543#action_12925543 ] 

Andrzej Bialecki  commented on NUTCH-926:
-----------------------------------------

bq. Nutch continues to crawl the WRONG subdomains! But it should not do this!!
No need to shout, we hear you :)

Indeed, Nutch behavior when following redirects doesn't play well with the rule of ignoring external outlinks. Strictly speaking, redirects are not outlinks, but the silent assumption behind ignoreExternalOutlinks is that we crawl content only from that hostname.

And your patch would solve this particular issue. However, this is not as simple as it seems... My favorite example is www.ibm.com -> www8.ibm.com/index.html . If we apply your fix you won't be able to crawl www.ibm.com unless you inject all wwwNNN load-balanced hosts... so a simple equality of hostnames may not be sufficient. We have utilities to extract domain names, so we could compare domains but then we may mistreat money.cnn.com vs. weather.cnn.com ...

> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-926) Nutch follows wrong url in Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925389#action_12925389 ] 

Markus Jelsma commented on NUTCH-926:
-------------------------------------

Are you using the regex-urlfilter plugin? If so, please show us your regex-urlfilter.txt configuration file.

> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in Fetcher class in method handleRedirect but we are not able to fix that

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-926) Nutch follows wrong url in Posted by "Marco Novo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925728#action_12925728 ] 

Marco Novo commented on NUTCH-926:
----------------------------------

I'm sorry I did not mean to shout, i know you are able to hear me, I was only despair, and capital letters were used to enhance the visibility of the problem. :)

>From what I understand, the problem was already known, we should add another property (and probably a plugin) to regulate the crawling of Web load balancer that have different hostname than the original but which contain relevant data.
But without our patch this time, with some unfortunate redirect outside of the domain (no load balancer), Nutch could end up downloading the entire web using high levels of depth ....

> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-926) Nutch follows wrong url in Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-926:
--------------------------------

    Fix Version/s:     (was: 1.3)

Won't be addressed in 1.3, more discussion needed

> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (NUTCH-926) Nutch follows wrong url in Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180556#comment-13180556 ] 

Lewis John McGibbney commented on NUTCH-926:
--------------------------------------------

Hey guys, just looking at our critical issues and hadn't noticed this one previously, did anyone have a look at this issue and can we reproduce?
                
> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-926) Nutch follows wrong url in Posted by "Marco Novo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marco Novo updated NUTCH-926:
-----------------------------

    Component/s:     (was: fetcher)
                 parser

> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-926) Nutch follows wrong url in Posted by "Marco Novo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925429#action_12925429 ] 

Marco Novo edited comment on NUTCH-926 at 10/27/10 11:54 AM:
-------------------------------------------------------------

Ok here is our File: /opt/nutch/conf/crawl-urlfilter.txt           


but we think it isn't here the problem....
                                                                                           
{noformat} 
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# Original filter line (without images parsing) 2010-09-06
# -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# Modified filter line to add images downloadin (and parsing?) 2010-09-06
-\.(css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|js)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else
# -.

# do everything
+.
{noformat} 


      was (Author: wildnove):
    Ok here is our File: /opt/nutch/conf/crawl-urlfilter.txt           


but we think it isn't here the problem....
                                                                                           

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# Original filter line (without images parsing) 2010-09-06
# -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# Modified filter line to add images downloadin (and parsing?) 2010-09-06
-\.(css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|js)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else
# -.

# do everything
+.


  
> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-926) Nutch follows wrong url in Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925439#action_12925439 ] 

Markus Jelsma edited comment on NUTCH-926 at 10/27/10 12:27 PM:
----------------------------------------------------------------

Marco, the following parts of your config are interesting:

{noformat}
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
{noformat}

This is commented out, if it wasn't, it would certainly allow all types of subdomain for MY.DOMAIN.NAME.

{noformat}
# skip everything else
# -.
{noformat}

This is also commented out, so nothing is skipped.

{noformat}
# do everything
+.
{noformat}

How about this? You allow everything.

You'll need to allow your injected URL's without also allowing subdomains so

{noformat}
+^http://www.MY.DOMAIN.NAME/ 
{noformat}

and then disallow everything else is good enough..


      was (Author: markus17):
    Marco, the following part is interesting in your config:

{noformat}
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
{noformat}

This is commented out, if it wasn't, it would certainly allow all types of subdomain for MY.DOMAIN.NAME.

{noformat}
# skip everything else
# -.
{noformat}

This is also commented out, so nothing is skipped.

{noformat}
# do everything
+.
{noformat}

How about this? You allow everything.

You'll need to allow your injected URL's without also allowing subdomains so

{noformat}
+^http://www.MY.DOMAIN.NAME/ 
{noformat}

is good enough.

  
> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-926) Nutch follows wrong url in Posted by "Marco Novo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marco Novo updated NUTCH-926:
-----------------------------

    Attachment: ParseOutputFormat.java.patch

Sorry for the different indentation ;)

> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-926) Nutch follows wrong url in Posted by "Marco Novo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925429#action_12925429 ] 

Marco Novo commented on NUTCH-926:
----------------------------------

Ok here is our File: /opt/nutch/conf/crawl-urlfilter.txt           


but we think it isn't here the problem....
                                                                                           

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# Original filter line (without images parsing) 2010-09-06
# -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# Modified filter line to add images downloadin (and parsing?) 2010-09-06
-\.(css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|js)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else
# -.

# do everything
+.



> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-926) Nutch follows wrong url in Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925439#action_12925439 ] 

Markus Jelsma commented on NUTCH-926:
-------------------------------------

Marco, the following part is interesting in your config:

{noformat}
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
{noformat}

This is commented out, if it wasn't, it would certainly allow all types of subdomain for MY.DOMAIN.NAME.

{noformat}
# skip everything else
# -.
{noformat}

This is also commented out, so nothing is skipped.

{noformat}
# do everything
+.
{noformat}

How about this? You allow everything.

You'll need to allow your injected URL's without also allowing subdomains so

{noformat}
+^http://www.MY.DOMAIN.NAME/ 
{noformat}

is good enough.


> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.