You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Uros Gruber (JIRA)" <ji...@apache.org> on 2006/10/05 21:35:41 UTC

[jira] Created: (NUTCH-381) Ignore external link not work as expected

Ignore external link not work as expected
-----------------------------------------

                 Key: NUTCH-381
                 URL: http://issues.apache.org/jira/browse/NUTCH-381
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.8.1
            Reporter: Uros Gruber
            Priority: Critical


Currently there is no way to properly limit fetcher without regexp rules we use ignore.external.link option but It seams that It doesn't work in all cases.
Here is example urls I'm seeing but

cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. 

fetching http://help.yahoo.com/help/sports
fetching http://www.turkish-xxx.com/adult-traffic-trade.php
fetching http://help.yahoo.com/help/us/astr/
fetching http://www.polish-xxx.com/de-index.html
fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
fetching http://help.yahoo.com/help/groups
fetching http://help.yahoo.com/help/fin/
fetching http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
fetching http://help.yahoo.com/help/us/edit/
fetching http://www.polish-xxx.com/es-index.html

Anyone notice this?

I assume that there must be something with expired domains where pages generates randomly. But still why urls from other domain was added. Maybe urlregexp filter +* exclude.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-381) Ignore external link not work as expected

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482266 ] 

Andrzej Bialecki  commented on NUTCH-381:
-----------------------------------------

Your last comment confirms my suspicions. After analysis of the code in Fetcher I can confirm that this indeed is the effect of handling redirects immediately - Fetcher doesn't check if the URLs we redirect to belong to the same host.

The solution is to disable immediate redirects (set http.redirect.max to 0 in your configuration).

> Ignore external link not work as expected
> -----------------------------------------
>
>                 Key: NUTCH-381
>                 URL: https://issues.apache.org/jira/browse/NUTCH-381
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1
>            Reporter: Uros Gruber
>            Priority: Critical
>
> Currently there is no way to properly limit fetcher without regexp rules we use ignore.external.link option but It seams that It doesn't work in all cases.
> Here is example urls I'm seeing but
> cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. 
> fetching http://help.yahoo.com/help/sports
> fetching http://www.turkish-xxx.com/adult-traffic-trade.php
> fetching http://help.yahoo.com/help/us/astr/
> fetching http://www.polish-xxx.com/de-index.html
> fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
> fetching http://help.yahoo.com/help/groups
> fetching http://help.yahoo.com/help/fin/
> fetching http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
> fetching http://help.yahoo.com/help/us/edit/
> fetching http://www.polish-xxx.com/es-index.html
> Anyone notice this?
> I assume that there must be something with expired domains where pages generates randomly. But still why urls from other domain was added. Maybe urlregexp filter +* exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-381) Ignore external link not work as expected

Posted by "Uros Gruber (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-381?page=comments#action_12440453 ] 
            
Uros Gruber commented on NUTCH-381:
-----------------------------------

I try to found what happened through the logs but because threads I didn't found any connection. I also try with linksdb. For example I search www.polish-xxx.com but found only fromUrl link and It's strange. If I understand this correctly I this case no url pointing to this url.

I have linksdb gziped with 15MB. I can send you somewhere or place it to our server if it's any help.

with -noAdditions I'm to late. I already updatedb with those links.


> Ignore external link not work as expected
> -----------------------------------------
>
>                 Key: NUTCH-381
>                 URL: http://issues.apache.org/jira/browse/NUTCH-381
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1
>            Reporter: Uros Gruber
>            Priority: Critical
>
> Currently there is no way to properly limit fetcher without regexp rules we use ignore.external.link option but It seams that It doesn't work in all cases.
> Here is example urls I'm seeing but
> cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. 
> fetching http://help.yahoo.com/help/sports
> fetching http://www.turkish-xxx.com/adult-traffic-trade.php
> fetching http://help.yahoo.com/help/us/astr/
> fetching http://www.polish-xxx.com/de-index.html
> fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
> fetching http://help.yahoo.com/help/groups
> fetching http://help.yahoo.com/help/fin/
> fetching http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
> fetching http://help.yahoo.com/help/us/edit/
> fetching http://www.polish-xxx.com/es-index.html
> Anyone notice this?
> I assume that there must be something with expired domains where pages generates randomly. But still why urls from other domain was added. Maybe urlregexp filter +* exclude.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-381) Ignore external link not work as expected

Posted by "nutch.newbie (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-381?page=comments#action_12440304 ] 
            
nutch.newbie commented on NUTCH-381:
------------------------------------

Yes I can confirm this. I have a list of 5000+ urls and it didn't work. I went back to regex include/exclude method. 

Regards

> Ignore external link not work as expected
> -----------------------------------------
>
>                 Key: NUTCH-381
>                 URL: http://issues.apache.org/jira/browse/NUTCH-381
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1
>            Reporter: Uros Gruber
>            Priority: Critical
>
> Currently there is no way to properly limit fetcher without regexp rules we use ignore.external.link option but It seams that It doesn't work in all cases.
> Here is example urls I'm seeing but
> cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. 
> fetching http://help.yahoo.com/help/sports
> fetching http://www.turkish-xxx.com/adult-traffic-trade.php
> fetching http://help.yahoo.com/help/us/astr/
> fetching http://www.polish-xxx.com/de-index.html
> fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
> fetching http://help.yahoo.com/help/groups
> fetching http://help.yahoo.com/help/fin/
> fetching http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
> fetching http://help.yahoo.com/help/us/edit/
> fetching http://www.polish-xxx.com/es-index.html
> Anyone notice this?
> I assume that there must be something with expired domains where pages generates randomly. But still why urls from other domain was added. Maybe urlregexp filter +* exclude.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-381) Ignore external link not work as expected

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-381?page=comments#action_12440351 ] 
            
Andrzej Bialecki  commented on NUTCH-381:
-----------------------------------------

It would be good to investigate where this problem occurs - is this somewhere in the redirects? You should have log messages about redirection to these urls.

In the meantime, you may wish to try the 'updatedb -noAdditions' in the recent trunk - it specifically discards all URLs that are not already present in the crawldb. If you use this option the only way to add urls to your crawldb is through injection.

> Ignore external link not work as expected
> -----------------------------------------
>
>                 Key: NUTCH-381
>                 URL: http://issues.apache.org/jira/browse/NUTCH-381
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1
>            Reporter: Uros Gruber
>            Priority: Critical
>
> Currently there is no way to properly limit fetcher without regexp rules we use ignore.external.link option but It seams that It doesn't work in all cases.
> Here is example urls I'm seeing but
> cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. 
> fetching http://help.yahoo.com/help/sports
> fetching http://www.turkish-xxx.com/adult-traffic-trade.php
> fetching http://help.yahoo.com/help/us/astr/
> fetching http://www.polish-xxx.com/de-index.html
> fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
> fetching http://help.yahoo.com/help/groups
> fetching http://help.yahoo.com/help/fin/
> fetching http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
> fetching http://help.yahoo.com/help/us/edit/
> fetching http://www.polish-xxx.com/es-index.html
> Anyone notice this?
> I assume that there must be something with expired domains where pages generates randomly. But still why urls from other domain was added. Maybe urlregexp filter +* exclude.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-381) Ignore external link not work as expected

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-381.
-----------------------------------

       Resolution: Won't Fix
    Fix Version/s: 0.9.0
         Assignee: Andrzej Bialecki 

This was caused by following redirected pages immediately in Fetcher. Set http.redirect.max to 0 to avoid this problem.

> Ignore external link not work as expected
> -----------------------------------------
>
>                 Key: NUTCH-381
>                 URL: https://issues.apache.org/jira/browse/NUTCH-381
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1
>            Reporter: Uros Gruber
>         Assigned To: Andrzej Bialecki 
>            Priority: Critical
>             Fix For: 0.9.0
>
>
> Currently there is no way to properly limit fetcher without regexp rules we use ignore.external.link option but It seams that It doesn't work in all cases.
> Here is example urls I'm seeing but
> cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. 
> fetching http://help.yahoo.com/help/sports
> fetching http://www.turkish-xxx.com/adult-traffic-trade.php
> fetching http://help.yahoo.com/help/us/astr/
> fetching http://www.polish-xxx.com/de-index.html
> fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
> fetching http://help.yahoo.com/help/groups
> fetching http://help.yahoo.com/help/fin/
> fetching http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
> fetching http://help.yahoo.com/help/us/edit/
> fetching http://www.polish-xxx.com/es-index.html
> Anyone notice this?
> I assume that there must be something with expired domains where pages generates randomly. But still why urls from other domain was added. Maybe urlregexp filter +* exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.