You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dennis Kubes (JIRA)" <ji...@apache.org> on 2008/12/03 02:42:44 UTC

[jira] Created: (NUTCH-668) Domain URL Filter

Domain URL Filter
-----------------

                 Key: NUTCH-668
                 URL: https://issues.apache.org/jira/browse/NUTCH-668
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.0.0
         Environment: All
            Reporter: Dennis Kubes
            Assignee: Dennis Kubes
             Fix For: 1.0.0


A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-668) Domain URL Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658118#action_12658118 ] 

Dennis Kubes commented on NUTCH-668:
------------------------------------

Anybody have a problem if I commit this today or tommorrow?

> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-668) Domain URL Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-668:
-------------------------------

    Attachment: NUTCH-668-3-20081213.patch

New domain filter patch that matches against suffix, domain, and hostname in that order.

> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-668) Domain URL Filter

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658121#action_12658121 ] 

Andrzej Bialecki  commented on NUTCH-668:
-----------------------------------------

+1. Minor cosmetic change I would do: DomainURLFitler.java uses StringUtils twice, each time using the full package name - just add an import statement to make this shorter.

> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-668) Domain URL Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653881#action_12653881 ] 

Dennis Kubes commented on NUTCH-668:
------------------------------------

I agree.  Being able to search for tlds like .com would make it much more flexible.  Let me work up the changes and I will post a new patch (without my local path :)).  Although I do want to get this in quickly I think the new functionality is worth the wait.

> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-668) Domain URL Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658165#action_12658165 ] 

Dennis Kubes commented on NUTCH-668:
------------------------------------

It uses two different StringUtils classes, one from commons lang, one from org.apache.hadoop.util.StringUtils.  I just chose commons as I thought I would use that one more times.  As it happens I only use it once in this patch.

> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-668) Domain URL Filter

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663224#action_12663224 ] 

Hudson commented on NUTCH-668:
------------------------------

Integrated in Nutch-trunk #691 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/691/])
    

> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-668) Domain URL Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes closed NUTCH-668.
------------------------------


> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-668) Domain URL Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-668:
-------------------------------

    Attachment: NUTCH-668-2-20081204.patch

Updated to include URLUtil methods that were missing.  Sorry.

> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-668) Domain URL Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-668.
--------------------------------

    Resolution: Fixed

Committed with revision 729958.

> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-668) Domain URL Filter

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-668:
-------------------------------

    Attachment: NUTCH-668-1-20081202.patch

Includes the DomainURLFilter and test files.  Domains can either be filtered by top level domains ignoring subdomains, or by hostnames through configuration.  There is a configuration file where valid domains are placed one per line.  Those domains are used to create valid domain set against which we validate urls at runtime.  Only urls which match domains in the domain set are considered valid.

> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-668) Domain URL Filter

Posted by "julien nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672994#action_12672994 ] 

julien nioche commented on NUTCH-668:
-------------------------------------

at line 173 - shouldn't we return 'url' instead of null? otherwise we are in contradiction with the comment 
// if an error happens, allow the url to pass
and block URLS which are pure IP adresses.


> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-668) Domain URL Filter

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653856#action_12653856 ] 

Andrzej Bialecki  commented on NUTCH-668:
-----------------------------------------

The test case contains a reference to a path on your local machine ...

Also, the issue of domain vs. subdomain vs. host matching ... I'd love to be able to specify patterns like this:

edu
example.com
blurfl.foobar.org

meaning: accept everything from .com TLD, everything from example.com including subdomains and hosts, and anything from blurfl.foobar.org, whether that's a hostname or a subdomain.

We could do it with a suffix tree, or by matching the increasing number of hostname elements to the HashSet, e.g. for www.blurfl.foobar.org we would check:

 org - no match
 foobar.org - no match
 blurfl.foobar.org - match, break and return

For www.foobar.com we would check:

 com - no match
 foobar.com - no match
 www.foobar - no match
 return null

The price is that we need to make as many probes in the HashSet as there are domain elements, but the advantage is the increased flexibility in configuring allowed domains / hosts.

I'm also fine if you want to commit it as it is, and create an issue to enhance this plugin later.



> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname.  A configuration file with a listing of URLs is used to denote accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.