You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dennis Kubes (JIRA)" <ji...@apache.org> on 2008/12/03 02:42:44 UTC
[jira] Created: (NUTCH-668) Domain URL Filter
Domain URL Filter
-----------------
Key: NUTCH-668
URL: https://issues.apache.org/jira/browse/NUTCH-668
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.0.0
Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Fix For: 1.0.0
A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-668) Domain URL Filter
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658118#action_12658118 ]
Dennis Kubes commented on NUTCH-668:
------------------------------------
Anybody have a problem if I commit this today or tommorrow?
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-668) Domain URL Filter
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes updated NUTCH-668:
-------------------------------
Attachment: NUTCH-668-3-20081213.patch
New domain filter patch that matches against suffix, domain, and hostname in that order.
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-668) Domain URL Filter
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658121#action_12658121 ]
Andrzej Bialecki commented on NUTCH-668:
-----------------------------------------
+1. Minor cosmetic change I would do: DomainURLFitler.java uses StringUtils twice, each time using the full package name - just add an import statement to make this shorter.
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-668) Domain URL Filter
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653881#action_12653881 ]
Dennis Kubes commented on NUTCH-668:
------------------------------------
I agree. Being able to search for tlds like .com would make it much more flexible. Let me work up the changes and I will post a new patch (without my local path :)). Although I do want to get this in quickly I think the new functionality is worth the wait.
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-668) Domain URL Filter
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658165#action_12658165 ]
Dennis Kubes commented on NUTCH-668:
------------------------------------
It uses two different StringUtils classes, one from commons lang, one from org.apache.hadoop.util.StringUtils. I just chose commons as I thought I would use that one more times. As it happens I only use it once in this patch.
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-668) Domain URL Filter
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663224#action_12663224 ]
Hudson commented on NUTCH-668:
------------------------------
Integrated in Nutch-trunk #691 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/691/])
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-668) Domain URL Filter
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes closed NUTCH-668.
------------------------------
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-668) Domain URL Filter
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes updated NUTCH-668:
-------------------------------
Attachment: NUTCH-668-2-20081204.patch
Updated to include URLUtil methods that were missing. Sorry.
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-668) Domain URL Filter
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes resolved NUTCH-668.
--------------------------------
Resolution: Fixed
Committed with revision 729958.
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-668) Domain URL Filter
Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes updated NUTCH-668:
-------------------------------
Attachment: NUTCH-668-1-20081202.patch
Includes the DomainURLFilter and test files. Domains can either be filtered by top level domains ignoring subdomains, or by hostnames through configuration. There is a configuration file where valid domains are placed one per line. Those domains are used to create valid domain set against which we validate urls at runtime. Only urls which match domains in the domain set are considered valid.
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-668) Domain URL Filter
Posted by "julien nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672994#action_12672994 ]
julien nioche commented on NUTCH-668:
-------------------------------------
at line 173 - shouldn't we return 'url' instead of null? otherwise we are in contradiction with the comment
// if an error happens, allow the url to pass
and block URLS which are pure IP adresses.
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, NUTCH-668-3-20081213.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-668) Domain URL Filter
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653856#action_12653856 ]
Andrzej Bialecki commented on NUTCH-668:
-----------------------------------------
The test case contains a reference to a path on your local machine ...
Also, the issue of domain vs. subdomain vs. host matching ... I'd love to be able to specify patterns like this:
edu
example.com
blurfl.foobar.org
meaning: accept everything from .com TLD, everything from example.com including subdomains and hosts, and anything from blurfl.foobar.org, whether that's a hostname or a subdomain.
We could do it with a suffix tree, or by matching the increasing number of hostname elements to the HashSet, e.g. for www.blurfl.foobar.org we would check:
org - no match
foobar.org - no match
blurfl.foobar.org - match, break and return
For www.foobar.com we would check:
com - no match
foobar.com - no match
www.foobar - no match
return null
The price is that we need to make as many probes in the HashSet as there are domain elements, but the advantage is the increased flexibility in configuring allowed domains / hosts.
I'm also fine if you want to commit it as it is, and create an issue to enhance this plugin later.
> Domain URL Filter
> -----------------
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.