You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2015/04/18 18:50:58 UTC

[jira] [Commented] (NUTCH-1989) Handling invalid URLs in CommonCrawlDataDumper

    [ https://issues.apache.org/jira/browse/NUTCH-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501465#comment-14501465 ] 

Hudson commented on NUTCH-1989:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #3069 (See [https://builds.apache.org/job/Nutch-trunk/3069/])
Fix for NUTCH-1989 Handling invalid URLs in CommonCrawlDataDumper contributed by Giuseppe Totaro. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1674536)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java


> Handling invalid URLs in CommonCrawlDataDumper
> ----------------------------------------------
>
>                 Key: NUTCH-1989
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1989
>             Project: Nutch
>          Issue Type: Improvement
>          Components: tool
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: memex
>             Fix For: 1.10
>
>         Attachments: NUTCH-1989.patch
>
>
> Hi all,
> running the {{CommonCrawlDataDumper}} tool ({{bin/nutch commoncrawldump}}) with the new options (as described in [NUTCH-1975|https://issues.apache.org/jira/browse/NUTCH-1975]) I noticed there are some problems if an invalid URL is detected.
> For example, the following URLs (that I found in crawled data) break the naming schema provided by using {{-epochFilename}} command-line option:
> * http://www/
> * http:/
> More in detail, using {{-epochFilename}} option, files extracted will be organized in a reversed-DNS tree based on the FQDN of the webpage, followed by a SHA1 hash of the complete URL. When the tool detect the URLs as above, it is not able to build the reversed-DNS tree.
> You can find in attachment a simple patch for detecting invalid URLs. The patch uses the [Apache Commons Validator|http://commons.apache.org/proper/commons-validator/] APIs to detect invalid URLs:
> {code}
> UrlValidator urlValidator = new UrlValidator();
> if (!urlValidator.isValid(url)) {
>   LOG.warn("Not valid URL detected: " + url);
> }
> {code}
> The tool logs a warning message if an invalid URL is detected. I am just wondering if we can perform a specific action if invalid URLs occur. We could skip invalid URLs but I notice that also the following URLs are detected as invalid:
> {noformat}
> 2015-04-15 13:49:40,386 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/
> 2015-04-15 13:49:41,603 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www/
> 2015-04-15 13:49:41,632 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http:/
> 2015-04-15 13:49:44,601 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/
> 2015-04-15 13:50:34,821 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/
> 2015-04-15 13:50:35,847 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www/
> 2015-04-15 13:50:35,866 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http:/
> 2015-04-15 13:50:38,605 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/
> 2015-04-15 13:51:20,013 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://antilop.cc/sr/users/nomad bloodbath
> 2015-04-15 13:51:20,499 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW
> 2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com
> 2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com\/gaming\/2015\/04\/mortal-kombat-x-charges-players-for-easy-fatalities\/
> 2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets
> 2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/civis
> 2015-04-15 13:51:20,588 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/ars.to\/1tECmHU
> 2015-04-15 13:51:20,589 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/arstechnica.com
> 2015-04-15 13:51:20,589 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/arstechnica.com\/tech-policy\/2014\/11\/prosecutor-silk-road-2-0-suspect-did-admit-to-everything\/
> 2015-04-15 13:51:20,590 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets
> 2015-04-15 13:51:20,590 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/civis
> {noformat}
> I would be very pleased to get your feedback on action to perform when invalid URLs are detected, avoiding to drop off data and break the naming schema if {{-epochFilename}} option is used.
> Now I am going to add a counter for invalid URLs. Thanks [~lewismc] for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)