You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Giuseppe Totaro (JIRA)" <ji...@apache.org> on 2015/04/16 17:50:58 UTC

[jira] [Created] (NUTCH-1989) Handling invalid URLs in CommonCrawlDataDumper

Giuseppe Totaro created NUTCH-1989:
--------------------------------------

             Summary: Handling invalid URLs in CommonCrawlDataDumper
                 Key: NUTCH-1989
                 URL: https://issues.apache.org/jira/browse/NUTCH-1989
             Project: Nutch
          Issue Type: Improvement
          Components: tool
    Affects Versions: 1.10
            Reporter: Giuseppe Totaro
            Priority: Minor


Hi all,
running the {{CommonCrawlDataDumper}} tool ({{bin/nutch commoncrawldump}}) with the new options (as described in [NUTCH-1975|https://issues.apache.org/jira/browse/NUTCH-1975]) I noticed there are some problems if an invalid URL is detected.
For example, the following URLs (that I found in crawled data) break the naming schema provided by using {{-epochFilename}} command-line option:
* http://www/
* http:/

More in detail, using {{-epochFilename}} option, files extracted will be organized in a reversed-DNS tree based on the FQDN of the webpage, followed by a SHA1 hash of the complete URL. When the tool detect the URLs as above, it is not able to build the reversed-DNS tree.

You can find in attachment a simple patch for detecting invalid URLs. The patch uses the [Apache Commons Validator|http://commons.apache.org/proper/commons-validator/] APIs to detect invalid URLs:
{code}
UrlValidator urlValidator = new UrlValidator();
if (!urlValidator.isValid(url)) {
  LOG.warn("Not valid URL detected: " + url);
}
{code}

The tool logs a warning message if an invalid URL is detected. I am just wondering if we can perform a specific action if invalid URLs occur. We could skip invalid URLs but I notice that also the following URLs are detected as invalid:
{noformat}
2015-04-15 13:49:40,386 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/
2015-04-15 13:49:41,603 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www/
2015-04-15 13:49:41,632 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http:/
2015-04-15 13:49:44,601 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/
2015-04-15 13:50:34,821 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/
2015-04-15 13:50:35,847 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www/
2015-04-15 13:50:35,866 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http:/
2015-04-15 13:50:38,605 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/
2015-04-15 13:51:20,013 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://antilop.cc/sr/users/nomad bloodbath
2015-04-15 13:51:20,499 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW
2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com
2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com\/gaming\/2015\/04\/mortal-kombat-x-charges-players-for-easy-fatalities\/
2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets
2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/civis
2015-04-15 13:51:20,588 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/ars.to\/1tECmHU
2015-04-15 13:51:20,589 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/arstechnica.com
2015-04-15 13:51:20,589 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/arstechnica.com\/tech-policy\/2014\/11\/prosecutor-silk-road-2-0-suspect-did-admit-to-everything\/
2015-04-15 13:51:20,590 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets
2015-04-15 13:51:20,590 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/civis
{noformat}

I would be very pleased to get your feedback on action to perform when invalid URLs are detected, avoiding to drop off data and break the naming schema if {{-epochFilename}} option is used.

Now I am going to add a counter for invalid URLs. Thanks [~lewismc] for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)