You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by ab...@apache.org on 2007/09/18 21:07:40 UTC
svn commit: r577018 - in /lucene/nutch/trunk: CHANGES.txt
src/java/org/apache/nutch/crawl/Generator.java
Author: ab
Date: Tue Sep 18 12:07:39 2007
New Revision: 577018
URL: http://svn.apache.org/viewvc?rev=577018&view=rev
Log:
NUTCH-554 - Generator throws IOException on invalid urls.
Modified:
lucene/nutch/trunk/CHANGES.txt
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
Modified: lucene/nutch/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?rev=577018&r1=577017&r2=577018&view=diff
==============================================================================
--- lucene/nutch/trunk/CHANGES.txt (original)
+++ lucene/nutch/trunk/CHANGES.txt Tue Sep 18 12:07:39 2007
@@ -133,6 +133,9 @@
45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)
+46. NUTCH-554 - Generator throws IOException on invalid urls.
+ (Brian Whitman via ab)
+
Release 0.9 - 2007-04-02
1. Changed log4j confiquration to log to stdout on commandline
Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?rev=577018&r1=577017&r2=577018&view=diff
==============================================================================
--- lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java (original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java Tue Sep 18 12:07:39 2007
@@ -184,7 +184,13 @@
Text url = entry.url;
if (maxPerHost > 0) { // are we counting hosts?
- URL u = new URL(url.toString());
+ URL u = null;
+ try {
+ u = new URL(url.toString());
+ } catch (MalformedURLException e) {
+ LOG.info("Bad protocol in url: " + url.toString());
+ continue;
+ }
String host = u.getHost();
if (host == null) {
// unknown host, skip