You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by brad <br...@bcs-mail.net> on 2010/08/08 06:16:35 UTC
Possible issue in OutlinkExtractor.java and Outlink.java

Hi,
I have been having some problems with OPICScoringFilter generating
MalformedURLException errors recently.  So I have been trying to trace back
why it occurring.  I put a few display statements in and I have been able to
trace the issues to invalid urls being permitted to progress through the
process, in some cases all the way to the OPICScoringFilter.  The issue
began to occur when I switched to using urlfilter-automaton vs
urlfilter-regex, since the same issue does not occur with RegEx (at least
the OPICScoringFilter level).

I believe the problem can be traced to OutlinkExtractor.java and
Outlink.java

OutlinkExtractor.java appears to use the following Perl5 regex statement to
try to create/filter an outlink:
([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[
A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)

  //loop the matches
  while (matcher.contains(input, pattern)) {
	// if this is taking too long, stop matching
	//   (SHOULD really check cpu time used so that heavily loaded
systems
	//   do not unnecessarily hit this limit.)
	if (System.currentTimeMillis() - start >= 60000L) {
	  if (LOG.isWarnEnabled()) {
		LOG.warn("Time limit exceeded for getOutLinks");
	  }
	  break;
	}
	result = matcher.getMatch();
	url = result.group(0);
	try {
	  outlinks.add(new Outlink(url, anchor));
	} catch (MalformedURLException mue) {
	  LOG.warn("Invalid url: '" + url + "', skipping.");
	}
  }

The Regex statement creates some of following urls:
Nd:YAG
neodymium:yttrium-aluminum-garnet
Reportshttp://www.lipperweb.com/Research/FundIndustry.aspxLipper
st1:place
upside.0:14:182010,Q2,fixed
Webcastshttp://www.lipperweb.com/Commentary/Webcasts.aspxAvailable
qxd:00

Then the program creates and adds a new outlink for the url using the catch
statement to prevent MalformedURLExceptions:
try {
  outlinks.add(new Outlink(url, anchor));
} catch (MalformedURLException mue) {
  LOG.warn("Invalid url: '" + url + "', skipping.");
}

The problem is that the call to new Outlink(url, anchor) in Outlink.java
does not appear to do any type of validation what will ever throw a
MalformedURLException error:

public Outlink(String toUrl, String anchor) throws MalformedURLException {
  this.toUrl = toUrl;
  if (anchor == null) anchor = "";
  this.anchor = anchor;
}

As a result, it appears that the try/catch block in OutlinkExtractor.java
never catches any MalformedURLException urls.  Some of the issues may be
caught further in the process through URL Filtering or other types of
filtering, but some are not.  In any case I expect it be better to catch the
error up front rather than further down the line?

Just as a test case, I added the following to Outlink(String toUrl, String
anchor) in Outlink.java 

URL u = new URL(toUrl);

Which should throw an MalformedURLException for a problem url.

Running a very small crawl the change generated 139 MalformedURLException
errors.  Here is a sampling of the urls:
Invalid url: 'T15:16:07Z', skipping.
Invalid url: 'tag:www.nabp.net,2010:/news//4.2171', skipping.
Invalid url: 'atftp://ftp.cbo.gov/24xx/doc2421/bank-', skipping.
Invalid url: 'display:none;', skipping.
Invalid url: 'javascript:expand', skipping.
Invalid url: 'doi:10.1016/j.amjmed.2008.05.005', skipping.
Invalid url:
'Signature:X__________________________________________________', skipping.
Invalid url: 'am-11:30am', skipping.
Invalid url:
'tag:www.floridaprobatelitigationlawyers.com,2010://1701.15565', skipping.

As a result, I believe that this is probably an issue that needs fixing.

I'm new to Java and Nutch, so, I'm not sure if this is something I should
report?  Heck, I'm not even sure I understood the classes and imports. If it
is something to report, how do I do it (I realize there is Jira, but I'm
unsure of how to get an Id and submit a case).  And who would fix it?  While
what I did does catch malformed URLs, I'm not sure it is the best or fastest
way to do it, or that it correctly follows the intent of the Nutch
functionality for the specified modules.

Your help would be appreciated.

Thanks
Brad