You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cook (JIRA)" <ji...@apache.org> on 2006/09/19 19:37:25 UTC
[jira] Commented: (NUTCH-364) Javascript parser creates some fairly bogus URLs

    [ http://issues.apache.org/jira/browse/NUTCH-364?page=comments#action_12435945 ] 
            
Doug Cook commented on NUTCH-364:
---------------------------------

I've been looking into this a little bit. I see two problems:

(1) The current "two pass" heuristic URL-like string extractor has some flaws (I know, it was intended to be simple). 

The biggest is that it considers a URL to be more or less anything with a "." or a "/" in it. This is problematic in that lots of Javascript outputs HTML, where "/" commonly occurs in closing HTML tags. The philosophy seems to be to keep the extraction simple and rely on the URLNormalizer to throw an exception for a malformed URL, but the URLNormalizer doesn't seem to do much checking.

The problem can be fixed either here or with a more robust validity checker in the URL normalizers. I'd be inclined to put it here to avoid slowing down all of the normalizations of mostly valid URLs. A simple but not perfect improvement would be to avoid strings containing tag-like objects "<" ">" "&gt;" "&lt;" and so on. I can run some tests to see what other common "garbage URLs" occur.

I see that there  is a more robust URL pattern string commented out. This one is better, but still has the same problem, in that it would allow &gt; and &lt; 

(2) Absolute URLs are also not handled properly. For example:

http://www.palmbayimports.com/xq/asp/VID.401/WID.1446/qx/products.html

refers to ./menu.js, which in turn creates a menu linking to (among others):
/tours_marchesi.asp

This should resolve to http://www.palmbayimports.com/tours_marchesi.asp, but instead resolves to:
http://www.palmbayimports.com/xq/asp/VID.401/WID.1446/qx/tours_marchesi.html

This won't be perfectly solvable given the current heuristic string-extraction approach, because a string beginning with a "/" may in fact be a suffix string to which the javascript prepends some other directory, and not actually an absolute URL. However, given that we don't know the prefix string, interpreting it as relative as likely incorrect as interpreting it as absolute, but will create a lot more unique URLs. We may want to interpret as absolute to avoid creating a lot of garbage (as in the palmbayimports example above, which creates tens of thousands of garbage URLs).

Comments?


> Javascript parser creates some fairly bogus URLs
> ------------------------------------------------
>
>                 Key: NUTCH-364
>                 URL: http://issues.apache.org/jira/browse/NUTCH-364
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: OS X 10.4.7
>            Reporter: Doug Cook
>
> If one crawls, say, 
>      http://www.metropoleparis.com/2000/501/
> with the Javascript parser enabled, one gets outlinks of the form:
> 2006-09-08 16:55:06,301 DEBUG js.JSParseFilter -  - outlink from JS: 'http://www.metropoleparis.com/2000/501/</IFRAME>'
> 2006-09-08 16:55:06,302 DEBUG js.JSParseFilter -  - outlink from JS: 'http://www.metropoleparis.com/2000/501/</SCR'
> 2006-09-08 16:55:06,302 DEBUG js.JSParseFilter -  - outlink from JS: 'http://www.metropoleparis.com/2000/501/</DIV>'
> Another example would be:
> http://www.wein-plus.de/glossar/G.htm
> which yields the URL (among others):
> 2006-09-08 16:55:10,499 DEBUG js.JSParseFilter -  - outlink from JS: 'http://www.wein-plus.de/glossar/<\/a>'
> I have seen these form "crawler traps" and make small sites explode to many, many URLs. For the moment, I have the worst offenders plugged with specific filter rules, but it would be nice to see if there is a way to improve the JSParseFilter's heuristics to avoid these.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira