You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/01 17:27:05 UTC

[jira] [Closed] (NUTCH-364) Javascript parser creates some fairly bogus URLs

     [ https://issues.apache.org/jira/browse/NUTCH-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-364.
-------------------------------

    Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

> Javascript parser creates some fairly bogus URLs
> ------------------------------------------------
>
>                 Key: NUTCH-364
>                 URL: https://issues.apache.org/jira/browse/NUTCH-364
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: OS X 10.4.7
>            Reporter: Doug Cook
>
> If one crawls, say, 
>      http://www.metropoleparis.com/2000/501/
> with the Javascript parser enabled, one gets outlinks of the form:
> 2006-09-08 16:55:06,301 DEBUG js.JSParseFilter -  - outlink from JS: 'http://www.metropoleparis.com/2000/501/</IFRAME>'
> 2006-09-08 16:55:06,302 DEBUG js.JSParseFilter -  - outlink from JS: 'http://www.metropoleparis.com/2000/501/</SCR'
> 2006-09-08 16:55:06,302 DEBUG js.JSParseFilter -  - outlink from JS: 'http://www.metropoleparis.com/2000/501/</DIV>'
> Another example would be:
> http://www.wein-plus.de/glossar/G.htm
> which yields the URL (among others):
> 2006-09-08 16:55:10,499 DEBUG js.JSParseFilter -  - outlink from JS: 'http://www.wein-plus.de/glossar/<\/a>'
> I have seen these form "crawler traps" and make small sites explode to many, many URLs. For the moment, I have the worst offenders plugged with specific filter rules, but it would be nice to see if there is a way to improve the JSParseFilter's heuristics to avoid these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira