You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2011/07/18 11:28:59 UTC

[jira] [Resolved] (NUTCH-1043) Add pattern for filtering .js in default url filters

     [ https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-1043.
----------------------------------

    Resolution: Fixed

Committed revision 1147796 -> 1.4
Committed revision 1147798 -> 2.0 (trunk)


> Add pattern for filtering .js in default url filters
> ----------------------------------------------------
>
>                 Key: NUTCH-1043
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1043
>             Project: Nutch
>          Issue Type: Task
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1043.patch
>
>
> The Javascript parser is not used by default as it is extremely noisy, however the default URL filters do not filter out URLs ending in .js and the default parser (Tika) can't parse them. In a nutshell we are fetching URLS that we know can't be parsed.
> I suggest that we add a regex to the default URL filters. If people are interested in fetching and parsing .js files they can activate the plugin in their conf and remove the regex in the URL filters.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira